Screen scraping your way into RSSWritten by Dennis Pallett
Introduction RSS is one hottest technologies at moment, and even big web publishers (such as New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds.If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as it's mostly used to steal content from other websites. I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to code! Getting content For this article, we'll use PHPit as an example, despite fact that PHPit already has RSS feeds. We'll want to generate a RSS feed from content listed on frontpage. The first step in screen scraping is getting complete page. In PHP this can be done very easily, by using implode(file("", "[the url here]")); IF your web host allows it. If you can't use file() you'll have to use a different method of getting page, e.g. using CURL library. Now that we have content available, we can parse it for content using some regular expressions. The key to screen scraping is looking for patterns that match content, e.g. are all content items wrapped in <div>'s or something else? If you can successfully discover a pattern, then you can use preg_match_all() to get all content items. For PHPit, pattern that match content is <div class="contentitem">[Content Here]<div>. You can verify this yourself by going to main page of PHPit, and viewing source. Now that we have a match we can get all content items. The next step is to retrieve individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace() on each content items. By now we have following code; <?php// Get page $url = "http://www.phpit.net/"; $data = implode("", file($url)); // Get content items preg_match_all ("/<div class="contentitem">([^`]*?)</div>/", $data, $matches); Like I said, next step is to retrieve individual information, but first let's make a beginning on our feed, by setting appropriate header (text/xml) and printing channel information, etc. // Begin feed header ("Content-Type: text/xml; charset=ISO-8859-1"); echo "<?xml version="1.0" encoding="ISO-8859-1" ?> "; ?> <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:admin="http://webns.net/mvcb/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <channel> <title>PHPit Latest Content</title> <description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description> <link>http://www.phpit.net</link> <language>en-us</language>
| | Mastering Regular Expressions in PHPWritten by Dennis Pallett
What are Regular Expressions? A regular expression is a pattern that can match various text strings. Using regular expressions you can find (and replace) certain text patterns, for example "all words that begin with letter A" or "find only telephone numbers". Regular expressions are often used in validation classes, because they are a really powerful tool to verify e-mail addresses, telephone numbers, street addresses, zip codes, and more.In this tutorial I will show you how regular expressions work in PHP, and give you a short introduction on writing your own regular expressions. I will also give you several example regular expressions that are often used. Regular Expressions in PHP Using regex (regular expressions) is really easy in PHP, and there are several functions that exist to do regex finding and replacing. Let's start with a simple regex find. Have a look at documentation of preg_match function. As you can see from documentation, preg_match is used to perform a regular expression. In this case no replacing is done, only a simple find. Copy code below to give it a try. <?php// Example string $str = "Let's find stuff <bla>in between</bla> these two previous brackets"; // Let's perform regex $do = preg_match("/<bla>(.*)</bla>/", $str, $matches); // Check if regex was successful if ($do = true) { // Matched something, show matched string echo htmlentities($matches['0']); // Also how text in between tags echo '<br />' . $matches['1']; } else { // No Match echo "Couldn't find a match"; } ?> After having run code, it's probably a good idea if I do a quick run through code. Basically, whole core of above code is line that contains preg_match. The first argument is your regex pattern. This is probably most important. Later on in this tutorial, I will explain some basic regular expressions, but if you really want to learn regular expression then it's best if you look on Google for specific regular expression examples.The second argument is subject string. I assume that needs no explaining. Finally, third argument can be optional, but if you want to get matched text, or text in between something, it's a good idea to use it (just like I used it in example). The preg_match function stops after it has found first match. If you want to find ALL matches in a string, you need to use preg_match_all function. That works pretty much same, so there is no need to separately explain it. Now that we've had finding, let's do a find-and-replace, with preg_replace function. The preg_replace function works pretty similar to preg_match function, but instead there is another argument for replacement string. Copy code below, and run it. <?php// Example string $str = "Let's replace <bla>stuff between</bla> bla brackets"; // Do preg replace $result = preg_replace ("/<bla>(.*)</bla>/", "<bla>new stuff</bla>", $str); echo htmlentities($result); ?> The result would then be same string, except it would now say 'new stuff' between bla tags. This is of course just a simple example, and more advanced replacements can be done.You can also use keys in replacement string. Say you still want text between brackets, and just add something? You use $1, $2, etc keys for those. For example: <?php// Example string $str = "Let's replace <bla>stuff between</bla> bla brackets"; // Do preg replace $result = preg_replace ("/<bla>(.*)</bla>/", "<bla>new stuff (the old: $1)</bla>", $str); echo htmlentities($result); ?> This would then print "Let's replace new stuff (the old: stuff between) bla brackets". $2 is for second "catch-all", $3 for third, etc.
|