Screen scraping your way into RSS

Written by Dennis Pallett


Introduction RSS is onerepparttar hottest technologies atrepparttar 105076 moment, and even big web publishers (such asrepparttar 105077 New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds.

If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as it's mostly used to steal content from other websites.

I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on torepparttar 105078 code!

Gettingrepparttar 105079 content For this article, we'll use PHPit as an example, despiterepparttar 105080 fact that PHPit already has RSS feeds.

We'll want to generate a RSS feed fromrepparttar 105081 content listed onrepparttar 105082 frontpage. The first step in screen scraping is gettingrepparttar 105083 complete page. In PHP this can be done very easily, by using implode(file("", "[the url here]")); IF your web host allows it. If you can't use file() you'll have to use a different method of gettingrepparttar 105084 page, e.g. usingrepparttar 105085 CURL library.

Now that we haverepparttar 105086 content available, we can parse it forrepparttar 105087 content using some regular expressions. The key to screen scraping is looking for patterns that matchrepparttar 105088 content, e.g. are allrepparttar 105089 content items wrapped in <div>'s or something else? If you can successfully discover a pattern, then you can use preg_match_all() to get allrepparttar 105090 content items.

For PHPit,repparttar 105091 pattern that matchrepparttar 105092 content is <div class="contentitem">[Content Here]<div>. You can verify this yourself by going torepparttar 105093 main page of PHPit, and viewingrepparttar 105094 source.

Now that we have a match we can get allrepparttar 105095 content items. The next step is to retrieverepparttar 105096 individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace() onrepparttar 105097 each content items.

By now we haverepparttar 105098 following code;

<?php

// Get page $url = "http://www.phpit.net/"; $data = implode("", file($url));

// Get content items preg_match_all ("/<div class="contentitem">([^`]*?)</div>/", $data, $matches);

Like I said,repparttar 105099 next step is to retrieverepparttar 105100 individual information, but first let's make a beginning on our feed, by settingrepparttar 105101 appropriate header (text/xml) and printingrepparttar 105102 channel information, etc.
// Begin feed header ("Content-Type: text/xml; charset=ISO-8859-1"); echo "<?xml version="1.0" encoding="ISO-8859-1" ?> "; ?> <rss version="2.0"  xmlns:dc="http://purl.org/dc/elements/1.1/"  xmlns:content="http://purl.org/rss/1.0/modules/content/"  xmlns:admin="http://webns.net/mvcb/"  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <channel> <title>PHPit Latest Content</title> <description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description> <link>http://www.phpit.net</link> <language>en-us</language> 

Mastering Regular Expressions in PHP

Written by Dennis Pallett


What are Regular Expressions? A regular expression is a pattern that can match various text strings. Using regular expressions you can find (and replace) certain text patterns, for example "allrepparttar words that begin withrepparttar 105072 letter A" or "find only telephone numbers". Regular expressions are often used in validation classes, because they are a really powerful tool to verify e-mail addresses, telephone numbers, street addresses, zip codes, and more.

In this tutorial I will show you how regular expressions work in PHP, and give you a short introduction on writing your own regular expressions. I will also give you several example regular expressions that are often used. Regular Expressions in PHP Using regex (regular expressions) is really easy in PHP, and there are several functions that exist to do regex finding and replacing. Let's start with a simple regex find.

Have a look atrepparttar 105073 documentation ofrepparttar 105074 preg_match function. As you can see fromrepparttar 105075 documentation, preg_match is used to perform a regular expression. In this case no replacing is done, only a simple find. Copyrepparttar 105076 code below to give it a try.

<?php

// Example string $str = "Let's findrepparttar 105077 stuff <bla>in between</bla> these two previous brackets";

// Let's performrepparttar 105078 regex $do = preg_match("/<bla>(.*)</bla>/", $str, $matches);

// Check if regex was successful if ($do = true) { // Matched something, showrepparttar 105079 matched string echo htmlentities($matches['0']);

// Also howrepparttar 105080 text in betweenrepparttar 105081 tags echo '<br />' . $matches['1']; } else { // No Match echo "Couldn't find a match"; }

?>

After having runrepparttar 105082 code, it's probably a good idea if I do a quick run throughrepparttar 105083 code. Basically,repparttar 105084 whole core ofrepparttar 105085 above code isrepparttar 105086 line that containsrepparttar 105087 preg_match. The first argument is your regex pattern. This is probablyrepparttar 105088 most important. Later on in this tutorial, I will explain some basic regular expressions, but if you really want to learn regular expression then it's best if you look on Google for specific regular expression examples.

The second argument isrepparttar 105089 subject string. I assume that needs no explaining. Finally,repparttar 105090 third argument can be optional, but if you want to getrepparttar 105091 matched text, orrepparttar 105092 text in between something, it's a good idea to use it (just like I used it inrepparttar 105093 example). The preg_match function stops after it has foundrepparttar 105094 first match. If you want to find ALL matches in a string, you need to userepparttar 105095 preg_match_all function. That works pretty muchrepparttar 105096 same, so there is no need to separately explain it.

Now that we've had finding, let's do a find-and-replace, withrepparttar 105097 preg_replace function. The preg_replace function works pretty similar torepparttar 105098 preg_match function, but instead there is another argument forrepparttar 105099 replacement string. Copyrepparttar 105100 code below, and run it.

<?php

// Example string $str = "Let's replacerepparttar 105101 <bla>stuff between</bla>repparttar 105102 bla brackets";

// Dorepparttar 105103 preg replace $result = preg_replace ("/<bla>(.*)</bla>/", "<bla>new stuff</bla>", $str);

echo htmlentities($result); ?>

The result would then berepparttar 105104 same string, except it would now say 'new stuff' betweenrepparttar 105105 bla tags. This is of course just a simple example, and more advanced replacements can be done.

You can also use keys inrepparttar 105106 replacement string. Say you still wantrepparttar 105107 text betweenrepparttar 105108 brackets, and just add something? You userepparttar 105109 $1, $2, etc keys for those. For example:

<?php

// Example string $str = "Let's replacerepparttar 105110 <bla>stuff between</bla>repparttar 105111 bla brackets";

// Dorepparttar 105112 preg replace $result = preg_replace ("/<bla>(.*)</bla>/", "<bla>new stuff (the old: $1)</bla>", $str);

echo htmlentities($result); ?>

This would then print "Let's replacerepparttar 105113 new stuff (the old: stuff between)repparttar 105114 bla brackets". $2 is forrepparttar 105115 second "catch-all", $3 forrepparttar 105116 third, etc.

Cont'd on page 2 ==>
 
ImproveHomeLife.com © 2005
Terms of Use