By Flexo

If your RSS feed is being “scraped” and republished in full somewhere else — as is currently happening to my main blog, Consumerism Commentary — you run two important risks. First of all, your scraper is likely pasting ads alongside your content on his website, trying to earn income directly off your hard efforts. Secondly, there’s a possibility that you can be penalized by the search engines for duplicate content, so you indirectly lose income.

This is a tough problem to combat, especially if you use a service like FeedBurner. I use FeedBurner for my main blog’s RSS feed because of the many interesting statistic-related features available. I’ve never published my FeedBurner address as my main feed. Rather, I used redirects, so those subscribed to my original feed at consumerismcommentary.com/index.xml never had broken links. That is the only URL I’ve ever published for the URL, so it is the one that was discovered by scrapers.

If I had publicized my feed, FeedBurner address, the scrapers would be using that, and I would not be able to do what I am doing today to avoid being scraped.

Now, for dealing with the scrapers.

Step 1. Create a fake feed. There’s no need to make this difficult. Take Consumerism Commentary’s fake feed, edit the links and text to your liking, and save or upload as fakefeed.xml on your web server. The item dates are in the future. This future-dating could have been done with simple PHP to automatically advance the dates and keep the feed fresh, but I was a little lazy.

Step 2. Locate your scrapers. The software that scrapes RSS feeds resides on the same machine as the website publishing the stolen content. Simply use nslookup in DOS or UNIX/Linux to determine the IP address of that machine. Let’s say the IP of the first offending machine is 128.175.13.63. Let’s also say you have a second offending machine at 192.193.217.120.

Step 3. Redirect requests to the fake feed. To redirect http requests, you’ll need access to your .htaccess file in your domain’s mapped root folder. Add the following to your .htaccess file:

RewriteEngine On
RewriteBase /

RewriteRule ^fakefeed.xml - [S=300]

RewriteCond %{REMOTE_ADDR} ^128\.175\.13\.63$ [OR]
RewriteCond %{REMOTE_ADDR} ^192\.193\.217\.120$
RewriteRule .* http://www.consumerismcommentary.com/fakefeed.xml [R=302,L]

The first two lines may already exist. If they do, there’s no need to duplicate. Just place the rest of the above content below the already existing lines. Of course, replace the URL in the second RewriteRule with the URL to your fake feed.

The [S=300] in line 3 advises the web server to skip the next 300 commands in the .htaccess file if fakefeed.xml is requested. This is important, because otherwise your server will enter an endless loop of redirects. In the highly unlikely event that you have an incredibly long .htaccess file, you may have to increase the number from 300 to another choice.

That’s it! Once the scrapers realize you’re providing bad content, they’ll likely stop scraping your feed. If they don’t stop, at least they’re not taking your real content’s full entries or excerpts.

Note: It’s too late to do anything about the entries that have already been scraped. Depending on the scraper’s software, these may not go away until the perpetrator removes the entire website. This will prevent future items from being scraped… until a new IP address is used.


6 Responses to “Being Scraped? Here is Something You Can Do.”  

  1. 1 Brian

    Great idea, a little bit more work than I’d like, but it works. Since the scrapers almost always use Adsense, I’ve found that sending an e-mail to adsense-abuse@google.com with a note that they have duplicate content (a violation of their TOS) usually gets the site taken down (sometimes within 48 hours).

    Alternatively (or in addition to), just make sure you add a lot of link backs to your previous posts. That way, the people reading your scraped content come to your site.

  2. 2 Nick

    It’s never too late to do something about entries that have already been scraped. I’ve found DMCA notices to ISPs an effective way to deal with such matters.

  3. 3 Ryan Dlugosz

    It seems like FeedBurner could pretty effectively offer this as a service. They could go one step further and use their community of users to identify the scrapers and blacklist them for everybody.

  1. 1 Bookmarks & Freebies Roundup Plus Updates » SuperAff.com
  2. 2 » Weekly Blog Roundup, Warm Weather Edition on Consumerism Commentary: A Personal Finance Blog
  3. 3 links for 2007-04-24

Leave a Reply



Categories