For the last four years, Ctrl blog’s syndication feed (“RSS”) subscribers have received early-access to most articles as an incentive to subscribe. This program is now being reversed, and syndication feeds will no longer receive new articles until a few hours after they’re published.
Every article I publish on this blog pretty much immediately appear on a couple of dozen other websites. What all the content scrapers have in common is that they use syndication feeds to be notified of new articles which allows them to quickly fetch new pages and reproduce them.
Some search engines including Yandex, Google, and Bing can have trouble identifying the original source where an article is published, and often rank one of the unlicensed plagiarized reproductions higher than the original. Some of the plagiarists use quite aggressive means to promote their copies and that also seems to pay off in search rankings.
It’s difficult to deal with content scrapers
Defending copyrights as an individual isn’t an easy matter. It can be quite time-consuming to get contact details of whoever owns many websites or any of their service providers. Most simply ignore any attempt at contacting them, and a few send angry replies back arguing it’s none of my business what they put up on their websites (even though they sometimes even show my name as the author.)
Filing DMCA notices against copyists is no laughing matter, and you wouldn’t believe how angry some site owners get when their copying machines stop working due to take down notices sent to their service providers. Lets also say I’ve not made friends by getting websites who’ve copied significant portions of my work kicked out of advertising programs.
In the past, I’ve taken out content scraping sites that copy and redistribute my articles using somewhat aggressive means. This requires significant work and verification, doesn’t work in all instances, and frankly isn’t worth the efforts and risks involved.
I’ve experimented with including hidden and visible “This article was originally published on Ctrl blog” paragraphs with links back to the original source. However, these have proven ineffective and were nearly always stripped out.
Delaying syndication feeds to slow down republications
I’ve provided feed subscribers with early-access to new articles as a reward for subscribing. I’ve not promoted this, and I’m sure most subscribers never noticed that articles appeared an hour or six in their feed readers before they appeared on the front page or anywhere else. The idea being that subscribers could get more exposure for sharing somewhat more exclusive links whether they realized it or not.
Unfortunately, this scheme have enabled some websites to grab articles early and republish them before Googlebot, Bingbot, and friends have had the chance to index the page; leaving room for other websites to republish it before search engines have a chance to see the original page first.
I did some experimentation and discovered that it isn’t enough to stop giving feed subscribers early access. I had to also delay new articles from appearing in the feeds for a a little over an hour after publication to significantly impact the ratings of plagiarized copies across the leading search engines.
The delay seems to be enough time for search engines — who’re notified about new pages as they’re published but still take some time to crawl them — to determine for certain which website is the original. The plagiarized copies are still made, but they don’t get nearly as much reach when they’re pushed down in rankings or even entirely out of search results.
Implementation notes for delayed feed updates
<updated> element in the Atom syndication format specifies when an
<entry> was added to the feed and not the time of first-publication. It’s important to set this timestamp correctly as some feed clients take shortcuts and skip processing of feed entries with update-timestamps older than the time the client last fetched the feed. Additionally, the
<updated> element for the root
<pubDate> element in RSS feeds is less strictly defined and it can be interpreted to mean just about anything. My interpretation, however, is that it’s supposed to be the time an item was added to the feed and not the time the item was first published.
There are at least two plugins for WordPress available that implement delayed feed delivery, but neither sets the corrects the update times in the feeds.
Exploring other means of copy-protection
There are surprisingly few tools available to help publishers assert ownership or remove unauthorized copies of their content.
There are a few companies that offer copy-protection services, where “copy-protection” means one of two things: 1) They’ll crawl the web and notify you by email when they discover copies of your registered articles, or 2) you pay to display a little copy-protection badge on your website (and that is all you get.) These services aren’t super useful.
Yandex has a free API called Original texts that allow webmasters to submit copies of original copy and associate them with a domain name prior to making the copy available on the web. This can help Yandex determining the original source of published materials but none of the other search engines have similar systems.
My own Internet Archive plugin for WordPress saves a copy of everything I publish in the Internet Archive. This copy is excluded from search engines, and I consider the services of the Internet Archive a net positive for the web ecosystem. It also means that I can use that copy later to assert that an article was published on my website on or before a particular point in time. This would of course only potentially be useful as part of a complicated legal process.
There isn’t much you can do proactively other than use the above tools, and monitor for copies and reach out to plagiarists and ask them to delete your property from their websites.