Has your blog content every been stolen? Have you ever used Google search and been incensed to discover the stolen version ie. duplicate content is appearing in higher positioning in SERPS (Search engine page results) than your original article appears?
Duplicate content is content that can be accessed on more than one URL. “Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. If search engine spiders can’t tell which version of a web page or document is the original or canonical version, then the consequences will be less search visibility.
Duplicate content within a domain
Duplicate content within a domain is a common problem on blogs where multiple URLs can refer to the same content, for example, if you have full posts displaying in Archives, Categories pages and Tag pages. On self-hosted wordpress.org installs the no-index, follow tag can be used to instruct Google and the other search engines to crawl the page and follow the links but not add the page to its index. This cannot be done on free hosted blogs on wordpress.com, as it’s a multi-user blogging platform where users cannot access and edit themes or templates. With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?
To reduce duplicate content within my domain I have taken these steps:
- I have set my RSS Feeds > Setings > Reading to “Summary” rather than “Full” to reduce content theft.
- I have a Copyright page and copyright notices also to reduce content theft.
- I do not use a theme that displays full posts in Archive pages, Categories pages and Tags pages. Instead I use the Inuit Types theme as it is a theme that automatically provides excerpts of post content on the Front page, Archives, Categories and Tags pages.
- I copy and paste a sentence from my latest post into Google search a few hours after publication to search for duplicates.
- I use Copyspace to search for duplicates.
- I also use plagium (beta) to track plagiarism.
- I have set up Google Alerts for my domain names.
- I act immediately when I discover my content has been stolen and file a DMCA take down notice when required.
Duplicate content across domains
Though it isn’t the only cause, the most obvious cause of duplicate content is when people intentionally lift content from other sites for their own use. Many content thieves are using Blogspot free hosting and Adsense (Google owns both) to make money from stolen blog content. In March Google decided to change the search algorithm by means of the “Panda update.” It was aimed at rooting out duplicate content from content farms thereby delivering relevant results and enriching users search experience. The bad news is Google’s new “Panda” algorithm is ranking some stolen content higher than the original versions.
Kunal Pradhan, Ahmedabad, India posed this question to Matt Cutts of Google:
“Google crawls site A every hour and site B once in a day. Site B writes an article, site A copies it changing time stamp. Site A gets crawled first by Googlebot. Whose content is original in Google’s eyes and rank highly? If it’s A, then how does that do justice to site B?”
How can I make sure that Google knows my content is original?
Updated June 21st, 2011
Will showing recent posts on my homepage cause a duplicate content issue?
Further reading on the Google Panda Algorithm update:
Why you should offer partial feeds after Google Panda Update
The Panda that hates farms (Matt Cutts and Amit Singhal Wired interview)