Scraping Off the Blog Scrapers

Money from advertising income motivates blog scrapers to steal content. It’s only a matter of time before you discover your copyright has been violated and your content is now duplicated on a site you don’t want to be associated with. So let’s consider the impact of that duplication and how to counter it.

Continue reading

15 Plagiarism Detection Tools for Bloggers and Writers

There’s no difference between copyright law in cyberspace and print media; the same law applies both online and offline. You don’t have to remain a silent victim of make money blogger rip off artists like these two. You can check to see if your content has been stolen, and file DMCA take-down notices with the appropriate web host when it has been. Continue reading

Reposting content from other sites

In one ear we hear that to increase reach cross-posting a blog post or update on a number of other sites and social media platforms will lead to success. In the other we hear cross-posting too many automated links will class us as spammers. In this post you can view Matt Cutts of Google in a video where he answers a cross-posting question. Continue reading

Official Google Webmaster Central Blog: Raising awareness of cross-domain URL selections

Now Google will alert you if  they canonicalize your URLs to a different domain.

To be transparent about cross-domain URL selection decisions, we’re launching new Webmaster Tools messages that will attempt to notify webmasters when our algorithms select an external URL instead of one from their website. The details about how these messages work are in our Help Center article about the topic  via  Official Google Webmaster Central Blog: Raising awareness of cross-domain URL selections.

Related posts found in this blog:
Can Google detect which content is original?
Duplicate Content in the SERPs Sucks!

Can Google detect which content is original?

Has your blog content every been stolen? Have you ever used Google search and been incensed to discover the stolen version ie. duplicate content is appearing in higher positioning in SERPS (Search engine page results) than your original article appears?

Duplicate content is content that can be accessed on more than one URL. “Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. If search engine spiders can’t tell which version of a web page or document is the original or canonical version, then the consequences will be less search visibility.

Duplicate content within a domain

Duplicate content within a domain is a common problem on blogs where multiple URLs can refer to the same content, for example, if you have full posts displaying in Archives, Categories pages and Tag pages. On self-hosted wordpress.org installs the no-index, follow tag can be used to instruct Google and the other search engines to crawl the page and follow the links but not add the page to its index. This cannot be done on free hosted blogs on wordpress.com, as it’s a multi-user blogging platform where users cannot access and edit themes or templates. With Panda rolling out globally and Google giving advice to remove duplicate content and non-original content, what is one to do?

To reduce duplicate content within my domain I have taken these steps:

  1. I have  set my RSS Feeds > Settings > Reading to “Summary” rather than “Full” to  reduce content theft.
  2. I have a Copyright page and copyright notices also to reduce content theft.
  3. I do not use a theme that displays full posts in Archive pages, Categories pages and Tags pages.  Instead I use the Inuit Types theme as it is a theme that automatically provides excerpts of post content on the Front page,  Archives, Categories and Tags pages.
  4. I copy and paste a sentence from my latest post into Google search a few hours after publication to search for duplicates.
  5. I use Copyspace to search for duplicates.
  6. I also use plagium (beta)  to track plagiarism.
  7. I have set up Google Alerts for my domain names.
  8. I act immediately when I discover my content has been stolen and file a DMCA take down notice when required.

Duplicate content across domains

Though it isn’t the only cause,  the most obvious cause of duplicate content is when people intentionally lift content from other sites for their own use.  Many content thieves are using Blogspot free hosting and Adsense (Google owns both) to make money from stolen blog content. In March Google decided to change the search algorithm  by means of the “Panda update.” It was aimed at rooting out duplicate content from content farms thereby delivering relevant results and enriching users search experience.  The bad news is  Google’s new “Panda” algorithm is ranking some  stolen content higher than the original versions.

Kunal Pradhan, Ahmedabad, India posed this question to Matt Cutts of Google:

“Google crawls site A every hour and site B once in a day. Site B writes an article, site A copies it changing time stamp. Site A gets crawled first by Googlebot. Whose content is original in Google’s eyes and rank highly? If it’s A, then how does that do justice to site B?”

How can I make sure that Google knows my content is original?

Updated June 21st, 2011

Will showing recent posts on my homepage cause a duplicate content issue?

Further reading on the Google Panda Algorithm update:
Why you should offer partial feeds after Google Panda Update
The Panda that hates farms (Matt Cutts and Amit Singhal Wired interview)

Duplicate Content in the SERPs Sucks!

duplicate dollsThe theme of this post is: don’t create multiple pages, subdomains, or domains with substantially duplicate content. Almost every day when I visit new blogs on the internet I spot duplicated content. The most common instances I witness are bloggers, who set up free blogs on WordPress.com, whereon blogger initiated advertising and duplicate content are not allowed, who then go on to create a mirror site on a free Blogger blog containing all the same content, so they can benefit from the niggardly income provided by Google Adsense.  The second most common experience I’m having is witnessing is published articles from article directories duplicated on multiple sites. The third most common experience I am witnessing is very similar content on multiple sites that differs only in that a few words or paragraphs have been added to the core text.

What constitutes duplicate content?

Duplicate content is content that can be accessed on more than one URL.
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

  • Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  • Store items shown or linked via multiple distinct URLs
  • Printer-only versions of web pages
  • If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google.   (This is called “canonicalization“.)  However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results.”

    Why is duplicate content an issue?

    One of the biggest issues with SEO is duplicate content. If search engine spiders can’t tell which version of a web page or document is the original or canonical version, then the consequences will be less than ideal search visibility. Most duplicate content is created by blog scraping sploggers who steal content by subscribing to RSS feeds. Some duplicate content is created by the author’s of the content and the latter is what this article is focused on.

    Search engines are designed to provide the most relevant results to those who use them. When it comes to a blog not making the ascent to the top of the search engine rankings and SERPs (search engine pages results) the issue of duplicate content arises. Search engines like Google, Yahoo, Bing, and Ask have developed tools and filters that locate and remove web pages containing duplicated content, in order to deliver the most relevant and timely results to searchers. Not all duplicate content has to be identical to be spotted and removed a search engine crawler. But web pages with similarity over of over 60% will definitely be detected and impede any ranking success a blogger is aiming to enjoy.

    Matt Cutts of Google introduces the canonical link element

    Whenever content on a site can be found at multiple URLs, it should be canonicalized for search engines. This can be accomplished using a 301 redirect to the correct URL, using the rel=canonical or in some cases using the Parameter handling tool in Google Webmaster Central.  The ways of properly handling cross-domain content duplication are found in Handling legitimate cross-domain content duplication on the Official Google WebMaster central Blog.

    Get with the program, please!

    On my regular read around today I came across the following comment relating to traffic generation and link building.
    “Submit some of your more popular posts to article directories in order to gain greater exposure”
    Let me just make myself 100% clear on this statement….
    It is false, do not submit any content from your site/blog to article marketing directories, if you do it will be labeled duplicate content and no doubt your page will be thrown into the supplementary index.” — Tim Grice in  SEO – Some Common Newbie Mistakes

    1.   It  seems clear to me that those creating duplicate content mirror blogs on WordPress.com and Blogger (blogspot) blogs are motivated by greed, and fall into the group who are deliberately duplicating content across domains in an attempt to manipulate search engine rankings and/or secure more traffic.  I report all such sites when I encounter them.

    The types of blogs allowed and not allowed on the WordPress.com blogging platform  and the Terms of Service prevent using a WordPress.com blog as a publicly available and indexed duplicate content blog.  WordPress.com Staff will suspend or delete all duplicate content blogs reported to them. If you have exported your content out of a blog on another blogging platform such as Blogger, Blogger, Israblog, LiveJournal, Movable Type, Typepad, Posterous, Spaces, Tapuz ,Vox, and Yahoo! 360, and then imported it into a WordPress.com free hosted blog, change the visibility on the original blog to “private” so there will be no duplicate content issue.  If you don’t do that then my understanding is that the first content to be indexed will be considered to be  the original, and all other copies will be considered to be  duplicates.

    2.   Ezinearticles and most article directories so accept article(s) that have been previously published elsewhere, provided you are the unique person who holds copyright to the article.  However, Hubpages, Buzzle, Ehow and Knol do not allow duplicate content. They want to only unique content on their sites and will delete your article(s) and your account if you persist. It seems to me that anyone who can write can also rewrite.  So smart bloggers are not duplicating content and having content in  article directories, etc. out place their blog content in the SERPs

    3.   Reputable blog directories do not allow duplicate content sites to be registered. If and when they do slip in under the radar and are reported to site Admin they will delete the site from their directory.

    4.   When RSS syndicating content, create different versions of the same article that you want to syndicate,  rather than posting the same article everywhere.

    Further reading: Six Easy Ways to Eliminate Pesky Duplicate Content

    Plagiarism checkers:

    There are many free plagiarism checkers you can use online. Copyscape is a free plagiarism checker. The software lets you detect duplicate content and check if  articles are original.

    plagium (beta) – Track plagiarism by pasting your original text.

    Conclusion:

    I require the use of search engines to do research for my contracted work and  prior to creating and publishing blog posts. And, I resent going through screen after screen of duplicated content results presented to me in the SERPs. I think it is a good strategy for search engines to penalize those sites with duplicate content by omitting them from the search results.   Google’s algorithm will continue to be adjusted over time to fit one simple goal:  return the most relevant, helpful pages for any particular search.  Really?  Then  why Google isn’t doing a better job?  Duplicate Content in the SERPs Sucks!

    Update:  Google Webmaster Central: Duplicate content summit at SMX Advanced.