Beating Duplicate Content with Robots.txt

By: John Elder posted in SEO


Hello good people!

Yesterday I posted a video explaining the Google Supplemental Index (video removed) and how to get your web site out of it by tweaking your robots.txt file in order to keep Google from indexing certain duplicate pages on your web site.

In this article I’m going to explain that in more detail.

First off, what is a robots.txt file?

The search engines constantly scan your web site to determine how to rank it at the search engines. They send automated programs to your site to do that. Those programs are called spiders, or robots, or any other funny little name like that.

A robots.txt file is a simple text file that you place on your web site. All robots will check it before they scan your site. It tells them what is, and what isn’t ok for them to scan on your site.

It’s YOUR Site after all. If you don’t want those grubby little spiders scanning your site and running up your hosting bandwidth, its perfectly alright to tell them to go away.

Of course, we don’t want to tell them to go away completely, because if they can’t scan your site, they can’t add your site to their search engine.

What we want to do is selectively tell them what they can and can’t scan so as to deter them from freaking out when they find duplicate content.

Why would you have duplicate content? Lots of reasons. If you run your site using wordpress, then you probably have LOTS of duplicate content in the form of archive pages, category pages, calendar pages etc that wordpress creates automatically every time you make a blog post.

So we want to add all that duplicate nonsense to a robots.txt file so Google can’t see it, while at the same time allowing Google to scan the main parts of your web site.

A robots.txt file is just a simple text file. Here’s an example robots.txt file:

sitemap: http://www.EXAMPLESITE.com/sitemap.xml

User-agent: *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /author/
Disallow: /page/
Disallow: /category/
Disallow: /category/search-engine-optimization
Disallow: /category/seo-videos
Disallow: /category/episode-transcripts
Disallow: /category/keyword-marketing
Disallow: /category/seo-news
Disallow: /category/internet-marketing
Disallow: /category/uncategorized/
Disallow: /wp-images/
Disallow: /images/
Disallow: /backup/
Disallow: /banners/
Disallow: /archives/
Disallow: /trackback/
Disallow: /feed/
Disallow: /2013/
Disallow: /archive

Feel free to use this robots.txt file as a template for your own site.

It’s pretty straight forward. The first line tells the robot where to find my main sitemap. Each other line is simply a part of my web site that I don’t want the spiders to scan.

You can see I’ve banned them from scanning my category pages, my archive pages, my wordpress administrative login pages, backup pages, author pages, etc.

It’s really just that simple. Go through your own blog and find all the duplicate content type pages your wordpress theme creates, then plunk them into a robots.txt file and upload it to the main folder of your web site so that the file can be found at www.yoursite.com/robots.txt

That’s all there is to it!

Do you use a robots.txt file in any sort of creative way? Comment below…

-John Elder
The Marketing Fool!

John Elder is an Entrepreneur, Web Developer, and Writer with over 27 years experience creating & running some of the most interesting websites on the Internet. Contact him here.



Did you like this article? Share it:


No comments.

Leave a Reply