How To Tackle The Duplicate Content Problem

by Gary on September 26, 2007 · 19 comments

in SEO Basics

Google and other search engines can penalise you for having duplicate content on your website. This may be intentional or unintentional on your part. If it is intentional then either you have your reasons or you can get rid of the problem. On the other hand if duplicate content is being generated on your website without your knowledge you can end up paying a big price without your fault.

Duplicate content can be generated due to some problem with the way your CMS generates pages, because of multiple versions of the same content, or when both http:// and http://www are pointing towards the same URL.

For Google, http://www.phoenixrealm.com and http://phoenixrealm.com represent two separate URLs whereas for you and me it could be the same thing. It is so because you can actually have two different websites on the above mentioned URLs (one www and another non-www) — it rarely happens though. When Google finds two URLs seemingly having the same/duplicate content, it indexes the one it considers the most apt. This way it can easily miss the page you actually want it to find.

The duplicate content problem can significantly harm your search engine rankings. Fortunately, canonical redirection can solve this problem of content duplication.

So what is Canonical redirection?

In canonical redirection, among multiple pages supposed to be having similar content you select a canonic page and then redirect all those pages to this one single page. Hence, if someone goes to http://phoenixrealm.com he or she should be automatically redirected to http://www.phoenixrealm.com. Similarly, http://www.phoenixrealm.com/index.php should be redirected to http://www.phoenixrealm.com as these two URLs too are considered separate by Google.

How can you achieve canonical redirection?

For most websites running on Apache the easiest solution is altering the .htaccess file. This file generally resides in the root folder of your website directory. Using your FTP program you can check if that file exists (if there is no need then perhaps you don’t have it there) and if does, then download it and alter it because you wouldn’t like to lose the commands already existing in that file. In case you don’t have it, you can easily create it using Notepad or any text editor. In the .htaccess file add the following lines:

RewriteEngine On

RewriteCond %{HTTP_HOST} ^domain\.com

RewriteRule (.*) http://www.domain.com/$1 [R=301,L]

Replace “domain” with your own domain name. This tells the search engine crawlers that the first URL has been permanently moved to the second URL.

In case you don’t want to use your .htaccess file you can always use some PHP commands to orchestrate the same sort of redirection. Here’s how you do it:

<?php

if (substr($_SERVER['HTTP_HOST'],0,3) != ‘www’) {

header(‘HTTP/1.1 301 Moved Permanently’);

header(‘Location: http://www.’.$_SERVER['HTTP_HOST']

.$_SERVER['REQUEST_URI']);

}

?>

You can insert this code somewhere in your header.php or top.php file.

Share This Post

Advertise here
  • WTL

    Very true. I would like to share this list of 301 redirect methods – which can be very handy as it covers the variety of server configurations you might run across.

    • Gary

      Yep, nice list – Thanks.

  • Adrian

    What do you think about slash at the end like domain.com/some-article and domain.com/some-article/ ? These are two different urls ?

    • Gary

      I believe these are both the same URL, so no problem there.

      Sure someone will correct me if I am wrong :-)

      • SEO Carly

        The trailing slash on folders are a different URL, however on the homepage they are the same because Apache enforces the slash.

        For instance the Google Adsense page:

        http://www.google.com/intl/en/ads/

        http://www.google.com/intl/en/ads

        With Apache you can have 2 different pages at these URL’s. PhoenxiRealm forces a trailing slash which is the correct way to do it to stop problems. You can try it by deleting the last / for this page and hit enter.

        Also Apache can show the same page with and without the slash and these can have 2 different pagerank values, which will result in duplicate problems.

        My site is like this, i’ll have to fix it one day but it’s only a few pages and the trailing slash versions arn’t referenced anywhere or cached so it’s no problems.

        You can see it by going to /contact which is a PR4 and /contact/ which shows the same page but PR N/A

        • Gary

          Argh!! had to approve this for some unknown reason hopefully the new version of WP will cure this glitch, it’s starting to get on my nerves.

          On my servers all folders are redirected to a url with a trailing slash. Like you say it must be the way Apache is set up, I did not realise it could be set up any different.

          Something else for me to watch out for – Thanks Carly.

          • SEO Carly

            No probs Gary, you can use .htaccess to enforce a trailing slash on all URL’s that are not files which will clear up any canonical problems. For example this code will force:

            /my-great-page

            to

            /my-great-page/

            In .htaccess put:

            # Force Trailing Slash
            RewriteCond %{REQUEST_URI} ^/[^.]+[^/]$
            RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R=301,L]

            As for the comment moderation, i think it just happens to me :(

            I posted and it vanished, i have a tendancy to highlight & copy before i submit in case something goes wrong, so i posted again and WP told me i already said that. :)

          • Gary

            Thanks, this works a treat. Will use it on all sites in the future.

  • SEO Carly

    I’m glad you done a post on Duplicate Content because alot of people have a hard time understanding how Google deals with it.

    While it’s not a penalty as such, but Google will choose the version which it deems more important ot authorative. Generally internal Pagerank is the deciding factor, so the problem arises when Google lists the version you “don’t” want displayed such as a printer friendly version, an XML feed. mobile version etc that’s devoid of your site’s branding and navigation.

    So duplicate content doesn’t really harm your site, one version of the document will still enjoy good rankings. But.. by having multiple versions you are wasting link equity by passing weight to secondary duplicate documents, and well as wasting crawl budget by having a number of the same pages crawled when Google could be crawling your fresh unique content instead.

    Also Gary i had a peek and your Phoenixrealm.com/index.php URL isn’t redirecting to the plain .com anymore and it’s PR4 with a cache 5 days different then the .com so it’s getting some weight passed to it from somewhere and not redirecting.

    • Gary

      Ooops, how embarrassing!!

  • Vincent

    Is it too late to implement it on my system? If I have already been penalised for duplicate content.

  • SEO Carly

    @ Vincent

    It’s never too late and can only make your situation better. By using a combination of .htaccess to sort your Canonical URL’s, and using Robots.txt and Nofollow on the versions of duplicate pages you don’t want ranking (printer friendly, XML etc) will ensure there is more weight (pagerank) getting passed to your good pages which will help to lift their rankings.

    @ Gary lol don’t worry it happens, i don’t know how many times i’ve had a “set and forget” configuration working, then tinkered with something else and undone it not realizing.

    Speaking of .htaccess, i put in a rewrite to serve up an alternate image to a site hotlinking my pictures. In a rush i wrote up the .htaccess rule, went out and checked it a few days later only to find it was serving a nasty “i am a bandwidth theif” image on my own site as well lol

    So you dont have to feel bad :-)

  • Chris Boswell

    A very interesting read. I’d also point out that there are some more nuanced scenarios, such as when multiple sites quote substantively from the same source (either another website or academic journal e.t.c.). It can hold you back a couple of pages in the SERPs if this happens.

    I’ve used a few methods to get around this. The first is a dead cert – I use an image for the material that’s being quoted, so that only unique content gets spidered. Or (this one I’m still experimenting with), I use the blockquote tag, so that I’m signposting that I don’t intend this to be taken as unique content. Google hates duplicate content and loves honesty, so I would expect this to work, but it’ll be a while before I know whether its got definite efficacy.

  • SEO Carly

    A good method is to paraphrase the original material.

    Say if you have a several paragraph article from a journal, after every paragraph write a paragraph of your own views, opinions or thoughts.

    This way it’s only 50% duplicate, plus you have added your own keywords which will assist ranking for different things that the original would not.

    I posted the major parts of the original Google Thesis on my site and it pulls SERP traffic because i left out trivial paragraphs so keywords now have different relationships to each other, plus the page title is different to the original.

    So subtle differences can change the scope of a document to an algo.

  • Chris Boswell

    I’m afraid you still risk modest penalisation doing it this way – use the blockquote tag and let the spiders know what is duplicate and what is original content, or put the quoted text into an image.

    Paraphrasing is also effective, but 50% duplication is too much – it works much the same as the way I used to get my undergrads to understand what plagarism is – only one in every 5 words should be the same. With a little skill you can still get the same key phrases in there, since volume of content, organisation of content and letting the bots know which bits are important are all different things.

    I’m faced with the daunting task of working the same copy into 4 different versions at the moment for a new customer with lots of duplicated content across his sites – for some its only costing a few places, for other sites its costing him a few pages in the serps.

  • SEO Carly

    “Iā€™m afraid you still risk modest penalisation doing it this way”

    Ok firstly, there is no such thing as a duplicate content penalty. When 2 documents are identical, Google will choose the most authorative of the two to rank highest as things being equal.

    However, both versions of the document can rank first on various phrases depending on the anchor text being pointed at it.

    If we both have an identical document, both having links from the same pages. If all my links use the anchor “SEO” and all yours have the anchor “SEO Company” mine will pull up for SEO and yours SEO Company.

    50% duplicate is fine in most cases, however if you have a very weak site than you may want to write two paragraphs for every one making it 33% duplicate.

    If you have a very strong site, you can publish it word for word and beat the original.

  • Chris Boswell

    Hi Carly,

    I take your point here, but I have seen cloned content pages completely failed to rank – happened to someone who nicked a page of mine a few years ago and he had the cheek to come to me six months later and ask me why he wasn’t ranking. There were other factors involved as well, but this looked to be the primary reason for that page.

    I do agree that its possible to do well with duplicated content and of course relevancy and importance are built up through targeted, contextual linking. I just find that the more unique content the better the results. In fact I got told off recently by another SEO I work very closely with in Canada for not dealing with enough duplicate content issues for a customer we both work with, so I’ve got this on my mind at the moment.

    I believe the word from Matt Cutt’s on this subject is that the ‘prettiest URL’ often wins as well, between two sites with closely similar content (can’t remember the blog – probably Dave Ns).

    As an aside, I’m amazed that I managed to write half coherent sentences last night – had just got back from a bad play with too many pints of beer swilling around inside me – apologies if I was a bit cranky.

  • Gary

    This is related to my targetting geographic areas post.

    Take a look at this page:

    http://www.tamba.co.uk/web-design-and-development/web-design-liverpool.html

    95% of the content is the same as many other pages on the site, just the town/city changes.

    The link in the site footer leads to many pages nearly exactly the same. Most pages have a of PR of 5, which I find hard to understand and is at odds with the thread above.

    Any ideas as to how/why these pages rank so well.

  • Gary