Duplicate Content – rel=Canonical and 301 Redirect Cures
Search Engine Optimization is a very tricky subject. It’s certainly something you should be aware of if you do any kind of business on the Internet. Unfortunately, there is also a lot of bad information about the topic out there. Sometimes said information is simply out-dated. Other times it’s simply misinformed, while some “snake oil salesmen” will also just tell you what you want to hear to get your business. Fortunately, Duplicate Content is a relatively simple concept for even non-technical minds to understand.
What is Duplicate Content?
Duplicate Content issues arise when a website has the same content served up from multiple different URLs. As an example, let’s consider an internet retailer with many products. We’ll call the website Example.com, for demonstration purposes. The store management software for Example.com uses a Product ID parameter to query the database and show the correct product information. To access a particular product’s page, you’d need to visit a URL like http://example.com/product_info.php?product_id=2534. When you go to that URL, the server pulls out ‘2534’ from the parameter part of the URL, queries the database for all information about the product with an ID of 2534.
The store manager realizes that such a URL isn’t very memorable for people. It’s also difficult for a search engine robot (or spider) to deduce what that particular page is about. With a better user-experience in mind as well as Search Engine Optimization, the store manager decides to implement a “pretty URL” structure. On most Linux Apache servers, this is typically done within the .htaccess file. The store manager decides to use the Product’s Name as a part of the URL, in conjunction with the Product ID. So rather than the ugly, unmemorable, and non-descriptive URL shown above, the same product will now be found at http://example.com/full-product-name-product-2534.html. Now a search engine spider can tell without much analysis that the page in question is about “Full Product Name”, while at the same time your own web server knows that the Product ID of the page in question is 2534. This is done with URL re-writing. Essentially, the server will match anything that comes after “product-” and before “.html” and knows that’s the Product ID. All is well! Users will see URLs that are more descriptive, while bots see URI’s that give some indication of what the page is about.
Introducing the Duplicate Content Problem
After seeing some great results from implementing “pretty URL’s”, the store manager decides that some of the products should be renamed. As it turns out, products with the color of the item in question have a higher conversion rate than those that don’t. So http://example.com/full-product-name-product-2534.html becomes http://example.com/red-full-product-name-product-2534.html. All good, right? Wrong!
The Core of the Duplicate Content Problem
You might think that everything is fine at this point. The new product URL works fine, and all the links on your site are automatically updated thanks to your E-commerce content management system. Users and bots are both directed to the new, correct URL. So, what’s the problem?
Search Engine’s like Google, Bing and Yahoo! have extremely long memories. Like, infinitely long. The Internet never forgets. They remember that http://example.com/full-product-name-product-2534.html pointed to a valid page on your site. They also remember that http://example.com/product_info.php?product_id=2534 pointed to a functioning page on your site, and they know that http://example.com/red-full-product-name-product-2534.html is a valid page on your site. All of these URL’s are in Google’s index of the Internet. The problem is that the content at all 3 URLs is exactly the same. You could go to any of the 3 listed URLs and the page will appear identical to users and bots.
Why this is a problem
This becomes a problem for several reasons. The first is that over the course of the website’s life users have posted links to these product pages. As the site evolved, the URLs that people linked to evolved with it. There will be links pointing to all 3 URLs from all over the web. This means that your Page Rank (in Google’s vocabulary) is split between 3 different URLs. The importance of the “Correct” URL is diluted from an algorithmic view point. There might be 60 people that linked to your Full Product Name. But from a bots-eye-view, it appears that 20 different people each linked to your 3 different URLs. As a general rule of thumb, the importance of a particular page is determined by the number of links pointing to that page from around the Internet. This is an overly simplistic view, but it should help you understand why splitting those Links across 3 different URLs can be an issue.
Confusing the Algorithm
The second problem occurs when somebody goes to Google and types in “Full Product Name”. Ideally, Google looks at it’s index and says “Hey, there’s this page at Example.com that’s all about Full Product Name”, we should offer that up as a choice to the user. Except that Google now has 3 pages at Example.com that are about Full Product Name, and they’re all exactly the same. Which one to serve the user? The newest URL might be the most relevant, or it might be the oldest one because it’s more authoritative. Without some indication from the publisher, it’s nearly impossible to determine which is the “correct” URL for the page in question. The search engine is then forced to choose one of the URLs to serve up to the User, and it won’t necessarily be the one you want them directed to.
When dealing with E-Commerce sites especially, you’ll often have multiple parameters being used for a product page. If you’re selling your product outside of North America, you’ll probably be displaying products in multiple currencies, like Euro’s, British Pounds, or Australian Dollars. The currency to display is often passed to the server as a URL parameter. Let’s say the store offers 5 different currencies. If not handled properly Google will see 5 different URLs, one for each currency. Pile this on top of the 3 different page URLs from above, and you’ve got 15 different URLs for the exact same content.
Many E-Commerce platforms also use Session ID’s to track a user throughout the site and checkout process. These are typically unique strings that are randomly generated for every new user to the site. Unfortunately, bots will generate a new Session ID each time they crawl the site. This can result in literally hundreds or thousands of variations of the same URL in the bot’s index.
There’s also a nefarious group of people on the web who like to practice what is known as Negative SEO. The premise behind Negative SEO is fairly simple: You own a website that competes with Example2.com. You’ve done everything in your power to optimize your site using White Hat, legitimate SEO strategies and tactics. Despite your best efforts, Example2.com still ranks better than your site in the Search Engines. This could be due to age, domain name, or any of the other hundreds of signals Search Engine’s use to rank their result pages.
Most people: the honest, hardworking folks who most of us live and work with every day would be content that they’ve tried their best, and will settle for #2 or #3, or #120 on Google’s SERP (Search Engine Results Page). Business will continue, and hopefully by applying SEO Best Practices their page will climb the SERPs. Continue to work hard and optimize the page for success.
As with most situations in any business, there is a dark side… an opportunity to do evil. In the Brick and Mortar business world, the easiest way to beat your competitor is to simply burn his business to the ground. In essence this is what Negative SEO is. Rather than putting resources into making your own business better, they put resources into taking your competitors down.
People with a more technical background will understand the URL re-writing techniques described above. They’ll also understand that because the product page in question can be accessed at http://example.com/full-product-name-product-2534.html or So http://example.com/red-full-product-name-product-2534.html, it stands to reason that the same page can be accessed at http://example.com/terrible-low-quality-product-2534.html or http://example.com/do-not-buy-this-product-2534.html, or any variation that ends with -product-2534.html.
Black Hat SEO’s know that Google looks at the words contained in a URL as descriptive. All the Black Hat needs to do is place various links around the Internet pointing to some made-up URLs with proper URL Rewriting parameters, and they’ll create a duplicate content issue for Example.com with each one. They’ll also be associating the Correct Content with URLs that contain a lot of “negative” keywords. Thus, rather than pulling themselves up the SERPs, all they need to do is push you down.
Solving Duplicate Content Problems
Lucky for us the major search engine’s have provided us with solutions to this problem. The oldest is called a 301 Redirect, while the newer actor is the
A 301 Redirect is a simple concept. The 301 code is returned by your web server to the users browser, and it means “Permanently Moved”. So when you type in http://example.com/full-product-name-product-2534.html or click on a link to that URL, your web server tells the users browser that the page has “permanently moved” to http://example.com/red-full-product-name-product-2534.html. Search Spiders get the same notification, and when they’re working correctly they’ll update their index and remember that all references to http://example.com/full-product-name-product-2534.html should now refer to http://example.com/red-full-product-name-product-2534.html. It’s like a change of address notification, except for the Internet. All of this happens seamlessly, and the user won’t know anything has changed unless they’re inspecting page URLs while they browse. Which is the kind of thing only geeky web developers do. The only down-side of using 301 Redirects is that they usually require manual human input. Which is to say that it requires putting resources to use that could be better applied elsewhere.
An alternative method has come into existence in the last few years, and that’s the rel=canonical meta tag. This tag indicates to a Search Engine which URL should be considered as the authoritative URL for the given page. While there may be links pointing to several different URL variations of the same page, Google will know which is the URL you want Users to arrive at. Unlike a 301 redirect, it won’t change the URL the User sees. In fact, it won’t change the User Experience at all, the only things that read such tags are Search Engines.
So while the store manager might update and refine the Product Name multiple times in a week, each time Google crawls the site it will see the new and correct URL. Simple, right? At least in theory it works this way. In practice, implementing Canonical Tags can literally take months to have the desired effect. That’s why Canonical tags are so important from the initial setup of the website. The upside however, is that it requires next to no overhead to maintain. If a page’s URL is changed, the old URLs of that page should automatically have the current URL listed as the Canonical URL of that page. Which makes it much easier for Google to know which result URL to return.
Hopefully you’ve learned a little bit about Duplicate Content issues, how they can negatively effect your SERP rankings, and what you can do to solve the problem. It may seem like a trivial concept at first, but I’ve worked with clients who should have had 1000 pages in Google’s index, but in fact had over 100,000 pages. That’s 100 variations of each and every page. A lot of websites can perform quite well despite Duplicate Content issues. But it’s largely dependent on how competitive the keywords it’s targeting are. In highly competitive industries, even a marginal penalty can have large repercussions. Luckily, if you need some help with Duplicate Content issues, I know a guy.
There are no trackbacks on this entry.