Feb 13, 2009

rel=canonical

I'm glad to see that yesterday all three major US search engines jointly announced support for the rel=canonical tag:

I'm more excited than most people, because I'm largely responsible for making this idea a reality at Google. To be fair, many people have had similar ideas in the past, and the effort was certainly a joint one between Google, Yahoo, and Microsoft. I just did my small part.

I was speaking at SES San Jose in 2007 and listening to the frustrations of webmasters when dealing with canonicalization issues. In the past, our suggestion was to always do a 301 redirect to the canonical of a page if Google wasn't getting things straight. Usually we do "the right thing" automatically, but we aren't perfect. 301 was the answer when you want to be sure (and it is still a valid answer).

However, there were some scenarios that people pointed out to me for which 301 was not ideal:

  1. 301 simply won't work for print-friendly versions of a page.
  2. The same story is true for any page where the content is constant, but the UI elements on the page change relative to the URL, for example having a sort field. Sure you could use cookies or other mechanisms to track sort fields, but that breaks bookmarking or sharing the URLs with other people, and it breaks the entire experience if the user has disabled cookies.
  3. Many webmasters don't have control over the server headers, for example freehosts like blogspot, or webmasters that aren't terribly tech-savvy.
  4. Sometimes systems or a client requires the use of session ids in URLs to track users throughout a site.
  5. Needing to 301 redirect a user doubles the load time of a web page because now their browser makes two round trips to your server to get the content. Faster is my favorite feature.

Here is a specific example (URLs are fake, no need to click on them):

  • http://stuff.com/breadcrumbs/tents/bags/red/tent_bag.html
  • http://stuff.com/breadcrumbs/bags/tents/red/tent_bag.html
  • http://stuff.com/breadcrumbs/bags/tents/red/tent_bag.html?view=print

All of these pages have similar (although not exact duplicate) content as they are generated by a mod_rewrite and a database script. Search engines might automatically group them in some cases, and might mess up in others. Any of them (except for the printer version) can be the canonical and the site owner probably doesn't care - just that only 1 of them is. The owner wants all URLS to exist (aka, doesn't want to use 301s) because they reflect the user's browsing path to get to the red tent bag product and hence the URLs themselves are good user experiences. All pagerank/links to any of these pages should flow to whatever canonical chosen. There was no good solution to a problem like this. You have to sacrifice something: usability or search engine optimization.

The new rel=canonical suggestion is simply to add to all 3 of these pages one single tag:

<link rel="canonical" href="http://stuff.com/breadcrumbs/bags/tents/red/tent_bag.html">

In this case, I arbitrarily sorted the breadcrumbs. This could easily be done server side without having to manually pick which path I want to be the canonical. I could sort by something other than alphabetical too if there was a reason I wanted one URL in the index instead of another.

I wouldn't expect most webmasters manually working on their site to use this often, but for software (CMS) systems, it would give an easy way to avoid using 301s or robots for canonical issues. In many cases where webmasters use a package CMS software, the webmasters need not even be aware of the canonical tag.

I've seen some discussion on the web indicating that this is just a hacky way to solve problems that are simply a reflection of bad website design, or people who are concerned that they need to implement yet one more thing for all of their web sites. I think people are missing the cases where the previous options really weren't ideal. Also, this is not required maintenance - all search engines rightly assume that the vast majority of the web won't mess with this stuff and as a result they all do their best to crawl and index the "right way" regardless.

There are definitely other great tools out there for solving these problems: cookie-based sessions, 301s, robots.txt. Yahoo even launched some time ago a feature in Site Explorer called dynamic url rewriting. I think all of these tools are great. One of the best parts about rel=canonical though is that any search engine out there can support it easily without webmasters having to change a thing. Set up rel=canonical tags for yahoo, and google/microsoft/etc can start indexing your content more effectively without any additional work on your part as a webmaster.

Feb 8, 2009

A sea of shopping