May 6, 2009

Why do we even need URL Shorteners?

My first thought was to title this post: "Why are URLs long?" but I realize that the reason I'm writing this was because of the recent issues being raised around URL Shorteners (aka: TinyURL).  While this post is over a month late to the party, the context seems relevant.

So, why do we even need URL Shorteners?  The answer is simple: because URLs are too long.  This may be an issue made more obvious with twitter, cell phones, or any kind of manual text-entry, but it isn't only related to this.  Essentially, most interesting content on the web has a URL that is too long to remember/type in/share.  This can be a problem if you are:
  1.  Sending an email to someone who uses a crappy email client that wraps (breaks) lines over some character limit.
  2. Hanging posters in your dorm with a URL to get more information.
  3. Giving a talk at a conference and want the audience to write down/remember some URL later.
  4. Having a verbal conversation with a friend: "I'll send you a link later" is a symptom of this issue.
Worse than just long, most URLs are a crappy User Interface.

Root Causes:
When "moving pictures" (video) first became possible to a large audience, we largely just recorded plays - what we were used to pre-video.  Only with time did we learn that the new medium afforded interesting new possibilities: camera angles, shifting scenes, overlaid audio, special effects, etc.

The web evolved similarly.  In the original web, most web servers were designed to be a way to access a collection of files on a server some where.  We were familiar with file systems and the pre-web internet was a lot of FTP and BBS servers.  Our URLs naturally then mirrored file systems.  There was certainly nothing that I know of in the HTTP spec that said they had to be.  This got us into some trouble:

With the file system as a metaphor, URLs got extensions (.html, .php, .asp).  Even though the HTTP spec defined a way to communicate the content type outside the URL structure, we were familiar with the extension UI element. However the vast majority of the URLs we interacted with were all one content-type: HTML.  Sure, HTML embedded .gif and .js, but users didn't directly interact with those URLs often, they were hidden.  What type of software generated the page (.php, .asp, .jsp) wasn't remotely interesting.  For the vast majority of URLs we were viewing, the information presented in extension was redundantly obvious or plain irrelevant.  Even this post will have a URL that ends with .html, 5 characters of needless redundancy!

With the file system as a metaphor, URLs became organized hierarchically into directories.  We grouped them by topic, date or whatever with well-defined levels of hierarchy.  Each file in one folder.  Most early http servers would even automatically generate and serve an "index" page which listed all the files in a particular directory. What was a weak metaphor for a hard drive file system became worse on the internet.  Hyperlinks made certain of that.  Instead of there being only one path to navigate through a series of directories to a document on the internet, links made sure there were plenty of paths to navigate.  Our URLs looked like a tree, but on closer inspection, we had really built a web.

Take this post for example.  It's path looks something like:

However, I sincerely doubt that you navigated to this post by first looking for documents that I created in 2009, followed by those I created in May (month 05).  You came through either a hyperlink or a feed reader.  The directory structure here is showing information that isn't usually that interesting to a user actually interacting with a URL.  How often are book titles based on Dewey Decimal categories?

Search Engines
The file system metaphor can't explain all our woes.  After all who in their right mind would ever name a file something so long as why-do-we-even-need-url-shorteners.html?  And originally, the web wasn't named this way.  Had I chosen it, this page might have a name of url-shorteners.html or long-URLs-rant.html.  But then search engines came along.  And before long it became known that one of their ranking signals was words contained in the URLs.  Users didn't type in URLs anyway, right?  They just clicked on them, so it quickly became more important to create URLs for Search Engine Marketing than for Usability: more keywords are always better.

But you can't blame Search Engines.  People frequently named their pages with descriptive URLs.  Using this as a signal made lots of sense.  And once webmasters noticed it and reacted to it, this custom was only further reinforced.  As a result we have, why-do-we-even-need-url-shorteners.html(39 characters) instead of url-shorteners.html (19 characters).

The HTML spec isn't completely blameless either.  Since our metaphor was a file system, we never really expected significant amounts of dynamic content.  When HTML forms were designed, we imagined things like a way to leave a comment for a webmaster, or a way to upload a file.  After all, what other interactions had we really done in the days of FTP or BBS systems?

As content on the internet became more dynamic, forms started to be used more frequently for navigation: search boxes, preference settings pages, javascript drop down elements.  All of these things created URLs that were strictly defined by how the HTML spec required GET method forms to interact.  For example, when submitting a form, even if only one of the fields is filled in, all of the fields become part of the URL: ?q=foo+bar&page=&sort=&width=  Repeated values create repeated keys as well: ?opt=red&opt=blue&opt=green  What a waste.


Historically each hostname (subdomain) generally referred to a different machine.  Most machines exposed to the internet were not running HTTP servers.  As a result, most uses of hostnames were for things other than a web browser.  Since the default was not HTTP, we needed a way to refer to the machine running the HTTP server.  A custom arose - the HTTP server would run on the machine named www.  It was short, easy to type, memorable, and unique.  These days with hardware load balancers, HTTP hostnames rarely refer to individual machines directly.  Instead a single hostname can refer to hundreds of separate machines.  However www has stuck around because people have come to expect it.  The mere presence of a www prefix calls up the concept of a web page in most minds.  As you'll notice, doesn't have a www and neither do url shorteners - 4 unneeded characters that will be with most URLs for a long time.

Change you can believe in:
Fortunately, this is not a chicken and egg problem.  If you run a website or a CMS system, you could write better URLs today without waiting for your customers to do something first.  Not all chickens have that much control, but many do.  And many websites are already paying attention.  Take a close look at how Twitter carefully crafts their URLs to be user interface elements in themselves.

A few of my suggested rules of thumb, but first an important disclaimer.  I do work for a search engine company, but the opinions expressed on my blog are my own and not necessarily those of my employer.  These recommendations may not be valid in the context of search engine optimization.  They are simply my opinions about how URLS could be effectively used as a User Interface Element.  With that out of the way, here we go:

  1. Drop the www.  But if your users type it, make sure you still get them to the right place.
  2. Drop the extensions (.html, .php) for HTML pages - they are the default.  Keep them for non-HTML documents (PDF, images, text) because they are useful hints to a user about what to expect.
  3. Don't let HTML forms dictate your URL structure.  They are a necessary evil for actual user-input, but they create awful URL UI experiences.
  4. Use directory structures for things users care about, not uninteresting categorization.  Each level you add makes the URL longer and potentially harder to remember/reuse.
  5. Urls should be descriptive.  Long numbers are often really bad, a few words are really good.
Finally, think about what is the shortest URL for a given page that would be specific and convey alot of information about what you might expect to find there.

For example, this URL could easily have been as long as:  (80 chars)

Or it could potentially have been as short and descriptive as: (34 chars)
34 chars isn't bad.  Even a tinyurl would look like (25 chars).  And consider how much more information is conveyed in the short and descriptive URL for a cost of 9 measly characters.


Michael Wyszomierski said...

Right on, I generally find shorter URLs to be more readable, especially in contrast to some of the URLs generated by CMS systems. For example, even my high school's homepage redirects to something quite scary.

As for limiting directory structure to things that users care about, I agree, but think that on a blog, the 2009/05 bit actually falls into the "useful" category. When I'm searching for an old post on my blog, for instance, I often find it easier to hack the YYYY/MM values in the URL to get around instead of clicking on links. It also lets me identify in a search result if a post is old or not. If I see "2004" in the URL and I'm looking for the latest data on browser market share, I know not to click on that link.

Another good use of directory structure is found on Flickr. Despite its questionable search features, I can generally get what I want by browsing via the address bar:
Want pictures of cats?
Just my cats?

Greg said...

I agree. Flickr's URL structure is pretty nice too. The only gripe I would have is their insistence on using www. hostname prefixes. It's relatively minor though, and either version works for type-ins.

I disagree with you about dates being all that useful. They aren't completely useless, we just disagree on the degree. Admittedly, there is no perfect answer to the right URL structure - dates are just one tiny example and not a black&white one.

Patrick Chapman said...

Your url should be :P (29 chars) Also, what limits CMS's from allowing users to type OR Both seem like a reasonable url.

As far as url shorteners go, Ugh! They are useful for saving characters, but thats it. Just try to remember that 6 or 7 character hash for any length of time, forget it. Its a horrible UI for people.

I think there is a good middle ground between short urls and descriptive urls that hasn't been tapped yet.