Disabling URL rewriting for the Googlebot

by Jelmer KuperusSeptember 8, 2008

Http is a stateless protocol. To work around the problems caused by this, web applications have the concept of a session. When a user requests a webpage for the first time the user is assigned a unique 32 character string. This string can be send along in subsequent requests to indicate that these requests are in fact originating from the same user. The most common way to pass along this string, or session identifier, is by sending it in a cookie. But what if a user chooses to disable cookies ? In that case a servlet container will fall back on url rewriting, the session identifier is appended at the end of any links in your application. So a link to your homepage might look like this after rewriting

/index.html;jsessionid=AA922A8B781AC4F95E68F88B0AF8CCB3

When you click this link the container will parse the jsessionid value and will determine that you are the same user that made the previous request. This way even privacy conscious users may continue to use your web site. This is something that all just works as long as you use something like the jstl url tag. When it detects that the user has disabled cookies it will automatically start rewriting all the URLs in your application.

Most of the time this is what you want. However there is an unfortunate side affect to this strategy. The Google bot that constantly spiders the internet for new content does not support cookies. This means that it will see, and index, the rewritten URLs. a quick search suggests that this is a fairly common problem. The rewritten URLs will hurt your google rating because less of the URL will match a users search query. So how do you solve it ? It turned out to be fairly trivial.

I created a ServletResponseWrapper that modifies the encodeURL and encodeRedirectURL methods so it does not append the session identifier. The wrapper is created in a servlet filter that only applies the wrapper when it determines that the request originates from the Google bot. You can check this fairy easily by inspecting the user agent header send along with every request. I included the source below

SeoResponseWrapper.java
SeoFilter.java