Google mini blues – 406 ownage!


We used to use Google mini to index our site to serve search results to our users. It worked great for a while until sometime back it just stopped indexing. It would get a fatal error and stop index after couple urls. It was also not so kind about sending me an email to let me know that it has stopped indexing. It was still serving pages which were indexed prior to this so we didn’t realize it wasn’t indexing new content.

And than one day we logged in to the admin console to see whats going on with it and found out it wasn’t indexing. I checked the logs and found out that it was stopping due too too many 4xx errors. After looking at logs on our web server I found out that it was stopping because Apache was giving back 406 response code which is described at w3.org as:

The resource identified by the request is only capable of generating response entities which have content characteristics not acceptable according to the accept headers sent in the request.

Unless it was a HEAD request, the response SHOULD include an entity containing a list of available entity characteristics and location(s) from which the user or user agent can choose the one most appropriate. The entity format is specified by the media type given in the Content-Type header field. Depending upon the format and the capabilities of the user agent, selection of the most appropriate choice MAY be performed automatically. However, this specification does not define any standard for such automatic selection.

I also found on some site that 406 happens if one uses Multiviews (in apache conf). But nowhere did it talk about how to fix it. Since we recently switched to using Multiviews, it was a reasonable thing to assume that it might be the cause. Obviously we didn’t want to go back to not using Multiviews. I contacted our google support with the question on how to fix it and got a response next day (which in my opinion is slow when you rely on your mini for crucial search functionality). Following is the response I got from them:

The 406 Error occurs when the web server wants to send back a
content-type that’s not included in the Mini’s Accept header.

The easiest way to correct this is to add the following line to the “Additional HTTP Headers for Crawler” under
Google Mini > Crawl and Index > HTTP Headers for Crawler field on the Crawler Parameters page:

Accept:text/html,text/plain,application/pdf,text/pdf,application/vnd.ms-excel,
text/vnd.ms-excel,application/rtf,text/rtf,application/msword,text/msword,
application/vnd.ms-powerpoint,text/vnd.ms-powerpoint,
application/x-shockwave-flash,text/x-shockwave-flash,
application/postscript,text/postscript,application/x-gzip,
application/octet-stream,application/*,text/

And as any good sysadmins would do, I copied pasted the code where they told me to and hoped for the best and started crawling again. Sure enough, it didn’t work. 406 errors continued to spam the logs everytime I tried crawling. So back to Google to search for answers.. this time I knew I needed to find answers has to with accept header. So I looked and tried to figure out what Googlebot sends as accept headers since I know googlebot gets 200 response while crawling. After looking around I found a site where it lists bunch of search engines and what headers they send and you can test your site against those. Sure enough, googlebot is updated (google probably figured out this would be a common problem) and handles this differently. So I added the headers specified on that site: Accept:*/*

And ever since than, my mini has been happy crawling our site. I hope this helps someone out there who is having simliar Google mini problems. As always, if you know any better ways or have a suggestion/comment, feel free to leave them here.


No comments yet. Be the first.

Leave a reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image