Why query strings in urls drive Googlebot and other search engine crawlers insane

February 8th, 2008

“We have to reinvent the wheel every once in a while, not because we need a lot of wheels; but because we need a lot of inventors.”
- Bruce Joyce

I wrote about my experience writing a site crawler in php in an earlier post, and I’m going to use some of the background there to make my point here. So it might help to go read it if you haven’t already.

[Google’s crawler [Googlebot] isn’t that sophisticated/writing a crawler in php]

From my casual observation of the way Googlebot crawls some of the sites I work on, I have reached the conclusion that it works in much the same way that a crawler I wrote a year ago worked.

Google bot goes page to page, gathering links from your page and tacking them onto the current url that it is at, right then. So why do query strings give it such a problem?

The answer is simple. Imagine this url for an item that doesn’t exist anymore.

www. example.com/store.php?buyid=29&catid=12

When a crawler encounters this url and tests it to see if it returns a 404 … it doesn’t.

Why?

Because  www. example.com/store.php is usually still a valid page. It  won’t give the crawler an error, unless you explicitly code it to.

So the crawler now tosses  www. example.com/store.php?buyid=29&catid=12 onto its list of pages to be crawled. Can you see the disaster waiting to happen?

www. example.com/store.php?buyid=29&catid=12 and any other non-existent urls like it are basically just mapping to the still valid www. example.com/store.php but in the crawlers mind they are all different urls.

Now , if there are other urls on that page (store.php), like for related products for example. Google just takes the url and tacks it on to the url (it thinks) its at right now. So it winds up with

www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11

It does that for every invalid query string url that has store.php in its base. It then goes back and crawls them again and now it has.

www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11store.php?buyid=39&catid=11

The crawler is now in a tailspin … going around in circles trying to crawl your site. Chewing up your cpu cycles and generally being a nuisance.

I hope this helps you understand why Googlebot hates query strings so much.

I haven’t tried this yet, but I think it should be clear that making the base url of a query string  resolve to a 404 error will help it out a lot.

So as an example

www. example.com/store.php?buyid=29&catid=12 

should return a code 200/ok

and

www. example.com/store.php

should give a 404 error.

This is just my theory, I don’t know that It’d be practical.

PS: I hope this further helps you understand why search engine crawlers also hate PHP session ids on your content.

Entry Filed under: developers, php

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

May 2012
S M T W T F S
« May    
 12345
6789101112
13141516171819
20212223242526
2728293031  

I recommend

Linode VPS's for Rails hosting

Heroku for mindless Rails hosting

Site 5 for shared Rails hosting and all round great service

Most Recent Posts

Categories