Archive for February 8th, 2008

Why query strings in urls drive Googlebot and other search engine crawlers insane

“We have to reinvent the wheel every once in a while, not because we need a lot of wheels; but because we need a lot of inventors.”
- Bruce Joyce

I wrote about my experience writing a site crawler in php in an earlier post, and I’m going to use some of the background there to make my point here. So it might help to go read it if you haven’t already.

[Google’s crawler [Googlebot] isn’t that sophisticated/writing a crawler in php]

From my casual observation of the way Googlebot crawls some of the sites I work on, I have reached the conclusion that it works in much the same way that a crawler I wrote a year ago worked.

Google bot goes page to page, gathering links from your page and tacking them onto the current url that it is at, right then. So why do query strings give it such a problem?

The answer is simple. Imagine this url for an item that doesn’t exist anymore.

www. example.com/store.php?buyid=29&catid=12

When a crawler encounters this url and tests it to see if it returns a 404 … it doesn’t.

Why?

Because  www. example.com/store.php is usually still a valid page. It  won’t give the crawler an error, unless you explicitly code it to.

So the crawler now tosses  www. example.com/store.php?buyid=29&catid=12 onto its list of pages to be crawled. Can you see the disaster waiting to happen?

www. example.com/store.php?buyid=29&catid=12 and any other non-existent urls like it are basically just mapping to the still valid www. example.com/store.php but in the crawlers mind they are all different urls.

Now , if there are other urls on that page (store.php), like for related products for example. Google just takes the url and tacks it on to the url (it thinks) its at right now. So it winds up with

www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11

It does that for every invalid query string url that has store.php in its base. It then goes back and crawls them again and now it has.

www. example.com/store.php?buyid=29&catid=12store.php?buyid=39&catid=11store.php?buyid=39&catid=11

The crawler is now in a tailspin … going around in circles trying to crawl your site. Chewing up your cpu cycles and generally being a nuisance.

I hope this helps you understand why Googlebot hates query strings so much.

I haven’t tried this yet, but I think it should be clear that making the base url of a query string  resolve to a 404 error will help it out a lot.

So as an example

www. example.com/store.php?buyid=29&catid=12 

should return a code 200/ok

and

www. example.com/store.php

should give a 404 error.

This is just my theory, I don’t know that It’d be practical.

PS: I hope this further helps you understand why search engine crawlers also hate PHP session ids on your content.

Add comment February 8th, 2008

Google’s crawler [Googlebot] isn’t that sophisticated/writing a crawler in php

I spent a lot of time early last year, trying to write a crawler in php (I know, I know).

It was supposed to sit on the server and when so that when you went to the url, it’d generate a google sitemap for your entire site.

What I found out was that writing a good crawler is very hard work. Not because of the recursion involved, but because of the infinite ways link tags appear.

Now Google has validated my experience (more on this in a second).

Just a couple of things I had to consider with my crawler were

  • I had to program it to look for a base tag so that I’d know if to treat the links as relative or absolute.
  • I had to check each link to figure out if it was an internal or external link so I’d know whether to crawl it.
  • Then I had to keep a running list of links crawled, so that I’d know if I had crawled a link before or not
  • I had to let the program know that if there was a “/” in a relative link, to let it know to substitute the domain for it.
  • knowing how to deal with “../” … this was a pain and a half
  • I had to let it know how to deal with mailtos, javascript, and improperly written urls like <a href=”www.concept47.com”> (more on this later) … etc

My crawler worked by gathering a list of links that it continually added urls to be crawled onto. As it crawled the urls, it put them in another array that each new url had to be crosschecked with before being added to the list to be crawled.

The problem  though,  is the way people write html markup. As many of you know, there are some nastily written pages out there … so if someone wrote <a href=”www.concept47.com”> or <a href=”concept47.com”> or even <a href=”screwyougoogle”> my crawler had to know not to add it to the list to be crawled.

This is very difficult to do correctly and for all the time I spent on it, there is no real way to deal with it. You could write special cases for <a href=”www.concept47.com”> but what about <a href=”ww.concept47.com”> or
<a href=”w.concept47.com”>  … see the problem?

Even though the urls give you a 404, they’ll make it onto your list to be crawled and waste the crawlers time. I felt like such a loser for not being able to figure out this issue, but it seems the Googlebot has the same problem.

Look at this. [click the image to make it bigger]

Googlebot has problems with bad link tags

This is from my webmaster tools console.

The problem here was that I had a link tag on one of my blog posts that went like this

<a href=”www.unfuddle.com”>

As you can see, even the mighty Googlebot didn’t pick up on the error. It just tacked the url onto the current url, it was at and went on about its business.

Validation!

Read the next in this series [Why query strings in urls drive Googlebot and other search engine crawlers insane]

Add comment February 8th, 2008


Recommended

Posts by Category

Calendar

February 2008
S M T W T F S
« Jan   Mar »
 12
3456789
10111213141516
17181920212223
242526272829  

Posts by Month