http://contempt.me/wp-content/themes/averin/images/microsite-728x90.gif

How Spiders Fetch & Classify Data

Have you ever really wondered how spiders really read data? Like really read it? I’ll drop something you may find interesting which will make your day to day life change drastically if you write your own content, or even just leave comments. Just a side note, I wouldn’t be posting this if it weren’t for a friend because one of his posts reminded me about this. I think you’ll find it interesting, none the less.

So let’s think about how spider’s work in general. Spiders go to your website, and for the most part they see flat HTML. Do they execute JavaScript? Do they execute Java? What about jQuery? If you say no to all of the previously mentioned statements, you are a dumbass. I’ve actually seen a Google Crawler join an IRC channel via a Java app. It was on a website it was crawling, which shows that it does actually read Java. And no, I wasn’t imagining things or lying, the actual hostmask was the crawl googlebot hostmask with a Google owned and operated IP.

Is it safe to say that they actually take whatever data about the scripts they’re running that they can? Maybe they won’t be able to interpret all of the data coming from it like when I said “Hey Googlebot!” but it could certainly look at the description, the data in it including the IRC URL, etc. Also, along the same lines is it out of reach to consider that Google Crawlers may actually use OCR to see the text in banners? Is it really that far out of reach? Personally, from what I’ve seen that can happen with OCR and CAPTCHA’s, I don’t doubt it for one second. For you doubters, 150-200 links/minute anyone? All CAPTCHA “protected”? ;)

Anyway, back to the subject at hand. What can Google Crawlers read and not read. By this point I’m sure there’s at least a handful of people going “why hasn’t he mentioned Flash yet?”. I’ll address that right now. One of my “research friends” wrote up this article about Indexing and Flash over at SERPable. If you give it a full read he actually goes into a lot of detail regarding what he found out with indexing flash.

When you start to think more along the lines of how spiders are configured and made to ‘think’ (algorithms, etc) you can start to find exactly what they’re looking at and seeing. Now think outside the box at how the data they collect is manipulated and categorized. Here’s an example: if you have a 500 word unique article that you can “spin” into being unique again, would you stop at just 5 spins? What about re-ordering all the sentences and doing it again? What about running it through a synonym function to find higher words synonyms or even taking a word, getting an antonym and saying “not <word>”?

An example of this would be: “The quick brown fox jumps over the lazy dog” => “The not so slow brown fox leaps over the lethargic dog”. Do you really think the spider would be able to tell the difference? It knows a few things … unique or not unique, keyword density, and related words. If you have a list of synonyms and antonyms, so do they. Keep that in mind.

As for other data and cloaking I won’t even really touch on that right now. I used to experiment with cloaking and it’s very hard to get around especially without an IP list of Google spiders. If you think UserAgent is the best way, I’d try something else after kicking yourself in the ass. Remember that Google’s entire job is to ensure that the best quality sites make it to the top. My job is to figure out how to get it up there, so that it stays. :)

Anyway I hope this helped you all think a little bit outside the box, and if you’re not members of WickedFire I hope you signup and take a peek at my post here that has some PHP code I posted about making unique content. Have a good one ;)

PS – Bored? Skype me @ Contempt.me (that’s the username). If you don’t catch me in the middle of coding we can have a little chat …

5 years ago by in Search Engine Marketing | You can follow any responses to this entry through the RSS feed. You can leave a response, or trackback from your own site.
About the

My name is Rob Adler and I'm an algo-holic. I spend most of my time coding, data mining, spidering and consulting for SEO. I hope the posts here are beneficial for you, and hopefully I can blow your mind every now and again.

9 Comments to How Spiders Fetch & Classify Data
    • Chris Monty
    • Hey, nice one. I actually managed to escape from the banking world and get a job as an SEO for a local sports marketing firm. We’ll have to grab lunch soon. I’m off to read your php code post about unique content.

    • xentech
    • I always put my jQuery in /js and disable it in robots.txt, not sure of it’s effectiveness though.

    • Mark
    • @xentech – robots.txt will request something not to appear in public index, but certainly doesn’t mean it isn’t crawled.

      Back to the post. Unless I’m missing something, I was waiting for the big “ohhh” and it never came.

      Google have publically annouced that:

      1) They can crawl Flash and trying to do so.

      2) Can execute Javascript and have been rolling this out.

      I didn’t get any insight into Contempt’s thoughts on, for instance – what is the point of a Googlebot joining a Java IRC channel? Think that was just an observed accident?

      I was looking through some of David Sontag’s papers on machine learning (http://people.csail.mit.edu/dsontag/) and it would strike me, even using high end stuff like his – you’d have trouble trying to categorize or order any data that came from pluggin around inside unknown Java.

      There’s a whole minefield of problems to deal with.

    • Contempt
    • Good call Mark. Yeah, there really wasn’t an “ohhh” moment with any of this. But you bring up a good point.

      Do you think Google spiders have the ability to probe JS or Java the way that a pokerbot reads off of the poker client? Reading the actual IO and display streams to the screen to decipher what it all is/means?

    • Mark
    • I think Google is looking at Javascript more in the case when it’s used to trigger circumstances for following pages. So, working out what it needs to pass to get some output and see what the user is seeing.

      The problem with an “open-ended” approach to decoding JS/Java is the almost unlimited variations of application. You can do some incredibly advanced stuff with JS and especially Java. If Google started probing large portions of (full app style) code, they would need some cut-off. Can you imagine the amount of resource it would take to fully test inputs/outputs on an unknown app? Not to mention, if this were the case there would be all kinds of ways to trap and trick bots.

      I think a full-execution would open up so many more flaws (and costs) than it would reap in benefit.

      My guess (and it is just that), is that they look for footprints of “known” code, which will have some kind of predictibility in what input/outputs are going to be and quickly run them, rather than bots that are intelligently trying to decipher how some specific code works / what it is outputting.

      Bare in mind pokerbots are very bespoke bits of software.

    • Still
    • I know I’m all late to the party, but I just wanted to give a quick shoutout –I had gotten a lil complacent in my day-to-day and your blog has made me step back and THINK again.

      Especially your EPN experiment! WOW!

      **jotting off to spend some quality time at WF…I’ve obviously been missing out…

      Thanks Bruh!

    • Trontastic
    • Its only going to be a matter of time before Google really starts to put the smack-down on article spinning. With personalized search, LSI, and what they are doing with OCR Google has clearly displayed the ability to stay ahead of the curve.

      Okay, leave me alone now. I have to go scrape a few thousand pages before lunch.

Leave A Response

* Required