Have you ever really wondered how spiders really read data? Like really read it? I’ll drop something you may find interesting which will make your day to day life change drastically if you write your own content, or even just leave comments. Just a side note, I wouldn’t be posting this if it weren’t for a friend because one of his posts reminded me about this. I think you’ll find it interesting, none the less.
Is it safe to say that they actually take whatever data about the scripts they’re running that they can? Maybe they won’t be able to interpret all of the data coming from it like when I said “Hey Googlebot!” but it could certainly look at the description, the data in it including the IRC URL, etc. Also, along the same lines is it out of reach to consider that Google Crawlers may actually use OCR to see the text in banners? Is it really that far out of reach? Personally, from what I’ve seen that can happen with OCR and CAPTCHA’s, I don’t doubt it for one second. For you doubters, 150-200 links/minute anyone? All CAPTCHA “protected”?
Anyway, back to the subject at hand. What can Google Crawlers read and not read. By this point I’m sure there’s at least a handful of people going “why hasn’t he mentioned Flash yet?”. I’ll address that right now. One of my “research friends” wrote up this article about Indexing and Flash over at SERPable. If you give it a full read he actually goes into a lot of detail regarding what he found out with indexing flash.
When you start to think more along the lines of how spiders are configured and made to ‘think’ (algorithms, etc) you can start to find exactly what they’re looking at and seeing. Now think outside the box at how the data they collect is manipulated and categorized. Here’s an example: if you have a 500 word unique article that you can “spin” into being unique again, would you stop at just 5 spins? What about re-ordering all the sentences and doing it again? What about running it through a synonym function to find higher words synonyms or even taking a word, getting an antonym and saying “not <word>”?
An example of this would be: “The quick brown fox jumps over the lazy dog” => “The not so slow brown fox leaps over the lethargic dog”. Do you really think the spider would be able to tell the difference? It knows a few things … unique or not unique, keyword density, and related words. If you have a list of synonyms and antonyms, so do they. Keep that in mind.
As for other data and cloaking I won’t even really touch on that right now. I used to experiment with cloaking and it’s very hard to get around especially without an IP list of Google spiders. If you think UserAgent is the best way, I’d try something else after kicking yourself in the ass. Remember that Google’s entire job is to ensure that the best quality sites make it to the top. My job is to figure out how to get it up there, so that it stays.
Anyway I hope this helped you all think a little bit outside the box, and if you’re not members of WickedFire I hope you signup and take a peek at my post here that has some PHP code I posted about making unique content. Have a good one
PS – Bored? Skype me @ Contempt.me (that’s the username). If you don’t catch me in the middle of coding we can have a little chat …