<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: How Spiders Fetch &amp; Classify Data</title>
	<atom:link href="http://contempt.me/how-spiders-fetch-classify-data/feed/" rel="self" type="application/rss+xml" />
	<link>http://contempt.me/how-spiders-fetch-classify-data/</link>
	<description>Getting Ranked ... And Everything After</description>
	<lastBuildDate>Mon, 16 Apr 2012 04:16:17 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
	<item>
		<title>By: Trontastic</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-4693</link>
		<dc:creator>Trontastic</dc:creator>
		<pubDate>Tue, 29 Dec 2009 16:24:51 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-4693</guid>
		<description>Its only going to be a matter of time before Google really starts to put the smack-down on article spinning. With personalized search, LSI, and what they are doing with OCR Google has clearly displayed the ability to stay ahead of the curve. 

Okay, leave me alone now. I have to go scrape a few thousand pages before lunch.</description>
		<content:encoded><![CDATA[<p>Its only going to be a matter of time before Google really starts to put the smack-down on article spinning. With personalized search, LSI, and what they are doing with OCR Google has clearly displayed the ability to stay ahead of the curve. </p>
<p>Okay, leave me alone now. I have to go scrape a few thousand pages before lunch.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Raaj</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-4004</link>
		<dc:creator>Raaj</dc:creator>
		<pubDate>Mon, 09 Nov 2009 07:28:40 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-4004</guid>
		<description>Great article. I could never imagaine that Google bots are getting smarter to this extent.</description>
		<content:encoded><![CDATA[<p>Great article. I could never imagaine that Google bots are getting smarter to this extent.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Still</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2892</link>
		<dc:creator>Still</dc:creator>
		<pubDate>Sun, 31 May 2009 19:56:25 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2892</guid>
		<description>I know I&#039;m all late to the party, but I just wanted to give a quick shoutout --I had gotten a lil complacent in my day-to-day and your blog has made me step back and THINK again. 

Especially your EPN experiment! WOW!

**jotting off to spend some quality time at WF...I&#039;ve obviously been missing out...

Thanks Bruh!</description>
		<content:encoded><![CDATA[<p>I know I&#8217;m all late to the party, but I just wanted to give a quick shoutout &#8211;I had gotten a lil complacent in my day-to-day and your blog has made me step back and THINK again. </p>
<p>Especially your EPN experiment! WOW!</p>
<p>**jotting off to spend some quality time at WF&#8230;I&#8217;ve obviously been missing out&#8230;</p>
<p>Thanks Bruh!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2866</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Tue, 26 May 2009 15:35:03 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2866</guid>
		<description>I think Google is looking at Javascript more in the case when it&#039;s used to trigger circumstances for following pages. So, working out what it needs to pass to get some output and see what the user is seeing.

The problem with an &quot;open-ended&quot; approach to decoding JS/Java is the almost unlimited variations of application. You can do some incredibly advanced stuff with JS and especially Java. If Google started probing large portions of (full app style) code, they would need some cut-off. Can you imagine the amount of resource it would take to fully test inputs/outputs on an unknown app? Not to mention, if this were the case there would be all kinds of ways to trap and trick bots.

I think a full-execution would open up so many more flaws (and costs) than it would reap in benefit.

My guess (and it is just that), is that they look for footprints of &quot;known&quot; code, which will have some kind of predictibility in what input/outputs are going to be and quickly run them, rather than bots that are intelligently trying to decipher how some specific code works / what it is outputting.

Bare in mind pokerbots are very bespoke bits of software.</description>
		<content:encoded><![CDATA[<p>I think Google is looking at Javascript more in the case when it&#8217;s used to trigger circumstances for following pages. So, working out what it needs to pass to get some output and see what the user is seeing.</p>
<p>The problem with an &#8220;open-ended&#8221; approach to decoding JS/Java is the almost unlimited variations of application. You can do some incredibly advanced stuff with JS and especially Java. If Google started probing large portions of (full app style) code, they would need some cut-off. Can you imagine the amount of resource it would take to fully test inputs/outputs on an unknown app? Not to mention, if this were the case there would be all kinds of ways to trap and trick bots.</p>
<p>I think a full-execution would open up so many more flaws (and costs) than it would reap in benefit.</p>
<p>My guess (and it is just that), is that they look for footprints of &#8220;known&#8221; code, which will have some kind of predictibility in what input/outputs are going to be and quickly run them, rather than bots that are intelligently trying to decipher how some specific code works / what it is outputting.</p>
<p>Bare in mind pokerbots are very bespoke bits of software.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Contempt</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2800</link>
		<dc:creator>Contempt</dc:creator>
		<pubDate>Tue, 19 May 2009 02:04:59 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2800</guid>
		<description>Good call Mark. Yeah, there really wasn&#039;t an &quot;ohhh&quot; moment with any of this. But you bring up a good point.

Do you think Google spiders have the ability to probe JS or Java the way that a pokerbot reads off of the poker client? Reading the actual IO and display streams to the screen to decipher what it all is/means?</description>
		<content:encoded><![CDATA[<p>Good call Mark. Yeah, there really wasn&#8217;t an &#8220;ohhh&#8221; moment with any of this. But you bring up a good point.</p>
<p>Do you think Google spiders have the ability to probe JS or Java the way that a pokerbot reads off of the poker client? Reading the actual IO and display streams to the screen to decipher what it all is/means?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mark</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2790</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Mon, 18 May 2009 11:04:53 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2790</guid>
		<description>@xentech - robots.txt will request something not to appear in public index, but certainly doesn&#039;t mean it isn&#039;t crawled.

Back to the post. Unless I&#039;m missing something, I was waiting for the big &quot;ohhh&quot; and it never came.

Google have publically annouced that:

1) They can crawl Flash and trying to do so.

2) Can execute Javascript and have been rolling this out.

I didn&#039;t get any insight into Contempt&#039;s thoughts on, for instance - what is the point of a Googlebot joining a Java IRC channel? Think that was just an observed accident?

I was looking through some of David Sontag&#039;s papers on machine learning (http://people.csail.mit.edu/dsontag/) and it would strike me, even using high end stuff like his - you&#039;d have trouble trying to categorize or order any data that came from pluggin around inside unknown Java.

There&#039;s a whole minefield of problems to deal with.</description>
		<content:encoded><![CDATA[<p>@xentech &#8211; robots.txt will request something not to appear in public index, but certainly doesn&#8217;t mean it isn&#8217;t crawled.</p>
<p>Back to the post. Unless I&#8217;m missing something, I was waiting for the big &#8220;ohhh&#8221; and it never came.</p>
<p>Google have publically annouced that:</p>
<p>1) They can crawl Flash and trying to do so.</p>
<p>2) Can execute Javascript and have been rolling this out.</p>
<p>I didn&#8217;t get any insight into Contempt&#8217;s thoughts on, for instance &#8211; what is the point of a Googlebot joining a Java IRC channel? Think that was just an observed accident?</p>
<p>I was looking through some of David Sontag&#8217;s papers on machine learning (<a href="http://people.csail.mit.edu/dsontag/" rel="nofollow">http://people.csail.mit.edu/dsontag/</a>) and it would strike me, even using high end stuff like his &#8211; you&#8217;d have trouble trying to categorize or order any data that came from pluggin around inside unknown Java.</p>
<p>There&#8217;s a whole minefield of problems to deal with.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: xentech</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2737</link>
		<dc:creator>xentech</dc:creator>
		<pubDate>Tue, 12 May 2009 08:46:35 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2737</guid>
		<description>I always put my jQuery in /js and disable it in robots.txt, not sure of it&#039;s effectiveness though.</description>
		<content:encoded><![CDATA[<p>I always put my jQuery in /js and disable it in robots.txt, not sure of it&#8217;s effectiveness though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Contempt</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2736</link>
		<dc:creator>Contempt</dc:creator>
		<pubDate>Tue, 12 May 2009 06:28:04 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2736</guid>
		<description>Hell chea. :D</description>
		<content:encoded><![CDATA[<p>Hell chea. <img src='http://contempt.me/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris Monty</title>
		<link>http://contempt.me/how-spiders-fetch-classify-data/comment-page-1/#comment-2729</link>
		<dc:creator>Chris Monty</dc:creator>
		<pubDate>Mon, 11 May 2009 02:13:52 +0000</pubDate>
		<guid isPermaLink="false">http://contempt.me/?p=233#comment-2729</guid>
		<description>Hey, nice one.  I actually managed to escape from the banking world and get a job as an SEO for a local sports marketing firm.  We&#039;ll have to grab lunch soon.  I&#039;m off to read your php code post about unique content.</description>
		<content:encoded><![CDATA[<p>Hey, nice one.  I actually managed to escape from the banking world and get a job as an SEO for a local sports marketing firm.  We&#8217;ll have to grab lunch soon.  I&#8217;m off to read your php code post about unique content.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

