Monday, December 15, 2008

Crawling vs. Indexing


Crawling means sucking content without processing the results. Crawlers are rather dumb processes that fetch content supplied by Web servers answering (HTTP) requests of requested URIs, delivering those contents to other processes, e.g. crawling caches or directly to indexers. Crawlers get their URIs from a crawling engine that’s feeded from different sources, including links extracted from previously crawled Web documents, URI submissions, foreign Web indexes, and whatnot.

Indexing means making sense out of the retrieved contents, storing the processing results in a (more or less complex) document index. Link analysis is a way to measure URI importance, popularity, trustworthiness and so on. Link analysis is often just a helper within the indexing process, sometimes the end in itself, but traditionally a task of the indexer, not the crawler (high sophisticated crawling engines do use link data to steer their crawlers, but that has nothing to do with link analysis in document indexes).

A crawler directive like “disallow” in robots.txt can direct crawlers, but means nothing to indexers.

An indexer directive like “noindex” in an HTTP header, an HTML document’s HEAD section, or even a robots.txt file, can direct indexers, but means nothing to crawlers, because the crawlers have to fetch the document in order to enable the indexer to obey those (inline) directives.

So when a Web service offers an indexer directive like to keep particular content out of its index, but doesn’t offer a crawler directive like User-agent: SEOservice Disallow: /, this Web service doesn’t crawl.

That’s not about semantics, that’s about Web standards.

No comments: