Three Main Issues with Scaling up a Web Crawler
1: Normal politeness needed on the web
There is a line in the robots.txt that each domain tells you how often you can crawl that domain.
The problem is that multiple domains can be hosted with a hosted service, and they can be hosted even on one machine.
Therefore even though you are polite to one domain, you are hurting a hosting service, or you are hurting a machine.
So you have to make plans in order to not hit one machine or domain too hard.
2: How to get a bunch of machines involved in crawling
If you crawl on just one machine, and the state is very good, and it is really easy to keep the state of the crawl on one machine.
But then it is hard to make a very big search engine, because you get as much as one machine will crawl. So of course in practice, you need to crawl on thousands of machines.
If you are going to obey politeness, that means that each of your thousand machines needs to tell the other ones, what it has crawled and what it is about to craw and what it it is going to crawl in 20 min. All of these communication overhead can slow down the crawler, which hurts your aim of having a big search engine.
3: How to consume a lot of bandwidth to keep the expensive resources busy
One thing that people do to lessen the communication is they pre-process the corpus that they already have. So when you have crawled a set of pages you can process them and extract out all the links that they point to. You can then normalize those links so that yahoo.com and xxxxx.yahoo.com wind up being the same string, and then you can farm those strings out to your thousand machines, and then they don't have to communicate because they know they are not going to hit the same domain.
However, they could accidentally hit the same hosting machine, but your can take care of that ahead of time as well.
--Notes from Anna Patterson's talk











