Oceanator Framewurk @oceanator - Tumblr Blog

Three Main Issues with Scaling up a Web Crawler

1: Normal politeness needed on the web

There is a line in the robots.txt that each domain tells you how often you can crawl that domain.

The problem is that multiple domains can be hosted with a hosted service, and they can be hosted even on one machine.

Therefore even though you are polite to one domain, you are hurting a hosting service, or you are hurting a machine.

So you have to make plans in order to not hit one machine or domain too hard.

2: How to get a bunch of machines involved in crawling

If you crawl on just one machine, and the state is very good, and it is really easy to keep the state of the crawl on one machine.

But then it is hard to make a very big search engine, because you get as much as one machine will crawl. So of course in practice, you need to crawl on thousands of machines.

If you are going to obey politeness, that means that each of your thousand machines needs to tell the other ones, what it has crawled and what it is about to craw and what it it is going to crawl in 20 min. All of these communication overhead can slow down the crawler, which hurts your aim of having a big search engine.

3: How to consume a lot of bandwidth to keep the expensive resources busy

One thing that people do to lessen the communication is they pre-process the corpus that they already have. So when you have crawled a set of pages you can process them and extract out all the links that they point to. You can then normalize those links so that yahoo.com and xxxxx.yahoo.com wind up being the same string, and then you can farm those strings out to your thousand machines, and then they don't have to communicate because they know they are not going to hit the same domain.

However, they could accidentally hit the same hosting machine, but your can take care of that ahead of time as well.

--Notes from Anna Patterson's talk

#web-crawler

病态！

耳机和学生证找不到了。。找不到东西时有一种想要google它们在哪里的冲动

每次merge时的真实感受写照

Window10实际体验

刚刚也有说到，很清爽。是怎样的清爽？

马上就发现了无边窗口的阴影和打开时窗口时的animation。这是Mac OS + Material Design嘛。

终于有multi desktop！喂！Linux快看这里！

豆知识：ctrl+win+D可以加一个desktop。ctrl+win+F4可以关闭一个desktop。

再来嘛，终于可以在cmd里用ctrl+v！

反正就是清爽啦。

#windows10

如何解决Windows 10 Technical Preview黑屏

Windows10使用两天了，感觉非常清爽。建议不在意Onedrive bug的勇者们快来攻略10！

Windows 10 的安装及其流畅。只有遇到一个小boss - 就是刚安装好后不久就黑屏，重启后直接黑屏 - 移动鼠标时只能看到光标。

很显然的是显卡的问题。用safe mode重启，并在device manager里狠狠地删掉了最新的显卡驱动。再重启。

于是你就成功收服了windows 10！

#windows10

甜蜜又苦涩的棒棒糖

上周终于迎来了棒棒糖更新。

于是本周过的也是相当幸苦。

周一就被抓去当QA苦力，每天工作量都是无休止的10+hours。Dev们更是被QA压着手在键盘上，被写满一整个whiteboard的delivery忙的烧焦头，以至于周五QA summit变成如下的批判大会:

“XX yelled at me when...”

"Then I just yelled back" etc.

不就是大家都yell来yell去嘛。

下周总算可以做Dev的工作了。优秀的Dev不需要QA！自己Q自己A！能保证代码的质量才是合格的Dev。

#android lollipop

我回来啦

好久不见了。

久违的南半球暑假在一周目前就开始了！

要在这个假期尽情coding+吐槽。

Trending Blogs

Recently Viewed Blogs

Oceanator Framewurk