Writing a crawler in less than 10 minutes
Hola !
Came across Scrapy a few days ago. It was quite simple and easy to get started with a simple crawler.
This post assumes you have read something about scrapy and explains the steps to create a simple application using scrapy.
1) Install scrapy
apt-get install scrapy
2) Start a project with scrapy:
scrapy startproject myscraper
3) Define Items to be scraped in items.py:
Inside myscraper/myscraper/items.py , you will find a template to define what items need to be scraped. If you are familiar with Django, this is like models except that there are no data types. All Items are Fields()
https://gist.github.com/3563409
4) Define allowed domains/start urls :
Once you have defined items, open a new file ,say myspider.py in myscraper/myscraper/spiders/. Subclass from BaseSpider ( or from CrawlSpider if you want to specify which links to allow/deny). Provide a crawler name and specify a start_url for the crawler.
https://gist.github.com/3563319
Specify what to extract from the urls in the parse method. Scrapy uses Xpath by default to select contents. You can read more about Xpath from here. Parse method accepts the http response object and returns an Item. In the above example, for bookname , I have assigned contents of title tag and contents of span tag with price attribute to ‘price’
5) Define Pipelines:
Its important to store the scraped data somewhere. For this purpose, scrapy provides pipelines. Define your own pipelines in pipelines.py and enable that class in ITEM_PIPELINE settings in settings.py.
For eg, to write to a json file:
https://gist.github.com/3563350
and in settings.py file :
ITEM_PIPELINES = ['myscraper.pipelines.MyScraperPipeline']
6) Run your crawler
Thats it ! You’re done. Run your crawler using:
scrapy crawl avispider
You can see the scraped items in results.out file. :) Hope this was useful. You can see a full fledged application using scrapy here

















