Scrapy 0.12 Parsing with python
Based on Scrapy Tutorial (dead link: doc.scrapy.org/intro/tutorial.html)
- Install scrapy and dependencies
sudo apt-get install python-lxml sudo easy_install -U Scrapy
- Create project
scrapy startproject dmoz
- Create item models
from scrapy.item import Item, Field class DmozItem(Item): title = Field() link = Field() desc = Field()
- Create spiders (in projname/spiders/)
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz.org" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items
- Run spiders
scrapy crawl dmoz.org scrapy crawl dmoz.org --set FEED_URI=items.json --set FEED_FORMAT=json scrapy crawl dmoz.org --set FEED_URI=items.csv --set FEED_FORMAT=csv
Category: 2011
Comments