SEP | 20 |
Title | Bulk Item Loader |
Author | Steven Almeroth |
Created | 2012-02-24 |
Status | Draft |
Just as Item Loaders "provide a convenient mechanism for populating scraped Items", the Bulk Item Loader provides a convenient mechanism for populating Item Loaders.
There are certain markup patterns that lend themselves quite nicely to
automated parsing, for example the <table>
tag outlines such a pattern
for populating a database table with the embedded <tr>
elements denoting
the rows and the further embedded <td>
elements denoting the individual
fields.
One pattern that is particularly well suited for auto-populating an Item Loader is the definition list:
<div class="geeks"> <dl> <dt> hacker <dd> a clever programmer <dt> nerd <dd> technically bright but socially inept person </dl> </div>
Within the <dl>
each <dt>
would contain the Field name
and the following <dd>
would contain the Field value.
Without a bulk loader a programmer needs to specifically hardcode all the entries that are needed. With the bulk loader on the other hand, just a seed point is required.
xpath = '//div[@class="geeks"]/dl/dt[contains(text(),"%s")]/following-sibling::dd[1]//text()'
gl = XPathItemLoader(response=response, item=dict())
gl.default_output_processor = Compose(TakeFirst(), lambda v: v.strip())
gl.add_xpath("hacker", xpath % "hacker")
gl.add_xpath("nerd", xpath % "nerd")
bil = BulkItemLoader(response=response)
bil.parse_dl('//div[@class="geeks"]/dl')
This is a working code sample that covers just the basics.
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose
class BulkItemLoader(XPathItemLoader):
"""Item loader based on specified pattern recognition"""
default_item_class = dict
base_xpath = "//body"
ignore = ()
def _get_label(self, entity):
"""Pull the text label out of selected markup
:param entity: Found markup
:type entity: Selector
"""
label = " ".join(entity.xpath(".//text()").extract())
label = label.encode("ascii", "xmlcharrefreplace") if label else ""
label = label.strip(" ") if " " in label else label
label = label.strip(":") if ":" in label else label
label = label.strip()
return label
def _get_entities(self, xpath):
"""Retrieve the list of selectors for a given sub-pattern
:param xpath: The xpath to select
:type xpath: String
:return: The list of selectors
:rtype: list
"""
return self.selector.xpath(self.base_xpath + xpath)
def parse_dl(self, xpath="//dl"):
"""Look for the specified definition list pattern and store all found
values for the enclosed terms and descriptions.
:param xpath: The xpath to select
:type xpath: String
"""
for term in self._get_entities(xpath + "/dt"):
label = self._get_label(term)
if label and label not in self.ignore:
value = term.xpath("following-sibling::dd[1]//text()")
if value:
self.add_value(
label, value.extract(), MapCompose(lambda v: v.strip())
)
This spider uses the bulk loader above.
from scrapy.spider import BaseSpider
from scrapy.contrib.loader.bulk import BulkItemLoader
class W3cSpider(BaseSpider):
name = "w3c"
allowed_domains = ["w3.org"]
start_urls = ("http://www.w3.org/TR/html401/struct/lists.html",)
def parse(self, response):
el = BulkItemLoader(response=response)
el.parse_dl("//dl[2]")
item = el.load_item()
from pprint import pprint
pprint(item)
2012-11-19 14:21:22-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapy-loader) 2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats, HttpCacheMiddleware 2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled item pipelines: 2012-11-19 14:21:22-0600 [w3c] INFO: Spider opened 2012-11-19 14:21:22-0600 [w3c] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-11-19 14:21:22-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-11-19 14:21:22-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-11-19 14:21:22-0600 [w3c] DEBUG: Crawled (200) <GET http://www.w3.org/TR/html401/struct/lists.html> (referer: None) ['cached'] {'Notes': [u'The recipe may be improved by adding raisins.'], 'The ingredients': [u'', u'100 g. flour', u'', u'10 g. sugar', u'', u'1 cup water', u'', u'2 eggs', u'', u'salt, pepper', u''], 'The procedure': [u'', u'Mix dry ingredients thoroughly.', u'', u'Pour in wet ingredients.', u'', u'Mix for 10 minutes.', u'', u'Bake for one hour at 300 degrees.', u'']}
Other parsers can also be dropped in such as:
parse_table ()
with column designations for key and value,parse_ul ()
with a key/value separator designation,parse_ol ()
with a key/value separator designation,parse ()
with a designation for key/value tags.
Actually this touches on the subject of embedded intelligence as it would be possible, with a little bootstrapping for what goes where, for a general parser to just go out and grab all of the above.