《Learning Scrapy》中文版

1 Scrapy
HelloScrapy
Scrapy
—— Google
Scrapy
2 HTML XPath
HTML DOM XPath
HTML
Chrome XPath
3
Scrapy
UR2IM——
4 Scrapy
Scrapy
app
5
JSON APIs AJAX
30
Excel
6 Scrapinghub
7
Scrapy
1——
HTTP
2——
3——
4—— Crawlera
Scrapy
8 Scrapy
Scrapy Twisted
Twisted I/O——Python
Scrapy
1—— pipeline
2——
9 Pipelines
REST APIs
treq
Elasticsearch pipeline
pipeline Google Geocoding API
Elasticsearch
Python
pipeline MySQL
Twisted
pipeline Redis
CPU
pipeline CPU
pipeline
10 Scrapy
Scrapy ——
Scrapy
1——CPU
2-
3- “ ”
4-
5-item /
6-
11 Scrapyd
Scrapyd
URL
settings URL
scrapyd
Apache Spark streaming
1 Scrapy
Scrapy
Scrapy
Scrapy
HelloScrapy
Scrapy
Excel 3
Scrapy
Scrapy Scrapy
7
Scrapy 8 9
Scrapy
100 Scrapy 16
16
1600 3
16 4800 9
4800 Scrapy 4800
Scrapy
API Scrapy
Scrapy
Scrapy Scrapy
Scrapy HTML
Scrapy BeautifulSoup lxml Scrapy Selector lxml

XPath HTML
Scrapy Scrapy https://groups.google.com/forum/#!foru

m/scrapy-users Stack Overflow http://stackoverflow.com/questions/tagged/s
crapy http://scrapy.org/community/
Scrapy Python pipelines

Scrapy
http://doc.scrapy.org/en/latest/news.html
/
Scrapy
Scrapy
50000 Scrapy
MySQL Redis Elasticsearch Google geocoding API
Apach Spark
HTML XPath
2 8
8 9
Python Python Python

Python Scrapy
“Scrapy ” Python
Python Coursera
Python Scrapy
Scrapy
bug
“ ”
“ ” Eric Ries
MVP
Scrapy
App
App “ 1” “ 2” “
433” “Samsung UN55J6200 55-Inch TV” “Richard S.”
MVP
Scrapy 4
App
—— Google
Stack Overflow GitHub

T
DoS
Scrapy
7
User-Agent
Scrapy BOT_NAME
User-Agent URL
Scrapy RobotsTxtMiddleware robots.txt

http://www.google.com/robots.txt
Scrapy
Scrapy
Scrapy Apache Nutch Scrapy

Scrapy XPath
CSS Apache Nutch
Scrapy Apache Solr Elasticsearch Lucene Scrapy

“ ” Scrapy Solr
Elasticsearch 9 Scrapy Scrapy
Scrapy MySQL MongoDB Redis

Scrapy
Scrapy
Scrapy
HTML XPath Scrapy
2 HTML XPath
HTML HTML
XPath
HTML DOM XPath
URL URL , gumtree.com

URL cookies request
HTML XML JSON
HTML
HTML DOM

URL
URL DNS https://mail.google.co
m/mail/u/0/#inbox mail.google.com DNS
IP 173.194.71.83 https://mail.google.com/mail/u/0/#inbox
https://173.194.71.83/mail/u/0/#inbox
URL
HTML
URL HTML HTML
TextMate Notepad vi Emacs HTML
World Wide Web Consortium
HTML http://example.com HTML
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type"
content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width,
initial-scale=1" />
<style type="text/css"> body { background-color: ...
} </style>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for
illustrative examples examples in documents.
You may use this domain in examples without
prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">
More information...</a></p>
</div>
</body>
</html>
HTML HTML
HTML
URL href
Example Domain
Scrapy
DOM
HTML HTML
HTML

HTML
Chrome
HTML HTML
HTML
HTML Chrome
Lynx
CSS
HTML Scrapy CSS
DOM
HTML
DOM DOM

XPath HTML
XPath HTML
XPath
Chrome XPath
功能。例如，在网页
x('//h1')
JavaScript
XPath
HTML
http://example.com/
$x('/html')
[ <html>...</html> ]
$x('/html/body')
[ <body>...</body> ]
$x('/html/body/div')
[ <div>...</div> ]
$x('/html/body/div/h1')
[ <h1>Example Domain</h1> ]
$x('/html/body/div/p')
[ <p>...</p>, <p>...</p> ]
$x('/html/body/div/p[1]')
[ <p>...</p> ]
$x('/html/body/div/p[2]')
[ <p>...</p> ]
p[1] p[2]
$x('//html/head/title')
[ <title>Example Domain</title> ]
XPath //
//p p //a
$x('//p')
[ <p>...</p>, <p>...</p> ]
$x('//a')
[ <a href="http://www.iana.org/domains/example">More information...</a> ]
//a //div//a a
//div/a
$x('//div//a')
[ <a href="http://www.iana.org/domains/example">More information...</a> ]
$x('//div/a')
[ ]
http://example.com/ href
$x('//a/@href')
[href="http://www.iana.org/domains/example"]
text( )
$x('//a/text()')
["More information..."]
$x('//div/*')
[<h1>Example Domain</h1>, <p>...</p>, <p>...</p>]
@class XPath //a[@href]

//a[@href="http://www.iana.org/domains/example"]
XPath
$x('//a[@href]')
[<a href="http://www.iana.org/domains/example">More information...</a>]
$x('//a[@href="http://www.iana.org/domains/example"]')
$x('//a[contains(@href, "iana")]')
$x('//a[starts-with(@href, "http://www.")]')
$x('//a[not(contains(@href, "abc"))]')
[ <a href="http://www.iana.org/domains/example">More information...</a>]
http://www.w3schools.com/xsl/xsl_functions.asp
Scrapy
scrapy shell "http://example.com"

HTML HtmlResponse
Chrome xpath( ) $x
response.xpath('/html').extract()
[u'<html><head><title>...</body></html>']
response.xpath('/html/body/div/h1').extract()
[u'<h1>Example Domain</h1>']
response.xpath('/html/body/div/p').extract()
[u'<p>This domain ... permission.</p>', u'<p><a href="http://www.iana.org/domains
/example">More information...</a></p>']
response.xpath('//html/head/title').extract()
[u'<title>Example Domain</title>']
response.xpath('//a').extract()
[u'<a href="http://www.iana.org/domains/example">More information...</a>']
response.xpath('//a/@href').extract()
[u'http://www.iana.org/domains/example']
response.xpath('//a/text()').extract()
[u'More information...']
response.xpath('//a[starts-with(@href, "http://www.")]').extract()
[u'<a href="http://www.iana.org/domains/example">More information...</a>']
Chrome XPath Scrapy
Chrome XPath
Chrome XPath
HTML
Copy XPath

$x('/html/body/div/p[2]/a')
XPath
id firstHeading div span text
//h1[@id="firstHeading"]/span/text()
id toc div ul URL
//div[@id="toc"]/ul//a/@href
class ltr class skin-vector h1 text

class
//*[contains(@class,"ltr") and contains(@class,"skin-vector")]//h1//text()
XPath class CSS

HTML class class link
class link active CSS
link link
active XPath contains( ) class
class infobox table URL
//table[@class="infobox"]//img[1]/@src
class reflist div URL
//div[starts-with(@class,"reflist")]//a/@href
div URL div

References
//*[text()="References"]/../following-sibling::div//a
URL
//img/@src
HTML XPath
Chrome
//*[@id="myid"]/div/div/div[1]/div[2]/div/div[1]/div[1]/a/img
HTML
img id class
//div[@class="thumbnail"]/a/img
class class CSS

class class class
//div[@class="thumbnail"]/a/img
//div[@class="preview green"]/a/img
class class
thumbnail green class thumbnail green
departure-time departure-time div
departure-time
id
id id JavaScript
id XPath
//*[@id="more_info"]//text( )
id
//[@id="order-F4982322"]
XPath id HTML
id
XPath HTML
HTML XPath Chrome XPath
XPath XPath 3
3
Scrapy
$ echo hello world

hello world
echo hello world $
>>> print 'hi'

hi
Python Scrapy >>>
Vagrant
Notepad Notepad++ Sublime Text TextMate Eclipse PyCharm
Linux/Unix vim emacs
nano
Scrapy
Scrapy
Scrapy Vagrant Linux
Vagrant
MacOS
Vagrant MacOS Scrapy
$ easy_install scrapy
Xcode
Windows
Windows Scrapy Windows
Virtualbox Vagrant 64
Windows http://scrapybook
.com/
Linux
Linux Scrapy
Scrapy 1.0.3
1.4
Ubuntu Debian Linux

Ubuntu Ubuntu 14.04 Trusty Tahr - 64 bit apt
Scrapy
$ sudo apt-get update

$ sudo apt-get install python-pip python-lxml python-crypto python-
cssselect python-openssl python-w3lib python-twisted python-dev libxml2-
dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
$ sudo pip install scrapy
PyPI Scrapy
“install Scrapy Ubuntu packages”
Red Hat CentOS Linux

yum Linux Scrapy Ubuntu 14.04 Trusty Tahr - 64 bit
sudo yum update

sudo yum -y install libxslt-devel pyOpenSSL python-lxml python-devel gcc
sudo easy_install scrapy
GitHub
Scrapy Scrapy Python
https://github.com/scrapy/scrapy
$ git clonehttps://github.com/scrapy/scrapy.git
$ cd scrapy
$ python setup.py install
virtualenv
Scrapy
Scrapy Scrapy pip easy_install aptitude
$ sudo pip install --upgrade Scrapy
$ sudo easy_install --upgrade scrapy
Scrapy
$ sudo pip install Scrapy==1.0.0

$ sudo easy_install scrapy==1.0.0
Vagrant
Vagrant
Vagrant “ ” Vagrant
—— Scrapy
A Vagrant git Vagrant

$ git clone https://github.com/scalingexcellence/scrapybook.git
$ cd scrapybook
Vagrant
$ vagrant up --no-parallel
Vagrant
$ vagrant ssh
book
$ cd book
$ ls
$ ch03 ch04 ch05 ch07 ch08 ch09 ch10 ch11 ...
vagrant ssh vagrant halt

vagrantstatus vagrant halt VirtualBox
vagrant global-status id vagrant halt
Vagrant
Scrapy
UR2IM——
Scrapy
Scrapy UR2IM

The URL
URL URL https://www.gumtree.com/ Gumtree
http://www.gumtree.com/flats-houses/london
URL URL URL https://www.gu
mtree.com/p/studios-bedsits-rent/split-level Gumtree URL XPath
Gumtree
Scrapy
scrapy shell -s USER_AGENT="Mozilla/5.0" <your url here e.g. http://www.gumtree.co

m/p/studios-bedsits-rent/...>
Scrapy –pdb
scrapy shell --pdb https://gumtree.com
Gumtree Gumtree URL

Vagrant
Gumtree
Vagrant Gumtree Vagrant h
ttp://web:9312/ http://localhost:9312/
Scrapy Vagrant
$ scrapy shell http://web:9312/properties/property_000000.html
...
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x2d4fb10>
[s] item {}
[s] request <GET http:// web:9312/.../property_000000.html>
[s] response <200 http://web:9312/.../property_000000.html>
[s] settings <scrapy.settings.Settings object at 0x2d4fa90>
[s] spider <DefaultSpider 'default' at 0x3ea0bd0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local...
[s] view(response) View response in a browser
>>>
Python Ctrl+D
Scrapy Scrapy
GET 200
reponse.body 50
>>> response.body[:50]
'<!DOCTYPE html>\n<html>\n<head>\n<meta charset="UTF-8"'
Gumtree HTML 5
Item HTML XPath

logo
HTML
HTML
HTML XPath
Chrome XPath h1
SEO HTML h1 h1
SEO
h1
>>> response.xpath('//h1/text()').extract()
[u'set unique family well']
h1 text() h1 text()
>>> response.xpath('//h1').extract()
[u'<h1 itemprop="name" class="space-mbs">set unique family well</h1>']
title
Gumtree itemprop=name XPath //*

[@itemprop="name"][1]/text() XPath 1 [] 1
itemprop="name" Gumtree “You

may also like”
HTML
<strong class="ad-price txt-xlarge txt-emphasis" itemprop="price">£334.39pw</strong

>
itemprop="name" XPath //*[@itemprop="price"][1]/text()
>>> response.xpath('//*[@itemprop="price"][1]/text()').extract()
[u'\xa3334.39pw']
Unicode £ 350.00pw
>>> response.xpath('//*[@itemprop="price"][1]/text()').re('[.0-9]+')
[u'334.39']
//*[@itemprop="description"][1]/text()
//*[@itemtype="http://schema.org/Place"][1]/text()
//img[@itemprop="image"][1]/@src text()
URL
XPath
title //*[@itemprop="name"][1]/text() Example value: [u'set unique family well']
Price //*[@itemprop="price"][1]/text() Example value (using re()):[u'334.39']
descripti //*[@itemprop="description"][1]/text() Example value: [u'website court wareho

on use\r\npool...']
//*[@itemtype="http://schema.org/Place"][1]/text() Example value: [u'Angel, Lo

Address
ndon']
Image_U
//*[@itemprop="image"][1]/@src Example value: [u'../images/i01.jpg']
RL
HTML XPath Python
Scrapy
Scrapy shell Scrapy
Ctrl+D Scrapy shell Scrapy shell XPath Scrapy
properties
$ scrapy startproject properties
$ cd properties
$ tree
.
├── properties
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
2 directories, 6 files
Scrapy items.py,
pipelines.py, settings.py spiders
settings pipelines scrapy.cfg
items
items.py
class PropertiesItem
Python
images pipeline image_URL
Location
server url date
Python
url response.url Example value: ‘http://web.../property_000000.html'
project self.settings.get('BOT_NAME') Example value: 'properties'

spider self.name Example value: 'basic'
server server socket.gethostname() Example value: 'scrapyserver1'
date datetime.datetime.now() Example value: datetime.datetime(2015, 6, 25...)
PropertiesItem properties/items.py
from scrapy.item import Item, Field
class PropertiesItem(Item):
# Primary fields
title = Field()
price = Field()
description = Field()
address = Field()
image_URL = Field()
# Calculated fields
images = Field()
location = Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
Python
tab
tab PropertiesItem
{} begin – end Python
UR2IM
scrapy genspider
$ scrapy genspider basic web
“basic” “basic”
properties.spiders.basic
basic.py properties/spiders basic

web URL basic
scrapy genspider –l –t
properties/spiders/basic.py file ,
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_URL = (
'http://www.web/',
)
def parse(self, response):
pass
import Scrapy BasicSpider

scrapy.Spider Scrapy
Spider
parse()
self response self response
Scrapy shell
start_URL Scrapy URL

log() properties/spiders/basic.py
import scrapy
name = "basic"
start_URL = (
'http://web:9312/properties/property_000000.html',
)
self.log("title: %s" % response.xpath(
'//*[@itemprop="name"][1]/text()').extract())
self.log("price: %s" % response.xpath(
'//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
self.log("description: %s" % response.xpath(
'//*[@itemprop="description"][1]/text()').extract())
self.log("address: %s" % response.xpath(
'//*[@itemtype="http://schema.org/'
'Place"][1]/text()').extract())
self.log("image_URL: %s" % response.xpath(
'//*[@itemprop="image"][1]/@src').extract())
scrapy crawl
$ scrapy crawl basic

INFO: Scrapy 1.0.3 started (bot: properties)
...
INFO: Spider opened
DEBUG: Crawled (200) <GET http://...000.html>
DEBUG: title: [u'set unique family well']
DEBUG: price: [u'334.39']
DEBUG: description: [u'website...']
DEBUG: address: [u'Angel, London']
DEBUG: image_URL: [u'../images/i01.jpg']
INFO: Closing spider (finished)
...
XPath
scrapy parse URL —spider
$ scrapy parse --spider=basic http://web:9312/properties/property_000001.html

PropertiesItem properties item.py
properties.items
from properties.items import PropertiesItem
parse() item =
PropertiesItem()
item['title'] = response.xpath('//*[@itemprop="name"][1]/text()').extract()
return item properties/spiders/basic.py
import scrapy
name = "basic"
start_URL = (
)
item = PropertiesItem()
item['title'] = response.xpath(
'//*[@itemprop="name"][1]/text()').extract()
item['price'] = response.xpath(
'//*[@itemprop="price"][1]/text()').re('[.0-9]+')
item['description'] = response.xpath(
'//*[@itemprop="description"][1]/text()').extract()
item['address'] = response.xpath(
'//*[@itemtype="http://schema.org/'
'Place"][1]/text()').extract()
item['image_URL'] = response.xpath(
'//*[@itemprop="image"][1]/@src').extract()
return item
“DEBUG ”
DEBUG: Scraped from <200
http://...000.html>
{'address': [u'Angel, London'],
'description': [u'website ... offered'],
'image_URL': [u'../images/i01.jpg'],
'price': [u'334.39'],
'title': [u'set unique family well']}
PropertiesItem Scrapy Items

pipelines “Feed export”
$ scrapy crawl basic -o items.json

$ cat items.json
[{"price": ["334.39"], "address": ["Angel, London"], "description":
["website court ... offered"], "image_URL": ["../images/i01.jpg"],
"title": ["set unique family well"]}]
$ scrapy crawl basic -o items.jl
$ cat items.jl
{"price": ["334.39"], "address": ["Angel, London"], "description":
["website court ... offered"], "image_URL": ["../images/i01.jpg"],
"title": ["set unique family well"]}
$ scrapy crawl basic -o items.csv
$ cat items.csv
description,title,url,price,spider,image_URL...
"...offered",set unique family well,,334.39,,../images/i01.jpg
$ scrapy crawl basic -o items.xml
$ cat items.xml
<?xml version="1.0" encoding="utf-8"?>
<items><item><price><value>334.39</value></price>...</item></items>
Scrapy
CSV XML
Excel JSON JavaScript JSON JSON
Line .json JSON 1GB
.jl JSON
Scrapy
FTP S3 bucket
$ scrapy crawl basic -o "ftp://user:pass@ftp.scrapybook.com/items.json "
$ scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"
URL S3
scrapy parse
$ scrapy parse --spider=basic http://web:9312/properties/property_000001.html

...
INFO: Spider closed (finished)
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items ------------------------------------------------
[{'address': [u'Plaistow, London'],
'description': [u'features'],
'image_URL': [u'../images/i02.jpg'],
'price': [u'388.03'],
'title': [u'belsize marylebone...deal']}]
# Requests ------------------------------------------------
[]
scrapy parse debug
——
ItemLoader extract() xpath() parse()

l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('title', '//*[@itemprop="name"][1]/text()')
l.add_xpath('price', './/*[@itemprop="price"]'
'[1]/text()', re='[,.0-9]+')
l.add_xpath('description', '//*[@itemprop="description"]'
'[1]/text()')
l.add_xpath('address', '//*[@itemtype='
'"http://schema.org/Place"][1]/text()')
l.add_xpath('image_URL', '//*[@itemprop="image"][1]/@src')
return l.load_item()
self-
documenting
ItemLoaders
http://doc.scrapy.org/en/latest/topics/loaders
ItemLoaders XPath/CSS Join() //p
MapCompose() Python Python

MapCompose(float) MapCompose(unicode.strip, unicode.title)
Join()
MapCompose(unicode.strip)
MapCompose(unicode.strip, unicode.title)
MapCompose(float)
MapCompose(lambda i: i.replace(',', ''), floa

t)
MapCompose(lambda i: urlparse.urljoin(res response.url URL

ponse.url, i)) URL
Python unicode.strip() unicode.title()

replace() urljoin()
Lambda
myFunction = lambda i: i.replace(',', '')
def myFunction(i):
return i.replace(',', '')
Lambda replace() urljoin()
scrapy URL
>>> from scrapy.loader.processors import MapCompose, Join

>>> Join()(['hi','John'])
u'hi John'
>>> MapCompose(unicode.strip)([u' I',u' am\n'])
[u'I', u'am']
>>> MapCompose(unicode.strip, unicode.title)([u'nIce cODe'])
[u'Nice Code']
>>> MapCompose(float)(['3.14'])
[3.14]
>>> MapCompose(lambda i: i.replace(',', ''), float)(['1,400.23'])
[1400.23]
>>> import urlparse
>>> mc = MapCompose(lambda i: urlparse.urljoin('http://my.com/test/abc', i))
>>> mc(['example.html#check'])
['http://my.com/test/example.html#check']
>>> mc(['http://absolute/url#help'])
['http://absolute/url#help']
XPath/CSS

l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
MapCompose(lambda i: i.replace(',', ''), float),
re='[,.0-9]+')
'[1]/text()', MapCompose(unicode.strip), Join())
l.add_xpath('address',
'//*[@itemtype="http://schema.org/Place"][1]/text()',
MapCompose(unicode.strip))
l.add_xpath('image_URL', '//*[@itemprop="image"][1]/@src',
MapCompose(
lambda i: urlparse.urljoin(response.url, i)))
scrapy crawl basic

'price': [334.39],
'title': [u'Set Unique Family Well']
add_value() Python XPath/CSS

“ ” URL
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
import datetime socket
Items
items
Scrapy 25
ItemLoaders Python
lambda
ItemLoader
ItemLoaders

""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_URL
@scrapes url project spider server date
"""
URL
scrapy check
$ scrapy check basic

----------------------------------------------------------------
Ran 3 contracts in 1.640s
OK
url
FAIL: [basic] parse (@scrapes post-hook)
------------------------------------------------------------------
ContractFail: 'url' field is missing
XPath
from scrapy.loader.processors import MapCompose, Join

from scrapy.loader import ItemLoader
import datetime
import urlparse
import socket
import scrapy
name = "basic"
# Start on a property page
start_URL = (
)
""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_URL
@scrapes url project spider server date
"""
# Create the loader using the response
l = ItemLoader(item=PropertiesItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(lambda i: i.replace(',', ''),
float),
re='[,.0-9]+')
'[1]/text()',
MapCompose(unicode.strip), Join())
'//*[@itemtype="http://schema.org/Place"]'
'[1]/text()',
l.add_xpath('image_URL', '//*[@itemprop="image"]'
'[1]/@src', MapCompose(
lambda i: urlparse.urljoin(response.url, i)))
l.add_value('url', response.url)
URL
start_URL URL
URL
start_URL = (
)
URL
start_URL = [i.strip() for i in

open('todo.URL.txt').readlines()]
Gumtree
http://www.gumtree.com/flats-houses/london

——
——
XPath Next page URL

li li next XPath //*[contains(@class,"next")]//@href
URL itemprop="url //*[@itemprop="url"]/@href

scrapy
$ scrapy shell http://web:9312/properties/index_00000.html
>>> URL = response.xpath('//*[contains(@class,"next")]//@href').extract()
>>> URL
[u'index_00001.html']
>>> import urlparse
>>> [urlparse.urljoin(response.url, i) for i in URL]
[u'http://web:9312/scrapybook/properties/index_00001.html']
>>> URL = response.xpath('//*[@itemprop="url"]/@href').extract()
>>> URL
[u'property_000000.html', ... u'property_000029.html']
>>> len(URL)
30
>>> [urlparse.urljoin(response.url, i) for i in URL]
[u'http://..._000000.html', ... /property_000029.html']
URL
manual.py
$ ls
properties scrapy.cfg
$ cp properties/spiders/basic.py properties/spiders/manual.py
properties/spiders/manual.py from scrapy.http import Request

Request manual start_URL parse()
parse_item() parse()

# Get the next index URL and yield Requests
next_selector = response.xpath('//*[contains(@class,'
'"next")]//@href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
# Get item URL and yield Requests

item_selector = response.xpath('//*[@itemprop="url"]/@href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)
yield return return yield
next_requests = []
for url in...
next_requests.append(Request(...))
for url in...
next_requests.append(Request(...))
return next_requests
yield Python
5
-s CLOSESPIDER_ITEMCOUNT=90 7
90
$ scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90

...
DEBUG: Crawled (200) <...index_00000.html> (referer: None)
DEBUG: Crawled (200) <...property_000029.html> (referer: ...index_00000.html)
DEBUG: Scraped from <200 ...property_000029.html>
{'address': [u'Clapham, London'],
'date': [datetime.datetime(2015, 10, 4, 21, 25, 22, 801098)],
'description': [u'situated camden facilities corner'],
'image_URL': [u'http://web:9312/images/i10.jpg'],
'price': [223.88],
'project': ['properties'],
'server': ['scrapyserver1'],
'spider': ['manual'],
'title': [u'Portered Mile'],
'url': ['http://.../property_000029.html']}
DEBUG: Crawled (200) <...property_000028.html> (referer: ...index_00000.
html)
...
DEBUG: Crawled (200) <...index_00001.html> (referer: ...)
DEBUG: Crawled (200) <...property_000059.html> (referer: ...)
...
INFO: Dumping Scrapy stats: ...
'downloader/request_count': 94, ...
'item_scraped_count': 90,
index_00000.html,
debug URL
property_000029.html, property_000028.html ... index_00001.html
referer( index_00000.html) property_000059.html referer
index_00001
Scrapy LIFO
URL Requests
Requests
Request() 0 0
Scrapy
URL
URL dont_filter Request() True
CrawlSpider
Scrapy
CrawlSpider genspider
-t
$ scrapy genspider -t crawl easy web

Created spider 'crawl' using template 'crawl' in module:
properties.spiders.easy
properties/spiders/easy.py
...
class EasySpider(CrawlSpider):
name = 'easy'
allowed_domains = ['web']
start_URL = ['http://www.web/']
rules = (
Rule(LinkExtractor(allow=r'Items/'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
...
CrawlSpider
Spider CrawlSpider rules parse()
start_URL parse_item() parse()

rules rules
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
callback='parse_item')
)
XPath a href LinkExtractor

a href tags attrs LinkExtractor()
Requests(self.parse_item)
parse_item callback Rule URL callback
Rule Rule callback return/yield
Rule() follow True Items
$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=90
Scrapy UR2IM
Items ItemLoaders XPath Items yield
Requests CrawlSpider
Rules
app Scrapy
4 Scrapy
app Appery.io Scrapy Excel
PhoneGap Appcelerator Appcelerator jQuery Mobile Sencha Touch
Appery.io PhoneGap jQuery Mobile iOS Android

Windows Phone HTML5 Appery.io
Appery.io 14
Appery.io
REST APIs
app PhoneGap
PhoneGap
Appery.io
Appery.io Emai
Appery.io

1. Databases 1
2. Create new database 2 scrapy 3
3. Create 4 Scrapy
Appery.io
Appery.io Users
Appery.io

root pass
Users 1 +Row 2 user/row 3,4
Scrapy properties Create new collection

5 properties 6 Add 7
+Col 8
+Col 8 9 10 Create
column 11
Scrapy
API key Settings 1 2 Collections
3
easy.py
tomobile.py
$ ls
$ cat properties/spiders/tomobile.py
...
class ToMobileSpider(CrawlSpider):
name = 'tomobile'
allowed_domains = ["scrapybook.s3.amazonaws.com"]
# Start on the first index page
start_URL = (
'http://scrapybook.s3.amazonaws.com/properties/'
'index_00000.html',
)
...
http://web:9312 http://scrapy
book.s3.amazonaws.com URL
app
Appery.io pipline Scrapy pipelines items

Python 8 easy_install pip
Vagrant
$ sudo easy_install -U scrapyapperyio

$ sudo pip install --upgrade scrapyapperyio
Scrapy API key 7

properties/settings.py
ITEM_PIPELINES = {'scrapyapperyio.ApperyIoPipeline': 300}

APPERYIO_DB_ID = '<<Your API KEY here>>'
APPERYIO_USERNAME = 'root'
APPERYIO_PASSWORD = 'pass'
APPERYIO_COLLECTION_NAME = 'properties'
APPERYIO_DB_ID API key Appery.io

Appery.io Scrapy
$ scrapy crawl tomobile -s CLOSESPIDER_ITEMCOUNT=90

...
INFO: Enabled item pipelines: ApperyIoPipeline
INFO: Spider opened
...
DEBUG: Crawled (200) <GET https://api.appery.io/rest/1/db/login?username=
root&password=pass>
...
DEBUG: Crawled (200) <POST https://api.appery.io/rest/1/db/collections/
properties>
...
INFO: Dumping Scrapy stats:
{'downloader/response_count': 215,
...}
INFO: Spider closed (closespider_itemcount)
ApperyIoPipeline pipeline
100 200 / Appery.io pipeline
api.appery.io URL

Appery.io properties 1 2
Apps 1 Create new app 2

properties 3 Create 4
Appery.io
Scrapy CREATE NEW 5 Datebase Services 6
scrapy 7 properties 8
List 9 Import selected
services 10
app DESIGN
Pages 1 startScreen 2 UI
Scrapy App
Grid 5
6
Rows 1 Apply 7,8
9
10 11
DESIGN DATA
1
Service 2 Add
3 Add Before send Success
Success Mapping
Mapping action editor

UI Expand all
5

Appery.io
Save and return 6
DATA UI DESIGN 7
EVENTS 8 EVENTS Appery.io UI
UI startScreen Load
Invoke service action Datasource restservice1 9
Save 10
app
app UI TEST 1
2
View on Phone
Scrapy http://de
vcenter.appery.io/tutorials/ Appery.io
EXPORT app
IDE app
Scrapy Appery.io
RESTful API
app Lorem Ipsum
Scrapy
5
3 Items
UR2IM R Request Response
http://web:9312/dynamic http://localhost:9312/dynamic
“user” “pass”
Scrapy
Chrome Network
1 Login 2
Login Network http://localhost:9

312/dynamic/login Request Method: POST
GET POST
POST
Chrome
302 FOUND 5
/dynamic/gated
http://localhost:9312/dynamic/gated http://l
ocalhost:9312/dynamic/error gated 6
RequestHeaders 7 Cookie 8
HTTP cookie
POST HTTP
Scrapy
3 easy login
class LoginSpider(CrawlSpider):
name = 'login'
github ch05 ch05/properties
http://localhost:9312/dynamic/login POST Scrapy

FormRequest 3 Request formdata
from scrapy.http import FormRequest
start_URL start_requests() URL

FormRequest
# Start with a login request

def start_requests(self):
return [
FormRequest(
"http://web:9312/dynamic/login",
formdata={"user": "user", "pass": "pass"}
)]
CrawlSpider parse() LoginSpider 3

Rules LinkExtractors Scrapy cookies
Scrapy cookies scrapy crawl
$ scrapy crawl login

...
DEBUG: Redirecting (302) to <GET .../gated> from <POST .../login >
DEBUG: Crawled (200) <GET .../data.php>
DEBUG: Crawled (200) <GET .../property_000001.html> (referer: .../data.
php)
DEBUG: Scraped from <200 .../property_000001.html>
{'address': [u'Plaistow, London'],
'date': [datetime.datetime(2015, 11, 25, 12, 7, 27, 120119)],
'description': [u'features'],
...
{...
'downloader/request_method_count/GET': 4,
'downloader/request_method_count/POST': 1,
...
dynamic/login dynamic/gated
POST GET dynamic/gated
URL
$ scrapy crawl login

...
DEBUG: Redirecting (302) to <GET .../dynamic/error > from <POST .../
dynamic/login>
DEBUG: Crawled (200) <GET .../dynamic/error>
...
INFO: Spider closed (closespider_itemcount)
Scrapy
POST
cookies

http://localhost:9312/dynamic/nonce
Chrome nonce
http://localhost:9312/dynamic/nonce-login
nonce
Scrapy
NonceLoginSpider start_requests()
Request callback parse_welcome()
parse_welcome() FormRequest from_response()
FormRequest FormRequest FormRequest.from_response()
from_response() formname
formnumber
formdata
user pass FormRequest
# Start on the welcome page

return [
Request(
"http://web:9312/dynamic/nonce",
callback=self.parse_welcome)
]
# Post welcome page's first form with the given user/pass
def parse_welcome(self, response):
return FormRequest.from_response(
response,
formdata={"user": "user", "pass": "pass"}
)
$ scrapy crawl noncelogin

...
DEBUG: Crawled (200) <GET .../dynamic/nonce>
DEBUG: Redirecting (302) to <GET .../dynamic/gated > from <POST .../
dynamic/login-nonce>
DEBUG: Crawled (200) <GET .../dynamic/gated>
...
{...
'downloader/request_method_count/GET': 5,
'downloader/request_method_count/POST': 1,
...
GET /dynamic/nonce POST /dynamic/nonce-login

/dynamic/gated
JSON APIs AJAX

HTML http://localhost:9312/static/
1,2 DOM HTML scrapy shell
Chrome 3,4 HTML

Network 5
static/ jquery.min.js
JavaScript api.json 6 Preview
7 http://localhost:9312/properties/api.js
on IDs 8
[{
"id": 0,
"title": "better set unique family well"
},
... {
"id": 29,
"title": "better portered mile"
}]
JSON API APIs POST

JSON XPath
Python JSON import json

json.loads response.body JSON Python
3 manual.py JSON IDs
URL Request api.py ApiSpider api
start_URL
start_URL = (
'http://web:9312/properties/api.json',
)
POST start_requests()
Scrapy URL Response parse() import
json JSON

base_url = "http://web:9312/properties/"
js = json.loads(response.body)
for item in js:
id = item["id"]
url = base_url + "property_%06d.html" % id
yield Request(url, callback=self.parse_item)
json.loads response.body JSON Python

URL base_url property_%06d
.html.base_url .html.base_url URL %06d Python
Python id %06d id
%d 6 0
id 5 %06d 000005 id 34322 %06d 034322
URL 3 yield URL Request
$ scrapy crawl api

...
DEBUG: Crawled (200) <GET ...properties/api.json>
DEBUG: Crawled (200) <GET .../property_000029.html>
...
...
31 api.json
JSON APIs Item

JSON API “better” “Covent
Garden” API “Better Covent Garden” Items “bette”
parse() parse_item()
parse() Request parse_item()

Response Request meta Response
title JSON
title = item["title"]
yield Request(url, meta={"title": title},callback=self.parse_item)
parse_item() XPath
l.add_value('title', response.meta['title'],
add_xpath() add_value() XPath

PropertyItems api.json
30
Scrapy
XPath
&show=50 10 50 100
30 100
3000 100 3000
Gumtree
HTML
itemtype="http://schema.org/Product"
Scrapy shell XPath
$ scrapy shell http://web:9312/properties/index_00000.html

While within the Scrapy shell, let's try to select everything with the Product tag:
>>> p=response.xpath('//*[@itemtype="http://schema.org/Product"]')
>>> len(p)
30
>>> p
[<Selector xpath='//*[@itemtype="http://schema.org/Product"]' data=u'<li
class="listing-maxi" itemscopeitemt'...]
30 Selector Selector Response

XPath
XPath XPath
“.” .//*[@itemprop="name"][1]/text()
>>> selector = p[3]

>>> selector
<Selector xpath='//*[@itemtype="http://schema.org/Product"]' ... '>
>>> selector.xpath('.//*[@itemprop="name"][1]/text()').extract()
[u'l fun broadband clean people brompton european']
Selector for 30 3
maunal.py fast.py parse() parse_item()

# Get the next index URL and yield Requests
next_sel = response.xpath('//*[contains(@class,"next")]//@href')
for url in next_sel.extract():
yield Request(urlparse.urljoin(response.url, url))
# Iterate through products and create PropertiesItems
selectors = response.xpath(
'//*[@itemtype="http://schema.org/Product"]')
for selector in selectors:
yield self.parse_item(selector, response)
parse_item() yield
def parse_item(self, selector, response):

# Create the loader using the selector
l = ItemLoader(item=PropertiesItem(), selector=selector)
# Load fields using XPath expressions
l.add_xpath('title', './/*[@itemprop="name"][1]/text()',
MapCompose(lambda i: i.replace(',', ''), float),
re='[,.0-9]+')
l.add_xpath('description',
'.//*[@itemprop="description"][1]/text()',
MapCompose(unicode.strip), Join())
'.//*[@itemtype="http://schema.org/Place"]'
'[1]/*/text()',
make_url = lambda i: urlparse.urljoin(response.url, i)
l.add_xpath('image_URL', './/*[@itemprop="image"][1]/@src',
MapCompose(make_url))
l.add_xpath('url', './/*[@itemprop="url"][1]/@href',
MapCompose(make_url))
ItemLoader selector Response ItemLoader
“.” XPath XPath
XPath
XPath
response.url URL Item URL

URL .//*[@itemprop="url"][1]/@href URL
MapCompose URL
$ scrapy crawl fast -s CLOSESPIDER_PAGECOUNT=3

...
'item_scraped_count': 90,...
90 93
scrapy parse spider
$ scrapy parse --spider=fast http://web:9312/properties/index_00000.html

...
>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items --------------------------------------------
[{'address': [u'Angel, London'],
... 30 items...
# Requests ---------------------------------------------------
[<GET http://web:9312/properties/index_00001.html>]
parse() 30 Items scrapy parse

—depth=2
Excel
XPath
ch05 properties
$ pwd
/root/book/ch05/properties
$ cd ..
$ pwd
/root/book/ch05
generic fromcsv
$ scrapy startproject generic

$ cd generic
$ scrapy genspider fromcsv example.com
.csv Excel
URL XPath scrapy.cfg todo.csv
csv

$ cat todo.csv
url,name,price
a.html,"//*[@id=""itemTitle""]/text()","//*[@id=""prcIsum""]/text()"
b.html,//h1/text(),//span/strong/text()
c.html,"//*[@id=""product-desc""]/span/text()"
Python csv import csv dict

csv Python
$ pwd
/root/book/ch05/generic2
$ python
>>> import csv
>>> with open("todo.csv", "rU") as f:
reader = csv.DictReader(f)
for line in reader:
print line
header dict
dict for
{'url': ' http://a.html', 'price': '//*[@id="prcIsum"]/text()', 'name': '//*[@id="i

temTitle"]/text()'}
{'url': ' http://b.html', 'price': '//span/strong/text()', 'name': '//h1/text()'}
{'url': ' http://c.html', 'price': '', 'name': '//*[@id="product-desc"]/span/text()
'}
generic/spiders/fromcsv.py .csv URL

start_URL allowed_domains .csv
URL start_requests()
Request request,meta csv XPath
parse() Item ItemLoader Item
import csv
import scrapy
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field
class FromcsvSpider(scrapy.Spider):
name = "fromcsv"
with open("todo.csv", "rU") as f:
reader = csv.DictReader(f)
for line in reader:
request = Request(line.pop('url'))
request.meta['fields'] = line
yield request
item = Item()
l = ItemLoader(item=item, response=response)
for name, xpath in response.meta['fields'].iteritems():
if xpath:
item.fields[name] = Field()
l.add_xpath(name, xpath)
csv
$ scrapy crawl fromcsv -o out.csv

INFO: Scrapy 0.0.3 started (bot: generic)
...
DEBUG: Scraped from <200 a.html>
{'name': [u'My item'], 'price': [u'128']}
DEBUG: Scraped from <200 b.html>
{'name': [u'Getting interesting'], 'price': [u'300']}
DEBUG: Scraped from <200 c.html>
{'name': [u'Buy this now']}
...
INFO: Spider closed (finished)
$ cat out.csv
price,name
128,My item
300,Getting interesting
,Buy this now
Items ItemLoader
item = Item()
l = ItemLoader(item=item, response=response)
Item fields ItemLoader
item.fields[name] = Field()
l.add_xpath(name, xpath)
todo.csv Scrapy
-a -a variable=value self.variable
Python getattr()
getattr(self, 'variable', 'default') with open…
with open(getattr(self, "file", "todo.csv"), "rU") as f:
todo.csv -a
another_todo.csv
$ scrapy crawl fromcsv -a file=another_todo.csv -o out.csv
Scrapy FormRequest /
meta XPath Selectors .csv
6 Scrapinghub 7 Scrapy
6 Scrapinghub
Amazon RackSpace
Scrapinghub
Scrapinghub Scrapy Amazon

http://scrapinghub.com/
+Service 1
properties 2 Create 3 new 4
Jobs Spiders
Periodic Jobs

Settings 1 Scrapinghub
Scrapy Deploy 2
Scrapy Deploy url scrapy.cfg

[depoly] 4 properties
4
settings.py Appery.io pipeline
ch06 ch06/properties
$ pwd
$ ls
$ cat scrapy.cfg
...
[settings]
default = properties.settings
# Project: properties
[deploy]
url = http://dash.scrapinghub.com/api/scrapyd/
username = 180128bc7a0.....50e8290dbf3b0
password =
project = 28814
Scrapinghub shub pip install shub

shub login Scrapinghub
$ shub login
Insert your Scrapinghub API key : 180128bc7a0.....50e8290dbf3b0
Success.
scrapy.cfg API key Scrapinghub

API key API key shub deploy
$ shub deploy
Packing version 1449092838
Deploying to project "28814"in {"status": "ok", "project": 28814,
"version":"1449092838", "spiders": 1}
Run your spiders at: https://dash.scrapinghub.com/p/28814/
Scrapy Scrapinghub
$ ls
build project.egg-info properties scrapy.cfgsetup.py
$ rm -rf build project.egg-info setup.py
Scrapinghub Spiders 1 tomobile
2
Schedule 3 Schedule 4

Running Jobs Requests Items
Scrapinghub
6 Stop 7
Completed Jobs 8
Items Requests Log 10

11 Items 12
13 CSV JSON JSON Lines
Scrapinghub Items API

URL
https://dash.scrapinghub.com/p/28814/job/1/1/
URL 28814 scrapy.cfg 1 “tomobile” ID

1 curl
https://storage.scrapinghub.com/items/// /API key
$ curl -u 180128bc7a0.....50e8290dbf3b0: https://storage.scrapinghub.com/items/2881

4/1/1
{"_type":"PropertiesItem","description":["same\r\nsmoking\r\nr...
{"_type":"PropertiesItem","description":["british bit keep eve...
...
Scrapinghub
Periodic Jobs 1 Add 2 3 4

Save 5
Scrapy Scrapinghub
API Scrapinghub
7
Scrapy Scrapy
Scrapy
Scrapy
Scrapy
Scrapy
scrapy/settings/default_settings.py
settings.py
custom_settings
Pipelines
-s -s CLOSESPIDER_PAGECOUNT=3
API settings.py
$ scrapy settings --get CONCURRENT_REQUESTS

16
settings.py CONCURRENT_REQUESTS
14 14
$ scrapy settings --get CONCURRENT_REQUESTS -s CONCURRENT_REQUESTS=19

19
scrapy crawl scrapy settings
$ scrapy shell -s CONCURRENT_REQUESTS=19

>>> settings.getint('CONCURRENT_REQUESTS')
19
Scrapy
Scrapy
Scrapy
Scrapy DEBUG INFO WARNING ERROR CRITICAL

SILENT Scrapy Log Stats
LOGSTATS_INTERVAL 60
5 LOG_FILE
LOG_ENABLED False
LOG_STDOUT True Scrapy
print
STATS_DUMP Stats Collector

DOWNLOADER_STATS DEPTH_STATS
DEPTH_STATS_VERBOSE True STATSMAILER_RCPTS email
Scrapy Python Scrapy

TELNETCONSOLE_ENABLED TELNETCONSOLE_PORT
1——
Scrapy
ch07 ch07/properties
$ pwd
$ ls
Start a crawl as follows:
$ scrapy crawl fast
...
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023:6023
6023
$ telnet localhost 6023

>>>
Scrapy Python engine
est()
>>> est()
Execution engine status
time()-engine.start_time : 5.73892092705
engine.has_capacity() : False
len(engine.downloader.active) : 8
...
len(engine.slot.inprogress) : 10
...
len(engine.scraper.slot.active) : 2
10
>>> import time

>>> time.sleep(1) # Don't do this!
>>> engine.pause()
>>> engine.unpause()
>>> engine.stop()
Connection closed by foreign host.
10
CONCURRENT_REQUESTS
/IPs
CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP
IP
CONCURRENT_REQUESTS_PER_IP CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS = 16
16/0.25 = 64 CONCURRENT_ITEMS
100 10 1
pipelines
CONCURRENT_REQUESTS = 16 CONCURRENT_ITEMS = 100 1600
DOWNLOADS_TIMEOUT 180
16 5 10
0 DOWNLOADS_DELAY
DOWNLOADS_DELAY ±50%
RANDOMIZE_DOWNLOAD_DELAY False
DNS DNSCACHE_ENABLED DNS
Scrapy CloseSpider
CLOSESPIDER_TIMEOUT(in seconds) CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT
$ scrapy crawl fast -s CLOSESPIDER_ITEMCOUNT=10

$ scrapy crawl fast -s CLOSESPIDER_PAGECOUNT=10
$ scrapy crawl fast -s CLOSESPIDER_TIMEOUT=10
HTTP
Scrapy HttpCacheMiddleware HTTP
HTTPCACHE_POLICY
scrapy.contrib.httpcache.RFC2616Policy RFC2616
HTTPCACHE_ENABLED True HTTPCACHE_DIR
HTTPCACHE_STORAGE
scrapy.contrib.httpcache.DbmCacheStorage HTTPCACHE_DBM_MODULE
anydbm
2——
$ scrapy crawl fast -s LOG_LEVEL=INFO -s CLOSESPIDER_ITEMCOUNT=5000
$ scrapy crawl fast -s LOG_LEVEL=INFO -s CLOSESPIDER_ITEMCOUNT=5000 -s HTTPCACHE_EN

ABLED=1
...
INFO: Enabled downloader middlewares:...*HttpCacheMiddleware*
HttpCacheMiddleware
$ tree .scrapy | head

.scrapy
└── httpcache
└── easy
├── 00
│ ├── 002054968919f13763a7292c1907caf06d5a4810
│ │ ├── meta
│ │ ├── pickled_meta
│ │ ├── request_body
│ │ ├── request_headers
│ │ ├── response_body
...
$ scrapy crawl fast -s LOG_LEVEL=INFO -s CLOSESPIDER_ITEMCOUNT=4500 -s

HTTPCACHE_ENABLED=1
CLOSESPIDER_ITEMCOUNT
$ rm -rf .scrapy
Scrapy DEPTH_LIMIT 0
DEPTH_PRIORITY
LIFO FIFO
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
Scrapy
DEPTH_LIMIT 3
robots.txt
ROBOTSTXT_OBEY True Scrapy True
CookiesMiddleware cookie session

COOKIES_ENABLED False cookies
REFERER_ENABLED True RefererMiddleware
Referer headers DEFAULT_REQUEST_HEADERS headers
settings.py
USER_AGENT
Feeds
Feeds Scrapy
FEED_URI.FEED_URI scrapy crawl fast -o "%(name)s_%(time)s.jl
%(foo)s, feed
foo S3 FTP URI
FEED_URI='s3://mybucket/file.json' Amazon AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY Amazon S3 JSON
JSON Lines CSV XML FEED_FORMAT Scrapy
FEED_URI FEED_STORE_EMPTY True
FEED_EXPORT_FIELDS .csv
header FEED_URI_PARAMS FEED_URI
Scrapy Image Pipeline
IMAGES_STORE
URL image_URL IMAGES_URL_FIELD
image IMAGES_RESULT_FIELD
IMAGES_MIN_WIDTH IMAGES_MIN_HEIGHT IMAGES_EXPIRES
IMAGES_THUMBS
Scrapy
Files Pipelines FILES_STORE

FILES_EXPIRES FILES_URL_FIELD FILES_
RESULT_FIELD pipelines
3——
pip install image
Image Pipeline settings.py ITEM_PIPELINES
scrapy.pipelines.images.ImagesPipeline IMAGES_STORE "images"
IMAGES_THUMBS
ITEM_PIPELINES = {
...
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'images'
IMAGES_THUMBS = { 'small': (30, 30) }
Item image_URL
$ scrapy crawl fast -s CLOSESPIDER_ITEMCOUNT=90

...
DEBUG: Scraped from <200 http://http://web:9312/.../index_00003.html/
property_000001.html>{
'images': [{'checksum': 'c5b29f4b223218e5b5beece79fe31510',
'path': 'full/705a3112e67...a1f.jpg',
'url': 'http://web:9312/images/i02.jpg'}],
...
$ tree images
images
├── full
│ ├── 0abf072604df23b3be3ac51c9509999fa92ea311.jpg
│ ├── 1520131b5cc5f656bc683ddf5eab9b63e12c45b2.jpg
...
└── thumbs
└── small
├── 0abf072604df23b3be3ac51c9509999fa92ea311.jpg
├── 1520131b5cc5f656bc683ddf5eab9b63e12c45b2.jpg
...
Images jpg
rm -rf images
Scrapy AWS access key AWS_ACCESS_KEY_ID

secret key AWS_SECRET_ACCESS_KEY
s3:// http:// URL

media pipelines s3://
s3:// settings.py
Scrapy HttpProxyMiddleware http_proxy https_proxy

no_proxy
4—— Crawlera
DynDNS IP Scrapy shell checkip.dyndns.org
IP
$ scrapy shell http://checkip.dyndns.org

>>> response.body
'<html><head><title>Current IP Check</title></head><body>Current IP
Address: xxx.xxx.xxx.xxx</body></html>\r\n'
>>> exit()
shell export HMA

http://proxylist.hidemyass.com/
IP 10.10.1.1 80
$ # First check if you already use a proxy

$ env | grep http_proxy
$ # We should have nothing. Now let's set a proxy
$ export http_proxy=http://10.10.1.1:80
Scrapy shell IP
IP Scrapy shell unset http_proxy
Crawlera Scrapinghub IP
http_proxy
$ export http_proxy=myusername:mypassword@proxy.crawlera.com:8010
HTTP Scrapy Crawlera
Scrapy Scrapy

BOT_NAME SPIDER_MODULES
Scrapy
startproject genspider
MAIL_FROM MailSender
STATSMAILER_RCPTS MEMUSAGE_NOTIFY_MAIL
SCRAPY_SETTINGS_MODULE SCRAPY_PROJECT Scrapy
Django scrapy.cfg
Scrapy
Scrapy ITEM_PIPELINES
Item Processing Pipelines 9 pipelines
Scrapy 8 COMMANDS_MODULE
properties/hi.py
from scrapy.commands import ScrapyCommand

class Command(ScrapyCommand):
default_settings = {'LOG_ENABLED': False}
def run(self, args, opts):
print("hello")
settings.py COMMANDS_MODULE='properties.hi' Scrapy help

hi default_settings settings.py
Scrapy -_BASE FEED_EXPORTERS_BASE

settings.py non-_BASE
FEED_EXPORTERS
Scrapy DOWNLOADER SCHEDULER

scrapy.core.downloader.Downloader
DOWNLOADER
RETRY_, REDIRECT_ METAREFRESH_* Retry Redirect Meta-Refresh

REDIRECT_PRIORITY_ 2
REDIRECT_MAX_TIMES 20 20
HTTPERROR_ALLOWED_CODES
URLLENGTH_LIMIT
AUTOTHROTTLE_*
DOWNLOAD_DELAY 0
MEMUSAGE_*
MEMUSAGE_LIMIT_MB 0
email Unix
MEMDEBUG_ENABLED MEMDEBUG_NOTIFY
trackref
LOG_ENCODING LOG_DATEFORMAT LOG_FORMAT

Splunk Logstash Kibana
DUPEFILTER_DEBUG COOKIES_DEBUG
session
Scrapy
Scrapy
8 Scrapy
scrapy
Items
Items
Scrapy Scrapy
Scrapy Scrapy
Scrapy Twisted
Scrapy Twisted
Scrapy Twisted Python Twisted
Scrapy
Scrapy GitHub
Twisted
Twisted
Item
CPUs

Twisted/Scrapy I/O select() poll()

epoll() “ ” result = i_block() Twisted
deferred = i_dont_block()
deferred.addCallback(process_result)) Twisted
Twisted Twisted “ ”
I/O
CPU
OS
Twisted
inlineCallbacks
I/O Node.js Node.js

Node.js APIs Java Netty NIO
Apche Storm Spark C++11 std::future std::promise
libevent plain POSIX
Twisted Twisted APIs
ch08 ch08/deferreds.py file ./deferreds.py 0
Python
$ python
>>> from twisted.internet import defer
>>> # Experiment 1
>>> d = defer.Deferred()
>>> d.called
False
>>> d.callback(3)
>>> d.called
True
>>> d.result
3
d callback called
True result
>>> # Experiment 2
>>> d = defer.Deferred()
>>> def foo(v):
... print "foo called"
... return v+1
...
>>> d.addCallback(foo)
<Deferred at 0x7f...>
>>> d.called
False
>>> d.callback(3)
foo called
>>> d.called
True
>>> d.result
4
foo() d callback(3) foo() d
>>> # Experiment 3
>>> def status(*ds):
... return [(getattr(d, 'result', "N/A"), len(d.callbacks)) for d in
ds]
>>> def b_callback(arg):
... print "b_callback called with arg =", arg
... return b
>>> def on_done(arg):
... print "on_done called with arg =", arg
... return arg
>>> # Experiment 3.a
>>> a = defer.Deferred()
>>> b = defer.Deferred()
a
b_callback() b a on_done() status()
[('N/A', 2), ('N/A',
0)]
a [(, 1), ('N/A', 1)] a
b b_callback() on_done()
b[(4, 0), (None, 0)]
>>> # Experiment 3.b

>>> a = defer.Deferred()
>>> b = defer.Deferred()
>>> a.addCallback(b_callback).addCallback(on_done)
>>> status(a, b)
[('N/A', 2), ('N/A', 0)]
>>> b.callback(4)
>>> status(a, b)
[('N/A', 2), (4, 0)]
>>> a.callback(3)
b_callback called with arg = 3
on_done called with arg = 4
>>> status(a, b)
[(4, 0), (None, 0)]
a 3 b b [('N/A', 2), (4, 0)] a
b b
Twisted defer.DeferredList
>>> # Experiment 4
>>> deferreds = [defer.Deferred() for i in xrange(5)]
>>> join = defer.DeferredList(deferreds)
>>> join.addCallback(on_done)
>>> for i in xrange(4):
... deferreds[i].callback(i)
>>> deferreds[4].callback(4)
on_done called with arg = [(True, 0), (True, 1), (True, 2),
(True, 3), (True, 4)]
for on_done()
deferreds[4].callback() on_done()
True /False
Twisted I/O——Python
Python
# ~*~ Twisted - A Python tale ~*~

from time import sleep
# Hello, I'm a developer and I mainly setup Wordpress.
def install_wordpress(customer):
# Our hosting company Threads Ltd. is bad. I start installation
and...
print "Start installation for", customer
# ...then wait till the installation finishes successfully. It is
# boring and I'm spending most of my time waiting while consuming
# resources (memory and some CPU cycles). It's because the process
# is *blocking*.
sleep(3)
print "All done for", customer
# I do this all day long for our customers
def developer_day(customers):
for customer in customers:
install_wordpress(customer)
developer_day(["Bill", "Elon", "Steve", "Mark"])
$ ./deferreds.py 1
------ Running example 1 ------
Start installation for Bill
All done for Bill
Start installation
...
* Elapsed time: 12.03 seconds
4 3 12
import threading
# The company grew. We now have many customers and I can't handle
the
# workload. We are now 5 developers doing exactly the same thing.
def developers_day(customers):
# But we now have to synchronize... a.k.a. bureaucracy
lock = threading.Lock()
#
def dev_day(id):
print "Goodmorning from developer", id
# Yuck - I hate locks...
lock.acquire()
while customers:
customer = customers.pop(0)
lock.release()
# My Python is less readable
install_wordpress(customer)
lock.acquire()
lock.release()
print "Bye from developer", id
# We go to work in the morning
devs = [threading.Thread(target=dev_day, args=(i,)) for i in
range(5)]
[dev.start() for dev in devs]
# We leave for the evening
[dev.join() for dev in devs]
# We now get more done in the same time but our dev process got more
# complex. As we grew we spend more time managing queues than doing dev
# work. We even had occasional deadlocks when processes got extremely
# complex. The fact is that we are still mostly pressing buttons and
# waiting but now we also spend some time in meetings.
developers_day(["Customer %d" % i for i in xrange(15)])
$ ./deferreds.py 2
Goodmorning from developer 0Goodmorning from developer
1Start installation forGoodmorning from developer 2
Goodmorning from developer 3Customer 0
...
from developerCustomer 13 3Bye from developer 2
5 15 3 45 5 9
Twisted
# For years we thought this was all there was... We kept hiring more
# developers, more managers and buying servers. We were trying harder
# optimising processes and fire-fighting while getting mediocre
# performance in return. Till luckily one day our hosting
# company decided to increase their fees and we decided to
# switch to Twisted Ltd.!
from twisted.internet import reactor
from twisted.internet import defer
from twisted.internet import task
# Twisted has a slightly different approach
def schedule_install(customer):
# They are calling us back when a Wordpress installation completes.
# They connected the caller recognition system with our CRM and
# we know exactly what a call is about and what has to be done
# next.
#
# We now design processes of what has to happen on certain events.
def schedule_install_wordpress():
def on_done():
print "Callback: Finished installation for", customer
print "Scheduling: Installation for", customer
return task.deferLater(reactor, 3, on_done)
#
def all_done(_):
#
# For each customer, we schedule these processes on the CRM
# and that
# is all our chief-Twisted developer has to do
d = schedule_install_wordpress()
d.addCallback(all_done)
#
return d
# Yes, we don't need many developers anymore or any synchronization.
# ~~ Super-powered Twisted developer ~~
def twisted_developer_day(customers):
print "Goodmorning from Twisted developer"
#
# Here's what has to be done today
work = [schedule_install(customer) for customer in customers]
# Turn off the lights when done
join = defer.DeferredList(work)
join.addCallback(lambda _: reactor.stop())
#
print "Bye from Twisted developer!"
# Even his day is particularly short!

twisted_developer_day(["Customer %d" % i for i in xrange(15)])
# Reactor, our secretary uses the CRM and follows-up on events!
reactor.run()
$ ./deferreds.py 3
Goodmorning from Twisted developer
Scheduling: Installation for Customer 0
....
Bye from Twisted developer!
Callback: Finished installation for Customer 0
All done for Customer 0
...
15 45 3
sleep() task.deferLater()
15
CPUs
CPU CPUs
I/O CPUs task.deferLater()
Goodmorning from Twisted developer Bye from Twisted developer!
Twisted reactor.run()
CRM
reactor.run()
# Twisted gave us utilities that make our code way more readable!
@defer.inlineCallbacks
def inline_install(customer):
print "Scheduling: Installation for", customer
yield task.deferLater(reactor, 3, lambda: None)
print "Callback: Finished installation for", customer
... same as previously but using inline_install()
instead of schedule_install()
reactor.run()
$ ./deferreds.py 4
... exactly the same as before
inlineCallbacks Python
inline_install() inline_install() yield
inline_install()
15 10000 10000
HTTP
Scrapy
CONCURRENT_ITEMS
def inline_install(customer):
... same as above
# The new "problem" is that we have to manage all this concurrency to
# avoid causing problems to others, but this is a nice problem to have.
print "Goodmorning from Twisted developer"
work = (inline_install(customer) for customer in customers)
#
# We use the Cooperator mechanism to make the secretary not
# service more than 5 customers simultaneously.
coop = task.Cooperator()
join = defer.DeferredList([coop.coiterate(work) for i in xrange(5)])
#
join.addCallback(lambda _: reactor.stop())
print "Bye from Twisted developer!"
reactor.run()
# We are now more lean than ever, our customers happy, our hosting
# bills ridiculously low and our performance stellar.
# ~*~ THE END ~*~
$ ./deferreds.py 5
Goodmorning from Twisted developer
Bye from Twisted developer!
...
...
3 5
Scrapy
Requests Responses Items

Items
Item Item Pipelins process_item()

process_item() Items pipelines
Item DropItem
pipelines Item open_spider() / close_spider()
Item Pipelines
Items
4 Appery.io pipeline Items
Appery.io
Scrapy cookies
Scrapy GitHub settings/default_settings.py

DOWNLOADER_MIDDLEWARES_BASE
Scrapy
URL HTTPS
HTTP
item pipelines
Items Items item pipeline
SPIDER_MIDDLEWARES_BASE
Item Pipelines
Scrapy API
Item
Item Pipelines process_item()

Items
EXTENSIONS_BASE

Scrapy MiddlewareManager
from_crawler() from_settings() Settings
crawler.settings from_crawler() Settings
Crawler
1—— pipeline
Python
pipelines items
from datetime import datetime

class TidyUp(object):
def process_item(self, item, spider):
item['date'] = map(datetime.isoformat, item['date'])
return item
process_item() pipeline
3 tidyup.py
pipeline
settings.py ITEM_PIPELINES
ITEM_PIPELINES = {'properties.pipelines.tidyup.TidyUp': 100 }
dict 100 pipelines pipeline Items

pipeline
ch8/properties

...
INFO: Enabled item pipelines: TidyUp
...
DEBUG: Scraped from <200 ...property_000060.html>
...
'date': ['2015-11-08T14:47:04.148968'],
ISO
Item crawler.signals.connect() 11
Item Pipeline
items

for i in range(2):
item = HooksasyncItem()
item['name'] = "Hello %d" % i
yield item
raise Exception("dead")
Item Item P ipeline DropItem
ch08/hooksasync
$ scrapy crawl test

... many lines ...
# First we get those two signals...
INFO: Extension, signals.spider_opened fired
INFO: Extension, signals.engine_started fired
# Then for each URL we get a request_scheduled signal
INFO: Extension, signals.request_scheduled fired
...# when download completes we get response_downloaded
INFO: Extension, signals.response_downloaded fired
INFO: DownloaderMiddlewareprocess_response called for example.com
# Work between response_downloaded and response_received
INFO: Extension, signals.response_received fired
INFO: SpiderMiddlewareprocess_spider_input called for example.com
# here our parse() method gets called... and then SpiderMiddleware used
INFO: SpiderMiddlewareprocess_spider_output called for example.com
# For every Item that goes through pipelines successfully...
INFO: Extension, signals.item_scraped fired
# For every Item that gets dropped using the DropItem exception...
INFO: Extension, signals.item_dropped fired
# If your spider throws something else...
INFO: Extension, signals.spider_error fired
# ... the above process repeats for each URL
# ... till we run out of them. then...
INFO: Extension, signals.spider_idle fired
# by hooking spider_idle you can schedule further Requests. If you don't
# the spider closes.
INFO: Extension, signals.spider_closed fired
# ... stats get printed
# and finally engine gets stopped.
INFO: Extension, signals.engine_stopped fired
11
spider_idle spider_error request_scheduled response_received
response_downloaded
2——
pipelines
Scrapy Log Stats Scrapy GitHub

scrapy/extensions/logstats.py
request_scheduled response_received item_scraped
item_scraped
Responses request_scheduled Requests response_received
Response Request Request
meta dict Requests meta dict
AutoThrottle
scrapy/extensions/throttle.py request.meta.get 'download_latency'
scrapy/core/downloader/webclient.py download_latency
Scrapy
class Latencies(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def __init__(self, crawler):
self.crawler = crawler
self.interval = crawler.settings.getfloat('LATENCIES_INTERVAL')
if not self.interval:
raise NotConfigured
cs = crawler.signals
cs.connect(self._spider_opened, signal=signals.spider_opened)
cs.connect(self._spider_closed, signal=signals.spider_closed)
cs.connect(self._request_scheduled, signal=signals.request_
scheduled)
cs.connect(self._response_received, signal=signals.response_
received)
cs.connect(self._item_scraped, signal=signals.item_scraped)
self.latency, self.proc_latency, self.items = 0, 0, 0
def _spider_opened(self, spider):
self.task = task.LoopingCall(self._log, spider)
self.task.start(self.interval)
def _spider_closed(self, spider, reason):
if self.task.running:
self.task.stop()
def _request_scheduled(self, request, spider):
request.meta['schedule_time'] = time()
def _response_received(self, response, request, spider):
request.meta['received_time'] = time()
def _item_scraped(self, item, response, spider):
self.latency += time() - response.meta['schedule_time']
self.proc_latency += time() - response.meta['received_time']
self.items += 1
def _log(self, spider):
irate = float(self.items) / self.interval
latency = self.latency / self.items if self.items else 0
proc_latency = self.proc_latency / self.items if self.items else 0
spider.logger.info(("Scraped %d items at %.1f items/s, avg
latency: "
"%.2f s and avg time in pipelines: %.2f s") %
(self.items, irate, latency, proc_latency))
self.latency, self.proc_latency, self.items = 0, 0, 0
Crawler
from_crawler(cls, crawler) crawler
init() crawler.settings NotConfigured
FooBar FOOBAR_ENABLED False
settings.py ITEM_PIPELINES
Scrapy AutoThrottle
HttpCache
LATENCIES_INTERVAL
init() crawler.signals.connect()
_spider_opened()
LATENCIES_INTERVAL _log() _spider_closed()
_request_scheduled() _response_received() request.meta
_item_scraped() items _log()
settings.py latencies.py
settings.py
EXTENSIONS = { 'properties.latencies.Latencies': 500, }

LATENCIES_INTERVAL = 5
$ pwd
$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=1000 -s LOG_LEVEL=INFO
...
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Scraped 0 items at 0.0 items/sec, average latency: 0.00 sec and
average time in pipelines: 0.00 sec
INFO: Scraped 115 items at 23.0 items/s, avg latency: 0.84 s and avg time
in pipelines: 0.12 s
INFO: Scraped 125 items at 25.0 items/s, avg latency: 0.78 s and avg time
in pipelines: 0.12 s
Log Stats 24
0.78 Little
N=ST=430.45≅19 CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN 100% CPU
30 10
scrapy/settings/default_settings.py Scrapy
DOWNLOAD_HANDLERS_BASE HTTP HTTPS S3 FTP

URL DOWNLOAD_HANDLERS
DOWNLOAD_HANDLERS
Scrapy
Scrapy

scrapy crawl CrawlerProcess Crawler
Crawler Scrapy settings signals spider
extensions.crawler ExtensionManager engine
ExecutionEngine Scheduler Downloader Scraper Scheduler
URL Downloader Scraper Downloader
DownloaderMiddleware DownloadHandler Scraper SpiderMiddleware
ItemPipeline MiddlewareManager Scrapy feeds
FeedExporter
URL S3 XML Pickle
FEED_STORAGES FEED_EXPORTERS scrapy
check SPIDER_CONTRACTS
Scrapy Twisted
Item Processing Pipeline
9 Pipelines
Scrapy pipelines
REST APIs CPU
Vagrant ping ping es ping mysql

REST APIs
REST APIs
REST SOAP web
REST web CRUD Create Read Update
Delete HTTP GET POST PUT DELETE web
URL http://api.mysite.com/customer/john URL
john
XML JSON
REST
Scrapy pipeline REST API
treq
treq Python Twisted Python requests GET
POST HTTP pip install treq
Scrapy Request/crawler.engine.download() API treq

10
Elasticsearch pipeline
ES Elasticsearch Items ES
MySQL ES ES
treq ES txes2
Python/Twisted ES
Vagrant ES ES
$ curl http://es:9200
{
"name" : "Living Brain",
"cluster_name" : "elasticsearch",
"version" : { ... },
"tagline" : "You Know, for Search"
}
http://localhost:9200 http://localhost:9200/pro
perties/property/_search ES
$ curl -XDELETE http://es:9200/properties
pipeline
ch09 ch09/properties/properties/pipelines/es.py
data = json.dumps(dict(item), ensure_ascii=False).encode("utf- 8")
yield treq.post(self.es_url, data)
process_item() 8
data ensure_ascii=False ASCII

JSON JSON UTF-8
treq post() POST ElasticSearch

es_url http://es:9200/properties/property settings.py ES_PIPELINE_URL
ES IP es:9200
properties property
pipeline settings.py ITEM_PIPELINES ES_PIPELINE_URL
ITEM_PIPELINES = {
'properties.pipelines.tidyup.TidyUp': 100,
'properties.pipelines.es.EsWriter': 800,
}
ES_PIPELINE_URL = 'http://es:9200/properties/property'
$ pwd
$ ls

...
INFO: Enabled item pipelines: EsWriter...
INFO: Closing spider (closespider_itemcount)...
http://localhost:9200/properties/property/_search 10
hits/total ?size=100
q= URL
http://localhost:9200/properties/property/_search?q=title:london
London https://www.elastic.co/guide/en/elasticsearch/ref
erence/current/query-dsl-query-string-query.html ES
ES ht
tp://localhost:9200/properties/
crawl easy -s CLOSESPIDER_ITEMCOUNT=1000 pipelines 0.12

0.15 0.78 0.81 25
pipelines Items
pipelines
Twisted APIs
pipeline Google Geocoding API
Google Geocoding API

curl URL
$ curl "https://maps.googleapis.com/maps/api/geocode/json?sensor=false&ad
dress=london"
{
"results" : [
...
"formatted_address" : "London, UK",
"geometry" : {
...
"location" : {
"lat" : 51.5073509,
"lng" : -0.1277583
},
"location_type" : "APPROXIMATE",
...
],
"status" : "OK"
}
JSON location
results[0].geometry.location
treq Google Geocoding API

pipelines geo.py
def geocode(self, address):
endpoint = 'http://web:9312/maps/api/geocode/json'
parms = [('address', address), ('sensor', 'false')]
response = yield treq.get(endpoint, params=parms)
content = yield response.json()
geo = content['results'][0]["geometry"]["location"]
defer.returnValue({"lat": geo["lat"], "lon": geo["lng"]})
URL
endpoint = 'https://maps.googleapis.com/maps/api/geocode/json' Google
address sensor URL treq get()
params yield response.json()
Python dict
defer.returnValue() inlineCallbacks
Scrapy
geocode() process_item()
item["location"] = yield self.geocode(item["address"][0])
pipeline ITEM_PIPELINES ES
ES
ITEM_PIPELINES = {
...
'properties.pipelines.geo.GeoPipeline': 400,
$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=90 -L DEBUG

...
{'address': [u'Greenwich, London'],
...
'location': {'lat': 51.482577, 'lon': -0.007659},
'price': [1030.0],
...
Items location Google API URL
File "pipelines/geo.py" in geocode (content['status'], address))

Exception: Unexpected status="OVER_QUERY_LIMIT" for
address="*London"
Geocoding API status OK

OK
OVER_QUERY_LIMIT
Scrapy
Geocoder API “ 24 2500
5 ” 10
Twisted
class Throttler(object):
def __init__(self, rate):
self.queue = []
self.looping_call = task.LoopingCall(self._allow_one)
self.looping_call.start(1. / float(rate))
def stop(self):
self.looping_call.stop()
def throttle(self):
d = defer.Deferred()
self.queue.append(d)
return d
def _allow_one(self):
if self.queue:
self.queue.pop(0).callback(None)
_allow_one() _allow_one()
callback() Twisted task.LoopingCall()
API _allow_one() Throttler pipeline init
class GeoPipeline(object):
def __init__(self, stats):
self.throttler = Throttler(5) # 5 Requests per second
def close_spider(self, spider):
self.throttler.stop()
process_item() geocode() yield

throttle()
yield self.throttler.throttle()
item["location"] = yield self.geocode(item["address"][0])
yield 11
5 11/5=2.2
Throttler
Python dict
API Python Twisted
class DeferredCache(object):
def __init__(self, key_not_found_callback):
self.records = {}
self.deferreds_waiting = {}
self.key_not_found_callback = key_not_found_callback
def find(self, key):
rv = defer.Deferred()
if key in self.deferreds_waiting:
self.deferreds_waiting[key].append(rv)
else:
self.deferreds_waiting[key] = [rv]
if not key in self.records:
try:
value = yield self.key_not_found_callback(key)
self.records[key] = lambda d: d.callback(value)
except Exception as e:
self.records[key] = lambda d: d.errback(e)
action = self.records[key]
for d in self.deferreds_waiting.pop(key):
reactor.callFromThread(action, d)
value = yield rv
defer.returnValue(value)
self.deferreds_waiting
self.records dict
find() self.records callback

yield self.key_not_found_callback(key)
Python Python
self.records lambdas callback errback lambda
key key_not_found_callback(key)
key
find()
self.deferreds_waiting dict key_not_found_callback()
key_not_found_callback()
action(d)
reactor.callFromThread()
init() API process_item()
def __init__(self, stats):

self.cache = DeferredCache(self.cache_key_not_found_callback)
def cache_key_not_found_callback(self, address):
yield self.throttler.enqueue()
value = yield self.geocode(address)
defer.returnValue(value)
item["location"] = yield self.cache.find(item["address"][0])
defer.returnValue(item)
ch09/properties/properties/pipelines/geo2.py
pipeline settings.py ITEM_PIPELINES
ITEM_PIPELINES = {
'properties.pipelines.tidyup.TidyUp': 100,
'properties.pipelines.es.EsWriter': 800,
# DISABLE 'properties.pipelines.geo.GeoPipeline': 400,
'properties.pipelines.geo2.GeoPipeline': 400,
}

...
Scraped... 15.8 items/s, avg latency: 1.74 s and avg time in pipelines:
0.94 s
0.97 s
0.14 s
...
: Dumping Scrapy stats:...
'geo_pipeline/misses': 35,
35
1019 - 35= 984 API
Google API API Throttler(5) Throttler(10) 5
10 geo_pipeline/retries stat
API geo_pipeline/errors stat
geo_pipeline/already_set stat http://local
host:9200/properties/property/_search ES
{..."location": {"lat": 51.5269736, "lon": -0.0667204}...}
Elasticsearch
HTTP POST
Angel {51.54, -0.19}
$ curl http://es:9200/properties/property/_search -d '{

"query" : {"term" : { "title" : "angel" } },
"sort": [{"_geo_distance": {
"location": {"lat": 51.54, "lon": -0.19},
"order": "asc",
"unit": "km",
"distance_type": "plane"
}}]}'
"failed to find mapper for [location] for

geo distance based sort" location
$ curl 'http://es:9200/properties/_mapping/property' > property.txt
property.txt
"location":{"properties":{"lat":{"type":"double"},"lon":{"type":"double"}}}
"location": {"type": "geo_point"}
{"properties":{"mappings": and two }}

schema
$ curl -XDELETE 'http://es:9200/properties'

$ curl -XPUT 'http://es:9200/properties'
$ curl -XPUT 'http://es:9200/properties/_mapping/property' --data
@property.txt
JSONs sort
Python
Python Database API 2.0 MySQL PostgreSQL Oracle
Microsoft SQL Server SQLite Twisted
Twisted Scrapy
twisted.enterprise.adbapi MySQL
pipeline MySQL
MySQL pipeline
MySQL MySQL
$ mysql -h mysql -uroot -ppass
mysql> MySQL
mysql> create database properties;

mysql> use properties
mysql> CREATE TABLE properties (
url varchar(100) NOT NULL,
title varchar(30),
price DOUBLE,
description varchar(30),
PRIMARY KEY (url)
);
mysql> SELECT * FROM properties LIMIT 10;
Empty set (0.00 sec)
MySQL properties
pipeline MySQL exit
MySQL properties
mysql> DELETE FROM properties;
MySQL Python dj-database-url

IP pip install dj-database-url MySQL-
python MySQL pipeline
from twisted.enterprise import adbapi

...
class MysqlWriter(object):
...
def __init__(self, mysql_url):
conn_kwargs = MysqlWriter.parse_mysql_url(mysql_url)
self.dbpool = adbapi.ConnectionPool('MySQLdb',
charset='utf8',
use_unicode=True,
connect_timeout=5,
**conn_kwargs)
def close_spider(self, spider):
self.dbpool.close()
try:
yield self.dbpool.runInteraction(self.do_replace, item)
except:
print traceback.format_exc()
@staticmethod
def do_replace(tx, item):
sql = """REPLACE INTO properties (url, title, price, description) VALUES (%
s,%s,%s,%s)"""
args = (
item["url"][0][:100],
item["title"][0][:30],
item["price"][0],
item["description"][0].replace("\r\n", " ")[:30]
)
tx.execute(sql, args)
ch09/properties/properties/pipelines/mysql.py
MYSQL_PIPELINE_URL mysql://user:pass@ip/database URL

init() adbapi.ConnectionPool() adbapi
MySQL MySQL
MySQLdb MySQL Unicode adbapi
MySQLdb.connect()
close()
process_item() dbpool.runInteraction()
Transaction Transaction DB-API
API do_replace() @staticmethod
self
SQL
Transaction execute() SQL REPLACE
INTO INSERT INTO
SQL SELECT dbpool.runQuery()
adbapi.ConnectionPool() cursorclass
cursorclass=MySQLdb.cursors

MYSQL_PIPELINE_URL
ITEM_PIPELINES = { ...
'properties.pipelines.mysql.MysqlWriter': 700,
...
MYSQL_PIPELINE_URL = 'mysql://root:pass@mysql/properties'
scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=1000
MySQL
mysql> SELECT COUNT(*) FROM properties;

+----------+
| 1006 |
+----------+
mysql> SELECT * FROM properties LIMIT 4;
+------------------+--------------------------+--------+-----------+
| url | title | price | description
+------------------+--------------------------+--------+-----------+
| http://...0.html | Set Unique Family Well | 334.39 | website c
| http://...1.html | Belsize Marylebone Shopp | 388.03 | features
| http://...2.html | Bathroom Fully Jubilee S | 365.85 | vibrant own

| http://...3.html | Residential Brentford Ot | 238.71 | go court
+------------------+--------------------------+--------+-----------+
4 rows in set (0.00 sec)
Twisted
treq REST APIs Scrapy Twisted
MongoDB “MongoDB Python”
PyMongo / pipeline
Twisted PyMongo “MongoDB Twisted Python” txmongo
Twisted Scrapy Twisted
Twisted Redis
pipeline Redis
Google Geocoding API IP IPs
API /
Redis dict Vagrant Redis

redis-cli
$ redis-cli -h redis
redis:6379> info keyspace
# Keyspace
redis:6379> set key value
OK
# Keyspace
db0:keys=1,expires=0,avg_ttl=0
redis:6379> FLUSHALL
OK
# Keyspace
redis:6379> exit
“Redis Twisted” txredisapi Python

Twisted reactor.connectTCP() Twisted
Redis txredisapi Twisted
dj_redis_url pip Redis URL sudo pip install txredisapi
dj_redis_url
RedisCache pipeline
from txredisapi import lazyConnectionPool
class RedisCache(object):
...
def __init__(self, crawler, redis_url, redis_nm):
self.redis_url = redis_url
self.redis_nm = redis_nm
args = RedisCache.parse_redis_url(redis_url)
self.connection = lazyConnectionPool(connectTimeout=5,
replyTimeout=5,
**args)
crawler.signals.connect(
self.item_scraped,signal=signals.item_scraped)
pipeline Redis URL

parse_redis_url()
redis_nm txredisapi lazyConnectionPool()
pipeline geo-pipeline Redis

geo-pipeline API
Redis signals.item_scraped
item_scraped()
ch09/properties/properties/pipelines/redis.py
Item Redis
dict
geo-pipeline Items
process incoming Items:

address = item["address"][0]
key = self.redis_nm + ":" + address
value = yield self.connection.get(key)
if value:
item["location"] = json.loads(value)
txredisapi connection get() Redis

JSON Redis JSON
Item pipelines Redis
from txredisapi import ConnectionError

def item_scraped(self, item, spider):
try:
location = item["location"]
value = json.dumps(location, ensure_ascii=False)
except KeyError:
return
address = item["address"][0]
key = self.redis_nm + ":" + address
quiet = lambda failure: failure.trap(ConnectionError)
return self.connection.set(key, value).addErrback(quiet)
txredisapi set()
set() @defer.inlineCallbacks signals.item_scraped
connection.set() yield
Scrapy Redis
connection.set()
trap() ConnectionError Twisted API trap()

REDIS_PIPELINE_URL geo-pipeline
'properties.pipelines.redis.RedisCache': 300,
'properties.pipelines.geo.GeoPipeline': 400,
...
REDIS_PIPELINE_URL = 'redis://redis:6379'

...
INFO: Enabled item pipelines: TidyUp, RedisCache, GeoPipeline,
MysqlWriter, EsWriter
...
Scraped... 0.0 items/s, avg latency: 0.00 s, time in pipelines: 0.00 s
...
INFO: Dumping Scrapy stats: {...
'geo_pipeline/already_set': 106,
GeoPipeline RedisCache RedisCache

geo_pipeline/already_set: 106 GeoPipeline Redis
Google API Redis Google API
GeoPipeline 5
API Redis
CPU
Twisted Twisted
Scrapy Twisted reactor.spawnProcess() Python
pipeline CPU
8
Twisted reactor.callInThread() API
pipeline
class UsingBlocking(object):
price = item["price"][0]
out = defer.Deferred()
reactor.callInThread(self._do_calculation, price, out)
item["price"][0] = yield out
def _do_calculation(self, price, out):
new_price = price + 1
time.sleep(0.10)
reactor.callFromThread(out.callback, new_price)
pipeline Item
_do_calucation() time.sleep()
reactor.callInThread()
out _do_calucation() out
Item
_do_calucation() 1 100ms
10
100ms
out.callback(new_price)
reactor.callFromThread()
process_item() yield Item
_do_calucation()
beta delta
def __init__(self):
self.beta, self.delta = 0, 0
...
self.beta += 1
time.sleep(0.001)
self.delta += 1
new_price = price + self.beta - self.delta + 1
assert abs(new_price-price-1) < 0.01
time.sleep(0.10)...
self.beta self.delta
beta/delta beta delta
Python threading.RLock()
def __init__(self):
...
self.lock = threading.RLock()
...
with self.lock:
self.beta += 1
...
new_price = price + self.beta - self.delta + 1
assert abs(new_price-price-1) < 0.01 ...
ch09/properties/properties/pipelines/computation.py
'properties.pipelines.computation.UsingBlocking': 500,
pipeline 100ms 25
items
pipeline
Twisted
reactor.spawnProcess() API protocol.ProcessProtocol
#!/bin/bash
trap "" SIGINT
sleep 3
while read line
do
# 4 per second
sleep 0.25
awk "BEGIN {print 1.20 * $line}"
done
bash Ctrl + C
Ctrl + C Scrapy Ctrl + C
250ms
1.20 Linux awk
1/250ms=4 Items session
$ properties/pipelines/legacy.sh
12 <- If you type this quickly you will wait ~3 seconds to get results
14.40
13 <- For further numbers you will notice just a slight delay
15.60
Ctrl + C Ctrl + D session Scrapy
class CommandSlot(protocol.ProcessProtocol):
def __init__(self, args):
self._queue = []
reactor.spawnProcess(self, args[0], args)
def legacy_calculate(self, price):
d = defer.Deferred()
self._queue.append(d)
self.transport.write("%f\n" % price)
return d
# Overriding from protocol.ProcessProtocol
def outReceived(self, data):
"""Called when new output is received"""
self._queue.pop(0).callback(float(data))
class Pricing(object):
def __init__(self):
self.slot = CommandSlot(['properties/pipelines/legacy.sh'])
item["price"][0] = yield self.slot.legacy_calculate(item["price"][0])
CommandSlot ProcessProtocol Pricing init()

CommandSlot reactor.spawnProcess()
ProcessProtocol self
spawnProcess() protocol args
pipeline process_item() CommandSlot legacy_calculate()

CommandSlot legacy_calculate()
transport.write() ProcessProtocol transport
outReceived()
pipeline
ITEM_PIPELINES
ITEM_PIPELINES = {...
'properties.pipelines.legacy.Pricing': 600,
pipeline
class Pricing(object):
def __init__(self):
self.concurrency = 16
args = ['properties/pipelines/legacy.sh']
self.slots = [CommandSlot(args)
for i in xrange(self.concurrency)]
self.rr = 0
slot = self.slots[self.rr]
self.rr = (self.rr + 1) % self.concurrency
item["price"][0] = yield
slot.legacy_calculate(item["price"][0])
16 pipeline 16*4 = 64

...
0.00 s
1.48 s
0.52 s
250 ms 25
transport.write() shell
Git
Scrapy pipelines Twisted

Item Processing Pipelines Items
pipelines
N=ST=250.77≅19, pipelines N=25*3.33≅83

Twisted 10 Scrapy
10 Scrapy
Scrapy
Scrapy Scrapy
Amdahl
Dr.Goldratt
Scrapy
Scrapy
Scrapy ——
1
Little
N T / S N=T*S T=N/S S=N/T

1 V L
A V=L*A
L S L≈S V≈N
A≈T Little
L≈S V≈N
A≈T
/
A2
Little 1
Scrapy / N =8
/ S S=250ms /
T=N/S=8/0.25=32 /
N 8 16 32 1
Little 16
T=N/S=16/0.25=64 / 32 T=N/S=32/0.25=128 /
/
T 2
Scrapy URL
URL
Scrapy
Scrapy
Scrapy
Scrapy 3

Scrapy
Request URL
5MB
Scrapy
/ Python/Twisted
CONCURRENT_REQUESTS*
Response Item Request
Item Pipelines Request

Items CONCURRENT_ITEMS
pipelines pipelines 100
pipelines pipelines
CPU
Requests/Items
Scrapy Requests/Responses/Items
Scrapy 6023
Scrapy Python
time.sleep() est()
6023 est()
ch10 ch10/speed
$ pwd
/root/book/ch10/speed
$ ls
scrapy.cfg speed
$ scrapy crawl speed -s SPEED_PIPELINE_ASYNC_DELAY=1
INFO: Scrapy 1.0.3 started (bot: speed)
...
scrapy crawl speed

>>> est()
...
len(engine.downloader.active) : 16
...
len(engine.slot.scheduler.mqs) : 4475
...
len(engine.scraper.slot.active) : 115
engine.scraper.slot.active_size : 117760
engine.scraper.slot.itemproc_size : 105
Ctrl+D Ctrl+C
dqs JOBDIR
dqs len(engine.slot.scheduler.dqs) mqs
mqs 4475
len(engine.downloader.active) 16
CONCURRENT_REQUESTS len(engine.scraper.slot.active) 115
(engine.scraper.slot.active_size) 115kb
105 Items pipelines(engine.scraper.slot.itemproc_size) 10
mqs
stats dict
via stats.get_stats() p()
$ p(stats.get_stats())
{'downloader/request_bytes': 558330,
...
...}
item_scraped_count stats.get_value ('item_scraped_count')

items
speed/spiders/speed.py
http://localhost:9312/benchmark/... handlers
4 URL /Scrapy
http://localhost:9312/benchmark/index?p=
1 http://localhost:9312/benchmark/id:3/rr:5/index?p=1
item items
http://localhost:9312/benchmark/ds:100/d
etail?id0=0 speed/settings.py SPEED_T_RESPONSE =
0.125 SPEED_TOTAL_ITEMS = 5000 Items

SpeedSpider SPEED_START_REQUESTS_STYLE
start_requests() parse_item() crawler.engine.crawl()
URL
pipeline DummyPipeline /
/ (SPEED_PIPELINE_BLOCKING_DELAY— )
(SPEED_PIPELINE_ASYNC_DELAY— ) treq API
(SPEED_PIPELINE_API_VIA_TREQ— ) Scrapy crawler.engine.download()
API (SPEED_PIPELINE_API_VIA_DOWNLOADER— ) pipeline
settings.py
Linux
$ time scrapy crawl speed

...
INFO: s/edule d/load scrape p/line done mem
INFO: 0 0 0 0 0 0
INFO: 4938 14 16 0 32 16384
INFO: 4831 16 6 0 147 6144
...
INFO: 119 16 16 0 4849 16384
INFO: 2 16 12 0 4970 12288
...
real 0m46.561s
Column Metric
s/edule len(engine.slot.scheduler.mqs)
d/load len(engine.downloader.active)
scrape len(engine.scraper.slot.active)
p/line engine.scraper.slot.itemproc_size
done stats.get_value('item_scraped_count')
mem engine.scraper.slot.active_size
5000 URL done 5000

16 pipeline
46 5000 Items 16
46*16/5000=147ms 125ms
Scrapy Scrapy
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP CONCURRENT_REQUESTS_PER_DOMAIN
IP
per-IP CONCURRENT_REQUESTS_PER_IP
0 CONCURRENT_REQUESTS_PER_DOMAIN 1000000
CONCURRENT_REQUESTS
Linux
Twisted/Python tdownload=tresponse+toverhead
Items pipeline
toverhead tstart/stop
CONCURRENT_REQUESTS
CPU tresponse
2000 items tresponse∈{0.125s 0.25s 0.5s}

CONCURRENT_REQUESTS∈{8 16 32 64}
$ for delay in 0.125 0.25 0.50; do for concurrent in 8 16 32 64; do

time scrapy crawl speed -s SPEED_TOTAL_ITEMS=2000 \
-s CONCURRENT_REQUESTS=$concurrent -s SPEED_T_RESPONSE=$delay
done; done
2000
5
y=toverhead·x+ tstart/stop x=N/CONCURRENT_REQUESTS y=tjob·x+tresponse
LINEST Excel toverhead=6ms
tstart/stop=3.1s toverhead URL

N tjob T
Scrapy
Scrapy
1——CPU
Unix/Linux ps Windows CPU

CPU
$ for concurrent in 25 50 100 150 200; do

-s CONCURRENT_REQUESTS=$concurrent
done
5000 URL CPU

Scrapy CPU
CPU 80-90%
CONCURRENT_REQUESTS CPU
11 CPU
2-
SPEED_SPIDER_BLOCKING_DELAY
SPEED_PIPELINE_BLOCKING_DELAY 100ms
100 URL 2 3 13
for concurrent in 16 32 64; do

-s CONCURRENT_REQUESTS=$concurrent -s SPEED_SPIDER_BLOCKING_DELAY=0.1
done

CONCURRENT_REQUESTS=1
100URL*100ms( )=10 +tstart/stop
pipelines
pipeline
APIs
9 Scrapy
11
pipelines
pipeline pipelines
pipelines
pipeline pipelines
meta
pipelines item_scraped
Twisted/ Twisted
SPEED_PIPELINE_BLOCKING_DELAY SPEED_PIPELINE_ASYNC_DELAY
3- “ ”
1000 0.25 16
19 pipeline crawler.engine.download()
API HTTP http://localhost:9312/benchmark/ar:1/api?text=hello
$ time scrapy crawl speed -s SPEED_TOTAL_ITEMS=1000 -s SPEED_T_

RESPONSE=0.25 -s SPEED_API_T_RESPONSE=1 -s SPEED_PIPELINE_API_VIA_
DOWNLOADER=1
...
s/edule d/load scrape p/line done mem
968 32 32 32 0 32768
952 16 0 0 32 0
936 32 32 32 32 32768
...
real 0m55.151s
16
Scrapy
Requests

>>> engine.downloader.active
set([<POST http://web:9312/ar:1/ti:1000/rr:0.25/benchmark/api>, ... ])
APIs
crawler.engine.download() Scrapy
robots.txt pipeline APIs
APIs Python requests
Twisted treq
API URL pipelines
crawler.engine.download() HTTP ONCURRENT_REQUESTS
CONCURRENT_REQUESTS

1 API 0.25
API
treq crawler.engine.download()
API CONCURRENT_REQUESTS
API
treq

RESPONSE=0.25 -s SPEED_API_T_RESPONSE=1 -s SPEED_PIPELINE_API_VIA_TREQ=1
...
936 16 48 32 0 49152
887 16 65 64 32 66560
823 16 65 52 96 66560
...
real 0m19.922s
pipeline (p/line) items (d/load)

16 T = N/S = 16/0.25 = 64 /
done 0.25 pipelines
1 API pipeline N = T * S = 64 * 1 = 64 Items
pipelines pipelines
4-
treq 120kB HTML

31 20

RESPONSE=0.25 -s SPEED_API_T_RESPONSE=1 -s SPEED_PIPELINE_API_VIA_TREQ=1
-s SPEED_DETAIL_EXTRA_SIZE=120000
952 16 32 32 0 3842818
917 16 35 35 32 4203080
876 16 41 41 67 4923608
840 4 48 43 108 5764224
805 3 46 27 149 5524048
...
real 0m30.611s
“ ”
max_active_size = 5000000
1kB
Scrapy pipelines pipelines

pipelines
pipelines 80

RESPONSE=0.25 -s SPEED_PIPELINE_ASYNC_DELAY=85
pipelines
APIs pipelines
5-item /
Items
1000 100 items 0.25 pipelines

3 CONCURRENT_ITEMS 10 150
for concurrent_items in 10 20 50 100 150; do

time scrapy crawl speed -s SPEED_TOTAL_ITEMS=100000 -s \
SPEED_T_RESPONSE=0.25 -s SPEED_ITEMS_PER_DETAIL=100 -s \
SPEED_PIPELINE_ASYNC_DELAY=3 -s \
CONCURRENT_ITEMS=$concurrent_items
done
...
952 16 32 180 0 243714
920 16 64 640 0 487426
888 16 96 960 0 731138
...
Items
1300 Items
scrape p/line p/line shows

CONCURRENT_ITEMS * scrape scrape Reponses p/line Items
11
CPU
CPU
5MB pipelines
6-
CONCURRENT_REQUESTS
1
T = N/S = N/1 = CONCURRENT_REQUESTS
$ time scrapy crawl speed -s SPEED_TOTAL_ITEMS=500 \

-s SPEED_T_RESPONSE=1 -s CONCURRENT_REQUESTS=64
436 64 0 0 0 0
...
real 0m10.99s
64 11 500 URL 64
S=N/T+tstart/stop=500/64+3.1=10.91
URL
SPEED_START_REQUESTS_STYLE=UseIndex URL
20 URL
$ time scrapy crawl speed -s SPEED_TOTAL_ITEMS=500 \

-s SPEED_T_RESPONSE=1 -s CONCURRENT_REQUESTS=64 \
-s SPEED_START_REQUESTS_STYLE=UseIndex
0 1 0 0 0 0
0 21 0 0 0 0
0 21 0 0 20 0
...
real 0m32.24s
T=N/S-tstart/stop=500/(32.2-
3.1)=17 /
d/load URL
URL 20 URL+
20 URL
URL
URL 50
$ for details in 10 20 30 40; do for nxtlinks in 1 2 3 4; do

time scrapy crawl speed -s SPEED_TOTAL_ITEMS=500 -s SPEED_T_RESPONSE=1 \
-s CONCURRENT_REQUESTS=64 -s SPEED_START_REQUESTS_STYLE=UseIndex \
-s SPEED_DETAILS_PER_INDEX_PAGE=$details \
-s SPEED_INDEX_POINTAHEAD=$nxtlinks
done; done
12 URL
LIFO
SPEED_INDEX_RULE_LAST=1
SPEED_INDEX_HIGHER_PRIORITY=1
URL
URL
100 1 51
-s SPEED_INDEX_SHARDS
$ for details in 10 20 30 40; do for shards in 1 2 3 4; do

time scrapy crawl speed -s SPEED_TOTAL_ITEMS=500 -s SPEED_T_RESPONSE=1 \
-s CONCURRENT_REQUESTS=64 -s SPEED_START_REQUESTS_STYLE=UseIndex \
-s SPEED_DETAILS_PER_INDEX_PAGE=$details -s SPEED_INDEX_SHARDS=$shards
done; done
Scrapy CONCURRENT_REQUESTS
CPU > 80-90%
5Mb
mqs/dqs
CPU
Scrapy Scrapy
Twisted Netty Node.js
Scrapy
“ / / ” Scrapy
11 Scrapyd
HTML XPath
Scrapy Scrapy
Scrapy Python Twisted Scrapy
Scrapy
Scrapyd 6
Apache Spark Spark

Spark Streaming API
Python Python
“ ” “ ”
“ ”
，我们观察到带有按摩浴缸的房子的平均价格是1300
$995 shiftjacuzzi=(1300-995)/1000=30.5%
5%
Scrapyd
Scrapyd Scrapyd
3
Scrapyd http://localhost:6800/

Jobs Items Logs Documentation

API
scrapy.cfg
$ pwd
$ cat scrapy.cfg
...
[settings]
default = properties.settings
[deploy]
url = http://localhost:6800/
project = properties
url scrapyd-
client scrapyd-deploy scrapyd-client Scrapy
pip install scrapyd-client
$ scrapyd-deploy
Deploying to project "properties" in http://localhost:6800/addversion.
json
Server response (200):
{"status": "ok", "project": "properties", "version": "1450044699",
"spiders": 3, "node_name": "dev"}
Scrapyd Available projects
$ curl http://localhost:6800/schedule.json -d project=properties -d spider=easy

{"status": "ok", "jobid": " d4df...", "node_name": "dev"}
Jobs jobid schedule.json cancel.json
$ curl http://localhost:6800/cancel.json -d project=properties -d job=d4df...

{"status": "ok", "prevstate": "running", "node_name": "dev"}
Logs Items items
http_port Scrapyd
http://scrapyd.readthedocs.org/
max_proc 0 CPU
Scrapyd max_proc 4 4
Scrapy

URL URL Scrapyd FTP .jl

Items Spark FTP HDFS Apache Kafka
FTP FEED_URI Scrapy Scrapyd
Spark S3
FTP
Pure-FTPd /root/items
Spark /root/items
Spark Python
24
Spark
Spark Scrapy MapReduce Apache Storm
items 9
pipelines Spark
Items Spark
Spark Scrapy Scrapyd

“ ”
Kafka
RabbitMQ
URL scrapyd
URL
16 0.25 16/0.25=64 / 5000O

30 1667 1667/64=26
3 easy Rule
callback='parse_item'
ch11
scrapy crawl 10
$ ls
$ pwd
$ time scrapy crawl easy -s CLOSESPIDER_PAGECOUNT=10
...
DEBUG: Crawled (200) <GET ...index_00000.html> (referer: None)
DEBUG: Crawled (200) <GET ...index_00001.html> (referer: ...index_00000.
html)
...
real 0m4.099s
10 4 26 1700
1
16 URL
URL 20 16
20
>>> map(lambda x: 1667 * x / 20, range(20))

[0, 83, 166, 250, 333, 416, 500, ... 1166, 1250, 1333, 1416, 1500, 1583]
start_URL
start_URL = ['http://web:9312/properties/index_%05d.html' % id
for id in map(lambda x: 1667 * x / 20, range(20))]
(CONCURRENT_REQUESTS,
CONCURRENT_REQUESTS_PER_DOMAIN) 16
$ time scrapy crawl easy -s CONCURRENT_REQUESTS=16 -s CONCURRENT_

REQUESTS_PER_DOMAIN=16
...
real 0m32.344s
1667 /32 =52 /

52*30=1560 Rule
50000/52=16
10
URL
scrapyd URL URL scrapyd URL
scrapyd
URL
URL scrapyds
Scrapy
process_spider_output() Requests
CrawlSpider GET POST
Scrapy GitHub SPIDER_MIDDLEWARES_BASE

Scrapy 1.0 HttpErrorMiddleware OﬀsiteMiddleware RefererMiddleware
UrlLengthMiddleware DepthMiddleware OﬀsiteMiddleware 60
allowed_domains URL
URL scrapyds

settings = crawler.settings
self._target = settings.getint('DISTRIBUTED_TARGET_RULE', -1)
self._seen = set()
self._URL = []
self._batch_size = settings.getint('DISTRIBUTED_BATCH_SIZE', 1000)
...
def process_spider_output(self, response, result, spider):
for x in result:
if not isinstance(x, Request):
yield x
else:
rule = x.meta.get('rule')
if rule == self._target:
self._add_to_batch(spider, x)
else:
yield x
def _add_to_batch(self, spider, request):
url = request.url
if not url in self._seen:
self._seen.add(url)
self._URL.append(url)
if len(self._URL) >= self._batch_size:
self._flush_URL(spider)
process_spider_output() Item Request Request

CrawlSpider Request/Response Rule meta dict
“rule”
Rule DISTRIBUTED_TARGET_RULE _add_to_batch() URL
Request
The addto_batch()
URL _seen set URL _URL
_batch_size DISTRIBUTED_BATCH_SIZE
_flush_URL()

...
self._targets = settings.get("DISTRIBUTED_TARGET_HOSTS")
self._batch = 1
self._project = settings.get('BOT_NAME')
self._feed_uri = settings.get('DISTRIBUTED_TARGET_FEED_URL', None)
self._scrapyd_submits_to_wait = []
def _flush_URL(self, spider):
if not self._URL:
return
target = self._targets[(self._batch-1) % len(self._targets)]
data = [
("project", self._project),
("spider", spider.name),
("setting", "FEED_URI=%s" % self._feed_uri),
("batch", str(self._batch)),
]
json_URL = json.dumps(self._URL)
data.append(("setting", "DISTRIBUTED_START_URL=%s" % json_URL))
d = treq.post("http://%s/schedule.json" % target,
data=data, timeout=5, persistent=False)
self._scrapyd_submits_to_wait.append(d)
self._URL = []
self._batch += 1
_batch scrapyd
_targets DISTRIBUTED_TARGET_HOSTS scrapyd schedule.json
POST curl
scrapyd
scrapy crawl distr \

-s DISTRIBUTED_START_URL='[".../property_000000.html", ... ]' \
-s FEED_URI='ftp://anonymous@spark/%(batch)s_%(name)s_%(time)s.jl' \
-a batch=1
FEED_URI
DISTRIBUTED_TARGET_FEED_URL
Scrapy FTP scrapyds FTP Item Spark

%(name)s %(time)s
%(batch) Scrapy
scrapyd schedule.json API
FEED_URI schedule.json
FEED_URI
DISTRIBUTED_START_URL URL JSON JSON
Scrapy
Redis Scrapy ID _flush_URL()
process_start_requests()
treq.post() POST Scrapyd persistent=False

5 _scrapyd_submits_to_wait
_URL _batch

...
crawler.signals.connect(self._closed, signal=signals.spider_
closed)
def _closed(self, spider, reason, signal, sender):
# Submit any remaining URL
self._flush_URL(spider)
yield defer.DeferredList(self._scrapyd_submits_to_wait)
_closed() Ctrl + C
URL _closed() _flush_URL(spider)
treq.post()
scrapyd_submits_to_wait
treq.post() defer.DeferredList() _closed()
@defer.inlineCallbacks yield
DISTRIBUTED_START_URL URL scrapyds scrapyds

start_URL
settings URL
process_start_requests() start_requests
DISTRIBUTED_START_URL JSON URL
CrawlSpider _response_downloaded()
meta['rule'] Rule Scrapy
CrawlSpider
...
self._start_URL = settings.get('DISTRIBUTED_START_URL', None)
self.is_worker = self._start_URL is not None
def process_start_requests(self, start_requests, spider):
if not self.is_worker:
for x in start_requests:
yield x
else:
for url in json.loads(self._start_URL):
yield Request(url, spider._response_downloaded,
meta={'rule': self._target})
settings.py
SPIDER_MIDDLEWARES = {
'properties.middlewares.Distributed': 100,
}
DISTRIBUTED_TARGET_RULE = 1
DISTRIBUTED_BATCH_SIZE = 2000
DISTRIBUTED_TARGET_FEED_URL = ("ftp://anonymous@spark/"
"%(batch)s_%(name)s_%(time)s.jl")
DISTRIBUTED_TARGET_HOSTS = [
"scrapyd1:6800",
"scrapyd2:6800",
"scrapyd3:6800",
]
DISTRIBUTED_TARGET_RULE
custom_settings :
custom_settings = {
'DISTRIBUTED_TARGET_RULE': 3
}
$ scrapy crawl distr -s \

DISTRIBUTED_START_URL='["http://web:9312/properties/property_000000.html"]'
FTP Spark
scrapy crawl distr -s \

DISTRIBUTED_START_URL='["http://web:9312/properties/property_000000.html"]' \
-s FEED_URI='ftp://anonymous@spark/%(batch)s_%(name)s_%(time)s.jl' -a batch=12
ssh Spark /root/items 12_distr_date_time.jl

scrapyd
CrawlSpider meta
scrapyds URL IDs

scrapyds URL
scrapyd
scrapyd scrapy.cfg
[deploy:target-name]
$ pwd
$ cat scrapy.cfg
...
[deploy:scrapyd1]
url = http://scrapyd1:6800/
[deploy:scrapyd2]
[deploy:scrapyd3]
scrapyd-deploy -l
$ scrapyd-deploy -l
scrapyd1 http://scrapyd1:6800/
scrapyd-deploy
$ scrapyd-deploy scrapyd1
Deploying to project "properties" in http://scrapyd1:6800/addversion.json
Server response (200):
{"status": "ok", "project": "properties", "version": "1449991257",
"spiders": 2, "node_name": "scrapyd1"}
build project.egg-info setup.py

scrapyd-deploy addversion.json
scrapyd-deploy –L
$ scrapyd-deploy -L scrapyd1
properties
touch scrapyd1-3 scrapyd

bash loop for i in scrapyd*;
do scrapyd-deploy $i; done
scrapyd
Scrapy scrapy monitor scrapyd
monitor.py settings.py COMMANDS_MODULE = 'properties.monitor'
scrapyd listjobs.json API
URL scrapyd-deploy https://github.com/scr
apy/scrapyd-client/blob/master/scrapyd-client/scrapyd-deploy
_get_targets() URL
class Command(ScrapyCommand):
requires_project = True
def run(self, args, opts):
self._to_monitor = {}
for name, target in self._get_targets().iteritems():
if name in args:
project = self.settings.get('BOT_NAME')
url = target['url'] + "listjobs.json?project=" + project
self._to_monitor[name] = url
l = task.LoopingCall(self._monitor)
l.start(5) # call every 5 seconds
reactor.run()
API dict tomonitor task.LoopingCall()

_monitor() _monitor() treq deferred
def _monitor(self):
all_deferreds = []
for name, url in self._to_monitor.iteritems():
d = treq.get(url, timeout=5, persistent=False)
d.addBoth(lambda resp, name: (name, resp), name)
all_deferreds.append(d)
all_resp = yield defer.DeferredList(all_deferreds)
for (success, (name, resp)) in all_resp:
json_resp = yield resp.json()
print "%-20s running: %d, finished: %d, pending: %d" %
(name, len(json_resp['running']),
len(json_resp['finished']), len(json_resp['pending']))
Twisted treq scrapyd API

defer.DeferredList all_resp
JSON treq Response'json()
JSON
Apache Spark streaming

Scrapy Apache Spark

Scrapy
boostwords.py
# Monitor the files and give us a DStream of term-price pairs

raw_data = raw_data = ssc.textFileStream(args[1])
word_prices = preprocess(raw_data)
# Update the counters using Spark's updateStateByKey
running_word_prices = word_prices.updateStateByKey(update_state_
function)
# Calculate shifts out of the counters
shifts = running_word_prices.transform(to_shifts)
# Print the results
shifts.foreachRDD(print_shifts)
Spark DStream textFileStream()

preprocess() term/price
update_state_function() Spark updateStateByKey() term/price
to_shifts() print_shifts()
shifts()
def to_shifts(word_prices):
(sum0, cnt0) = word_prices.values().reduce(add_tuples)
avg0 = sum0 / cnt0
def calculate_shift((isum, icnt)):
avg_with = isum / icnt
avg_without = (sum0 - isum) / (cnt0 - icnt)
return (avg_with - avg_without) / avg0
return word_prices.mapValues(calculate_shift)
Spark mapValues() calculate_shift Spark
vagrant ssh
1 CPU
$ alias provider_id="vagrant global-status --prune | grep 'docker-

provider' | awk '{print \$1}'"
$ vagrant ssh $(provider_id)
$ docker ps --format "{{.Names}}" | xargs docker stats
ssh docker provider VM VM docker Linux
2 scrapy monitor
$ vagrant ssh
$ cd book/ch11/properties
$ scrapy monitor scrapyd*
scrapyd* scrapy monitor scrapy monitor

scrapyd1 scrapyd2 scrapyd3
3
$ vagrant ssh
$ cd book/ch11/properties
$ for i in scrapyd*; do scrapyd-deploy $i; done
$ scrapy crawl distr
for scrapyd-deploy
scrapy crawl distr scrapy crawl distr -s
CLOSESPIDER_PAGECOUNT=100 100 3000
4 Spark
$ vagrant ssh spark

$ pwd
/root
$ ls
book items
$ spark-submit book/ch11/boostwords.py items
boostwords.py items
watch ls -1 items item
CPU
3 4 CPU16 *4 / =768 /
4G 8CPU Macbook Pro 2 40 50000
URL 315 / EC2 m4.large 2 vCPUs 8G CPU
6 12 134 / EC2 m4.4xlarge 16 vCPUs 64G
1 44 480 / scrapyd 6 Vagrantfile
scrapy.cfg settings.py 1 15 667 /
250ms
Twisted scrapyds URL FTP Spark Items
scrapyd 2.5 scrapyd poll_interval
scrapyd
500000
URL
Scrapy Scrapy
Scrapy
Scrapy
Scrapy

《Learning Scrapy》中文版

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

《Learning Scrapy》中文版

Uploaded by

Copyright:

Available Formats

1 Scrapy

Apache Spark streaming

Scrapy BeautifulSoup lxml Scrapy Selector lxml

Scrapy Scrapy https://groups.google.com/forum/#!foru

Scrapy Python pipelines

Python Python Python

Stack Overflow GitHub

Scrapy RobotsTxtMiddleware robots.txt

Scrapy Apache Nutch Scrapy

Scrapy Apache Solr Elasticsearch Lucene Scrapy

Scrapy MySQL MongoDB Redis

HTML XPath Scrapy

HTML DOM XPath

URL URL , gumtree.com

@class XPath //a[@href]

scrapy shell "http://example.com"

Chrome XPath Scrapy

id firstHeading div span text

id toc div ul URL

class ltr class skin-vector h1 text

//*[contains(@class,"ltr") and contains(@class,"skin-vector")]//h1//text()

XPath class CSS

class infobox table URL

div URL div

class class CSS

$ echo hello world

echo hello world $

>>> print 'hi'

Python Scrapy >>>

Vagrant MacOS Scrapy

Ubuntu Debian Linux

$ sudo apt-get update

Red Hat CentOS Linux

sudo yum update

$ sudo pip install --upgrade Scrapy

$ sudo easy_install --upgrade scrapy

$ sudo pip install Scrapy==1.0.0

A Vagrant git Vagrant

vagrant ssh vagrant halt

scrapy shell -s USER_AGENT="Mozilla/5.0" <your url here e.g. http://www.gumtree.co

scrapy shell --pdb https://gumtree.com

Gumtree Gumtree URL

Item HTML XPath

Gumtree itemprop=name XPath //*

itemprop="name" Gumtree “You

<strong class="ad-price txt-xlarge txt-emphasis" itemprop="price">£334.39pw</strong

itemprop="name" XPath //*[@itemprop="price"][1]/text()

title //*[@itemprop="name"][1]/text() Example value: [u'set unique family well']

Price //*[@itemprop="price"][1]/text() Example value (using re()):[u'334.39']

descripti //*[@itemprop="description"][1]/text() Example value: [u'website court wareho

//*[@itemtype="http://schema.org/Place"][1]/text() Example value: [u'Angel, Lo

HTML XPath Python

images pipeline image_URL

server url date

url response.url Example value: ‘http://web.../property_000000.html'

project self.settings.get('BOT_NAME') Example value: 'properties'

server server socket.gethostname() Example value: 'scrapyserver1'

date datetime.datetime.now() Example value: datetime.datetime(2015, 6, 25...)

from scrapy.item import Item, Field

basic.py properties/spiders basic

import Scrapy BasicSpider

start_URL Scrapy URL

$ scrapy crawl basic