Professional Documents
Culture Documents
《Learning Scrapy》中文版
《Learning Scrapy》中文版
HelloScrapy
Scrapy
Scrapy
2 HTML XPath
HTML DOM XPath
HTML
Chrome XPath
3
Scrapy
UR2IM——
4 Scrapy
Scrapy
app
5
JSON APIs AJAX
30
Excel
6 Scrapinghub
7
Scrapy
1——
HTTP
2——
3——
4—— Crawlera
Scrapy
8 Scrapy
Scrapy Twisted
Twisted I/O——Python
Scrapy
1—— pipeline
2——
9 Pipelines
REST APIs
treq
Elasticsearch pipeline
pipeline Google Geocoding API
Elasticsearch
Python
pipeline MySQL
Twisted
pipeline Redis
CPU
pipeline CPU
pipeline
10 Scrapy
Scrapy ——
Scrapy
1——CPU
2-
3- “ ”
4-
5-item /
6-
11 Scrapyd
Scrapyd
URL
settings URL
scrapyd
1 Scrapy
Scrapy
Scrapy
Scrapy
HelloScrapy
Scrapy
Excel 3
Scrapy
Scrapy Scrapy
7
Scrapy 8 9
Scrapy
100 Scrapy 16
16
1600 3
16 4800 9
4800 Scrapy 4800
Scrapy
API Scrapy
Scrapy
Scrapy Scrapy
Scrapy HTML
http://doc.scrapy.org/en/latest/news.html
/
Scrapy
Scrapy
50000 Scrapy
MySQL Redis Elasticsearch Google geocoding API
Apach Spark
HTML XPath
2 8
8 9
Scrapy
bug
“ ”
“ ” Eric Ries
MVP
Scrapy
App
App “ 1” “ 2” “
433” “Samsung UN55J6200 55-Inch TV” “Richard S.”
MVP
Scrapy 4
App
DoS
Scrapy
7
User-Agent
Scrapy BOT_NAME
User-Agent URL
Scrapy
Scrapy
Scrapy
2 HTML XPath
HTML HTML
XPath
URL
URL DNS https://mail.google.co
m/mail/u/0/#inbox mail.google.com DNS
IP 173.194.71.83 https://mail.google.com/mail/u/0/#inbox
https://173.194.71.83/mail/u/0/#inbox
URL
HTML
URL HTML HTML
TextMate Notepad vi Emacs HTML
World Wide Web Consortium
HTML http://example.com HTML
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type"
content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width,
initial-scale=1" />
<style type="text/css"> body { background-color: ...
} </style>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for
illustrative examples examples in documents.
You may use this domain in examples without
prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">
More information...</a></p>
</div>
</body>
</html>
HTML HTML
HTML
URL href
Example Domain
Scrapy
DOM
HTML HTML
HTML
HTML
Chrome
HTML HTML
HTML
HTML Chrome
Lynx
CSS
HTML Scrapy CSS
DOM
HTML
DOM DOM
XPath HTML
XPath HTML
XPath
Chrome XPath
功能。例如,在网页
x('//h1')
JavaScript
XPath
HTML
http://example.com/
$x('/html')
[ <html>...</html> ]
$x('/html/body')
[ <body>...</body> ]
$x('/html/body/div')
[ <div>...</div> ]
$x('/html/body/div/h1')
[ <h1>Example Domain</h1> ]
$x('/html/body/div/p')
[ <p>...</p>, <p>...</p> ]
$x('/html/body/div/p[1]')
[ <p>...</p> ]
$x('/html/body/div/p[2]')
[ <p>...</p> ]
p[1] p[2]
$x('//html/head/title')
[ <title>Example Domain</title> ]
XPath //
//p p //a
$x('//p')
[ <p>...</p>, <p>...</p> ]
$x('//a')
[ <a href="http://www.iana.org/domains/example">More information...</a> ]
//a //div//a a
//div/a
$x('//div//a')
[ <a href="http://www.iana.org/domains/example">More information...</a> ]
$x('//div/a')
[ ]
http://example.com/ href
$x('//a/@href')
[href="http://www.iana.org/domains/example"]
text( )
$x('//a/text()')
["More information..."]
$x('//div/*')
[<h1>Example Domain</h1>, <p>...</p>, <p>...</p>]
$x('//a[@href]')
[<a href="http://www.iana.org/domains/example">More information...</a>]
$x('//a[@href="http://www.iana.org/domains/example"]')
[<a href="http://www.iana.org/domains/example">More information...</a>]
$x('//a[contains(@href, "iana")]')
[<a href="http://www.iana.org/domains/example">More information...</a>]
$x('//a[starts-with(@href, "http://www.")]')
[<a href="http://www.iana.org/domains/example">More information...</a>]
$x('//a[not(contains(@href, "abc"))]')
[ <a href="http://www.iana.org/domains/example">More information...</a>]
http://www.w3schools.com/xsl/xsl_functions.asp
Scrapy
response.xpath('/html').extract()
[u'<html><head><title>...</body></html>']
response.xpath('/html/body/div/h1').extract()
[u'<h1>Example Domain</h1>']
response.xpath('/html/body/div/p').extract()
[u'<p>This domain ... permission.</p>', u'<p><a href="http://www.iana.org/domains
/example">More information...</a></p>']
response.xpath('//html/head/title').extract()
[u'<title>Example Domain</title>']
response.xpath('//a').extract()
[u'<a href="http://www.iana.org/domains/example">More information...</a>']
response.xpath('//a/@href').extract()
[u'http://www.iana.org/domains/example']
response.xpath('//a/text()').extract()
[u'More information...']
response.xpath('//a[starts-with(@href, "http://www.")]').extract()
[u'<a href="http://www.iana.org/domains/example">More information...</a>']
Chrome XPath
Chrome XPath
HTML
Copy XPath
$x('/html/body/div/p[2]/a')
[<a href="http://www.iana.org/domains/example">More information...</a>]
XPath
//h1[@id="firstHeading"]/span/text()
//div[@id="toc"]/ul//a/@href
//table[@class="infobox"]//img[1]/@src
class reflist div URL
//div[starts-with(@class,"reflist")]//a/@href
//*[text()="References"]/../following-sibling::div//a
URL
//img/@src
HTML XPath
Chrome
//*[@id="myid"]/div/div/div[1]/div[2]/div/div[1]/div[1]/a/img
HTML
img id class
//div[@class="thumbnail"]/a/img
//div[@class="thumbnail"]/a/img
//div[@class="preview green"]/a/img
class class
thumbnail green class thumbnail green
departure-time departure-time div
departure-time
id
id id JavaScript
id XPath
//*[@id="more_info"]//text( )
id
//[@id="order-F4982322"]
XPath id HTML
id
XPath HTML
HTML XPath Chrome XPath
XPath XPath 3
3
Scrapy
Vagrant
Notepad Notepad++ Sublime Text TextMate Eclipse PyCharm
Linux/Unix vim emacs
nano
Scrapy
Scrapy
Scrapy Vagrant Linux
Vagrant
MacOS
$ easy_install scrapy
Xcode
Windows
Windows Scrapy Windows
Virtualbox Vagrant 64
Windows http://scrapybook
.com/
Linux
Linux Scrapy
Scrapy 1.0.3
1.4
GitHub
Scrapy Scrapy Python
https://github.com/scrapy/scrapy
$ git clonehttps://github.com/scrapy/scrapy.git
$ cd scrapy
$ python setup.py install
virtualenv
Scrapy
Scrapy Scrapy pip easy_install aptitude
Scrapy
Vagrant
Vagrant
Vagrant “ ” Vagrant
—— Scrapy
Vagrant
$ vagrant up --no-parallel
Vagrant
$ vagrant ssh
book
$ cd book
$ ls
$ ch03 ch04 ch05 ch07 ch08 ch09 ch10 ch11 ...
Scrapy
UR2IM——
Scrapy
Scrapy UR2IM
The URL
URL URL https://www.gumtree.com/ Gumtree
http://www.gumtree.com/flats-houses/london
URL URL URL https://www.gu
mtree.com/p/studios-bedsits-rent/split-level Gumtree URL XPath
Gumtree
Scrapy
Scrapy –pdb
Scrapy Vagrant
$ scrapy shell http://web:9312/properties/property_000000.html
...
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x2d4fb10>
[s] item {}
[s] request <GET http:// web:9312/.../property_000000.html>
[s] response <200 http://web:9312/.../property_000000.html>
[s] settings <scrapy.settings.Settings object at 0x2d4fa90>
[s] spider <DefaultSpider 'default' at 0x3ea0bd0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local...
[s] view(response) View response in a browser
>>>
Python Ctrl+D
Scrapy Scrapy
GET 200
reponse.body 50
>>> response.body[:50]
'<!DOCTYPE html>\n<html>\n<head>\n<meta charset="UTF-8"'
Gumtree HTML 5
logo
HTML
HTML
HTML XPath
Chrome XPath h1
SEO HTML h1 h1
SEO
h1
>>> response.xpath('//h1/text()').extract()
[u'set unique family well']
h1 text() h1 text()
>>> response.xpath('//h1').extract()
[u'<h1 itemprop="name" class="space-mbs">set unique family well</h1>']
title
HTML
>>> response.xpath('//*[@itemprop="price"][1]/text()').extract()
[u'\xa3334.39pw']
Unicode £ 350.00pw
>>> response.xpath('//*[@itemprop="price"][1]/text()').re('[.0-9]+')
[u'334.39']
//*[@itemprop="description"][1]/text()
//*[@itemtype="http://schema.org/Place"][1]/text()
//img[@itemprop="image"][1]/@src text()
URL
XPath
Image_U
//*[@itemprop="image"][1]/@src Example value: [u'../images/i01.jpg']
RL
Scrapy
Scrapy shell Scrapy
Ctrl+D Scrapy shell Scrapy shell XPath Scrapy
properties
$ scrapy startproject properties
$ cd properties
$ tree
.
├── properties
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── __init__.py
└── scrapy.cfg
2 directories, 6 files
Scrapy items.py,
pipelines.py, settings.py spiders
settings pipelines scrapy.cfg
items
items.py
class PropertiesItem
Python
Location
Python
PropertiesItem properties/items.py
class PropertiesItem(Item):
# Primary fields
title = Field()
price = Field()
description = Field()
address = Field()
image_URL = Field()
# Calculated fields
images = Field()
location = Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
Python
tab
tab PropertiesItem
{} begin – end Python
UR2IM
scrapy genspider
$ scrapy genspider basic web
“basic” “basic”
properties.spiders.basic
properties/spiders/basic.py file ,
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_URL = (
'http://www.web/',
)
def parse(self, response):
pass
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_URL = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
self.log("title: %s" % response.xpath(
'//*[@itemprop="name"][1]/text()').extract())
self.log("price: %s" % response.xpath(
'//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
self.log("description: %s" % response.xpath(
'//*[@itemprop="description"][1]/text()').extract())
self.log("address: %s" % response.xpath(
'//*[@itemtype="http://schema.org/'
'Place"][1]/text()').extract())
self.log("image_URL: %s" % response.xpath(
'//*[@itemprop="image"][1]/@src').extract())
scrapy crawl
XPath
parse() item =
PropertiesItem()
item['title'] = response.xpath('//*[@itemprop="name"][1]/text()').extract()
import scrapy
from properties.items import PropertiesItem
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_URL = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
item = PropertiesItem()
item['title'] = response.xpath(
'//*[@itemprop="name"][1]/text()').extract()
item['price'] = response.xpath(
'//*[@itemprop="price"][1]/text()').re('[.0-9]+')
item['description'] = response.xpath(
'//*[@itemprop="description"][1]/text()').extract()
item['address'] = response.xpath(
'//*[@itemtype="http://schema.org/'
'Place"][1]/text()').extract()
item['image_URL'] = response.xpath(
'//*[@itemprop="image"][1]/@src').extract()
return item
“DEBUG ”
DEBUG: Scraped from <200
http://...000.html>
{'address': [u'Angel, London'],
'description': [u'website ... offered'],
'image_URL': [u'../images/i01.jpg'],
'price': [u'334.39'],
'title': [u'set unique family well']}
Scrapy
CSV XML
Excel JSON JavaScript JSON JSON
Line .json JSON 1GB
.jl JSON
Scrapy
FTP S3 bucket
$ scrapy crawl basic -o "ftp://user:pass@ftp.scrapybook.com/items.json "
$ scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"
URL S3
scrapy parse
——
ItemLoaders
http://doc.scrapy.org/en/latest/topics/loaders
ItemLoaders XPath/CSS Join() //p
Join()
MapCompose(unicode.strip)
MapCompose(unicode.strip, unicode.title)
MapCompose(float)
def myFunction(i):
return i.replace(',', '')
Lambda replace() urljoin()
scrapy URL
XPath/CSS
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
Items
items
Scrapy 25
ItemLoaders Python
lambda
ItemLoader
ItemLoaders
XPath
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_URL = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_URL
@scrapes url project spider server date
"""
# Create the loader using the response
l = ItemLoader(item=PropertiesItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
MapCompose(lambda i: i.replace(',', ''),
float),
re='[,.0-9]+')
l.add_xpath('description', '//*[@itemprop="description"]'
'[1]/text()',
MapCompose(unicode.strip), Join())
l.add_xpath('address',
'//*[@itemtype="http://schema.org/Place"]'
'[1]/text()',
MapCompose(unicode.strip))
l.add_xpath('image_URL', '//*[@itemprop="image"]'
'[1]/@src', MapCompose(
lambda i: urlparse.urljoin(response.url, i)))
# Housekeeping fields
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
return l.load_item()
URL
start_URL URL
URL
start_URL = (
'http://web:9312/properties/property_000000.html',
'http://web:9312/properties/property_000001.html',
'http://web:9312/properties/property_000002.html',
)
URL
Gumtree
http://www.gumtree.com/flats-houses/london
——
——
URL
manual.py
$ ls
properties scrapy.cfg
$ cp properties/spiders/basic.py properties/spiders/manual.py
next_requests = []
for url in...
next_requests.append(Request(...))
for url in...
next_requests.append(Request(...))
return next_requests
yield Python
5
-s CLOSESPIDER_ITEMCOUNT=90 7
90
Scrapy LIFO
URL Requests
Requests
Request() 0 0
Scrapy
URL
URL dont_filter Request() True
CrawlSpider
Scrapy
CrawlSpider genspider
-t
properties/spiders/easy.py
...
class EasySpider(CrawlSpider):
name = 'easy'
allowed_domains = ['web']
start_URL = ['http://www.web/']
rules = (
Rule(LinkExtractor(allow=r'Items/'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
...
CrawlSpider
Spider CrawlSpider rules parse()
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
callback='parse_item')
)
Scrapy UR2IM
Items ItemLoaders XPath Items yield
Requests CrawlSpider
Rules
app Scrapy
4 Scrapy
app Appery.io Scrapy Excel
PhoneGap Appcelerator Appcelerator jQuery Mobile Sencha Touch
Appery.io
Appery.io Emai
Appery.io
1. Databases 1
2. Create new database 2 scrapy 3
3. Create 4 Scrapy
Appery.io
Appery.io Users
Appery.io
root pass
Users 1 +Row 2 user/row 3,4
Scrapy
API key Settings 1 2 Collections
3
easy.py
tomobile.py
$ ls
properties scrapy.cfg
$ cat properties/spiders/tomobile.py
...
class ToMobileSpider(CrawlSpider):
name = 'tomobile'
allowed_domains = ["scrapybook.s3.amazonaws.com"]
# Start on the first index page
start_URL = (
'http://scrapybook.s3.amazonaws.com/properties/'
'index_00000.html',
)
...
http://web:9312 http://scrapy
book.s3.amazonaws.com URL
app
ApperyIoPipeline pipeline
100 200 / Appery.io pipeline
api.appery.io URL
Appery.io properties 1 2
Appery.io
Scrapy CREATE NEW 5 Datebase Services 6
scrapy 7 properties 8
List 9 Import selected
services 10
app DESIGN
Pages 1 startScreen 2 UI
Scrapy App
Grid 5
6
Rows 1 Apply 7,8
9
10 11
DESIGN DATA
1
Service 2 Add
3 Add Before send Success
Success Mapping
Appery.io
Save and return 6
DATA UI DESIGN 7
EVENTS 8 EVENTS Appery.io UI
UI startScreen Load
Invoke service action Datasource restservice1 9
Save 10
app
app UI TEST 1
2
View on Phone
Scrapy http://de
vcenter.appery.io/tutorials/ Appery.io
EXPORT app
IDE app
Scrapy Appery.io
RESTful API
Scrapy
5
3 Items
UR2IM R Request Response
http://web:9312/dynamic http://localhost:9312/dynamic
“user” “pass”
Scrapy
Chrome Network
1 Login 2
GET POST
POST
Chrome
302 FOUND 5
/dynamic/gated
http://localhost:9312/dynamic/gated http://l
ocalhost:9312/dynamic/error gated 6
RequestHeaders 7 Cookie 8
HTTP cookie
POST HTTP
Scrapy
3 easy login
class LoginSpider(CrawlSpider):
name = 'login'
dynamic/login dynamic/gated
POST GET dynamic/gated
URL
Scrapy
POST
cookies
http://localhost:9312/dynamic/nonce
Chrome nonce
http://localhost:9312/dynamic/nonce-login
nonce
Scrapy
NonceLoginSpider start_requests()
Request callback parse_welcome()
parse_welcome() FormRequest from_response()
FormRequest FormRequest FormRequest.from_response()
from_response() formname
formnumber
formdata
user pass FormRequest
Network 5
static/ jquery.min.js
JavaScript api.json 6 Preview
7 http://localhost:9312/properties/api.js
on IDs 8
[{
"id": 0,
"title": "better set unique family well"
},
... {
"id": 29,
"title": "better portered mile"
}]
start_URL = (
'http://web:9312/properties/api.json',
)
POST start_requests()
Scrapy URL Response parse() import
json JSON
title = item["title"]
yield Request(url, meta={"title": title},callback=self.parse_item)
parse_item() XPath
l.add_value('title', response.meta['title'],
MapCompose(unicode.strip, unicode.title))
30
Scrapy
XPath
&show=50 10 50 100
30 100
3000 100 3000
Gumtree
HTML
itemtype="http://schema.org/Product"
Selector for 30 3
maunal.py fast.py parse() parse_item()
parse_item() yield
XPath
XPath
90 93
Excel
XPath
ch05 properties
$ pwd
/root/book/ch05/properties
$ cd ..
$ pwd
/root/book/ch05
generic fromcsv
.csv Excel
URL XPath scrapy.cfg todo.csv
csv
$ cat todo.csv
url,name,price
a.html,"//*[@id=""itemTitle""]/text()","//*[@id=""prcIsum""]/text()"
b.html,//h1/text(),//span/strong/text()
c.html,"//*[@id=""product-desc""]/span/text()"
$ pwd
/root/book/ch05/generic2
$ python
>>> import csv
>>> with open("todo.csv", "rU") as f:
reader = csv.DictReader(f)
for line in reader:
print line
header dict
dict for
URL start_requests()
Request request,meta csv XPath
parse() Item ItemLoader Item
import csv
import scrapy
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field
class FromcsvSpider(scrapy.Spider):
name = "fromcsv"
def start_requests(self):
with open("todo.csv", "rU") as f:
reader = csv.DictReader(f)
for line in reader:
request = Request(line.pop('url'))
request.meta['fields'] = line
yield request
def parse(self, response):
item = Item()
l = ItemLoader(item=item, response=response)
for name, xpath in response.meta['fields'].iteritems():
if xpath:
item.fields[name] = Field()
l.add_xpath(name, xpath)
return l.load_item()
csv
Items ItemLoader
item = Item()
l = ItemLoader(item=item, response=response)
Item fields ItemLoader
item.fields[name] = Field()
l.add_xpath(name, xpath)
todo.csv Scrapy
-a -a variable=value self.variable
Python getattr()
getattr(self, 'variable', 'default') with open…
todo.csv -a
another_todo.csv
Scrapy FormRequest /
meta XPath Selectors .csv
6 Scrapinghub 7 Scrapy
6 Scrapinghub
Amazon RackSpace
Scrapinghub
Jobs Spiders
Periodic Jobs
Settings 1 Scrapinghub
Scrapy Deploy 2
ch06 ch06/properties
$ pwd
/root/book/ch06/properties
$ ls
properties scrapy.cfg
$ cat scrapy.cfg
...
[settings]
default = properties.settings
# Project: properties
[deploy]
url = http://dash.scrapinghub.com/api/scrapyd/
username = 180128bc7a0.....50e8290dbf3b0
password =
project = 28814
$ shub deploy
Packing version 1449092838
Deploying to project "28814"in {"status": "ok", "project": 28814,
"version":"1449092838", "spiders": 1}
Run your spiders at: https://dash.scrapinghub.com/p/28814/
Scrapy Scrapinghub
$ ls
build project.egg-info properties scrapy.cfgsetup.py
$ rm -rf build project.egg-info setup.py
2
Schedule 3 Schedule 4
Scrapinghub
6 Stop 7
Completed Jobs 8
Scrapinghub
7
Scrapy Scrapy
Scrapy
Scrapy
Scrapy
Scrapy
scrapy/settings/default_settings.py
settings.py
custom_settings
Pipelines
-s -s CLOSESPIDER_PAGECOUNT=3
API settings.py
settings.py CONCURRENT_REQUESTS
14 14
Scrapy
Scrapy
Scrapy
1——
Scrapy
ch07 ch07/properties
$ pwd
/root/book/ch07/properties
$ ls
properties scrapy.cfg
Start a crawl as follows:
$ scrapy crawl fast
...
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023:6023
6023
>>> est()
Execution engine status
time()-engine.start_time : 5.73892092705
engine.has_capacity() : False
len(engine.downloader.active) : 8
...
len(engine.slot.inprogress) : 10
...
len(engine.scraper.slot.active) : 2
10
>>> engine.pause()
>>> engine.unpause()
>>> engine.stop()
Connection closed by foreign host.
10
CONCURRENT_REQUESTS
/IPs
CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP
IP
CONCURRENT_REQUESTS_PER_IP CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS = 16
16/0.25 = 64 CONCURRENT_ITEMS
100 10 1
pipelines
CONCURRENT_REQUESTS = 16 CONCURRENT_ITEMS = 100 1600
DOWNLOADS_TIMEOUT 180
16 5 10
0 DOWNLOADS_DELAY
DOWNLOADS_DELAY ±50%
RANDOMIZE_DOWNLOAD_DELAY False
Scrapy CloseSpider
CLOSESPIDER_TIMEOUT(in seconds) CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT CLOSESPIDER_ERRORCOUNT
HTTP
Scrapy HttpCacheMiddleware HTTP
HTTPCACHE_POLICY
scrapy.contrib.httpcache.RFC2616Policy RFC2616
HTTPCACHE_ENABLED True HTTPCACHE_DIR
HTTPCACHE_STORAGE
scrapy.contrib.httpcache.DbmCacheStorage HTTPCACHE_DBM_MODULE
anydbm
2——
$ scrapy crawl fast -s LOG_LEVEL=INFO -s CLOSESPIDER_ITEMCOUNT=5000
HttpCacheMiddleware
CLOSESPIDER_ITEMCOUNT
$ rm -rf .scrapy
Scrapy DEPTH_LIMIT 0
DEPTH_PRIORITY
LIFO FIFO
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
Scrapy
DEPTH_LIMIT 3
robots.txt
ROBOTSTXT_OBEY True Scrapy True
Feeds
Feeds Scrapy
FEED_URI.FEED_URI scrapy crawl fast -o "%(name)s_%(time)s.jl
%(foo)s, feed
foo S3 FTP URI
FEED_URI='s3://mybucket/file.json' Amazon AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY Amazon S3 JSON
JSON Lines CSV XML FEED_FORMAT Scrapy
FEED_URI FEED_STORE_EMPTY True
FEED_EXPORT_FIELDS .csv
header FEED_URI_PARAMS FEED_URI
Scrapy Image Pipeline
IMAGES_STORE
URL image_URL IMAGES_URL_FIELD
image IMAGES_RESULT_FIELD
IMAGES_MIN_WIDTH IMAGES_MIN_HEIGHT IMAGES_EXPIRES
IMAGES_THUMBS
Scrapy
3——
pip install image
Image Pipeline settings.py ITEM_PIPELINES
scrapy.pipelines.images.ImagesPipeline IMAGES_STORE "images"
IMAGES_THUMBS
ITEM_PIPELINES = {
...
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'images'
IMAGES_THUMBS = { 'small': (30, 30) }
Item image_URL
Images jpg
rm -rf images
4—— Crawlera
DynDNS IP Scrapy shell checkip.dyndns.org
IP
Scrapy shell IP
IP Scrapy shell unset http_proxy
Crawlera Scrapinghub IP
http_proxy
$ export http_proxy=myusername:mypassword@proxy.crawlera.com:8010
Scrapy Scrapy
BOT_NAME SPIDER_MODULES
Scrapy
startproject genspider
MAIL_FROM MailSender
STATSMAILER_RCPTS MEMUSAGE_NOTIFY_MAIL
SCRAPY_SETTINGS_MODULE SCRAPY_PROJECT Scrapy
Django scrapy.cfg
Scrapy
Scrapy ITEM_PIPELINES
Item Processing Pipelines 9 pipelines
Scrapy 8 COMMANDS_MODULE
properties/hi.py
HTTPERROR_ALLOWED_CODES
URLLENGTH_LIMIT
AUTOTHROTTLE_*
DOWNLOAD_DELAY 0
MEMUSAGE_*
MEMUSAGE_LIMIT_MB 0
email Unix
MEMDEBUG_ENABLED MEMDEBUG_NOTIFY
trackref
Scrapy
Scrapy
8 Scrapy
scrapy
Items
Items
Scrapy Scrapy
Scrapy Scrapy
Scrapy Twisted
Scrapy Twisted
Scrapy Twisted Python Twisted
Scrapy
Scrapy GitHub
Twisted
Twisted
Item
CPUs
I/O
CPU
OS
Twisted
inlineCallbacks
Python
$ python
>>> from twisted.internet import defer
>>> # Experiment 1
>>> d = defer.Deferred()
>>> d.called
False
>>> d.callback(3)
>>> d.called
True
>>> d.result
3
d callback called
True result
>>> # Experiment 2
>>> d = defer.Deferred()
>>> def foo(v):
... print "foo called"
... return v+1
...
>>> d.addCallback(foo)
<Deferred at 0x7f...>
>>> d.called
False
>>> d.callback(3)
foo called
>>> d.called
True
>>> d.result
4
foo() d callback(3) foo() d
>>> # Experiment 3
>>> def status(*ds):
... return [(getattr(d, 'result', "N/A"), len(d.callbacks)) for d in
ds]
>>> def b_callback(arg):
... print "b_callback called with arg =", arg
... return b
>>> def on_done(arg):
... print "on_done called with arg =", arg
... return arg
>>> # Experiment 3.a
>>> a = defer.Deferred()
>>> b = defer.Deferred()
a
b_callback() b a on_done() status()
[('N/A', 2), ('N/A',
0)]
a [(, 1), ('N/A', 1)] a
b b_callback() on_done()
b[(4, 0), (None, 0)]
b b
Twisted defer.DeferredList
>>> # Experiment 4
>>> deferreds = [defer.Deferred() for i in xrange(5)]
>>> join = defer.DeferredList(deferreds)
>>> join.addCallback(on_done)
>>> for i in xrange(4):
... deferreds[i].callback(i)
>>> deferreds[4].callback(4)
on_done called with arg = [(True, 0), (True, 1), (True, 2),
(True, 3), (True, 4)]
for on_done()
deferreds[4].callback() on_done()
True /False
Twisted I/O——Python
Python
$ ./deferreds.py 1
------ Running example 1 ------
Start installation for Bill
All done for Bill
Start installation
...
* Elapsed time: 12.03 seconds
4 3 12
import threading
# The company grew. We now have many customers and I can't handle
the
# workload. We are now 5 developers doing exactly the same thing.
def developers_day(customers):
# But we now have to synchronize... a.k.a. bureaucracy
lock = threading.Lock()
#
def dev_day(id):
print "Goodmorning from developer", id
# Yuck - I hate locks...
lock.acquire()
while customers:
customer = customers.pop(0)
lock.release()
# My Python is less readable
install_wordpress(customer)
lock.acquire()
lock.release()
print "Bye from developer", id
# We go to work in the morning
devs = [threading.Thread(target=dev_day, args=(i,)) for i in
range(5)]
[dev.start() for dev in devs]
# We leave for the evening
[dev.join() for dev in devs]
# We now get more done in the same time but our dev process got more
# complex. As we grew we spend more time managing queues than doing dev
# work. We even had occasional deadlocks when processes got extremely
# complex. The fact is that we are still mostly pressing buttons and
# waiting but now we also spend some time in meetings.
developers_day(["Customer %d" % i for i in xrange(15)])
$ ./deferreds.py 2
------ Running example 2 ------
Goodmorning from developer 0Goodmorning from developer
1Start installation forGoodmorning from developer 2
Goodmorning from developer 3Customer 0
...
from developerCustomer 13 3Bye from developer 2
* Elapsed time: 9.02 seconds
5 15 3 45 5 9
Twisted
# For years we thought this was all there was... We kept hiring more
# developers, more managers and buying servers. We were trying harder
# optimising processes and fire-fighting while getting mediocre
# performance in return. Till luckily one day our hosting
# company decided to increase their fees and we decided to
# switch to Twisted Ltd.!
from twisted.internet import reactor
from twisted.internet import defer
from twisted.internet import task
# Twisted has a slightly different approach
def schedule_install(customer):
# They are calling us back when a Wordpress installation completes.
# They connected the caller recognition system with our CRM and
# we know exactly what a call is about and what has to be done
# next.
#
# We now design processes of what has to happen on certain events.
def schedule_install_wordpress():
def on_done():
print "Callback: Finished installation for", customer
print "Scheduling: Installation for", customer
return task.deferLater(reactor, 3, on_done)
#
def all_done(_):
print "All done for", customer
#
# For each customer, we schedule these processes on the CRM
# and that
# is all our chief-Twisted developer has to do
d = schedule_install_wordpress()
d.addCallback(all_done)
#
return d
# Yes, we don't need many developers anymore or any synchronization.
# ~~ Super-powered Twisted developer ~~
def twisted_developer_day(customers):
print "Goodmorning from Twisted developer"
#
# Here's what has to be done today
work = [schedule_install(customer) for customer in customers]
# Turn off the lights when done
join = defer.DeferredList(work)
join.addCallback(lambda _: reactor.stop())
#
print "Bye from Twisted developer!"
$ ./deferreds.py 3
------ Running example 3 ------
Goodmorning from Twisted developer
Scheduling: Installation for Customer 0
....
Scheduling: Installation for Customer 14
Bye from Twisted developer!
Callback: Finished installation for Customer 0
All done for Customer 0
Callback: Finished installation for Customer 1
All done for Customer 1
...
All done for Customer 14
* Elapsed time: 3.18 seconds
15 45 3
sleep() task.deferLater()
15
CPUs
CPU CPUs
I/O CPUs task.deferLater()
Twisted reactor.run()
CRM
reactor.run()
# Twisted gave us utilities that make our code way more readable!
@defer.inlineCallbacks
def inline_install(customer):
print "Scheduling: Installation for", customer
yield task.deferLater(reactor, 3, lambda: None)
print "Callback: Finished installation for", customer
print "All done for", customer
def twisted_developer_day(customers):
... same as previously but using inline_install()
instead of schedule_install()
twisted_developer_day(["Customer %d" % i for i in xrange(15)])
reactor.run()
$ ./deferreds.py 4
... exactly the same as before
inlineCallbacks Python
inline_install() inline_install() yield
inline_install()
15 10000 10000
HTTP
Scrapy
CONCURRENT_ITEMS
@defer.inlineCallbacks
def inline_install(customer):
... same as above
# The new "problem" is that we have to manage all this concurrency to
# avoid causing problems to others, but this is a nice problem to have.
def twisted_developer_day(customers):
print "Goodmorning from Twisted developer"
work = (inline_install(customer) for customer in customers)
#
# We use the Cooperator mechanism to make the secretary not
# service more than 5 customers simultaneously.
coop = task.Cooperator()
join = defer.DeferredList([coop.coiterate(work) for i in xrange(5)])
#
join.addCallback(lambda _: reactor.stop())
print "Bye from Twisted developer!"
twisted_developer_day(["Customer %d" % i for i in xrange(15)])
reactor.run()
# We are now more lean than ever, our customers happy, our hosting
# bills ridiculously low and our performance stellar.
# ~*~ THE END ~*~
$ ./deferreds.py 5
------ Running example 5 ------
Goodmorning from Twisted developer
Bye from Twisted developer!
Scheduling: Installation for Customer 0
...
Callback: Finished installation for Customer 4
All done for Customer 4
Scheduling: Installation for Customer 5
...
Callback: Finished installation for Customer 14
All done for Customer 14
* Elapsed time: 9.19 seconds
3 5
Scrapy
Scrapy cookies
Scrapy
URL HTTPS
HTTP
item pipelines
Items Items item pipeline
Scrapy GitHub settings/default_settings.py
SPIDER_MIDDLEWARES_BASE
Item Pipelines
Scrapy API
Item
Scrapy MiddlewareManager
from_crawler() from_settings() Settings
crawler.settings from_crawler() Settings
Crawler
1—— pipeline
Python
pipelines items
process_item() pipeline
3 tidyup.py
pipeline
settings.py ITEM_PIPELINES
ch8/properties
ISO
Item crawler.signals.connect() 11
Item Pipeline
items
ch08/hooksasync
2——
pipelines
AutoThrottle
scrapy/extensions/throttle.py request.meta.get 'download_latency'
scrapy/core/downloader/webclient.py download_latency
Scrapy
class Latencies(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def __init__(self, crawler):
self.crawler = crawler
self.interval = crawler.settings.getfloat('LATENCIES_INTERVAL')
if not self.interval:
raise NotConfigured
cs = crawler.signals
cs.connect(self._spider_opened, signal=signals.spider_opened)
cs.connect(self._spider_closed, signal=signals.spider_closed)
cs.connect(self._request_scheduled, signal=signals.request_
scheduled)
cs.connect(self._response_received, signal=signals.response_
received)
cs.connect(self._item_scraped, signal=signals.item_scraped)
self.latency, self.proc_latency, self.items = 0, 0, 0
def _spider_opened(self, spider):
self.task = task.LoopingCall(self._log, spider)
self.task.start(self.interval)
def _spider_closed(self, spider, reason):
if self.task.running:
self.task.stop()
def _request_scheduled(self, request, spider):
request.meta['schedule_time'] = time()
def _response_received(self, response, request, spider):
request.meta['received_time'] = time()
def _item_scraped(self, item, response, spider):
self.latency += time() - response.meta['schedule_time']
self.proc_latency += time() - response.meta['received_time']
self.items += 1
def _log(self, spider):
irate = float(self.items) / self.interval
latency = self.latency / self.items if self.items else 0
proc_latency = self.proc_latency / self.items if self.items else 0
spider.logger.info(("Scraped %d items at %.1f items/s, avg
latency: "
"%.2f s and avg time in pipelines: %.2f s") %
(self.items, irate, latency, proc_latency))
self.latency, self.proc_latency, self.items = 0, 0, 0
Crawler
from_crawler(cls, crawler) crawler
init() crawler.settings NotConfigured
FooBar FOOBAR_ENABLED False
settings.py ITEM_PIPELINES
Scrapy AutoThrottle
HttpCache
LATENCIES_INTERVAL
init() crawler.signals.connect()
_spider_opened()
LATENCIES_INTERVAL _log() _spider_closed()
_request_scheduled() _response_received() request.meta
_item_scraped() items _log()
settings.py latencies.py
settings.py
$ pwd
/root/book/ch08/properties
$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=1000 -s LOG_LEVEL=INFO
...
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
INFO: Scraped 0 items at 0.0 items/sec, average latency: 0.00 sec and
average time in pipelines: 0.00 sec
INFO: Scraped 115 items at 23.0 items/s, avg latency: 0.84 s and avg time
in pipelines: 0.12 s
INFO: Scraped 125 items at 25.0 items/s, avg latency: 0.78 s and avg time
in pipelines: 0.12 s
Log Stats 24
0.78 Little
N=ST=430.45≅19 CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN 100% CPU
30 10
scrapy/settings/default_settings.py Scrapy
Scrapy
Scrapy
scrapy crawl CrawlerProcess Crawler
Crawler Scrapy settings signals spider
extensions.crawler ExtensionManager engine
ExecutionEngine Scheduler Downloader Scraper Scheduler
URL Downloader Scraper Downloader
DownloaderMiddleware DownloadHandler Scraper SpiderMiddleware
ItemPipeline MiddlewareManager Scrapy feeds
FeedExporter
URL S3 XML Pickle
FEED_STORAGES FEED_EXPORTERS scrapy
check SPIDER_CONTRACTS
Scrapy Twisted
Item Processing Pipeline
9 Pipelines
Scrapy pipelines
REST APIs CPU
REST APIs
REST SOAP web
REST web CRUD Create Read Update
Delete HTTP GET POST PUT DELETE web
URL http://api.mysite.com/customer/john URL
john
XML JSON
REST
treq
treq Python Twisted Python requests GET
POST HTTP pip install treq
Elasticsearch pipeline
ES Elasticsearch Items ES
MySQL ES ES
treq ES txes2
Python/Twisted ES
Vagrant ES ES
$ curl http://es:9200
{
"name" : "Living Brain",
"cluster_name" : "elasticsearch",
"version" : { ... },
"tagline" : "You Know, for Search"
}
http://localhost:9200 http://localhost:9200/pro
perties/property/_search ES
pipeline
ch09 ch09/properties/properties/pipelines/es.py
@defer.inlineCallbacks
def process_item(self, item, spider):
data = json.dumps(dict(item), ensure_ascii=False).encode("utf- 8")
yield treq.post(self.es_url, data)
process_item() 8
ITEM_PIPELINES = {
'properties.pipelines.tidyup.TidyUp': 100,
'properties.pipelines.es.EsWriter': 800,
}
ES_PIPELINE_URL = 'http://es:9200/properties/property'
$ pwd
/root/book/ch09/properties
$ ls
properties scrapy.cfg
http://localhost:9200/properties/property/_search 10
hits/total ?size=100
q= URL
http://localhost:9200/properties/property/_search?q=title:london
London https://www.elastic.co/guide/en/elasticsearch/ref
erence/current/query-dsl-query-string-query.html ES
ES ht
tp://localhost:9200/properties/
pipelines Items
pipelines
Twisted APIs
$ curl "https://maps.googleapis.com/maps/api/geocode/json?sensor=false&ad
dress=london"
{
"results" : [
...
"formatted_address" : "London, UK",
"geometry" : {
...
"location" : {
"lat" : 51.5073509,
"lng" : -0.1277583
},
"location_type" : "APPROXIMATE",
...
],
"status" : "OK"
}
JSON location
results[0].geometry.location
@defer.inlineCallbacks
def geocode(self, address):
endpoint = 'http://web:9312/maps/api/geocode/json'
parms = [('address', address), ('sensor', 'false')]
response = yield treq.get(endpoint, params=parms)
content = yield response.json()
geo = content['results'][0]["geometry"]["location"]
defer.returnValue({"lat": geo["lat"], "lon": geo["lng"]})
URL
endpoint = 'https://maps.googleapis.com/maps/api/geocode/json' Google
address sensor URL treq get()
params yield response.json()
Python dict
defer.returnValue() inlineCallbacks
Scrapy
geocode() process_item()
pipeline ITEM_PIPELINES ES
ES
ITEM_PIPELINES = {
...
'properties.pipelines.geo.GeoPipeline': 400,
Twisted
class Throttler(object):
def __init__(self, rate):
self.queue = []
self.looping_call = task.LoopingCall(self._allow_one)
self.looping_call.start(1. / float(rate))
def stop(self):
self.looping_call.stop()
def throttle(self):
d = defer.Deferred()
self.queue.append(d)
return d
def _allow_one(self):
if self.queue:
self.queue.pop(0).callback(None)
_allow_one() _allow_one()
callback() Twisted task.LoopingCall()
API _allow_one() Throttler pipeline init
class GeoPipeline(object):
def __init__(self, stats):
self.throttler = Throttler(5) # 5 Requests per second
def close_spider(self, spider):
self.throttler.stop()
yield self.throttler.throttle()
item["location"] = yield self.geocode(item["address"][0])
yield 11
5 11/5=2.2
Throttler
Python dict
API Python Twisted
class DeferredCache(object):
def __init__(self, key_not_found_callback):
self.records = {}
self.deferreds_waiting = {}
self.key_not_found_callback = key_not_found_callback
@defer.inlineCallbacks
def find(self, key):
rv = defer.Deferred()
if key in self.deferreds_waiting:
self.deferreds_waiting[key].append(rv)
else:
self.deferreds_waiting[key] = [rv]
if not key in self.records:
try:
value = yield self.key_not_found_callback(key)
self.records[key] = lambda d: d.callback(value)
except Exception as e:
self.records[key] = lambda d: d.errback(e)
action = self.records[key]
for d in self.deferreds_waiting.pop(key):
reactor.callFromThread(action, d)
value = yield rv
defer.returnValue(value)
self.deferreds_waiting
self.records dict
find()
self.deferreds_waiting dict key_not_found_callback()
key_not_found_callback()
action(d)
reactor.callFromThread()
ch09/properties/properties/pipelines/geo2.py
ITEM_PIPELINES = {
'properties.pipelines.tidyup.TidyUp': 100,
'properties.pipelines.es.EsWriter': 800,
# DISABLE 'properties.pipelines.geo.GeoPipeline': 400,
'properties.pipelines.geo2.GeoPipeline': 400,
}
35
1019 - 35= 984 API
Google API API Throttler(5) Throttler(10) 5
10 geo_pipeline/retries stat
API geo_pipeline/errors stat
geo_pipeline/already_set stat http://local
host:9200/properties/property/_search ES
{..."location": {"lat": 51.5269736, "lon": -0.0667204}...}
Elasticsearch
HTTP POST
Angel {51.54, -0.19}
property.txt
"location":{"properties":{"lat":{"type":"double"},"lon":{"type":"double"}}}
JSONs sort
Python
Python Database API 2.0 MySQL PostgreSQL Oracle
Microsoft SQL Server SQLite Twisted
Twisted Scrapy
twisted.enterprise.adbapi MySQL
pipeline MySQL
MySQL pipeline
MySQL MySQL
$ mysql -h mysql -uroot -ppass
mysql> MySQL
MySQL properties
pipeline MySQL exit
MySQL properties
ch09/properties/properties/pipelines/mysql.py
process_item() dbpool.runInteraction()
Transaction Transaction DB-API
API do_replace() @staticmethod
self
SQL
Transaction execute() SQL REPLACE
INTO INSERT INTO
SQL SELECT dbpool.runQuery()
adbapi.ConnectionPool() cursorclass
cursorclass=MySQLdb.cursors
ITEM_PIPELINES = { ...
'properties.pipelines.mysql.MysqlWriter': 700,
...
MYSQL_PIPELINE_URL = 'mysql://root:pass@mysql/properties'
MySQL
Twisted
treq REST APIs Scrapy Twisted
MongoDB “MongoDB Python”
PyMongo / pipeline
Twisted PyMongo “MongoDB Twisted Python” txmongo
Twisted Scrapy Twisted
Twisted Redis
pipeline Redis
Google Geocoding API IP IPs
API /
$ redis-cli -h redis
redis:6379> info keyspace
# Keyspace
redis:6379> set key value
OK
redis:6379> info keyspace
# Keyspace
db0:keys=1,expires=0,avg_ttl=0
redis:6379> FLUSHALL
OK
redis:6379> info keyspace
# Keyspace
redis:6379> exit
RedisCache pipeline
from txredisapi import lazyConnectionPool
class RedisCache(object):
...
def __init__(self, crawler, redis_url, redis_nm):
self.redis_url = redis_url
self.redis_nm = redis_nm
args = RedisCache.parse_redis_url(redis_url)
self.connection = lazyConnectionPool(connectTimeout=5,
replyTimeout=5,
**args)
crawler.signals.connect(
self.item_scraped,signal=signals.item_scraped)
ch09/properties/properties/pipelines/redis.py
Item Redis
dict
geo-pipeline Items
txredisapi set()
set() @defer.inlineCallbacks signals.item_scraped
connection.set() yield
Scrapy Redis
connection.set()
trap() ConnectionError Twisted API trap()
ITEM_PIPELINES = { ...
'properties.pipelines.redis.RedisCache': 300,
'properties.pipelines.geo.GeoPipeline': 400,
...
REDIS_PIPELINE_URL = 'redis://redis:6379'
CPU
Twisted Twisted
Scrapy Twisted reactor.spawnProcess() Python
pipeline CPU
8
Twisted reactor.callInThread() API
pipeline
class UsingBlocking(object):
@defer.inlineCallbacks
def process_item(self, item, spider):
price = item["price"][0]
out = defer.Deferred()
reactor.callInThread(self._do_calculation, price, out)
item["price"][0] = yield out
defer.returnValue(item)
def _do_calculation(self, price, out):
new_price = price + 1
time.sleep(0.10)
reactor.callFromThread(out.callback, new_price)
pipeline Item
_do_calucation() time.sleep()
reactor.callInThread()
out _do_calucation() out
Item
_do_calucation() 1 100ms
10
100ms
out.callback(new_price)
reactor.callFromThread()
process_item() yield Item
_do_calucation()
beta delta
class UsingBlocking(object):
def __init__(self):
self.beta, self.delta = 0, 0
...
def _do_calculation(self, price, out):
self.beta += 1
time.sleep(0.001)
self.delta += 1
new_price = price + self.beta - self.delta + 1
assert abs(new_price-price-1) < 0.01
time.sleep(0.10)...
self.beta self.delta
beta/delta beta delta
Python threading.RLock()
class UsingBlocking(object):
def __init__(self):
...
self.lock = threading.RLock()
...
def _do_calculation(self, price, out):
with self.lock:
self.beta += 1
...
new_price = price + self.beta - self.delta + 1
assert abs(new_price-price-1) < 0.01 ...
ch09/properties/properties/pipelines/computation.py
ITEM_PIPELINES = { ...
'properties.pipelines.computation.UsingBlocking': 500,
pipeline 100ms 25
items
pipeline
Twisted
reactor.spawnProcess() API protocol.ProcessProtocol
#!/bin/bash
trap "" SIGINT
sleep 3
while read line
do
# 4 per second
sleep 0.25
awk "BEGIN {print 1.20 * $line}"
done
bash Ctrl + C
Ctrl + C Scrapy Ctrl + C
250ms
1.20 Linux awk
1/250ms=4 Items session
$ properties/pipelines/legacy.sh
12 <- If you type this quickly you will wait ~3 seconds to get results
14.40
13 <- For further numbers you will notice just a slight delay
15.60
class CommandSlot(protocol.ProcessProtocol):
def __init__(self, args):
self._queue = []
reactor.spawnProcess(self, args[0], args)
def legacy_calculate(self, price):
d = defer.Deferred()
self._queue.append(d)
self.transport.write("%f\n" % price)
return d
# Overriding from protocol.ProcessProtocol
def outReceived(self, data):
"""Called when new output is received"""
self._queue.pop(0).callback(float(data))
class Pricing(object):
def __init__(self):
self.slot = CommandSlot(['properties/pipelines/legacy.sh'])
@defer.inlineCallbacks
def process_item(self, item, spider):
item["price"][0] = yield self.slot.legacy_calculate(item["price"][0])
defer.returnValue(item)
ITEM_PIPELINES = {...
'properties.pipelines.legacy.Pricing': 600,
pipeline
class Pricing(object):
def __init__(self):
self.concurrency = 16
args = ['properties/pipelines/legacy.sh']
self.slots = [CommandSlot(args)
for i in xrange(self.concurrency)]
self.rr = 0
@defer.inlineCallbacks
def process_item(self, item, spider):
slot = self.slots[self.rr]
self.rr = (self.rr + 1) % self.concurrency
item["price"][0] = yield
slot.legacy_calculate(item["price"][0])
defer.returnValue(item)
16 pipeline 16*4 = 64
10 Scrapy
Scrapy
Scrapy Scrapy
Amdahl
Dr.Goldratt
Scrapy
Scrapy
Scrapy ——
1
Little
N T / S N=T*S T=N/S S=N/T
1 V L
A V=L*A
L S L≈S V≈N
A≈T Little
L≈S V≈N
A≈T
/
A2
Little 1
Scrapy / N =8
/ S S=250ms /
T=N/S=8/0.25=32 /
N 8 16 32 1
Little 16
T=N/S=16/0.25=64 / 32 T=N/S=32/0.25=128 /
/
T 2
Scrapy URL
URL
Scrapy
Scrapy
Scrapy
Scrapy 3
Scrapy
Request URL
5MB
Scrapy
/ Python/Twisted
CONCURRENT_REQUESTS*
Response Item Request
pipelines pipelines
CPU
Requests/Items
Scrapy Requests/Responses/Items
Scrapy 6023
Scrapy Python
time.sleep() est()
6023 est()
ch10 ch10/speed
$ pwd
/root/book/ch10/speed
$ ls
scrapy.cfg speed
$ scrapy crawl speed -s SPEED_PIPELINE_ASYNC_DELAY=1
INFO: Scrapy 1.0.3 started (bot: speed)
...
Ctrl+D Ctrl+C
dqs JOBDIR
dqs len(engine.slot.scheduler.dqs) mqs
mqs 4475
len(engine.downloader.active) 16
CONCURRENT_REQUESTS len(engine.scraper.slot.active) 115
(engine.scraper.slot.active_size) 115kb
105 Items pipelines(engine.scraper.slot.itemproc_size) 10
mqs
stats dict
via stats.get_stats() p()
$ p(stats.get_stats())
{'downloader/request_bytes': 558330,
...
'item_scraped_count': 2485,
...}
speed/spiders/speed.py
http://localhost:9312/benchmark/... handlers
4 URL /Scrapy
http://localhost:9312/benchmark/index?p=
1 http://localhost:9312/benchmark/id:3/rr:5/index?p=1
item items
http://localhost:9312/benchmark/ds:100/d
etail?id0=0 speed/settings.py SPEED_T_RESPONSE =
0.125 SPEED_TOTAL_ITEMS = 5000 Items
SpeedSpider SPEED_START_REQUESTS_STYLE
start_requests() parse_item() crawler.engine.crawl()
URL
pipeline DummyPipeline /
/ (SPEED_PIPELINE_BLOCKING_DELAY— )
(SPEED_PIPELINE_ASYNC_DELAY— ) treq API
(SPEED_PIPELINE_API_VIA_TREQ— ) Scrapy crawler.engine.download()
API (SPEED_PIPELINE_API_VIA_DOWNLOADER— ) pipeline
settings.py
Linux
Scrapy Scrapy
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP CONCURRENT_REQUESTS_PER_DOMAIN
IP
per-IP CONCURRENT_REQUESTS_PER_IP
0 CONCURRENT_REQUESTS_PER_DOMAIN 1000000
CONCURRENT_REQUESTS
Linux
Twisted/Python tdownload=tresponse+toverhead
Items pipeline
toverhead tstart/stop
CONCURRENT_REQUESTS
CPU tresponse
2000
5
y=toverhead·x+ tstart/stop x=N/CONCURRENT_REQUESTS y=tjob·x+tresponse
LINEST Excel toverhead=6ms
tstart/stop=3.1s toverhead URL
N tjob T
Scrapy
Scrapy
1——CPU
Scrapy CPU
CPU 80-90%
CONCURRENT_REQUESTS CPU
11 CPU
2-
SPEED_SPIDER_BLOCKING_DELAY
SPEED_PIPELINE_BLOCKING_DELAY 100ms
100 URL 2 3 13
CONCURRENT_REQUESTS=1
100URL*100ms( )=10 +tstart/stop
pipelines
pipeline
APIs
9 Scrapy
11
pipelines
pipeline pipelines
pipelines
pipeline pipelines
meta
pipelines item_scraped
Twisted/ Twisted
SPEED_PIPELINE_BLOCKING_DELAY SPEED_PIPELINE_ASYNC_DELAY
3- “ ”
1000 0.25 16
19 pipeline crawler.engine.download()
API HTTP http://localhost:9312/benchmark/ar:1/api?text=hello
16
Scrapy
Requests
APIs
crawler.engine.download() Scrapy
robots.txt pipeline APIs
APIs Python requests
Twisted treq
API URL pipelines
crawler.engine.download() HTTP ONCURRENT_REQUESTS
CONCURRENT_REQUESTS
1 API 0.25
API
treq crawler.engine.download()
API CONCURRENT_REQUESTS
API
treq
16 T = N/S = 16/0.25 = 64 /
done 0.25 pipelines
1 API pipeline N = T * S = 64 * 1 = 64 Items
pipelines pipelines
4-
pipelines
APIs pipelines
5-item /
Items
Items
1300 Items
11
CPU
CPU
5MB pipelines
6-
CONCURRENT_REQUESTS
1
T = N/S = N/1 = CONCURRENT_REQUESTS
64 11 500 URL 64
S=N/T+tstart/stop=500/64+3.1=10.91
URL
SPEED_START_REQUESTS_STYLE=UseIndex URL
20 URL
T=N/S-tstart/stop=500/(32.2-
3.1)=17 /
d/load URL
URL 20 URL+
20 URL
URL
URL 50
12 URL
LIFO
SPEED_INDEX_RULE_LAST=1
SPEED_INDEX_HIGHER_PRIORITY=1
URL
URL
100 1 51
-s SPEED_INDEX_SHARDS
5Mb
mqs/dqs
CPU
Scrapy Scrapy
Twisted Netty Node.js
Scrapy
“ / / ” Scrapy
11 Scrapyd
HTML XPath
Scrapy Scrapy
Scrapy Python Twisted Scrapy
Scrapy
Scrapyd 6
“ ” “ ”
“ ”
,我们观察到带有按摩浴缸的房子的平均价格是1300
$995 shiftjacuzzi=(1300-995)/1000=30.5%
5%
Scrapyd
Scrapyd Scrapyd
3
Scrapyd http://localhost:6800/
scrapy.cfg
$ pwd
/root/book/ch03/properties
$ cat scrapy.cfg
...
[settings]
default = properties.settings
[deploy]
url = http://localhost:6800/
project = properties
url scrapyd-
client scrapyd-deploy scrapyd-client Scrapy
pip install scrapyd-client
$ scrapyd-deploy
Packing version 1450044699
Deploying to project "properties" in http://localhost:6800/addversion.
json
Server response (200):
{"status": "ok", "project": "properties", "version": "1450044699",
"spiders": 3, "node_name": "dev"}
http_port Scrapyd
http://scrapyd.readthedocs.org/
max_proc 0 CPU
Scrapyd max_proc 4 4
Scrapy
FTP
Pure-FTPd /root/items
Spark /root/items
Spark Python
24
Spark
items 9
pipelines Spark
Items Spark
3 easy Rule
callback='parse_item'
ch11
scrapy crawl 10
$ ls
properties scrapy.cfg
$ pwd
/root/book/ch11/properties
$ time scrapy crawl easy -s CLOSESPIDER_PAGECOUNT=10
...
DEBUG: Crawled (200) <GET ...index_00000.html> (referer: None)
DEBUG: Crawled (200) <GET ...index_00001.html> (referer: ...index_00000.
html)
...
real 0m4.099s
10 4 26 1700
1
16 URL
URL 20 16
20
start_URL
start_URL = ['http://web:9312/properties/index_%05d.html' % id
for id in map(lambda x: 1667 * x / 20, range(20))]
(CONCURRENT_REQUESTS,
CONCURRENT_REQUESTS_PER_DOMAIN) 16
50000/52=16
10
URL
scrapyd URL URL scrapyd URL
scrapyd
URL
URL scrapyds
Scrapy
process_spider_output() Requests
CrawlSpider GET POST
_batch scrapyd
_targets DISTRIBUTED_TARGET_HOSTS scrapyd schedule.json
POST curl
scrapyd
FEED_URI
DISTRIBUTED_TARGET_FEED_URL
FEED_URI schedule.json
FEED_URI
DISTRIBUTED_START_URL URL JSON JSON
Scrapy
Redis Scrapy ID _flush_URL()
process_start_requests()
_closed() Ctrl + C
URL _closed() _flush_URL(spider)
treq.post()
scrapyd_submits_to_wait
treq.post() defer.DeferredList() _closed()
@defer.inlineCallbacks yield
settings URL
process_start_requests() start_requests
DISTRIBUTED_START_URL JSON URL
CrawlSpider _response_downloaded()
meta['rule'] Rule Scrapy
CrawlSpider
def __init__(self, crawler):
...
self._start_URL = settings.get('DISTRIBUTED_START_URL', None)
self.is_worker = self._start_URL is not None
def process_start_requests(self, start_requests, spider):
if not self.is_worker:
for x in start_requests:
yield x
else:
for url in json.loads(self._start_URL):
yield Request(url, spider._response_downloaded,
meta={'rule': self._target})
settings.py
SPIDER_MIDDLEWARES = {
'properties.middlewares.Distributed': 100,
}
DISTRIBUTED_TARGET_RULE = 1
DISTRIBUTED_BATCH_SIZE = 2000
DISTRIBUTED_TARGET_FEED_URL = ("ftp://anonymous@spark/"
"%(batch)s_%(name)s_%(time)s.jl")
DISTRIBUTED_TARGET_HOSTS = [
"scrapyd1:6800",
"scrapyd2:6800",
"scrapyd3:6800",
]
DISTRIBUTED_TARGET_RULE
custom_settings :
custom_settings = {
'DISTRIBUTED_TARGET_RULE': 3
}
CrawlSpider meta
scrapyd
scrapyd scrapy.cfg
[deploy:target-name]
$ pwd
/root/book/ch11/properties
$ cat scrapy.cfg
...
[deploy:scrapyd1]
url = http://scrapyd1:6800/
[deploy:scrapyd2]
url = http://scrapyd2:6800/
[deploy:scrapyd3]
url = http://scrapyd3:6800/
scrapyd-deploy -l
$ scrapyd-deploy -l
scrapyd1 http://scrapyd1:6800/
scrapyd2 http://scrapyd2:6800/
scrapyd3 http://scrapyd3:6800/
scrapyd-deploy
$ scrapyd-deploy scrapyd1
Packing version 1449991257
Deploying to project "properties" in http://scrapyd1:6800/addversion.json
Server response (200):
{"status": "ok", "project": "properties", "version": "1449991257",
"spiders": 2, "node_name": "scrapyd1"}
scrapyd-deploy –L
$ scrapyd-deploy -L scrapyd1
properties
scrapyd
Scrapy scrapy monitor scrapyd
monitor.py settings.py COMMANDS_MODULE = 'properties.monitor'
scrapyd listjobs.json API
URL scrapyd-deploy https://github.com/scr
apy/scrapyd-client/blob/master/scrapyd-client/scrapyd-deploy
_get_targets() URL
class Command(ScrapyCommand):
requires_project = True
def run(self, args, opts):
self._to_monitor = {}
for name, target in self._get_targets().iteritems():
if name in args:
project = self.settings.get('BOT_NAME')
url = target['url'] + "listjobs.json?project=" + project
self._to_monitor[name] = url
l = task.LoopingCall(self._monitor)
l.start(5) # call every 5 seconds
reactor.run()
@defer.inlineCallbacks
def _monitor(self):
all_deferreds = []
for name, url in self._to_monitor.iteritems():
d = treq.get(url, timeout=5, persistent=False)
d.addBoth(lambda resp, name: (name, resp), name)
all_deferreds.append(d)
all_resp = yield defer.DeferredList(all_deferreds)
for (success, (name, resp)) in all_resp:
json_resp = yield resp.json()
print "%-20s running: %d, finished: %d, pending: %d" %
(name, len(json_resp['running']),
len(json_resp['finished']), len(json_resp['pending']))
Scrapy
boostwords.py
def to_shifts(word_prices):
(sum0, cnt0) = word_prices.values().reduce(add_tuples)
avg0 = sum0 / cnt0
def calculate_shift((isum, icnt)):
avg_with = isum / icnt
avg_without = (sum0 - isum) / (cnt0 - icnt)
return (avg_with - avg_without) / avg0
return word_prices.mapValues(calculate_shift)
vagrant ssh
1 CPU
2 scrapy monitor
$ vagrant ssh
$ cd book/ch11/properties
$ scrapy monitor scrapyd*
3
$ vagrant ssh
$ cd book/ch11/properties
$ for i in scrapyd*; do scrapyd-deploy $i; done
$ scrapy crawl distr
for scrapyd-deploy
scrapy crawl distr scrapy crawl distr -s
CLOSESPIDER_PAGECOUNT=100 100 3000
4 Spark
boostwords.py items
watch ls -1 items item
CPU
3 4 CPU16 *4 / =768 /
4G 8CPU Macbook Pro 2 40 50000
URL 315 / EC2 m4.large 2 vCPUs 8G CPU
6 12 134 / EC2 m4.4xlarge 16 vCPUs 64G
1 44 480 / scrapyd 6 Vagrantfile
scrapy.cfg settings.py 1 15 667 /
250ms
Twisted scrapyds URL FTP Spark Items
scrapyd 2.5 scrapyd poll_interval
scrapyd
500000
URL
Scrapy Scrapy
Scrapy
Scrapy
Scrapy