FILE 20210217 182321 Dlscrib - Com PDF Crawl Data With Scrapy Public Draft DL

Crawl data with Scrapy [Public] [Draft]
Scrapy Architecture (source: scrapy.org)
1
License
reative Commons Share-alike 4.0 license.

The document is licensed under a C
Contributors
No. Name Email
1 Cuong Tran tranhuucuong91@gmail.com
2 Nguyễn Quang Dương duongnq094@gmail.com
3 Nguyễn Đình Khải khainguyenptiter@gmail.com
4 Nguyễn Bá Cường cuongnb14@gmail.com
5 Phan Công Huân nauh94@gmail.com
2
Mục lục
License 2
Contributors 2
1. Scrapy Architecture 6
1.1 Thành phần 6
1.2 Luồng dữ liệu 6
2. Tutorial với Scrapy 7

Install Scrapy 7
Các bước chính trong tutorial: 8
1. Defining our Item 8
2. Our first Spider 8
3. Crawling 9
4. Extracting Items 9
5. Storing the scraped data (lưu trữ kết quả sau khi crawl) 10
3. Các vấn đề cần giải quyết với Scrapy 10

Python Conventions 11
Xpath 11
Docker 11
4. Store into database 12

4.1 MongoDB: là kiểu noSQL 12
MongoDB to MySQL 12
Store into mongo database 13
Export/Import MongoDB 13
4.2 MySQL database 14
5. Pipeline 15
Duplicates filter 15
Write items to a JSON file 15
Price validation and dropping items with no prices 15
Clean whitespace and HTML 16
6. Extractor, Spider 16
Duyệt all page 16
Thêm thông tin vào callback functions 17
XPath pattern 17
XPath Tips from the Web Scraping Trenches 18
Config ItemLoad default: extract first and strip 19
Các thư viện Extractor 20
7. Downloader 21
Cấu hình và sử dụng proxy 21
3
Working with http proxy 21
Scrapy Download images 23
8. Scrapy setting 23
ITEM_PIPELINES 23
DownloaderStats 24
Scrapy handle AJAX Website 24
Scrapy handle AJAX Website with Splash 26
Scrapy debug 26
Scrapy caching 26
Scrapy revisit for update: 27
Continue download: Jobs: pausing and resuming crawls 27
Monitoring scrapy, status, log: 28
Xử lý nhiều spider trong 1 project: 28
Scrapy-fake-useragent 29
Crawler website sử dụng login 29
Filling Login Forms Automatically 29
Scrapy - how to manage cookies/sessions 30

Multiple cookie sessions per spider 31
How to send cookie with scrapy CrawlSpider requests? 31
By-pass anti-crawler 32
Kinh nghiệm thực tế 33

Alibaba redirect, yêu cầu đăng nhập 33
Xử lí products list 33
Vấn đề lấy danh sách công ty: 33
Vấn đề next_page: 35
Deploy project scrapy sử dụng ScrapingHub 37
Lập lịch chạy sprider: 39
Django-dynamic-scraper (DDS) 39
Requirements: 39
Documents 40
SETUP: 40
1. Install docker, compose 40
2. Run docker django-dynamic-scraper 40
3. Defining the object to be scraped 40
4
4. run crawl data 41
5. run schedule crawl: 41
Tài liệu tham khảo 43
5
1. Scrapy Architecture
http://doc.scrapy.org/en/latest/topics/architecture.html
Hình 1: Scrapy Architecture
1.1 Thành phần

● Scheduler: bộ lập lịch thứ tự các URL download.
● Downloader: thực hiện download dữ liệu. Quản lý các lỗi khi download. Chống
trùng.
● Spiders: bóc tách dữ liệu thành các items và requests
● Item Pipeline: xử lý dữ liệu bóc tách được và lưu và db.
● Scrapy Engine: quản lý các thành phần trên.
1.2 Luồng dữ liệu

Bước 1: Cung cấp URL xuất phát (start_url), được tạo thành một R equest lưu trong
Scheduler.
Bước 2 - 3: Scheduler lần lượt lấy các R equests gửi đến D
ownloader.
Bước 4 - 5: Downloader download dữ liệu từ internet, được R esponses gửi đến S piders.
Bước 6 - 7: Spiders thực hiện:
● Bóc tách dữ liệu, thu được I tem, gửi đến I tem Pipeline.
● Tách được URLs, tạo các Requests gửi đến S cheduler.
Bước 8: Item Pipeline thực hiện xử lý dữ liệu bóc tách được. Đơn giản nhất là thực hiện
lưu dữ liệu vào database.
6
Bước 9: kiểm tra Scheduler còn R
equest?
● Đúng: quay lại Bước 2.
● Sai: kết thúc.
2. Tutorial với Scrapy

Tham khảo:
1. Scrapy Tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html

2. Scraping và crawling Web với Scrapy và SQLAlchemy:
https://viblo.asia/naa/posts/6BkGyxOLM5aV
3. Kỹ thuật scraping và crawling Web nâng cao với Scrapy và SQLAlchemy:
https://viblo.asia/naa/posts/6BkGyxzeM5aV
4. Github: https://github.com/tranhuucuong91/scrapy-tutorial
Install Scrapy
# install virtualenv
sudo pip install virtualenv
virtualenv venv -p python3

source venv/bin/activate
# install scrapy dependencies

sudo apt-get install -y gcc g++
sudo apt-get install -y python3-dev
sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev libffi-dev
# install mysql dependencies

sudo apt-get install -y libmysqlclient-dev
# install python libs: scrapy, mysql

pip install -r requirements.txt
pip install twisted w3lib lxml cssselect pydispatch
# install scrapy in system

sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev
sudo apt-get install -y python-dev
sudo pip2 install scrapy pyOpenSSL
sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev

sudo apt-get install -y python-dev
sudo pip3 install scrapy
7
Các bước chính trong tutorial:
Tạo một Scrapy project:
scrapy startproject tutorial
1. Định nghĩa Items sẽ extract

2. Viết một spider để crawl một site và extract Items
3. Viết một Item Pipeline để store và extract Items
Danh sách mã nguồn:

├── __init__.py
├── items.py : định nghĩa cấu trúc dữ liệu sẽ bóc tách.
├── pipelines.py : định nghĩa hàm thực hiện việc chèn dữ liệu vào
database.
├── settings.py : cài đặt cấu hình.
└── spiders
├── __init__.py
└── vietnamnet_vn.py : định nghĩa hàm bóc tách dữ liệu
1. Defining our Item

Items là containers được loaded cùng scraped data, giống với python dict, ngoài ra còn bổ
sung thêm một số tính năng cần thiết
Tạo 1 class trong file tutorial/items.py
2. Our first Spider

Spider là class chúng ta định nghĩa và được scrapy sử dụng để scrape thông tin từ một
domain (hoặc một nhóm domain)
Chúng ta định nghĩa một danh sách khởi tạo của URLs để download, cách follow links, và
cách parse nội dung của pages để trích xuất items.
Để tạo một spider, chúng ta tạo một subclass scrapy.Spider và định nghĩa một số thuộc tính
● name: định danh spider và nó là duy nhất
● start_urls: một danh sách urls cho spider bắt đầu thực hiện crawl. Các trang được
download đầu tiên sẽ bắt đầu từ đây, còn lại sẽ được tạo từ dữ liệu đã được lấy về
● parse(): một phương thức sẽ được gọi với một đối tượng Response đã được
download của mỗi start urls. The response sẽ được truyền tới phương thức như là
tham số đầu tiên và duy nhất của phương thức. Phương thức này có trách nhiệm
phân tích response data và trích xuất scraped data (như là scraped items ) và nhiều
url để follow (như là Request object)
Tạo một spider trong thư mục tutorial/spiders.
8
3. Crawling
Tới thư mục gốc của project và chạy lệnh:
scrapy crawl dmoz // dmoz là tên của scrapy (name)
=> Quá trình thực hiện:
- Scrapy tạo scrapy.Request cho mỗi URL trong list start_urls của spider và gán chúng
phương thức parse được gọi bởi callback function của chúng.
- Các Request được lập lịch rồi thực thi và trả về scrapy.http.Response object, sau đó được
đưa trở lại spider thông qua phương thức parse().
4. Extracting Items
Introduction to Selectors
- Sử dụng cơ chế dựa trên Xpath hoặc biểu thức CSS gọi là Scrapy Selector.
Note: XPath mạnh mẽ hơn CSS
- Scrapy cung cấp class Selector và một số quy ước, shortcut để làm việc với biểu thức
xpath và css
- Selector object đại diện các nodes ở trong một document có cấu trúc. Vì thế đầu tiên khởi
tạo một selector gắn với root node hoặc toàn bộ tài liệu
- Selector có 4 phương thức cơ bản:
1. xpath(): trả về danh sách các selectors, mỗi cái đại diện cho một node đã được chọn
bằng tham số biểu thức xpath truyền vào.
2. css(): trả về danh sách các selectors, mỗi cái đại diện cho một node đã được chọn
bằng tham số biểu thức css truyền vào.
3. extract(): trả về một list unicode string với dữ liệu được chọn -> c ó thể dùng
extract_first() để lấy 1 phần tử đầu tiên
4. re(): trả về danh sách unicode string đã được trích xuất bằng applying tham số biểu
thức chính quy truyền vào.
Note: response object có thuộc tính selector là instance của Selector class. Chúng ta có thể
query bằng cách: response.selector.xpath() or response.selector.css()
hoặc sử dụng shortcut: response.xpath() or r esponse.css()
Using our item

Item object are custom python dict, có thể truy cập vào các trường bằng cách:
item = DmozItem() //DmozItem là tên class định nghĩa item
item['title'] = 'Example title'
Sử dụng item trong parse() method (yield Item object)

yield là gì : http://phocode.com/python/python-iterator-va-generator/
Following links
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
9
5. Storing the scraped data (lưu trữ kết quả sau khi crawl)
Dùng lệnh:
scrapy crawl dmoz -o items.json
3. Các vấn đề cần giải quyết với Scrapy

Scrapy TODO:
Lấy dữ liệu
[x] How to extract data?
[] Thư viện các mẫu extract, các ví dụ extract.
[] Duyệt trang để lấy dữ liệu

[x] Store data into database
Tăng tốc độ, hiệu năng

[x] Using proxy with scrapy.
[] cache
[x] Tăng tốc độ. ( Đa luồng). -> scrapy không hỗ trợ đa luồng. Nhưng hỗ trợ
cơ chế bất đồng bộ.
[] Scrapy download continue.

[] re-visit for update.
[] Monitoring scrapy, status, log.

[] Using scrapy with Docker.
Scrapy for dev

[] Limit total request (for testing)
[] Scrapy debug?
Crawl dữ liệu nhiều cấp.

Bài toán chống trùng.
Nhiều sprider.
Tìm giải pháp cho các vấn đề sau:

* *
1. Scrapy:
- re-extractor (ví dụ như dùng caching)
- download continue
- re-visit for update.
- Xử lí vấn đề caching. Thử nghiệm.
10
2. Monitoring scrapy, status, log: Tìm cách lấy các thông tin về tình trạng crawler, như
Scrapy stats nhưng ở dạng realtime. Mục đích là để biết tình trạng crawler như thế nào.
3. Using scrapy with Docker: Đóng gói scrapy vào docker. Chạy trong docker
Python Conventions
https://www.python.org/dev/peps/pep-0008/
http://docs.python-guide.org/en/latest/writing/style/
Xpath
Tool để thử xpath:
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?
utm_source=chrome-app-launcher-info-dialog
Tài liệu về xpath:

https://drive.google.com/open?id=0ByyO0Po-LQ5aVnlobzNBOHhjWW8
Docker
Install docker and docker-compose
# install docker
wget -qO- https://get.docker.com/ | sh
sudo usermod -a -G docker `whoami`
# install docker-compose
sudo wget
https://github.com/docker/compose/releases/download/1.9.0/docker-compose-ùna
me -s`-ùname -m` -O /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
Tham khảo:
https://github.com/tranhuucuong91/docker-training
Chú ý: Nếu crawl dữ liệu lớn hoặc những trang kiểm duyệt chặt cần xử dụng proxy để tránh
bị ban IP
Xem ở phần Cấu hình và sử dụng proxy.
11
4. Store into database
4.1 MongoDB: là kiểu noSQL

Tải docker image của mongodb về và chạy.
mongodb
https://hub.docker.com/_/mongo/
Tạo file docker-compose.yml có nội dung:
version: "2"
services:
mongodb:
image: mongo:3.2
ports:
- "27017:27017"
volumes:
- ./mongodb-data/:/data/db
# hostname: mongodb
# domainname: coclab.lan
cpu_shares: 512 # 0.5 CPU
mem_limit: 536870912 # 512 MB RAM
# privileged: true
# restart: always
# stdin_open: true
# tty: true
MongoDB to MySQL
Database == Database
Collection == Table
Document == Row
Query Mongo : https://docs.mongodb.org/getting-started/python/query/
Đọc thêm: Why MongoDB Is a Bad Choice for Storing Our Scraped Data
https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
-> đề cập các vấn đề gặp phải khi sử dụng mongodb
1. Locking
2. Poor space efficiency
3. Too Many Databases
4. Ordered data
12
5. Skip + Limit Queries are slow
6. Restrictions
7. Impossible to keep the working set in memory
8. Data that should be good, ends up bad!
Store into mongo database

http://doc.scrapy.org/en/latest/topics/item-pipeline.html
template pipeline:
import pymongo
class MongoPipeline(object):
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):

self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):

self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):

self.client.close()
def process_item(self, item, spider):

self.db[self.collection_name].insert(dict(item))
return item
Export/Import MongoDB
Export from server:

mongodump --archive=crawler.`date +%Y-%m-%d"_"%H-%M-%S`.gz --gzip --db
crawler
13
Import to mongodb:
#copy gzip file from local to container
docker cp /path/to/file container_id:/root
#restore
mongorestore --gzip --archive=/root/crawler.2016-04-18_07-40-11.gz
--db crawler
4.2 MySQL database

Chon kieu du lieu nao la phu hop?
utf8_unicode_ci vs utf8_general
- utf8mb4_unicode_ci: sort chinh xac, cham hon.

- utf8mb4_general_ci: sort khong chinh xac bang, nhanh hon.
-> chon: utf8mb4_unicode_ci
CREATE DATABASE crawler CHARACTER SET utf8mb4 COLLATE

utf8mb4_unicode_ci;
Install mysqlclient
sudo apt-get install -y libmysqlclient-dev
sudo pip2 install mysqlclient
sudo pip2 install sqlalchemy
MySQL Command:
# Login
mysql -u username -p
# Create new database

> CREATE DATABASE name;
# import:
> use name
> source import.sql
Pipeline điều khiển quá trình store. Models xử lý tạo bảng db.
_init_ : kết nối và khởi tạo bảng dữ liệu,
hàm xử lý việc lưu dữ liệu vào bảng (sessionmaker)
process: tham số là item và spider. dữ liệu được crawl bởi spider được đưa vào item, sau
đó được session đưa vào lưu trữ trong db
TODO: Viết file demo việc lưu dữ liệu vào MySQL.
14
5. Pipeline
Duplicates filter
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()

if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Write items to a JSON file

import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')

line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Price validation and dropping items with no prices

from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
15
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)
Clean whitespace and HTML

# cleans whitespace & HTML
class CleanerPipeline(object):
# general tidying up
for (name, val) in item.items():
#utils.devlog("Working on %s [%s]" % (name, val))
if val is None:
item[name] = ""
continue
item[name] = re.sub('\s+', ' ', val).strip() # remove whitespace
#item['blurb'] = re.sub('<[^<]+?>', '', item['blurb']).strip() # remove
HTML tags
# spider specific
if spider.name == "techmeme":
item['blurb'] = item['blurb'].replace(' — ',
'').strip()
return item
6. Extractor, Spider
Duyệt all page
Cần xác định rõ cấp độ của page
● start_urls (cấp 1)
● trong Parse dẫn link đến page cấp 2 -> đây là page cần duyệt hết các page con trong
đó
● parse_next_page: crawl dữ liệu
def parse_articles_follow_next_page(self, response):

for article in response.xpath("//article"):
item = ArticleItem()
... extract article data here
yield item
16
next_page = response.css("ul.navigation > li.next-page > a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_articles_follow_next_page)
Note: để test kết quả trả về khi dùng các selector, ta dùng scrapy shell
cú pháp: scrapy shell [url]
https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell
Thêm thông tin vào callback functions

http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-req
uest-callback-arguments
In some cases you may be interested in passing arguments to those callback functions so
you can receive the arguments later, in the second callbMergeack.
You can use the Request.meta attribute for that.
Tình huống: khi duyệt các category -> ... -> website công ty -> lấy dữ liệu về công ty
Vấn đề: muốn lưu thông tin category của công ty
-> Giải pháp: truyền thêm thông tin vào callback functions.
def parse_page1(self, response):

item = MyItem()
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
def parse_page2(self, response):

item = response.meta['item']
item['other_url'] = response.url
yield item
Xem bài thực hành phía dưới.
XPath pattern
//p[contains(string(),'Address:')]
17
vietnamese_title //p[contains(string(),'Vietnamese
Title:')]
english_title //p[contains(string(),'English Title:')]
address //p[contains(string(), 'Address:') or
contains(string(), 'Địa chỉ:') or contains(string(), 'Trụ sở chính:')]
province //p[contains(string(),'Province:')]
director //p[contains(string(),'Director:')]
tel //p[contains(string(),'Tel:') or
contains(string(),'Điện thoại:')]
fax //p[contains(string(),'Fax:')]
email //p[contains(string(),'Email:')]
main_business //p[contains(string(),'Main Business:')]
business //p[(contains(string(),'Business:') and
not(contains(string(),'Main Business'))) or contains(string(),'Ngành nghề
kinh doanh:')]
website //p[contains(string(),'Website:')]
company_title
//p[contains(string(),'Vietnamese Title:')]
fn:tokenize(//p[contains(string(),'Vietnamese Title:')], ':')[0]
XPath Tips from the Web Scraping Trenches

In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write
specifications of document locations more flexibly than CSS selectors. In case you’re looking for a
tutorial,here is a XPath tutorial with nice examples.
Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(.,
‘search text’) instead.
GOOD:
1 >>> xp("//a[contains(., 'Next Page')]")

2 [u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']
BAD:
1 >>> xp("//a[contains(.//text(), 'Next Page')]")

2 []
GOOD:
1 >>> xp("substring-after(//a, 'Next ')")
18
2 [u'Page']
BAD:
1 >>> xp("substring-after(//a//text(), 'Next ')")

2 [u'']
Beware of the difference between //node[1] and (//node)[1]

//node[1] selects all the nodes occurring first under their respective parents.
(//node)[1] selects all the nodes in the document, and then gets only the first of them.
When selecting by class, be as specific as necessary

If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:
>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']
Learn to use all the different axes

It is handy to know how to use the axes, you can follow through the examples given in the
tutorial to quickly review this.
Useful trick to get text content

Here is another XPath trick that you may use to get the interesting text contents:
1 //*[not(self::script or self::style)]/text()[normalize-space(.)]
This excludes the content from script and style tags and also skip whitespace-only text nodes.
Source: http://stackoverflow.com/a/19350897/2572383
Config ItemLoad default: extract first and strip
http://stackoverflow.com/questions/17000640/scrapy-why-extracted-strings-are-in-this-format
There's a nice solution to this using I tem Loaders. Item Loaders are objects that get data
from responses, process the data and build Items for you. Here's an example of an Item
Loader that will strip the strings and return the first value that matches the XPath, if any:
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, TakeFirst
19
class MyItemLoader(XPathItemLoader):
default_item_class = MyItem
default_input_processor = MapCompose(lambda string: string.strip())
default_output_processor = TakeFirst()
And you use it like this:
def parse(self, response):
loader = MyItemLoader(response=response)
loader.add_xpath('desc', 'a/text()')
return loader.load_item()
Các thư viện Extractor
http://jeroenjanssens.com/2013/08/31/extracting-text-from-html-with-reporter.html
3 HTML text extractors in Python

1. python-readability
https://github.com/buriy/python-readability
2. python-boilerpipe
https://github.com/misja/python-boilerpipe
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
3. python-goose
https://github.com/grangier/python-goose
Html Content / Article Extractor, web scrapping lib in Python
pip2 install: goose-extractor
https://github.com/codelucas/newspaper
pip3 install newspaper3k
Ví dụ:
https://blog.openshift.com/day-16-goose-extractor-an-article-extractor-that-just-works/
https://github.com/shekhargulati/day16-goose-extractor-demo
http://vietnamnet.vn/vn/thoi-su/du-bao-thoi-tiet-hom-nay-6-12-ha-noi-ret-14-do-mien-trung-m
ua-cuc-lon-344734.html
20
7. Downloader
Cấu hình và sử dụng proxy
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
https://rohitnarurkar.wordpress.com/2013/10/29/scrapy-working-with-a-proxy-network/
1. Go into your project directory
2. Create a file middlewares.py and add the following code:
# Importing base64 library because we'll need it ONLY in case if the proxy we
are going to use requires authentication
import base64
# Start your middleware class

class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):

# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
3. Add the following lines in your settings.py script:

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'sample.middlewares.ProxyMiddleware': 100,
}
http://wayback.archive.org/web/20150828053704/http://blog.michaelyin.info/2014/02/19/scra
py-socket-proxy/
Working with http proxy
21
When crawling infos from some website like google shop, it will detect the source ip and restrict
some service to some specific ip address, however, scrapy framework can handle this situation
by making the request through proxy.
The scrapy has provided HttpProxyMiddleware to support http proxy, if you want to make your
web crawler to go through proxy, the first thing you need to do is modify your setting file just like
this
Convert socket proxy to http proxy
http://stackoverflow.com/questions/19446536/proxy-ip-for-scrapy-framework
Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list
available of various websites like this
No i am confused about the options i should choose

1. Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
2. Use TOR
3. Use VPN Service like http://www.hotspotshield.com/
4. Any Option better than above three
free proxy list

http://proxylist.hidemyass.com/
proxy checker: script python
Lấy ngẫu nhiên proxy từ danh sách:
https://pypi.python.org/pypi/proxylist
from proxylist import ProxyList
pl = ProxyList()
pl.load_file('/web/proxy.txt')
pl.random()
# <proxylist.base.Proxy object at 0x7f1882d599e8>
pl.random().address()
# '1.1.1.1:8085'
22
Scrapy Download images
http://doc.scrapy.org/en/0.24/topics/images.html
8. Scrapy setting
http://doc.scrapy.org/en/latest/topics/settings.html
ITEM_PIPELINES
Default: {}
A dict containing the item pipelines to use, and their orders. The dict is empty by
default order values are arbitrary but it’s customary to define them in the 0-1000
range.
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
23
DownloaderStats
Middleware lưu trữ số liệu thống kê của tất cả các yêu cầu, phản ứng và các ngoại
lệ đó đi qua nó.
'scrapy.downloadermiddleware.stats.DownloaderStats': None,
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermi
ddlewares.stats
Scrapy handle AJAX Website

http://stackoverflow.com/questions/24652170/scrapy-how-to-catch-the-unexpected-case-of-r
eturn-a-response-with-partial-html
During my crawling, some pages return a response with partial html body and status 200,
after I compare the response body with the one I open in browser, the former one miss
something. How can I catch this unexpected partial response body case in spider or in
download middleware?
Below is about the log example：
2014-01-23 16:31:53+0100 [filmweb_multi] DEBUG: Crawled (408)

http://www.filmweb.pl/film/Labirynt-2013-507169/photos>
(referer:http://www.filmweb.pl/film/Labirynt-2013-507169) ['partial']
Answer:
Its not partial content as such. The rest of the content is dynamically loaded by a
Javacript AJAX call.
To debug what content is being sent as response for a particular request, use Scrapy's
open_in_browser() function.
There's another thread on How to extract dynamic content from websites that are using
AJAX ?. Refer this for a workaround.
Để xử lí những website sử dụng AJAX, có cách dùng selenium để lấy nội dung website,
giống như cách duyệt web thông thường: browser render dữ liệu
với cách này thì tốc độ crawler sẽ chậm hơn rất nhiều
Dùng selenium thì có webdriver là Firefox, sẽ yêu cầu giao diện đồ họa. Muốn chạy trên
server không có giao diện đồ họa thì dùng PhantomJS.
24
Headless with PhantomJS: sử dụng PhantomJS cho automation testing, crawler
Selenium with headless phantomjs webdriver
https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/
1. Setup
2. Example
3. Benchmarking
https://dzone.com/articles/python-testing-phantomjs
scrapy with AJAX

http://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content
-from-websites-that-are-using-ajax
scrape hidden web data with python

http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
sử dụng PhantomJS để làm downloader

https://github.com/flisky/scrapy-phantomjs-downloader/blob/master/scrapy_phantomjs/downl
oader/handler.py
https://github.com/flisky/scrapy-phantomjs-downloader
Element not found in the cache - perhaps the page has changed since it was looked up
-> cần phải tìm lại element sau khi reload trang
Web elements you stored before clicking on login button will not be present in cache after
login because of page refresh or page changes. You need to again store these web
elements in order to make them available again under cache. I have modified your code a bit
which might help:
selenium: select dropdown loop

http://dnxnk.moit.gov.vn/
-> giải pháp:

- Lưu index select đã chọn.
- Mỗi lần chọn, lấy lại element select.
element = driver.find_element_by_name('Years')
all_options = element.find_elements_by_tag_name('option')
for index in range(len(all_options)):

element = driver.find_element_by_name('Years')
all_options = element.find_elements_by_tag_name('option')
25
option = all_options[index]
print('Select: year is {}'.format(option.get_attribute('value')))

option.click()
Message: 'phantomjs' executable needs to be in PATH.

-> Thêm đường dẫn phantomjs vào PATH (sửa ~/.bashrc hoặc ~/.zshrc)
Scrapy handle AJAX Website with Splash

https://github.com/scrapy-plugins/scrapy-splash
Sử dụng splash với proxy:

https://github.com/tranhuucuong91/docker-training/blob/master/compose/splash/docker-com
pose.yml
Trong code, để sử dụng proxy cho request của splash:

SplashRequest(url, self.parse_data, args={'wait': 0.5, 'proxy': 'splash_proxy'})`
Tham khảo: Crawling dynamic pages: Splash + Scrapyjs => S2
http://www.thecodeknight.com/post_categories/search/posts/scrapy_python
Scrapy debug
http://doc.scrapy.org/en/latest/topics/debug.html#open-in-browser
How to use pycharm to debug scrapy projects

http://unknownerror.org/opensource/scrapy/scrapy/q/stackoverflow/21788939/how-to-use-py
charm-to-debug-scrapy-projects
Scrapy caching
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-storage-fs
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True
HTTPCACHE_EXPIRATION_SECS = 30 * 24 * 60 * 60
26
Scrapy revisit for update:
http://stackoverflow.com/questions/23950184/avoid-scrapy-revisit-on-a-different-run
pipelines.py
from mybot.utils import connect_url_database
class DedupPipeline(object):
def __init__(self):
self.db = connect_url_database()

url = item['url']
self.db.insert(url)
yield item
middlewares.py
from scrapy import log
from scrapy.exceptions import IgnoreRequest
from mybot.utils import connect_url_database
class DedupMiddleware(object):
def __init__(self):
self.db = connect_url_database()
def process_request(self, request, spider):

url = request.url
if self.db.has(url):
log.msg('ignore duplicated url: <%s>'%url, level=log.DEBUG)
raise IgnoreRequest()
settings.py
ITEM_PIPELINES = {
'mybot.pipelines.DedupPipeline': 0
}
'mybot.middlewares.DedupMiddleware': 0
}
Continue download: Jobs: pausing and resuming crawls

http://scrapy.readthedocs.io/en/latest/topics/jobs.html
Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later.
27
How to use it
To start a spider with persistence supported enabled, run it like this:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a
signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
Monitoring scrapy, status, log:

https://github.com/scrapinghub/scrapyrt/tree/master/scrapyrt
Xử lý nhiều spider trong 1 project:

http://stackoverflow.com/questions/8372703/how-can-i-use-different-pipelines-for-different-s
piders-in-a-single-scrapy-proje
I can think of at least four approaches:
1. Use a different scrapy project per set of spiders+pipelines (might be appropriate if

your spiders are different enough warrant being in different projects)
2. On the scrapy tool command line, change the pipeline setting with s crapy settings in
between each invocation of your spider
3. Isolate your spiders into their own s crapy tool commands, and define the
default_settings['ITEM_PIPELINES'] on your command class to the pipeline list you
want for that command. See line 6 of this example.
4. In the pipeline classes themselves, have p rocess_item() check what spider it's
running against, and do nothing if it should be ignored for that spider. See the
example using resources per spider to get you started. (This seems like an ugly
solution because it tightly couples spiders and item pipelines. You probably shouldn't
use this one.)
class CustomPipeline(object)
def process_item(self, item, spider)

if spider.name == 'spider1':
# do something
return item
return item
28
Scrapy-fake-useragent
Random User-Agent middleware based on f ake-useragent. It picks up User-Agent strings
based on usage statistics from a real world database.
Configuration
Turn off the built-in UserAgentMiddleware and add RandomUserAgentMiddleware.
In Scrapy >=1.0:
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}
6 ví dụ dùng scrapy http request

http://www.programcreek.com/python/example/71420/scrapy.http.Request
Tài liệu hướng dẫn scrapy web

http://hopefulramble.blogspot.com/2014/08/web-scraping-with-scrapy-first-steps_30.html
Crawler website sử dụng login

http://blog.javachen.com/2014/06/08/using-scrapy-to-cralw-zhihu.html
Filling Login Forms Automatically

https://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/
We often have to write spiders that need to login to sites, in order to scrape data from them.
Our customers provide us with the site, username and password, and we do the rest.
The classic way to approach this problem is:
1. launch a browser, go to site and search for the login page
2. inspect the source code of the page to find out:
1. which one is the login form (a page can have many forms, but usually one of
them is the login form)
2. which are the field names used for username and password (these could vary
a lot)
3. if there are other fields that must be submitted (like an authentication token)
29
3. write the Scrapy spider to replicate the form submission using FormRequest (here is
an example)
Being fans of automation, we figured we could write some code to automate point 2 (which is
actually the most time-consuming) and the result is loginform, a library to automatically fill
login forms given the login page, username and password.
Here is the code of a simple spider that would use loginform to login to sites automatically:
In addition to being open source, loginform code is very simple and easy to hack (check the
README on Github for more details). It also contains a collection of HTML samples to keep
the library well-tested, and a convenient tool to manage them. Even with the simple code so
far, we have seen accuracy rates of 95% in our tests. We encourage everyone with similar
needs to give it a try, provide feedback and contribute patches.
https://blog.scrapinghub.com/2016/05/11/monkeylearn-addon-retail-classifier-tutorial/
Scrapy - how to manage cookies/sessions

http://stackoverflow.com/questions/4981440/scrapy-how-to-manage-cookies-sessions
My script:
1. My spider has a start url of searchpage_url
2. The searchpage is requested by parse() and the search form response gets passed
to search_generator()
3. search_generator() then yields lots of search requests using FormRequest and the
search form response.
4. Each of those FormRequests, and subsequent child requests need to have it's own
session, so needs to have it's own individual cookiejar and it's own session cookie.
=> Solution
30
Multiple cookie sessions per spider
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
There is support for keeping multiple cookie sessions per spider by using the cookiejar
Request meta key. By default it uses a single cookie jar (session), but you can pass an
identifier to use different ones.
For example:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)
Keep in mind that the cookiejar meta key is not “sticky”. You need to keep passing it along
on subsequent requests. For example:
def parse_page(self, response):

# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
How to send cookie with scrapy CrawlSpider requests?

http://stackoverflow.com/questions/32623285/how-to-send-cookie-with-scrapy-crawlspider-re
quests
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
for i,url in enumerate(self.start_urls):
yield Request(url,cookies={'over18':'1'}, callback=self.parse_item,
headers=headers)
Don't know what's wrong with CrawlSpider but Spider could work anyway.
# encoding: utf-8
import scrapy
class MySpider(scrapy.Spider):
name = 'redditscraper'
allowed_domains = ['reddit.com', 'imgur.com']
start_urls = ['https://www.reddit.com/r/nsfw']
def request(self, url, callback):

"""
31
wrapper for scrapy.request
"""
request = scrapy.Request(url=url, callback=callback)
request.cookies['over18'] = 1
request.headers['User-Agent'] = (
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,
'
'like Gecko) Chrome/45.0.2454.85 Safari/537.36')
return request
def start_requests(self):
for i, url in enumerate(self.start_urls):
yield self.request(url, self.parse_item)
def parse_item(self, response):

titleList = response.css('a.title')
for title in titleList:

item = {}
item['url'] = title.xpath('@href').extract()
item['title'] = title.xpath('text()').extract()
yield item
url = response.xpath('//a[@rel="nofollow
next"]/@href').extract_first()
if url:
yield self.request(url, self.parse_item)
# you may consider scrapy.pipelines.images.ImagesPipeline :D
By-pass anti-crawler
- Một số cách tinh chỉnh setting cho Scrapy:
+ Tăng delay time
+ Giảm concurrent request
+ Dùng proxy
+ Một số trick khác - tùy trang (ajax....)
- https://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
- http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
32
Kinh nghiệm thực tế
Alibaba redirect, yêu cầu đăng nhập
-> có thể thực hiện bước đăng nhập.
def parse(self, response):

if response.url.startswith('https://login.alibaba.com'):
self.logger.debug('Login: {}'.format(response.url))
return self.login(response)
else:
return self.parse_product(response)
def login(self, response):

return [FormRequest.from_response(response,
formdata={'loginId': 'csdk@gmail.com', 'password': 'BSvrvY'},
callback=self.parse)]
Xử lí products list
Ví dụ: http://www.alibaba.com/Agricultural-Growing-Media_pid144
scrapy shell http://www.alibaba.com/Agricultural-Growing-Media_pid144
company: //div[@class="cbrand ellipsis"]

next_page: //a[@class="next"]/@href
=> Problem: Khi next-page các button có thể bị thay đổi

button "Next" không get qua urljoin kể từ page 2
Kể từ trang 2 thì k lấy đc button nextpage.
response.xpath('//div[@class="cbrand ellipsis"]')
next_page là: http://www.alibaba.com/catalogs/products/CID144/2
Vấn đề lấy danh sách công ty:

Truy cập vào trang: http://www.alibaba.com/catalogs/products/CID144/2
company: //div[@class="stitle util-ellipsis"]
chạy câu lệnh:
scrapy shell http://www.alibaba.com/catalogs/products/CID144/2
33
response.xpath('//div[@class="stitle util-ellipsis"]')
-> null. Không lấy được dữ liệu.
-> Nguyên nhân: website sử dụng javascript để render dữ liệu.

- Dùng trình duyệt thì có thể dùng xpath để lấy được.
- Dùng scrapy thì chưa lấy được.
Hướng giải quyết:

Khi dùng browser nhìn thấy được dữ liệu và extract được dữ liệu bằng xpath, nhưng dùng
scrapy lại không extract được
-> phải nghĩ đến việc browser render dữ liệu khác với scrapy.
- browser sẽ chạy javascript, css, render dữ liệu.
- scrapy chỉ lấy html raw.
-> thực hiện lấy html raw để phân tích.
Ví dụ:
wget http://www.alibaba.com/catalogs/products/CID144/2
đọc html raw thì thấy:
page.setPageData({"baseServer":"//www.alibaba.com","isForbiddenSel
l":false,"isForbidden":false,"clearAllHref":"//www.alibaba.com/cat
alogs/products/CID144","quotationSupplierNum":250375,"allCategory"
:null,"searchbarFixed":
-> cần phải tìm cách xử lí dữ liệu này với scrapy.
Ví dụ: get raw data

Tách phần json, được dữ liệu json của 2 trang như sau.
http://pastebin.com/TBxYswGD
http://pastebin.com/ZFEMcuST
Có thể dùng trang sau để đọc file json, có vẻ đẹp hơn trang em đang dùng:
http://www.jsoneditoronline.org/
-> Hướng giải quyết:

1. Dùng regular expression để lấy phần json.
2. Đọc json data, lấy ra dữ liệu mong muốn.
34
Ví dụ:
json_data =
response.xpath('string(//body)').re(r'page.setPageData$(.*?\})$;')[0]
# tiếp tục xử lí json
Hoặc:
1. Dùng regular expression, extract chính xác dữ liệu mong muốn.
Ví dụ:
Pattern:
"supplierHref":"http://dfl-vermiculite.en.alibaba.com/company_profile.h
tml#top-nav-bar"
Code extract:
response.xpath('string(//body)').re(r'"supplierHref":"([^#]+)')
Trong bài toán này, chúng ta chọn cách đơn giản là cách 2: dùng regular expression để
extract chính xác phần dữ liệu muốn lấy.
Bài học:
1. Dùng browser thấy được dữ liệu, dùng scrapy không thấy được dữ liệu
-> cần phải lấy dữ liệu raw để phân tích.
2. Cần thấy điểm khác nhau của thứ nhìn thấy trên browser và dữ liệu raw. "Những gì chúng
ta nhìn thấy không như những gì chúng ta nghĩ".
Đây cũng là cách để các website hạn chế việc bị crawl data.
3. Không dùng được xpath thì dùng regular expression. Regular expression là mức cơ bản
nhất, xử lí được hầu hết các vấn đề extract.
Vấn đề next_page:
Có thể giải quyết bằng cách duyệt page theo index tăng dần.
http://www.alibaba.com/catalogs/products/CID144/2
-> next_page: http://www.alibaba.com/catalogs/products/CID144/3
origin_url = 'http://www.alibaba.com/catalogs/products/CID144/2'
url_token = origin_url.split('/')
next_page_url = '/'.join(url_token[:-1] + [str(int(url_token[-1]) + 1)])
print(next_page_url)
Khi nào hết page?
35
-> khi nút Next không có thẻ <a>
Như vậy, giải pháp để next_page là:

Kiểm tra: next_page = response.xpath('//a[@class="next"]/@href').extract_first('').strip()
=> it incorrect, we need another solution
- Nếu có next_page: next_page sẽ có URL bằng URL hiện tại + (index + 1)
- Nếu không: dừng.
crawl thông tin product:

Cần bổ sung thêm thông tin:
- description
- cost, currency
- category, nếu lấy được
36
Deploy project scrapy sử dụng ScrapingHub
Source: https://scrapinghub.com/
Scrapinghub là một cloud dựa trên web crawling platform hỗ trợ deloy và scale mà bạn sẽ
không phải lo lắng về server, monitoring, backup và schedule cho project scrapy của mình.
Hỗ trợ nhiều add-ons hỗ trợ việc mở rộng spider của bạn cùng với rotator proxy thông minh
hỗ trợ việc chặn từ các website tăng tốc độ crawl.
Các tính năng chính:

● Jobs dashboard: Có giao diện quản lý các job, thống kê chi tiết rất dễ quản lý và
chạy.
● Item browser: Hiển thị dữ liệu đã crawl được.Dữ liệu được hiển thị khá đẹp mắt và
theo cấu trúc.
● Log inspector: Kiểm tra logs sinh ra trong quá trình chạy, các lỗi phát sinh được
hiển thị khá rõ ràng.
● Data storage, usage reports and API: Tất cả dữ liệu crawl được đều được lưu vào
db của ScrapingHub và truy cập thông qua API trả về. Ngoài ra còn có hệ thống lập
lịch chạy,các addons có sẵn hỗ trợ crawl như: Monkeylearn, splash, crawlera,
BigML, DeltaFetch, Images, Monitoring, Portia ...
Install scrapinghub:
$ pip/pip3 install scrapinghub
Thực hiện login để deploy:
$ shub login
Nhập API key để login (API được lấy tại: https://dash.scrapinghub.com/account/apikey)
Sau khi đăng nhập thông tin sẽ được lưu tại ~ /.scrapinghub.yml
Tiến hành deploy một project lên scrapinghub:
crapy Cloud Projects trên
Tạo một project mới để chứa project của mình tại S
scrapinghub.
Click vào project vừa tạo trên scrapinghub vào mục code & deploy để lấy API project vừa
tạo.
$cd <your project scrapy>
$shub deploy
Tùy chọn cài thêm thư viện khi chạy:
Edit file scrapinghub.yml:
projects:
default: 123
requirements_file: requirements.txt
Nhập API của project sau đó project sẽ được deploy trên scrapinghub.
37
Tiến hành chạy project click vào run chọn tên spider,có vài tùy chọn khi chạy như priority ,
tags, Arguments tùy vào nhu cầu sử dụng.
Bạn chỉ có thể chạy được 1 spider một lúc, cấc cả các lần chạy tiếp theo sẽ được đưa vào
next jobs.
Dữ liệu sẽ được xuất ra csv và được lưu trong data của scrapinghub với thời gian lưu trữ 1
tuần. để lưu lâu hơn bạn sẽ cần phải nâng cấp và trả phí.
38
Lập lịch chạy sprider:
Scrapinghub hỗ trợ lập lịch chạy spider rất dễ sử dụng
Các chức năng free đủ cho bạn deploy và chạy một project với các chức năng và công cụ
hỗ trợ với 1G ram và 1 concurrent crawl. Nếu có nhu cầu bạn có thể nâng cấp để tùy chỉnh
và sử dụng các addons khác.
Django-dynamic-scraper (DDS)
Django-dynamic-scraper use scrapy base on django framework and use admin
django interface create scrapy crawl many website.
Dockerfile: https://github.com/khainguyen95/django-dynamic-scraper
Image: https://hub.docker.com/r/khainguyendinh/django-dynamic-scraper/
Requirements:
● Python 2.7+ or 3.4+
● Django 1.8/1.9
● Scrapy 1.1
● Scrapy-djangoitem 1.1
● Python JSONPath RW 1.4+
● Python future
● scrapyd
● django-celery
39
● django-dynamic-scraper
Documents
Tutorial DDS
Scrapyd-client
DjangoItem in scrapy
SETUP:
1. Install docker, compose

Install docker:
$wget -qO- https://get.docker.com/ | sh
$sudo usermod -a -G docker whoami
Install compose:
$sudo wget -q
https://github.com/docker/compose/releases/download/1.6.2/dock
er-compose-ùname -s-uname -m` \
-O /usr/local/bin/docker-compose
$sudo chmod +x /usr/local/bin/docker-compose

Tip : after that, logout, then login for update environment
2. Run docker django-dynamic-scraper

Pull docker images:
$docker pull khainguyendinh/django-dynamic-scraper
3. Defining the object to be scraped

● create Database utf8
40
● CREATE DATABASE news CHARACTER SET utf8 COLLATE utf8_general_ci;
● $cd djangoItem
● create user admin
● python manage.py createsuperuser
● run django server
● python manage.py runserver 0.0.0.0:8000
● show admin django in browser
● http://localhost:8000/admin
● add New Scraped object classes
● add New Scrapers
● add News websites
4. run crawl data

Run:
$scrapy crawl [--output=FILE --output-format=FORMAT]
SPIDERNAME -a id=REF_OBJECT_ID [-a do_action=(yes|no) -a
run_type=(TASK|SHELL) -a max_items_read={Int} -a
max_items_save={Int} -a max_pages_read={Int} -a
output_num_mp_response_bodies={Int} -a
output_num_dp_response_bodies={Int} ]
$scrapy crawl news -a id=1 -a do_action=yes
5. run schedule crawl:

deploy project scrapy:
$cd crawl
$scrapyd-deploy -p crawl
$scrapyd
41
run schedule scrapy:
$python manage.py celeryd -l info -B
--settings=example_project.settings
python manage.py celeryd -l info -B --settings=djangoItem.settings
run check error expath:

$scrapy crawl news_checker -a id=ITEM_ID -a do_action=yes
42
Tài liệu tham khảo
1. Scrapy Documentation. https://doc.scrapy.org/en/latest/
2. Scrapinghub Documentation: h ttps://doc.scrapinghub.com
3. Django Dynamic Scrapy: http://django-dynamic-scraper.readthedocs.io/
TODO: Từ rất nhiều nguồn khác nữa, mỗi phần có link tham khảo. Chúng tôi sẽ sớm cập
nhật đầy đủ tài liệu tham khảo.
43

FILE 20210217 182321 Dlscrib - Com PDF Crawl Data With Scrapy Public Draft DL

Uploaded by

Copyright:

Available Formats

You might also like

FILE 20210217 182321 Dlscrib - Com PDF Crawl Data With Scrapy Public Draft DL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FILE 20210217 182321 Dlscrib - Com PDF Crawl Data With Scrapy Public Draft DL

Uploaded by

Copyright:

Available Formats

Crawl data with Scrapy [Public] [Draft]

Scrapy Architecture (source: scrapy.org)

​ reative Commons Share-alike 4.0​ license.

1 Cuong Tran tranhuucuong91@gmail.com

2 Nguyễn Quang Dương duongnq094@gmail.com

3 Nguyễn Đình Khải khainguyenptiter@gmail.com

4 Nguyễn Bá Cường cuongnb14@gmail.com

5 Phan Công Huân nauh94@gmail.com

2. Tutorial với Scrapy 7

3. Các vấn đề cần giải quyết với Scrapy 10

4. Store into database 12

Scrapy handle AJAX Website 24

Scrapy handle AJAX Website with Splash 26

Scrapy revisit for update: 27

Continue download: Jobs: pausing and resuming crawls 27

Monitoring scrapy, status, log: 28

Xử lý nhiều spider trong 1 project: 28

Crawler website sử dụng login 29

Filling Login Forms Automatically 29

Scrapy - how to manage cookies/sessions 30

How to send cookie with scrapy CrawlSpider requests? 31

Kinh nghiệm thực tế 33

Deploy project scrapy sử dụng ScrapingHub 37

Lập lịch chạy sprider: 39

Tài liệu tham khảo 43

Hình 1: Scrapy Architecture

1.1 Thành phần

1.2 Luồng dữ liệu

2. Tutorial với Scrapy

1. Scrapy Tutorial: ​http://doc.scrapy.org/en/latest/intro/tutorial.html

virtualenv venv -p python3

# install scrapy dependencies

# install mysql dependencies

# install python libs: scrapy, mysql

pip install twisted w3lib lxml cssselect pydispatch

# install scrapy in system

sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev

sudo pip3 install scrapy

1. Định nghĩa Items sẽ extract

Danh sách mã nguồn:

1. Defining our Item

2. Our first Spider

Using our item

Sử dụng item trong parse() method (yield Item object)

3. Các vấn đề cần giải quyết với Scrapy

[] Duyệt trang để lấy dữ liệu

Tăng tốc độ, hiệu năng

[] Scrapy download continue.

[] Monitoring scrapy, status, log.

Scrapy for dev

Crawl dữ liệu nhiều cấp.

Tìm giải pháp cho các vấn đề sau:​ ​

Tài liệu về xpath:

sudo chmod +x /usr/local/bin/docker-compose

Xem ở phần Cấu hình và sử dụng proxy.

4.1 MongoDB: là kiểu noSQL

Tạo file docker-compose.yml có nội dung:

Query Mongo : ​https://docs.mongodb.org/getting-started/python/query/

reative Commons Share-alike 4.0 license.

1. Scrapy Tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html

Tìm giải pháp cho các vấn đề sau:

Query Mongo : https://docs.mongodb.org/getting-started/python/query/

def init(self, mongo_uri, mongo_db):

def open_spider(self, spider):

def close_spider(self, spider):

def process_item(self, item, spider):

from scrapy.exceptions import DropItem

def process_item(self, item, spider):

def process_item(self, item, spider):

def process_item(self, item, spider):

def parse_articles_follow_next_page(self, response):

... extract article data here

def parse_page1(self, response):

def parse_page2(self, response):

from scrapy.contrib.loader import XPathItemLoader

from scrapy.contrib.loader.processor import MapCompose, TakeFirst

default_input_processor = MapCompose(lambda string: string.strip())

def parse(self, response):

2. Create a file middlewares.py and add the following code:

def process_request(self, request, spider):

3. Add the following lines in your settings.py script:

proxy checker: script python

from proxylist import ProxyList

for index in range(len(all_options)):

print('Select: year is {}'.format(option.get_attribute('value')))