Professional Documents
Culture Documents
Crawling The Web With Python and Scrapy
Crawling The Web With Python and Scrapy
35
Gaurav Singhal
Introduction Scrapy
Why Scrapy?
Gaurav Singhal
Getting Started
Jun 25, 2019 • 19 Min read • 9,313 Views
Jumping to the Code
Conclusion Jun 25, 2019 • 19 Min read • 9,313 Views
Top
Data Python
Introduction
Web scraping has become popular over the last few years, as it is an
effective way to extract the required information out from the
different websites so that it can be used for further analysis.
If you are new to using web scraping, check out my previous guide on
extracting data with Beautiful Soup.
In this guide, we will learn how to scrape the products from the
35
Why Scrapy?
Beautiful Soup is widely used for scraping, but it is also used for small
scale scraping (static HTML pages). Remember, Scrapy is only a
parsing library which parses the HTML document. However it is easy
to learn, so you can quickly use it to extract the data you want.
Introduction
Why Scrapy?
Getting Started Getting Started
Jumping to the Code
Conclusion Before getting started with this guide, make sure you have Python3
Top
installed in your system. If not, you can install it from here.
The next thing you need is the Scrapy package, let's install it by pip.
shell
1 pip3 install scrapy
35
Jumping to the Code
Introduction
Now that you have installed Scrapy in your system, let us jump into a
Why Scrapy?
Getting Started simplistic example code. As discussed earlier, in the Introduction, we
Jumping to the Code will be scraping Zappos product list page for the keywords men
Conclusion
running shoes which is available in paginated form.
Top
Step 1: Start a New Project
shell
1 scrapy startproject tutorial
docs
1 tutorial
2 ├── scrapy.cfg -- deploy configuration file of scrapy project
3 └── tutorial -- your scrapy project module.
4 ├── __init__.py -- module initializer(empty file)
5 ├── items.py -- project item definition py file
6 ├── middlewares.py -- project middleware py file
7 ├── pipelines.py -- project pipeline py file
8 ├── settings.py -- project settings py file
9 └── spiders -- directory where spiders are kept
Jumping to the Code 10 ├── __init__.py
35
html
35 1 <article>
Introduction 2 <a aria-label="" itemprop="url" href="PRODUCT URL HERE">
Why Scrapy? 3 <meta itemprop="image" content="">
4 <div>
Getting Started
5 <span>
Jumping to the Code
6 <img src="PRODUCT IMG SRC HERE" alt="alt tag" >
Conclusion 7 </span>
Top 8 </div>
9 </a>
10 <div>
11 <button>
12 </button>
13 <p>
14 <span itemprop="name">PRODUCT BY HERE</span>
15 </p>
16 <p itemprop="name">PRODUCT NAME HERE</p>
17 <p><span>PRODUCT PRICE HERE</span></p>
18 <p>
19 <span itemprop="aggregateRating" data-star-rating="PRODUCT RATI
20 <meta><meta>
21 <span></span>
22 <span class="screenReadersOnly"></span>
23 </span>
24 </p>
25 </div>
26 </article>
From the above HTML code snippet, we are going to scrape the
following things from each product:
Jumping to the Code
Product Name
35
Product by
Product price
35 Product stars
Product image url
Introduction
Why Scrapy? Step 3: Creating Our First Spider
Getting Started
Jumping to the Code Now let's create our first spider. To create new spider, you can use the
Conclusion
genspider command which takes an argument of spider name and
Top
start url.
shell
1 scrapy genspider zappos www.zappos.com
After you run the above command, you will notice that a new .py file is
created in your spider's folder.
In that spider python file, you will see a class named ZapposSpider
which inherits the scrapy.Spider class and contains a method named
parse which we will discuss in the next step.
python
1 import scrapy
2
3
4 class ZapposSpider(scrapy.Spider):
5 name = 'zappos'
6 allowed_domains = ['www.zappos.com']
7 start_urls = ['http://www.zappos.com/']
8
Jumping to the Code 9
10 def parse(self, response):
35
11 pass
35
To run a spider, you can use either the crawl command or the
Introduction
Why Scrapy? runspider command.
Getting Started
Jumping to the Code The crawl command takes the spider name as an argument:
Conclusion
shell
Top
1 scrapy crawl zappos
Or you can use the runspider command. This command will take the
location of the spider file.
shell
1 scrapy runspider tutorial/spiders/zappos.py
After you run any of the above commands, you will see the output in
the terminal showing something like this:
shell
1 2019-06-17 15:45:11 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tut
2 2019-06-17 15:45:11 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml
3 2019-06-17 15:45:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME'
4 2019-06-17 15:45:11 [scrapy.extensions.telnet] INFO: Telnet Password: 8ddf4
5 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled extensions:
6 [...]
7 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled downloader middleware
8 [...]
9 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled spider middlewares:
Jumping to the Code 10 [...]
11 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled item pipelines:
35
12 []
13 2019-06-17 15:45:11 [scrapy.core.engine] INFO: Spider opened
14 2019-06-17 15:45:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at
35
15 2019-06-17 15:45:11 [scrapy.extensions.telnet] INFO: Telnet console listeni
Introduction 16 2019-06-17 15:45:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://w
Why Scrapy? 17 2019-06-17 15:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://w
Getting Started 18 2019-06-17 15:45:13 [scrapy.core.engine] INFO: Closing spider (finished)
19 2019-06-17 15:45:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Jumping to the Code
20 {
Conclusion 21 ...
Top 22 }
23 2019-06-17 15:45:13 [scrapy.core.engine] INFO: Spider closed (finished)
Now, let's write our parse method. Before jumping to the parse
method, we have to change the start_url to the web page URL, that
we wish to scrape.
We will use CSS selectors for this guide, since CSS is the easiest
option to iterate over the products. The other selector that is
commonly used is XPath selector. For more info about Scrapy
selectors, refer to this documentation.
35
shell
Introduction 1 scrapy crawl zappos
Why Scrapy?
Getting Started
Jumping to the Code This time, you will see the names of all the products (100) which were
Conclusion listed on first page appear in the output:
Top
shell
1 ...
2 ...
3 {'name': 'Motion 7'}
4 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
5 {'name': 'Fate 5'}
6 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
7 {'name': 'Gravity 8'}
8 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
9 {'name': 'Distance 8'}
10 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
11 ...
12 ...
Now, let's expand our yield dictionary by adding price, stars, by, image
URL, etc.
python
1 # -*- coding: utf-8 -*-
2 import scrapy
3
4
5 class ZapposMenShoesSpider(scrapy.Spider):
6 name = "zappos"
7 start_urls = ['https://www.zappos.com/men-running-shoes']
8 allowed_domains = ['www.zappos.com']
9
10 def parse(self, response):
11 for product in response.css("article"):
12 yield {
13 "name": product.css("p[itemprop='name']::text").extract_fir
14 "by": product.css("p[itemprop='brand'] span[itemprop='name'
15 "price": product.css("p span::text").extract()[1],
16 "stars": product.css(
17 "p span[itemprop='aggregateRating']::attr('data-star-ra
18 ).extract_first(),
19 "img-url": product.css(
20 "div span img::attr('src')").extract_first()
Jumping to the Code 21 }
35
This time you will see the all the information about each product
35 which were listed on 1st page appeared in the output:
Introduction
Why Scrapy? shell
1 ...
Getting Started
2 ...
Jumping to the Code
3 {'name': 'Motion 7', 'by': 'Newton Running', 'price': '$131.25', 'stars': '4
Conclusion 4 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
Top 5 {'name': 'Fate 5', 'by': 'Newton Running', 'price': '$140.00', 'stars': Non
6 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
7 {'name': 'Gravity 8', 'by': 'Newton Running', 'price': '$175.00', 'stars':
8 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
9 {'name': 'Distance 8', 'by': 'Newton Running', 'price': '$155.00', 'stars':
10 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
11 ...
12 ...
To preserve the output in a file, you can use the -o flag followed by
the filename while running the spider.
Scrapy allows you to export your extracted data item into several
different file formats. Some of the commonly used file exports are
(Refer to):
CSV
JSON
XML
pickle
35
35
You can grab the next page URL from the href attribute of a tag which
35
has another unique attribute, called rel, which is next for this element.
Introduction
Why Scrapy? So, the CSS selector for grabbing the same will be:
Getting Started
a[rel='next']::attr('href')
Jumping to the Code
Conclusion
Modify your code as follows:
Top
python
1 # -*- coding: utf-8 -*-
2 import scrapy
3
4
5 class ZapposMenShoesSpider(scrapy.Spider):
6 name = "zappos_p"
7 start_urls = ['https://www.zappos.com/men-running-shoes']
8 allowed_domains = ['www.zappos.com']
9
10
11 def parse(self, response):
12 for product in response.css("article"):
13 yield {
14 "name": product.css("p[itemprop='name']::text").extract_fir
15 "by": product.css("p[itemprop='brand'] span[itemprop='name'
16 "price": product.css("p span::text").extract()[1],
17 "stars": product.css(
18 "p span[itemprop='aggregateRating']::attr('data-star-ra
19 ).extract_first(),
20 "img-url": product.css(
21 "div span img::attr('src')").extract_first()
22 }
23
Jumping to the Code 24
25 next_url_path = response.css(
35
26 "a[rel='next']::attr('href')").extract_first()
27 if next_url_path:
28 yield scrapy.Request(
35
29 response.urljoin(next_url_path),
Introduction 30 callback=self.parse
Why Scrapy? 31 )
Getting Started
Jumping to the Code
Conclusion You will notice from the previous code that we have just added two
Top new statements.
The first statement will grab the next page URL, if exists, which we
will store it in the variable next_url_path.
The second statement will check if the next_url_path exists or not.
If it exists, then we simply call the self.parse method with the new
page URL we got.
Finally, you can run your spider with mentioning the output file:
shell
1 scrapy crawl zappos -o zappos.csv
Conclusion
In this guide, you have successfully built a spider that extracts all the
products of the specified category which are available in the
Jumping to the Code paginated form in just 25 lines of code. This is a great start, but there
are
35 a lot of things that you can do with the spider. For a greater
understanding, you can follow the documentation of Scrapy.
35
Here are some of the ways that you can expand your code for learning
Introduction
Why Scrapy? purposes:
Getting Started
Jumping to the Code Extract the URL of the product.
Conclusion Scrape for multiple keywords. In this example, we have just scraped
Top for a single keyword (men-running-shoes).
Try to scrape other, different types of websites.
I hope you have learned a lot in this guide. Try experimenting with
another website to get more understanding about the Scrapy
framework. You can always follow the Scrapy Documentation for a
better and deeper understanding. For more information on scraping
the data from the web, check out Extracting data with Beautiful soup .