Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Jumping to the Code

35

Crawling the Web with Python and


35

Gaurav Singhal
Introduction Scrapy
Why Scrapy?
Gaurav Singhal
Getting Started
Jun 25, 2019 • 19 Min read • 9,313 Views
Jumping to the Code
Conclusion Jun 25, 2019 • 19 Min read • 9,313 Views
Top
Data Python

Introduction

Web scraping has become popular over the last few years, as it is an
effective way to extract the required information out from the
different websites so that it can be used for further analysis.

If you are new to using web scraping, check out my previous guide on
extracting data with Beautiful Soup.

According to the documentation on Scrapy:


Scrapy is an application framework for crawling websites and
extracting structured data which can be used for a wide range of
Jumping to the Code useful applications, like data mining, information processing, or
35 historical archival.

In this guide, we will learn how to scrape the products from the
35

Introduction product page of Zappos. We will be scraping men’s running shoes


Why Scrapy? products which have been paginated into 100 products per page and
Getting Started then export the data into a CSV file. Scraping data from a website that
Jumping to the Code
has been paginated is not always easy. This guide will establish a
Conclusion
Top strong groundwork for such websites. Zappos is an example, the same
technique can be used on numerous websites like Amazon.

Why Scrapy?

Beautiful Soup is widely used for scraping, but it is also used for small
scale scraping (static HTML pages). Remember, Scrapy is only a
parsing library which parses the HTML document. However it is easy
to learn, so you can quickly use it to extract the data you want.

On the other hand, Scrapy is a web crawling framework that provides


a complete tool for scraping to developers. In Scrapy, we create
Spiders which are python classes that define how a certain site/sites
will be scraped. So, if you want to build a robust, scalable, large scale
scraper, then Scrapy is a good choice for you.
The biggest advantage of Scrapy is that it is built on top of
theTwisted library which is an asynchronous networking library that
Jumping to the Code allows you to write non-blocking (asynchronous) code for
concurrency,
35 which improves the spider performance to a great
extent.
35

Introduction
Why Scrapy?
Getting Started Getting Started
Jumping to the Code
Conclusion Before getting started with this guide, make sure you have Python3
Top
installed in your system. If not, you can install it from here.

The next thing you need is the Scrapy package, let's install it by pip.

shell
1 pip3 install scrapy

Note: If you are using Windows, use pip instead of pip3.

For Windows users only:


If you are getting the following error, Microsoft Visual C++ 14.0 is
required, while installing the Twisted library, then you need to
install cpp build tools from the below link.
https://visualstudio.microsoft.com/visual-cpp-build-tools/
Under downloads you will find Tools for Visual Studio 2019. Then,
download Build Tools for Visual Studio 2019. After downloading
and installing, you need to install the Visual C++ build tools which
will be almost 1.5GB.
Jumping to the Code
35

35
Jumping to the Code
Introduction
Now that you have installed Scrapy in your system, let us jump into a
Why Scrapy?
Getting Started simplistic example code. As discussed earlier, in the Introduction, we
Jumping to the Code will be scraping Zappos product list page for the keywords men
Conclusion
running shoes which is available in paginated form.
Top
Step 1: Start a New Project

Since Scrapy is a framework, we need to follow some standards of the


framework. To create a new project in Scrapy, use the command
startproject. I have named my project tutorial.

shell
1 scrapy startproject tutorial

This will create a tutorial directory with the following contents:

docs
1 tutorial
2 ├── scrapy.cfg -- deploy configuration file of scrapy project
3 └── tutorial -- your scrapy project module.
4 ├── __init__.py -- module initializer(empty file)
5 ├── items.py -- project item definition py file
6 ├── middlewares.py -- project middleware py file
7 ├── pipelines.py -- project pipeline py file
8 ├── settings.py -- project settings py file
9 └── spiders -- directory where spiders are kept
Jumping to the Code 10 ├── __init__.py

35

Step 2: Analyze the Website


35
The next important step while doing Scraping is analyzing the
Introduction
Why Scrapy? webpage content that you want to scrap, is to identify how the
Getting Started information can be retrieved from the HTML text by examining the
Jumping to the Code uniqueness in the desired element.
Conclusion
Top To inspect the page in the Chrome, open Developer Tools by right-
clicking on the page.
In this example, we are intending to scrape all the information about
the product from the list of products. Every piece of product
Jumping to the Code information is available between the article tags. The sample HTML
layout
35 of a product (article) is:

html
35 1 <article>
Introduction 2 <a aria-label="" itemprop="url" href="PRODUCT URL HERE">
Why Scrapy? 3 <meta itemprop="image" content="">
4 <div>
Getting Started
5 <span>
Jumping to the Code
6 <img src="PRODUCT IMG SRC HERE" alt="alt tag" >
Conclusion 7 </span>
Top 8 </div>
9 </a>
10 <div>
11 <button>
12 </button>
13 <p>
14 <span itemprop="name">PRODUCT BY HERE</span>
15 </p>
16 <p itemprop="name">PRODUCT NAME HERE</p>
17 <p><span>PRODUCT PRICE HERE</span></p>
18 <p>
19 <span itemprop="aggregateRating" data-star-rating="PRODUCT RATI
20 <meta><meta>
21 <span></span>
22 <span class="screenReadersOnly"></span>
23 </span>
24 </p>
25 </div>
26 </article>
From the above HTML code snippet, we are going to scrape the
following things from each product:
Jumping to the Code
Product Name
35
Product by
Product price
35 Product stars
Product image url
Introduction
Why Scrapy? Step 3: Creating Our First Spider
Getting Started
Jumping to the Code Now let's create our first spider. To create new spider, you can use the
Conclusion
genspider command which takes an argument of spider name and
Top
start url.

shell
1 scrapy genspider zappos www.zappos.com

After you run the above command, you will notice that a new .py file is
created in your spider's folder.

In that spider python file, you will see a class named ZapposSpider
which inherits the scrapy.Spider class and contains a method named
parse which we will discuss in the next step.

python
1 import scrapy
2
3
4 class ZapposSpider(scrapy.Spider):
5 name = 'zappos'
6 allowed_domains = ['www.zappos.com']
7 start_urls = ['http://www.zappos.com/']
8
Jumping to the Code 9
10 def parse(self, response):
35
11 pass

35
To run a spider, you can use either the crawl command or the
Introduction
Why Scrapy? runspider command.
Getting Started
Jumping to the Code The crawl command takes the spider name as an argument:
Conclusion
shell
Top
1 scrapy crawl zappos

Or you can use the runspider command. This command will take the
location of the spider file.

shell
1 scrapy runspider tutorial/spiders/zappos.py

After you run any of the above commands, you will see the output in
the terminal showing something like this:

shell
1 2019-06-17 15:45:11 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tut
2 2019-06-17 15:45:11 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml
3 2019-06-17 15:45:11 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME'
4 2019-06-17 15:45:11 [scrapy.extensions.telnet] INFO: Telnet Password: 8ddf4
5 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled extensions:
6 [...]
7 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled downloader middleware
8 [...]
9 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled spider middlewares:
Jumping to the Code 10 [...]
11 2019-06-17 15:45:11 [scrapy.middleware] INFO: Enabled item pipelines:
35
12 []
13 2019-06-17 15:45:11 [scrapy.core.engine] INFO: Spider opened
14 2019-06-17 15:45:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at
35
15 2019-06-17 15:45:11 [scrapy.extensions.telnet] INFO: Telnet console listeni
Introduction 16 2019-06-17 15:45:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://w
Why Scrapy? 17 2019-06-17 15:45:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://w
Getting Started 18 2019-06-17 15:45:13 [scrapy.core.engine] INFO: Closing spider (finished)
19 2019-06-17 15:45:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
Jumping to the Code
20 {
Conclusion 21 ...
Top 22 }
23 2019-06-17 15:45:13 [scrapy.core.engine] INFO: Spider closed (finished)

Step 4: Extracting the Data from the Page

Now, let's write our parse method. Before jumping to the parse
method, we have to change the start_url to the web page URL, that
we wish to scrape.

We will use CSS selectors for this guide, since CSS is the easiest
option to iterate over the products. The other selector that is
commonly used is XPath selector. For more info about Scrapy
selectors, refer to this documentation.

As discussed earlier, in Step 2, while we are inspecting the elements


on the web page every product is wrapped in an article tag. So, we
have to loop through each article tag and then extract the further the
product information from the product object.
Jumping to the Code
The product object has all the information regarding each product.we
35
can further use the selector on the product object to find information
about the product. Let's just try to extract product name only, from
35
each product on the first page.
Introduction
Why Scrapy?
python
Getting Started 1 # -*- coding: utf-8 -*-
Jumping to the Code 2 import scrapy
Conclusion 3
Top 4
5 class ZapposMenShoesSpider(scrapy.Spider):
6 name = "zappos"
7 start_urls = ['https://www.zappos.com/men-running-shoes']
8 allowed_domains = ['www.zappos.com']
9
10 def parse(self, response):
11 for product in response.css("article"):
12 yield {
13 "name": product.css("p[itemprop='name']::text").extract_fir
14 }

You’ll notice the following things going on in the above code:

We use the selector as p[itemprop='name'] for fetching the


product name. It says, "Hey find the p tag that has the attribute as
itemprop and which sets it to name from the product object".
We append ::text to our selector for the name because we just
want to extract the text between the tags enclosed. It is called CSS
pseudo-selector.
We call extract_first() on the object returned by product.css (CSS
SELECTOR) because we just want the first element that matches
the selector. This will give us a string, rather than a list of elements,
Jumping to the Code which may match other similar CSS patterns.
35
Save the spider file and run the scraper again:

35
shell
Introduction 1 scrapy crawl zappos
Why Scrapy?
Getting Started
Jumping to the Code This time, you will see the names of all the products (100) which were
Conclusion listed on first page appear in the output:
Top
shell
1 ...
2 ...
3 {'name': 'Motion 7'}
4 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
5 {'name': 'Fate 5'}
6 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
7 {'name': 'Gravity 8'}
8 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
9 {'name': 'Distance 8'}
10 2019-06-18 00:09:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://w
11 ...
12 ...

Now, let's expand our yield dictionary by adding price, stars, by, image
URL, etc.

by: For extracting product by from a product object. The


p[itemprop='brand'] span[itemprop='name']::text selector can be
used; it says that from the product object, find the p tag that has an
attribute named itemprop which sets it to brand and which has a
child element span with attribute named itemprop and attribute
Jumping to the Code value named name.
35 price: For price, the p span::text selector can be used. Note that we
have two matching results for the above selector, so we have to use
the second one or match at the first index.
35
stars: The total star of a product can be extracted from an attribute
Introduction value. The selector will be p
Why Scrapy? span[itemprop='aggregateRating']::attr('data-star-rating'), it says
Getting Started that, in the product object, find the p tag that has the child element
Jumping to the Code span and has attribute named itemprop which sets to
Conclusion aggregateRating. And then extract the attribute value of data-star-
Top rating.
image url: For extracting src attribute from the img tag, we will use
the selector as div span img::attr('src').

python
1 # -*- coding: utf-8 -*-
2 import scrapy
3
4
5 class ZapposMenShoesSpider(scrapy.Spider):
6 name = "zappos"
7 start_urls = ['https://www.zappos.com/men-running-shoes']
8 allowed_domains = ['www.zappos.com']
9
10 def parse(self, response):
11 for product in response.css("article"):
12 yield {
13 "name": product.css("p[itemprop='name']::text").extract_fir
14 "by": product.css("p[itemprop='brand'] span[itemprop='name'
15 "price": product.css("p span::text").extract()[1],
16 "stars": product.css(
17 "p span[itemprop='aggregateRating']::attr('data-star-ra
18 ).extract_first(),
19 "img-url": product.css(
20 "div span img::attr('src')").extract_first()
Jumping to the Code 21 }

35

This time you will see the all the information about each product
35 which were listed on 1st page appeared in the output:
Introduction
Why Scrapy? shell
1 ...
Getting Started
2 ...
Jumping to the Code
3 {'name': 'Motion 7', 'by': 'Newton Running', 'price': '$131.25', 'stars': '4
Conclusion 4 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
Top 5 {'name': 'Fate 5', 'by': 'Newton Running', 'price': '$140.00', 'stars': Non
6 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
7 {'name': 'Gravity 8', 'by': 'Newton Running', 'price': '$175.00', 'stars':
8 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
9 {'name': 'Distance 8', 'by': 'Newton Running', 'price': '$155.00', 'stars':
10 2019-06-18 00:36:41 \[scrapy.core.scraper] DEBUG: Scraped from <200 https:/
11 ...
12 ...

To preserve the output in a file, you can use the -o flag followed by
the filename while running the spider.

Scrapy allows you to export your extracted data item into several
different file formats. Some of the commonly used file exports are
(Refer to):

CSV
JSON
XML
pickle

For example, let's export the data into a CSV file.


Jumping to the Code
35 shell
1 scrapy crawl zappos -o zappos.csv

35

Introduction Step 5: Crawling Multiple Pages


Why Scrapy?
We have successfully extracted products for the first page. Now, let's
Getting Started
Jumping to the Code extend our spider so that it navigates to all the available pages for the
Conclusion given keyword by fetching the next page URL.
Top
You will notice a Next Page link at the bottom of the page

which has an element as follows:


html
1 <a rel="next" href="/men-running-shoes/.zso?t=men running shoes&amp;p=1">Next
2 <span> Page</span>
Jumping to the Code 3 </a>

35

You can grab the next page URL from the href attribute of a tag which
35
has another unique attribute, called rel, which is next for this element.
Introduction
Why Scrapy? So, the CSS selector for grabbing the same will be:
Getting Started
a[rel='next']::attr('href')
Jumping to the Code
Conclusion
Modify your code as follows:
Top
python
1 # -*- coding: utf-8 -*-
2 import scrapy
3
4
5 class ZapposMenShoesSpider(scrapy.Spider):
6 name = "zappos_p"
7 start_urls = ['https://www.zappos.com/men-running-shoes']
8 allowed_domains = ['www.zappos.com']
9
10
11 def parse(self, response):
12 for product in response.css("article"):
13 yield {
14 "name": product.css("p[itemprop='name']::text").extract_fir
15 "by": product.css("p[itemprop='brand'] span[itemprop='name'
16 "price": product.css("p span::text").extract()[1],
17 "stars": product.css(
18 "p span[itemprop='aggregateRating']::attr('data-star-ra
19 ).extract_first(),
20 "img-url": product.css(
21 "div span img::attr('src')").extract_first()
22 }
23
Jumping to the Code 24
25 next_url_path = response.css(
35
26 "a[rel='next']::attr('href')").extract_first()
27 if next_url_path:
28 yield scrapy.Request(
35
29 response.urljoin(next_url_path),
Introduction 30 callback=self.parse
Why Scrapy? 31 )
Getting Started
Jumping to the Code
Conclusion You will notice from the previous code that we have just added two
Top new statements.

The first statement will grab the next page URL, if exists, which we
will store it in the variable next_url_path.
The second statement will check if the next_url_path exists or not.
If it exists, then we simply call the self.parse method with the new
page URL we got.

Finally, you can run your spider with mentioning the output file:

shell
1 scrapy crawl zappos -o zappos.csv

Conclusion
In this guide, you have successfully built a spider that extracts all the
products of the specified category which are available in the
Jumping to the Code paginated form in just 25 lines of code. This is a great start, but there
are
35 a lot of things that you can do with the spider. For a greater
understanding, you can follow the documentation of Scrapy.
35
Here are some of the ways that you can expand your code for learning
Introduction
Why Scrapy? purposes:
Getting Started
Jumping to the Code Extract the URL of the product.
Conclusion Scrape for multiple keywords. In this example, we have just scraped
Top for a single keyword (men-running-shoes).
Try to scrape other, different types of websites.

I hope you have learned a lot in this guide. Try experimenting with
another website to get more understanding about the Scrapy
framework. You can always follow the Scrapy Documentation for a
better and deeper understanding. For more information on scraping
the data from the web, check out Extracting data with Beautiful soup .

You might also like