Professional Documents
Culture Documents
19-5E8 Tushara Priya
19-5E8 Tushara Priya
19-5E8 Tushara Priya
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
Submitted By:
T. TUSHARA PRIYA (19UP1A05E8)
K. SATHWIKA (19UP1A05C4)
U. SAHITHI (19UP1A05F1)
Mrs. P. ARCHANA
Assistant Professor
CSE VMTW
VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR WOMEN
CERTIFICATE
This is to certify that the Project work titled “TO COMPARE PRICE OF THE PRODUCT USING
WEB SCRAPING” submitted by
K. SATHWIKA (19UP1A05C4),
U. SAHITHI (19UP1AO5F1),
in partial fulfilment of the requirements for the award of the degree of Bachelor of Technology in Computer
Science and Engineering to the Vignan’s institute of management and technology for women is a record of
bonafide work carried out by them under my guidance and supervision. The results embodied in this project
report have not been submitted to any university for the award of any degree and the results are achieved
satisfactorily.
(External Examiner)
CSE VMTW
VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR
WOMEN
DECLARATION
We here by declare that project entitled “TO COMPARE THE PRICE OF A PRODUCT USING
WEBSCRAPING” is bonafide work duly completed by us. It does not contain any part of the
project is submitted by any other candidate to this or another institute of the university. All such
materials that have been obtained from other sources have been duly acknowledged.
CSE VMTW
ACKNOWLEDGEMENT
We would like to express sincere gratitude to Dr. G. Appa Rao Naidu, Principal, Vignan’s Institute of
Management and Technology for Women for his timely suggestions which helped us to complete
the project in time.
We would also like to thank our madam Mrs. M. Parimala, Head of the Department, Computer
Science and Engineering, for providing us with constant encouragement and resources which helped
us to complete the project in time.
We would like to thank our project guide, Mrs. P. Archana, Assistant Professor, Computer Science
and Engineering, for her timely cooperation and valuable suggestions throughout the project. We
are indebted to her for the opportunity given to work under her guidance.
Our sincere thanks to all the teaching and non-teaching staff of Department of Computer Science
and Engineering for their support throughout our project work.
CSE VMTW
INDEX
CONTENTS PAGE NO
ABSTRACT
LIST OF FIGURES
1. INTRODUCTION 1-3
1.1. Web scraping 3-4
1.2. Existing System 4-5
1.3. Proposed System
2. LITERATURE SURVEY 6-8
2.1. About the project 9
2.2. How it benefits the business 10-12
2.3. Challenges 12- 13
3.METHODOLOGY 14-18
4. SYSTEM REQUIREMENTS 19
4.1 Hardware Requirements
4.2 Software Requirements
4. SYSTEM ARCHITECTURE 20
4.1. System design 21-24
5.UML DIAGRAMS
5.1 Use Case 25
5.2 Sequence diagram 26
5.3 Class Diagram 27
6.SOFTWARE MODELLING AND SETUP 28-34
6.1 Installation
6.2 Setting up the system
7. IMPLEMENTATION 35-39
7.1. Code 40-46
7.2. Results 47-48
8. CONCLUSION 49-50
9. FUTURE SCOPE/ENHANCEMENT 51-52
10. BIBILOGRAPHY 53
CSE VMTW
ABSTRACT
Web scraping is basically an interactive method for website and some other online
sources to browse for and access data. To delete a replica of the information and save it in
an external archive for review, it uses software engineering technology and custom software
programming to extract data or any other content of on-line sources. Web scraping is often
called automatic data gathering, database discovery, database crawling, or content
management mining. Web scraping have possibly existed since before the start of the
World Wide Web, but it has been used mainly in the context of data analytics, and is
generally associated to e-commerce.
Web scraping technique provides a broad collection of options and can serve
various purposes: A web crawler's least necessity is to automate the normally physical work
of gathering cost quotation marks and website article details. A web crawler's main
requirement will be to discover formerly inaccessible sources of price data, and include a
survey of all accessible price information. This scraping process is performed using
different technologies which can be automatic application tools or manual methods. This
paper provides the overall review of web scraping technology, how it is carried out and the
effects of this technology.
CSE VMTW
LIST OF FIGURES
2 Comparative study 7
4 flipkart website 14
5 Amazon website 15
6 Inspecting page 15
7 Importing libraries 17
8 Block diagram 18
9 System architecture 20
10 Design diagram 21
12 Sequence diagram 26
13 Class diagram 27
CSE VMTW
17 Data stored in CSV 39
18 Result screen 47
19 Future benefits 51
CSE VMTW
Page |1
INTRODUCTION
Web Scraping let us to collect data from web runners across the internet. In this
project the script searches for a product via URL and finds the price of the product. This
project is particularly useful when we want to monitor the price of the specific item from
multiple eCommerce platforms. Here, in this project we have three major eCommerce
websites to find the price of the product. On each execution, all the websites are crawled and
the product is located, and the price of the same product from all the sources is obtained and
displayed on the console window. So the buyer can see the prices and make the decision to
buy from the platform which offers the lowest price.
CSE VMTW
Page |2
Every website has a different structure, that is why web scrapers are usually built to explore
one website.
The scraper tool for the web is utilized for derived information from the web
host, and as a portion of uses used for web orders, web mining and data mining, online
esteem change observing and value correlation, element survey scratching (to watch the
challenge), gathering land postings, atmosphere data checking, webpage change area, inspect,
following on the web closeness and reputation, web mash up and, web data joining. Pages are
manufactured utilizing content-based increase dialects (HTML and XHTML), and much of
the time contain a profusion of cooperative info in the content structure. Be that it may be as
most website pages are anticipated for human end users and not for minimalism of robotized
use. Thus, the toolbox that scrapes web info was made.
CSE VMTW
Page |3
as input. Due to e-commerce sites’ dynamic pricing, extraction and updating of this
data are challenging. Prices change frequently, and they artificially complicate things.
Scraping social media websites is another way to gather price intelligence.
EXISTING SYSTEM
In Existing system is the manual web data extraction process has two major
problems. Firstly, it can’t measure costs efficiently and can escalate it very quickly. The data
collection costs increase as more data is collected from each website. In order to conduct a
manual extraction, businesses need to hire large number of staffs, this increases the cost of
labour significantly. Secondly, each manual extraction is known to be error prone. Further, if
any business process is very complex then cleaning up the data can get expensive and time
consuming.
The existing system doesn’t enable us to rapidly scrape many websites at the
same time without having to watch and control every single request. It is not easy to
implement-This means that with onetime investment, the data cannot be collected Competitor
Monitoring-It is not easy to monitor the competitors in the market and the business world.
The world of retail is changing rapidly. Many brick and mortar locations are
closing and being replaced by online stores, direct to consumer brands, and subscription
services. However, while the breadth of assortment is something that drives customers to
website, a lot of E-Commerce platforms fail to sell through a high percentage of merchandise.
DISADVANTAGES
• The existing system doesn’t enable us to rapidly scrape many websites at the same
time without having to watch and control every single request.
• You can also set it up just one time and it will scrape a whole website within an hour
or much less - instead of what would have taken a week for a single person to
complete.
• It is not easy to implement - This means that with onetime investment, the data cannot
be collected.
• Competitor Monitoring - It is not easy to monitor the competitors in the market and
the business world
CSE VMTW
Page |4
PROPOSED SYSTEM
To find the right price, you need to understand and be able to predict how
your customers react to price change. Web scraping allows you to compare price of the
products that you want to buy. Track how customers are reacting to changes in your
competitors’ prices or tweak your own prices and monitor how it affects sales.
Create Applications for Tools that don’t have a public developer API. Web
scraping services provide an essential service at a low cost. The advantage of web Scraping is
its time-efficient and low maintenance. For example, downloading big data may take hours,
and then analyzing every single row manually at a time is worth spending your entire month.
ADVANTAGES
1.Time Efficient
The advantage of web Scraping is its time-efficient and low maintenance. For example,
downloading big data may take hours, and then analyzing every single row manually at a
time is worth spending your entire month.
2. Complete Automation
• Some advantages of automation are that it doesn’t get bored or tiring, does not require
any breaks, and never gets distracted they follow the given instructions.
• While we have advantages in tasks like analysis, running an algorithm across a large
dataset is faster and more effective than having someone manually read through every
document one by one.
CSE VMTW
Page |5
3. Cost Efficiency
• Web scraping services provide essential services at a competitive cost because it’s
much cheaper than hiring a company to perform the same task.
• By monitoring listings and sales data, it allows you to see how well different products
are performing. Keeping track of your business has never been easier.
5. Data Accuracy
• There are no humans are involved in this process, Simple errors in data extraction
may lead to major issues. Web scraping is not only a fast process, but it’s also very
accurate too. Hence, it’s necessary to ensure that the data is accurate.
CSE VMTW
Page |6
LITERATURE SURVEY
To know how the data extraction process has evolved has so much one must
understand the techniques involved in this method of web scraping is important scraping has
been around nearly as long as the web. The impact behind business web scraping has
dependably been to pick up a simple business advantage and incorporate things like
undermining a contender's special valuing, taking leads, commandeering promoting efforts,
diverting APIs, and the inside and out robbery of and information.
The primary aggregators and examination motors seemed hot on the impact points
of the web-based business blast and worked generally unchallenged until the legitimate
difficulties of the mid-2000s. Early scraping apparatuses were really fundamental - physically
reordering anything unmistakable from the site. When software engineers got included,
scraping graduated to the Unix grep order or customary articulation coordinating procedures
posting remote HTTP demands utilizing attachment programming, and parsing site utilizing
information programming and parsing site utilizing information inquiry dialects. Today, in
any case, it's an altogether different story: web scraping is huge business with powerful
devices and administrations to coordinate.
Extraction and Analysis of information are generally utilized by the Digital
distributers and catalogue, Travel, Real home, and E-trade. Then again, examination and
figuring come path back with the advances in accumulation components and the innovation
of Real Databases: The data had been seen and dealt with as data to be set up for data
examination. The pivotal turning point was the nearness of RDB (Relational Database) amid
the 1980s which empowered customers to create Sequel (SQL) to recoup data from the
database. For customers, the advantage of RDB and SQL is to have the ability to separate
their data on intrigue. It made the methodology to get data basic and spread database use.
Information Warehouse: The distinction from regular social databases is that information
stockrooms are generally streamlined for reaction time to inquiries. The improvement of data
mining as made possible appreciation to database and data stockroom progressions, which
engage associations to store more data and still separate it in a reasonable manner. A general
commercial pattern developed, where administrations began to "foresee" client’s potential
needs dependent on examination of the chronicled obtaining designs.
CSE VMTW
Page |7
This paper depicts a standard data examination based on the user requirements. The method
is dispensed into three parts: the web scrubber draws the ideal connections from web, and
afterwards the information is extracted(scraped) to get the data from the source, lastly it
stores the information into a CSV document. Because of a gigantic local area and library
assets for Python and the impassableness of coding stylish of python language, it is most
suitable one for Scraping wanted information from the ideal website [1].
Learn web scratching and creeping procedures to get to limitless information from any web
source in any organization. Ideal for developers, security experts, and web managers
acquainted with Python, this book trains essential web scratching mechanics, yet in addition
CSE VMTW
Page |8
digs into further developed subjects, for example, investigating crude information or utilizing
scrubbers for frontend site testing [2].
Paper 3: Web Scraping with python successfully scrape data from any
website
The Internet contains the most helpful arrangement of information at any point collected,
generally openly open free of charge. Notwithstanding, this information isn't effectively
reusable. It is implanted inside the design and style of sites and should be painstakingly
separated to be valuable. Web scratching is getting progressively valuable as a way to
effortlessly assemble and sort out the plenty of data accessible on the web. Utilizing a
straightforward language like Python, you can creep the data out of complex sites utilizing
basic programming [3].
Web scraping can be performed by teams in the following steps shown in figure 2. The
scraping organisations take the websites details from the clients from where the data to be
extracted and analysis in done by the experts. Then they get it approved by the clients. After
approval the extraction process is done for the required data along with data configuration
and then the final information is delivered to the client followed by collecting the feedback.
CSE VMTW
Page |9
For example, if you are selling a frying pan from a reputed company. You will
scrape prices of the same product across different websites to understand the market value
and specify an attractive price on your website. Consumers will flock to your website if you
can provide a comparatively lower price than most other competitors. Amazon almost forced
CSE VMTW
P a g e | 10
Quadis to merge itself with the retail giant after an aggressive war over diaper-selling for
years.
However, in cases where the prices change too frequently such as the Sensex or flight
prices. The automated scraping bots fall short as this pricing data updated every second. This
is why real-time web scraping is important. With a real-time scraper, there are no intervals in
between sessions of scraping. The data is extracted as soon as it gets updated which ensures
quick responsive action and better analysis. This also means that the extracted data need not
be stored as everything happens in real-time.
The growth of a business rests on the shoulders of effective marketing. However, for
marketing efforts to take fruit, the business needs to generate leads. Web scraping can
collect high volumes of data, which will subsequently trigger lead generation. Through its
surgical precision, it can generate lead data quickly and accurately. Plus, this information will
be in CSV or similar formats, which can be easily processed or integrated with other tools.
Analyse & Predict Market Trends
Sometimes, the market is not as black and white as selling woollens during winters. E-
Commerce is transforming rapidly, and you need to keep up with it. When it comes to
finalizing sales, timing is everything. Scraping e-commerce sites and monitoring similar or
competitor products over several months can help provide insights on a specific market and
product trends. These data points can help you predict the best time to launch a product and at
the most optimal price. Competitive pricing and in-season launch will result in a magical
recipe that will boost sales. Further, depending on the prevailing or projected market trends,
you can effectively manage the stock and inventory of your products.
CSE VMTW
P a g e | 11
select the one priced the cheapest while simultaneously ensuring good quality. Thus, if you
are an eCommerce website, it is this consumer mentality that will drive your business.
Offering sales, buy-one-get-one-free deals, exchange offers will also bring in more
traffic to your website. When you know competitor prices, it becomes easier for you to make
logic-backed decisions. Through optimal pricing, brand reputation is also improved, and thus,
you gain more customers.
Finding the right consumer preference will take time. It cannot be done sitting in a boardroom
and deciding the preferences for them. It is only through data analytics that companies can
stay connected with their customers. Integrating your analytics with the web scraped data will
allow you to implement data-backed decisions. With data at your disposal. You will begin to
understand customer preferences and the quality that is demanded at each price point.
Practices like mailing them when something they are eyeing becomes
cheaper can ensure a purchase. But such practices will need tracking for conversion rates and
compare with the loss or profit that you end up making with such aggressive pricing
strategies. Using historical pricing data to analyse and forecast future trends. And then stock
up accordingly can also mean more business for you.
When it comes to web scraping, not everything is a bed of roses; it has its fair share of thorns
too. E-Commerce websites, especially your competitors, do not want you stealing
information from their websites. And as web scrapers get better and more effective at
extracting product data, the website admins are also coming up with creative ways of
thwarting such attempts.
Here are some of the challenges that might keep you from using web scrapers:
CSE VMTW
P a g e | 12
Whether it’s intentional, or just amateur coding standards, an e-commerce website may be
difficult to navigate with bots due to the design and structure, or the ever-changing layout of
the website. Keeping up with all these changes requires time and effort.
2. Use of Unique Elements
Websites responsible for storing sensitive data ensure the protection of information through
Honey Pot traps, which can detect scrapers and crawlers. Through this method, they
strategically place invisible links on a webpage that are not meant for visitors but are present
for scrapers. These are specially designed to trap and block web scrapers and bots as soon as
they attempt to crawl them. On setting off the trigger, the IP address corresponding to the
scraper is instantly blocked.
5. Use of CAPTCHAs
Fun Fact: The technology behind CAPTCHA (Completely Automated Public
Turing Test to Tell Computers and Humans Apart) is based on the Turing Test, which can
test whether a machine can think like humans!
CSE VMTW
P a g e | 13
The very role of CAPTCHA is to block automated scripts from performing repetitive
actions on a website. It essentially brings an element of randomness into an otherwise
predictable workflow. Web scrapers are tasked to decipher images containing distortions and
randomness. Solving captchas is something that a robot cannot perform successfully!
CSE VMTW
P a g e | 14
METHODOLOGY
The methodology used for the project is to gather all the data extracted from various sources
by using the vivid features of the web crawler scrapy using the scripts written in python
language and further analyse it as per the requirements of the customer where the data is
stored in the csv file.
The initial step is to find the URL that you want to scrap. Here we are extracting product
details from the Flipkart and Amazon. The URL of this page is https://www.flipkart.com and
https://www.amazon.com/
CSE VMTW
P a g e | 15
what it has to scrape. Right-clicking anywhere on the frontend of a website gives you the
option to ‘inspect element’ or ‘view page source.’ This reveals the site’s backend code,
which is what the scraper will read.
CSE VMTW
P a g e | 17
required libraries like pandas, beautiful soup and selenium and write the code.
The required website is crawled to obtain the required data. The website page is then
inspected and the required “div” tags of price, name and ratings are gathered. The data
present in the “div” tags are obtained from the website for the execution of the code. If the
data is obtained then it is saved in the CSV file format and saved for future reference. If the
data is not obtained then again, the website is crawled to obtain the data.
CSE VMTW
P a g e | 18
SYSTEM REQUIREMENTS
Hardware Requirements
Ram:8.00 GB
Software Requirements
Platform: jupyter (python 3.x with Selenium, Beautiful Soup, Pandas libraries installed)
CSE VMTW
P a g e | 19
SYSTEM ARCHITECTURE
System architecture defines the structure of a software system. This is usually a series of
diagrams that illustrate services, components, layers and interactions. A scheduler is a
software product that allows an enterprise to schedule and track computer batch tasks.
CSE VMTW
P a g e | 20
SYSTEM DESIGN
System Design is the process of designing the architecture, components, and interfaces for
a system so that it meets the end-user requirements. Web scraping requires two parts, namely
the crawler and the scraper. The crawler is an artificial intelligence algorithm that browses
the web to search for the particular data required by following the links across the internet.
The scraper, on the other hand, is a specific tool created to extract data from the website.
DESIGN COMPONENTS
1. Input Seed URLs
Firstly, your crawler will need ‘seed URLs’. Once it has the initial input, it will continue
extracting and storing data recursively. This list of seed URLs or absolute URLs is fed to the
‘URL frontier’.
2. URL Frontier
The job of module 2, the URL frontier, is to build and store a list of URLs to be downloaded
from the internet. For focused or topical web crawlers, the URL frontier will also prioritize the
CSE VMTW
P a g e | 21
4. DNS Resolver
Before the HTML fetcher can actually download the page content, an additional step is
required. This is where the role of a DNS resolver comes in. A DNS resolver, or a DNS
lookup tool, component 4 in the diagram, maps a hostname to its IP address. Though DNS
resolution can be requested from the server, it will take a lot of time to complete the step,
given the big number of URLs to be crawled. Instead, the better option is to create a
customized DNS resolver, as you can see in the diagram, to complement the basic crawler
design. So your custom DNS resolver will give HTML fetcher the IP address of the hostname
that is to be fetched. Once it has the IP address, the fetcher downloads the content on the page
available on that address.
5. Caching
Next, the content downloaded from the internet by the fetcher is cached. Since the data is
typically stored after being compressed and can be time-consuming to retrieve, an open-source
data structure store, such as Redis can be used to cache the document. This makes it easier for
other processors in your web crawler design to fetch the data and re-read it without consuming
unnecessary time.
CSE VMTW
P a g e | 22
document is compared to all the checksums present in a store called ‘Doc FPs’ to see if the file
has already been crawled. If the checksum already exists, the document is discarded at this
point.
7. Storage
If the document passed the content seen test in the previous module, it’s saved into the
persistent storage.
8. Processing Data
You can have multiple processors in your customized web crawler design, depending on what
you plan on doing with the crawler. All the processing is carried out on the cached document,
rather than the stored database since it’s easier to retrieve. The three most common processors
that are almost always present include:
URL filter receives the set of unique, standardized URLs from the link extractor module.
Next, depending on how you’re using the web crawler, the URL filter will filter out the files
that are required and discards the rest.
You can design a URL filter that filters by filetype. For example, a web crawler that crawls
only jpg files will keep all the links that end with ‘.jpg’ and discard the rest. Other than the
filetype, you can also filter links by their prefix or domain name. For example, if you don’t
want to crawl Wikipedia links, you can design your URL filter to ignore the links pointing to
CSE VMTW
P a g e | 23
Wikipedia.
It is at this point that we can implement the robot exclusion protocol. Since the URL fetcher
will already have fetched a document called robot.txt and mapped the off-limit pages to the
URL list, the URL filter will discard all the links that the website does not permit
downloading. We’ll discuss the need for robot exclusion protocol later in the post. The output
of the URL filter is all the URLs that we want to keep and pass on to the URL frontier after
some further processing.
URL De-Dup is typically implemented after the URL filter module. The stream of URLs
coming out of the URL filter might not be unique. You may have multiple URLs in the stream
that point to the same document. We wouldn’t want to crawl the same document twice, so a
De-Dup test is performed on each filtered link before passing it ahead. Your crawler should
ideally store a database of all the crawled URLs — let’s call it the URL set. Each URL to be
tested is mapped onto each of the URLs in the set to detect a repetition.
OVERVIEW
As you can see in the system design diagram, the loop is initiated through a set of ‘seed URLs’
that is created and fed into the URL frontier. The URL frontier applies algorithms to build
URL queues based on certain constraints, prioritization and politeness, which we’ll discuss in
detail further in the post. Which is the URL fetcher receive’s the URLs waiting in the queue
one by one, receives the address against it from the DNS resolver and downloads the content
from that page. The content is cached by module 5 for easier access to the processors. It is also
compressed and stored after going through the document De-Dup test at module. This test
checks to see if the content has already been crawled.
CSE VMTW
P a g e | 24
UML DIAGRAMS
USE CASE
CSE VMTW
P a g e | 25
SEQUENCE DIAGRAM
CSE VMTW
P a g e | 26
CLASS DIAGRAM
CSE VMTW
P a g e | 27
Documents/python/scraping
Do not use any spaces in your folder names. If you must use punctuation, do not use
anything other than an underscore _ . It’s best if you use only lowercase letters. Change into
cd Documents/python/scraping
CSE VMTW
P a g e | 28
This is how you install any Python library that exists in the Python Package Index. Pretty
handy. Pip is a tool for installing Python packages, which is what you just did.
You should now be at the >>> prompt — the Python interactive shell prompt.
1. You imported two Python modules, url open and Beautiful Soup (the first two lines).
2. You used url open to copy the entire contents of the URL given into a new Python
3. You used the Beautiful Soup function to process the value of that variable (the plain-text
contents of the file at that URL) through a built-in HTML parser called html.parser .
4. The result: All the HTML from the file is now in a Beautiful Soup object with the new
Python variable name soup . (It is just a variable name.)
5. Last line: Using the syntax of the Beautiful Soup library, you printed the first h1 element
CSE VMTW
P a g e | 29
The command soup.h1 would work the same way for any HTML tag (if it exists in the file).
heading = soup.h1
print(heading.text)
information, we can use other Python commands (and libraries) to write the data into a
database, CSV file, or other usable format — and then we can search it, sort it, etc.
Many programming languages include objects as a data type. Python does, JavaScript does,
etc. An object is an even more powerful and complex data type than an array (JavaScript) or
a list (Python) and can contain many other data types in a structured format.
When you extract information from an object with a Beautiful Soup command, sometimes
you get a single Tag object, and sometimes you get a Python list (similar to an array in
JavaScript) of Tag objects. The way you treat that extracted information will
be different depending on whether it is one item or a list (usually, but not always,
containing more than one item).
CSE VMTW
P a g e | 30
SELENIUM
A way of automating and simulating a human browsing with a web browser can
be accomplished by using a tool called Selenium. It is primarily used and intended for testing
of web applications, but is a relevant choice for web scraping. Using the Selenium
WebDriver API in conjunction with a browser driver (such as Chrome Driver for the Google
Chrome browser) will act the same way as if a user manually opened up the browser to do the
desired actions. Because of this, loading and scraping web pages that makes use of JavaScript
to update the DOM is not a problem. The Selenium WebDriver can be used in Java, Python,
C#, JavaScript, Haskell, Ruby and more.
Selenium Scripts are built to do some tedious tasks which can be automated
using headless web browsers. For example, Searching for some Questions on Different
Search engines and storing results in a file by visiting each link. This task can take a long for
a normal human being but with the help of selenium scripts one can easily do it
Now, Some of You may be wondering what is headless web browsers. It’s nothing but a
browser that can be controlled using these selenium scripts for automation (web tasks).
Selenium Scripts can be programmed using various languages such as JavaScript, Java,
Python etc.
Whatever Operating System You are Using Python command is Same for Installing Selenium
Library.
First Method
Open Terminal/Cmd and Write Command as written Below
CSE VMTW
P a g e | 31
Second Method
Alternatively, you can download the source distribution here, unarchive it, and run the
command below:
REQUESTS
It is a Python module in which you can send HTTP requests to retrieve contents.
It helps you to access website HTML contents or API by sending Get or Post requests.
PANDAS
Panda’s is another multi-purpose Python library used for data manipulation and indexing. It
can be used to scrape the web in conjunction with Beautiful Soup. The main benefit of using
pandas is that analysts can carry out the entire data analytics process using one language
(avoiding the need to switch to other languages, such as R).
TOOLS
Several software resources are available which can also be utilized to configure web crawling
strategies. This program can proceed to instantly identify a page's data framework or to
provide a documenting framework that eliminates the need to create Web scraping code
manually, or other parsing functions which could be used to retrieve and transform material,
and spreadsheet applications which can archive the scraped information.
CSE VMTW
P a g e | 32
There are different kinds of websites from which web scrapping is done.
The websites are divided into small, average and large based on how much users visit these
sites.
CSE VMTW
P a g e | 33
IMPLEMENTATION
In essence, web scraping is used to fetch unstructured data from web pages
and transform it to a structured presentation, or for storage in an external database. It is also
considered an efficient technique for collecting big data, where gathering large amounts of
data is important. Search engines use web scraping in conjunction with web crawling to index
the World Wide Web, with the purpose of making the vast amount of pages searchable. The
crawlers, also called spiders, follow every link that they can find and store them in their
databases. On every website metadata and site contents are scraped to allow for determining
which site best fit the users search terms. One example of a way to "rank" the pages is by an
algorithm called PageRank 1. PageRank looks at how many links are outgoing from a
website, and how often the website is linked from elsewhere.
Working of the proposed system is as follows: The backend system consists of
two important techniques web crawling and web scrapping. Web scrapping is a technique that
is used to extract information in the human readable format and display it on destination
terminal. But before scrapping the output, Web Crawlers are responsible to navigate to the
destination once the crawler reaches the correct page and matches up with the products,
scrapping process starts. Web scrapping essentially consists of two tasks: first is to load the
desired web page and second is to parse HTML information of the page to locate intended
information. In this system Scrapping is done using python as it provides rich set of libraries
to address these tasks. “requests” is used to load the URLs and “Beautiful soup” library is
used to parse the web page. After scrapping the products information from different e-
commerce websites, the data is displayed on the website. The frond end consists of Main
website. The client searches for the required product in search bar and query is fired in local
database i.e., sqlite3. The website is designed using Django web framework which is written
in python. Required results are retrieved and displayed on Main website. The client can then
compare prices of products that are available on e-commerce websites. A soon as client
selects on best deal according to him, he will be redirected to the original ecommerce
website. Another feature provided is price alert, which user can set, to get notified by the
website whenever the suitable price comes up.
Three different phases that build up of web scraping are:
CSE VMTW
P a g e | 34
Fetching phase
First, in what is commonly called the fetching phase, the desired web site that
contains the relevant data has to be accessed. This is done via the HTTP protocol, an Internet
protocol used to send requests and receive responses from a web server.
This is the same techniques used by web browsers to access web page content.
Libraries such as curl 2 and we get 3 can be used in this phase by sending an HTTP GET
request to the desired location (URL), getting the HTML document sent back in the response.
Extraction phase
Once the HTML document is fetched, the data of interest needs to be
extracted. This phase is called the extraction phase, and the technologies used are regular
expressions, HTML parsing libraries or XPath queries. XPath stands for XML Path Language
and is used to find information in documents. This is considered the second phase.
Transformation phase
Now that only the data of interest is left it can be transformed into a
structured version, either for storage or presentation.
Technique used
There are several approaches one can take when implementing a web
scraper. A common path is to use libraries. Using this approach the web scraper is developed
in the same vein as a software program using a programming language of choice. Popular
programming languages for building web scrapers include Java, Python, Ruby or JavaScript,
in the framework Node.js. Programming languages usually offer libraries to use the HTTP
protocol to fetch the HTML from a web page. Popular libraries for using the HTTP protocol
include curl and wget. After this process regular expressions or other libraries can be used to
parse the HTML.
CSE VMTW
P a g e | 35
CSE VMTW
P a g e | 36
CSE VMTW
P a g e | 37
CSE VMTW
P a g e | 38
CODE
import csv
from bs4 import BeautifulSoup
from msedge.selenium_tools import Edge, EdgeOptions
import pandas as pd
import csv
import json
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
import pandas as pd
import time
CSE VMTW
P a g e | 39
jd = req_data[i]["widget"]["data"]["products"]
# print(len(jd))
# print("i: ", i, end="\n")
for j in range(len(jd)):
jd2 = jd[j]["productInfo"]["value"]
d["title"] = jd2["titles"]["title"]
d["keySpecs"] = jd2["keySpecs"]
d["rating"] = jd2["rating"]["average"]
d["ratingCount"] = jd2["rating"]["count"]
d["price"] = jd2["pricing"]["finalPrice"]["value"]
# d["warranty"] = jd2["warrantySummary"]
d["url"] = jd2["smartUrl"]
# You can uncomment below lines if you want to print json output on terminal
# print("Title: ",jd2["titles"]["title"],end="\n")
# print("key specs: ", jd2["keySpecs"], end="\n")
# print("Rating: ", jd2["rating"]["average"], end="\n")
# print("Total ratings: ", jd2["rating"]["count"], end="\n")
# print("Price: ", jd2["pricing"]["finalPrice"]["value"],end="\n")
# print("warranty: ", jd2["warrantySummary"], end="\n")
# print("Smart url: ", jd2["smartUrl"], end="\n")
data_list.append(d)
except:
pass
# dumping data to result.json file
# print(list(data_list))
with open("flipkart"+'.json', 'w') as fp:
json.dump(data_list, fp)
# Now let us write our data to csv file
data_file = open("flipkart"+'.csv', 'w')
# create the csv writer object
csv_writer = csv.writer(data_file)
# Counter variable used for writing
# headers to the CSV file
count = 0
for data in data_list:
if count == 0:
# Writing headers of CSV file
header = data.keys()
csv_writer.writerow(header)
CSE VMTW
P a g e | 40
count += 1
# Writing data of CSV file
csv_writer.writerow(data.values())
with open('flipkart.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
rows = list(reader)
i,j = 0,2
while i in range(len(rows))
try:
name = rows[j][0]
# name = " ".join(name.split(' ')[0:2])
# print(name)
# print("name = ",name)
names.append(name)
i += 1
j += 2
except:
break
print("Best results",len(names))
# print(names,len(names))
if len(names) < 10:
f = open("flipkart.csv", "w")
f.truncate()
f.close()
flipkart_url = "https://www.flipkart.com/search?q=" + q
print(flipkart_url)
uClient = uReq(flipkart_url)
flipkartPage = uClient.read()
uClient.close()
flipkart_html = bs(flipkartPage, "html.parser")
bigboxes = flipkart_html.find_all("a", {"class": "s1Q9rs"})
soup = BeautifulSoup(flipkartPage, 'html.parser')
info = soup.select("[class~=s1Q9rs]")
if info == []:
info = soup.select("[class~=IRpwTa]")
flipPrices = soup.select("[class =_30jeq3]")
prodNames = [i.get('title') for i in info]
names = prodNames
df = pd.DataFrame(list(zip(prodNames, flipPrices)),
columns =['product_name', 'Flipkart_price'])
df.to_csv('test.csv')
print(df,"brooooooooooooooooooo")
else:
with open('flipkart.csv') as csv_file:
CSE VMTW
P a g e | 41
CSE VMTW
P a g e | 42
i=0
while i in range(len(names)):
print(names[i])
def amazon(name):
try:
global amazon
name = " ".join(name.split(' ')[0:2])
name1 = name.replace(" ","-")
name2 = name.replace(" ","+")
amazon=f'https://www.amazon.in/{name1}/s?k={name2}'
res =
requests.get(f'https://www.amazon.in/{name1}/s?k={name2}',headers=headers)
print("\nSearching in amazon:")
soup = BeautifulSoup(res.text,'html.parser')
amazon_page = soup.select('.a-color-base.a-text-normal')
amazon_page_length = int(len(amazon_page))
for i in range(0,amazon_page_length):
name = name.upper()
amazon_name = soup.select('.a-color-base.a-text-
normal')[i].getText().strip().upper()
if name in amazon_name[0:20]:
amazon_name = soup.select('.a-color-base.a-text-
normal')[i].getText().strip().upper()
amazon_price = soup.select('.a-price-whole')[i].getText().strip().upper()
amazonlist.append(amazon_price)
print("Amazon:")
print(amazon_name)
amazonName.append(amazon_name)
print("₹"+amazon_price)
print("-----------------------")
break
else:
i+=1
i=int(i)
if i==amazon_page_length:
print("amazon : No product found!")
print("-----------------------")
amazon_price = '0'
amazonlist.append(amazon_price)
amazonName.append("No similar product")
break
return amazon_price
except:
CSE VMTW
P a g e | 43
df = pd.DataFrame(list(zip(amazonName, amazonlist)),
columns =['Product_name', 'Amazon_price'])
print(df)
# len(amazonName)
test = flipPrices
flip = flipPrices
idk = []
flip = flipPrices
idk = []
for i in range(len(flip)):
# x=
CSE VMTW
P a g e | 44
try:
x = flip[i].text.replace('₹','')
print(x)
idk.append(x)
except:
idk = test
import csv
# opening the csv file in 'w+' mode
file = open('flipkartandamazon.csv', 'w+', newline ='')
# csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
# writing the data into the file
with file:
write = csv.writer(file)
write.writerows(names)
write.writerows(idk)
write.writerows(amazonName)
write.writerows(amazonlist)
df = pd.DataFrame(list(zip(names,idk,amazonName, amazonlist)),
columns =["Product_name_Flipkart","Flipkart_price",'Product_name_Amazon',
'Amazon_price'])
df
CSE VMTW
P a g e | 45
RESULT
CSE VMTW
P a g e | 46
The overall results of the project turn out to be helpful to understand the price of the
products. The Web scrap extracted the data and made into CSV file format. The script which
was written to extract the data turned out to be both of finding each of these sources provided
with great ease. Moreover, the analysis done has shown the most rated product in the site
taken in the most rates review product format.
CSE VMTW
P a g e | 47
CONCLUSION
The main outcomes of this project were user friendly search interface,
indexing, query processing, and effective data extraction technique based on web structure.
Web scraping assist us to avail large-scale product data and also helps in gaining data as per
the requirement in a readable format.
This project presents the survey of Web scraping technology incorporating what
it is, how it works, the popular tools and technologies of web scraping, the websites used for
this technology and the top most fields which are making use of this technology.
The data could be used in real time to keep pricing in line with rival companies, or
could be used to track the misuse of data and illegal sales.
The purpose with this thesis was to evaluate state of the art web scraping tools
based on established software evaluation metrics. A recommendation based on state of the art
web scraping tools can then be presented, with these different software metrics in mind.
The website provides users with useful information that will help them making
informed decision. With this price comparison website, it solves the problems of the working
people to check on the price before buying products. This website will facilitate users to
CSE VMTW
P a g e | 48
analyse prices that are present on different e-commerce shopping websites so that they get to
know the cheapest price of product with best deal. This will surely save buyers efforts and
valuable time. Ultimately, this will bring together strategies, best offers and deals from all
leading online stores and will help buyers to shop online.
CSE VMTW
P a g e | 49
FUTURE ENHANCEMENT
As we go forward, marketing will become an even more competitive exercise.
Those who wish to arrive at a suitable marketing strategy will need to derive deeper in sights
regarding the market and base their marketing decisions the on data than other aspects.
For this, the future of marketing is closely linked with comparison of price of
products aggregated from various media sites, social media platforms, web traffic etc.
At present, the trend has started wherein sentiment analysis plays a part in arriving at a
strategy. In future, it is set to increase its role in decision making many times more. Going
forward, it will become an integral part of policy framing and strategic planning in all fields.
To put it in perspective, let’s say, a company launches a new product. How are
they going to analyse its efficacy in future? How will they derive insights regarding the
product design or service provided? Right now, there are companies that analyse user
comments and feedback to learn something new about their products but the practice is not as
much widespread. In future, the concept of scraping user reviews, product feedback and
service feedback will grow manifold. Going forward, sentiment analysis using web scraping
CSE VMTW
P a g e | 50
will become a vital driver of policy and strategy. Companies which will invest in web
scraping for future will reap huge dividends in terms of sentiment analysis and rich insights
into customer expectations and overall customer behaviour.
Whether perfectly legal or not, web scraping has grown as an essential requirement
of a set of stakeholders of Internet. Starting with Google, everyone needs data to process,
analyse and streamline information. The world of business has become more dynamic and
responds to change immediately and at times frequently. The prices keep fluctuating on e-
commerce website and a number of a number of businesses are keenly watching and
analysing this data to rework their own strategy.
Starting with Google, everyone needs data to process, analyse and streamline
information. The world of business has become more dynamic and responds to change
immediately and at times frequently. The prices keep fluctuating on e-commerce websites
and a number of businesses are keenly watching and analysing this data to rework their own
strategy.
CSE VMTW
P a g e | 51
BIBILOGRAPHY
[1]. Renita Crystal Pereira, Vanitha T. “Web Scraping of Social Networks.” International
Journal of Innovative Research in Computer and Communication Engineering, vol. 3,
pp.237-239, Oct. 7, 2018”
[2] Kaushal Parikh, Dilip Singh, Dinesh Yadav and Mansingh Rathod, “Detection of web
scraping using machine learning,” Open access international journal of Science and
Engineering, pp.114-118, Vol. 3, 2018.
[3] Anand V. Saurkar, Kedar G. Pathare and Shweta A. Gode, “An Overview on Web
Scraping Techniques and Tools,” International Journal on Future Revolution in Computer
Science & Communication Engineering, pp. 363-367, Vol. 4, 2018.
[4] Federico Polidoro, Riccardo Giannini, Rosanna Lo Conte, Stefano Mosca and Francesca
Rossetti, “Web scraping techniques to collect data on consumer electronics and airfares for
Italian HICP compilation,” Statistical Journal of the IAOS, pp. 165-176, 2015.
[5] Jan Kinne and Janna Axenbeck, “Web Mining of Firm Websites: A Framework for Web
Scraping and a Pilot Study for Germany,” 2019.
[6] Ingolf Boettcher, “Automatic data collection on the Internet,” pp. 1-9, 2015.
[7] Erin J. Farley and Lisa Pierotte, “An Emerging Data Collection Method for Criminal
Justice Researchers,” Justice Research and statistics association, pp. 1-9, 2017
CSE VMTW
P a g e | 52
CSE VMTW