19-5E8 Tushara Priya

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

A

Mini Project Report


On
TO COMPARE THE PRICE OF A PRODUCT
USING WEB SCRAPING
Submitted in partial fulfilment of the requirement for the award of the degree
of

BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
Submitted By:
T. TUSHARA PRIYA (19UP1A05E8)
K. SATHWIKA (19UP1A05C4)
U. SAHITHI (19UP1A05F1)

Under the guidance of

Mrs. P. ARCHANA

Assistant Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR WOMEN

(Affiliated to JNTUH, Hyderabad, Accredited by NBA)


Kondapur(v), Ghatkesar(M), Medchal-Malkajgiri(D)-501301
[2022-2023]

CSE VMTW
VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR WOMEN

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
This is to certify that the Project work titled “TO COMPARE PRICE OF THE PRODUCT USING
WEB SCRAPING” submitted by

T. TUSHARA PRIYA (19UP1A05E8),

K. SATHWIKA (19UP1A05C4),

U. SAHITHI (19UP1AO5F1),

in partial fulfilment of the requirements for the award of the degree of Bachelor of Technology in Computer
Science and Engineering to the Vignan’s institute of management and technology for women is a record of
bonafide work carried out by them under my guidance and supervision. The results embodied in this project
report have not been submitted to any university for the award of any degree and the results are achieved
satisfactorily.

INTERNAL GUIDE HEAD OF THE DEPARTMENT


MRS. ARCHANA Mrs. M. PARIMALA
(Assistant Professor) (Associate Professor)

(External Examiner)

CSE VMTW
VIGNAN’S INSTITUTE OF MANAGEMENT AND TECHNOLOGY FOR

WOMEN

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

DECLARATION

We here by declare that project entitled “TO COMPARE THE PRICE OF A PRODUCT USING
WEBSCRAPING” is bonafide work duly completed by us. It does not contain any part of the
project is submitted by any other candidate to this or another institute of the university. All such
materials that have been obtained from other sources have been duly acknowledged.

T. TUSHARA PRIYA (19UP1A05E8)


K. SATHWIKA (19UP1A05C4)
U. SAHITHI (19UP1A05F1)

CSE VMTW
ACKNOWLEDGEMENT
We would like to express sincere gratitude to Dr. G. Appa Rao Naidu, Principal, Vignan’s Institute of
Management and Technology for Women for his timely suggestions which helped us to complete
the project in time.

We would also like to thank our madam Mrs. M. Parimala, Head of the Department, Computer
Science and Engineering, for providing us with constant encouragement and resources which helped
us to complete the project in time.

We would like to thank our project guide, Mrs. P. Archana, Assistant Professor, Computer Science
and Engineering, for her timely cooperation and valuable suggestions throughout the project. We
are indebted to her for the opportunity given to work under her guidance.

Our sincere thanks to all the teaching and non-teaching staff of Department of Computer Science
and Engineering for their support throughout our project work.

T. TUSHARA PRIYA (19UP1A05E8)


K. SATHWIKA (19UP1A05C4)
U. SAHITHI (19UP1A05F1)

CSE VMTW
INDEX
CONTENTS PAGE NO
ABSTRACT
LIST OF FIGURES
1. INTRODUCTION 1-3
1.1. Web scraping 3-4
1.2. Existing System 4-5
1.3. Proposed System
2. LITERATURE SURVEY 6-8
2.1. About the project 9
2.2. How it benefits the business 10-12
2.3. Challenges 12- 13
3.METHODOLOGY 14-18
4. SYSTEM REQUIREMENTS 19
4.1 Hardware Requirements
4.2 Software Requirements
4. SYSTEM ARCHITECTURE 20
4.1. System design 21-24
5.UML DIAGRAMS
5.1 Use Case 25
5.2 Sequence diagram 26
5.3 Class Diagram 27
6.SOFTWARE MODELLING AND SETUP 28-34
6.1 Installation
6.2 Setting up the system
7. IMPLEMENTATION 35-39
7.1. Code 40-46
7.2. Results 47-48
8. CONCLUSION 49-50
9. FUTURE SCOPE/ENHANCEMENT 51-52
10. BIBILOGRAPHY 53

CSE VMTW
ABSTRACT
Web scraping is basically an interactive method for website and some other online
sources to browse for and access data. To delete a replica of the information and save it in
an external archive for review, it uses software engineering technology and custom software
programming to extract data or any other content of on-line sources. Web scraping is often
called automatic data gathering, database discovery, database crawling, or content
management mining. Web scraping have possibly existed since before the start of the
World Wide Web, but it has been used mainly in the context of data analytics, and is
generally associated to e-commerce.

Web scraping technique provides a broad collection of options and can serve
various purposes: A web crawler's least necessity is to automate the normally physical work
of gathering cost quotation marks and website article details. A web crawler's main
requirement will be to discover formerly inaccessible sources of price data, and include a
survey of all accessible price information. This scraping process is performed using
different technologies which can be automatic application tools or manual methods. This
paper provides the overall review of web scraping technology, how it is carried out and the
effects of this technology.

CSE VMTW
LIST OF FIGURES

FIG NO NAME OF FIGURE PAGE NO

1 Web Scraping Technology 1

2 Comparative study 7

3 Phases of Web Scraping 9

4 flipkart website 14

5 Amazon website 15

6 Inspecting page 15

7 Importing libraries 17

8 Block diagram 18

9 System architecture 20

10 Design diagram 21

11 Use case diagram 25

12 Sequence diagram 26

13 Class diagram 27

14 Fields including web scraping 34

15 Web Scraping Process 37

16 Entering required product name 38

CSE VMTW
17 Data stored in CSV 39

18 Result screen 47

19 Future benefits 51

CSE VMTW
Page |1

INTRODUCTION
Web Scraping let us to collect data from web runners across the internet. In this
project the script searches for a product via URL and finds the price of the product. This
project is particularly useful when we want to monitor the price of the specific item from
multiple eCommerce platforms. Here, in this project we have three major eCommerce
websites to find the price of the product. On each execution, all the websites are crawled and
the product is located, and the price of the same product from all the sources is obtained and
displayed on the console window. So the buyer can see the prices and make the decision to
buy from the platform which offers the lowest price.

Fig 1: Web Scraping Technology

Web scraping consists in gathering data available on


websites. This can be done manually by a human user or by a bot. The latter can of course
gather data much faster than a human user and that is why we are going to focus on this. Is it
therefore technically possible to collect all the data of a website in a matter of minutes this
kind of bot. The legality of this practice is not well defined however. Websites usually
describe in their terms of use and in their robots. Txt file if they allow scrapers or not. Web
scrapers gather website data in the same way a human would do it: the scraper goes onto a
web page of the website, gets the relevant data, and move forward to the next web page.

CSE VMTW
Page |2

Every website has a different structure, that is why web scrapers are usually built to explore
one website.

Web scraping is not initially developed for research of social science, as a


effect, analysts using this method may incorporate unknown suppositions into their own,
because web scraping will not usually require direct contact among the analyst and those who
were formerly collecting the information and inserting it online, data analysis issues may
simply arise. Research teams using web scraping techniques as an information gathering
method still have to be acquainted with the accuracy and correct analysis of the details
retrieved from the website. One final problem analyst must address is the potential effect of
web scraping on a publication's functionality, as certain web scraping actions unintentionally
overpowered and close down a webpage. A web scraper which is appropriately intended and
executed, could assist analysts prevail over obstacle to data access, gather online information
more resourcefully, and eventually respond investigation queries that cannot be answered by
conventional means of assortment and examination. The below figure 1 shows the overview
of how web scraping is done.

The scraper tool for the web is utilized for derived information from the web
host, and as a portion of uses used for web orders, web mining and data mining, online
esteem change observing and value correlation, element survey scratching (to watch the
challenge), gathering land postings, atmosphere data checking, webpage change area, inspect,
following on the web closeness and reputation, web mash up and, web data joining. Pages are
manufactured utilizing content-based increase dialects (HTML and XHTML), and much of
the time contain a profusion of cooperative info in the content structure. Be that it may be as
most website pages are anticipated for human end users and not for minimalism of robotized
use. Thus, the toolbox that scrapes web info was made.

There are a variety of competitor price monitoring services that allow


you to compare prices across retailers, and many of them are presented as easy -to-use
tools. Price comparison engines scrape and gather data from multiple sites about
products and services including descriptions, features, and reviews. These details are
then used on price comparison websites to tailor results based on the visitors’
interests. Customers can compare listings of the same product from different vendors
on the platform after searching for a product on the website. After comparing the
listings, the buyer can decide which deal is best for them. These algorithms use data

CSE VMTW
Page |3

as input. Due to e-commerce sites’ dynamic pricing, extraction and updating of this
data are challenging. Prices change frequently, and they artificially complicate things.
Scraping social media websites is another way to gather price intelligence.

EXISTING SYSTEM
In Existing system is the manual web data extraction process has two major
problems. Firstly, it can’t measure costs efficiently and can escalate it very quickly. The data
collection costs increase as more data is collected from each website. In order to conduct a
manual extraction, businesses need to hire large number of staffs, this increases the cost of
labour significantly. Secondly, each manual extraction is known to be error prone. Further, if
any business process is very complex then cleaning up the data can get expensive and time
consuming.

The existing system doesn’t enable us to rapidly scrape many websites at the
same time without having to watch and control every single request. It is not easy to
implement-This means that with onetime investment, the data cannot be collected Competitor
Monitoring-It is not easy to monitor the competitors in the market and the business world.

The world of retail is changing rapidly. Many brick and mortar locations are
closing and being replaced by online stores, direct to consumer brands, and subscription
services. However, while the breadth of assortment is something that drives customers to
website, a lot of E-Commerce platforms fail to sell through a high percentage of merchandise.

DISADVANTAGES

• The existing system doesn’t enable us to rapidly scrape many websites at the same
time without having to watch and control every single request.

• You can also set it up just one time and it will scrape a whole website within an hour
or much less - instead of what would have taken a week for a single person to
complete.

• It is not easy to implement - This means that with onetime investment, the data cannot
be collected.

• Competitor Monitoring - It is not easy to monitor the competitors in the market and
the business world

CSE VMTW
Page |4

PROPOSED SYSTEM
To find the right price, you need to understand and be able to predict how
your customers react to price change. Web scraping allows you to compare price of the
products that you want to buy. Track how customers are reacting to changes in your
competitors’ prices or tweak your own prices and monitor how it affects sales.

Create Applications for Tools that don’t have a public developer API. Web
scraping services provide an essential service at a low cost. The advantage of web Scraping is
its time-efficient and low maintenance. For example, downloading big data may take hours,
and then analyzing every single row manually at a time is worth spending your entire month.

Web scraping services provide essential services at a competitive cost


because it’s much cheaper than hiring a company to perform the same task. By monitoring
listings and sales data, it allows you to see how well different products are performing.
Keeping track of your business has never been easier. There are no humans are involved in
this process, Simple errors in data extraction may lead to major issues. Web scraping is not
only a fast process, but it’s also very accurate too. Hence, it’s necessary to ensure that the
data is accurate.

ADVANTAGES

1.Time Efficient

The advantage of web Scraping is its time-efficient and low maintenance. For example,
downloading big data may take hours, and then analyzing every single row manually at a
time is worth spending your entire month.

2. Complete Automation

• Some advantages of automation are that it doesn’t get bored or tiring, does not require
any breaks, and never gets distracted they follow the given instructions.

• While we have advantages in tasks like analysis, running an algorithm across a large
dataset is faster and more effective than having someone manually read through every
document one by one.

CSE VMTW
Page |5

3. Cost Efficiency

• Web scraping services provide essential services at a competitive cost because it’s
much cheaper than hiring a company to perform the same task.

4. Track product performance

• By monitoring listings and sales data, it allows you to see how well different products
are performing. Keeping track of your business has never been easier.

5. Data Accuracy

• There are no humans are involved in this process, Simple errors in data extraction
may lead to major issues. Web scraping is not only a fast process, but it’s also very
accurate too. Hence, it’s necessary to ensure that the data is accurate.

CSE VMTW
Page |6

LITERATURE SURVEY
To know how the data extraction process has evolved has so much one must
understand the techniques involved in this method of web scraping is important scraping has
been around nearly as long as the web. The impact behind business web scraping has
dependably been to pick up a simple business advantage and incorporate things like
undermining a contender's special valuing, taking leads, commandeering promoting efforts,
diverting APIs, and the inside and out robbery of and information.

The primary aggregators and examination motors seemed hot on the impact points
of the web-based business blast and worked generally unchallenged until the legitimate
difficulties of the mid-2000s. Early scraping apparatuses were really fundamental - physically
reordering anything unmistakable from the site. When software engineers got included,
scraping graduated to the Unix grep order or customary articulation coordinating procedures
posting remote HTTP demands utilizing attachment programming, and parsing site utilizing
information programming and parsing site utilizing information inquiry dialects. Today, in
any case, it's an altogether different story: web scraping is huge business with powerful
devices and administrations to coordinate.
Extraction and Analysis of information are generally utilized by the Digital
distributers and catalogue, Travel, Real home, and E-trade. Then again, examination and
figuring come path back with the advances in accumulation components and the innovation
of Real Databases: The data had been seen and dealt with as data to be set up for data
examination. The pivotal turning point was the nearness of RDB (Relational Database) amid
the 1980s which empowered customers to create Sequel (SQL) to recoup data from the
database. For customers, the advantage of RDB and SQL is to have the ability to separate
their data on intrigue. It made the methodology to get data basic and spread database use.
Information Warehouse: The distinction from regular social databases is that information
stockrooms are generally streamlined for reaction time to inquiries. The improvement of data
mining as made possible appreciation to database and data stockroom progressions, which
engage associations to store more data and still separate it in a reasonable manner. A general
commercial pattern developed, where administrations began to "foresee" client’s potential
needs dependent on examination of the chronicled obtaining designs.

CSE VMTW
Page |7

S. PAPER AUTHOR METHOD ADVAN DISADVAN


N
TITLE NAME -TAGE -TAGE
O
1. Data Analysis David Mathew Python, Web Easy to Time Consuming
Thomas, Scraping, implement
by web Scraping
Sandeep (Beautiful Soup),
Using Python
Mathur Implementing
Web Scrape

2. Web Scraping Ryan Mitchell Python, Web Good Difficult to understand


Scraping, Approach
Using python
(Beautiful Soup) Explained

3. Successfully Richard Python, Web Low protection policies


scrape data from Lawson Scraping, maintenanc
any website (Beautiful Soup) e and speed

Fig 2: Comparative study

Paper 1: Compare the price of products By Web Scraping Using Python

This paper depicts a standard data examination based on the user requirements. The method
is dispensed into three parts: the web scrubber draws the ideal connections from web, and
afterwards the information is extracted(scraped) to get the data from the source, lastly it
stores the information into a CSV document. Because of a gigantic local area and library
assets for Python and the impassableness of coding stylish of python language, it is most
suitable one for Scraping wanted information from the ideal website [1].

Paper 2: Web scraping using python

Learn web scratching and creeping procedures to get to limitless information from any web
source in any organization. Ideal for developers, security experts, and web managers
acquainted with Python, this book trains essential web scratching mechanics, yet in addition

CSE VMTW
Page |8

digs into further developed subjects, for example, investigating crude information or utilizing
scrubbers for frontend site testing [2].

Paper 3: Web Scraping with python successfully scrape data from any
website

The Internet contains the most helpful arrangement of information at any point collected,
generally openly open free of charge. Notwithstanding, this information isn't effectively
reusable. It is implanted inside the design and style of sites and should be painstakingly
separated to be valuable. Web scratching is getting progressively valuable as a way to
effortlessly assemble and sort out the plenty of data accessible on the web. Utilizing a
straightforward language like Python, you can creep the data out of complex sites utilizing
basic programming [3].

ABOUT THE PROJECT

Web scraping can be performed by teams in the following steps shown in figure 2. The
scraping organisations take the websites details from the clients from where the data to be
extracted and analysis in done by the experts. Then they get it approved by the clients. After
approval the extraction process is done for the required data along with data configuration
and then the final information is delivered to the client followed by collecting the feedback.

CSE VMTW
Page |9

Fig 3: Phases of Web


Scraping
How extracting data from e-commerce sites benefits businesses
While a price comparison and tracking system may be difficult to implement initially, it
comes with multiple benefits:

Staying Ahead of Your Competition

Staying ahead of your competition in an industry that rewards the winner


tremendously means that analysing and monitoring the competitor becomes an important
factor. The e-commerce industry depends on its consumer base and to ensure its growth, it
has to provide optimum services. By scraping product prices from different websites and then
deciding on a price on your website. This way you ensure that there is an improvement in
sales.

For example, if you are selling a frying pan from a reputed company. You will
scrape prices of the same product across different websites to understand the market value
and specify an attractive price on your website. Consumers will flock to your website if you
can provide a comparatively lower price than most other competitors. Amazon almost forced

CSE VMTW
P a g e | 10

Quadis to merge itself with the retail giant after an aggressive war over diaper-selling for
years.

However, in cases where the prices change too frequently such as the Sensex or flight
prices. The automated scraping bots fall short as this pricing data updated every second. This
is why real-time web scraping is important. With a real-time scraper, there are no intervals in
between sessions of scraping. The data is extracted as soon as it gets updated which ensures
quick responsive action and better analysis. This also means that the extracted data need not
be stored as everything happens in real-time.

High-Quality Lead Generation

The growth of a business rests on the shoulders of effective marketing. However, for
marketing efforts to take fruit, the business needs to generate leads. Web scraping can
collect high volumes of data, which will subsequently trigger lead generation. Through its
surgical precision, it can generate lead data quickly and accurately. Plus, this information will
be in CSV or similar formats, which can be easily processed or integrated with other tools.
Analyse & Predict Market Trends

Sometimes, the market is not as black and white as selling woollens during winters. E-
Commerce is transforming rapidly, and you need to keep up with it. When it comes to
finalizing sales, timing is everything. Scraping e-commerce sites and monitoring similar or
competitor products over several months can help provide insights on a specific market and
product trends. These data points can help you predict the best time to launch a product and at
the most optimal price. Competitive pricing and in-season launch will result in a magical
recipe that will boost sales. Further, depending on the prevailing or projected market trends,
you can effectively manage the stock and inventory of your products.

Determining Pricing Strategy from Scraping eCommerce Websites

Before completing a purchase, a single customer browses thousands of products. To get a


customer to make a purchase, you will need a good pricing strategy. By offering the right
prices, you can attract more buyers. For example, suppose you are a customer on the hunt for
olive oil. While you will look for good brands that offer good reviews of the product. You
will also compare the prices of products across popular websites. After doing so, you will

CSE VMTW
P a g e | 11

select the one priced the cheapest while simultaneously ensuring good quality. Thus, if you
are an eCommerce website, it is this consumer mentality that will drive your business.

Offering sales, buy-one-get-one-free deals, exchange offers will also bring in more
traffic to your website. When you know competitor prices, it becomes easier for you to make
logic-backed decisions. Through optimal pricing, brand reputation is also improved, and thus,
you gain more customers.

Analysing the Scraped Data

Finding the right consumer preference will take time. It cannot be done sitting in a boardroom
and deciding the preferences for them. It is only through data analytics that companies can
stay connected with their customers. Integrating your analytics with the web scraped data will
allow you to implement data-backed decisions. With data at your disposal. You will begin to
understand customer preferences and the quality that is demanded at each price point.

Practices like mailing them when something they are eyeing becomes
cheaper can ensure a purchase. But such practices will need tracking for conversion rates and
compare with the loss or profit that you end up making with such aggressive pricing
strategies. Using historical pricing data to analyse and forecast future trends. And then stock
up accordingly can also mean more business for you.

Challenges with large-scale data extraction and product data


scraping

When it comes to web scraping, not everything is a bed of roses; it has its fair share of thorns
too. E-Commerce websites, especially your competitors, do not want you stealing
information from their websites. And as web scrapers get better and more effective at
extracting product data, the website admins are also coming up with creative ways of
thwarting such attempts.
Here are some of the challenges that might keep you from using web scrapers:

1. Site Design and Layout Changes


A web scraper is based on the structure of the website. However, this structure is prone to
changing and that too often, which could be a pain point for web scraping companies.

CSE VMTW
P a g e | 12

Whether it’s intentional, or just amateur coding standards, an e-commerce website may be
difficult to navigate with bots due to the design and structure, or the ever-changing layout of
the website. Keeping up with all these changes requires time and effort.
2. Use of Unique Elements

Modern elements in website design can enhance their responsiveness. However, it


makes a trade-off as web scraping becomes even more difficult. Design elements introduce
complexities that might sometimes slow down or interrupt efficient scraping of data.
In addition to these modern elements, the inclusion of dynamic content that makes
use of transitions like lazy loading images, show more info, and infinite scrolling, make it
difficult for the scraper to read the data.

3. Use of Anti-Scraping Technologies


Websites may use multiple security protocols and techniques to block potential
scraping attempts. Some of these techniques include content copy protection, using
JavaScript for rendering content, user-agent validations, etc.
Additionally, websites can track which IP address your requests are coming from.
If they flag any request as suspicious (e.g., sending too many requests within a short time),
they might ban the particular IP address from sending further requests. The issue worsens
with the fact that you cannot mask your IP address because websites can detect and block IP
addresses from well-known rotating IP providers as well.

4. Honey Pot Traps

Websites responsible for storing sensitive data ensure the protection of information through
Honey Pot traps, which can detect scrapers and crawlers. Through this method, they
strategically place invisible links on a webpage that are not meant for visitors but are present
for scrapers. These are specially designed to trap and block web scrapers and bots as soon as
they attempt to crawl them. On setting off the trigger, the IP address corresponding to the
scraper is instantly blocked.

5. Use of CAPTCHAs
Fun Fact: The technology behind CAPTCHA (Completely Automated Public
Turing Test to Tell Computers and Humans Apart) is based on the Turing Test, which can
test whether a machine can think like humans!

CSE VMTW
P a g e | 13

The very role of CAPTCHA is to block automated scripts from performing repetitive
actions on a website. It essentially brings an element of randomness into an otherwise
predictable workflow. Web scrapers are tasked to decipher images containing distortions and
randomness. Solving captchas is something that a robot cannot perform successfully!

CSE VMTW
P a g e | 14

METHODOLOGY

The methodology used for the project is to gather all the data extracted from various sources
by using the vivid features of the web crawler scrapy using the scripts written in python
language and further analyse it as per the requirements of the customer where the data is
stored in the csv file.

Step - 1: Find the desired URL to scrap

The initial step is to find the URL that you want to scrap. Here we are extracting product
details from the Flipkart and Amazon. The URL of this page is https://www.flipkart.com and
https://www.amazon.com/

Fig 4: Flipkart Website

CSE VMTW
P a g e | 15

Fig 5: Amazon Website

Step - 2: Inspecting the page


It is necessary to inspect the page carefully because the data is usually contained within
the tags. So, we need to inspect to select the desired tag. To inspect the page, right click on
the element and click "inspect". Before coding your web scraper, you need to identify

Fig 6: Inspecting page


CSE VMTW
P a g e | 16

what it has to scrape. Right-clicking anywhere on the frontend of a website gives you the
option to ‘inspect element’ or ‘view page source.’ This reveals the site’s backend code,
which is what the scraper will read.

Step - 3: Find the data for extracting


Extract the price and name which are contained in the "div" tag, respectively. you’ll need
to identify where these are located in the backend code. Most browsers automatically
highlight selected frontend content with its corresponding code on the backend. Your aim
is to identify the unique tags that enclose (or ‘nest’) the relevant content (e.g., <div> tags).
Step - 4: Importing libraries and code execution
Once you’ve found the appropriate nest tags, you’ll need to incorporate these into your
preferred scraping software. This basically tells the bot where to look and what to extract.
It’s commonly done using Python libraries, which do much of the heavy lifting. You need
to specify exactly what data types you want the scraper to parse and store. Import all the

Fig 7: Importing Libraries

CSE VMTW
P a g e | 17

required libraries like pandas, beautiful soup and selenium and write the code.

Fig 8: Block diagram

The required website is crawled to obtain the required data. The website page is then
inspected and the required “div” tags of price, name and ratings are gathered. The data
present in the “div” tags are obtained from the website for the execution of the code. If the
data is obtained then it is saved in the CSV file format and saved for future reference. If the
data is not obtained then again, the website is crawled to obtain the data.

CSE VMTW
P a g e | 18

SYSTEM REQUIREMENTS
Hardware Requirements

Processor:11th Gen Intel(R) Core (TM) i3-1115G4 3.00GHz

Ram:8.00 GB

System type:64-bit operating system, x64-based processor

Software Requirements

Operating System: windows 64-bit OS

Platform: jupyter (python 3.x with Selenium, Beautiful Soup, Pandas libraries installed)

Web Browser: Microsoft Edge Version 105.0.1343.50

CSE VMTW
P a g e | 19

SYSTEM ARCHITECTURE
System architecture defines the structure of a software system. This is usually a series of
diagrams that illustrate services, components, layers and interactions. A scheduler is a
software product that allows an enterprise to schedule and track computer batch tasks.

Fig 9: System Architecture


AAARCARCHITECTURE
Job schedulers may also manage the job queue for a computer cluster. A scheduler starts by
manipulating a prepared job control language algorithm or through communication with a
human user and taking the required URL. A Download Manager is basically a computer
program dedicated to the task of downloading stand-alone files from internet. Here, we are
going to create a simple Download Manager with the help of threads in Python. Using
multi-threading a file can be downloaded in the form of chunks simultaneously from
different threads. To implement this, we are going to create simple command line tool
which accepts the URL of the file and then downloads it. Downloads are put to the
download queue and prioritised. From this we get the required data from the website and can
be stored in required format.

CSE VMTW
P a g e | 20

SYSTEM DESIGN

System Design is the process of designing the architecture, components, and interfaces for
a system so that it meets the end-user requirements. Web scraping requires two parts, namely
the crawler and the scraper. The crawler is an artificial intelligence algorithm that browses
the web to search for the particular data required by following the links across the internet.
The scraper, on the other hand, is a specific tool created to extract data from the website.

Fig 10: Design diagram

DESIGN COMPONENTS
1. Input Seed URLs
Firstly, your crawler will need ‘seed URLs’. Once it has the initial input, it will continue
extracting and storing data recursively. This list of seed URLs or absolute URLs is fed to the
‘URL frontier’.
2. URL Frontier
The job of module 2, the URL frontier, is to build and store a list of URLs to be downloaded
from the internet. For focused or topical web crawlers, the URL frontier will also prioritize the

CSE VMTW
P a g e | 21

URLs in the queue.


3. Fetching Data
Whenever the URL frontier is requested for a URL, it will send the next URL in the priority
queue to module 3, the HTML fetcher. The HTML fetcher then downloads the document
against the fetched URL, once the DNS resolver gives it the IP address (find details under the
next heading). The crawler downloads the file based on the network protocol that the file is
running. Your crawler can also have multiple protocol modules to download different file
types. The fetcher, also called the worker, will invoke the appropriate protocol module to
download the page on the URL.

4. DNS Resolver
Before the HTML fetcher can actually download the page content, an additional step is
required. This is where the role of a DNS resolver comes in. A DNS resolver, or a DNS
lookup tool, component 4 in the diagram, maps a hostname to its IP address. Though DNS
resolution can be requested from the server, it will take a lot of time to complete the step,
given the big number of URLs to be crawled. Instead, the better option is to create a
customized DNS resolver, as you can see in the diagram, to complement the basic crawler
design. So your custom DNS resolver will give HTML fetcher the IP address of the hostname
that is to be fetched. Once it has the IP address, the fetcher downloads the content on the page
available on that address.

5. Caching
Next, the content downloaded from the internet by the fetcher is cached. Since the data is
typically stored after being compressed and can be time-consuming to retrieve, an open-source
data structure store, such as Redis can be used to cache the document. This makes it easier for
other processors in your web crawler design to fetch the data and re-read it without consuming
unnecessary time.

6. Content Seen Module


Another aspect to consider is whether the URL content is already seen by the crawler.
Sometimes, multiple URLs can have the same content in them. If the document is already in
the crawler database, you’re going to discard it here without sending it to storage.
We’ll use module 6, the content seen module or the document De-Dup test, so that
the crawler doesn’t download the same document multiple times. Fingerprint mechanisms,
like checksum or shingles, can be used to detect duplication. The checksum of the current

CSE VMTW
P a g e | 22

document is compared to all the checksums present in a store called ‘Doc FPs’ to see if the file
has already been crawled. If the checksum already exists, the document is discarded at this
point.

7. Storage
If the document passed the content seen test in the previous module, it’s saved into the
persistent storage.

8. Processing Data
You can have multiple processors in your customized web crawler design, depending on what
you plan on doing with the crawler. All the processing is carried out on the cached document,
rather than the stored database since it’s easier to retrieve. The three most common processors
that are almost always present include:

8a. Link Extractor


URL extractor or link extractor can be taken as the default processor. A copy of the crawled
page is fed into the link extractor from Redis or any other in-memory data store. The extractor
will parse the network protocol and extract all the links on the page. The links could either be
pointing to a specific location on the same page, a different page on the same website, or a
different website. A set of normalization techniques will need to be incorporated to make the
link list more manageable. The links in the list should follow a standard format to make them
easily understandable by the crawler modules. You can:
Map all child domains to the main domain. For example, links from mail.yahoo.com and
music.yahoo.com can all be tagged under www.yahoo.com.If the components of the link are
uppercased, convert them to lowercase. Add network protocol to the beginning of the link, if
it’s missing. Add/remove backslashes at the end of the link.

8b. URL Filtering

URL filter receives the set of unique, standardized URLs from the link extractor module.
Next, depending on how you’re using the web crawler, the URL filter will filter out the files
that are required and discards the rest.

You can design a URL filter that filters by filetype. For example, a web crawler that crawls
only jpg files will keep all the links that end with ‘.jpg’ and discard the rest. Other than the
filetype, you can also filter links by their prefix or domain name. For example, if you don’t
want to crawl Wikipedia links, you can design your URL filter to ignore the links pointing to

CSE VMTW
P a g e | 23

Wikipedia.
It is at this point that we can implement the robot exclusion protocol. Since the URL fetcher
will already have fetched a document called robot.txt and mapped the off-limit pages to the
URL list, the URL filter will discard all the links that the website does not permit
downloading. We’ll discuss the need for robot exclusion protocol later in the post. The output
of the URL filter is all the URLs that we want to keep and pass on to the URL frontier after
some further processing.

8c. URL De-Dup

URL De-Dup is typically implemented after the URL filter module. The stream of URLs
coming out of the URL filter might not be unique. You may have multiple URLs in the stream
that point to the same document. We wouldn’t want to crawl the same document twice, so a
De-Dup test is performed on each filtered link before passing it ahead. Your crawler should
ideally store a database of all the crawled URLs — let’s call it the URL set. Each URL to be
tested is mapped onto each of the URLs in the set to detect a repetition.

OVERVIEW
As you can see in the system design diagram, the loop is initiated through a set of ‘seed URLs’
that is created and fed into the URL frontier. The URL frontier applies algorithms to build
URL queues based on certain constraints, prioritization and politeness, which we’ll discuss in
detail further in the post. Which is the URL fetcher receive’s the URLs waiting in the queue
one by one, receives the address against it from the DNS resolver and downloads the content
from that page. The content is cached by module 5 for easier access to the processors. It is also
compressed and stored after going through the document De-Dup test at module. This test
checks to see if the content has already been crawled.

CSE VMTW
P a g e | 24

UML DIAGRAMS
USE CASE

Fig 11: USECASE DIAGRAM

CSE VMTW
P a g e | 25

SEQUENCE DIAGRAM

Fig 12: SEQUENCE DIAGRAM

CSE VMTW
P a g e | 26

CLASS DIAGRAM

Fig 13: CLASS DIAGRAM

CSE VMTW
P a g e | 27

SOFTWARE MODELLING AND SETUP


BEAUTIFUL SOUP
Beautiful Soup is a Python HTML extracting library. While not being a complete
web scraping tool, it can be used in conjunction with the requests 21 package, which allows
for doing HTTP calls in Python. Beautiful Soup is a simple, pythonic way to navigate, search
and modify parse trees, such as an HTML tree. It is intended to be easy to use and provides
traversal functionality such as finding all links or all tables matching some condition.

Setup for Beautiful Soup library


Beautiful Soup is a scraping library for Python. We want to run all our scraping projects in a
virtual environment, so we will set that up first.
Create a directory and change into it.
The first step is to create a new folder (directory) for all your scraping projects.

Documents/python/scraping

Do not use any spaces in your folder names. If you must use punctuation, do not use
anything other than an underscore _ . It’s best if you use only lowercase letters. Change into

that directory. The command is

cd Documents/python/scraping

Create a new virtual environment in that directory and activate it.


Create a new virtual environment there (this is done only once).

Install the Beautiful Soup library

In MacOS or Windows, at the command prompt, type

CSE VMTW
P a g e | 28

pip install beautifulsoup4

This is how you install any Python library that exists in the Python Package Index. Pretty
handy. Pip is a tool for installing Python packages, which is what you just did.

Test Beautiful Soup


Start Python. Because you are already in a Python 3 virtual environment, Mac users need
only type python (NOT python3 ). Windows users also type python as usual.

You should now be at the >>> prompt — the Python interactive shell prompt.

In MacOS or Windows, type (or copy/paste) one line at a time:

from urllib.request import urlopen


from bs4 import BeautifulSoup
page =
urlopen("https://weimergeeks.com/examples/scraping/example1.html")
soup = BeautifulSoup(page, "html.parser")
print(soup.h1)

1. You imported two Python modules, url open and Beautiful Soup (the first two lines).

2. You used url open to copy the entire contents of the URL given into a new Python

variable, page (line 3).

3. You used the Beautiful Soup function to process the value of that variable (the plain-text

contents of the file at that URL) through a built-in HTML parser called html.parser .

4. The result: All the HTML from the file is now in a Beautiful Soup object with the new
Python variable name soup . (It is just a variable name.)

5. Last line: Using the syntax of the Beautiful Soup library, you printed the first h1 element

(including its tags) from that parsed value.


If it works, you’ll see:

CSE VMTW
P a g e | 29

<h1>We Are Learning About Web Scraping!</h1>

The command soup.h1 would work the same way for any HTML tag (if it exists in the file).

Instead of printing it, you might stash it in a variable:

heading = soup.h1

Then, to see the text in the element without the tags:

print(heading.text)

Understanding Beautiful Soup


Beautiful Soup is a Python library that enables us to extract information from web pages and
even entire websites. We use Beautiful Soup commands to create a well-structured
data object (more about objects below) from which we can extract, for example, everything
with an <li> tag, or everything with class="product price” . After extracting the desired

information, we can use other Python commands (and libraries) to write the data into a
database, CSV file, or other usable format — and then we can search it, sort it, etc.

What is the Beautiful Soup object?


It’s important to understand that many of the Beautiful Soup commands work on
an object, which is not the same as a simple string.

Many programming languages include objects as a data type. Python does, JavaScript does,
etc. An object is an even more powerful and complex data type than an array (JavaScript) or
a list (Python) and can contain many other data types in a structured format.
When you extract information from an object with a Beautiful Soup command, sometimes
you get a single Tag object, and sometimes you get a Python list (similar to an array in
JavaScript) of Tag objects. The way you treat that extracted information will
be different depending on whether it is one item or a list (usually, but not always,
containing more than one item).

CSE VMTW
P a g e | 30

SELENIUM

A way of automating and simulating a human browsing with a web browser can
be accomplished by using a tool called Selenium. It is primarily used and intended for testing
of web applications, but is a relevant choice for web scraping. Using the Selenium
WebDriver API in conjunction with a browser driver (such as Chrome Driver for the Google
Chrome browser) will act the same way as if a user manually opened up the browser to do the
desired actions. Because of this, loading and scraping web pages that makes use of JavaScript
to update the DOM is not a problem. The Selenium WebDriver can be used in Java, Python,
C#, JavaScript, Haskell, Ruby and more.

Selenium Scripts are built to do some tedious tasks which can be automated
using headless web browsers. For example, Searching for some Questions on Different
Search engines and storing results in a file by visiting each link. This task can take a long for
a normal human being but with the help of selenium scripts one can easily do it
Now, Some of You may be wondering what is headless web browsers. It’s nothing but a
browser that can be controlled using these selenium scripts for automation (web tasks).
Selenium Scripts can be programmed using various languages such as JavaScript, Java,
Python etc.

How to Use selenium with Python and Linux Environment.


Python should already be installed. It can be 2.* or 3.* version.
Steps:
1. Installing Selenium

2. Installing web drivers


Installing Selenium

Whatever Operating System You are Using Python command is Same for Installing Selenium
Library.
First Method
Open Terminal/Cmd and Write Command as written Below

python -m pip install selenium

CSE VMTW
P a g e | 31

Second Method
Alternatively, you can download the source distribution here, unarchive it, and run the
command below:

python setup.py install

INSTALLING WEB DRIVERS


For using chrome you may need to install Chromium
A way of automating and simulating a human browsing with a web browser can be
accomplished by using a tool called Selenium. It is primarily used and intended for testing of
web applications, but is a relevant choice for web scraping. Using the Selenium WebDriver
API in conjunction with a browser driver (such as Chrome Driver for the Google Chrome
browser) will act the same way as if a user manually opened up the browser to do the desired
actions. Because of this, loading and scraping web pages that makes use of JavaScript to
update the DOM is not a problem. The Selenium WebDriver can be used in Java, Python, C#,
JavaScript, Haskell, Ruby and more.

REQUESTS

It is a Python module in which you can send HTTP requests to retrieve contents.
It helps you to access website HTML contents or API by sending Get or Post requests.
PANDAS
Panda’s is another multi-purpose Python library used for data manipulation and indexing. It
can be used to scrape the web in conjunction with Beautiful Soup. The main benefit of using
pandas is that analysts can carry out the entire data analytics process using one language
(avoiding the need to switch to other languages, such as R).
TOOLS
Several software resources are available which can also be utilized to configure web crawling
strategies. This program can proceed to instantly identify a page's data framework or to
provide a documenting framework that eliminates the need to create Web scraping code
manually, or other parsing functions which could be used to retrieve and transform material,
and spreadsheet applications which can archive the scraped information.

CSE VMTW
P a g e | 32

Fig 14: FIELDS INCLUDING WEB SCRAPING

There are different kinds of websites from which web scrapping is done.
The websites are divided into small, average and large based on how much users visit these
sites.

CSE VMTW
P a g e | 33

IMPLEMENTATION
In essence, web scraping is used to fetch unstructured data from web pages
and transform it to a structured presentation, or for storage in an external database. It is also
considered an efficient technique for collecting big data, where gathering large amounts of
data is important. Search engines use web scraping in conjunction with web crawling to index
the World Wide Web, with the purpose of making the vast amount of pages searchable. The
crawlers, also called spiders, follow every link that they can find and store them in their
databases. On every website metadata and site contents are scraped to allow for determining
which site best fit the users search terms. One example of a way to "rank" the pages is by an
algorithm called PageRank 1. PageRank looks at how many links are outgoing from a
website, and how often the website is linked from elsewhere.
Working of the proposed system is as follows: The backend system consists of
two important techniques web crawling and web scrapping. Web scrapping is a technique that
is used to extract information in the human readable format and display it on destination
terminal. But before scrapping the output, Web Crawlers are responsible to navigate to the
destination once the crawler reaches the correct page and matches up with the products,
scrapping process starts. Web scrapping essentially consists of two tasks: first is to load the
desired web page and second is to parse HTML information of the page to locate intended
information. In this system Scrapping is done using python as it provides rich set of libraries
to address these tasks. “requests” is used to load the URLs and “Beautiful soup” library is
used to parse the web page. After scrapping the products information from different e-
commerce websites, the data is displayed on the website. The frond end consists of Main
website. The client searches for the required product in search bar and query is fired in local
database i.e., sqlite3. The website is designed using Django web framework which is written
in python. Required results are retrieved and displayed on Main website. The client can then
compare prices of products that are available on e-commerce websites. A soon as client
selects on best deal according to him, he will be redirected to the original ecommerce
website. Another feature provided is price alert, which user can set, to get notified by the
website whenever the suitable price comes up.
Three different phases that build up of web scraping are:

CSE VMTW
P a g e | 34

Fetching phase
First, in what is commonly called the fetching phase, the desired web site that
contains the relevant data has to be accessed. This is done via the HTTP protocol, an Internet
protocol used to send requests and receive responses from a web server.
This is the same techniques used by web browsers to access web page content.
Libraries such as curl 2 and we get 3 can be used in this phase by sending an HTTP GET
request to the desired location (URL), getting the HTML document sent back in the response.
Extraction phase
Once the HTML document is fetched, the data of interest needs to be
extracted. This phase is called the extraction phase, and the technologies used are regular
expressions, HTML parsing libraries or XPath queries. XPath stands for XML Path Language
and is used to find information in documents. This is considered the second phase.
Transformation phase
Now that only the data of interest is left it can be transformed into a
structured version, either for storage or presentation.
Technique used
There are several approaches one can take when implementing a web
scraper. A common path is to use libraries. Using this approach the web scraper is developed
in the same vein as a software program using a programming language of choice. Popular
programming languages for building web scrapers include Java, Python, Ruby or JavaScript,
in the framework Node.js. Programming languages usually offer libraries to use the HTTP
protocol to fetch the HTML from a web page. Popular libraries for using the HTTP protocol
include curl and wget. After this process regular expressions or other libraries can be used to
parse the HTML.

CSE VMTW
P a g e | 35

Fig 15: Web Scraping Process

CSE VMTW
P a g e | 36

Fig 16: Entering required product name

CSE VMTW
P a g e | 37

Fig 17: Data stored in CSV file

CSE VMTW
P a g e | 38

CODE

import csv
from bs4 import BeautifulSoup
from msedge.selenium_tools import Edge, EdgeOptions
import pandas as pd
import csv
import json
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
import pandas as pd
import time

q = input("Enter product name (Searching on flipkart)")


q = q.replace(" ","+")
test = []
names = []
flipPrices = []
prodNames = []
info,price = [],[]
url = "https://flipkart.com/search?q="
# query to search for.
# q = input("Enter a query: ")
file_name = q.replace(" ", "_")
# response recieved in bytes
resp = requests.get(url+q)
# parsing response content using BeautifulSoup class, so that we can perform operations on
it.
parsed_html = bs(resp.content, 'html.parser')
# data cleaning
raw_data = parsed_html.find("script", attrs={"id":"is_script"})
data = raw_data.contents[0].replace("window.__INITIAL_STATE__ = ","").replace(";","")
json_data = json.loads(data)
req_data = json_data["pageDataV4"]["page"]["data"]["10003"]
#[10]["widget"]["data"]["products"][3]["productInfo"]
#req_json_data =
json_data["seoMeta"]["answerBox"]["data"]["renderableComponents"][0]["value"]["data"]
data_list = []
# print(len(req_data))
try:
for i in range(1, len(req_data)):
d = {}

CSE VMTW
P a g e | 39

jd = req_data[i]["widget"]["data"]["products"]
# print(len(jd))
# print("i: ", i, end="\n")
for j in range(len(jd)):
jd2 = jd[j]["productInfo"]["value"]

d["title"] = jd2["titles"]["title"]
d["keySpecs"] = jd2["keySpecs"]
d["rating"] = jd2["rating"]["average"]
d["ratingCount"] = jd2["rating"]["count"]
d["price"] = jd2["pricing"]["finalPrice"]["value"]
# d["warranty"] = jd2["warrantySummary"]
d["url"] = jd2["smartUrl"]

# You can uncomment below lines if you want to print json output on terminal

# print("Title: ",jd2["titles"]["title"],end="\n")
# print("key specs: ", jd2["keySpecs"], end="\n")
# print("Rating: ", jd2["rating"]["average"], end="\n")
# print("Total ratings: ", jd2["rating"]["count"], end="\n")
# print("Price: ", jd2["pricing"]["finalPrice"]["value"],end="\n")
# print("warranty: ", jd2["warrantySummary"], end="\n")
# print("Smart url: ", jd2["smartUrl"], end="\n")
data_list.append(d)

except:
pass
# dumping data to result.json file
# print(list(data_list))
with open("flipkart"+'.json', 'w') as fp:
json.dump(data_list, fp)
# Now let us write our data to csv file
data_file = open("flipkart"+'.csv', 'w')
# create the csv writer object
csv_writer = csv.writer(data_file)
# Counter variable used for writing
# headers to the CSV file
count = 0
for data in data_list:
if count == 0:
# Writing headers of CSV file
header = data.keys()
csv_writer.writerow(header)

CSE VMTW
P a g e | 40

count += 1
# Writing data of CSV file
csv_writer.writerow(data.values())
with open('flipkart.csv') as csv_file:
reader = csv.reader(csv_file, delimiter=',')
rows = list(reader)
i,j = 0,2
while i in range(len(rows))
try:
name = rows[j][0]
# name = " ".join(name.split(' ')[0:2])
# print(name)
# print("name = ",name)
names.append(name)
i += 1
j += 2
except:
break
print("Best results",len(names))
# print(names,len(names))
if len(names) < 10:
f = open("flipkart.csv", "w")
f.truncate()
f.close()
flipkart_url = "https://www.flipkart.com/search?q=" + q
print(flipkart_url)
uClient = uReq(flipkart_url)
flipkartPage = uClient.read()
uClient.close()
flipkart_html = bs(flipkartPage, "html.parser")
bigboxes = flipkart_html.find_all("a", {"class": "s1Q9rs"})
soup = BeautifulSoup(flipkartPage, 'html.parser')
info = soup.select("[class~=s1Q9rs]")
if info == []:
info = soup.select("[class~=IRpwTa]")
flipPrices = soup.select("[class =_30jeq3]")
prodNames = [i.get('title') for i in info]
names = prodNames
df = pd.DataFrame(list(zip(prodNames, flipPrices)),
columns =['product_name', 'Flipkart_price'])
df.to_csv('test.csv')
print(df,"brooooooooooooooooooo")
else:
with open('flipkart.csv') as csv_file:

CSE VMTW
P a g e | 41

reader = csv.reader(csv_file, delimiter=',')


rows = list(reader)
# i,j = 0,2
# while i in range(len(rows)):
# try:
# name = rows[j][0]
# # name = " ".join(name.split(' ')[0:2])
# # print(name)
# # print("name = ",name)
# names.append(name)
# i += 1
# j += 2
# except:
# break
i,j = 0,2
while i in range(len(rows)):
try:
price = rows[j][4]
# price = price[i].text
# print("price = ",price)
flipPrices.append(price)
i += 1
j += 2
except:
break
df = pd.DataFrame(list(zip(names, flipPrices)),
columns =['Product_name', 'Flipkart_price'])
df.to_csv('test.csv')
print(df)
data_file.close()
import pandas as pd
df = pd.read_csv("test.csv", sep=",")
df.head(100)
from bs4 import BeautifulSoup
import requests
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
flipkart=''
ebay=''
amazon=''
olx=''
amazonlist = []
amazonName = []

CSE VMTW
P a g e | 42

i=0
while i in range(len(names)):
print(names[i])
def amazon(name):
try:
global amazon
name = " ".join(name.split(' ')[0:2])
name1 = name.replace(" ","-")
name2 = name.replace(" ","+")
amazon=f'https://www.amazon.in/{name1}/s?k={name2}'
res =
requests.get(f'https://www.amazon.in/{name1}/s?k={name2}',headers=headers)
print("\nSearching in amazon:")
soup = BeautifulSoup(res.text,'html.parser')
amazon_page = soup.select('.a-color-base.a-text-normal')
amazon_page_length = int(len(amazon_page))
for i in range(0,amazon_page_length):
name = name.upper()
amazon_name = soup.select('.a-color-base.a-text-
normal')[i].getText().strip().upper()
if name in amazon_name[0:20]:
amazon_name = soup.select('.a-color-base.a-text-
normal')[i].getText().strip().upper()
amazon_price = soup.select('.a-price-whole')[i].getText().strip().upper()
amazonlist.append(amazon_price)
print("Amazon:")
print(amazon_name)
amazonName.append(amazon_name)
print("₹"+amazon_price)
print("-----------------------")
break
else:
i+=1
i=int(i)
if i==amazon_page_length:
print("amazon : No product found!")
print("-----------------------")
amazon_price = '0'
amazonlist.append(amazon_price)
amazonName.append("No similar product")
break

return amazon_price
except:

CSE VMTW
P a g e | 43

print("amazon: No product found!")


print("-----------------------")
amazon_price = '0'
amazonlist.append(amazon_price)
amazonName.append("No similar product")
return amazon_price
amazon_price = amazon(names[i])
flipkart=''
ebay=''
croma=''
amazon=''
olx=''
i += 1
flip = flipPrices
idk = []
for i in range(len(flip)):
# x=
try:
x = flip[i].text.replace('₹','')
print(x)
idk.append(x)
except:
idk = test
df = pd.DataFrame(list(zip(names,idk,amazonName, amazonlist)),
columns =["Product_name_Flipkart","Flipkart_price",'Product_name_Amazon',
'Amazon_price'])
df.to_csv('flipkartandamazon.csv')
df

df = pd.DataFrame(list(zip(amazonName, amazonlist)),
columns =['Product_name', 'Amazon_price'])

print(df)
# len(amazonName)

test = flipPrices
flip = flipPrices

idk = []

flip = flipPrices
idk = []
for i in range(len(flip)):
# x=

CSE VMTW
P a g e | 44

try:
x = flip[i].text.replace('₹','')
print(x)
idk.append(x)
except:
idk = test

import csv
# opening the csv file in 'w+' mode
file = open('flipkartandamazon.csv', 'w+', newline ='')
# csv_data = pd.read_csv(csvpath,encoding='iso-8859-1')
# writing the data into the file
with file:
write = csv.writer(file)
write.writerows(names)
write.writerows(idk)
write.writerows(amazonName)
write.writerows(amazonlist)

df = pd.DataFrame(list(zip(names,idk,amazonName, amazonlist)),
columns =["Product_name_Flipkart","Flipkart_price",'Product_name_Amazon',
'Amazon_price'])
df

CSE VMTW
P a g e | 45

RESULT

Comparison of product prices from different ecommerce websites and


result is displayed on single web interface. This website aims at providing the best possible
deal to the users for the required product by comparing the product price and displaying the
minimum price from various E-commerce websites such as Amazon, Flipkart and Croma,
which are leading and some of the best websites to shop. To achieve this result web mining is
done to fetch the required product details and concept of web crawler and web scraper is used
to extract information of these products available on different ecommerce websites. System
will allow users to redirect to original website of that specific product selected by the user as
a best deal. Thus, website serves as a time - saving tool for frequent online buyers as they can
compare the prices at one - stop instead of searching for the same product on various
websites. Following images show how product analysis and comparison of e-commerce sites
is done.

Fig 18: RESULT SCREEN

CSE VMTW
P a g e | 46

The overall results of the project turn out to be helpful to understand the price of the

products. The Web scrap extracted the data and made into CSV file format. The script which
was written to extract the data turned out to be both of finding each of these sources provided
with great ease. Moreover, the analysis done has shown the most rated product in the site
taken in the most rates review product format.

Updates your product. Scraping descriptions and reviews from your


competitors’ online stores, can help you find that sweet spot between quality and
price. It is known that cheaper products are often due to the use of lower material
costs with fewer labour expenses. Such cases are fine, but what will happen if your
customers notice the decline in quality? Will the cheaper prices be enough
compensation for them? These questions can all be answered through price scraping,
which allows you to extract descriptions from your competitors’ online stores, identify
openings for greater savings, and change up their strategy.

CSE VMTW
P a g e | 47

CONCLUSION
The main outcomes of this project were user friendly search interface,
indexing, query processing, and effective data extraction technique based on web structure.
Web scraping assist us to avail large-scale product data and also helps in gaining data as per
the requirement in a readable format.

This project presents the survey of Web scraping technology incorporating what
it is, how it works, the popular tools and technologies of web scraping, the websites used for
this technology and the top most fields which are making use of this technology.

Whether in e-commerce or e-marketing, the use of the technique of web scraping


will be the key to success as it will provide insight into the targeting market and help decision
makers.

Web scraping has become a modern necessity to stay competitive in business,


helping organizations to utilize data to track trends and strategize for the future.

The data could be used in real time to keep pricing in line with rival companies, or
could be used to track the misuse of data and illegal sales.

The purpose with this thesis was to evaluate state of the art web scraping tools
based on established software evaluation metrics. A recommendation based on state of the art
web scraping tools can then be presented, with these different software metrics in mind.

Extracting data through scraping technology is a new evolving activity in the


technology harvesting arena. Though many companies are still using manual process of
extracting data but Web Scraping solutions will transform the traditional method of extracting
data. The day is not that far with exponential growth throughout this field when it can
become a phenomenon and most companies will understand the value of scraping innovation
and how it enables them remain ahead in the race dramatically. This paper presents the
survey of Web scraping technology incorporating what it is, how it works, the popular tools
and technologies of web scraping, the websites used for this technology and the top most
fields which are making use of this technolology.

The website provides users with useful information that will help them making
informed decision. With this price comparison website, it solves the problems of the working
people to check on the price before buying products. This website will facilitate users to

CSE VMTW
P a g e | 48

analyse prices that are present on different e-commerce shopping websites so that they get to
know the cheapest price of product with best deal. This will surely save buyers efforts and
valuable time. Ultimately, this will bring together strategies, best offers and deals from all
leading online stores and will help buyers to shop online.

By tracking your competitors’ product reviews, we can learn what our


customers want and need by taking advantage of their mistakes, as well as their
successes.

CSE VMTW
P a g e | 49

FUTURE ENHANCEMENT
As we go forward, marketing will become an even more competitive exercise.
Those who wish to arrive at a suitable marketing strategy will need to derive deeper in sights
regarding the market and base their marketing decisions the on data than other aspects.

For this, the future of marketing is closely linked with comparison of price of
products aggregated from various media sites, social media platforms, web traffic etc.

Fig 19: FUTURE BENEFITS

At present, the trend has started wherein sentiment analysis plays a part in arriving at a
strategy. In future, it is set to increase its role in decision making many times more. Going
forward, it will become an integral part of policy framing and strategic planning in all fields.

To put it in perspective, let’s say, a company launches a new product. How are
they going to analyse its efficacy in future? How will they derive insights regarding the
product design or service provided? Right now, there are companies that analyse user
comments and feedback to learn something new about their products but the practice is not as
much widespread. In future, the concept of scraping user reviews, product feedback and
service feedback will grow manifold. Going forward, sentiment analysis using web scraping

CSE VMTW
P a g e | 50

will become a vital driver of policy and strategy. Companies which will invest in web
scraping for future will reap huge dividends in terms of sentiment analysis and rich insights
into customer expectations and overall customer behaviour.

Whether perfectly legal or not, web scraping has grown as an essential requirement
of a set of stakeholders of Internet. Starting with Google, everyone needs data to process,
analyse and streamline information. The world of business has become more dynamic and
responds to change immediately and at times frequently. The prices keep fluctuating on e-
commerce website and a number of a number of businesses are keenly watching and
analysing this data to rework their own strategy.

Sentimental analysis is a popular way for organizations to determine and


categorize opinions about a product, service or idea. In future, it is set to increase its role in
decision making. Going forward, It will become an integral part of policy framing and
strategic planning in all fields. Going forward, sentiment analysis using web scraping will
become a vital driver of policy and strategy. Companies which will invest in web scraping for
future will reap huge dividends in terms of sentiment analysis and rich insights into customer
expectations and overall customer behaviour.

Starting with Google, everyone needs data to process, analyse and streamline
information. The world of business has become more dynamic and responds to change
immediately and at times frequently. The prices keep fluctuating on e-commerce websites
and a number of businesses are keenly watching and analysing this data to rework their own
strategy.

As we go forward, marketing will become an even more competitive


exercise. Those who wish to arrive at a suitable marketing strategy will need to derive deeper
insights regarding the market and base their marketing decisions the on data than other
aspects.For this, the future of marketing is closely linked with comparison of price of
products aggregated from various media sites, social media platforms, web traffic etc.

CSE VMTW
P a g e | 51

BIBILOGRAPHY
[1]. Renita Crystal Pereira, Vanitha T. “Web Scraping of Social Networks.” International
Journal of Innovative Research in Computer and Communication Engineering, vol. 3,
pp.237-239, Oct. 7, 2018”

[2] Kaushal Parikh, Dilip Singh, Dinesh Yadav and Mansingh Rathod, “Detection of web
scraping using machine learning,” Open access international journal of Science and
Engineering, pp.114-118, Vol. 3, 2018.

[3] Anand V. Saurkar, Kedar G. Pathare and Shweta A. Gode, “An Overview on Web
Scraping Techniques and Tools,” International Journal on Future Revolution in Computer
Science & Communication Engineering, pp. 363-367, Vol. 4, 2018.

[4] Federico Polidoro, Riccardo Giannini, Rosanna Lo Conte, Stefano Mosca and Francesca
Rossetti, “Web scraping techniques to collect data on consumer electronics and airfares for
Italian HICP compilation,” Statistical Journal of the IAOS, pp. 165-176, 2015.

[5] Jan Kinne and Janna Axenbeck, “Web Mining of Firm Websites: A Framework for Web
Scraping and a Pilot Study for Germany,” 2019.

[6] Ingolf Boettcher, “Automatic data collection on the Internet,” pp. 1-9, 2015.

[7] Erin J. Farley and Lisa Pierotte, “An Emerging Data Collection Method for Criminal
Justice Researchers,” Justice Research and statistics association, pp. 1-9, 2017

CSE VMTW
P a g e | 52

CSE VMTW

You might also like