Web Scraping

(m9 marketing research in international business) 03.11.
22
web-scraping:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
big data and non- now = datetime.now()
current_time = now.strftime("%H%M%S")
reactive research
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0;
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0",
@ aiur zhamsaranov df['currency']='$'
#Let's find out 10 product which have the highest-priced
"AcceptEncoding":"gzip, deflate", data = df.sort_values(["price"], axis=0,
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https: plt.barh(y_pos, data['price'])
(m9 marketing research in international business) 2
Borah, Abhishek & Boegershausen, Johannes & Stephen, Andrew. (2021). Fields
of Gold: Web Scraping for Consumer Research
Johannes Boegershausen Hannes Dat ta Abhishek Borah Andrew Stewen

what is web-
the process ofscraping?
designing and deploying code
that automatically extracts and parses
information from websites (Borah, 2021)
#!/usr/bin/env python3 now = datetime.now()

# -*- coding: utf-8 -*- current_time = now.strftime("%H%M%S")
“if programming is magic, then web

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0", #Let's find out 10 product which have the highest-priced
scraping is wizardry”
"AcceptEncoding":"gzip, deflate",
"Accept":"text/html,application/xhtml+xml,application/xml;q=
0.9,*/*;q=0.8", "DNT":"1","Connection":"close",
data = df.sort_values(["price"], axis=0,
ascending=False)[:10]
data['tittledata'] = data['tittledata'].str.slice(0,20)
#result = requests.get("https://www.amazon.com/best- @ Ryan Mitchell plt.barh(y_pos, data['price'])
sellersbooks-Amazon/zgbs/books", headers=headers) # Create names on the y-axis
print(result.status_code) plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
plt.show()

6,37 h/day
#result = requests.get("https://www.amazon.com/best- plt.barh(y_pos, data['price'])
spent online by the average consumer world-wide
src = result.content plt.xlabel('Amount in dollars')
df = # Show graphic
plt.show()
Source: Digital 2022: October global statshot report
web-scraping: big data

"Accept":"text/html,application/xhtml+xml,application/xml;q=
244m reviews
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", 556k projects
ascending=False)[:10]
data['tittledata'] = data['tittledata'].str.slice(0,20)
src = result.content > 1 b reviews 500m/day
plt.xlabel('Amount in dollars')
df = # Show graphic
plt.show()
Source: Borah, 2021
127,251,840
websites
5,075,587,500
internet users
2,500,000,000,000,000,0
00 bytes of data is created every day
Source: Digital 2022: October global statshot report

web-scraping: non-reactive research

advantage:
non-reactive research: scrapping gives numbers and also
the extraction and copying of data sentiment and behavioral analysis,
from a website into a structured so the business can know what
format using a computer program audience types and choice of ads
they want to see
(Borah, 2021)
Source: Dictionary.com, 2022
types of data
textual
reviews, tweets, articles
010010
11
numeric
star ratings, final auction
visual
price, number of
followers
avatars, airbnb apartment
photos, Instagram photos
metadata
Source: Borah, 2021
different ways to harvest data

using preexisting web datasets
creating novel datasets from web (API)
web-scraping (software or self-coding)
Source: Borah, 2021

web-scraping: roadmap
0
can data from the web inform the
1 research question?
yes no
what is the purpose of the study? experiments,

ethnography,
illustration part of a primary surveys
multi-study source
package of data
Source: Borah, 2021

0
is there a preexisting dataset?
2 yes no
is the dataset sufficient for does the target website offer easy
addressing the research question? access to its content?
yes no yes no
use
preexisting
are the APIs good for the research?
dataset use web-
use APIs yes no
scraping!
Source: Borah, 2021

make or buy?
"AcceptEncoding":"gzip, deflate",
coding
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0",
softwar
#Let's find out 10 product which have the highest-priced
data = df.sort_values(["price"], axis=0,
closedesource
open source
free
Phones/zgbs/wireless/7072561011", headers=headers) rarely free # Create horizontal bars
print(result.status_code)
more powerful
sellersbooks-Amazon/zgbs/books", headers=headers)
limited # Create names on the y-axis
plt.yticks(y_pos, data['tittledata'])
src = result.content steep learning curve easy to learn plt.xlabel('Amount in dollars')
df = # Show graphic
Source: Kasereka, 2020
plt.show()
web-scraping
steps
1 robots.txt
3 extracting
5 output
2 html loading
4 selecting
Source: Kasereka, 2020

example
Source: Edureka, 2022
0 16
1 (m9 marketing research in international business)
example
0
2
17
0 18
3
question
what is the most important
thing a researcher should do
before web scraping?
Source: Altapress, 2022

references
1. Borah, Abhishek & Boegershausen, Johannes & Stephen, Andrew. (2021). Fields of Gold: Web Scraping for
Consumer Research. Marketing Science Institute Working Paper Series 2021. Report No. 21-101 URL:
https://www.msi.org/working-papers/fields-of-gold-web-scraping-for-consumer-research/ (Date of Access:
23.10.2022)
2. Kasereka, Henrys. (2020). Importance of web scraping in e commerce and e-marketing. SSRN Electronic Journal.
10.6084/m9.figshare.13611395.v1. URL: https://
www.researchgate.net/publication/347999311_Importance_of_web_scraping_in_e-commerce_and_e-marketing (Date
of Access: 23.10.2022)
3. Digital 2022: October global statshot report. URL: https://datareportal.com/reports/digital-2022-october-global-
statshot (Date of Access: 23.10.2022)
4. A Beginner’s Guide to learn web scraping with python! Omkar, Hiremath. (2022) URL:
https://www.edureka.co/blog/web-scraping-with-python/ (Date of Access: 23.10.2022)
5. Web scraping – definition. (2022) Dictionary LLC. URL: https://www.dictionary.com/browse/web-scraping (Date of
Access: 23.10.2022)
the end!
do you have any questions?
zhamaur@gmail.com
CREDITS: This presentation template was created by Slidesgo, including icons by

Flaticon, infographics & images by Freepik

Web Scraping

Uploaded by

Copyright:

Available Formats

You might also like

Web Scraping

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web Scraping

Uploaded by

Copyright:

Available Formats

(m9 marketing research in international business) 03.11.

Johannes Boegershausen Hannes Dat ta Abhishek Borah Andrew Stewen

#!/usr/bin/env python3 now = datetime.now()

“if programming is magic, then web

#!/usr/bin/env python3 now = datetime.now()

#!/usr/bin/env python3 now = datetime.now()

web-scraping: big data

Source: Digital 2022: October global statshot report

web-scraping: non-reactive research

different ways to harvest data

creating novel datasets from web (API)

web-scraping (software or self-coding)

Source: Borah, 2021

what is the purpose of the study? experiments,

Source: Borah, 2021

Source: Borah, 2021

Source: Kasereka, 2020

1 (m9 marketing research in international business)

Source: Altapress, 2022

CREDITS: This presentation template was created by Slidesgo, including icons by

You might also like