Professional Documents
Culture Documents
Web Scraping
Web Scraping
Web Scraping
22
web-scraping:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
big data and non- now = datetime.now()
current_time = now.strftime("%H%M%S")
reactive research
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0;
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0",
@ aiur zhamsaranov df['currency']='$'
#Let's find out 10 product which have the highest-priced
"AcceptEncoding":"gzip, deflate", data = df.sort_values(["price"], axis=0,
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https: plt.barh(y_pos, data['price'])
(m9 marketing research in international business) 2
Borah, Abhishek & Boegershausen, Johannes & Stephen, Andrew. (2021). Fields
of Gold: Web Scraping for Consumer Research
what is web-
the process ofscraping?
designing and deploying code
that automatically extracts and parses
information from websites (Borah, 2021)
(m9 marketing research in international business) 4
scraping is wizardry”
"AcceptEncoding":"gzip, deflate",
"Accept":"text/html,application/xhtml+xml,application/xml;q=
0.9,*/*;q=0.8", "DNT":"1","Connection":"close",
data = df.sort_values(["price"], axis=0,
ascending=False)[:10]
data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- @ Ryan Mitchell plt.barh(y_pos, data['price'])
sellersbooks-Amazon/zgbs/books", headers=headers) # Create names on the y-axis
print(result.status_code) plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
plt.show()
(m9 marketing research in international business) 5
6,37 h/day
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0", #Let's find out 10 product which have the highest-priced
"AcceptEncoding":"gzip, deflate", data = df.sort_values(["price"], axis=0,
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- plt.barh(y_pos, data['price'])
spent online by the average consumer world-wide
sellersbooks-Amazon/zgbs/books", headers=headers) # Create names on the y-axis
print(result.status_code) plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
plt.show()
Source: Digital 2022: October global statshot report
(m9 marketing research in international business) 6
127,251,840
websites
5,075,587,500
internet users
2,500,000,000,000,000,0
00 bytes of data is created every day
types of data
textual
reviews, tweets, articles
010010
11
numeric
star ratings, final auction
visual
price, number of
followers
avatars, airbnb apartment
photos, Instagram photos
metadata
Source: Borah, 2021
(m9 marketing research in international business) 10
web-scraping: roadmap
0
can data from the web inform the
1 research question?
yes no
0
is there a preexisting dataset?
2 yes no
is the dataset sufficient for does the target website offer easy
addressing the research question? access to its content?
yes no yes no
use
preexisting
are the APIs good for the research?
dataset use web-
use APIs yes no
scraping!
make or buy?
#!/usr/bin/env python3 now = datetime.now()
# -*- coding: utf-8 -*- current_time = now.strftime("%H%M%S")
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'
"AcceptEncoding":"gzip, deflate",
coding
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0",
softwar
#Let's find out 10 product which have the highest-priced
data = df.sort_values(["price"], axis=0,
closedesource
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
open source
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
free
Phones/zgbs/wireless/7072561011", headers=headers) rarely free # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- plt.barh(y_pos, data['price'])
print(result.status_code)
more powerful
sellersbooks-Amazon/zgbs/books", headers=headers)
limited # Create names on the y-axis
plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content steep learning curve easy to learn plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
Source: Kasereka, 2020
plt.show()
(m9 marketing research in international business) 14
web-scraping
steps
1 robots.txt
3 extracting
5 output
2 html loading
4 selecting
example
Source: Edureka, 2022
0 16
example
0
2
17
0 18
3
(m9 marketing research in international business) 19
question
what is the most important
thing a researcher should do
before web scraping?
references
1. Borah, Abhishek & Boegershausen, Johannes & Stephen, Andrew. (2021). Fields of Gold: Web Scraping for
Consumer Research. Marketing Science Institute Working Paper Series 2021. Report No. 21-101 URL:
https://www.msi.org/working-papers/fields-of-gold-web-scraping-for-consumer-research/ (Date of Access:
23.10.2022)
2. Kasereka, Henrys. (2020). Importance of web scraping in e commerce and e-marketing. SSRN Electronic Journal.
10.6084/m9.figshare.13611395.v1. URL: https://
www.researchgate.net/publication/347999311_Importance_of_web_scraping_in_e-commerce_and_e-marketing (Date
of Access: 23.10.2022)
3. Digital 2022: October global statshot report. URL: https://datareportal.com/reports/digital-2022-october-global-
statshot (Date of Access: 23.10.2022)
4. A Beginner’s Guide to learn web scraping with python! Omkar, Hiremath. (2022) URL:
https://www.edureka.co/blog/web-scraping-with-python/ (Date of Access: 23.10.2022)
5. Web scraping – definition. (2022) Dictionary LLC. URL: https://www.dictionary.com/browse/web-scraping (Date of
Access: 23.10.2022)
(m9 marketing research in international business) 21
the end!
do you have any questions?
zhamaur@gmail.com