Web Scraping

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

(m9 marketing research in international business) 03.11.

22

web-scraping:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
big data and non- now = datetime.now()
current_time = now.strftime("%H%M%S")

reactive research
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0;
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0",
@ aiur zhamsaranov df['currency']='$'
#Let's find out 10 product which have the highest-priced
"AcceptEncoding":"gzip, deflate", data = df.sort_values(["price"], axis=0,
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https: plt.barh(y_pos, data['price'])
(m9 marketing research in international business) 2

Borah, Abhishek & Boegershausen, Johannes & Stephen, Andrew. (2021). Fields
of Gold: Web Scraping for Consumer Research

Johannes Boegershausen Hannes Dat ta Abhishek Borah Andrew Stewen


(m9 marketing research in international business) 3

what is web-
the process ofscraping?
designing and deploying code
that automatically extracts and parses
information from websites (Borah, 2021)
(m9 marketing research in international business) 4

#!/usr/bin/env python3 now = datetime.now()


# -*- coding: utf-8 -*- current_time = now.strftime("%H%M%S")
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])

“if programming is magic, then web


from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0", #Let's find out 10 product which have the highest-priced

scraping is wizardry”
"AcceptEncoding":"gzip, deflate",
"Accept":"text/html,application/xhtml+xml,application/xml;q=
0.9,*/*;q=0.8", "DNT":"1","Connection":"close",
data = df.sort_values(["price"], axis=0,
ascending=False)[:10]
data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- @ Ryan Mitchell plt.barh(y_pos, data['price'])
sellersbooks-Amazon/zgbs/books", headers=headers) # Create names on the y-axis
print(result.status_code) plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
plt.show()
(m9 marketing research in international business) 5

#!/usr/bin/env python3 now = datetime.now()


# -*- coding: utf-8 -*- current_time = now.strftime("%H%M%S")
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')

6,37 h/day
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0", #Let's find out 10 product which have the highest-priced
"AcceptEncoding":"gzip, deflate", data = df.sort_values(["price"], axis=0,
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- plt.barh(y_pos, data['price'])
spent online by the average consumer world-wide
sellersbooks-Amazon/zgbs/books", headers=headers) # Create names on the y-axis
print(result.status_code) plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
plt.show()
Source: Digital 2022: October global statshot report
(m9 marketing research in international business) 6

#!/usr/bin/env python3 now = datetime.now()

web-scraping: big data


# -*- coding: utf-8 -*- current_time = now.strftime("%H%M%S")
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0", #Let's find out 10 product which have the highest-priced
"AcceptEncoding":"gzip, deflate", data = df.sort_values(["price"], axis=0,
"Accept":"text/html,application/xhtml+xml,application/xml;q=
244m reviews
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", 556k projects
ascending=False)[:10]
data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
Phones/zgbs/wireless/7072561011", headers=headers) # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- plt.barh(y_pos, data['price'])
sellersbooks-Amazon/zgbs/books", headers=headers) # Create names on the y-axis
print(result.status_code) plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content > 1 b reviews 500m/day
plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
plt.show()
Source: Borah, 2021
(m9 marketing research in international business) 7

127,251,840
websites

5,075,587,500
internet users
2,500,000,000,000,000,0
00 bytes of data is created every day

Source: Digital 2022: October global statshot report


(m9 marketing research in international business) 8

web-scraping: non-reactive research


advantage:
non-reactive research: scrapping gives numbers and also
the extraction and copying of data sentiment and behavioral analysis,
from a website into a structured so the business can know what
format using a computer program audience types and choice of ads
they want to see
(Borah, 2021)
Source: Dictionary.com, 2022
(m9 marketing research in international business) 9

types of data
textual
reviews, tweets, articles
010010
11
numeric
star ratings, final auction

visual
price, number of
followers
avatars, airbnb apartment
photos, Instagram photos

metadata
Source: Borah, 2021
(m9 marketing research in international business) 10

different ways to harvest data


using preexisting web datasets

creating novel datasets from web (API)

web-scraping (software or self-coding)

Source: Borah, 2021


(m9 marketing research in international business) 11

web-scraping: roadmap
0
can data from the web inform the

1 research question?

yes no

what is the purpose of the study? experiments,


ethnography,
illustration part of a primary surveys
multi-study source
package of data

Source: Borah, 2021


(m9 marketing research in international business) 12

0
is there a preexisting dataset?
2 yes no

is the dataset sufficient for does the target website offer easy
addressing the research question? access to its content?

yes no yes no
use
preexisting
are the APIs good for the research?
dataset use web-
use APIs yes no
scraping!

Source: Borah, 2021


(m9 marketing research in international business) 13

make or buy?
#!/usr/bin/env python3 now = datetime.now()
# -*- coding: utf-8 -*- current_time = now.strftime("%H%M%S")
import requests #Split price to get a numeric number
import pandas as pd df["pricedata"] = df["pricedata"].str.replace('$', '')
import matplotlib.pyplot as plt df["pricedata"] = df["pricedata"].str.replace(',', '')
import numpy as np df['price'] = df['pricedata'].apply(lambda x: x.split('.')[0])
from bs4 import BeautifulSoup df['price'] = df['price'].astype(int)
from datetime import datetime #Add a new column
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; df['currency']='$'

"AcceptEncoding":"gzip, deflate",
coding
Win64; x64; rv:66.0) Gecko/20100101 Firefox/84.0",
softwar
#Let's find out 10 product which have the highest-priced
data = df.sort_values(["price"], axis=0,

closedesource
"Accept":"text/html,application/xhtml+xml,application/xml;q= ascending=False)[:10]
open source
0.9,*/*;q=0.8", "DNT":"1","Connection":"close", data['tittledata'] = data['tittledata'].str.slice(0,20)
"UpgradeInsecure-Requests":"1"} #Plot
result = requests.get("https://www.amazon.com/Best-SellersCell- y_pos = np.arange(len(data['tittledata']))
free
Phones/zgbs/wireless/7072561011", headers=headers) rarely free # Create horizontal bars
#result = requests.get("https://www.amazon.com/best- plt.barh(y_pos, data['price'])

print(result.status_code)
more powerful
sellersbooks-Amazon/zgbs/books", headers=headers)
limited # Create names on the y-axis
plt.yticks(y_pos, data['tittledata'])
if result.status_code < 500 : #
src = result.content steep learning curve easy to learn plt.xlabel('Amount in dollars')
soup = BeautifulSoup(src,'lxml') plt.ylabel('Labels')
links = soup.find_all('span',{'class' : 'aok-inline-block zgitem'}) plt.title('10 products with the highest prices')
#define dataframe plt.legend()
df = # Show graphic
Source: Kasereka, 2020
plt.show()
(m9 marketing research in international business) 14

web-scraping
steps
1 robots.txt
3 extracting

5 output

2 html loading
4 selecting

Source: Kasereka, 2020


(m9 marketing research in international business) 15

example
Source: Edureka, 2022
0 16

1 (m9 marketing research in international business)

example
0
2
17
0 18

3
(m9 marketing research in international business) 19

question
what is the most important
thing a researcher should do
before web scraping?

Source: Altapress, 2022


(m9 marketing research in international business) 20

references
1. Borah, Abhishek & Boegershausen, Johannes & Stephen, Andrew. (2021). Fields of Gold: Web Scraping for
Consumer Research. Marketing Science Institute Working Paper Series 2021. Report No. 21-101 URL:
https://www.msi.org/working-papers/fields-of-gold-web-scraping-for-consumer-research/ (Date of Access:
23.10.2022)
2. Kasereka, Henrys. (2020). Importance of web scraping in e commerce and e-marketing. SSRN Electronic Journal.
10.6084/m9.figshare.13611395.v1. URL: https://
www.researchgate.net/publication/347999311_Importance_of_web_scraping_in_e-commerce_and_e-marketing (Date
of Access: 23.10.2022)
3. Digital 2022: October global statshot report. URL: https://datareportal.com/reports/digital-2022-october-global-
statshot (Date of Access: 23.10.2022)
4. A Beginner’s Guide to learn web scraping with python! Omkar, Hiremath. (2022) URL:
https://www.edureka.co/blog/web-scraping-with-python/ (Date of Access: 23.10.2022)
5. Web scraping – definition. (2022) Dictionary LLC. URL: https://www.dictionary.com/browse/web-scraping (Date of
Access: 23.10.2022)
(m9 marketing research in international business) 21

the end!
do you have any questions?
zhamaur@gmail.com

CREDITS: This presentation template was created by Slidesgo, including icons by


Flaticon, infographics & images by Freepik

You might also like