4251 Assignment 2

Assignment 2
Student Name: Renu Tamsekar

C Number: C22020111255
Date: 5/2/24
Aim: Web Scraping and Data Visualization using Python
Brief Information: Web Scraping and Data Visualization are two powerful techniques in the
realm of data science and analysis.
Basics of Web Scraping:

Web scraping is the process of extracting data from websites. It involves fetching the
HTML code of a webpage and then parsing it to extract the desired information. Key
concepts include:
1. HTTP Requests: Fetching webpage content using libraries like `requests`.
2. HTML Parsing: Extracting relevant data from HTML using libraries like `Beautiful
Soup` or `lxml`.
3. Navigating DOM: Understanding the structure of HTML to locate specific elements

containing desired data.
4. Handling Dynamic Content: Techniques like using Selenium for scraping dynamic
websites.
5. Respecting Robots.txt: Adhering to website rules and policies.
Brief Basics of Data Visualization:
Data visualization is the graphical representation of data to uncover insights and patterns.
It involves creating visualizations such as charts, graphs, and maps to communicate
information effectively. Key concepts include:
1. Types of Visualizations: Understanding when to use different types of charts like bar
charts, line charts, scatter plots, etc.
2. Choosing the Right Visualization: Selecting appropriate visualizations based on the data
and the insights you want to convey.
3. Color Theory and Design: Understanding color schemes and design principles to create
visually appealing and informative visuals.
4. Interactivity: Adding interactivity to visualizations for better exploration and
understanding.
5. Storytelling with Data: Presenting data in a narrative format to convey a clear message.
Various graphs and their usage:
1. Bar Chart: Comparing categories of data.
2. Line Chart: Showing trends over time.
3. Pie Chart: Displaying proportions of a whole.
4. Scatter Plot: Visualizing relationships between two variables.
5. Histogram: Representing the distribution of a continuous variable.
6. Heatmap: Displaying data in a matrix format with colors.
Important Libraries used in Python Code:
1. Web Scraping:
- `requests`: For making HTTP requests.
- `Beautiful Soup` or `lxml`: For parsing HTML.
- `Selenium`: For scraping dynamic websites.
2. Data Visualization:
- `matplotlib`: Basic plotting library in Python.
- `Seaborn`: Statistical data visualization library based on matplotlib.
- `Plotly`: Interactive visualization library.
- `Bokeh`: Interactive visualization library with a focus on web-ready plotting.
- `Altair`: Declarative statistical visualization library.

Website used for scraping: “https://www.worldometers.info/coronavirus/"
Code For Web Scraping and Visualization:
from collections.abc import ValuesView
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
def data_Downloader():
url = "https://www.worldometers.info/coronavirus/"
html_page=requests.get(url).text
#print(html_page)
soup=BeautifulSoup(html_page,"lxml")
get_table = soup.find("table",id="main_table_countries_today")
get_table_data = get_table.tbody.find_all("tr")
dic={}
for i in range(len(get_table_data)):
try:
key=get_table_data[i].find_all("a",href=True)[0].string
except:
key=get_table_data[1].find_all("td")[0].string
value=[j.string for j in get_table_data[i].find_all('td')]
dic[key]=ValuesView
#print(dic)
column_names=["Country", "Total Cases","New Cases","Total Deaths", "New Deaths", "Total
Recovered", "New Recovered", "Active Cases", "Serious Critical", "Tot Cases", "Deaths", "Total
Tests", "Tests", "Population"]
table=pd.DataFrame(dic).iloc[2:,:]
table=table.reset_index(drop=True).T.iloc[:,:14]
table.index.name='Country'
table.columns=column_names
table=table.rename(index={None:"World"})
table.head(20)
table.to_csv("Corona Data.csv")
print('code worked')
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Function to scrape data
def scrape_corona_data():
url = "https://www.worldometers.info/coronavirus/"
html_page = requests.get(url).text
soup = BeautifulSoup(html_page, "html.parser")
# Find the table containing the data
table = soup.find("table", id="main_table_countries_today")
# Extract column names
columns = [header.text.strip() for header in table.find_all("th")]

# Extract data rows
rows = []
for row in table.find_all("tr")[1:]:
row_data = [data.text.strip() for data in row.find_all("td")]
rows.append(row_data)
# Create DataFrame
df = pd.DataFrame(rows, columns=columns)
# Clean data
df.replace('', 'N/A', inplace=True)
return df
# Scrape data
corona_data = scrape_corona_data()
# Display the data
corona_data.head()
# Save data to CSV in Google Colab
corona_data.to_csv('/content/drive/MyDrive/Corona_Data.csv', index=False)
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV file
df = pd.read_csv('/content/drive/MyDrive/Corona_Data.csv') # Change the path to your CSV file
# Drop rows with NaN values in the 'Country, Other' column
df_cleaned = df.dropna(subset=['Country,Other'])
# Sort the DataFrame by 'TotalCases' in descending order
df_sorted = df_cleaned.sort_values(by='TotalCases', ascending=True)
# Select the first 10 rows after sorting
top_10_countries = df_sorted.head(10)
top_10_countries = top_10_countries.sort_values(by='TotalCases', ascending=True)
# Print the first 10 rows
print("First 10 rows of 'Country,Other' after sorting and removing NaN:")
print(top_10_countries[['Country,Other', 'TotalCases']])
# Plotting
plt.figure(figsize=(12, 6))
plt.bar(top_10_countries['Country,Other'], top_10_countries['TotalCases'], color='blue')
plt.xlabel('Country/Region')
plt.ylabel('Total Confirmed Cases')
plt.title('Top 10 COVID-19 Confirmed Cases by Country (Ascending Order)')
#plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better visibility
plt.tight_layout()
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# Read CSV file
df = pd.read_csv('/content/drive/MyDrive/Corona_Data.csv') # Change the path to your CSV file
# Drop rows with NaN values in the 'Country, Other' column
df_cleaned = df.dropna(subset=['Country,Other'])
# Sort the DataFrame by 'TotalCases' in descending order
df_sorted = df_cleaned.sort_values(by='TotalCases', ascending=False)
# Select the first 10 rows after sorting
top_10_countries = df_sorted.head(10)
# Remove commas and convert 'TotalCases' to numeric
top_10_countries['TotalCases'] = top_10_countries['TotalCases'].str.replace(',', '').astype(int)
# Plotting
plt.figure(figsize=(8, 8))
plt.pie(top_10_countries['TotalCases'], labels=top_10_countries['Country,Other'],
autopct='%1.1f%%', startangle=140)
plt.title('Distribution of COVID-19 Cases by Country')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.tight_layout()
plt.show()
Data Visualization:
1. Sample Screenshot of .csv file created
2. Visualization Graphs (minimum 2)

Conclusion:
Web Scraping in Python involves extracting data from websites, while Data Visualization is the
process of representing data graphically. Python offers powerful libraries like `requests` and
`Beautiful Soup` for web scraping, and `matplotlib`, `Seaborn`, and others for data visualization.
Combining these techniques allows for the extraction, analysis, and presentation of valuable
insights from web data, facilitating informed decision-making across various domains.
Applications of Data Scraping.

Data scraping finds applications in market research, lead generation, competitor analysis,
financial analysis, real estate, academic research, job market analysis, social media monitoring,
weather forecasting, and government/public data analysis. It helps automate data collection from
various sources, enabling organizations to gather insights for decision-making and analysis in
diverse domains.

4251 Assignment 2

Uploaded by

Copyright:

Available Formats

You might also like

4251 Assignment 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4251 Assignment 2

Uploaded by

Copyright:

Available Formats

Assignment 2

Student Name: Renu Tamsekar

Aim: Web Scraping and Data Visualization using Python

Basics of Web Scraping:

1. HTTP Requests: Fetching webpage content using libraries like `requests`.

3. Navigating DOM: Understanding the structure of HTML to locate specific elements

5. Respecting Robots.txt: Adhering to website rules and policies.

Brief Basics of Data Visualization:

Various graphs and their usage:

1. Bar Chart: Comparing categories of data.

2. Line Chart: Showing trends over time.

3. Pie Chart: Displaying proportions of a whole.

4. Scatter Plot: Visualizing relationships between two variables.

5. Histogram: Representing the distribution of a continuous variable.

6. Heatmap: Displaying data in a matrix format with colors.

Important Libraries used in Python Code:

- `requests`: For making HTTP requests.

- `Beautiful Soup` or `lxml`: For parsing HTML.

- `Selenium`: For scraping dynamic websites.

- `matplotlib`: Basic plotting library in Python.

- `Seaborn`: Statistical data visualization library based on matplotlib.

- `Plotly`: Interactive visualization library.

- `Bokeh`: Interactive visualization library with a focus on web-ready plotting.

- `Altair`: Declarative statistical visualization library.

Code For Web Scraping and Visualization:

from collections.abc import ValuesView

from bs4 import BeautifulSoup

value=[j.string for j in get_table_data[i].find_all('td')]

from bs4 import BeautifulSoup

# Function to scrape data

soup = BeautifulSoup(html_page, "html.parser")

# Find the table containing the data

table = soup.find("table", id="main_table_countries_today")

# Extract column names

columns = [header.text.strip() for header in table.find_all("th")]

for row in table.find_all("tr")[1:]:

row_data = [data.text.strip() for data in row.find_all("td")]

df.replace('', 'N/A', inplace=True)

# Display the data

# Save data to CSV in Google Colab

import matplotlib.pyplot as plt

# Read CSV file

df = pd.read_csv('/content/drive/MyDrive/Corona_Data.csv') # Change the path to your CSV file

# Drop rows with NaN values in the 'Country, Other' column

df_sorted = df_cleaned.sort_values(by='TotalCases', ascending=True)

# Select the first 10 rows after sorting

top_10_countries = top_10_countries.sort_values(by='TotalCases', ascending=True)

# Print the first 10 rows

print("First 10 rows of 'Country,Other' after sorting and removing NaN:")

plt.bar(top_10_countries['Country,Other'], top_10_countries['TotalCases'], color='blue')

plt.ylabel('Total Confirmed Cases')

plt.title('Top 10 COVID-19 Confirmed Cases by Country (Ascending Order)')

#plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better visibility

import matplotlib.pyplot as plt

# Read CSV file

df = pd.read_csv('/content/drive/MyDrive/Corona_Data.csv') # Change the path to your CSV file

# Drop rows with NaN values in the 'Country, Other' column

df_sorted = df_cleaned.sort_values(by='TotalCases', ascending=False)