4251 Assignment 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Assignment 2

Student Name: Renu Tamsekar


C Number: C22020111255
Date: 5/2/24

Aim: Web Scraping and Data Visualization using Python

Brief Information: Web Scraping and Data Visualization are two powerful techniques in the
realm of data science and analysis.

Basics of Web Scraping:


Web scraping is the process of extracting data from websites. It involves fetching the
HTML code of a webpage and then parsing it to extract the desired information. Key
concepts include:

1. HTTP Requests: Fetching webpage content using libraries like `requests`.

2. HTML Parsing: Extracting relevant data from HTML using libraries like `Beautiful
Soup` or `lxml`.

3. Navigating DOM: Understanding the structure of HTML to locate specific elements


containing desired data.

4. Handling Dynamic Content: Techniques like using Selenium for scraping dynamic
websites.

5. Respecting Robots.txt: Adhering to website rules and policies.

Brief Basics of Data Visualization:

Data visualization is the graphical representation of data to uncover insights and patterns.
It involves creating visualizations such as charts, graphs, and maps to communicate
information effectively. Key concepts include:

1. Types of Visualizations: Understanding when to use different types of charts like bar
charts, line charts, scatter plots, etc.

2. Choosing the Right Visualization: Selecting appropriate visualizations based on the data
and the insights you want to convey.

3. Color Theory and Design: Understanding color schemes and design principles to create
visually appealing and informative visuals.
4. Interactivity: Adding interactivity to visualizations for better exploration and
understanding.

5. Storytelling with Data: Presenting data in a narrative format to convey a clear message.

Various graphs and their usage:

1. Bar Chart: Comparing categories of data.

2. Line Chart: Showing trends over time.

3. Pie Chart: Displaying proportions of a whole.

4. Scatter Plot: Visualizing relationships between two variables.

5. Histogram: Representing the distribution of a continuous variable.

6. Heatmap: Displaying data in a matrix format with colors.

Important Libraries used in Python Code:

1. Web Scraping:

- `requests`: For making HTTP requests.

- `Beautiful Soup` or `lxml`: For parsing HTML.

- `Selenium`: For scraping dynamic websites.

2. Data Visualization:

- `matplotlib`: Basic plotting library in Python.

- `Seaborn`: Statistical data visualization library based on matplotlib.

- `Plotly`: Interactive visualization library.

- `Bokeh`: Interactive visualization library with a focus on web-ready plotting.

- `Altair`: Declarative statistical visualization library.


Website used for scraping: “https://www.worldometers.info/coronavirus/"

Code For Web Scraping and Visualization:

from collections.abc import ValuesView

import requests

import pandas as pd

import numpy as np

from bs4 import BeautifulSoup

def data_Downloader():

url = "https://www.worldometers.info/coronavirus/"

html_page=requests.get(url).text

#print(html_page)

soup=BeautifulSoup(html_page,"lxml")

get_table = soup.find("table",id="main_table_countries_today")

get_table_data = get_table.tbody.find_all("tr")

dic={}

for i in range(len(get_table_data)):

try:

key=get_table_data[i].find_all("a",href=True)[0].string

except:

key=get_table_data[1].find_all("td")[0].string

value=[j.string for j in get_table_data[i].find_all('td')]

dic[key]=ValuesView

#print(dic)
column_names=["Country", "Total Cases","New Cases","Total Deaths", "New Deaths", "Total
Recovered", "New Recovered", "Active Cases", "Serious Critical", "Tot Cases", "Deaths", "Total
Tests", "Tests", "Population"]

table=pd.DataFrame(dic).iloc[2:,:]

table=table.reset_index(drop=True).T.iloc[:,:14]

table.index.name='Country'

table.columns=column_names

table=table.rename(index={None:"World"})

table.head(20)

table.to_csv("Corona Data.csv")

print('code worked')

import requests

import pandas as pd

from bs4 import BeautifulSoup

# Function to scrape data

def scrape_corona_data():

url = "https://www.worldometers.info/coronavirus/"

html_page = requests.get(url).text

soup = BeautifulSoup(html_page, "html.parser")

# Find the table containing the data

table = soup.find("table", id="main_table_countries_today")

# Extract column names

columns = [header.text.strip() for header in table.find_all("th")]


# Extract data rows

rows = []

for row in table.find_all("tr")[1:]:

row_data = [data.text.strip() for data in row.find_all("td")]

rows.append(row_data)

# Create DataFrame

df = pd.DataFrame(rows, columns=columns)

# Clean data

df.replace('', 'N/A', inplace=True)

return df

# Scrape data

corona_data = scrape_corona_data()

# Display the data

corona_data.head()

# Save data to CSV in Google Colab

corona_data.to_csv('/content/drive/MyDrive/Corona_Data.csv', index=False)

import pandas as pd

import matplotlib.pyplot as plt

# Read CSV file

df = pd.read_csv('/content/drive/MyDrive/Corona_Data.csv') # Change the path to your CSV file

# Drop rows with NaN values in the 'Country, Other' column

df_cleaned = df.dropna(subset=['Country,Other'])
# Sort the DataFrame by 'TotalCases' in descending order

df_sorted = df_cleaned.sort_values(by='TotalCases', ascending=True)

# Select the first 10 rows after sorting

top_10_countries = df_sorted.head(10)

top_10_countries = top_10_countries.sort_values(by='TotalCases', ascending=True)

# Print the first 10 rows

print("First 10 rows of 'Country,Other' after sorting and removing NaN:")

print(top_10_countries[['Country,Other', 'TotalCases']])

# Plotting

plt.figure(figsize=(12, 6))

plt.bar(top_10_countries['Country,Other'], top_10_countries['TotalCases'], color='blue')

plt.xlabel('Country/Region')

plt.ylabel('Total Confirmed Cases')

plt.title('Top 10 COVID-19 Confirmed Cases by Country (Ascending Order)')

#plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better visibility

plt.tight_layout()

plt.show()

import pandas as pd

import matplotlib.pyplot as plt

# Read CSV file

df = pd.read_csv('/content/drive/MyDrive/Corona_Data.csv') # Change the path to your CSV file

# Drop rows with NaN values in the 'Country, Other' column

df_cleaned = df.dropna(subset=['Country,Other'])
# Sort the DataFrame by 'TotalCases' in descending order

df_sorted = df_cleaned.sort_values(by='TotalCases', ascending=False)

# Select the first 10 rows after sorting

top_10_countries = df_sorted.head(10)

# Remove commas and convert 'TotalCases' to numeric

top_10_countries['TotalCases'] = top_10_countries['TotalCases'].str.replace(',', '').astype(int)

# Plotting

plt.figure(figsize=(8, 8))

plt.pie(top_10_countries['TotalCases'], labels=top_10_countries['Country,Other'],
autopct='%1.1f%%', startangle=140)

plt.title('Distribution of COVID-19 Cases by Country')

plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.tight_layout()

plt.show()
Data Visualization:
1. Sample Screenshot of .csv file created

2. Visualization Graphs (minimum 2)


Conclusion:
Web Scraping in Python involves extracting data from websites, while Data Visualization is the
process of representing data graphically. Python offers powerful libraries like `requests` and
`Beautiful Soup` for web scraping, and `matplotlib`, `Seaborn`, and others for data visualization.
Combining these techniques allows for the extraction, analysis, and presentation of valuable
insights from web data, facilitating informed decision-making across various domains.

Applications of Data Scraping.


Data scraping finds applications in market research, lead generation, competitor analysis,
financial analysis, real estate, academic research, job market analysis, social media monitoring,
weather forecasting, and government/public data analysis. It helps automate data collection from
various sources, enabling organizations to gather insights for decision-making and analysis in
diverse domains.

You might also like