Duplication - Typecasting-Problem Statement Answer

Duplication & Typecasting
Instructions:
Please share your answers filled inline in the word document. Submit Python code and R code
files wherever applicable.
Please ensure you update all the details:

Name: ________SANCHARI DEY_________________
Batch Id: _____1503_2024__________________
Topic: Preliminaries for Data Analysis
Problem statement:
Data collected may have duplicate entries, that might be because the data collected
were not at regular intervals or for any other reason. Building a proper solution on such
data will be a tough ask. The common techniques are either removing duplicates
completely or substituting those values with logical data. There are various techniques
to treat these types of problems.
Q1. For the given dataset perform the type casting (convert the datatypes, ex. float to int)
Q2. Check for duplicate values, and handle the duplicate values (ex. drop)
Q3. Do the data analysis (EDA)?
Such as histogram, boxplot, scatterplot, etc.
InvoiceN StockCod Description Quantit InvoiceDate UnitPrice CustomerID Country
o e y
536365 85123A WHITE 6 12/1/2010 2.55 17850 United
HANGING 8:26 Kingdom
HEART T-LIGHT
HOLDER
536365 71053 WHITE METAL 6 12/1/2010 3.39 17850 United
LANTERN 8:26 Kingdom
536365 84406B CREAM CUPID 8 12/1/2010 2.75 17850 United
HEARTS COAT 8:26 Kingdom
HANGER
536365 84029G KNITTED 6 12/1/2010 3.39 17850 United
UNION FLAG 8:26 Kingdom
HOT WATER
BOTTLE
536365 84029E RED WOOLLY 6 12/1/2010 3.39 17850 United
© 360DigiTMG. All Rights Reserved.

HOTTIE WHITE 8:26 Kingdom
HEART.
536365 22752 SET 7 2 12/1/2010 7.65 17850 United
BABUSHKA 8:26 Kingdom
NESTING
BOXES
536365 21730 GLASS STAR 6 12/1/2010 4.25 17850 United
FROSTED T- 8:26 Kingdom
LIGHT HOLDER
536366 22633 HAND 6 12/1/2010 1.85 17850 United
WARMER 8:28 Kingdom
UNION JACK
536366 22632 HAND 6 12/1/2010 1.85 17850 United
WARMER RED 8:28 Kingdom
POLKA DOT
Hints:
For each assignment, the solution should be submitted in the below format.
1. Work on each feature of the dataset to create a data dictionary as displayed in the
below image:
2.
3. Consider the OnlineRetail.csv dataset.
4. Research and perform all possible steps for obtaining the solution.
5. All the codes (executable programs) should execute without errors.
6. Code modularization should be followed.
7. Each line of code should have comments explaining the logic and why you are using that
function.
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
data = pd.read_csv('OnlineRetail.csv')
# Display the first few rows of the dataset
print(data.head())
# Type casting: Convert float columns to int where appropriate
# Assuming Quantity and CustomerID should be integers
data['Quantity'] = data['Quantity'].astype(int)
data['CustomerID'] = data['CustomerID'].astype('Int64') # Int64 to handle possible NaNs
# Check the data types after type casting
print("Data types after type casting:")
print(data.dtypes)
# Check for duplicate values
duplicate_rows = data[data.duplicated()]
print(f"Number of duplicate rows: {duplicate_rows.shape[0]}")
# Drop duplicate values
data = data.drop_duplicates()
# Confirm that duplicates are removed
print(f"Number of rows after dropping duplicates: {data.shape[0]}")
# Basic Exploratory Data Analysis (EDA)

# 1. Histogram for Quantity
plt.figure(figsize=(10, 6))
plt.hist(data['Quantity'], bins=50, edgecolor='k')
plt.title('Histogram of Quantity')
plt.xlabel('Quantity')
plt.ylabel('Frequency')
plt.show()
# 2. Boxplot for UnitPrice
sns.boxplot(data['UnitPrice'])
plt.title('Boxplot of UnitPrice')
plt.xlabel('UnitPrice')
plt.show()
# 3. Scatterplot for Quantity vs. UnitPrice
plt.scatter(data['Quantity'], data['UnitPrice'], alpha=0.5)
plt.title('Scatterplot of Quantity vs. UnitPrice')
plt.xlabel('Quantity')
plt.ylabel('UnitPrice')
plt.show()
# 4. Count plot for Country
sns.countplot(y=data['Country'], order=data['Country'].value_counts().index)
plt.title('Count plot of Country')
plt.xlabel('Frequency')
plt.ylabel('Country')

plt.show()
# Function to display the data dictionary
def data_dictionary(df):
data_dict = pd.DataFrame(columns=["Column Name", "Data Type", "Description"])
for column in df.columns:
data_dict = data_dict.append({
"Column Name": column,
"Data Type": df[column].dtype,
"Description": "Description of " + column
}, ignore_index=True)
return data_dict
# Display the data dictionary
data_dict = data_dictionary(data)
print(data_dict)

Duplication - Typecasting-Problem Statement Answer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Duplication - Typecasting-Problem Statement Answer

Uploaded by

Copyright:

Available Formats

Duplication & Typecasting

Please ensure you update all the details:

© 360DigiTMG. All Rights Reserved.

© 360DigiTMG. All Rights Reserved.

import seaborn as sns

# Load the dataset

# Display the first few rows of the dataset

# Type casting: Convert float columns to int where appropriate

# Assuming Quantity and CustomerID should be integers

data['CustomerID'] = data['CustomerID'].astype('Int64') # Int64 to handle possible NaNs

# Check the data types after type casting

print("Data types after type casting:")

# Check for duplicate values

print(f"Number of duplicate rows: {duplicate_rows.shape[0]}")

# Drop duplicate values

# Confirm that duplicates are removed

print(f"Number of rows after dropping duplicates: {data.shape[0]}")

# Basic Exploratory Data Analysis (EDA)

© 360DigiTMG. All Rights Reserved.

plt.hist(data['Quantity'], bins=50, edgecolor='k')

# 2. Boxplot for UnitPrice

# 3. Scatterplot for Quantity vs. UnitPrice

plt.scatter(data['Quantity'], data['UnitPrice'], alpha=0.5)

plt.title('Scatterplot of Quantity vs. UnitPrice')

# 4. Count plot for Country

plt.title('Count plot of Country')

© 360DigiTMG. All Rights Reserved.

# Function to display the data dictionary

data_dict = pd.DataFrame(columns=["Column Name", "Data Type", "Description"])

for column in df.columns:

"Column Name": column,

"Data Type": df[column].dtype,

"Description": "Description of " + column

# Display the data dictionary

© 360DigiTMG. All Rights Reserved.

You might also like