Sales Data Practice Assignment

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

sales-data-practice-assignment

May 16, 2024

[1]: import numpy as np


import pandas as pd

[69]: df = pd.read_csv('sales_data.csv')

[70]: df.head()

[70]: TransactionID CustomerID ProductID ProductCategory Quantity UnitPrice \


0 1 1685 50 Home & Kitchen 2 61.267975
1 2 1560 64 Electronics 3 51.859550
2 3 1630 10 Clothing 9 14.871596
3 4 1193 25 Home & Kitchen 5 23.993986
4 5 1836 69 Home & Kitchen 1 58.666911

TotalPrice Timestamp Age Gender City


0 122.535950 28-02-2022 37 Male Los Angeles
1 155.578651 12-05-2022 48 Female New York
2 133.844366 24-03-2020 53 Male Phoenix
3 119.969930 15-06-2021 0 Female Houston
4 58.666911 14-11-2022 59 Female Los Angeles

1. Data Loading and Exploration:


a. How many rows and columns are there in the dataset?
[71]: num_rows = df.shape[0]
num_columns = df.shape[1]

print("Rows: ", num_rows)


print("Columns: ", num_columns)

Rows: 5000
Columns: 11
b. What are the data types of each column?
[72]: df.dtypes

1
[72]: TransactionID int64
CustomerID int64
ProductID int64
ProductCategory object
Quantity int64
UnitPrice float64
TotalPrice float64
Timestamp object
Age int64
Gender object
City object
dtype: object

c. Are there any missing values in the dataset? If yes, how many and in which columns?
[73]: df.isnull().sum()

[73]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 250
TotalPrice 250
Timestamp 0
Age 0
Gender 517
City 0
dtype: int64

2. Data Manipulation:
a. How many unique customers are there in the dataset?
[74]: df["CustomerID"].nunique()

[74]: 993

b. What is the total revenue generated by the store?


[75]: df["TotalPrice"].sum()

[75]: 1436068.8463578

c. Calculate the total quantity sold for each product category.


[76]: df.groupby("ProductCategory")["Quantity"].sum()

2
[76]: ProductCategory
Books 5718
Clothing 5672
Electronics 5337
Home & Kitchen 5761
Sports & Outdoors 5459
Name: Quantity, dtype: int64

d. Create a new column named “AgeGroup” based on the age of customers (e.g., “18-30”, “31-
45”, “46-60”, “60+”)

[77]: age_groups = [(0, 30), (31, 45), (46, 60), (61, float('inf'))]
age_group_labels = ['18-30', '31-45', '46-60', '60+']
def grouping_age(age):
for i, (start, end) in enumerate(age_groups):
if start <= age <= end:
return age_group_labels[i]
df['AgeGroup'] = df['Age'].apply(grouping_age)
df['AgeGroup'].fillna('Prefer not to say', inplace=True)

C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\3210157570.py:8:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

df['AgeGroup'].fillna('Prefer not to say', inplace=True)

[78]: df

[78]: TransactionID CustomerID ProductID ProductCategory Quantity \


0 1 1685 50 Home & Kitchen 2
1 2 1560 64 Electronics 3
2 3 1630 10 Clothing 9
3 4 1193 25 Home & Kitchen 5
4 5 1836 69 Home & Kitchen 1
… … … … … …
4995 4996 1957 23 Home & Kitchen 6
4996 4997 1595 5 Home & Kitchen 3
4997 4998 1477 51 Books 5
4998 4999 1575 97 Clothing 9
4999 5000 1908 40 Sports & Outdoors 2

3
UnitPrice TotalPrice Timestamp Age Gender City AgeGroup
0 61.267975 122.535950 28-02-2022 37 Male Los Angeles 31-45
1 51.859550 155.578651 12-05-2022 48 Female New York 46-60
2 14.871596 133.844366 24-03-2020 53 Male Phoenix 46-60
3 23.993986 119.969930 15-06-2021 0 Female Houston 18-30
4 58.666911 58.666911 14-11-2022 59 Female Los Angeles 46-60
… … … … … … … …
4995 54.421139 326.526833 28-02-2020 23 Female Chicago 18-30
4996 81.503247 244.509740 16-10-2021 58 Female Phoenix 46-60
4997 91.173962 455.869812 22-08-2020 48 Male Chicago 46-60
4998 44.403988 399.635891 26-09-2020 67 Female Houston 60+
4999 89.000682 178.001364 23-06-2023 56 Male Phoenix 46-60

[5000 rows x 12 columns]

3. Data Analysis:
a. Analyze the data to identify the leading product category, determined by the cumulative
quantity of products sold within each category.
[79]: category_quantity = df.groupby('ProductCategory')['Quantity'].sum()
leading_category = category_quantity.idxmax()
leading_category

[79]: 'Home & Kitchen'

b. Examine the pricing trends within different product categories and evaluate the company’s
yearly earnings.
[80]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Year'] = df['Timestamp'].dt.year
avg_unit_price = df.groupby(['ProductCategory', 'Year'])['UnitPrice'].mean().
↪reset_index()

avg_unit_price

C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\1499781126.py:1:
UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False (the default)
was specified. Pass `dayfirst=True` or specify a format to silence this warning.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

[80]: ProductCategory Year UnitPrice


0 Books 2020 54.207737
1 Books 2021 57.477909
2 Books 2022 54.386839
3 Books 2023 54.952147
4 Clothing 2020 54.669198
5 Clothing 2021 52.166971

4
6 Clothing 2022 55.212257
7 Clothing 2023 53.014750
8 Electronics 2020 54.868886
9 Electronics 2021 52.640335
10 Electronics 2022 54.731241
11 Electronics 2023 54.408497
12 Home & Kitchen 2020 54.979156
13 Home & Kitchen 2021 52.126756
14 Home & Kitchen 2022 51.480199
15 Home & Kitchen 2023 56.653768
16 Sports & Outdoors 2020 52.634402
17 Sports & Outdoors 2021 52.765040
18 Sports & Outdoors 2022 53.909761
19 Sports & Outdoors 2023 57.490032

[81]: df['Earnings'] = df['Quantity'] * df['UnitPrice']


yearly_earnings = df.groupby(['ProductCategory', 'Year'])['Earnings'].sum().
↪reset_index()

yearly_earnings

[81]: ProductCategory Year Earnings


0 Books 2020 71991.108635
1 Books 2021 80203.998934
2 Books 2022 74000.078012
3 Books 2023 74125.829146
4 Clothing 2020 70689.300882
5 Clothing 2021 64882.760070
6 Clothing 2022 77014.533850
7 Clothing 2023 72413.610178
8 Electronics 2020 81153.116963
9 Electronics 2021 64346.901361
10 Electronics 2022 63882.928903
11 Electronics 2023 66376.135116
12 Home & Kitchen 2020 83589.651428
13 Home & Kitchen 2021 73736.197968
14 Home & Kitchen 2022 59292.271211
15 Home & Kitchen 2023 79099.175466
16 Sports & Outdoors 2020 67842.166651
17 Sports & Outdoors 2021 64346.081552
18 Sports & Outdoors 2022 76750.709769
19 Sports & Outdoors 2023 70332.290267

c. Segment customers into quantiles according to their total expenditure to gain insights into
spending behavior.
[85]: customer_expenditure = df.groupby('CustomerID')['TotalPrice'].sum()
customer_expenditure

5
quantiles = pd.qcut(customer_expenditure, q=[0, 0.25, 0.5, 0.75, 1],␣
↪labels=['Low', 'Medium', 'High', 'Very High'])

df['ExpenditureQuantile'] = df['CustomerID'].map(quantiles)
df.reset_index()

[85]: index TransactionID CustomerID ProductID ProductCategory \


0 0 1 1685 50 Home & Kitchen
1 1 2 1560 64 Electronics
2 2 3 1630 10 Clothing
3 3 4 1193 25 Home & Kitchen
4 4 5 1836 69 Home & Kitchen
… … … … … …
4995 4995 4996 1957 23 Home & Kitchen
4996 4996 4997 1595 5 Home & Kitchen
4997 4997 4998 1477 51 Books
4998 4998 4999 1575 97 Clothing
4999 4999 5000 1908 40 Sports & Outdoors

Quantity UnitPrice TotalPrice Timestamp Age Gender City \


0 2 61.267975 122.535950 2022-02-28 37 Male Los Angeles
1 3 51.859550 155.578651 2022-05-12 48 Female New York
2 9 14.871596 133.844366 2020-03-24 53 Male Phoenix
3 5 23.993986 119.969930 2021-06-15 0 Female Houston
4 1 58.666911 58.666911 2022-11-14 59 Female Los Angeles
… … … … … … … …
4995 6 54.421139 326.526833 2020-02-28 23 Female Chicago
4996 3 81.503247 244.509740 2021-10-16 58 Female Phoenix
4997 5 91.173962 455.869812 2020-08-22 48 Male Chicago
4998 9 44.403988 399.635891 2020-09-26 67 Female Houston
4999 2 89.000682 178.001364 2023-06-23 56 Male Phoenix

AgeGroup Year Earnings ExpenditureQuantile


0 31-45 2022 122.535950 Low
1 46-60 2022 155.578650 Very High
2 46-60 2020 133.844366 Low
3 18-30 2021 119.969930 Medium
4 46-60 2022 58.666911 Medium
… … … … …
4995 18-30 2020 326.526833 Medium
4996 46-60 2021 244.509740 High
4997 46-60 2020 455.869812 Very High
4998 60+ 2020 399.635891 Very High
4999 46-60 2023 178.001364 Low

[5000 rows x 16 columns]

d. Conduct an in-depth analysis of customer demographics and product preferences across dif-

6
ferent spending segments.Identify the top-selling product category based on the total quantity
sold. Determine the average unit price for each product category. Calculate the total revenue
generated per year.
[ ]:

4. Data Aggregation:
a. Segment the data by product categories and assess the typical quantity sold and total revenue
generated for each category.
[37]: category_analysis = df.groupby("ProductCategory").agg({"Quantity": 'mean',␣
↪'TotalPrice':'sum'})

category_analysis

[37]: Quantity TotalPrice


ProductCategory
Books 5.529981 300321.014728
Clothing 5.588177 285000.204977
Electronics 5.423780 275759.082342
Home & Kitchen 5.795775 295717.296073
Sports & Outdoors 5.610483 279271.248237

b. Aggregate the dataset by gender and determine the total sales revenue attributed to each
gender category.
[41]: gender_analysis = df.groupby("Gender").agg({"TotalPrice":'sum'})
gender_analysis

[41]: TotalPrice
Gender
Female 641115.956773
Male 643077.966904

5. Data Cleaning:
a. Handle missing values in the dataset (e.g., fill null values in the “ProductCategory” column).

[46]: mean_unit_price = df["UnitPrice"].mean()


df["UnitPrice"].fillna(mean_unit_price, inplace = True)
df

C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\2190109594.py:2:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using

7
'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

df["UnitPrice"].fillna(mean_unit_price, inplace = True)

[46]: TransactionID CustomerID ProductID ProductCategory Quantity \


0 1 1600 50 Home & Kitchen 2
1 2 1600 64 Electronics 3
2 3 1600 10 Clothing 9
3 4 1600 25 Home & Kitchen 5
4 5 1600 69 Home & Kitchen 1
… … … … … …
4995 4996 1600 23 Home & Kitchen 6
4996 4997 1600 5 Home & Kitchen 3
4997 4998 1600 51 Books 5
4998 4999 1600 97 Clothing 9
4999 5000 1600 40 Sports & Outdoors 2

UnitPrice TotalPrice Timestamp Age Gender City AgeGroup \


0 61.267975 122.535950 2022-02-28 37 Male Los Angeles 31-45
1 51.859550 155.578651 2022-05-12 48 Female New York 46-60
2 14.871596 133.844366 2020-03-24 53 Male Phoenix 46-60
3 23.993986 119.969930 2021-06-15 0 Female Houston 18-30
4 58.666911 58.666911 2022-11-14 59 Female Los Angeles 46-60
… … … … … … … …
4995 54.421139 326.526833 2020-02-28 23 Female Chicago 18-30
4996 81.503247 244.509740 2021-10-16 58 Female Phoenix 46-60
4997 91.173962 455.869812 2020-08-22 48 Male Chicago 46-60
4998 44.403988 399.635891 2020-09-26 67 Female Houston 60+
4999 89.000682 178.001364 2023-06-23 56 Male Phoenix 46-60

Year Earnings
0 2022 122.535950
1 2022 155.578650
2 2020 133.844366
3 2021 119.969930
4 2022 58.666911
… … …
4995 2020 326.526833
4996 2021 244.509740
4997 2020 455.869812
4998 2020 399.635891
4999 2023 178.001364

[5000 rows x 14 columns]

8
[45]: mean_total_price = df["Quantity"] * mean_unit_price
df["TotalPrice"].fillna(mean_total_price, inplace = True)
df

C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\3526706954.py:2:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

df["TotalPrice"].fillna(mean_total_price, inplace = True)

[45]: TransactionID CustomerID ProductID ProductCategory Quantity \


0 1 1600 50 Home & Kitchen 2
1 2 1600 64 Electronics 3
2 3 1600 10 Clothing 9
3 4 1600 25 Home & Kitchen 5
4 5 1600 69 Home & Kitchen 1
… … … … … …
4995 4996 1600 23 Home & Kitchen 6
4996 4997 1600 5 Home & Kitchen 3
4997 4998 1600 51 Books 5
4998 4999 1600 97 Clothing 9
4999 5000 1600 40 Sports & Outdoors 2

UnitPrice TotalPrice Timestamp Age Gender City AgeGroup \


0 61.267975 122.535950 2022-02-28 37 Male Los Angeles 31-45
1 51.859550 155.578651 2022-05-12 48 Female New York 46-60
2 14.871596 133.844366 2020-03-24 53 Male Phoenix 46-60
3 23.993986 119.969930 2021-06-15 0 Female Houston 18-30
4 58.666911 58.666911 2022-11-14 59 Female Los Angeles 46-60
… … … … … … … …
4995 54.421139 326.526833 2020-02-28 23 Female Chicago 18-30
4996 81.503247 244.509740 2021-10-16 58 Female Phoenix 46-60
4997 91.173962 455.869812 2020-08-22 48 Male Chicago 46-60
4998 44.403988 399.635891 2020-09-26 67 Female Houston 60+
4999 89.000682 178.001364 2023-06-23 56 Male Phoenix 46-60

Year Earnings
0 2022 122.535950
1 2022 155.578650

9
2 2020 133.844366
3 2021 119.969930
4 2022 58.666911
… … …
4995 2020 326.526833
4996 2021 244.509740
4997 2020 455.869812
4998 2020 399.635891
4999 2023 178.001364

[5000 rows x 14 columns]

[47]: df.isna().sum()

[47]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 0
TotalPrice 0
Timestamp 0
Age 0
Gender 517
City 0
AgeGroup 7
Year 0
Earnings 250
dtype: int64

[48]: df["Gender"].fillna("Prefer not to say", inplace = True)

[53]: df.isna().sum()

[53]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 0
TotalPrice 0
Timestamp 0
Age 0
Gender 0
City 0
AgeGroup 0
Year 0

10
Earnings 250
dtype: int64

[54]: df["Earnings"].fillna(df["Earnings"].mean(), inplace = True)

C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\440114691.py:1:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using


'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.

df["Earnings"].fillna(df["Earnings"].mean(), inplace = True)

[55]: df.isna().sum()

[55]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 0
TotalPrice 0
Timestamp 0
Age 0
Gender 0
City 0
AgeGroup 0
Year 0
Earnings 0
dtype: int64

[56]: df = df[df['Quantity'] >= 0]


df = df[df['UnitPrice'] >= 0]
df = df[df['Age'] >= 0]

[57]: df

[57]: TransactionID CustomerID ProductID ProductCategory Quantity \


0 1 1600 50 Home & Kitchen 2
1 2 1600 64 Electronics 3
2 3 1600 10 Clothing 9

11
3 4 1600 25 Home & Kitchen 5
4 5 1600 69 Home & Kitchen 1
… … … … … …
4995 4996 1600 23 Home & Kitchen 6
4996 4997 1600 5 Home & Kitchen 3
4997 4998 1600 51 Books 5
4998 4999 1600 97 Clothing 9
4999 5000 1600 40 Sports & Outdoors 2

UnitPrice TotalPrice Timestamp Age Gender City AgeGroup \


0 61.267975 122.535950 2022-02-28 37 Male Los Angeles 31-45
1 51.859550 155.578651 2022-05-12 48 Female New York 46-60
2 14.871596 133.844366 2020-03-24 53 Male Phoenix 46-60
3 23.993986 119.969930 2021-06-15 0 Female Houston 18-30
4 58.666911 58.666911 2022-11-14 59 Female Los Angeles 46-60
… … … … … … … …
4995 54.421139 326.526833 2020-02-28 23 Female Chicago 18-30
4996 81.503247 244.509740 2021-10-16 58 Female Phoenix 46-60
4997 91.173962 455.869812 2020-08-22 48 Male Chicago 46-60
4998 44.403988 399.635891 2020-09-26 67 Female Houston 60+
4999 89.000682 178.001364 2023-06-23 56 Male Phoenix 46-60

Year Earnings
0 2022 122.535950
1 2022 155.578650
2 2020 133.844366
3 2021 119.969930
4 2022 58.666911
… … …
4995 2020 326.526833
4996 2021 244.509740
4997 2020 455.869812
4998 2020 399.635891
4999 2023 178.001364

[4993 rows x 14 columns]

[ ]:

12

You might also like