Professional Documents
Culture Documents
Sales Data Practice Assignment
Sales Data Practice Assignment
Sales Data Practice Assignment
[69]: df = pd.read_csv('sales_data.csv')
[70]: df.head()
Rows: 5000
Columns: 11
b. What are the data types of each column?
[72]: df.dtypes
1
[72]: TransactionID int64
CustomerID int64
ProductID int64
ProductCategory object
Quantity int64
UnitPrice float64
TotalPrice float64
Timestamp object
Age int64
Gender object
City object
dtype: object
c. Are there any missing values in the dataset? If yes, how many and in which columns?
[73]: df.isnull().sum()
[73]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 250
TotalPrice 250
Timestamp 0
Age 0
Gender 517
City 0
dtype: int64
2. Data Manipulation:
a. How many unique customers are there in the dataset?
[74]: df["CustomerID"].nunique()
[74]: 993
[75]: 1436068.8463578
2
[76]: ProductCategory
Books 5718
Clothing 5672
Electronics 5337
Home & Kitchen 5761
Sports & Outdoors 5459
Name: Quantity, dtype: int64
d. Create a new column named “AgeGroup” based on the age of customers (e.g., “18-30”, “31-
45”, “46-60”, “60+”)
[77]: age_groups = [(0, 30), (31, 45), (46, 60), (61, float('inf'))]
age_group_labels = ['18-30', '31-45', '46-60', '60+']
def grouping_age(age):
for i, (start, end) in enumerate(age_groups):
if start <= age <= end:
return age_group_labels[i]
df['AgeGroup'] = df['Age'].apply(grouping_age)
df['AgeGroup'].fillna('Prefer not to say', inplace=True)
C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\3210157570.py:8:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.
[78]: df
3
UnitPrice TotalPrice Timestamp Age Gender City AgeGroup
0 61.267975 122.535950 28-02-2022 37 Male Los Angeles 31-45
1 51.859550 155.578651 12-05-2022 48 Female New York 46-60
2 14.871596 133.844366 24-03-2020 53 Male Phoenix 46-60
3 23.993986 119.969930 15-06-2021 0 Female Houston 18-30
4 58.666911 58.666911 14-11-2022 59 Female Los Angeles 46-60
… … … … … … … …
4995 54.421139 326.526833 28-02-2020 23 Female Chicago 18-30
4996 81.503247 244.509740 16-10-2021 58 Female Phoenix 46-60
4997 91.173962 455.869812 22-08-2020 48 Male Chicago 46-60
4998 44.403988 399.635891 26-09-2020 67 Female Houston 60+
4999 89.000682 178.001364 23-06-2023 56 Male Phoenix 46-60
3. Data Analysis:
a. Analyze the data to identify the leading product category, determined by the cumulative
quantity of products sold within each category.
[79]: category_quantity = df.groupby('ProductCategory')['Quantity'].sum()
leading_category = category_quantity.idxmax()
leading_category
b. Examine the pricing trends within different product categories and evaluate the company’s
yearly earnings.
[80]: df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['Year'] = df['Timestamp'].dt.year
avg_unit_price = df.groupby(['ProductCategory', 'Year'])['UnitPrice'].mean().
↪reset_index()
avg_unit_price
C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\1499781126.py:1:
UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False (the default)
was specified. Pass `dayfirst=True` or specify a format to silence this warning.
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
4
6 Clothing 2022 55.212257
7 Clothing 2023 53.014750
8 Electronics 2020 54.868886
9 Electronics 2021 52.640335
10 Electronics 2022 54.731241
11 Electronics 2023 54.408497
12 Home & Kitchen 2020 54.979156
13 Home & Kitchen 2021 52.126756
14 Home & Kitchen 2022 51.480199
15 Home & Kitchen 2023 56.653768
16 Sports & Outdoors 2020 52.634402
17 Sports & Outdoors 2021 52.765040
18 Sports & Outdoors 2022 53.909761
19 Sports & Outdoors 2023 57.490032
yearly_earnings
c. Segment customers into quantiles according to their total expenditure to gain insights into
spending behavior.
[85]: customer_expenditure = df.groupby('CustomerID')['TotalPrice'].sum()
customer_expenditure
5
quantiles = pd.qcut(customer_expenditure, q=[0, 0.25, 0.5, 0.75, 1],␣
↪labels=['Low', 'Medium', 'High', 'Very High'])
df['ExpenditureQuantile'] = df['CustomerID'].map(quantiles)
df.reset_index()
d. Conduct an in-depth analysis of customer demographics and product preferences across dif-
6
ferent spending segments.Identify the top-selling product category based on the total quantity
sold. Determine the average unit price for each product category. Calculate the total revenue
generated per year.
[ ]:
4. Data Aggregation:
a. Segment the data by product categories and assess the typical quantity sold and total revenue
generated for each category.
[37]: category_analysis = df.groupby("ProductCategory").agg({"Quantity": 'mean',␣
↪'TotalPrice':'sum'})
category_analysis
b. Aggregate the dataset by gender and determine the total sales revenue attributed to each
gender category.
[41]: gender_analysis = df.groupby("Gender").agg({"TotalPrice":'sum'})
gender_analysis
[41]: TotalPrice
Gender
Female 641115.956773
Male 643077.966904
5. Data Cleaning:
a. Handle missing values in the dataset (e.g., fill null values in the “ProductCategory” column).
C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\2190109594.py:2:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.
7
'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value)
instead, to perform the operation inplace on the original object.
Year Earnings
0 2022 122.535950
1 2022 155.578650
2 2020 133.844366
3 2021 119.969930
4 2022 58.666911
… … …
4995 2020 326.526833
4996 2021 244.509740
4997 2020 455.869812
4998 2020 399.635891
4999 2023 178.001364
8
[45]: mean_total_price = df["Quantity"] * mean_unit_price
df["TotalPrice"].fillna(mean_total_price, inplace = True)
df
C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\3526706954.py:2:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.
Year Earnings
0 2022 122.535950
1 2022 155.578650
9
2 2020 133.844366
3 2021 119.969930
4 2022 58.666911
… … …
4995 2020 326.526833
4996 2021 244.509740
4997 2020 455.869812
4998 2020 399.635891
4999 2023 178.001364
[47]: df.isna().sum()
[47]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 0
TotalPrice 0
Timestamp 0
Age 0
Gender 517
City 0
AgeGroup 7
Year 0
Earnings 250
dtype: int64
[53]: df.isna().sum()
[53]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 0
TotalPrice 0
Timestamp 0
Age 0
Gender 0
City 0
AgeGroup 0
Year 0
10
Earnings 250
dtype: int64
C:\Users\sai.kanthamneni\AppData\Local\Temp\ipykernel_5608\440114691.py:1:
FutureWarning: A value is trying to be set on a copy of a DataFrame or Series
through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work
because the intermediate object on which we are setting values always behaves as
a copy.
[55]: df.isna().sum()
[55]: TransactionID 0
CustomerID 0
ProductID 0
ProductCategory 0
Quantity 0
UnitPrice 0
TotalPrice 0
Timestamp 0
Age 0
Gender 0
City 0
AgeGroup 0
Year 0
Earnings 0
dtype: int64
[57]: df
11
3 4 1600 25 Home & Kitchen 5
4 5 1600 69 Home & Kitchen 1
… … … … … …
4995 4996 1600 23 Home & Kitchen 6
4996 4997 1600 5 Home & Kitchen 3
4997 4998 1600 51 Books 5
4998 4999 1600 97 Clothing 9
4999 5000 1600 40 Sports & Outdoors 2
Year Earnings
0 2022 122.535950
1 2022 155.578650
2 2020 133.844366
3 2021 119.969930
4 2022 58.666911
… … …
4995 2020 326.526833
4996 2021 244.509740
4997 2020 455.869812
4998 2020 399.635891
4999 2023 178.001364
[ ]:
12