Exercise Univariate Analysis - Andoni Fikri - 13118111

Import Library
In [1]: import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
Import Data Set and Describe the Data

In [70]: #importing the data set
df = pd.read_csv('cp_data_eng_trial2.csv')
In [3]: #5 rows of data set

df.head()
Out[3]: NO Customer_ID Response Sex Age Job questionnaire1 questionnaire2 questionnaire3
0 1 80000018 no reply F 35.0 NaN 0.0 0.0 0.
general
1 2 80000042 reply M 39.0 0.0 0.0 0.
payer1
2 3 80000234 no reply F 43.0 NaN 0.0 0.0 0.
3 4 80000273 no reply F 45.0 NaN 0.0 0.0 0.
general
4 5 80000529 no reply M 33.0 0.0 0.0 0.
payer1
In [4]: #Data Set Info

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1293 entries, 0 to 1292
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NO 1293 non-null int64
1 Customer_ID 1293 non-null int64
2 Response 1003 non-null object
3 Sex 1232 non-null object
4 Age 1232 non-null float64
5 Job 771 non-null object
6 questionnaire1 1095 non-null float64
11 avg_charge 1033 non-null float64
12 charge_avg_per_mon 1033 non-null float64
13 charge_Monday 567 non-null float64
14 contraction_day 1086 non-null float64
15 contraction_day_JP 1086 non-null object
dtypes: float64(10), int64(2), object(4)
memory usage: 161.8+ KB
In [5]: #Missing Values (percentage) of the data

df.isnull().sum()/len(df)*100
Out[5]: NO 0.000000
Customer_ID 0.000000
Response 22.428461
Sex 4.717711
Age 4.717711
Job 40.371230
questionnaire1 15.313225
avg_charge 20.108275
charge_avg_per_mon 20.108275
charge_Monday 56.148492
contraction_day 16.009281
contraction_day_JP 16.009281
dtype: float64
In [72]: #Remove Missing Data especially for Response Column because we need to know wether
df = df[~df['Response'].isnull()] #filtering data where the response column is not
df.isnull().sum()/len(df)*100 #Checking for missing values after filtering the data
Out[72]: NO 0.000000
Customer_ID 0.000000
Response 0.000000
Sex 0.000000
Age 0.000000
Job 29.312064
avg_charge 0.000000
charge_avg_per_mon 0.000000
charge_Monday 47.657029
contraction_day 0.000000
contraction_day_JP 0.000000
dtype: float64
Work 1
Frequency Distribution of Response Feature

In [50]: #Plotting for Response Frequency Distribution
response_freq = df.groupby('Response')['Customer_ID'].nunique().reset_index()
sns.barplot(data=response_freq, x='Response', y='Customer_ID')
plt.title('Response Frequency Distribution', fontweight='bold')
plt.ylabel('Count')
Out[50]: Text(0, 0.5, 'Count')
Conclusion:
From the frequency distribution plot above, we can conclude that only several customer
responding to the email campaign (around 10%). This is indicating that the response rate of
the campaign is very low and there should be a strategy/solution for increasing the response
rate.
Frequency Distribution of Sex Feature

In [51]: #Plotting for Sex Freq. Distribution only for customer who has response category da
sex_freq = df.groupby('Sex')['Customer_ID'].nunique().reset_index()
sns.barplot(data=sex_freq, x='Sex', y='Customer_ID')
plt.title('Sex Frequency Distribution', fontweight='bold')
plt.ylabel('Count')
Conclusion:
The recipients of the email campaign are mostly female. Based on this information, we can
check if females have a higher response rate than males. If the opposite is true, it could
mean that Mr. Matsui targeted the wrong customers.
In [52]: #Check for female/male response rate

sex_freq_resp = df.groupby(['Sex','Response'])['Customer_ID'].nunique().reset_index
sex_response = sex_freq_resp.merge(sex_freq, how='left', on='Sex')
sex_response['response_rate'] = sex_response['Customer_ID_x']/sex_response['Custome
sex_response[sex_response['Response']=='reply'][['Sex', 'response_rate']]
Out[52]: Sex response_rate
1 F 0.170984
3 M 0.144208
Conclusion:
Female customer has a better response rate than the male customer. It means that sending
the email to a female customer has a higher change of being replied.
Frequency Distribution of Job Feature
In [53]: job_freq = df.groupby('Job')['Customer_ID'].nunique().reset_index().sort_values(by=
plt.figure(figsize=(15,6))
sns.barplot(data=job_freq, x='Job', y='Customer_ID')
plt.title('Job Frequency Distribution', fontweight='bold')
plt.ylabel('Count')
Conclusion:
The recipients of the email campaign are mostly part-time employees or students. However,
there is a concern regarding the data as it contains a significant amount of missing data
(around 29%), which means that the insights derived from it may not be accurate enough.
From this data, it should also be checked wether the client targeting the right segment or
not.
In [58]: #See wether a customer job has a correlation with the response
job_freq_resp = df.groupby(['Job','Response'])['Customer_ID'].nunique().reset_index
job_response = job_freq_resp.merge(job_freq, how='left', on='Job')
job_response['response_rate'] = job_response['Customer_ID_x']/job_response['Custome
job_response[job_response['Response']=='reply'][['Job', 'response_rate']].sort_valu
Out[58]: Job response_rate
15 sole proprietorship 0.250000
3 general payer1 0.207317
9 lawyer 0.171429
11 other 0.133333
1 doctor 0.108108
13 part time job 0.097561
17 student 0.057377
Insight:
The client might be targeting the wrong customer based on their Job. As we can see, the
response rate for student and part time employee is the lowest response rate
compared to other customer job type. The client can solve the issue by sending more email
to sole proprietorship , general payer1 , or lawyer .
In this data there should be some transformation as well to breakdown further about what is
general payer1, 2, or 3 to make the data much more detail than this
Work 2
Age Feature Distribution Plot

In [100… #Plotting The Distribution of Age Feature
df['Age'].plot(kind='hist', bins=50)
plt.title('Age Feature Distribution', fontweight='bold')
plt.xlabel('Age Value')
plt.ylabel('Frequency')
plt.show()
In [99]: #Density Distribution (KDE Plot)
df['Age'].plot(kind='hist', density=True, bins=50)
df['Age'].plot(kind='kde')
plt.title('Age Feature KDE Plot', fontweight='bold')
plt.xlabel('Age Value')
plt.show()
In [93]: # Degree of Deviation / Statistical Summary of the Feature

df['Age'].describe()
Out[93]: count 1003.000000

mean 35.522433
std 12.712868
min 14.000000
25% 28.000000
50% 35.000000
75% 42.000000
max 200.000000
Name: Age, dtype: float64
Conclusion:
For Age Feature seems to have a normal distribution shape. This means that the data is well
distributed and less likely to have a large amount of outliers. The normal data distribution
means that median and mean can be a representative value of the feature.
From the data distribution, we can see that most of the customer age is between 25 and 45
or we can simply say that most of the email receivers is an adult with some few teenagers.
One of the transformation needed is removing The outlier (Age equal to 200) because it is
not making any sense should a customer age is equal to 200.
Representative Value = 35.52 or 35
Degree of Deviation = 12.712 (std dev) → Meaning the data is clustered around avg. value|
charge_Monday Feature Distribution Plot

In [101… #Plotting The Distribution of charge_Monday Feature
df['charge_Monday'].plot(kind='hist', bins=50)
plt.title('charge_Monday Feature Distribution', fontweight='bold')
plt.xlabel('charge_Monday Value')
plt.ylabel('Frequency')
plt.show()
In [102… #Density Distribution (KDE Plot)

df['charge_Monday'].plot(kind='hist', density=True, bins=50)
df['charge_Monday'].plot(kind='kde')
plt.title('charge_Monday Feature KDE Plot', fontweight='bold')
plt.xlabel('charge_Monday Value')
plt.show()
In [103… # Degree of Deviation or Statistical Summary of the Feature
df['charge_Monday'].describe()
Out[103]: count 525.000000

mean 42643.047619
std 55666.307408
min 2040.000000
25% 8370.000000
50% 24170.000000
75% 50230.000000
max 361110.000000
Name: charge_Monday, dtype: float64
Conclusion:
From this data we can conclude that most of the customer charge is around 0 2.000 to
40.000. But the data also tells us that there are also a customer who have a charge larger
than 100.000. This data distribution seems to be right skewed which means that the data is
containing some amount of outliers.
Because of data distribution is skewed, median would be the best representative value of the
data.
Representative Value : 24170 (median)
Degree of Deviation : 55666.307 → Meaning the data is very spread out

Exercise Univariate Analysis - Andoni Fikri - 13118111

Uploaded by

Copyright:

Available Formats

You might also like

Exercise Univariate Analysis - Andoni Fikri - 13118111

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exercise Univariate Analysis - Andoni Fikri - 13118111

Uploaded by

Copyright:

Available Formats

Import Library

In [1]: import pandas as pd

Import Data Set and Describe the Data

In [3]: #5 rows of data set

Out[3]: NO Customer_ID Response Sex Age Job questionnaire1 questionnaire2 questionnaire3

0 1 80000018 no reply F 35.0 NaN 0.0 0.0 0.

2 3 80000234 no reply F 43.0 NaN 0.0 0.0 0.

3 4 80000273 no reply F 45.0 NaN 0.0 0.0 0.

In [4]: #Data Set Info

In [5]: #Missing Values (percentage) of the data

Frequency Distribution of Response Feature

Out[50]: Text(0, 0.5, 'Count')

Frequency Distribution of Sex Feature

Out[51]: Text(0, 0.5, 'Count')

In [52]: #Check for female/male response rate

Out[52]: Sex response_rate

Out[53]: Text(0, 0.5, 'Count')

15 sole proprietorship 0.250000

3 general payer1 0.207317

7 general payer3 0.133333

5 general payer2 0.101449

13 part time job 0.097561

Age Feature Distribution Plot

In [93]: # Degree of Deviation / Statistical Summary of the Feature

Out[93]: count 1003.000000

Representative Value = 35.52 or 35

charge_Monday Feature Distribution Plot

In [102… #Density Distribution (KDE Plot)

Out[103]: count 525.000000

Representative Value : 24170 (median)

Degree of Deviation : 55666.307 → Meaning the data is very spread out

You might also like