Exercise Univariate Analysis - Andoni Fikri - 13118111

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Import Library

In [1]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

Import Data Set and Describe the Data


In [70]: #importing the data set
df = pd.read_csv('cp_data_eng_trial2.csv')

In [3]: #5 rows of data set


df.head()

Out[3]: NO Customer_ID Response Sex Age Job questionnaire1 questionnaire2 questionnaire3

0 1 80000018 no reply F 35.0 NaN 0.0 0.0 0.

general
1 2 80000042 reply M 39.0 0.0 0.0 0.
payer1

2 3 80000234 no reply F 43.0 NaN 0.0 0.0 0.

3 4 80000273 no reply F 45.0 NaN 0.0 0.0 0.

general
4 5 80000529 no reply M 33.0 0.0 0.0 0.
payer1

In [4]: #Data Set Info


df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1293 entries, 0 to 1292
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NO 1293 non-null int64
1 Customer_ID 1293 non-null int64
2 Response 1003 non-null object
3 Sex 1232 non-null object
4 Age 1232 non-null float64
5 Job 771 non-null object
6 questionnaire1 1095 non-null float64
7 questionnaire2 1095 non-null float64
8 questionnaire3 1095 non-null float64
9 questionnaire4 1095 non-null float64
10 questionnaire5 1095 non-null float64
11 avg_charge 1033 non-null float64
12 charge_avg_per_mon 1033 non-null float64
13 charge_Monday 567 non-null float64
14 contraction_day 1086 non-null float64
15 contraction_day_JP 1086 non-null object
dtypes: float64(10), int64(2), object(4)
memory usage: 161.8+ KB

In [5]: #Missing Values (percentage) of the data


df.isnull().sum()/len(df)*100

Out[5]: NO 0.000000
Customer_ID 0.000000
Response 22.428461
Sex 4.717711
Age 4.717711
Job 40.371230
questionnaire1 15.313225
questionnaire2 15.313225
questionnaire3 15.313225
questionnaire4 15.313225
questionnaire5 15.313225
avg_charge 20.108275
charge_avg_per_mon 20.108275
charge_Monday 56.148492
contraction_day 16.009281
contraction_day_JP 16.009281
dtype: float64

In [72]: #Remove Missing Data especially for Response Column because we need to know wether
df = df[~df['Response'].isnull()] #filtering data where the response column is not
df.isnull().sum()/len(df)*100 #Checking for missing values after filtering the data
Out[72]: NO 0.000000
Customer_ID 0.000000
Response 0.000000
Sex 0.000000
Age 0.000000
Job 29.312064
questionnaire1 0.000000
questionnaire2 0.000000
questionnaire3 0.000000
questionnaire4 0.000000
questionnaire5 0.000000
avg_charge 0.000000
charge_avg_per_mon 0.000000
charge_Monday 47.657029
contraction_day 0.000000
contraction_day_JP 0.000000
dtype: float64

Work 1

Frequency Distribution of Response Feature


In [50]: #Plotting for Response Frequency Distribution
response_freq = df.groupby('Response')['Customer_ID'].nunique().reset_index()
sns.barplot(data=response_freq, x='Response', y='Customer_ID')
plt.title('Response Frequency Distribution', fontweight='bold')
plt.ylabel('Count')

Out[50]: Text(0, 0.5, 'Count')

Conclusion:

From the frequency distribution plot above, we can conclude that only several customer
responding to the email campaign (around 10%). This is indicating that the response rate of
the campaign is very low and there should be a strategy/solution for increasing the response
rate.

Frequency Distribution of Sex Feature


In [51]: #Plotting for Sex Freq. Distribution only for customer who has response category da
sex_freq = df.groupby('Sex')['Customer_ID'].nunique().reset_index()
sns.barplot(data=sex_freq, x='Sex', y='Customer_ID')
plt.title('Sex Frequency Distribution', fontweight='bold')
plt.ylabel('Count')

Out[51]: Text(0, 0.5, 'Count')

Conclusion:

The recipients of the email campaign are mostly female. Based on this information, we can
check if females have a higher response rate than males. If the opposite is true, it could
mean that Mr. Matsui targeted the wrong customers.

In [52]: #Check for female/male response rate


sex_freq_resp = df.groupby(['Sex','Response'])['Customer_ID'].nunique().reset_index
sex_response = sex_freq_resp.merge(sex_freq, how='left', on='Sex')
sex_response['response_rate'] = sex_response['Customer_ID_x']/sex_response['Custome
sex_response[sex_response['Response']=='reply'][['Sex', 'response_rate']]

Out[52]: Sex response_rate

1 F 0.170984

3 M 0.144208

Conclusion:

Female customer has a better response rate than the male customer. It means that sending
the email to a female customer has a higher change of being replied.
Frequency Distribution of Job Feature
In [53]: job_freq = df.groupby('Job')['Customer_ID'].nunique().reset_index().sort_values(by=
plt.figure(figsize=(15,6))
sns.barplot(data=job_freq, x='Job', y='Customer_ID')
plt.title('Job Frequency Distribution', fontweight='bold')
plt.ylabel('Count')

Out[53]: Text(0, 0.5, 'Count')

Conclusion:

The recipients of the email campaign are mostly part-time employees or students. However,
there is a concern regarding the data as it contains a significant amount of missing data
(around 29%), which means that the insights derived from it may not be accurate enough.
From this data, it should also be checked wether the client targeting the right segment or
not.

In [58]: #See wether a customer job has a correlation with the response
job_freq_resp = df.groupby(['Job','Response'])['Customer_ID'].nunique().reset_index
job_response = job_freq_resp.merge(job_freq, how='left', on='Job')
job_response['response_rate'] = job_response['Customer_ID_x']/job_response['Custome
job_response[job_response['Response']=='reply'][['Job', 'response_rate']].sort_valu
Out[58]: Job response_rate

15 sole proprietorship 0.250000

3 general payer1 0.207317

9 lawyer 0.171429

7 general payer3 0.133333

11 other 0.133333

1 doctor 0.108108

5 general payer2 0.101449

13 part time job 0.097561

17 student 0.057377

Insight:

The client might be targeting the wrong customer based on their Job. As we can see, the
response rate for student and part time employee is the lowest response rate
compared to other customer job type. The client can solve the issue by sending more email
to sole proprietorship , general payer1 , or lawyer .

In this data there should be some transformation as well to breakdown further about what is
general payer1, 2, or 3 to make the data much more detail than this

Work 2

Age Feature Distribution Plot


In [100… #Plotting The Distribution of Age Feature
plt.figure(figsize=(15,6))
df['Age'].plot(kind='hist', bins=50)
plt.title('Age Feature Distribution', fontweight='bold')
plt.xlabel('Age Value')
plt.ylabel('Frequency')
plt.show()
In [99]: #Density Distribution (KDE Plot)
plt.figure(figsize=(15,6))
df['Age'].plot(kind='hist', density=True, bins=50)
df['Age'].plot(kind='kde')
plt.title('Age Feature KDE Plot', fontweight='bold')
plt.xlabel('Age Value')
plt.show()

In [93]: # Degree of Deviation / Statistical Summary of the Feature


df['Age'].describe()

Out[93]: count 1003.000000


mean 35.522433
std 12.712868
min 14.000000
25% 28.000000
50% 35.000000
75% 42.000000
max 200.000000
Name: Age, dtype: float64

Conclusion:
For Age Feature seems to have a normal distribution shape. This means that the data is well
distributed and less likely to have a large amount of outliers. The normal data distribution
means that median and mean can be a representative value of the feature.

From the data distribution, we can see that most of the customer age is between 25 and 45
or we can simply say that most of the email receivers is an adult with some few teenagers.

One of the transformation needed is removing The outlier (Age equal to 200) because it is
not making any sense should a customer age is equal to 200.

Representative Value = 35.52 or 35

Degree of Deviation = 12.712 (std dev) → Meaning the data is clustered around avg. value|

charge_Monday Feature Distribution Plot


In [101… #Plotting The Distribution of charge_Monday Feature
plt.figure(figsize=(15,6))
df['charge_Monday'].plot(kind='hist', bins=50)
plt.title('charge_Monday Feature Distribution', fontweight='bold')
plt.xlabel('charge_Monday Value')
plt.ylabel('Frequency')
plt.show()

In [102… #Density Distribution (KDE Plot)


plt.figure(figsize=(15,6))
df['charge_Monday'].plot(kind='hist', density=True, bins=50)
df['charge_Monday'].plot(kind='kde')
plt.title('charge_Monday Feature KDE Plot', fontweight='bold')
plt.xlabel('charge_Monday Value')
plt.show()
In [103… # Degree of Deviation or Statistical Summary of the Feature
df['charge_Monday'].describe()

Out[103]: count 525.000000


mean 42643.047619
std 55666.307408
min 2040.000000
25% 8370.000000
50% 24170.000000
75% 50230.000000
max 361110.000000
Name: charge_Monday, dtype: float64

Conclusion:

From this data we can conclude that most of the customer charge is around 0 2.000 to
40.000. But the data also tells us that there are also a customer who have a charge larger
than 100.000. This data distribution seems to be right skewed which means that the data is
containing some amount of outliers.

Because of data distribution is skewed, median would be the best representative value of the
data.

Representative Value : 24170 (median)

Degree of Deviation : 55666.307 → Meaning the data is very spread out

You might also like