Professional Documents
Culture Documents
Exercise Univariate Analysis - Andoni Fikri - 13118111
Exercise Univariate Analysis - Andoni Fikri - 13118111
Exercise Univariate Analysis - Andoni Fikri - 13118111
general
1 2 80000042 reply M 39.0 0.0 0.0 0.
payer1
general
4 5 80000529 no reply M 33.0 0.0 0.0 0.
payer1
Out[5]: NO 0.000000
Customer_ID 0.000000
Response 22.428461
Sex 4.717711
Age 4.717711
Job 40.371230
questionnaire1 15.313225
questionnaire2 15.313225
questionnaire3 15.313225
questionnaire4 15.313225
questionnaire5 15.313225
avg_charge 20.108275
charge_avg_per_mon 20.108275
charge_Monday 56.148492
contraction_day 16.009281
contraction_day_JP 16.009281
dtype: float64
In [72]: #Remove Missing Data especially for Response Column because we need to know wether
df = df[~df['Response'].isnull()] #filtering data where the response column is not
df.isnull().sum()/len(df)*100 #Checking for missing values after filtering the data
Out[72]: NO 0.000000
Customer_ID 0.000000
Response 0.000000
Sex 0.000000
Age 0.000000
Job 29.312064
questionnaire1 0.000000
questionnaire2 0.000000
questionnaire3 0.000000
questionnaire4 0.000000
questionnaire5 0.000000
avg_charge 0.000000
charge_avg_per_mon 0.000000
charge_Monday 47.657029
contraction_day 0.000000
contraction_day_JP 0.000000
dtype: float64
Work 1
Conclusion:
From the frequency distribution plot above, we can conclude that only several customer
responding to the email campaign (around 10%). This is indicating that the response rate of
the campaign is very low and there should be a strategy/solution for increasing the response
rate.
Conclusion:
The recipients of the email campaign are mostly female. Based on this information, we can
check if females have a higher response rate than males. If the opposite is true, it could
mean that Mr. Matsui targeted the wrong customers.
1 F 0.170984
3 M 0.144208
Conclusion:
Female customer has a better response rate than the male customer. It means that sending
the email to a female customer has a higher change of being replied.
Frequency Distribution of Job Feature
In [53]: job_freq = df.groupby('Job')['Customer_ID'].nunique().reset_index().sort_values(by=
plt.figure(figsize=(15,6))
sns.barplot(data=job_freq, x='Job', y='Customer_ID')
plt.title('Job Frequency Distribution', fontweight='bold')
plt.ylabel('Count')
Conclusion:
The recipients of the email campaign are mostly part-time employees or students. However,
there is a concern regarding the data as it contains a significant amount of missing data
(around 29%), which means that the insights derived from it may not be accurate enough.
From this data, it should also be checked wether the client targeting the right segment or
not.
In [58]: #See wether a customer job has a correlation with the response
job_freq_resp = df.groupby(['Job','Response'])['Customer_ID'].nunique().reset_index
job_response = job_freq_resp.merge(job_freq, how='left', on='Job')
job_response['response_rate'] = job_response['Customer_ID_x']/job_response['Custome
job_response[job_response['Response']=='reply'][['Job', 'response_rate']].sort_valu
Out[58]: Job response_rate
9 lawyer 0.171429
11 other 0.133333
1 doctor 0.108108
17 student 0.057377
Insight:
The client might be targeting the wrong customer based on their Job. As we can see, the
response rate for student and part time employee is the lowest response rate
compared to other customer job type. The client can solve the issue by sending more email
to sole proprietorship , general payer1 , or lawyer .
In this data there should be some transformation as well to breakdown further about what is
general payer1, 2, or 3 to make the data much more detail than this
Work 2
Conclusion:
For Age Feature seems to have a normal distribution shape. This means that the data is well
distributed and less likely to have a large amount of outliers. The normal data distribution
means that median and mean can be a representative value of the feature.
From the data distribution, we can see that most of the customer age is between 25 and 45
or we can simply say that most of the email receivers is an adult with some few teenagers.
One of the transformation needed is removing The outlier (Age equal to 200) because it is
not making any sense should a customer age is equal to 200.
Degree of Deviation = 12.712 (std dev) → Meaning the data is clustered around avg. value|
Conclusion:
From this data we can conclude that most of the customer charge is around 0 2.000 to
40.000. But the data also tells us that there are also a customer who have a charge larger
than 100.000. This data distribution seems to be right skewed which means that the data is
containing some amount of outliers.
Because of data distribution is skewed, median would be the best representative value of the
data.