DATA ANALYTICS ON VECHICLE INSURANCE DATA

EDA on VEHICLE INSURANCE DATA
A company has customer data that contains 8 columns of customer details and another table having name customer_policy data contains the policy
details of the customer.
The company intends to offer some discount in premium for certain customers. To do that they ask their Data scientist team to get some information.
Hence, following tasks DS team decided to perform:
1. Add the column names to both datasets:

i. Column Name for customer details table:
customer_id,
Gender,
age,
driving licence present,
region code,
previously insured,
vehicle age
and vehicle damage, in respective order.

ii. Column Name for customer_policy table:
customer_id,
annual premium (in Rs),
sales channel code,
vintage and response.

2. Checking and Cleaning Data Quality:
i. Null values
• Generate a summary of count of all the null values column wise

• Drop Null values for customer_id because central tendencies for id’s is not feasible.
• Replace all null values for numeric columns by mean.
• Replace all null values for Categorical value by mode.
ii. Outliers
• Generate a summary of count of all the outliers column wise

• Replace all outlier values for numeric columns by mean.
(Hint1: for outlier treatment use IQR method as follows:
For example: for a column X calculate Q1 = 25th percentile and Q3 = 75th percentile then IQR = Q3 – Q1 ) then to check outlier, anything lower than a Q1 –
1.5IQR or greater than Q3 + 1.5 IQR would be an outlier
Hint2: For getting percentile value, explore pd.describe() function)
iii. White spaces
• Remove white spaces
iv. case correction(lower or upper, any one)

v. Convert nominal data (categorical) into dummies
for future modeling use if required

vi. Drop Duplicates (duplicated rows)
3. Create a Master table for future use. Join the customer table and customer_policy table to get a master table using customer_id in both tables.
(Hint: use pd.merge() function)
4. Company needs some important information from the master table to make decisions for future growth.They needs following information:
• i. Gender wise average annual premium

• ii. Age wise average annual premium
• iii. Is your data balanced between the genders?
(Hint: Data is balanced if number of counts in each group is approximately same)
• iv. Vehicle age wise average annual premium.
5. Is there any relation between Person Age and annual premium?

Hint: use correlation function (Correlation describes the relationship between two variables).
Correlation coefficient < -0.5 - Strong negative relationship
Correlation coefficient > 0.5 - Strong positive relationship
0.5 < Correlation coefficient < 0.5 - There is no relationship.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Loading the data files
df1 = pd.read_csv("customer_details.csv")
df2 = pd.read_csv("customer_policy_details.csv")
1. Add the column names to both datasets:
1.(i) Adding columns to customer details table
headers = ['customer_id','Gender','age','driving licence present','region code','previously insured','vehicle age','vehicle damage']
df1.columns=headers
df1.head()
driving
region previously vehicle vehicle
customer_id Gender age licence
code insured age damage
present
>2
0 1.0 Male 44.0 1.0 28.0 0.0 Yes
Years
1 2.0 Male 76.0 1.0 3.0 0.0 1-2 Year No
>2
2 3.0 Male 47.0 1.0 28.0 0.0 Yes
Years
<1
3 4.0 Male 21.0 1.0 11.0 1.0 No
Year
<1
4 5.0 Female 29.0 1.0 41.0 1.0 No
Year
1.(ii) Adding columns to customer policy details table

df2.columns=['customer_id','annual premium (in Rs)','sales channel code','vintage','responce']
df2.head()
customer_id annual premium (in Rs) sales channel code vintage responce
0 1.0 40454.0 26.0 217.0 1.0
1 2.0 33536.0 26.0 183.0 0.0
2 3.0 38294.0 26.0 27.0 1.0
3 4.0 28619.0 152.0 203.0 0.0
4 5.0 27496.0 152.0 39.0 0.0
2. Checking and cleaning Data quality

2.(i). NULL Values
2.i.(a). Generate a summary of count of all the null values column wise
# the column wise count of null values on customer details table
df1_null = df1.isnull()
for i in df1_null.columns.values.tolist():
print(i)
print(df1_null[i].value_counts())
print("")
customer_id
False 380723
True 386
Name: customer_id, dtype: int64
Gender
False 380741
True 368
Name: Gender, dtype: int64
age
False 380741
True 368
Name: age, dtype: int64
driving licence present

False 380716
True 393
Name: driving licence present, dtype: int64
region code
False 380717
True 392
Name: region code, dtype: int64
previously insured
False 380728
True 381
Name: previously insured, dtype: int64
vehicle age
False 380728
True 381
Name: vehicle age, dtype: int64
vehicle damage
False 380702
True 407
Name: vehicle damage, dtype: int64
• SUMMARY OF NULL VALUES ON CUSTOMER DETAILS TABLE

• customer_id has 386 null values
•
• Gender has 368 null values
•
• age has 368 null values
•
• driving licence present has 393 null values
•
• region code has 392 null values
•
• previously insured has 381 null values
•
• vehicle age has 381 null values
•
vehicle damage has 407 null values
# the column wise count of null values on customer policy details table
df2_null = df2.isnull()
for i in df2_null.columns.values.tolist():
print(i)
print(df2_null[i].value_counts())
print("")
customer_id
False 380722
True 387
Name: customer_id, dtype: int64
annual premium (in Rs)

False 380763
True 346
Name: annual premium (in Rs), dtype: int64
sales channel code

False 380709
True 400
Name: sales channel code, dtype: int64
vintage
False 380721
True 388
Name: vintage, dtype: int64
responce
False 380748
True 361
Name: responce, dtype: int64
• SUMMARY OF NULL VALUES ON CUSTOMER-POLICY-DETAILS TABLE

• customer_id has 387 null values
•
• annual premium has 346 null values
•
• sales channel code has 400 null values
•
• vintage has 388 null values
•
responce has 361 null values
2.i.(b). Droppinf NULL values for customer_id
# Dropping the rows that contains null values of customr_id on customer details table
df1.dropna(subset=['customer_id'], axis=0,inplace=True)
# resetting index because some rows deleted

df1.reset_index(drop = True, inplace = True)
# Dropping the rows that contains null values of customr_id on customer policy details table
df2.dropna(subset=['customer_id'], axis=0,inplace=True)
# resetting index because some rows deleted

df2.reset_index(drop = True, inplace = True)
2.i.(c). Replacing all null values for numeric columns by mean
df1.head()
driving
present
>2
0 1.0 Male 44.0 1.0 28.0 0.0 Yes
Years
1 2.0 Male 76.0 1.0 3.0 0.0 1-2 Year No
>2
2 3.0 Male 47.0 1.0 28.0 0.0 Yes
Years
<1
3 4.0 Male 21.0 1.0 11.0 1.0 No
Year
<1
4 5.0 Female 29.0 1.0 41.0 1.0 No
Year
In this table
age,and region code contains numerical data

where as driving licence present and previously insured contains categorical data.
# Replacing the NaN values of age by its mean value
df1['age'].fillna(df1['age'].mean(),inplace = True)
# Replacing the NaN values of region code by its mean value
df1['region code'].fillna(df1['region code'].mean(),inplace = True)
df2.head()
annual premium (in sales channel
customer_id vintage responce
Rs) code
0 1.0 40454.0 26.0 217.0 1.0
1 2.0 33536.0 26.0 183.0 0.0
2 3.0 38294.0 26.0 27.0 1.0
3 4.0 28619.0 152.0 203.0 0.0
4 5.0 27496.0 152.0 39.0 0.0
In this table
annual premium, sales channel code, and vintage contains numerical data
where as responce has categorical data
# Replacing the NaN values of annual premium by its mean value
df2['annual premium (in Rs)'].fillna(df2['annual premium (in Rs)'].mean(),inplace = True)
# Replacing the NaN values of sales channel code by its mean value
df2['sales channel code'].fillna(df2['sales channel code'].mean(),inplace = True)
# Replacing the NaN values of vintage by its mean value

df2['vintage'].fillna(df2['vintage'].mean(),inplace = True)
2.i.(d) Repalcing categorical null values by its mode.

# it replaces the null values with the value which exist maximum no of times i.e. mode
df1['driving licence present'].fillna (df1['driving licence present'].mode()[0], inplace=True)
df1['previously insured'].fillna (df1['previously insured'].mode()[0], inplace=True)
df2['responce'].fillna(df2['responce'].mode()[0], inplace=True)
2.(ii) OUTLIERS
2.ii.(a). Summary of Count of Outliers
In customer details table we need to calculate outliers for columns age and regional code.others are non-numerical columns so ignore them
# this is the function to boxplot for easy visualisation of outliers..
def plot_boxplot(df,ft):
df.boxplot(column=[ft])
plt.grid(False)
plt.show()
plot_boxplot(df1,'age')
plot_boxplot(df1,'region code')
from the above boxplot figures we can easily understand there are no outliers for customer details table.
lets see summary of the outliers for the customer details..
# this functions finds outliers if present by IQR method..
def finding_outliers(df):
q1 = df.quantile(0.25)
iqr = q3-q1
outlier = df[((df<(q1-1.5*iqr)) | (df>(q3+1.5*iqr)))]
return outlier
outliers = finding_outliers(df1['age'])
print('Number of outliers :',len(outliers))

print('Maximum outlier value :',outliers.max())
print('Minimum outlier value :',outliers.min())
Number of outliers : 0
Maximum outlier value : nan
Minimum outlier value : nan
outliers = finding_outliers(df1['region code'])

from the above results we can conclude that customer details table doesn't have outliers..
Finding outliers for customer policy details..

plot_boxplot(df2,'annual premium (in Rs)')
plot_boxplot(df2,'sales channel code')
plot_boxplot(df2,'vintage')
From the above boxplot graphs we can conclude that only column annual premium has outliers.. lets find the summary of the outliers..
outliers = finding_outliers(df2['annual premium (in Rs)'])

Maximum outlier value : 540165.0
Minimum outlier value : 61858.0
outliers = finding_outliers(df2['sales channel code'])

outliers = finding_outliers(df2['vintage'])

2.ii.(b).Replacing the outliers with mean values

def replace_outlier(df):
iqr = q3-q1
upper = df[~(df>(q3+1.5*iqr))].max()
lower = df[~(df<(q1-1.5*iqr))].min()
df = np.where(df>upper,df.mean(),np.where(df<lower,df.mean(),df))
return df
df2['annual premium (in Rs)'] = replace_outlier(df2['annual premium (in Rs)'])
2.(iii). Remove white spaces

df1.apply(lambda x: x.str.strip() if x.dtype=='object' else x)
driving licence region previously vehicle vehicle
customer_id Gender age
present code insured age damage
0 1.0 Male 44.0 1.0 28.0 0.0 > 2 Years Yes
1 2.0 Male 76.0 1.0 3.0 0.0 1-2 Year No
2 3.0 Male 47.0 1.0 28.0 0.0 > 2 Years Yes
3 4.0 Male 21.0 1.0 11.0 1.0 < 1 Year No
4 5.0 Female 29.0 1.0 41.0 1.0 < 1 Year No
... ... ... ... ... ... ... ... ...
380718 381105.0 Male 74.0 1.0 26.0 1.0 1-2 Year No
380719 381106.0 Male 30.0 1.0 37.0 1.0 < 1 Year No
380720 381107.0 Male 21.0 1.0 30.0 1.0 < 1 Year No
380721 381108.0 Female 68.0 1.0 14.0 0.0 > 2 Years Yes
380722 381109.0 Male 46.0 1.0 29.0 0.0 1-2 Year No
NOTE: The customer policy details tables containls all the columns of float type only.so no need to remove spaces
2.(iv). Case Correction
convert all the characters to upper case.
df1.apply(lambda x: x.str.upper() if x.dtype=='object' else x)

driving
present
>2
0 1.0 MALE 44.0 1.0 28.0 0.0 YES
YEARS
1-2
1 2.0 MALE 76.0 1.0 3.0 0.0 NO
YEAR
>2
2 3.0 MALE 47.0 1.0 28.0 0.0 YES
YEARS
<1
3 4.0 MALE 21.0 1.0 11.0 1.0 NO
YEAR
<1
4 5.0 FEMALE 29.0 1.0 41.0 1.0 NO
YEAR
... ... ... ... ... ... ... ... ...
1-2
380718 381105.0 MALE 74.0 1.0 26.0 1.0 NO
YEAR
<1
380719 381106.0 MALE 30.0 1.0 37.0 1.0 NO
YEAR
<1
380720 381107.0 MALE 21.0 1.0 30.0 1.0 NO
YEAR
>2
380721 381108.0 FEMALE 68.0 1.0 14.0 0.0 YES
YEARS
1-2
380722 381109.0 MALE 46.0 1.0 29.0 0.0 NO
YEAR
2.(v). Convert nominal data into dummy variables
Indicator Variable (or Dummy Variable)

What is an indicator variable?
An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers
themselves don't have inherent meaning.
Why we use indicator variables?
We use indicator variables so we can use categorical variables for regression analysis in the later modules.
Example
We see the column "vehicle damage" has two unique values: "Yes" or "No". Regression doesn't understand words, only numbers. To use this attribute in
regression analysis, we convert "vehicle damage" to indicator variables.
We will use pandas' method 'get_dummies' to assign numerical values to different categories of vehicle damage.
# Creating dummy variable for vehicle damage
dummy_variable_1 = pd.get_dummies(df1["vehicle damage"])
dummy_variable_1.head()
No Yes
0 0 1
1 1 0
2 0 1
3 1 0
4 1 0
# Changing the column names for dummy variable
dummy_variable_1.rename(columns={'No':'vehicle-damage-No', 'Yes':'vehicle-damage-Yes'}, inplace=True)
dummy_variable_1.head()
vehicle-damage-No vehicle-damage-Yes
0 0 1
1 1 0
2 0 1
3 1 0
4 1 0
# merge data frame "df1" and "dummy_variable_1"

df1 = pd.concat([df1, dummy_variable_1], axis=1)
# drop original column "vehicle damage" from "df1"

df1.drop("vehicle damage", axis = 1, inplace=True)
similary we can create a dummy variable for gender if needed.
2.(vi). Drop Duplicates

# Dropping duplicates from customer details table
df1.drop_duplicates(inplace=True)
# Dropping duplicates from customer policy details table
df2.drop_duplicates(inplace=True)
3. Merging the data sets.

master_data = pd.merge(df1,df2,on='customer_id')
master_data
driving vehicle- vehicle- annual sales
region previously vehicle
customer_id Gender age licence damage- damage- premium channel vintage responce
code insured age
present No Yes (in Rs) code
>2
0 1.0 Male 44.0 1.0 28.0 0.0 0 1 40454.0 26.0 217.0 1.0
Years
1-2
1 2.0 Male 76.0 1.0 3.0 0.0 1 0 33536.0 26.0 183.0 0.0
Year
>2
2 3.0 Male 47.0 1.0 28.0 0.0 0 1 38294.0 26.0 27.0 1.0
Years
<1
3 4.0 Male 21.0 1.0 11.0 1.0 1 0 28619.0 152.0 203.0 0.0
Year
<1
4 5.0 Female 29.0 1.0 41.0 1.0 1 0 27496.0 152.0 39.0 0.0
Year
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1-2
380331 381105.0 Male 74.0 1.0 26.0 1.0 1 0 30170.0 26.0 88.0 0.0
Year
driving vehicle- vehicle- annual sales
region previously vehicle
customer_id Gender age licence damage- damage- premium channel vintage responce
code insured age
present No Yes (in Rs) code
<1
380332 381106.0 Male 30.0 1.0 37.0 1.0 1 0 40016.0 152.0 131.0 0.0
Year
<1
380333 381107.0 Male 21.0 1.0 30.0 1.0 1 0 35118.0 160.0 161.0 0.0
Year
>2
380334 381108.0 Female 68.0 1.0 14.0 0.0 0 1 44617.0 124.0 74.0 0.0
Years
1-2
380335 381109.0 Male 46.0 1.0 29.0 0.0 1 0 41777.0 26.0 237.0 0.0
Year
4. getting required data for decisions

4.(i). Gender wise average annual premium
gender_data = master_data.groupby('Gender')['annual premium (in Rs)'].mean()
gender_data
Gender
Female 29273.474247
Male 29323.022594
Name: annual premium (in Rs), dtype: float64
# plots a linear graph
gender_data.plot()
# plotting a bar graph
gender_data.plot.bar(title='Gender vs average annual premium')
4.(ii). age wise average annual premium

age_data = master_data.groupby('age')['annual premium (in Rs)'].mean()
age_data
age
20.0 26342.073517
21.0 29751.791916
22.0 29946.848634
23.0 29838.344763
24.0 30125.557096
...
81.0 29287.910702
82.0 36480.586199
83.0 28995.818172
84.0 35440.818182
85.0 26637.454525
Name: annual premium (in Rs), Length: 67, dtype: float64
# showing the results on a linear graph
age_data.plot(xlabel="age",ylabel="annual premium(in Rs)",title="Age vs Average annual premium")
# plotting a bar graph
age_data.plot.bar()
4.(iii). Checking Balanced between gender or not..
master_data.groupby('Gender').count()
driving
previously vehicle vehicle- vehicle- annual premium sales channel
customer_id age licence region code vintage responce
insured age damage-No damage-Yes (in Rs) code
present
Gender
Female 174485 174485 174485 174485 174485 174309 174485 174485 174485 174485 174485 174485
Male 205484 205484 205484 205484 205484 205279 205484 205484 205484 205484 205484 205484
• summary
• The ratio between male and female is approximately 1. (1.177)
Here the data between Gender is approximately same.. so the data is balanced..
4.(iv). Vehicle age wise average annual premium
vehicle_age = master_data.groupby('vehicle age')['annual premium (in Rs)'].mean()
vehicle_age
vehicle age
1-2 Year 29099.066738
< 1 Year 29188.150594
> 2 Years 32943.540830
Name: annual premium (in Rs), dtype: float64
vehicle_age.plot() # linear graph

vehicle_age.plot.bar() # bar graph
5. Correlation between age and annual premuim

Pearson Correlation
The Pearson Correlation measures the linear dependence between two variables X and Y.
The resulting coefficient is a value between -1 and 1 inclusive, where:

• 1: Perfect positive linear correlation.
• 0: No linear correlation, the two variables most likely do not affect each other.
• -1: Perfect negative linear correlation.
• Correlation coefficient < -0.5 - Strong negative relationship
• Correlation coefficient > 0.5 - Strong positive relationship
• -0.5 < Correlation coefficient < 0.5 - There is no relationship.
# finding the correlation coefficient in pandas
master_data['age'].corr(master_data['annual premium (in Rs)'])

0.0506575892861754
# Matrix form of correlation
master_data[['age','annual premium (in Rs)']].corr()

age annual premium (in Rs)
age 1.000000 0.050658
annual premium (in Rs) 0.050658 1.000000
Summary
Since the Pearson coefficient lies between 0.5 and -.5, So no relation between

DATA ANALYTICS ON VECHICLE INSURANCE DATA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DATA ANALYTICS ON VECHICLE INSURANCE DATA

Uploaded by

Copyright:

Available Formats

EDA on VEHICLE INSURANCE DATA

1. Add the column names to both datasets:

driving licence present,

and vehicle damage, in respective order.

annual premium (in Rs),

sales channel code,

vintage and response.

• Generate a summary of count of all the null values column wise

• Generate a summary of count of all the outliers column wise

(Hint1: for outlier treatment use IQR method as follows:

Hint2: For getting percentile value, explore pd.describe() function)

iii. White spaces

• Remove white spaces

iv. case correction(lower or upper, any one)

for future modeling use if required

• i. Gender wise average annual premium

(Hint: Data is balanced if number of counts in each group is approximately same)

• iv. Vehicle age wise average annual premium.

5. Is there any relation between Person Age and annual premium?

Correlation coefficient < -0.5 - Strong negative relationship

Correlation coefficient > 0.5 - Strong positive relationship

0.5 < Correlation coefficient < 0.5 - There is no relationship.

# Loading the data files

headers = ['customer_id','Gender','age','driving licence present','region code','previously insured','vehicle age','vehicle damage']

1.(ii) Adding columns to customer policy details table

2. Checking and cleaning Data quality

# the column wise count of null values on customer details table

driving licence present

• SUMMARY OF NULL VALUES ON CUSTOMER DETAILS TABLE

annual premium (in Rs)

sales channel code

• SUMMARY OF NULL VALUES ON CUSTOMER-POLICY-DETAILS TABLE

# resetting index because some rows deleted

# resetting index because some rows deleted

2.i.(c). Replacing all null values for numeric columns by mean

age,and region code contains numerical data

# Replacing the NaN values of age by its mean value

# Replacing the NaN values of region code by its mean value

df1['region code'].fillna(df1['region code'].mean(),inplace = True)

# Replacing the NaN values of vintage by its mean value

2.i.(d) Repalcing categorical null values by its mode.

# this is the function to boxplot for easy visualisation of outliers..

# this functions finds outliers if present by IQR method..

print('Number of outliers :',len(outliers))

outliers = finding_outliers(df1['region code'])

Finding outliers for customer policy details..

plot_boxplot(df2,'sales channel code')

outliers = finding_outliers(df2['annual premium (in Rs)'])

print('Number of outliers :',len(outliers))

outliers = finding_outliers(df2['sales channel code'])

print('Number of outliers :',len(outliers))

print('Number of outliers :',len(outliers))

2.ii.(b).Replacing the outliers with mean values

df2['annual premium (in Rs)'] = replace_outlier(df2['annual premium (in Rs)'])

2.(iii). Remove white spaces

2.(iv). Case Correction

convert all the characters to upper case.

df1.apply(lambda x: x.str.upper() if x.dtype=='object' else x)

2.(v). Convert nominal data into dummy variables

Indicator Variable (or Dummy Variable)

# Creating dummy variable for vehicle damage

dummy_variable_1 = pd.get_dummies(df1["vehicle damage"])

# Changing the column names for dummy variable

dummy_variable_1.rename(columns={'No':'vehicle-damage-No', 'Yes':'vehicle-damage-Yes'}, inplace=True)