Professional Documents
Culture Documents
DATA ANALYTICS ON VECHICLE INSURANCE DATA
DATA ANALYTICS ON VECHICLE INSURANCE DATA
A company has customer data that contains 8 columns of customer details and another table having name customer_policy data contains the policy
details of the customer.
The company intends to offer some discount in premium for certain customers. To do that they ask their Data scientist team to get some information.
Hence, following tasks DS team decided to perform:
customer_id,
Gender,
age,
region code,
previously insured,
vehicle age
customer_id,
For example: for a column X calculate Q1 = 25th percentile and Q3 = 75th percentile then IQR = Q3 – Q1 ) then to check outlier, anything lower than a Q1 –
1.5IQR or greater than Q3 + 1.5 IQR would be an outlier
3. Create a Master table for future use. Join the customer table and customer_policy table to get a master table using customer_id in both tables.
(Hint: use pd.merge() function)
4. Company needs some important information from the master table to make decisions for future growth.They needs following information:
df1 = pd.read_csv("customer_details.csv")
df2 = pd.read_csv("customer_policy_details.csv")
1. Add the column names to both datasets:
1.(i) Adding columns to customer details table
df1.columns=headers
df1.head()
driving
region previously vehicle vehicle
customer_id Gender age licence
code insured age damage
present
>2
0 1.0 Male 44.0 1.0 28.0 0.0 Yes
Years
1 2.0 Male 76.0 1.0 3.0 0.0 1-2 Year No
>2
2 3.0 Male 47.0 1.0 28.0 0.0 Yes
Years
<1
3 4.0 Male 21.0 1.0 11.0 1.0 No
Year
<1
4 5.0 Female 29.0 1.0 41.0 1.0 No
Year
2.i.(a). Generate a summary of count of all the null values column wise
df1_null = df1.isnull()
for i in df1_null.columns.values.tolist():
print(i)
print(df1_null[i].value_counts())
print("")
customer_id
False 380723
True 386
Name: customer_id, dtype: int64
Gender
False 380741
True 368
Name: Gender, dtype: int64
age
False 380741
True 368
Name: age, dtype: int64
previously insured
False 380728
True 381
Name: previously insured, dtype: int64
vehicle age
False 380728
True 381
Name: vehicle age, dtype: int64
vehicle damage
False 380702
True 407
Name: vehicle damage, dtype: int64
# the column wise count of null values on customer policy details table
df2_null = df2.isnull()
for i in df2_null.columns.values.tolist():
print(i)
print(df2_null[i].value_counts())
print("")
customer_id
False 380722
True 387
Name: customer_id, dtype: int64
vintage
False 380721
True 388
Name: vintage, dtype: int64
responce
False 380748
True 361
Name: responce, dtype: int64
# Dropping the rows that contains null values of customr_id on customer details table
df1.dropna(subset=['customer_id'], axis=0,inplace=True)
# Dropping the rows that contains null values of customr_id on customer policy details table
df2.dropna(subset=['customer_id'], axis=0,inplace=True)
df1.head()
driving
region previously vehicle vehicle
customer_id Gender age licence
code insured age damage
present
>2
0 1.0 Male 44.0 1.0 28.0 0.0 Yes
Years
1 2.0 Male 76.0 1.0 3.0 0.0 1-2 Year No
>2
2 3.0 Male 47.0 1.0 28.0 0.0 Yes
Years
<1
3 4.0 Male 21.0 1.0 11.0 1.0 No
Year
<1
4 5.0 Female 29.0 1.0 41.0 1.0 No
Year
In this table
df1['age'].fillna(df1['age'].mean(),inplace = True)
df2.head()
annual premium (in sales channel
customer_id vintage responce
Rs) code
0 1.0 40454.0 26.0 217.0 1.0
1 2.0 33536.0 26.0 183.0 0.0
2 3.0 38294.0 26.0 27.0 1.0
3 4.0 28619.0 152.0 203.0 0.0
4 5.0 27496.0 152.0 39.0 0.0
In this table
annual premium, sales channel code, and vintage contains numerical data
where as responce has categorical data
# Replacing the NaN values of annual premium by its mean value
df2['annual premium (in Rs)'].fillna(df2['annual premium (in Rs)'].mean(),inplace = True)
# Replacing the NaN values of sales channel code by its mean value
df2['sales channel code'].fillna(df2['sales channel code'].mean(),inplace = True)
# it replaces the null values with the value which exist maximum no of times i.e. mode
df2['responce'].fillna(df2['responce'].mode()[0], inplace=True)
2.(ii) OUTLIERS
2.ii.(a). Summary of Count of Outliers
In customer details table we need to calculate outliers for columns age and regional code.others are non-numerical columns so ignore them
def plot_boxplot(df,ft):
df.boxplot(column=[ft])
plt.grid(False)
plt.show()
plot_boxplot(df1,'age')
plot_boxplot(df1,'region code')
from the above boxplot figures we can easily understand there are no outliers for customer details table.
lets see summary of the outliers for the customer details..
def finding_outliers(df):
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)
iqr = q3-q1
outlier = df[((df<(q1-1.5*iqr)) | (df>(q3+1.5*iqr)))]
return outlier
outliers = finding_outliers(df1['age'])
from the above results we can conclude that customer details table doesn't have outliers..
plot_boxplot(df2,'vintage')
From the above boxplot graphs we can conclude that only column annual premium has outliers.. lets find the summary of the outliers..
We see the column "vehicle damage" has two unique values: "Yes" or "No". Regression doesn't understand words, only numbers. To use this attribute in
regression analysis, we convert "vehicle damage" to indicator variables.
We will use pandas' method 'get_dummies' to assign numerical values to different categories of vehicle damage.
dummy_variable_1.head()
No Yes
0 0 1
1 1 0
2 0 1
3 1 0
4 1 0
dummy_variable_1.head()
vehicle-damage-No vehicle-damage-Yes
0 0 1
1 1 0
2 0 1
3 1 0
4 1 0
df1.drop_duplicates(inplace=True)
df2.drop_duplicates(inplace=True)
master_data
driving vehicle- vehicle- annual sales
region previously vehicle
customer_id Gender age licence damage- damage- premium channel vintage responce
code insured age
present No Yes (in Rs) code
>2
0 1.0 Male 44.0 1.0 28.0 0.0 0 1 40454.0 26.0 217.0 1.0
Years
1-2
1 2.0 Male 76.0 1.0 3.0 0.0 1 0 33536.0 26.0 183.0 0.0
Year
>2
2 3.0 Male 47.0 1.0 28.0 0.0 0 1 38294.0 26.0 27.0 1.0
Years
<1
3 4.0 Male 21.0 1.0 11.0 1.0 1 0 28619.0 152.0 203.0 0.0
Year
<1
4 5.0 Female 29.0 1.0 41.0 1.0 1 0 27496.0 152.0 39.0 0.0
Year
... ... ... ... ... ... ... ... ... ... ... ... ... ...
1-2
380331 381105.0 Male 74.0 1.0 26.0 1.0 1 0 30170.0 26.0 88.0 0.0
Year
driving vehicle- vehicle- annual sales
region previously vehicle
customer_id Gender age licence damage- damage- premium channel vintage responce
code insured age
present No Yes (in Rs) code
<1
380332 381106.0 Male 30.0 1.0 37.0 1.0 1 0 40016.0 152.0 131.0 0.0
Year
<1
380333 381107.0 Male 21.0 1.0 30.0 1.0 1 0 35118.0 160.0 161.0 0.0
Year
>2
380334 381108.0 Female 68.0 1.0 14.0 0.0 0 1 44617.0 124.0 74.0 0.0
Years
1-2
380335 381109.0 Male 46.0 1.0 29.0 0.0 1 0 41777.0 26.0 237.0 0.0
Year
gender_data
Gender
Female 29273.474247
Male 29323.022594
Name: annual premium (in Rs), dtype: float64
gender_data.plot()
# plotting a bar graph
age_data
age
20.0 26342.073517
21.0 29751.791916
22.0 29946.848634
23.0 29838.344763
24.0 30125.557096
...
81.0 29287.910702
82.0 36480.586199
83.0 28995.818172
84.0 35440.818182
85.0 26637.454525
Name: annual premium (in Rs), Length: 67, dtype: float64
# showing the results on a linear graph
age_data.plot(xlabel="age",ylabel="annual premium(in Rs)",title="Age vs Average annual premium")
age_data.plot.bar()
master_data.groupby('Gender').count()
driving
previously vehicle vehicle- vehicle- annual premium sales channel
customer_id age licence region code vintage responce
insured age damage-No damage-Yes (in Rs) code
present
Gender
Female 174485 174485 174485 174485 174485 174309 174485 174485 174485 174485 174485 174485
Male 205484 205484 205484 205484 205484 205279 205484 205484 205484 205484 205484 205484
• summary
• The ratio between male and female is approximately 1. (1.177)
Here the data between Gender is approximately same.. so the data is balanced..
vehicle_age
vehicle age
1-2 Year 29099.066738
< 1 Year 29188.150594
> 2 Years 32943.540830
Name: annual premium (in Rs), dtype: float64
The Pearson Correlation measures the linear dependence between two variables X and Y.
Summary
Since the Pearson coefficient lies between 0.5 and -.5, So no relation between