Big Data Hadoop and Spark Developer: Certification Project

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Big Data Hadoop and Spark

Developer
Certification Project
Car Insurance Analysis

Objective: To use the Hadoop features for data engineering or analysis of


car insurance, share patterns, and actionable insights

Problem Statement:
A car insurance company wants to look at its historical data to understand
and predict the probability of a customer making a claim based on multiple
features other than MVR_POINTS. The data set comprises of 10K plus
submitted claim records 14 plus features. The scope of this project is limited
to data engineering and analysis.

Steps to perform:
1) Create the following hive table as per the schema provided below:
TABLE: < Your own database>.CAR_INSURANCE
Column Name Datatype
Policyid LONGINT

Income
Parent1 String
Home_value
Married String
Gender String
Education String
Occupation String
Travel_time
Car_usage String

Car_type String
Red_car String
Old_claim
Claim_freq
Mvr_points
Claim_amount
Car_age
Claim_status
Urbanicity String
Car_age String
age_category String
Solution:
create table rs_bdhs.car_insurance(policyid bigint, age int,
cust_age_category string,income int, parent1 string, home_value int,
married string, gender string, education string, occupation string,
travel_time int, car_usage string, IDV int, car_type string, red_car string,
old_claim int, claim_freq int, mvr_points int, claim_amount int, car_age
int,car_age_category string, claim_status int, urbanicity string);

2) Create a data pipeline using Sqoop, Flume to pull the table data below
from the MYSQL server into the HBase table with one column family.
MYSQL DATABASE NAME: BDHS_PROJECT
Table: Car_insurance
Solution:
Step 1: Use sqoop to pull data from mysql table to local file system.
sqoop import -fs local -jt local --connect jdbc:mysql://ip-10-0-1-
10.ec2.internal/bdhs_project --username labuser --password
simplilearn --table=car_insurance
--target-dir=file:///home/rakeshsrivastva75_gmail/car_insurance -m 1

Step 2: Create a table in HBase for storing car_insurance data.


create_namespace 'rs_bdhs'
create 'rs_bdhs:hbase_car_insurance','car_ins_details'

Step 3: Run the flume agent to read from sqoop imported directory and
insert into HBase table created in step 2 above.
Agent configuration below:
car_agent.sources = dirSource
car_agent.sources.dirSource.type = spoolDir
car_agent.sources.dirSource.spoolDir =
/home/rakeshsrivastva75_gmail/car_insurance
car_agent.channels = fileChannel
car_agent.channels.fileChannel.type = file

car_agent.sinks = hbaseSink
car_agent.sinks.hbaseSink.type = hbase
car_agent.sinks.hbaseSink.table=rs_bdhs:hbase_car_insurance
car_agent.sinks.hbaseSink.columnFamily=car_ins_details

car_agent.sinks.hbaseSink.serializer=org.apache.flume.sink.hbase.Sim
pleHbaseEventSerializer
car_agent.sources.dirSource.channels = fileChannel
car_agent.sinks.hbaseSink.channel = fileChannel

Step 4: Running the agent.


flume-ng agent --conf-file car_insurance_flume_agent.conf --
name car_agent

3) Create a staging Hive table pointing to the HBase table.


Solution:
Step 1: Create a table in Hive mapping to table in HBase.
CREATE EXTERNAL TABLE car_insurance_staging(key string, records
string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH
SERDEPROPERTIES ("hbase.columns.mapping" =
"car_ins_details:pCol") TBLPROPERTIES ("hbase.table.name" =
"rs_bdhs:hbase_car_insurance", "hbase.mapred.output.outputtable" =
"rs_bdhs:hbase_car_insurance");

Step 2: Create a table new temp table in hive matching the structure of
imported car insurance data.
create table rs_bdhs.car_insurance_temp(policy_id bigINT, DOB string,
age INT, income string, parent1 string, home_value string, married
string, gender string, education string, occupation string, travel_time
INT, car_usage string, IDV string, car_type string, red_car string,
old_claim string, claim_freq INT, MVR_points INT, claim_amount string,
car_age INT, claim_status INT, urbanicity string);

Step 3: Change the delimiter of this temp table to ','


ALTER TABLE rs_bdhs.car_insurance_temp set serde
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH
SERDEPROPERTIES ('field.delim' = ',');

Step 4: Read the data from car_insurance_staging table and create a


data file under the hive folder of table created in the above step 3.
INSERT OVERWRITE DIRECTORY
'/user/hive/warehouse/rs_bdhs.db/car_insurance_temp' stored as
textfile select records from car_insurance_staging;
4) Perform the following transformations before loading the data from
staging table to the main table.
i. Filter out the records where age is missing.
ii. Convert $values to integer values.
iii. Remove additional character (z_) from all the fields
wherever it is present.
iv. Perform feature engineering for age field and add a new
field (cust_age_category) to the main Hive table as shown
below:
1. 16 to 18: Learner
2. 19 to 40: Young
3. 41 to 60: mid-age
4. 60+: old-age
v. Perform feature engineering for cust_car_age field and add
a new field (car_age_category) to the main Hive table as
shown below:
1. 0 to 5: New
2. 6 to 12: used
3. 13 to 20: old
4. 21+: vintage
Solution (i,ii,iii):
create table car_insurance_clean as select policy_id, DOB,
age,income,parent1,trim(regexp_replace(home_value,"\\$",""))
home_value,trim(regexp_replace(married,"z_",""))
married,trim(regexp_replace(gender,"z_",""))
gender,trim(regexp_replace(education,"z_",""))
education,trim(regexp_replace(occupation,"z_",""))
occupation,travel_time, car_usage,trim(regexp_replace(IDV,"\\$",""))
IDV,trim(regexp_replace(car_type,"z_","")) car_type,red_car,
trim(regexp_replace(old_claim,"\\$",""))
old_claim,claim_freq,MVR_points,trim(regexp_replace(claim_amount,"\\
$",""))
claim_amount,car_age,claim_status,trim(regexp_replace(urbanicity,"z_
","")) urbanicity from car_insurance_temp where age > 0;

Solution (iv,v):
use rs_bdhs;
insert into car_insurance select policy_id, age, case when age between
16 and 18 then "Learner" when age between 19 and 40 then "young"
when age between 41 and 60 then "mid-age" when age > 60 then
"old-age" else "invalid age" end as
cust_age_category,income ,parent1 ,home_value ,married ,gender ,edu
cation ,occupation ,travel_time ,car_usage ,IDV ,car_type ,red_car ,old_
claim ,claim_freq ,MVR_points ,claim_amount ,car_age, case when
car_age between 0 and 5 then "New" when car_age between 6 and 12
then "used" when car_age between 13 and 20 then "old" else "vintage"
end as car_age_category,claim_status ,urbanicity from
car_insurance_clean where car_age >= 0;
5) Perform data analysis using Hive to display the repeated claim customers
based on the age category, marital status, gender, and the urbanicity
falling in top 75%.
Solution:
select cust_age_category,married,gender,urbanicity,cume_dist() over
(order by claim_freq_count) as claim_dist from (select
cust_age_category,married,gender,urbanicity,count(claim_freq)
claim_freq_count from car_insurance group by
cust_age_category,married,gender,urbanicity having count(claim_freq)
> 1) x order by claim_dist desc;

Output:
age_category married gender urbanicity claim_dist
mid-age Yes F Highly Urban; Urban 1.0
mid-age Yes M Highly Urban; Urban 0.9615384615384616
mid-age No M Highly Urban; Urban 0.9230769230769231
mid-age No F Highly Urban; Urban 0.8846153846153846
young Yes F Highly Urban; Urban
0.8461538461538461
young No F Highly Urban; Urban
0.8076923076923077

6) Find the relation between car_claim, car_age, and car_usage.


Solution:
select car_age_category,car_usage,count(claim_freq) as claim_freq
from car_insurance where claim_freq > 0 group by
car_age_category,car_usage order by claim_freq desc;
Output:
+-------------------+-------------+-------------+--+
| car_age_category | car_usage | claim_freq |
+-------------------+-------------+-------------+--+
| Used | Private | 890 |
| New | Private | 734 |
| Used | Commercial | 711 |
| New | Commercial | 539 |
| Old | Private | 519 |
| Old | Commercial | 307 |
| Vintage | Private | 35 |
| Vintage | Commercial | 19 |
+-------------------+-------------+-------------+--+

7) Find the top two car types with the maximum claim amounts.
Solution:
select car_type, sum(claim_amount) claim_amt from car_insurance
group by car_type order by claim_amt desc limit 2;
Output:
+--------------+------------+--+
| car_type | claim_amt |
+--------------+------------+--+
| SUV | 4287058 |
| Pickup | 2830903 |
+--------------+------------+--+
2 rows selected (32.723 seconds)

8) Display the average rate of claim approval for repeat claims vs. new claims.
Solution:
select
new_claim_app.app_count/(new_claim_app.app_count+new_claim_rej.r
ej_count) as new_claim_app_rate,
repeat_claim_app.app_count/(repeat_claim_app.app_count+repeat_clai
m_rej.rej_count) as repeat_claim_app_rate from (select count(1) as
app_count from car_insurance where claim_freq = 1 and claim_status
= 1) new_claim_app, (select count(1) as rej_count from car_insurance
where claim_freq = 1 and claim_status = 0) new_claim_rej,(select
count(1) as app_count from car_insurance where claim_freq>1 and
claim_status =1) repeat_claim_app,(select count(1) as rej_count from
car_insurance where claim_freq > 1 and claim_status = 0)
repeat_claim_rej;
Output:
new_claim_app_rate repeat_claim_app_rate
0.3951137320977254 0.41137514608492404
Time taken: 121.518 seconds, Fetched: 1 row(s)

You might also like