Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

It is a statistical technique to

predict future behaviour.


Predictive modelling
solutions are a form of data-
mining technology that
works by analyzing historical
and current data and
generating a model to help
predict future outcomes.

PREDICTIV
E
MODELLIN
G PROJECT

Created by Pranjal Singh


PGP-DSBA Online
05/02/2022
1

Table of Contents
Contents
Problem 1 Executive Summary…………………………………………………………………………………………………………………….3
Introduction………………………………………………………………………………………………………………………………………………….3
Data Description………………………………………………………………………………………………………………………………………..3-4
1.1Read the data and do exploratory data analysis. Describe the data briefly.
Perform Univariate, Bivariate Analysis, Multivariate Analysis………………………………………………………………………4-9
Sample…………………………………………………………………………………………………………………………………………4
EDA………………………………………………………………………………………………………………………………………………5
Five-Point Summary……………………………………………………………………………………………………………………..5
Univariate Analysis………………………………………………………………………………………………………………………6-7
Bivariate Analysis…………………………………………………………………………………………………………………………7-8
Multivariate Analysis……………………………………………………………………………………………………………………9
1.2 Impute null values if present, also check for the values which are equal to zero…………………………………..9-11
1.3 Encode the data (having string values) for Modelling………………………………………………………………………….11-16
1.4 Inference: Basis on these predictions, what are the business insights and recommendations …………….16
Problem 2 Introduction......................................................................................................................................17
Data Description……………………………………………………………………………………………………………………………………………17
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics………………………………………………………………..17-22
Sample…………………………………………………………………………………………………………………………………………….17
Descriptive Statistics………………………………………………………………………………………………………………………..17-20
Univariate Analysis…………………………………………………………………………………………………………………………..20-21
Bivariate Analysis……………………………………………………………………………………………………………………………21
Multivariate Analysis……………………………………………………………………………………………………………………….22
2.2 Do not scale the data. Encode the data (having string values) for Modelling……………………………………………22-23
2.3 Performance Metrics:………………………………………………………………………………………………………………………………23-27
2.4 Inference: Basis on these predictions, what are the insights and recommendations………………………………..27-28

List of Figures

Problem 1
Fig.1 - runqsz Countplot……………………………………………………………………………………………………………………………………………6
Fig. 2- scall Histplot……………………………………………………………………………………………………………………………………………………6
Fig. 3 - scall Boxplot…………………………………………………………………………………………………………………………………………………..6
Fig. 4 - usr Histplot…………………………………………………………………………………………………………………………………………………….6
Fig. 5 - usr Boxplot…………………………………………………………………………………………………………………………………………………….6
Fig. 6 - usr v/s scall Scatterplot…………………………………………………………………………………………………………………………………7
Fig. 7 - runqsz v/s scall Boxplot…………………………………………………………………………………………………………………………………7
Fig. 8 - runqsz v/s sread Boxplot………………………………………………………………………………………………………………………………..7
Fig. 9 - runqsz v/s swrite Boxplot……………………………………………………………………………………………………………………………….7
Fig. 10 - Correlation Heatmap……………………………………………………………………………………………………………………………………8
Fig. 11 - usr v/s fork Scatterplot………………………………………………………………………………………………………………………………..8
Fig. 12 - usr v/s exec Scatterplot………………………………………………………………………………………………………………………………..8
Fig. 13 - rchar Histplot………………………………………………………………………………………………………………………………………………..9
Fig. 14 - rchar Boxplot………………………………………………………………………………………………………………………………………………..9
Fig. 15 - wchar Histplot …………………………………………………………………………………………………………………………………………….10
Fig. 16 - wchar Boxplot …………………………………………………………………………………………………………………………………………….10
Fig. 17 - With Outliers Boxplot ………………………………………………………………………………………………………………………………….10
Fig. 18 - After Outlier Removal Boxplot……………………………………………………………………………………………………………………..10
2

Problem 2
Fig. 19– Wife_age Histplot………………………………………………………………………………………………………………………………………..19
Fig. 20 - Wife_age Boxplot……………………………………………………………………………………………………………………………………….19
Fig. 21 – No_of_children_born Histplot…………………………………………………………………………………………………………………….19
Fig. 22 - No_of_children_born Boxplot………………………………………………………………………………………………………………………19
Fig. 23 – With Outliers Boxplot………………………………………………………………………………………………………………………………….20
Fig. 24 – Histplots & Boxplots……………………………………………………………………………………………………………………………………20
Fig. 25 – Countplots…………………………………………………………………………………………………………………………………………………..21
Fig. 26 – Correlation Heatmap…………………………………………………………………………………………………………………………………..21
Fig. 27 - Scatterplot by Contraceptive_method_used as hue…………………………………………………………………………………….22
Fig. 28 – ROC Curve of Training Data(Logistic Regression)…………………………………………………………………………………………23
Fig. 29 – ROC Curve of Testing Data(Logistic Regression)…………………………………………………………………………………………..23
Fig. 30 – Confusion Matrix of Training Data(Logistic Regression)……………………………………………………………………………….24
Fig. 31 – Confusion Matrix of Training Data(Logistic Regression)……………………………………………………………………………….24
Fig. 32 – ROC Curve of Training & Test Data(LDA)……………………………………………………………………………………………………..25
Fig. 33 – Confusion Matrix of Training & Test Data(LDA)……………………………………………………………………………………………25
Fig. 34 – ROC Curve of Training Data(CART)………………………………………………………………………………………………………………26
Fig. 35 – ROC Curve of Testing Data(CART)……………………………………………………………………………………………………………….26
Fig. 36 – Confusion Matrix of Training & Test Data(CART)…………………………………………………………………………………………26

List of Tables

Problem 1
Table 1 : Dataset Sample…………………………………………………………………………………………………………………………………………....4
Table 2 : Five-Point Summary of the Data…………………………………………………………………………………………………………………..5
Table 3 : Categorical Column Summary……………………………………………………………………………………………………………………...6
Table 4 : Null values Count…………………………………………………………………………………………………………………………………………9
Table 5 : After Encoding Data Sample……………………………………………………………………………………………………………………….11
Table 6 : Model Summary using StatsModel on Training Data…………………………………………………………………………….11-12
Table 7 : VIF Summary……………………………………………………………………………………………………………………………………………..13
Table 8 : VIF Summary after Removing Multi-collinear columns……………………………………………………………………………...14
Table 9 : Final Model Summary on Training Data………………………………………………………………………………………………….…14
Table 10 : Final Model Summary on Testing Data………………………………………………………………………………………………..….15
Table 11 : Model Comparison Summary…………………………………………………………………………………………………………………..15

Problem 2
Table 12 : Dataset Sample………………………………………………………………………………………………………………………………………..17
Table 13 : Descriptive Summary …………………………………………………………………………………………………………………………..….18
Table 14 : Null Values Summary…………………………………………………………………………………………………………………………….…18
Table 15 : After Encoding Data Sample…………………………………………………………………………………………………………………..…22
Table 16 : Model Comparison Summary………………………………………………………………………………………………………………..…27

Datasets Used
Dataset for Problem 1: compactiv.xlsx
Dataset for Problem 2: Contraceptive_method_dataset.xlsx
3

Problem 1 Linear Regression

Executive Summary
The comp-activ databases is a collection of a computer systems activity measures .
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a multi-user
university department. Users would typically be doing a large variety of tasks ranging from accessing the
internet, editing files or running very cpu-bound programs.

Introduction 
The purpose of this whole exercise is to explore the dataset and find out a linear equation to build a model
to predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each attribute affects
the system to be in 'usr' mode using a list of system attributes.

Data Description
System measures used:
lread - Reads (transfers per second ) between system memory and user memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transferred per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that are waiting for a CPU to
run.
Typically, this value should be less than 2. Consistently higher values mean that the system might be CPU-
bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
------------------------
usr - Portion of time (%) that cpus run in user mode
4

1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the Data types, shape, EDA, 5 point summary). Perform
Univariate, Bivariate Analysis, Multivariate Analysis.

Sample of the dataset:

Table 1 Dataset Sample

Dataset has 22 variables with 2 different types of process runs: CPU_Bound and Not_CPU_Bound

Exploratory Data Analysis


Let us check the types of variables in the data frame.

There are total 8192 rows and 22 columns in the dataset. Out of 22, 13 columns are of float type, 8
columns are of integer type and rest 1 column is of object data type.

Five-Point Summary
5

Table 2 Five Point Summary of the Data

It can be seen that there are some blank values in rchar and wchar columns as their overall count is less
than the total number of rows in the dataset. Also, as per the above summary , all the variables are on
different scale, hence it needs to be scaled before performing the linear regression steps.

Univariate Analysis
6

Fig 1 runqsz Countplot Table 3 Categorical Column Summary

Not CPU Bound Process Runs(runqsz) are more than CPU Bound runs .

Fig 2 scall Histplot Fig 3 scall Boxplot

Number of system calls (scall) variable data is largely skewed to the right and has many outliers.

Fig 4 usr Histplot Fig 5 usr Boxplot

Portion of time the CPU runs in user mode is highly left-skewed and has few outliers in the lower range .
7

Bivariate Analysis

Fig 6 usr v/s scall Scatterplot

From the above plot we can infer that number of system calls per second is slightly lower when the portion of time
of CPU runs in user mode is high and a negative correlation can be seen between these two variables.

Fig 7 runqsz v/s scall Boxplot Fig 8 runqsz v/s sread Boxplot Fig 9 runqsz v/s swrite Boxplot

We can see that median value of scall for Not CPU Bound process runs is slightly less than the CPU Bound process
runs.

There is no difference in median sread or median swrite between CPU Bound and Not CPU Bound process runs.

Correlation Heatmap
8

Fig 10 Correlation Heatmap

We observe that some columns like pgfree, pgscan ,ppgin ,ppgout ,pfit and vfit are highly correlated to each other
having correlation of approx. 0.92 .

Usr column shows negative correlation with all the attributes indicating that higher the portion of time(%) the CPU
runs in user mode the system attributes (call or pages) number is lower.

Multivariate Analysis

Fig 11 usr v/s fork Scatterplot Fig 12 usr v/s exec Scatterplot

We see a slightly higher number of Not_CPU_Bound process runs with lower amount of system fork calls
ranging between 50 to 100 portion of time(%) the CPU runs in user mode.
And we see the similar kind of trend in Fig 10 as well between usr and exec attribute.

1.2 Impute null values if present, also check for the values which are
equal to zero. Do they have any meaning or do we need to change them
or drop them? Check for the possibility of creating new features if
required. Also check for outliers and duplicates if there.
There are 104 and 15 null values in rchar and wchar columns respectively as shown below :
9

Table 4 Null values Count

Lets check what should be the right statistics to impute null values :

Fig 13 rchar Histplot Fig 14 rchar Boxplot


10

Fig 15 wchar Histplot Fig 16 wchar Boxplot


From the above descriptive statistics and plots, we see that the rchar and wchar field is right skewed and
has outliers. Hence, median will be the right statistics to impute.

Also, there are no values equal to zero in the above two mentioned columns , however, there are values
equal to zero in other columns but I would prefer refraining to change or drop them as they are part of the
raw data and there is no indication that those are anomalies or bad data.

There are no duplicates in the dataset.

Lets check for Outliers and treat them if any :

There are outliers in almost all the variables . As there are outliers in lower values as well , so lets treat
them using the IQR method.

Fig 17 With Outliers Boxplot Fig 18 After Outlier Removal Boxplot


11

1.3 Encode the data (having string values) for Modelling. Split the data
into train and test (70:30). Apply Linear regression using scikit learn.
Perform checks for significant variables using appropriate method from
statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj Rsquare.
Compare these models and select the best one with appropriate
reasoning.
We will now encode the runqsz column values as this is the only categorical field in the data using
get_dummies function and output is shown as below :

Table 5 After Encoding Data Sample

Now for splitting the data into training and test sets , first we need separate the target and predictor
variables into two different dataframes namely X and y where X will contain all predictor variables and y
will contain the target variable which is “usr” in this dataset.
We will split the data into 70:30 now using train_test_split from sklearn library.
Lets perform checks for significant variables using methods from statsmodel and for this we need to fit
prepare the model whose results are as below :
12

Table 6 Model Summary using StatsModel on Training Data

Interpretation of R-squared
 The R-squared value tells us that our model can explain 79.6% of the variance in the training set.

Interpretation of Coefficients
 The coefficients tell us how one unit change in X can affect y.
 The sign of the coefficient indicates if the relationship is positive or negative.
 In this data set, for example, an increase of 1% in portion of times that CPU runs in user mode
occurs with a 0.0635 decrease in reads between system and memory(lread).
 Earlier we saw that the relationship of usr with fork and exec is almost the same (as usr increases,
the variable decreases and vice versa). This suggests that all the 2 factors have similar effect on usr,
i.e., the increase in either of the 2 decreases usr. Therefore, the signs of the coefficents should be
the same. But we observe that it is not so. This indicates the presence of multicollinearity in our
data.
 Multicollinearity occurs when predictor variables in a regression model are correlated. This
correlation is a problem because predictor variables should be independent. If the collinearity
between variables is high, we might not be able to trust the p-values to identify independent
variables that are statistically significant.
 When we have multicollinearity in the linear model, the coefficients that the model suggests are
unreliable.

Interpretation of p-values (P > |t|)


 For each predictor variable there is a null hypothesis and alternate hypothesis.
- Null hypothesis : Predictor variable is not significant
- Alternate hypothesis : Predictor variable is significant
 (P > |t|) gives the p-value for each predictor variable to check the null hypothesis.
 If the level of significance is set to 5% (0.05), the p-values greater than 0.05 would indicate that the
corresponding predictor variables are not significant.
 However, due to the presence of multicollinearity in our data, the p-values will also change.
 We need to ensure that there is no multicollinearity in order to interpret the p-values.

Lets check for multicollinearity using VIF (Variance Inflation Factor) whose general of thumb is as below :

 If VIF is 1, then there is no correlation among the kth predictor and the remaining predictor
variables, and hence, the variance of βk is not inflated at all.
13

 If VIF exceeds 5, we say there is moderate VIF, and if it is 10 or exceeding 10, it shows signs of high
multi-collinearity.
 The purpose of the analysis should dictate which threshold to use.

Table 7 VIF Summary

 The VIF values indicate that the features lread, sread, swrite are moderately correlated and
features like fork, pgout, ppgout, pgfree, pgin, ppgin, pflt and vflt are highly correlated with one or
more independent features.
 Multicollinearity affects only the specific independent variables that are correlated. Therefore, in
this case, we can trust the p-values of lwrite, scall, exec, rchar, wchar, atch, freemem, freeswap and
runqsz variables.
 To treat high multicollinearity, we will have to drop one or more of the correlated features (fork,
pgout, ppgout, pgfree, pgin, ppgin, pflt and vflt)
 We will drop the variable that has the least impact on the adjusted R-squared of the model.

Steps :
 Let's remove/drop multicollinear columns one by one and observe the effect on our predictive
model.
 On dropping ppgout, pgfree, ppgin, vflt , pgin, fork ,sread adj. R-squared decreased by 0.001 and it
remained same after dropping lread column.
 Since there is a very small effect (0.001) or no effect on adj. R-squared after dropping the above
columns, we can remove them from the training set.
 After dropping the above columns , we can see that the VIF for all features is <4 which is acceptable
as per the thumb rule.
14

Table 8 VIF Summary after Removing Multi-collinear columns

 Now that we do not have multicollinearity in our data, the p-values of the coefficients have become
reliable and we can remove the non-significant predictor variables.

Building Model on Trained Data

Table 9 Final Model Summary on Train Data

 As observed in the above model (olsres_9), there is no variable whose p-value is greater than 0.05.
So, we can conclude that all the above variables are significant in predicting the usr.
 After dropping the features causing strong multicollinearity and the statistically insignificant ones,
our model performance hasn't dropped sharply (adj. R-squared has dropped from 0.795 to 0.783).
This shows that these variables did not have much predictive power.

Building Model on Testing Data


15

Table 10 Final Model Summary on Testing Data

Now we have below 2 models , lets analyse their performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare metrics :

No.of No.of
Predictor Independent
  s variable R-Squared Adjusted R-Squared RMSE Constant
0.796(Train) & 0.795(Train) & 4.419(Train)&
Model 1 21 1 0.767(Test) 0.765(Test) 4.652(Test) 84.12
0.784(Train) & 0.783(Train) & 4.550 (Train) &
Model 2 11 1 0.773(Test) 0.771(Test) 4.652(Test) 83.17

Table 11 Model Comparison Summary

Model 2 is more preferable than Model 1 as it has less no. of predictors, more simple linear equation and is
more viable as the variables are not collinear to each other and are statistically significant . Also, the R-
Squared score difference between train and test is lower in Model 2 than Model 1 which indicates that the
model is more suitable to predict the usr and is not overfitted or underfitted. And the RMSE for test and
train set is almost similar as Model 1 .R-Squared score is highest for Model 2 (0.773). Hence, I would prefer
the second model over the first model.

The final Linear Regression equation is:


usr= b0 + b1 lwrite + b2 scall + b3 swrite + b4 exec + b5 rchar + b6 wchar + b7 atch + b8 pflt + b9 freemem
+ b10 freeswap + b11 runqsz_Not_CPU_Bound

usr = (83.17) Intercept + (-0.033) lwrite + (-0.000) scall + (-0.006) swrite + (-0.457) exec + (-6.388e-06)


16

rchar + (-5.896e-06) wchar + (-0.307) atch + (-0.041) pflt + (-0.000) freemem + (9.347e-06) freeswap +


(1.541) runqsz_Not_CPU_Bound

When freeswap increases by 1 unit, usr increases by 9.347e-06 units, keeping all other predictors
constant .When runqsz_Not_CPU_Bound increases by 1 unit, usr increases by 1.541 units, keeping all other
predictors constant etc.

Almost all the variables other than freeswap and runqsz_Not_CPU_Bound have negative co-efficient
values, for instance, lwrite has its corresponding co-efficient as -0.033.

1.4 Inference: Basis on these predictions, what are the business insights
and recommendations.

Summing up all the above steps as below:


 Analysed the dataset thoroughly by doing EDA to analyse different variables and their relationship with each
other , pre-processed the data as there were outliers and missing values and checked for scope of creating a
new feature .
 Encoded the categorical variable, bifurcated the data into train and test sets (70:30), created a model using
all the variables .
 Checked and analysed their co-efficients for multi-collinearity using VIF method and dropped attributes that
had less or no effect on the model .
 Again created the model with less variables and analysed it .
 Compared both the models on the basis of simplicity of linear equation, significance of variables ,R-squared ,
Adjusted R-squared and RMSE scores and took the decision to go with Model 2.
 Quoted the linear equation based on the above decision.

Based on the above predictions and linear model , following insights can be drawn :
 More the number of disk blocks available for page swapping the portion of time (%) that cpus run in user
mode goes up drastically .
 And when there are more number of Not CPU Bound process runs the portion of time that cpus run in user
mode also increases.
 There will be a drastic decrease in the portion of time that cpus run in user mode when the number of
characters transferred per second by system read or write calls is high.

END OF PROBLEM 1
17

Problem 2 Logistic Regression, LDA and CART

Introduction 
You are a statistician at the Republic of Indonesia Ministry of Health and you are provided with a data of
1473 females collected from a Contraceptive Prevalence Survey. The samples are married women who
were either not pregnant or do not know if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on their demographic
and socio-economic characteristics.

Data Description
System measures used:
1. Wife's age (numerical)
2. Wife's education (categorical) 1=uneducated, 2, 3, 4=tertiary
3. Husband's education (categorical) 1=uneducated, 2, 3, 4=tertiary
4. Number of children ever born (numerical)
5. Wife's religion (binary) Non-Scientology, Scientology
6. Wife's now working? (binary) Yes, No
7. Husband's occupation (categorical) 1, 2, 3, 4(random)
8. Standard-of-living index (categorical) 1=very low, 2, 3, 4=high
9. Media exposure (binary) Good, Not good
10. Contraceptive method used (class attribute) No, Yes

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, check for duplicates and outliers and write an
inference on it. Perform Univariate and Bivariate Analysis and Multivariate
Analysis.

Sample of the dataset:

Table 12 Dataset Sample

Descriptive Statistics:

Let us check the types of variables in the data frame.


18

There are total 1473 rows and 10 columns in the dataset. Out of 10, 2 columns are of float type, 1 column
is of integer type and rest 7 columns are of object data type. Categorical variables are not in encoded
format.

Table 13 Descriptive Summary

32.6 is the average age of the females in this dataset, whereas 16 being the minimum and 49 being the
maximum age. There are indications of presence of some null/missing values in the Wife_age and
No_of_children_born variables which we will check separately. Most of the women have high standard of
living index . Also as per the summary, most number of females use contraceptive methods.

Let us check for null values and treat them .

Table 14 Null Values Summary

There are 71 and 21 missing values in the Wife_age and No.of_children_born columns respectively.
19

Lets check what should be the right statistics to impute null values :
As per the below histplot and boxplot, Wife_age variable is evenly distributed, hence mean would be the
right statistics to impute for null values in this field.

Fig 19 Wife_age Histplot Fig 20 Wife_age Boxplot

Now lets check for the other column :

Fig 21 No_of_children_born Histplot Fig 22 No_of_children_born Boxplot

As this variable data is not evenly distributed and is right-skewed ,hence median would be the best
statistics to impute.

On checking for the duplicates , we can see that there are 82 duplicate rows found in the dataset
after imputing null values. Hence, we will remove these duplicates as this would make our model over-
fitted.

Lets check for Outliers and treat them if any :

There are few outliers only in the No_of_children_born variable and but I choose not to treat them as
these are true outliers and represent natural variation in the population.
20

Fig 23 With Outliers Boxplot

Univariate Analysis
Before doing univariate analysis, lets convert Husband_occupation variable into object datatype as it is a
categorical column but its datatype is mentioned as integer in the dataset.

Fig 24 Histplots & Boxplots


In the above graphs we can see that Wife_age variable is almost evenly distributed which represents normal
distribution and No_of_children_born variable is right-skewed and has few true outliers .

On analysing the below categorical variables, we can infer that the religion of wives is mostly Scientology and most
of the wives are non-working . Also, standard of living is very high for most number of wives and most of them are
exposed to media and hence large number of wives prefer using contraceptive methods.
21

Fig 25 Countplots

Bivariate Analysis
Correlation Heatmap

Fig 26 Correlation Heatmap

As per the above heatmap, there is a correlation of 0.53 between the numeric variables which does not
indicate a clear correlation.
22

Multivariate Analysis
As per the below scatterplot, we can infer that wives across all age groups use contraceptive methods in
large numbers and no clear correlation can be seen between the numeric variables.

Fig 27 Scatterplot by Contraceptive_method_used as hue

2.2 Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis) and CART.
Lets encode the below categorical columns as per the instructions mentioned in the problem :

 Wife_education : Encoding it in an ordinal manner where 1 indicates ‘Uneducated’, 2 indicates


‘Primary’, 3 indicates ‘Secondary’ and 4 indicates ‘Tertiary’.
 Husband_education : Encoding this also in an ordinal manner where 1 indicates ‘Uneducated’, 2
indicates ‘Primary’, 3 indicates ‘Secondary’ and 4 indicates ‘Tertiary’.
 Standard_of_living_index : Encoding this as well in an ordinal manner where 1 indicates ‘Very Low’,
2 indicates ‘Low’, 3 indicates ‘High’ and 4 indicates ‘Very High’.
 Contraceptive_method_used : Converting target column into numeric by using the LabelEncoder
functionality inside sklearn and then applying the created Label Encoder object for the target class
by assigning 0 to ‘No’ and 1 to ‘Yes’.
 Now convert Wife_education, Husband_education, Standard_of_living_index and
Husband_Occupation (already encoded as 1,2,3,4) variables into numeric data type to avoid getting
them encoded twice while using dummy function for other binary categorical variables .
 Encode the other binary categorical variables as dummy variables .

Table 15 After Encoding Data Sample


23

Now for splitting the data into training and test sets , first we need separate the target and predictor
variables into two different dataframes namely X and y where X will contain all predictor variables and y
will contain the target variable which is “Contraceptive_method_used” in this dataset.
We will split the data into 70:30 now using train_test_split from sklearn library.
And we will apply the Logistic Regression, LDA and CART on the test sets one by one while making some
adjustments to the parameters in the Logistic Regression Class to get a better accuracy and perform the
predictions on training and test sets .

2.3 Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model Final Model: Compare Both the models
and write inference which model is best/optimized.
1. Lets first check the performance of predictions on Train and Test sets as per Logistic Regression Model :

 Accuracy Score on Training Data : 0.67


 Accuracy Score on Testing Data : 0.65
 ROC_AUC score on Training Data : 0.717
 ROC_AUC score on Testing Data : 0.717

Fig 28 ROC Curve of Training Data Fig 29 ROC Curve of Testing Data
24

Fig 30 Confusion Matrix of Training Data Fig 31 Confusion Matrix of Testing Data

Inferences :
For predicting Contraceptive Method Used- No (Label 0)

Precision (64%) – 64% of females predicted are actually not using any contraceptive methods out of all females
predicted to not use contraceptive methods.

Recall (46%) – Out of all the females actually not using contraceptive methods, 46% of females have been
predicted correctly .

For predicting Contraceptive Method Used- Yes (Label 1)

Precision (65%) – 65% of females predicted are actually using contraceptive methods out of all females predicted
to use contraceptive methods.

Recall (79%) – Out of all the females actually using contraceptive methods, 79% of females have been predicted
correctly .

(Note : Precision tells us how many predictions are actually positive out of all the total positive predicted. Recall tells
us how many observations of positive class are actually predicted as positive)

Overall accuracy of the model – 65% of total predictions are correct

Accuracy, AUC_ROC and Precision for test data is almost inline with training data .This proves that no overfitting or
underfitting has happened, and overall the model is a good model for classification.

2. Lets first check the performance of predictions on Train and Test sets as per Linear Discriminant
Analysis(LDA):

 Accuracy Score on Training Data : 0.67


 Accuracy Score on Testing Data : 0.64
 ROC_AUC score on Training Data : 0.716
 ROC_AUC score on Testing Data : 0.664
25

Fig 32 ROC Curve of Training & Test Data

Fig 33 Confusion Matrix of Training & Test Data


Inferences :
Linear Discriminant Function -0.78+ (-0.07Wife_age)+(0.51Wife_education)+(0.03Husband_education)
+(0.31No_of_children_born)+(0.17Husband_Occupation)+(0.31Standard_of_living_index)
+(0.5Wife_religion_Scientology)+(0.19Wife_Working_Yes)+(-0.33Media_exposure_Not-Exposed)

By the above equation and the coefficients it is clear that:

 predictor 'Wife_education' has the largest magnitude thus this helps in classifying the best
 predictor 'Media_exposure_Not-Exposed' has the smallest magnitude thus this helps in classifying
the least

For predicting Contraceptive Method Used- No (Label 0)

 Precision (64%) – 64% of females predicted are actually not using any contraceptive methods out of all
females predicted to not use contraceptive methods.
 Recall (44%) – Out of all the females actually not using contraceptive methods, 44% of females have been
predicted correctly .
26

For predicting Contraceptive Method Used- Yes (Label 1)

 Precision (64%) – 64% of females predicted are actually using contraceptive methods out of all females
predicted to use contraceptive methods.
 Recall (80%) – Out of all the females actually using contraceptive methods, 80% of females have been
predicted correctly .

Overall accuracy of the model – 64% of total predictions are correct

Accuracy, AUC_ROC and Precision for test data is almost inline with training data .This proves that no overfitting or
underfitting has happened, and overall the model is a good model for classification.

3. Lets first check the performance of predictions on Train and Test sets as per CART :

 Accuracy Score on Training Data : 0.98


 Accuracy Score on Testing Data : 0.61
 ROC_AUC score on Training Data : 0.999
 ROC_AUC score on Testing Data : 0.596

Fig 34 ROC Curve of Training Data Fig 35 ROC Curve of Training Data

Fig 36 Confusion Matrix of Training Data & Test Data


Inferences :

For predicting Contraceptive Method Used- No (Label 0)

 Precision (58%) – 58% of females predicted are actually not using any contraceptive methods out of all
females predicted to not use contraceptive methods.
 Recall (58%) – Out of all the females actually not using contraceptive methods, 58% of females have been
predicted correctly .

For predicting Contraceptive Method Used- Yes (Label 1)


27

 Precision (63%) – 64% of females predicted are actually using contraceptive methods out of all females
predicted to use contraceptive methods.
 Recall (64%) – Out of all the females actually using contraceptive methods, 64% of females have been
predicted correctly .

Overall accuracy of the model – 61% of total predictions are correct

Accuracy, AUC_ROC ,Precision for test data is not at all inline with training data .This proves that overfitting has
happened, and overall the model is not a good model for classification.
No_of_children_born, Wife_age and Wife_education (in same order of preference) are the most important variables
in determining if a female will use contraceptive method or not .

Lets quickly compare all the performance metrics of above three models and find out the best model
among all three:

Classification Methods Logistic Regression LDA CART


Metrics Train Test Train Test Train Test
Accuracy 0.67 0.65 0.67 0.64 0.98 0.61
ROC_AUC_Score 0.717 0.717 0.716 0.664 0.999 0.596
Confusion Matrix No overfitting or
Analysis underfitted No overfitting or underfitted Overfitted Model
Table 16 Model Comparison Summary

On comparing the above, we can conclude that the model build through Logistic Regression method is the best
model as its overall accuracy and ROC_AUC score for test set is highest (65%) and (0.717) among all the models
shown above.

And this model is also not overfitted or underfitted as the precision and recall for test data is almost in line with the
training data .

2.4 Inference: Basis on these predictions, what are the insights and
recommendations.
Summing up all the above steps as below:
 Analysed the dataset thoroughly by doing EDA to analyse different variables and their relationship with each
other , pre-processed the data as there were duplicates , outliers and missing values .
 Encoded all the categorical variables, bifurcated the data into train and test sets (70:30).
 Created different classification models like Logistic Regression, LDA and CART.
 Checked the model performance of all the three models on Train and Test sets using the Accuracy Score,
ROC_AUC score, Confusion Matrix and analysed their inferences separately one by one.
 Compared all the models after analysing the performance metrics and deciding to select the model built
through Logistic Regression as the best model among all .

Based on the predictions of Logistic Regression model , following business insights can be drawn :
 It can be seen that 65% of total predictions are correct.
 Out of all the females actually using contraceptive methods, 79% of females have been predicted correctly.
 So , it can be concluded that most number of females tend to use contraceptive methods .
 And contraceptive methods have been mostly used by women who already have 2 or more than 2 children
among all age groups.
28

 Wife’s education also slightly affect the use of contraceptive methods among females.

END OF PROBLEM 2

You might also like