Professional Documents
Culture Documents
Predictive Modelling
Predictive Modelling
Predictive Modelling
Project
Advanced Statistics
Contents:
Problem 1: Linear Regression
The comp-activ databases is a collection of a computer systems activity measures.
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a multi-
user university department. Users would typically be doing a large variety of tasks ranging from accessing
the internet, editing files or running very CPU-bound programs.
As you are a budding data scientist you thought to find out a linear equation to build a model to predict
'usr'(Portion of time (%) that cup’s run in user mode) and to find out how each attribute affects the
system to be in 'usr' mode using a list of system attributes.
DATA DICTIONARY:
-----------------------
System measures used:
lread - Reads (transfers per second ) between system memory and user memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that are waiting for a CPU to
run.
Typically, this value should be less than 2. Consistently higher values mean that the system might be
CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
------------------------
usr - Portion of time (%) that cpus run in user mode
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types,
shape, EDA, 5-point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis.
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of creating new features
if required. Also check for outliers and duplicates if there.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2|Page
Advanced Statistics
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30). Apply
Linear regression using scikit learn. Perform checks for significant variables using appropriate method
from stats model. Create multiple models and check the performance of Predictions on Train and Test
sets using R-square, RMSE & Adj R-square. Compare these models and select the best one with
appropriate reasoning.
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.
The problem is to predict do/don't they use a contraceptive method of choice based on their
demographic and socio-economic characteristics.
Data Dictionary:
Important Note: Please reflect on all that you have learned while working on this project. This step is
critical in cementing all your concepts and closing the loop. Please write down your thoughts here.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate Analysis
and Multivariate Analysis.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and CART.
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3|Page
Advanced Statistics
Solutions:
Problem 1:
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types,
shape, EDA, 5-point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis.
Ans.:
#There are a total of 8192 rows and 22 columns in the dataset. Out of 22, 13 are float 8 are integer
type and 1 object type variable
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4|Page
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5|Page
Advanced Statistics
We have outliers for our columns, we need to treat those before further analysis, from the Histplot we
can say all the features are having left-skewed distribution. ‘usr’ feature is having right skewed
distribution.
From the analysis we can say we have total ‘Process run queue size’ 3861 as CPU_Bound and 4331 as
Not_ CPU_Bound.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6|Page
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7|Page
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8|Page
Advanced Statistics
1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of creating new
features if required. Also check for outliers and duplicates if there.
Ans.:
#We can see we have null values for ‘rchar’ and ‘wchar’, As it is a continuous variable, mean value can be
imputed.
#We can keep the 0(Zero's) in our dataset for further analysis as we have some features in our dataset
which can be 0(Zero) if the system stays Idle.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
9|Page
Advanced Statistics
#From the Univariate analysis we can clearly see that outliers are present in the data-set, so we need to
treat those outliers:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
10 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
11 | P a g e
Advanced Statistics
1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on Train
and Test sets using R-square, RMSE & Adj R-square. Compare these models and select the best one
with appropriate reasoning.
Ans.:
#Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them. If they are correlated, then it is not ideal for linear regression models as
they inflate the standard errors which in turn affects the regression parameters. As a result, the
regression model becomes non-reliable and lacks interpretability.
##General rule of thumb: If VIF values are equal to 1, then that means there is no multicollinearity. If
VIF values are equal to 5 or exceedingly more than 5, then there is moderate multicollinearity. If VIF is
10 or more, then that means there is high collinearity.
##From the above I can conclude that variables have moderate correlation.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
12 | P a g e
Advanced Statistics
**Linear Regression Model Accuracy of training data: 0.7858835725266415, so 78% of the variation
in the usr is explained by the predictors in the model for train set.
**Linear Regression Model Accuracy of testing data: 0.7931214712225988, so 79% of the variation in
the usr is explained by the predictors in the model for test set.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
13 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
14 | P a g e
Advanced Statistics
(82.8072) * const + (-0.0539) * lread + (0.0446) * lwrite + (-0.0008) * scall + (0.0024) * sread + (-0.0051)
* swrite + (-0.1464) * fork + (-0.2312) * exec + (-0.0) * rchar + (-0.0) * wchar + (-0.4547) * pgout +
(0.0316) * ppgout + (0.0411) * pgfree + (-0.0) * pgscan + (0.5216) * atch + (0.0059) * pgin + (-0.058) *
ppgin + (-0.0314) * pflt + (-0.0064) * vflt + (-0.0005) * freemem + (0.0) * freeswap + (1.8468) *
runqsz_Not_CPU_Bound +
1.4 Inference: Basis on these predictions, what are the business insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.
Ans.:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
15 | P a g e
Advanced Statistics
Problem 2:
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate
Analysis and Multivariate Analysis.
Ans.:
#The dataset of 10 variables in which there are 7 objects, 2 float type and 1 integer type variable
Contraceptive_method_used is the dependent variable.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
16 | P a g e
Advanced Statistics
#We can see we have null values for ‘Wife_age’ and ‘No_of_children_born’, As it is a continuous variable,
mean value can be imputed.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
17 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
18 | P a g e
Advanced Statistics
#Bivariate analysis(Pair-plot) :
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
19 | P a g e
Advanced Statistics
#Multi-variate analysis:
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
20 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
21 | P a g e
Advanced Statistics
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and
CART.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
22 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
23 | P a g e
Advanced Statistics
#Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them. If they are correlated, then it is not ideal for linear regression models as
they inflate the standard errors which in turn affects the regression parameters. As a result, the
regression model becomes non-reliable and lacks interpretability.
##General rule of thumb: If VIF values are equal to 1, then that means there is no multicollinearity. If
VIF values are equal to 5 or exceedingly more than 5, then there is moderate multicollinearity. If VIF is
10 or more, then that means there is high collinearity.
##From the above I can conclude that variables have moderate correlation.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
24 | P a g e
Advanced Statistics
#Logistic Regression Model Accuracy of Training data before applying GridSearchCV: 0.6646153846153846
#Logistic Regression Model Accuracy of Test data before applying GridSearchCV: 0.645933014354067
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
25 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
26 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
27 | P a g e
Advanced Statistics
#Logistic Regression Model Accuracy of Training data after applying GridSearchCV: 0.6646153846153846
#Logistic Regression Model Accuracy of Test data after applying GridSearchCV: 0.645933014354067
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
28 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
29 | P a g e
Advanced Statistics
#CART Model :
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
30 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
31 | P a g e
Advanced Statistics
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
32 | P a g e
Advanced Statistics
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.
Ans.:
#Comparing both these models, we find both results are same, but LDA works better when there is
category target variable
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.
Ans.: #The EDA analysis clearly indicates that women with a tertiary education and very high standard of
living used contraceptive methods Women ranging from 21 to 38 generally use contraceptive methods
more
#The usage of contraceptive methods need not depend on their demographic or socioeconomic
backgrounds since the use of contraceptive methods were almost the same for both working and non-
working women
#The use of contraceptive method was high for both Scientology and Non-scientology women
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
33 | P a g e