Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

12/18/2022

Predictive Modelling
Project
Advanced Statistics

Contents:
Problem 1: Linear Regression
The comp-activ databases is a collection of a computer systems activity measures.
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in a multi-
user university department. Users would typically be doing a large variety of tasks ranging from accessing
the internet, editing files or running very CPU-bound programs.

As you are a budding data scientist you thought to find out a linear equation to build a model to predict
'usr'(Portion of time (%) that cup’s run in user mode) and to find out how each attribute affects the
system to be in 'usr' mode using a list of system attributes.

Dataset for Problem 1: compactiv.xlsx

DATA DICTIONARY:
-----------------------
System measures used:

lread - Reads (transfers per second ) between system memory and user memory
lwrite - writes (transfers per second) between system memory and user memory
scall - Number of system calls of all types per second
sread - Number of system read calls per second .
swrite - Number of system write calls per second .
fork - Number of system fork calls per second.
exec - Number of system exec calls per second.
rchar - Number of characters transferred per second by system read calls
wchar - Number of characters transfreed per second by system write calls
pgout - Number of page out requests per second
ppgout - Number of pages, paged out per second
pgfree - Number of pages per second placed on the free list.
pgscan - Number of pages checked if they can be freed per second
atch - Number of page attaches (satisfying a page fault by reclaiming a page in memory) per second
pgin - Number of page-in requests per second
ppgin - Number of pages paged in per second
pflt - Number of page faults caused by protection errors (copy-on-writes).
vflt - Number of page faults caused by address translation .
runqsz - Process run queue size (The number of kernel threads in memory that are waiting for a CPU to
run.
Typically, this value should be less than 2. Consistently higher values mean that the system might be
CPU-bound.)
freemem - Number of memory pages available to user processes
freeswap - Number of disk blocks available for page swapping.
------------------------
usr - Portion of time (%) that cpus run in user mode

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types,
shape, EDA, 5-point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis.

1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of creating new features
if required. Also check for outliers and duplicates if there.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2|Page
Advanced Statistics

1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30). Apply
Linear regression using scikit learn. Perform checks for significant variables using appropriate method
from stats model. Create multiple models and check the performance of Predictions on Train and Test
sets using R-square, RMSE & Adj R-square. Compare these models and select the best one with
appropriate reasoning.

1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.

Problem 2: Logistic Regression, LDA and CART


You are a statistician at the Republic of Indonesia Ministry of Health and you are provided with a data of
1473 females collected from a Contraceptive Prevalence Survey. The samples are married women who
were either not pregnant or do not know if they were at the time of the survey.

The problem is to predict do/don't they use a contraceptive method of choice based on their
demographic and socio-economic characteristics.

Dataset for Problem 2: Contraceptive_method_dataset.xlsx

Data Dictionary:

1. Wife's age (numerical)


2. Wife's education (categorical) 1=uneducated, 2, 3, 4=tertiary
3. Husband's education (categorical) 1=uneducated, 2, 3, 4=tertiary
4. Number of children ever born (numerical)
5. Wife's religion (binary) Non-Scientology, Scientology
6. Wife's now working? (binary) Yes, No
7. Husband's occupation (categorical) 1, 2, 3, 4(random)
8. Standard-of-living index (categorical) 1=very low, 2, 3, 4=high
9. Media exposure (binary) Good, not good
10. Contraceptive method used (class attribute) No, yes.

Important Note: Please reflect on all that you have learned while working on this project. This step is
critical in cementing all your concepts and closing the loop. Please write down your thoughts here.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate Analysis
and Multivariate Analysis.

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and CART.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized.

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3|Page
Advanced Statistics

Solutions:
Problem 1:
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data types,
shape, EDA, 5-point summary). Perform Univariate, Bivariate Analysis, Multivariate Analysis.

Ans.:

#There are a total of 8192 rows and 22 columns in the dataset. Out of 22, 13 are float 8 are integer
type and 1 object type variable

#We have 8192 rows and 22 columns in our Data-set.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4|Page
Advanced Statistics

#Data frame summary:

#Univariate analysis for numerical data (Histplot & Boxplot):

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5|Page
Advanced Statistics

We have outliers for our columns, we need to treat those before further analysis, from the Histplot we
can say all the features are having left-skewed distribution. ‘usr’ feature is having right skewed
distribution.

#Univariate analysis for categorical data:

From the analysis we can say we have total ‘Process run queue size’ 3861 as CPU_Bound and 4331 as
Not_ CPU_Bound.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6|Page
Advanced Statistics

#Bivariate analysis: Pairplot (pairwise relationships between variables):

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7|Page
Advanced Statistics

#Bivariate analysis: Heatmap (Check for presence of correlations):

#From the analysis we can see the presence of correlations.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8|Page
Advanced Statistics

1.2 Impute null values if present, also check for the values which are equal to zero. Do they have any
meaning or do we need to change them or drop them? Check for the possibility of creating new
features if required. Also check for outliers and duplicates if there.

Ans.:

#We can see we have null values for ‘rchar’ and ‘wchar’, As it is a continuous variable, mean value can be
imputed.

#We can keep the 0(Zero's) in our dataset for further analysis as we have some features in our dataset
which can be 0(Zero) if the system stays Idle.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
9|Page
Advanced Statistics

#Unique values for categorical variables:

#Converting categorical to dummy variables:

#Sample data set after data-encoding:

#From the Univariate analysis we can clearly see that outliers are present in the data-set, so we need to
treat those outliers:

#Function to treat outliers: “ remove_outlier ”

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
10 | P a g e
Advanced Statistics

#Outliers check after treatment :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
11 | P a g e
Advanced Statistics

1.3 Encode the data (having string values) for Modelling. Split the data into train and test (70:30).
Apply Linear regression using scikit learn. Perform checks for significant variables using appropriate
method from stats model. Create multiple models and check the performance of Predictions on Train
and Test sets using R-square, RMSE & Adj R-square. Compare these models and select the best one
with appropriate reasoning.

Ans.:

#We have already encoded the object data to categorical variable.

#Checking Multicollinearity using Variance Inflation Factor (VIF):

#Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them. If they are correlated, then it is not ideal for linear regression models as
they inflate the standard errors which in turn affects the regression parameters. As a result, the
regression model becomes non-reliable and lacks interpretability.

##General rule of thumb: If VIF values are equal to 1, then that means there is no multicollinearity. If
VIF values are equal to 5 or exceedingly more than 5, then there is moderate multicollinearity. If VIF is
10 or more, then that means there is high collinearity.

##From the above I can conclude that variables have moderate correlation.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
12 | P a g e
Advanced Statistics

#Linear Regression using Using Scikit-learn :

We fit the dataset to model to LinearRegression( )

#The coefficients for each of the independent attributes:

The intercept: 82.80717337506023


*Mean Squared Error (MSE) on training data: 20.65
*Mean Squared Error (MSE) on test data: 18.97
**Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383064068
***Co-efficient of Determination (r-square) on the train data: 0.7858835725266415
***Co-efficient of Determination (r-square) on the test data: 0.7931214712225988

**Linear Regression Model Accuracy of training data: 0.7858835725266415, so 78% of the variation
in the usr is explained by the predictors in the model for train set.

**Linear Regression Model Accuracy of testing data: 0.7931214712225988, so 79% of the variation in
the usr is explained by the predictors in the model for test set.
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
13 | P a g e
Advanced Statistics

#Linear Regression using stats models(OLS) :

*Mean Squared Error (MSE) on training data: 20.65


*Mean Squared Error (MSE) on test data: 18.97
**Root Mean Squared Error (RMSE) on training data: 4.5441663714860345
**Root Mean Squared Error (RMSE) on test data: 4.355572383063754
***Co-efficient of Determination (r-square) on the train data: 0.7858835725266415
***Co-efficient of Determination (r-square) on the test data: 0.7931214712226287

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
14 | P a g e
Advanced Statistics

#We can write the Linear regression as :

(82.8072) * const + (-0.0539) * lread + (0.0446) * lwrite + (-0.0008) * scall + (0.0024) * sread + (-0.0051)
* swrite + (-0.1464) * fork + (-0.2312) * exec + (-0.0) * rchar + (-0.0) * wchar + (-0.4547) * pgout +
(0.0316) * ppgout + (0.0411) * pgfree + (-0.0) * pgscan + (0.5216) * atch + (0.0059) * pgin + (-0.058) *
ppgin + (-0.0314) * pflt + (-0.0064) * vflt + (-0.0005) * freemem + (0.0) * freeswap + (1.8468) *
runqsz_Not_CPU_Bound +

1.4 Inference: Basis on these predictions, what are the business insights and recommendations.

Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.

Ans.:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
15 | P a g e
Advanced Statistics

Problem 2:

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
check for duplicates and outliers and write an inference on it. Perform Univariate and Bivariate
Analysis and Multivariate Analysis.

Ans.:

#The dataset of 10 variables in which there are 7 objects, 2 float type and 1 integer type variable
Contraceptive_method_used is the dependent variable.

#We have 1473 rows and 10 columns in our Data-set.

#Data frame summary:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
16 | P a g e
Advanced Statistics

#We can see we have null values for ‘Wife_age’ and ‘No_of_children_born’, As it is a continuous variable,
mean value can be imputed.

#Univariate analysis for numerical data:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
17 | P a g e
Advanced Statistics

#Univariate analysis for categorical data:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
18 | P a g e
Advanced Statistics

#Bivariate analysis(Pair-plot) :

‘Wife age’ and ‘No_of_children_born’ are slightly correlated.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
19 | P a g e
Advanced Statistics

#Multi-variate analysis:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
20 | P a g e
Advanced Statistics

#Catplot for categorical vs numerical analysis:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
21 | P a g e
Advanced Statistics

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant analysis) and
CART.

Ans.: Encoded the categorical variables Wife_ education, Husband_education, Wife_religion,


Standard_of_living_index, Media_exposure and Contraceptive_method_used in the ascending order
from worst to best since LDA does not take string variables as parameters into model building. Below is
the encoding for ordinal values:

Wife_ education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.


Husband_education: Uneducated = 1, Primary = 2, Secondary = 3, Tertiary = 4.
Wife_religion: Scientology = 1 and non-Scientology = 2.
Wife_Working: Yes = 1 and No = 2.
Standard_of_living_index: Very Low = 1, Low = 2, High = 3, Very High = 4.
Media_exposure: Exposed = 1 and Not-Exposed = 2.
Contraceptive_method_used: Yes = 1 and No = 0.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
22 | P a g e
Advanced Statistics

#We have no object type data now.

#Bivariate analysis: Pairplot (pairwise relationships between variables):

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
23 | P a g e
Advanced Statistics

#Check for presence of correlations:

#Checking Multicollinearity using Variance Inflation Factor (VIF):

VIF for Wife_age ---> 23.986752368551485


VIF for Wife_ education ---> 16.16229659265143
VIF for Husband_education ---> 27.364195172617837
VIF for No_of_children_born ---> 4.376622562511411
VIF for Wife_religion ---> 12.615598921821046
VIF for Wife_Working ---> 14.783600009367635
VIF for Husband_Occupation ---> 6.921852898686767
VIF for Standard_of_living_index ---> 13.36777566587339
VIF for Media_exposure ---> 14.976189325600197

#Variance Inflation Factor (VIF) is one of the methods to check if independent variables have
correlation between them. If they are correlated, then it is not ideal for linear regression models as
they inflate the standard errors which in turn affects the regression parameters. As a result, the
regression model becomes non-reliable and lacks interpretability.

##General rule of thumb: If VIF values are equal to 1, then that means there is no multicollinearity. If
VIF values are equal to 5 or exceedingly more than 5, then there is moderate multicollinearity. If VIF is
10 or more, then that means there is high collinearity.

##From the above I can conclude that variables have moderate correlation.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
24 | P a g e
Advanced Statistics

#Logistic Regression Model :

# Getting the probabilities on the training set:

# Getting the probabilities on the test set:

#Logistic Regression Model Accuracy of Training data before applying GridSearchCV: 0.6646153846153846

#Logistic Regression Model Accuracy of Test data before applying GridSearchCV: 0.645933014354067

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
25 | P a g e
Advanced Statistics

##Applying GridSearchCV for Logistic Regression:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
26 | P a g e
Advanced Statistics

#Getting the probabilities on the training set:

#Getting the probabilities on the test set:

#Classification report and Confusion matrix:

#AUC and ROC for the training data:

*AUC for the Training Data: 0.7074


*AUC for the Test Data: 0.6922

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
27 | P a g e
Advanced Statistics

#Logistic Regression Model Accuracy of Training data after applying GridSearchCV: 0.6646153846153846

#Logistic Regression Model Accuracy of Test data after applying GridSearchCV: 0.645933014354067

#LDA (linear discriminant analysis) Model :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
28 | P a g e
Advanced Statistics

#Training Data Probability Prediction :

#Test Data Probability Prediction

#Classification report and Confusion matrix:

#AUC and ROC for the training data:

*AUC for the Training Data: 0.707


*AUC for the Test Data: 0.694

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
29 | P a g e
Advanced Statistics

#LDA Model Accuracy of Training data: 0.6615384615384615

#LDA Model Accuracy of test data: 0.6555023923444976

#CART Model :

Decision tree using Graphviz:

Pruned Decision tree using Graphviz:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
30 | P a g e
Advanced Statistics

#Training Data Probability Prediction:

#Test Data Probability Prediction:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
31 | P a g e
Advanced Statistics

#Classification report and Confusion matrix:

#AUC and ROC for the training data:

*AUC for the Training Data: 0.815


*AUC for the Test Data: 0.725

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
32 | P a g e
Advanced Statistics

#CART Model Accuracy of training data: 0.7487179487179487

#CART Model Accuracy of test data: 0.6746411483253588.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final Model: Compare Both
the models and write inference which model is best/optimized.

Ans.:

#Comparing both these models, we find both results are same, but LDA works better when there is
category target variable

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

Please explain and summarise the various steps performed in this project. There should be proper
business interpretation and actionable insights present.

Ans.: #The EDA analysis clearly indicates that women with a tertiary education and very high standard of
living used contraceptive methods Women ranging from 21 to 38 generally use contraceptive methods
more

#The usage of contraceptive methods need not depend on their demographic or socioeconomic
backgrounds since the use of contraceptive methods were almost the same for both working and non-
working women

#The use of contraceptive method was high for both Scientology and Non-scientology women

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
33 | P a g e

You might also like