Predictive - Modelling - Project - PDF 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Predictive Modelling Project

Problem 1: Linear Regression


The comp-activ databases is a collection of a computer systems activity measures .
The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory running in
a multi-user university department. Users would typically be doing a large variety of tasks
ranging from accessing the internet, editing files or running very cpu-bound programs.

As you are a budding data scientist you thought to find out a linear equation to build a model to
predict 'usr'(Portion of time (%) that cpus run in user mode) and to find out how each attribute
affects the system to be in 'usr' mode using a list of system attributes.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the Data
types, shape, EDA, 5 point summary). Perform Univariate, Bivariate Analysis, Multivariate
Analysis.

Sample of dataset
Exploratory data analysis
Checking the shape and data types

There are a total of 8192 rows and 22 columns in the dataset. Out of 22, 13 are float 8 are
integer type and 1 object type variable
Data Description

Unique values in categorical data


Univariate Analysis – Continuous Variables

Distribution plot of scall

Distribution plot of sread


Distribution plot of fork

Distribution plot of exec


Distribution plot of pfit

Distribution plot of freemem


Distribution plot of freeswap

Distribution plot of usr


1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning or do we need to change them or drop them? Check for the possibility of
creating new features if required. Also check for outliers and duplicates if there.

There are 104 missing values in rchar and 15 missing values in wchar
As it is a continuous variable, mean / median can be imputed

Checking for duplicate rows

There are no duplicate rows


Checking for outliers
1.3 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.

Encoding the String Values by Get Dummies

Sample of dataset after encoding

Data Split: Split the data into test and train The X and y for the data has been formulated and
split under the criteria 70:30 and random_state = 1.
Then the data of X_train, y_train is being fit into the Linear Regression model
The coefficients for each of the independent attributes is as below.
Intercept for the model
The intercept of the model is 84.13143842096603

R square on training data


The R square on training data is 0.7961565330395104

R square on test data


The R square on test data is 0.7676695029858404

Root Mean Square Error (RMSE) on Training data


The RMSE on training data is 0.20690072466418796 23

Root Mean Square Error (RMSE) on test data


The RMSE on test data is 0.21647817772382874
We can drop scall and fork variables as they have high p values
Problem 2: Logistic Regression, LDA and CART
You are a statistician at the Republic of Indonesia Ministry of Health and you are
provided with a data of 1473 females collected from a Contraceptive Prevalence
Survey. The samples are married women who were either not pregnant or do not know
if they were at the time of the survey.
The problem is to predict do/don't they use a contraceptive method of choice based on
their demographic and socio-economic characteristics.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition
check, check for duplicates and outliers and write an inference on it. Perform Univariate and
Bivariate Analysis and Multivariate Analysis.

Sample of the dataset

Checking the types of variables in the data frame


The dataset of 10 variables in which there are 7 object, 2 float type and 1 integer type variable
Contraceptive_method_used is the dependent variable

Check for missing values in the dataset

Check for duplicate values in the dataset

Now we go ahead and remove the null values and duplicate rows
Unique values in the categorical data

Percentage of target variable

The split indicates 52.9% have used contraceptive methods


Univariate Analysis – Categorical Variables

There are more Uneducated wives than Husbands and most of the husbands have completed their
tertiary education.
Working Yes = 0
Working No = 1
Box plot
Correlation Matrix Plot

Wife age and No_of_children_born are slightly correlated


Pairplot

Boxplot before treating outliers


Boxplot after treating outlier in No of children variable

Do not scale the data. Encode the data (having string values) for Modelling. Data
Split: Split the data into train and test (70:30). Apply Logistic Regression and LDA
(linear discriminant analysis) and CART.
Encoding the Categorical Values by Get Dummies
This the new data frame with additional columns

Data Split: Split the data into test and train


The X and y for the data has been formulated and split under the criteria 70:30 and
random_state = 1.

The above is the value count of y train dataset.

Getting the probabilities on the test set

Confusion matrix on the training data


Confusion matrix on the test data
Accuracy - Training Data

Accuracy of the training data is 0.6582694414019715

AUC and ROC for the training data


Accuracy - test Data

Accuracy of the test data is 0.6454081632653061

AUC and ROC for the test data

Classification Report on train data

For predicting they used contraceptive method (Label 1 ):


Precision = 0.66
Recall = 0.73
f1 score = 0.69

Classification Report on test data

Precision = 0.64
Recall = 0.77
f1 score = 0.70
We fit our data into LDA model.

Training Data Class Prediction with a cut-off value of 0.5


Test Data Class Prediction with a cut-off value of 0.5

Confusion matrix
AUC and ROC for training and test data
Comparing both the models
LR Train LR Test LR Train LDA Test
Accuracy 0.66 0.65 0.66 0.65
Precision 0.66 0.64 0.66 0.64
Recall 0.73 0.77 0.75 0.79
F1-score 0.69 0.70 0.70 0.71
AUC 0.72 0.72 0.72 0.68

Comparing both these models, we find both results are same, but LDA works
better when there is category target variable

The EDA analysis clearly indicates that women with a tertiary education and very
high standard of living used contraceptive methods
Women ranging from 21 to 38 generally use contraceptive methods more

I believe the usage of contraceptive methods need not depend on their


demographic or socioeconomic backgrounds since the use of contraceptive
methods were almost the same for both working and non-working women

The use of contraceptive method was high for both Scientology and Non-
scientology women

You might also like