Surabhi Charu Project

Data Mining Project
Summitted by Surabhi Charu

BABI
July Batch
Exploratory Data Analysis with R
Question 1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

EDA is Exploratory data analysis (EDA) is an approach to data analysis for summarising and
visualising the important characteristics of a data set. The four types of EDA are
 Univariate non graphical

 Multivariate Non graphical
 Univariate graphical
 Multivariate Graphical
Beyond the four created categories, each category of EDA have further divisions based on
role(outcome and explanatory ) and type (categorical and quotative) of the variable(s) being
examined
Quantitative methods include summary statistics while graphical methods include plots and
charts. Univariate methods involve analyzing one variable at a time while bivariate involves
analyzing two or more variables to examine their underlying relationships. EDA also depends
on the role of variable being examined:
Outcome variable
Explanatory variable
Univariate are the measure of the central tendency including the Mean, median and mode
also some of the measures of dispersion.
 Min
 Max
 Range
 Quartiles
 Variance
 Standard deviation
For the graphical methods we have the multiple methods or techniques we use
 Histogram
 Box plots
 Bar plots
 Kernel density plots
 Used Library psychinstall.packages(“Hmisc”)
 install.packages(“pastecs”)
 install.packages(“psych”)
some of the basic command are used are show below
Summary of data
Missing value
newdata <- na.omit(data)
we have removed the missing values from the data set .
New data set has no missing values now.

glimpse(data)
df_status(newdata)
library(Hmisc)
describe(data)
plot_histogram(newdata)
Multivariate Analysis
That marks the end of univariate analysis and the beginning of bivariate/multivariate
analysis, starting with Correlation analysis.
Applying CART <plot the tree>
We will focus on CART, that uses the Gini impurity as its impurity measure.
CART (Classification And Regression Tree)

Binary classification ,
We are using these libraryies
library(rpart )
library(rpart.plot)
library(rattle)
library(RColorBrewer),
set the working directory as Main 1 ,
we have to divide the data set into two different data set which can be use as the development data set and the
testing data set ,we have to validate to check the performance , for un seen data set we have to work on testing
we will randomly use the data which can be utilizes the data fro development and validating, 70 -30 is the desired
separation,
we are adding a random number in the data and we are generating with the help of run if function .
we will check the proportion
We go further as the properotionas are near,
Now we will remove the random variable from the data set and go further with the comtriol parameter.
To See how the model perform, we will use the original data without the control parameter and conduct the test
case on the data.
Interpret the CART model output <pruning, remarks on pruning, plot the pruned tree>
After pruning :
Now we will predict the class in the development data set using the below code.
Random Forest Generation:
We will split the dataset and creating the random variable with the same spilt , adding the random variable into the
data as developments and validation dataset.
We have to check the proportion of the distribution in the spilt dat , we will use the prop.table function.
For the same output from the random forest variable we will use command set .seed by which we will get the same out
in every output.
As know the random forest also a bagging algorithm , which means the randomnss is included in the data set .
Black box model is considered in the random forest.
Interpretation
We have to check the OOB estimate of error rate: 1.53% as we decide the best
random forest with a measure of Out of bag error .
The predicted values
Using the importance function we can get more insights on the interpretation.
Higher value of mean diecrease in ginniindicates high inporatnce of the variable.

To choose the optimunvalue of tree.
Tuning the random forest and making the
The Final Model we have generated is and making the predict value , we got the
confusion matrix
We can use the above code to get the accuracy ,
Now taking the data from validation data set and using the below code
Confusion matrix for the validation data set
We can see that the accuracy is 98.94%. we have created in the Random forest
with t
We can see the true values with the false values under the prediction variables.
Interpretation of other Model Performance

Measures <KS, AUC, GINI> 10
We will measure the performance of the model written in r code below, using the rpart and
rpart.plot library
we have added the prediction and probabilities column in the data set.
Now we will computing the rank also with the calculation of response rate with the
below code.
Calculating the other performance measure
We use the below code and get the following values
Now calculating the concordance values using the
code
Looking at the values of the test and predicted values we can conclude that we have a high
accuracy and we can use this data of other data sets. Model is a good fit for both the data.

Surabhi Charu Project

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Surabhi Charu Project

Uploaded by

Copyright:

Available Formats

Data Mining Project

Summitted by Surabhi Charu

Question 1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

 Univariate non graphical

newdata <- na.omit(data)

we have removed the missing values from the data set .

New data set has no missing values now.

CART (Classification And Regression Tree)

We are using these libraryies

set the working directory as Main 1 ,

we will check the proportion

We go further as the properotionas are near,

Random Forest Generation:

The predicted values

Higher value of mean diecrease in ginniindicates high inporatnce of the variable.

Interpretation of other Model Performance

You might also like