Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

Data Mining Project

Summitted by Surabhi Charu


BABI
July Batch
Exploratory Data Analysis with R

Question 1 EDA - Basic data summary, Univariate, Bivariate analysis, graphs


EDA is Exploratory data analysis (EDA) is an approach to data analysis for summarising and
visualising the important characteristics of a data set. The four types of EDA are

 Univariate non graphical


 Multivariate Non graphical
 Univariate graphical
 Multivariate Graphical

Beyond the four created categories, each category of EDA have further divisions based on
role(outcome and explanatory ) and type (categorical and quotative) of the variable(s) being
examined

Quantitative methods include summary statistics while graphical methods include plots and
charts. Univariate methods involve analyzing one variable at a time while bivariate involves
analyzing two or more variables to examine their underlying relationships. EDA also depends
on the role of variable being examined:

Outcome variable

Explanatory variable

Univariate are the measure of the central tendency including the Mean, median and mode
also some of the measures of dispersion.

 Min
 Max
 Range
 Quartiles
 Variance
 Standard deviation

For the graphical methods we have the multiple methods or techniques we use

 Histogram
 Box plots
 Bar plots
 Kernel density plots
 Used Library psychinstall.packages(“Hmisc”)
 install.packages(“pastecs”)
 install.packages(“psych”)
some of the basic command are used are show below

Summary of data

Missing value

newdata <- na.omit(data)

we have removed the missing values from the data set .

New data set has no missing values now.


glimpse(data)

df_status(newdata)

library(Hmisc)

describe(data)

plot_histogram(newdata)
Multivariate Analysis
That marks the end of univariate analysis and the beginning of bivariate/multivariate
analysis, starting with Correlation analysis.
 Applying CART <plot the tree>
We will focus on CART, that uses the Gini impurity as its impurity measure.

CART (Classification And Regression Tree)


Binary classification ,

We are using these libraryies

library(rpart )

library(rpart.plot)

library(rattle)

library(RColorBrewer),

set the working directory as Main 1 ,

we have to divide the data set into two different data set which can be use as the development data set and the
testing data set ,we have to validate to check the performance , for un seen data set we have to work on testing

we will randomly use the data which can be utilizes the data fro development and validating, 70 -30 is the desired
separation,

we are adding a random number in the data and we are generating with the help of run if function .

we will check the proportion

We go further as the properotionas are near,

Now we will remove the random variable from the data set and go further with the comtriol parameter.
To See how the model perform, we will use the original data without the control parameter and conduct the test
case on the data.

 Interpret the CART model output <pruning, remarks on pruning, plot the pruned tree>
After pruning :

Now we will predict the class in the development data set using the below code.

Random Forest Generation:

We will split the dataset and creating the random variable with the same spilt , adding the random variable into the
data as developments and validation dataset.

We have to check the proportion of the distribution in the spilt dat , we will use the prop.table function.

For the same output from the random forest variable we will use command set .seed by which we will get the same out
in every output.

As know the random forest also a bagging algorithm , which means the randomnss is included in the data set .
Black box model is considered in the random forest.

Interpretation

We have to check the OOB estimate of error rate: 1.53% as we decide the best
random forest with a measure of Out of bag error .

The predicted values

Using the importance function we can get more insights on the interpretation.

Higher value of mean diecrease in ginniindicates high inporatnce of the variable.


To choose the optimunvalue of tree.
Tuning the random forest and making the

The Final Model we have generated is and making the predict value , we got the
confusion matrix
We can use the above code to get the accuracy ,

Now taking the data from validation data set and using the below code
Confusion matrix for the validation data set
We can see that the accuracy is 98.94%. we have created in the Random forest
with t
We can see the true values with the false values under the prediction variables.

Interpretation of other Model Performance


Measures <KS, AUC, GINI> 10
We will measure the performance of the model written in r code below, using the rpart and
rpart.plot library

we have added the prediction and probabilities column in the data set.
Now we will computing the rank also with the calculation of response rate with the
below code.
Calculating the other performance measure
We use the below code and get the following values
Now calculating the concordance values using the
code

Looking at the values of the test and predicted values we can conclude that we have a high
accuracy and we can use this data of other data sets. Model is a good fit for both the data.

You might also like