Professional Documents
Culture Documents
Surabhi Charu Project
Surabhi Charu Project
Beyond the four created categories, each category of EDA have further divisions based on
role(outcome and explanatory ) and type (categorical and quotative) of the variable(s) being
examined
Quantitative methods include summary statistics while graphical methods include plots and
charts. Univariate methods involve analyzing one variable at a time while bivariate involves
analyzing two or more variables to examine their underlying relationships. EDA also depends
on the role of variable being examined:
Outcome variable
Explanatory variable
Univariate are the measure of the central tendency including the Mean, median and mode
also some of the measures of dispersion.
Min
Max
Range
Quartiles
Variance
Standard deviation
For the graphical methods we have the multiple methods or techniques we use
Histogram
Box plots
Bar plots
Kernel density plots
Used Library psychinstall.packages(“Hmisc”)
install.packages(“pastecs”)
install.packages(“psych”)
some of the basic command are used are show below
Summary of data
Missing value
df_status(newdata)
library(Hmisc)
describe(data)
plot_histogram(newdata)
Multivariate Analysis
That marks the end of univariate analysis and the beginning of bivariate/multivariate
analysis, starting with Correlation analysis.
Applying CART <plot the tree>
We will focus on CART, that uses the Gini impurity as its impurity measure.
library(rpart )
library(rpart.plot)
library(rattle)
library(RColorBrewer),
we have to divide the data set into two different data set which can be use as the development data set and the
testing data set ,we have to validate to check the performance , for un seen data set we have to work on testing
we will randomly use the data which can be utilizes the data fro development and validating, 70 -30 is the desired
separation,
we are adding a random number in the data and we are generating with the help of run if function .
Now we will remove the random variable from the data set and go further with the comtriol parameter.
To See how the model perform, we will use the original data without the control parameter and conduct the test
case on the data.
Interpret the CART model output <pruning, remarks on pruning, plot the pruned tree>
After pruning :
Now we will predict the class in the development data set using the below code.
We will split the dataset and creating the random variable with the same spilt , adding the random variable into the
data as developments and validation dataset.
We have to check the proportion of the distribution in the spilt dat , we will use the prop.table function.
For the same output from the random forest variable we will use command set .seed by which we will get the same out
in every output.
As know the random forest also a bagging algorithm , which means the randomnss is included in the data set .
Black box model is considered in the random forest.
Interpretation
We have to check the OOB estimate of error rate: 1.53% as we decide the best
random forest with a measure of Out of bag error .
Using the importance function we can get more insights on the interpretation.
The Final Model we have generated is and making the predict value , we got the
confusion matrix
We can use the above code to get the accuracy ,
Now taking the data from validation data set and using the below code
Confusion matrix for the validation data set
We can see that the accuracy is 98.94%. we have created in the Random forest
with t
We can see the true values with the false values under the prediction variables.
we have added the prediction and probabilities column in the data set.
Now we will computing the rank also with the calculation of response rate with the
below code.
Calculating the other performance measure
We use the below code and get the following values
Now calculating the concordance values using the
code
Looking at the values of the test and predicted values we can conclude that we have a high
accuracy and we can use this data of other data sets. Model is a good fit for both the data.