Professional Documents
Culture Documents
Assignment EDA
Assignment EDA
Note: Use the Adult_nm.csv for the following exercises. The target variable is income, and
the goal is to classify income based on the other variables.
2. Construct a table of the first 10 records of the data set, in order to get a feel for the data.
head(adultnm,10)
library(dplyr)
adultnm1=select(adultnm,1:2)
adultnm1
adultnm2=select(adultnm,4)
adultnm2
adultnm3=select(adultnm,6)
adultnm3
adultnm4 <- select(adultnm,12:14)
adultnm4
adultnm4 <- cbind(adultnm1,adultnm2,adultnm3,adultnm4)
cor(adultnm4)
4. For each pair of categorical variable and target variable, construct a cross-tabulation.
Discuss your salient results.
counts16 <-
table(adultnm$income,adultnm$marital.status,dnn=c("income","marital.status"))
counts16
counts18 <-
table(adultnm$income,adultnm$relationship,dnn=c("income","relationship"))
counts18
counts20 <-
table(adultnm$income,adultnm$native.country,dnn=c("income","native.country"))
counts20
5. For each of the categorical variables, construct a bar chart of the variable, with an overlay
of the target variable. Plot the bar chart and normalized graph side by side.
ggplot(adultnm,aes(race))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(relationship))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(education))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(marital.status))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(sex))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(native.country))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
6. Discuss the relationship, if any, each of these variables has with the target variables.
#Looking at the graphs and the contingency table we see some trends- Higher educated the
#person is, more are his/her chances to make >50k.Race also impacts the income, we see a
higher proportion of
#people having income>50k for White and Asian/Pacific races when compared with others.
#Relationship status also shows a trend, Husbands and Wife have higher proportion of >50k
income relative to others.
#Marital status also impacts- Married-AF-Spouse and Married-CIV-Spouse have high >50
earners.
#Gender also impacts the income, males have a high proportion of >50k, then female.
#Native Country also impacts the income>50k as we can see in the graph.
7. Which variables would you expect to make a significant appearance in any data mining
classification model we work with?
8. Report the mean, median, minimum, maximum, and standard deviation for each of the
numerical variables.
summary(adultnm4)
sd(adultnm4$X)
sd(adultnm4$demogweight)
sd(adultnm4$education.num)
sd(adultnm4$capital.gain)
sd(adultnm4$capital.loss)
sd(adultnm4$hours.per.week)
ggplot(adultnm,aes(hours.per.week))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(capital.loss))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(education.num))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(capital.gain))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(age))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(demogweight))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
10. Which variables would you expect to make a significant appearance in any data mining
classification model we work with?
#For categorical variables, we observed in the bar plots as well as Chi-Square tests
#that they are significant for the target variable.
11. For each pair of numerical variables, construct a scatter plot of the variables. Discuss
your salient results.
Ans.11)
ggplot(adultnm,aes(x=age,y=demogweight,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=education.num,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=capital.gain,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=capital.loss,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=demogweight,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=hours.per.week,col="red"))+geom_point()
12. Based on your EDA so far, identify interesting sub-groups of records within the data set
that would be worth further investigation.
Ans.12)
For categorical variables, we observed in the bar plots as well as Chi-Square tests
that they are significant for the target variable.
13. Apply binning to Capital-Loss Variable. Do it in such a way as to maximize the effect of
the classes thus created (following the suggestions in the text). Apply the other two
binning methods (equal width, and equal number of records) to this same variable.
Compare the results and discuss the differences. Which method do you prefer?
Ans.13) We did both Equal Width Binning as well as Equal Frequency binning, sub-
groupings are insightful vis-à-vis income in case of Equal Width Binning, but are not
offering any insights in case of Equal Frequency Binning.
But using EDA, we can analyze the data, in a manner which gives us finer details. We looked
at the Categorical Variables, drew barplots with respect to income variable, also ran Chi-
Square test, these offered us insights about the income variable within sub-groups.
In case of Numeric variables, we did the equal width binning which provided us insights on
the variables vis-a-vis the income variable.
Once we get these results, we can direct our further investigations on the target subgroups,
not the entire variable and obtain better results.