Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Exercise on Exploratory Data Analysis

Note: Use the Adult_nm.csv for the following exercises. The target variable is income, and
the goal is to classify income based on the other variables.

Data Description can be checked at #https://archive.ics.uci.edu/ml/datasets/Adult


# The continuous variable demogweight represents final weight, which is the number of units
in the target population that the responding unit represents.
#The variable education_num stands for the number of years of education in total, which is a
continuous representation of the discrete variable education. The variable relationship
represents the responding unit’s role in the family.
#capital_gain and capital_loss are income from investment sources other than wage/salary.

1. Which variables are categorical and which are continuous?


str(adultnm)

2. Construct a table of the first 10 records of the data set, in order to get a feel for the data.
head(adultnm,10)

3. Investigate whether we have any correlated variables.

library(dplyr)
adultnm1=select(adultnm,1:2)
adultnm1
adultnm2=select(adultnm,4)
adultnm2
adultnm3=select(adultnm,6)
adultnm3
adultnm4 <- select(adultnm,12:14)
adultnm4
adultnm4 <- cbind(adultnm1,adultnm2,adultnm3,adultnm4)
cor(adultnm4)
4. For each pair of categorical variable and target variable, construct a cross-tabulation.
Discuss your salient results.

counts15 <- table(adultnm$income,adultnm$education,dnn=c("income","education"))


counts15

counts16 <-
table(adultnm$income,adultnm$marital.status,dnn=c("income","marital.status"))
counts16

counts17 <- table(adultnm$income,adultnm$occupation,dnn=c("income","occupation"))


counts17

counts18 <-
table(adultnm$income,adultnm$relationship,dnn=c("income","relationship"))
counts18

counts19 <- table(adultnm$income,adultnm$race,dnn=c("income","race"))


counts19

counts19 <- table(adultnm$income,adultnm$sex,dnn=c("income","sex"))


counts19

counts20 <-
table(adultnm$income,adultnm$native.country,dnn=c("income","native.country"))
counts20

5. For each of the categorical variables, construct a bar chart of the variable, with an overlay
of the target variable. Plot the bar chart and normalized graph side by side.

ggplot(adultnm,aes(race))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(relationship))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(education))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(marital.status))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(sex))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(native.country))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
6. Discuss the relationship, if any, each of these variables has with the target variables.

#Looking at the graphs and the contingency table we see some trends- Higher educated the
#person is, more are his/her chances to make >50k.Race also impacts the income, we see a
higher proportion of
#people having income>50k for White and Asian/Pacific races when compared with others.
#Relationship status also shows a trend, Husbands and Wife have higher proportion of >50k
income relative to others.
#Marital status also impacts- Married-AF-Spouse and Married-CIV-Spouse have high >50
earners.
#Gender also impacts the income, males have a high proportion of >50k, then female.
#Native Country also impacts the income>50k as we can see in the graph.

Chi-Square Tests to check the significance

#For checking significance of categorical variable we will run chi-square test


#If P-Value<=0.05, We reject Null Hyp. (Significant)
#If P-Value>0.05, We accept Null Hyp. (Insignificant)

sex <- table(adultnm$sex,adultnm$income)


sex
sex1 <- chisq.test(sex)
sex1
#Sex is significant

race <- table(adultnm$race,adultnm$income)


race
race1 <- chisq.test(race)
race1
#Race is significant

occupation <- table(adultnm$occupation,adultnm$income)


occupation
occupation1 <- chisq.test(occupation)
occupation1
#occupation is significant

relationship <- table(adultnm$relationship,adultnm$income)


relationship
relationship1 <- chisq.test(relationship)
relationship1
#relationship is significant

marital.status <- table(adultnm$marital.status,adultnm$income)


marital.status
marital.status1 <- chisq.test(marital.status)
marital.status1
#Marital Status is significant

education <- table(adultnm$education,adultnm$income)


education
education1 <- chisq.test(education)
education1
#Marital Status is significant

native.country <- table(adultnm$native.country,adultnm$income)


native.country
native.country1 <- chisq.test(native.country)
native.country1
#native.country is significant

7. Which variables would you expect to make a significant appearance in any data mining
classification model we work with?

Did Chi-Square tests in Q6.

8. Report the mean, median, minimum, maximum, and standard deviation for each of the
numerical variables.
summary(adultnm4)
sd(adultnm4$X)
sd(adultnm4$demogweight)
sd(adultnm4$education.num)
sd(adultnm4$capital.gain)
sd(adultnm4$capital.loss)
sd(adultnm4$hours.per.week)

9. Construct a normalized histogram of each numerical variable, with an overlay of the


target variable income.

ggplot(adultnm,aes(hours.per.week))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")

ggplot(adultnm,aes(capital.loss))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(education.num))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")

ggplot(adultnm,aes(capital.gain))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(age))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")

ggplot(adultnm,aes(demogweight))+geom_histogram(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
10. Which variables would you expect to make a significant appearance in any data mining
classification model we work with?

#For categorical variables, we observed in the bar plots as well as Chi-Square tests
#that they are significant for the target variable.

#For Numerical Variables:


#Equal Width Binning applied to check the sub-groups impact on target variable:
#Conclusion- Out of Six numerical variables in 5 subgroupings offer insights on target variable,
only demogweight variable, sub-groups are having similar distribution for the target variable, so
the sub-grouping is not insightful.

11. For each pair of numerical variables, construct a scatter plot of the variables. Discuss
your salient results.
Ans.11)
ggplot(adultnm,aes(x=age,y=demogweight,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=education.num,col="red"))+geom_point()

ggplot(adultnm,aes(x=age,y=capital.gain,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=capital.loss,col="red"))+geom_point()

ggplot(adultnm,aes(x=age,y=demogweight,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=hours.per.week,col="red"))+geom_point()
12. Based on your EDA so far, identify interesting sub-groups of records within the data set
that would be worth further investigation.
Ans.12)
For categorical variables, we observed in the bar plots as well as Chi-Square tests
that they are significant for the target variable.

For Numerical Variables:


Equal Width Binning applied to check the sub-groups impact on target variable:
Conclusion- Out of Six numerical variables in 5 subgroupings offer insights on target variable,
only demogweight variable, sub-groups are having similar distribution for the target variable, so
the sub-grouping is not insightful.

capital.loss (Subgroups are insightful)

#For age (Subgroups are insightful)


#For demogweight (Subgroups are not insightful)
# For education.num (Subgroups are insightful)
#For capital.gain (Subgroups are insightful)

13. Apply binning to Capital-Loss Variable. Do it in such a way as to maximize the effect of
the classes thus created (following the suggestions in the text). Apply the other two
binning methods (equal width, and equal number of records) to this same variable.
Compare the results and discuss the differences. Which method do you prefer?

Ans.13) We did both Equal Width Binning as well as Equal Frequency binning, sub-
groupings are insightful vis-à-vis income in case of Equal Width Binning, but are not
offering any insights in case of Equal Frequency Binning.

Equal Width binning for Capital-Loss Variable in terms of income


Equal Frequency binning for Capital-Loss Variable in terms of income
14. Summarize your salient EDA findings from the above exercises, just as if you were
writing a report.
Ans.14
EDA allows us to be more specific while investigating the target variable.
We can find the proportion of target variable, count using table

But using EDA, we can analyze the data, in a manner which gives us finer details. We looked
at the Categorical Variables, drew barplots with respect to income variable, also ran Chi-
Square test, these offered us insights about the income variable within sub-groups.

In case of Numeric variables, we did the equal width binning which provided us insights on
the variables vis-a-vis the income variable.
Once we get these results, we can direct our further investigations on the target subgroups,
not the entire variable and obtain better results.

You might also like