Assignment EDA

Exercise on Exploratory Data Analysis
Note: Use the Adult_nm.csv for the following exercises. The target variable is income, and
the goal is to classify income based on the other variables.
Data Description can be checked at #https://archive.ics.uci.edu/ml/datasets/Adult

# The continuous variable demogweight represents final weight, which is the number of units
in the target population that the responding unit represents.
#The variable education_num stands for the number of years of education in total, which is a
continuous representation of the discrete variable education. The variable relationship
represents the responding unit’s role in the family.
#capital_gain and capital_loss are income from investment sources other than wage/salary.
1. Which variables are categorical and which are continuous?

str(adultnm)
2. Construct a table of the first 10 records of the data set, in order to get a feel for the data.
head(adultnm,10)
3. Investigate whether we have any correlated variables.
library(dplyr)
adultnm1=select(adultnm,1:2)
adultnm1
adultnm2=select(adultnm,4)
adultnm2
adultnm3=select(adultnm,6)
adultnm3
adultnm4 <- select(adultnm,12:14)
adultnm4
adultnm4 <- cbind(adultnm1,adultnm2,adultnm3,adultnm4)
cor(adultnm4)
4. For each pair of categorical variable and target variable, construct a cross-tabulation.
Discuss your salient results.
counts15 <- table(adultnm$income,adultnm$education,dnn=c("income","education"))

counts15
counts16 <-
table(adultnm$income,adultnm$marital.status,dnn=c("income","marital.status"))
counts16
counts17 <- table(adultnm$income,adultnm$occupation,dnn=c("income","occupation"))

counts17
counts18 <-
table(adultnm$income,adultnm$relationship,dnn=c("income","relationship"))
counts18
counts19 <- table(adultnm$income,adultnm$race,dnn=c("income","race"))

counts19
counts19 <- table(adultnm$income,adultnm$sex,dnn=c("income","sex"))

counts19
counts20 <-
table(adultnm$income,adultnm$native.country,dnn=c("income","native.country"))
counts20
5. For each of the categorical variables, construct a bar chart of the variable, with an overlay
of the target variable. Plot the bar chart and normalized graph side by side.
ggplot(adultnm,aes(race))+geom_bar(aes(fill=income),position="fill")
+scale_y_continuous("Percent")
ggplot(adultnm,aes(relationship))+geom_bar(aes(fill=income),position="fill")
ggplot(adultnm,aes(education))+geom_bar(aes(fill=income),position="fill")
ggplot(adultnm,aes(marital.status))+geom_bar(aes(fill=income),position="fill")
ggplot(adultnm,aes(sex))+geom_bar(aes(fill=income),position="fill")
ggplot(adultnm,aes(native.country))+geom_bar(aes(fill=income),position="fill")
6. Discuss the relationship, if any, each of these variables has with the target variables.
#Looking at the graphs and the contingency table we see some trends- Higher educated the
#person is, more are his/her chances to make >50k.Race also impacts the income, we see a
higher proportion of
#people having income>50k for White and Asian/Pacific races when compared with others.
#Relationship status also shows a trend, Husbands and Wife have higher proportion of >50k
income relative to others.
#Marital status also impacts- Married-AF-Spouse and Married-CIV-Spouse have high >50
earners.
#Gender also impacts the income, males have a high proportion of >50k, then female.
#Native Country also impacts the income>50k as we can see in the graph.
Chi-Square Tests to check the significance
#For checking significance of categorical variable we will run chi-square test

#If P-Value<=0.05, We reject Null Hyp. (Significant)
#If P-Value>0.05, We accept Null Hyp. (Insignificant)
sex <- table(adultnm$sex,adultnm$income)

sex
sex1 <- chisq.test(sex)
sex1
#Sex is significant
race <- table(adultnm$race,adultnm$income)

race
race1 <- chisq.test(race)
race1
#Race is significant
occupation <- table(adultnm$occupation,adultnm$income)

occupation
occupation1 <- chisq.test(occupation)
occupation1
#occupation is significant
relationship <- table(adultnm$relationship,adultnm$income)

relationship
relationship1 <- chisq.test(relationship)
relationship1
#relationship is significant
marital.status <- table(adultnm$marital.status,adultnm$income)

marital.status
marital.status1 <- chisq.test(marital.status)
marital.status1
#Marital Status is significant
education <- table(adultnm$education,adultnm$income)

education
education1 <- chisq.test(education)
education1
#Marital Status is significant
native.country <- table(adultnm$native.country,adultnm$income)

native.country
native.country1 <- chisq.test(native.country)
native.country1
#native.country is significant
7. Which variables would you expect to make a significant appearance in any data mining
classification model we work with?
Did Chi-Square tests in Q6.
8. Report the mean, median, minimum, maximum, and standard deviation for each of the
numerical variables.
summary(adultnm4)
sd(adultnm4$X)
sd(adultnm4$demogweight)
sd(adultnm4$education.num)
sd(adultnm4$capital.gain)
sd(adultnm4$capital.loss)
sd(adultnm4$hours.per.week)
9. Construct a normalized histogram of each numerical variable, with an overlay of the

target variable income.
ggplot(adultnm,aes(hours.per.week))+geom_histogram(aes(fill=income),position="fill")
ggplot(adultnm,aes(capital.loss))+geom_histogram(aes(fill=income),position="fill")
ggplot(adultnm,aes(education.num))+geom_histogram(aes(fill=income),position="fill")
ggplot(adultnm,aes(capital.gain))+geom_histogram(aes(fill=income),position="fill")
ggplot(adultnm,aes(age))+geom_histogram(aes(fill=income),position="fill")
ggplot(adultnm,aes(demogweight))+geom_histogram(aes(fill=income),position="fill")
10. Which variables would you expect to make a significant appearance in any data mining
classification model we work with?
#For categorical variables, we observed in the bar plots as well as Chi-Square tests
#that they are significant for the target variable.
#For Numerical Variables:

#Equal Width Binning applied to check the sub-groups impact on target variable:
#Conclusion- Out of Six numerical variables in 5 subgroupings offer insights on target variable,
only demogweight variable, sub-groups are having similar distribution for the target variable, so
the sub-grouping is not insightful.
11. For each pair of numerical variables, construct a scatter plot of the variables. Discuss
your salient results.
Ans.11)
ggplot(adultnm,aes(x=age,y=demogweight,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=education.num,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=capital.gain,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=capital.loss,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=demogweight,col="red"))+geom_point()
ggplot(adultnm,aes(x=age,y=hours.per.week,col="red"))+geom_point()
12. Based on your EDA so far, identify interesting sub-groups of records within the data set
that would be worth further investigation.
Ans.12)
For categorical variables, we observed in the bar plots as well as Chi-Square tests
that they are significant for the target variable.
For Numerical Variables:

Equal Width Binning applied to check the sub-groups impact on target variable:
Conclusion- Out of Six numerical variables in 5 subgroupings offer insights on target variable,
only demogweight variable, sub-groups are having similar distribution for the target variable, so
the sub-grouping is not insightful.
capital.loss (Subgroups are insightful)
#For age (Subgroups are insightful)

#For demogweight (Subgroups are not insightful)
# For education.num (Subgroups are insightful)
#For capital.gain (Subgroups are insightful)
13. Apply binning to Capital-Loss Variable. Do it in such a way as to maximize the effect of
the classes thus created (following the suggestions in the text). Apply the other two
binning methods (equal width, and equal number of records) to this same variable.
Compare the results and discuss the differences. Which method do you prefer?
Ans.13) We did both Equal Width Binning as well as Equal Frequency binning, sub-
groupings are insightful vis-à-vis income in case of Equal Width Binning, but are not
offering any insights in case of Equal Frequency Binning.
Equal Width binning for Capital-Loss Variable in terms of income

Equal Frequency binning for Capital-Loss Variable in terms of income
14. Summarize your salient EDA findings from the above exercises, just as if you were
writing a report.
Ans.14
EDA allows us to be more specific while investigating the target variable.
We can find the proportion of target variable, count using table
But using EDA, we can analyze the data, in a manner which gives us finer details. We looked
at the Categorical Variables, drew barplots with respect to income variable, also ran Chi-
Square test, these offered us insights about the income variable within sub-groups.
In case of Numeric variables, we did the equal width binning which provided us insights on
the variables vis-a-vis the income variable.
Once we get these results, we can direct our further investigations on the target subgroups,
not the entire variable and obtain better results.

Assignment EDA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment EDA

Uploaded by

Copyright:

Available Formats

Exercise on Exploratory Data Analysis

Data Description can be checked at #https://archive.ics.uci.edu/ml/datasets/Adult

1. Which variables are categorical and which are continuous?

3. Investigate whether we have any correlated variables.

counts15 <- table(adultnm$income,adultnm$education,dnn=c("income","education"))

counts17 <- table(adultnm$income,adultnm$occupation,dnn=c("income","occupation"))

counts19 <- table(adultnm$income,adultnm$race,dnn=c("income","race"))

counts19 <- table(adultnm$income,adultnm$sex,dnn=c("income","sex"))

Chi-Square Tests to check the significance

#For checking significance of categorical variable we will run chi-square test

sex <- table(adultnm$sex,adultnm$income)

race <- table(adultnm$race,adultnm$income)

occupation <- table(adultnm$occupation,adultnm$income)

relationship <- table(adultnm$relationship,adultnm$income)

marital.status <- table(adultnm$marital.status,adultnm$income)

education <- table(adultnm$education,adultnm$income)

native.country <- table(adultnm$native.country,adultnm$income)

Did Chi-Square tests in Q6.

9. Construct a normalized histogram of each numerical variable, with an overlay of the

#For Numerical Variables:

For Numerical Variables:

capital.loss (Subgroups are insightful)

#For age (Subgroups are insightful)

Equal Width binning for Capital-Loss Variable in terms of income

You might also like