Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

use same file for this assignment which you used for practice assignment for Univariate Statistics

Does your data follow normal distribution?


Note: you can revisit this question once you have gone through hypothesis testing module
Find out if variable hours.per.week follows normal distribution using density plots. Your plot should like this:

setwd("/Users/lalitsachan/Desktop/March onwards/CBAP with R/Data/")


# You'll have to chose path accroding to location of file in your machine

d=read.csv("census_income.csv",stringsAsFactors = F)
library(ggplot2)

d$temp=dnorm(d$hours.per.week,mean=mean(d$hours.per.week),sd=sd(d$hours.per.week))

ggplot(d,aes(x=hours.per.week))+
geom_density(color="red")+
geom_line(aes(x=hours.per.week,y=temp),color="green")

0.4

0.3
density

0.2

0.1

0.0

0 25 50 75 100
hours.per.week

d$temp=NULL

1
Correlation between two categorcial variables
Use stacked barcharts to find out if outcome Y is affected by [behaves differently across] variables race or
relationship. Your code should produce following plots:

ggplot(d,aes(x=race,fill=Y))+
geom_bar(position = "fill")+
ylab("Percentage")+
theme(axis.text.x = element_text(face="bold", color="#993333",
size=9, angle=45))

1.00

0.75
Percentage

Y
0.50 <=50K
>50K

0.25
o

r
m

de

0.00
ki

an
Es

sl
n−

te
er
ck
−I

hi
th
ia

ac

la

W
nd

O
B
−P
−I

an
er

si
m

A
A

race

ggplot(d,aes(x=relationship,fill=Y))+
geom_bar(position="fill")+
ylab("Percentage")+
theme(axis.text.x = element_text(face="bold", color="#993333",
size=9, angle=45))

2
1.00

0.75
Percentage

Y
0.50 <=50K
>50K

0.25

0.00
e
ily

tiv

ld

d
m
nd

rie
la

i
ch
a

ife
ba

e
−f

ar
−r

n−

W
n
us

nm
er
−i

w
H

ot

th

U
N

relationship
Note : You’ll have to explore option position in geom_bar.

Use smoothing to see natural patterns


Show visually that hours per week vary similarly across age for both the sexes. Your code should produce
following plot:

ggplot(d,aes(x=age,y=hours.per.week,color=sex))+
geom_point(aes(alpha=0.2))+
geom_smooth()

3
100

75

sex
hours.per.week

Female
Male
50

0.2
0.2

25

25 50 75
age

Using point plots with jitter instead of boxplots


Prepare following plots for capital.gain and capital.loss with outcome Y to examine their behaviour across
both outcomes.

ggplot(d,aes(y=capital.gain,x=Y,color=Y))+
geom_point()

4
100000

75000
capital.gain

Y
50000 <=50K
>50K

25000

<=50K >50K
Y
ggplot(d,aes(y=capital.loss,x=Y,color=Y))+
geom_point()

4000

3000
capital.loss

Y
<=50K
2000 >50K

1000

<=50K >50K
Y

5
As you can see these plots give a fair idea about ranges of capital.gain and capital.loss but tell nothing about
density or count of observations at certain levels which is a crucial set of information. We can remove this
drawback by adding some jitter to the point positions. you code should prdocue following plots:

ggplot(d,aes(y=capital.gain,x=Y,color=Y))+
geom_point()+
geom_jitter()

100000

75000
capital.gain

Y
50000 <=50K
>50K

25000

<=50K >50K
Y

ggplot(d,aes(y=capital.loss,x=Y,color=Y))+
geom_point()+
geom_jitter()

6
4000

3000
capital.loss

Y
<=50K
2000 >50K

1000

<=50K >50K
Y

Violin Plots
examine behaviour of fnlwgt for classes of Y [ and within that for both the sexes]with . your code should
produce following plots:

ggplot(d,aes(x=Y,y=fnlwgt,color=Y,fill=sex))+
geom_violin()

7
1500000

sex
1000000
Female
Male
fnlwgt

Y
<=50K
500000 >50K

<=50K >50K
Y

Which visualisation to select


Examine hours.per.week by workclass. which one will be a better choice , a box plot , a violin plot or simple
point plot with jitter, think and discuss pros and cons associated with each.

ggplot(d,aes(x=workclass,y=hours.per.week,fill=workclass))+
geom_boxplot()+
theme(axis.text.x=element_blank())

8
100

workclass
75 ?
Federal−gov
hours.per.week

Local−gov
Never−worked
50
Private
Self−emp−inc
Self−emp−not−inc
State−gov
25
Without−pay

workclass

ggplot(d,aes(x=workclass,y=hours.per.week,fill=workclass))+
geom_violin()+
theme(axis.text.x=element_blank())

9
100

workclass
75 ?
Federal−gov
hours.per.week

Local−gov
Never−worked
50
Private
Self−emp−inc
Self−emp−not−inc
State−gov
25
Without−pay

workclass

ggplot(d,aes(x=workclass,y=hours.per.week,color=workclass))+
geom_point()+
geom_jitter()+
theme(axis.text.x=element_blank())

10
100

workclass
75 ?
Federal−gov
hours.per.week

Local−gov
Never−worked
50
Private
Self−emp−inc
Self−emp−not−inc
State−gov
25
Without−pay

workclass
Discussion on pros and cons of chosing any of these visualisations is left for discussion on the QA forum.

11

You might also like