Professional Documents
Culture Documents
LectureSlidesDA00 Topics
LectureSlidesDA00 Topics
LectureSlidesDA00 Topics
Data Analysis
Jan Graffelman1
1 Department of Statistics and Operations Research
Universitat Politècnica de Catalunya
Barcelona, Spain
jan.graffelman@upc.edu
Data Analysis
Multivariate analysis
Organizing methods
Bibliography
Preprocessing
Classification of missingness
Approaches
You reduce your sample size and loose power to detect effects.
Statistical inference may be biased if the missing observations
are not MCAR.
A better idea is to impute the missings somehow.
● ● ●
● ●
● ●● ● ●●
● ●
●● ● ●● ●
● ●
● ●
● ●● ● ●●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●●
●● ●● ● ●● ●
y
y
● ● ● ●
● ●
● ● ●
●
● ●
● ● ●● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
x x
● ● ● ● ●
●
● ●
●
● ● ● ●
● ●
●
● ●
● ●
●
●
●
●
●
● ● ●●● ●
●
● ● ● ● ●●
●
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
About zeros
Example
0 25 50 75 100
100
80
●
●
●
●
●
●
●
●
●
●
●
●
●
●
60
Tinnitus (VAS)
●
●
●
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12
Treatment number
Is it significant?
100
80
95% CI of mean VAS score
60
● ●
●
●
●
●
40
●
20
●
0
2 4 6 8 10 12
Treatment number
100
80
●
60
% zero values
●
40
●
20
●
●
●
0
2 4 6 8 10 12
Treatment number
100
80
60
Tinnitus (VAS)
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12
Treatment number
Outliers
Transformations
62 Mammals
5000
●
4000
Weight Brain
3000
2000
●
1000
●
●
●● ●
● ●
●
●●●●●
●
●●
●
●
●
0
Weight Body
Identifying outliers
62 Mammals
African_elephant ●
5000
Asian_elephant ●
4000
Weight Brain
3000
2000
● Human
1000
●
●
●● ●
● ●
●
●●●●●
●
●●
●
●
●
0
Weight Body
62 Mammals
●
●
8 ●
●
● ●
●● ●
6
● ● ● ●
● ●
● ●●
●
ln(Weight Brain)
●
● ●
4
●
●
●
●●
●
● ●
●
● ●●●●
●
●
2
● ● ●
●●
●
● ●
●
●
● ● ●
●
●
● ●
0
●
● ●
●
−2
−4 −2 0 2 4 6 8
ln(Weight Body)
Box-Cox transformation
Exploring relationships
5 15 25 40 60 0 15000 35000
●● ● ● ●● ● ●
50
● ● ● ● ● ● ● ●● ●
●
●● ● ●● ●● ●● ● ●● ● ●●●●
● ●● ● ● ●
●●●● ●●●● ● ● ●
●● ●● ●● ●
● ●
● ● ●●● ●● ●●● ● ●●● ● ● ●
●
● ● ●● ● ●
●
●● ●
● ●●● ●● ● ● ●●●●● ●● ●● ● ●
●
●●●●
● ●● ● ● ●● ● ● ●● ●● ● ● ●
●●● ●
● ● ●●● ●● ● ● ● ● ● ● ● ●
●
●●
●●●●● ● ●●● ●●● ● ●
● ● ●●●
Birth ●● ●●
●●● ● ● ●●
● ●● ● ● ●
● ●● ●
●
●●●
● ● ● ●●
●
●
●●
●● ●●
●●
30
● ●●
●●
● ● ● ●●● ● ● ●●● ●●
●●● ●●●● ●● ●
● ●●● ● ● ●●●● ●● ●● ● ● ●
● ● ●● ●●● ●●● ●●
●
●● ●●●● ●● ● ●● ● ● ●● ●
● ●●● ● ●● ● ●●● ●
● ●● ● ●
● ● ● ●● ●●
● ●
● ●
●●●●
●
●
● ●●● ●
●●
●●
● ●●●●●● ●
●● ●
● ●●●● ● ●● ●●●● ●●
● ●● ●●●●●●● ●
●
●●
●
●
●
●●
●●
●●●●
●●●
●●
●●
●●
●
●
●●●●
●
●●●
●
● ●● ● ●● ● ●●● ●
● ●
10
●● ●
●
●● ● ●
● ● ●●● ● ●●
25
● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ● ●● ● ●
●●● ●●
● ●
● ●
●
● ●
●
●
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●● ● ● ●●● ●●● ●
●
15
●●● ●●●
● ●● ●
●● ● ●● ●● ●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●●● ●●●
●● ● ●●●
●
● ●
●
●●●●
●
Death ●
●
●
●
●
●
●
●
●●
●●
●●●
● ●
●●
●●
●
● ● ●
●●
●
●
● ●
●
● ●
●
●
●
●●●
●●●
●
● ●●
●●
●
● ●
●
●
●
●
●
●
●●●
●●●
●●●
●●●
●●
●●
● ●●●
● ●
● ● ●
●●
●● ● ●
●
●●●
●
●
●
●●
●●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
● ●
● ●
●●
● ●
●
●●●● ● ● ●●●●●● ●
●●
●●● ● ●●● ●
● ● ●●●●●●●
● ●● ●●● ●●
●●●
●● ●●
●
●● ● ●● ● ●●
●●
●
●●● ● ● ●●●
●
● ●●● ●● ● ●● ●●
● ● ●●●● ●●● ●
●●●
●●
● ●
●
●● ●
●●●
●
● ● ●●
● ●
●●●● ●●
●●
● ●●
● ●●
● ●
●● ●●●
● ●● ● ●
●
●
●●
● ●
● ●
● ● ●●● ● ●●● ●
● ●●● ● ●●●
● ● ●● ●
5
● ●
● ● ●● ●● ●●●
● ● ● ● ●
● ● ● ● ●
150
● ● ● ● ●
●●
●● ●● ● ●●● ●●●
●●● ●● ● ● ● ●● ● ● ●●● ● ● ● ●●●● ●
●
●
●
●
●●
● ● ● ● ● ● ● ● ● ● ●●
●● ● ●●●●●● ● ● ●●
● ●● ●●●
● ●●● ●●●● ● ● ●
●●●
●
●● ●
● ● ●
● ●● ● ●
●
●● ●
●● ●● ●
●
●●
●
●
● ●
● ●
●● ●
●●●●●
● Infant ●● ● ●
●● ●●●●●●●
●●
●
●●
●●● ●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
● ●
●
● ●
●
50
●● ● ●
● ● ●●● ● ● ● ● ●
●●
● ● ●● ● ●
●● ●● ● ● ●●
●
● ● ● ●● ●
● ●
●● ●
●● ●
● ●●
●● ● ●●●● ● ● ●
●● ●● ● ● ●●●
●●●●●●
●● ●
●●●
●● ●● ● ●●
●●
● ●●●
●
●
●
●
●
● ●
●●●●
●
●
●●●
●●
●●
● ●● ●● ● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●● ● ●
●
●● ●●●
●●●●
●●●
●
● ●
●
●
●
●●● ●●
●●●
●●
●
●
●●
●
●●●
●
● ●●
●●
●
●●
● ●
●●●
●●
●●●
● ●●●●
●● ● ●
● ●●●
●●●●●
●●● ●
0
●● ●●● ● ● ●
●●●
● ● ●●● ●
●
● ●●●
● ● ●●●●●●● ●
●
●●
●●●
●● ●● ● ● ●
●●●●●●●
●● ●
●● ●
●● ● ●● ● ●
●● ● ●● ●●
●
●● ● ●
●●●
●●
● ●●
●
●
●●●
●● ●
●● ●
● ● ● ● ●●
● ●
●●● ●●●● ●
●
●●●●● ● ●
●
●
●●●
● ●●
● ●
● ●
● ●
●
●●● ● ●●●
●● ●●
●●
● ● ●●
● ●
● ●
●
●●●
● ●●
●● ● ●● ● ● ● ●● ●●
●●
●● ● ● ● ●
●●●
●●
●●● ● ●●
●● ●●
● ●
●●
●
●●●●
●● ● ● ●●●
●●
●
●●● ● ●
●●●
● ●●
●●
● ●●●●
● ●● ●●
●
●
●●
●●●●
60
●● ● ● ●● ●
● ● ●● ●● ●●● ● ●
●
● ●●●●
● ●●●●●
●●
●●
●●●● ●
●●●●● ●
● ●●
●
●● ●●
●●
● ● ●● ●
● LM ●●
●
●
●●
●●
●●
●
●
●●●● ●
●
●
●
●
●
●● ●
●
●
●●
●●
● ● ●●● ●
●●●
●● ●●● ●
● ●●
●
● ● ●
●●
●● ●● ● ● ●● ●
● ● ●● ● ●●● ●
●●● ● ● ●●● ● ●
●
● ●● ● ● ●
40
● ● ● ● ●● ●
● ● ● ● ●● ●
●
●●● ● ●●● ● ●● ●
80
●● ● ●●●●●●
●
●●
●●
●●
●
●●● ● ● ●● ●● ●
●
●
●
● ●
●●
●
●● ● ●●●
● ●
●
●
● ●● ●●●
● ● ● ●●
●
●
●
●
●●● ●
●
●
●
● ● ●● ●●●● ● ●
● ●● ● ● ●
●
●●●
● ● ●
●●●● ●●
● ● ●●
●● ● ●●●
● ●
●●●● ●● ● ● ●
●
●
●● ●●●●
●
●●●
● ●●
●● ● ● ●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
● ●
●●●●
●
● ●● ●
● ● ●● ●
●● ●● ●
● ●
●●● ●● ●
●
● ●
●●
●●
●●
●●●● ●
●●●●●● ●
●
● ●●●
● ●●
● ●
● ●●
● ●
●
● ● ●●
● ● ●
●●● ●●●●
●●●
●●● ●● ● ● ●
●●
● ●
●
●●● ●●
●●●
LF ●●●
● ● ●
60
● ●● ●● ●●●● ● ● ●● ● ● ● ●●● ●
● ●
●● ● ● ● ●
●●●
● ●●●● ● ●●● ●
●●● ●● ●●
● ●●
●●
● ● ●●● ●● ● ●●●● ●●●
● ●
●●
●●
●
●
●
●● ● ●●● ● ●●● ●●●●● ●●
● ●● ● ● ● ●
●● ● ●
●
●
●
●● ●● ● ●
●● ●
●●
● ● ● ●● ● ● ● ●●● ●
40
● ●
35000
● ● ● ● ●
●● ● ● ●
● ● ● ●●
● ● ● ● ●● ●
●
●●
● ● ● ● ●● ●
●
● ●●●
● ●● ● ● ● ●
15000
● ● ● ●● ●
●●●
●
●●
● ●● ●
● ●
●
●●
●● ●
●●● ●
●●
●
●
●
●
●●●●
●
● ●●
●
●
●●●
● ●●
●
● GNP
● ● ● ●
● ● ● ● ● ● ●● ● ● ●●
● ● ●● ● ●● ● ●● ● ●● ●● ● ●●
●
●●
●●●●●● ●●
● ●●● ● ●
●
●●● ●●● ●● ●
●
●
●●●●
● ● ● ●●●
●●
●
●● ● ●
●
●●
●● ● ● ●● ●●
●●●●●
●
●
●●●
● ●●● ●●
●
● ● ●
●
●●●
●
● ●
● ●
● ●●● ●
● ●●
●●
● ●●●● ●●●●●
●
●●●
●●●
● ●
●
●
●●
●
● ●
●●
●●● ●●
●
●
●●●
●
●● ●
●●●
●● ●
●●●●
●●
●
●● ●●
●● ●●
● ●● ●● ●●●●●●
●●●●
●
●●
●
●●
● ●●●
●●
●● ●
●
● ●
●
●●
●●●
● ● ●●●● ●
●●●●●
●● ●
●● ●●
● ●●●●●
●● ●●
●●● ●● ●
●●●
●●●●
● ●●●
●
●●
●●
●●●●
●●
●● ●
●●● ●
●
0
● ●● ● ● ● ● ● ● ●● ●●●
10 30 50 0 50 150 40 60 80
Boxplot of GNP
GNP
35000
●
30000
25000
●
●
●
●
●
●
●
20000
●
●
●
●
●
●
15000
●
●
10000
5000
0
> sum(X$GNP==-99)
[1] 6
> X$GNP[X$GNP==-99] <- NA
> sum(is.na(X$GNP))/nrow(X)
[1] 0.06185567
A quick solution
me <- median(X$GNP,na.rm=TRUE)
X$GNP[is.na(X$GNP)] <- me
Box-Cox transformation
95%
−280
−300
log−Likelihood
−320
−340
−360