Minggu 2 - EDA-Univariate Dan Multivariate PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

EXPLORATORY DATA ANALYSIS (EDA):

UNIVARIATE DAN MULTIVARIATE

Prof. Nur Iriawan PhD.


Tujuan
• Exploratory Data Analysis (EDA) adalah bagian
dari proses dan tahapan belajar data sains.
• Pemahaman karakteristik data melalui EDA
sangat penting untuk dilakukan agar lebih
mendalami kandungan informasi dalam data
secara deskriptif.
• EDA sangat penting dilakukan sebelum
melakukan feature engineering dan modeling
data secara “data driven”.
REFERENCE

• R A Johnson and D W Wichern, Applied


multivariate statistical analysis, Sixth Edition, PHI,
2007.
• Joseph F. Hair Jr, Rolph E. Anderson, Ronald L
Tatham, and William C. Black, Multivariate data
analysis, Fifth Edition, Pearson Education, 2019.
• Pearson, R. K. - Exploratory Data Analysis using R-
Taylor & Francis Group, LLC, CRC Press, 2018.
• Peng, R. D. - Exploratory data analysis with R-
Leanpub, 2016
REFERENCE

• Buku-buku elektronik yang gratis


Variabel dan Data

Dimensi dan sudut pandang suatu


objek penelitian

5
How Are Data Measured?
1. Nominal Scale 3. Interval Scale
– Categories
• e.g., Male-Female – Equal Intervals
– Count – No True 0
– Big number doesn’t represent • e.g., Degrees Celsius
Big difference (Coding only) – Measurement

2. Ordinal Scale 4. Ratio Scale


– Categories – Equal Intervals
– Ordering Implied – True 0
• e.g., High-Low – Meaningful Ratios
– Count • e.g., Height in Inches
• Big number doesn’t
represent Big difference
(Ordered coding only)
Summarizing Variables
• Categorical variables
✓ Frequency tables - how many observations in each
category?
✓ Relative frequency table - percent in each category.
✓ Bar chart and other plots.

• Continuous variables
✓ Bin the observations (create categories .e.g., (0-10), (11-20),
etc.) then, treat as ordered categorical.
✓ Plots specific to Continuous variables.

The goal for both categorical and continuous data is data


reduction while preserving/extracting key information
about the process under investigation.
7
Tantangan: Data Besar

8
Big Data EveryWhere!
• Banyak sekali data dikumpulkan secara masif
dan bersamaan di lokasi
yang sangat tersebar:
– Web data, e-commerce
– Pembelian di department/
grocery stores
– Transaksi Bank/Credit Card
– Social Network
– CCTV sebagai
pemantau traffic, keamanan rumah, dll
Small or Big Data
– suatu masalah –
Berapa maksimum baris dan kolom dalam
- Worksheet Excell ?
- Worksheet MINITAB ?
Keterbatasan Kemampuan Microsoft Excel

11
Keterbatasan Kemampuan MINITAB 16

12
EDA – UNIVARIATE

13
Univariate
• categorized as
–Measures of location,
–Measures of variability,
–Measures of heterogenity,
–Measures of concentration,
–Measures of asymmetry, and
–Measures of kurtosis
Data dalam
MINITAB

• Apa arti dari rata-rata data kolom C2?


– Dapatkah diartikan dengan baik? 511,67
– Butuh tambahan informasi apakah agar bisa diartikan dengan baik?
• Apa arti rata-rata data kolom C1?
– Dapatkah diartikan dengan baik?
– Butuh tambahan informasi apakah agar bisa diartikan dengan baik?
Data dalam GAJI

MINITAB

0,67

511,67
• Apa arti dari rata-rata data kolom C2?
– Dapatkah diartikan dengan baik?
– Butuh tambahan informasi apakah agar bisa diartikan dengan baik?
Data dalam JENIS
KELAMIN GAJI

MINITAB

0,67

511,67
• Apa arti rata-rata data kolom C1?
– Dapatkah diartikan dengan baik?
– Butuh tambahan informasi apakah agar bisa diartikan dengan baik?
Network Of
Univariate
Distribution
Mana Yang Lebih Makmur

jumlah penduduk.

Sidoarjo
Rp.
g1
jumlah penduduk.

Surabaya

g2 Rp.
Pola data jumlah penduduk yang positif COVID-19
dalam pantauan satuan bulan

jumlah penduduk.

200

Negara A
bulan
jumlah penduduk.

200

Negara B
bulan.
Numerical Data Properties

Central Tendency
(Location)

Variation
(Dispersion)

Shape
Descriptive: Output MINITAB (1)
Results for: Market.MTW
Descriptive Statistics: Sales, Advertis, Capital

Variable N Mean Median TrMean StDev SE Mean


Sales 8 103.25 102.50 103.25 8.83 3.12
Advertis 8 15.88 15.50 15.88 5.38 1.90
Capital 8 21.50 22.50 21.50 12.26 4.33

Variable Minimum Maximum Q1 Q3


Sales 92.00 116.00 95.00 111.75
Advertis 9.00 24.00 10.75 21.00
Capital 6.00 36.00 9.00 32.75
Data Summary
Descriptive Statistics
Variable: MSData

Anderson-Darling Normality Test


A-Squared: 0.312
P-Value: 0.551

Mean 5.01805
StDev 1.92771
Variance 3.71608
Skewness 4.19E-02
Kurtosis 9.67E-02
N 1000
0 2 4 6 8 10
Minimum -0.7060
1st Quartile 3.7674
Median 4.9697
3rd Quartile 6.2907
95% Confidence Interval for Mu Maximum 11.5726
95% Confidence Interval for Mu
4.8984 5.1377
4.85 4.95 5.05 5.15 95% Confidence Interval for Sigma
1.8468 2.0161
95% Confidence Interval for Median
95% Confidence Interval for Median
4.8384 5.0884
MMT – ITS
Pascasarjana FE - UNAIR Advacedan
Statistik Bisnis Statistics
Industri Prof.Prof.
Nur Iriawan, PhD.PhD.
Nur Iriawan,
Mean
1. Measure of Central Tendency
2. Most Common Measure
3. Acts as ‘Balance Point’
4. Affected by Extreme Values (‘Outliers’)
5. Formula (Sample Mean)
n

 Xi X1 + X 2 +  + X n
i =1
X= =
n n
Perhatikan Contoh Berikut

Data1 Data2
1 1
2 2
3 3
169,1667
3
4 4
5 5
1000
Median
1. Measure of Central Tendency
2. Middle Value In Ordered Sequence
– If Odd n, Middle Value of Sequence
– If Even n, Average of 2 Middle Values
3. Position of Median in Sequence
n +1
Positioning Point =
2
4. Not Affected by Extreme Values
Perhatikan Contoh Berikut

Data1 Data2
1 1
2 2
3 3
3,5
3
4 4
5 5
1000
Mode
1. Measure of Central Tendency
2. Value That Occurs Most Often
3. Not Affected by Extreme Values
4. May Be No Mode or Several Modes
5. May Be Used for Numerical & Categorical
Data
Perhatikan Contoh Berikut

Data1 Data2
1 1
2 2
3 3
3
3
3 3
4 4
5 5
1000
Midrange
1. Measure of Central Tendency
2. Middle of Smallest & Largest Observation
X smallest + X l arg est
Midrange =
2
3. Affected by Extreme Values
Perhatikan Contoh Berikut

Data1 Data2
1 1
2 2
3 3
500,5
3
4 4
5 5
1000
Quartiles
1. Measure of Noncentral Tendency
2. Split Ordered Data into 4 Quarters
25% 25% 25% 25%
Q1 Q2 Q3

3. Position of i-th Quartile


Positioning Point of Qi =
i n +1 a f
4
Midhinge
1. Measure of Central Tendency
2. Middle of 1st & 3rd Quartiles
Q1 +Q 3
Midhinge =
2
3. Not Affected by Extreme Values
Summary of
Central Tendency Measures
Measure Equation Description
Mean  Xi / n Balance Point
Median (n+1) Position Middle Value
2 When Ordered
Mode none Most Frequent
Midrange X smallest + X l argest Middle of Smallest
2 & Largest
Midhinge (Q1 + Q3) Middle of 1st &
2 3rd Quartile where
Qi = i (n+1)/4
Numerical Data - Properties & Measures

Numerical Data
Properties

Central
Variation Shape
Tendency
Mean Range Skew
Median Interquartile Kurtosis
Range
Mode Variance
Midrange Standard Deviation
Midhinge Coeff. of Variation
Variance & Standard Deviation
1. Measures of Dispersion
2. Most Common Measures
3. Consider How Data Are Distributed
4. Show Variation About Mean (X or X)

X = 8.3

4 6 8 10 12
Degree of
freedom (df)
Coefficient of Variation
1. Measure of Relative Dispersion
2. Always a %
3. Shows Variation Relative to Mean
4. Used to Compare 2 or More Groups
5. Formula (Sample)
S
CV = 100%
X
Coefficient of Variation Example
Group 1 Data: 1 2 3
Group 2 Data:100 200 300

S 1
Group 1 CV = X 100% =  100% = 50%
X 2

S 100
Group 2 CV = X 100% =  100% = 50%
X 200
Summary of Variation Measures
Measure Equation Description
Range Xlargest - Xsmallest Total Spread
Interquartile Range Q3 - Q1 Spread of Middle 50%

 (X − X)
Standard Deviation 2 Dispersion about
i
(Sample) Sample Mean
n −1

 (Xi − X )
Dispersion about
Standard Deviation 2
(Population) Population Mean
N
Variance (Xi -X )2 Squared Dispersion
(Sample) n-1 about Sample Mean
Coeff. of Variation (S /X )100% Relative Variation
Standard Notation

Measure Sample Population


Mean X X
Stand. Dev. S X
2
X
2
Variance S
Size n N
What is “normal” anyway?
• With enough measurements, most variables
are distributed normally
mesokurtic But in order to fully
describe data we need
to introduce the idea of
a standard deviation

leptokurtic

platokurtic
Descriptive data analysis: Real Fenomena
• Point Estimation
– Mean
– Median
– Modus Confidence
– Tream mean interval
• Deviation
– Variance
– Range
Skewness?
Numerical Data - Properties & Measures

Numerical Data
Properties

Central
Variation Shape
Tendency
Mean Range Skew
Median Interquartile Kurtosis
Range
Mode Variance
Midrange Standard Deviation
Midhinge Coeff. of Variation
Shape
1. Describes How Data Are Distributed
2. Measures of Shape
– Kurtosis = How Peaked or Flat
– Skew = Symmetry

Left-Skewed Symmetric Right-Skewed


Mean Median Mode Mean = Median = Mode Mode Median Mean
Box-and-Whisker Plot

1. Graphical Display of Data Using


5-Number Summary
Xsmallest Q1 Median Q3 Xlargest

4 6 8 10 12
Shape & Box-and-Whisker Plot

Left-Skewed Symmetric Right-Skewed


Q1 Median Q3 Q1 Median Q3 Q1 Median Q3
Bell-shaped interval

P ( x − s  x  x + s ) = 68%
P ( x − 2 s  x  x + 2 s ) = 95%
P ( x − 3s  x  x + 3s ) = 99,7%
Studi Densitas Setimbang (1)

( x  tolerasi)
Studi Densitas Setimbang (2)
(Credible Interval)
Studi Densitas Setimbang
(Credible Interval)
Confidence Interval (1)
Confidence Interval (2)
(Credible Interval)
Densitas Setimbang (Contoh 2)
(Credible Interval)
HPD (Highest Probability Distribution)
(Credible Interval)
(1-) Batas Batas
Peta Kendali x Kendali Kendali
100% Bawah Atas

95,0 71,3953 109,4810

97,5 64,4857 110,9149

99,0 55,3356 112,7754


Central Posterior Interval (confidence Interval)
VS
HPD Interval (Credible Interval)

(Gelman, Carlin, Stern, Dunson, Vehtari, dan Rubin, 2014)


Contoh HPD menggunakan Maple

59
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros

60
But be careful
with axes and
scales!

61
Pemodelan Mixture of mixture
dalam Pemilihan Portofolio

Seminar Nasional Statistika UNDIP


Fenomena Pola Data
Skew dan Multi-Modal Sebagai
Representasi Bentuk Mixture
Tiga Normal
dengan beda
varians

Seminar Nasional Statistika UNDIP


5/21/2011
Seminar Nasional Statistika UNDIP
5/21/2011
Dua Normal
dengan beda
mean dan
varians

Seminar Nasional Statistika UNDIP


5/21/2011
Fenomena Mixture
Dalam Return Saham
Time Series
If your data has a temporal component, be sure to exploit it

summer bifurcations in air travel


(favor early/late)
summer
peaks

steady growth
trend

New Year bumps

68
Seriak Harga dan Return
Saham
Astra Agro Lestari Tbk
(AALI), dari mulai listing
s/d 5 April 2011
Plot Serial Return dan Marginal
Saham AALI

5/21/2011
Estimasi Densitas Saham AALI
dengan Univariate Normal Unimodal

5/21/2011
Identifikasi
Normal Uni-modal
VS Multi-Modal
Mixture

Representasi distribusi
mana yang lebih tepat?
The function g in blue is a mixture of two Gaussians. We draw 20
which are shown as blue dots.
We use the samples to generate the histogram (yellow)
and its kernel density estimate f (red).
The Matlab script is twoGaussKernelDensity1.m 73
Empirical distribution function Continued

74
20-Year Treasury Bond
Monthly Returns (1957-1996)
125
120 115
115
110
105
100 1957 to 1996 =
95
90
85 78
480 monthly obs
80
75 69
70
65
60
57 480 * 0.5 = 24th obs
55
50
45
41
40 35
35
30
5% of
25 19 17 20
20
15 distribution 9 5 5
10
5
1 1 1 3 1 0 1 0 1 1
0
%
%

0%

3%

5%

8%

1%

4%

7%

0%

ore
6%

3%

0%

7%

5%

2%

9%

%
57
-56

73

85

98

11

M
17

28

39

50

62

73

84

96
-84

-73

-62

-50

-39

-28

-16

10

11

12

14
-4%
EDA – MULTIVARIATE

76
Multivariate
• discuss about
– Dependent,Independent, and association
• Variance and covariance matrix
• Correlation matrixx
• Probability – joint, conditional, and odd of success
– Dimensionality
• Two-way, three-way and so on
• Reduction of dimensionality
– Eigenvalue
– PCA
• Distance measure for Classification - clustering
confirmatory experiments &
exploratory studies
exploratory studies:
- research question
- unexpected phenomenon
- curiosity about a system's behaviour in
particular conditions
Exploratory studies lead to confirmatory studies
Pemilihan jodoh (isteri)

• Beranikah Anda memilih calon isteri Anda


hanya berdasar pada satu variabel?
• Beranikah meng-iya-kan perempuan
dengan foto setengah badan atas dan
tampak depan sebagai calon isteri?

Yang waras, mestinya menjawab


tidak berani.
MENGAPA?
Berapa kali hujan – lama hujan
– dan –
kedalaman air hujan dalam tanah
00 01 02 03 04 05 06 07 … 23
3 1 25 30 10
3 2 10 18 11
3 3 32 21 14
6 4 10 13 11 16 2 3
7 5 8 9 6 9 2 2 3
3 6 12 8 4
3 7 35 27 12
3 8 23 20 10

Distribusi kedalaman air


Distribusi banyak hujan bisa weibull
kali hujan bisa
Poisson
Distribusi
lama hujan – Berapa kali hujan
– lama hujan, dan
bisa Gamma
– kedalaman air hujan
dalam tanah
Interaksi Beberapa Distribusi
Interaksi beberapa distribusi di
supermarket
• Dalam sistem perbelanjaan di Supermarket,
dimana letak interaksinya dan interaksi di antara
distribusi apa saja?
• Perhatikan kasus Market Basket
Market Basket

Di trolly akan dibeli: Orange Jouice, beberapa piang,


detergent, pembersih kaca, dan minuman soda

Adakah hubungan keperluan pembelian


Apakah pembeli ini telah minuman soda dan pisang? Mungkinkah
terpengaruhi tetangganya ada soda merk khusus yang harus diminum
yang telah juga beli dtergent, Setelah makan pisang?
orange juice, dll tersebut?

Apa menurut anda yang seharusnya


ada dalam trolly ternyata tidak ada?
Apakah karena detergent dan orange juice dibeli,
sehingga cairan pembersih kaca harus dibeli?
Perhatikan banyaknya barang untuk
setiap macam produk di dalam trolly
Bivariate Data
Visual Display of Bivariate Data
So, you have examined each variable for
mistakes, outliers and distribution and made
any necessary alterations. Now what?
Look at the relationship between 2 (or more)
variables at a time
Visual Displays of Bivariate Data

Variable 1 Variable 2 Display


Example

Categorical Categorical Crosstabs

Categorical Continuous Box plots

Continuous Continuous Scatter plots


Bivariate Distribution

Std. Dev = .85


Mean = .95
N = 100.00
5

4.25
4.00
3.75
4

3.50
3.25
3.00
3

2.75
2.50
2.25
2
2.00
1.75
1.50

1
1.25
1.00
.75

0
.50
.25
EXP
EXP
0.00

-1
30

20

10

-3 -2 -1 0 1 2 3
14

NORMAL
12

10

2 Std. Dev = 1.02


Mean = -.16
0 N = 100.00
-2.5 -2.2 -2.0 -1.7 -1.5 -1.2 -1.0 -.7 -.5 -.2 0.0 .2 .5 .7 1.0 1.2 1.5 1.7 2.0 2.2
0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5 0 5

NORMAL
With Outlier and Out of Range Value
M= 5.15,Sd= 3.67,Sk=-0.19,K=-1.51 r=-0.57, B=-0.6, t=-1.97, p=0.08, N=10

16
8

14
DURING,N=10
6

10 12
AFTER
4

8
2

6
0

4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8
Standard Normal Quantiles DURING
r=-0.57, B=-0.55, t=-1.97, p=0.08, N=10 M=6.35,Sd=3.82,Sk=2.01*,K=3.12*

16
8

14
6

10 12
AFTER,N=10
DURING
4

8
2

6
0

4 6 8 10 12 14 16 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5


AFTER Standard Normal Quantiles
Two Continuous Variables
• For two numeric variables, the scatterplot
is the obvious choice

interesting?

interesting?

95
2D Scatterplots
• useful to answer:
• standard tool to display relation – x,y related?
between 2 variables • linear
– e.g. y-axis = response, x-axis = • quadratic
suspected indicator • other
– variance(y) depend on x?
– outliers present?

interesting?

interesting?

96
Scatter Plot: No apparent
relationship

97
Scatter Plot: Linear relationship

98
The Trouble with Summary Stats
Looking at Data
Scatter Plot: Quadratic relationship

101
Scatter plot: Homoscedastic

Why is this important in classical statistical modelling?

102
Scatter plot: Heteroscedastic

variation in Y differs depending on the value of X


e.g., Y = annual tax paid, X = income

103
Two variables - continuous
• Scatterplots
– But can be bad with lots of data

104
Two variables - continuous
• What to do for large data sets
– Contour plots

105
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22", pch=16,cex=3)

106
More than
two variables
Pairwise scatterplots

Can be somewhat
ineffective for
categorical data

107
108

You might also like