Professional Documents
Culture Documents
Minggu 2 - EDA-Univariate Dan Multivariate PDF
Minggu 2 - EDA-Univariate Dan Multivariate PDF
Minggu 2 - EDA-Univariate Dan Multivariate PDF
5
How Are Data Measured?
1. Nominal Scale 3. Interval Scale
– Categories
• e.g., Male-Female – Equal Intervals
– Count – No True 0
– Big number doesn’t represent • e.g., Degrees Celsius
Big difference (Coding only) – Measurement
• Continuous variables
✓ Bin the observations (create categories .e.g., (0-10), (11-20),
etc.) then, treat as ordered categorical.
✓ Plots specific to Continuous variables.
8
Big Data EveryWhere!
• Banyak sekali data dikumpulkan secara masif
dan bersamaan di lokasi
yang sangat tersebar:
– Web data, e-commerce
– Pembelian di department/
grocery stores
– Transaksi Bank/Credit Card
– Social Network
– CCTV sebagai
pemantau traffic, keamanan rumah, dll
Small or Big Data
– suatu masalah –
Berapa maksimum baris dan kolom dalam
- Worksheet Excell ?
- Worksheet MINITAB ?
Keterbatasan Kemampuan Microsoft Excel
11
Keterbatasan Kemampuan MINITAB 16
12
EDA – UNIVARIATE
13
Univariate
• categorized as
–Measures of location,
–Measures of variability,
–Measures of heterogenity,
–Measures of concentration,
–Measures of asymmetry, and
–Measures of kurtosis
Data dalam
MINITAB
MINITAB
0,67
511,67
• Apa arti dari rata-rata data kolom C2?
– Dapatkah diartikan dengan baik?
– Butuh tambahan informasi apakah agar bisa diartikan dengan baik?
Data dalam JENIS
KELAMIN GAJI
MINITAB
0,67
511,67
• Apa arti rata-rata data kolom C1?
– Dapatkah diartikan dengan baik?
– Butuh tambahan informasi apakah agar bisa diartikan dengan baik?
Network Of
Univariate
Distribution
Mana Yang Lebih Makmur
jumlah penduduk.
Sidoarjo
Rp.
g1
jumlah penduduk.
Surabaya
g2 Rp.
Pola data jumlah penduduk yang positif COVID-19
dalam pantauan satuan bulan
jumlah penduduk.
200
Negara A
bulan
jumlah penduduk.
200
Negara B
bulan.
Numerical Data Properties
Central Tendency
(Location)
Variation
(Dispersion)
Shape
Descriptive: Output MINITAB (1)
Results for: Market.MTW
Descriptive Statistics: Sales, Advertis, Capital
Mean 5.01805
StDev 1.92771
Variance 3.71608
Skewness 4.19E-02
Kurtosis 9.67E-02
N 1000
0 2 4 6 8 10
Minimum -0.7060
1st Quartile 3.7674
Median 4.9697
3rd Quartile 6.2907
95% Confidence Interval for Mu Maximum 11.5726
95% Confidence Interval for Mu
4.8984 5.1377
4.85 4.95 5.05 5.15 95% Confidence Interval for Sigma
1.8468 2.0161
95% Confidence Interval for Median
95% Confidence Interval for Median
4.8384 5.0884
MMT – ITS
Pascasarjana FE - UNAIR Advacedan
Statistik Bisnis Statistics
Industri Prof.Prof.
Nur Iriawan, PhD.PhD.
Nur Iriawan,
Mean
1. Measure of Central Tendency
2. Most Common Measure
3. Acts as ‘Balance Point’
4. Affected by Extreme Values (‘Outliers’)
5. Formula (Sample Mean)
n
Xi X1 + X 2 + + X n
i =1
X= =
n n
Perhatikan Contoh Berikut
Data1 Data2
1 1
2 2
3 3
169,1667
3
4 4
5 5
1000
Median
1. Measure of Central Tendency
2. Middle Value In Ordered Sequence
– If Odd n, Middle Value of Sequence
– If Even n, Average of 2 Middle Values
3. Position of Median in Sequence
n +1
Positioning Point =
2
4. Not Affected by Extreme Values
Perhatikan Contoh Berikut
Data1 Data2
1 1
2 2
3 3
3,5
3
4 4
5 5
1000
Mode
1. Measure of Central Tendency
2. Value That Occurs Most Often
3. Not Affected by Extreme Values
4. May Be No Mode or Several Modes
5. May Be Used for Numerical & Categorical
Data
Perhatikan Contoh Berikut
Data1 Data2
1 1
2 2
3 3
3
3
3 3
4 4
5 5
1000
Midrange
1. Measure of Central Tendency
2. Middle of Smallest & Largest Observation
X smallest + X l arg est
Midrange =
2
3. Affected by Extreme Values
Perhatikan Contoh Berikut
Data1 Data2
1 1
2 2
3 3
500,5
3
4 4
5 5
1000
Quartiles
1. Measure of Noncentral Tendency
2. Split Ordered Data into 4 Quarters
25% 25% 25% 25%
Q1 Q2 Q3
Numerical Data
Properties
Central
Variation Shape
Tendency
Mean Range Skew
Median Interquartile Kurtosis
Range
Mode Variance
Midrange Standard Deviation
Midhinge Coeff. of Variation
Variance & Standard Deviation
1. Measures of Dispersion
2. Most Common Measures
3. Consider How Data Are Distributed
4. Show Variation About Mean (X or X)
X = 8.3
4 6 8 10 12
Degree of
freedom (df)
Coefficient of Variation
1. Measure of Relative Dispersion
2. Always a %
3. Shows Variation Relative to Mean
4. Used to Compare 2 or More Groups
5. Formula (Sample)
S
CV = 100%
X
Coefficient of Variation Example
Group 1 Data: 1 2 3
Group 2 Data:100 200 300
S 1
Group 1 CV = X 100% = 100% = 50%
X 2
S 100
Group 2 CV = X 100% = 100% = 50%
X 200
Summary of Variation Measures
Measure Equation Description
Range Xlargest - Xsmallest Total Spread
Interquartile Range Q3 - Q1 Spread of Middle 50%
(X − X)
Standard Deviation 2 Dispersion about
i
(Sample) Sample Mean
n −1
(Xi − X )
Dispersion about
Standard Deviation 2
(Population) Population Mean
N
Variance (Xi -X )2 Squared Dispersion
(Sample) n-1 about Sample Mean
Coeff. of Variation (S /X )100% Relative Variation
Standard Notation
leptokurtic
platokurtic
Descriptive data analysis: Real Fenomena
• Point Estimation
– Mean
– Median
– Modus Confidence
– Tream mean interval
• Deviation
– Variance
– Range
Skewness?
Numerical Data - Properties & Measures
Numerical Data
Properties
Central
Variation Shape
Tendency
Mean Range Skew
Median Interquartile Kurtosis
Range
Mode Variance
Midrange Standard Deviation
Midhinge Coeff. of Variation
Shape
1. Describes How Data Are Distributed
2. Measures of Shape
– Kurtosis = How Peaked or Flat
– Skew = Symmetry
4 6 8 10 12
Shape & Box-and-Whisker Plot
P ( x − s x x + s ) = 68%
P ( x − 2 s x x + 2 s ) = 95%
P ( x − 3s x x + 3s ) = 99,7%
Studi Densitas Setimbang (1)
( x tolerasi)
Studi Densitas Setimbang (2)
(Credible Interval)
Studi Densitas Setimbang
(Credible Interval)
Confidence Interval (1)
Confidence Interval (2)
(Credible Interval)
Densitas Setimbang (Contoh 2)
(Credible Interval)
HPD (Highest Probability Distribution)
(Credible Interval)
(1-) Batas Batas
Peta Kendali x Kendali Kendali
100% Bawah Atas
59
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros
60
But be careful
with axes and
scales!
61
Pemodelan Mixture of mixture
dalam Pemilihan Portofolio
steady growth
trend
68
Seriak Harga dan Return
Saham
Astra Agro Lestari Tbk
(AALI), dari mulai listing
s/d 5 April 2011
Plot Serial Return dan Marginal
Saham AALI
5/21/2011
Estimasi Densitas Saham AALI
dengan Univariate Normal Unimodal
5/21/2011
Identifikasi
Normal Uni-modal
VS Multi-Modal
Mixture
Representasi distribusi
mana yang lebih tepat?
The function g in blue is a mixture of two Gaussians. We draw 20
which are shown as blue dots.
We use the samples to generate the histogram (yellow)
and its kernel density estimate f (red).
The Matlab script is twoGaussKernelDensity1.m 73
Empirical distribution function Continued
74
20-Year Treasury Bond
Monthly Returns (1957-1996)
125
120 115
115
110
105
100 1957 to 1996 =
95
90
85 78
480 monthly obs
80
75 69
70
65
60
57 480 * 0.5 = 24th obs
55
50
45
41
40 35
35
30
5% of
25 19 17 20
20
15 distribution 9 5 5
10
5
1 1 1 3 1 0 1 0 1 1
0
%
%
0%
3%
5%
8%
1%
4%
7%
0%
ore
6%
3%
0%
7%
5%
2%
9%
%
57
-56
73
85
98
11
M
17
28
39
50
62
73
84
96
-84
-73
-62
-50
-39
-28
-16
10
11
12
14
-4%
EDA – MULTIVARIATE
76
Multivariate
• discuss about
– Dependent,Independent, and association
• Variance and covariance matrix
• Correlation matrixx
• Probability – joint, conditional, and odd of success
– Dimensionality
• Two-way, three-way and so on
• Reduction of dimensionality
– Eigenvalue
– PCA
• Distance measure for Classification - clustering
confirmatory experiments &
exploratory studies
exploratory studies:
- research question
- unexpected phenomenon
- curiosity about a system's behaviour in
particular conditions
Exploratory studies lead to confirmatory studies
Pemilihan jodoh (isteri)
4.25
4.00
3.75
4
3.50
3.25
3.00
3
2.75
2.50
2.25
2
2.00
1.75
1.50
1
1.25
1.00
.75
0
.50
.25
EXP
EXP
0.00
-1
30
20
10
-3 -2 -1 0 1 2 3
14
NORMAL
12
10
NORMAL
With Outlier and Out of Range Value
M= 5.15,Sd= 3.67,Sk=-0.19,K=-1.51 r=-0.57, B=-0.6, t=-1.97, p=0.08, N=10
16
8
14
DURING,N=10
6
10 12
AFTER
4
8
2
6
0
4
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0 2 4 6 8
Standard Normal Quantiles DURING
r=-0.57, B=-0.55, t=-1.97, p=0.08, N=10 M=6.35,Sd=3.82,Sk=2.01*,K=3.12*
16
8
14
6
10 12
AFTER,N=10
DURING
4
8
2
6
0
interesting?
interesting?
95
2D Scatterplots
• useful to answer:
• standard tool to display relation – x,y related?
between 2 variables • linear
– e.g. y-axis = response, x-axis = • quadratic
suspected indicator • other
– variance(y) depend on x?
– outliers present?
interesting?
interesting?
96
Scatter Plot: No apparent
relationship
97
Scatter Plot: Linear relationship
98
The Trouble with Summary Stats
Looking at Data
Scatter Plot: Quadratic relationship
101
Scatter plot: Homoscedastic
102
Scatter plot: Heteroscedastic
103
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
104
Two variables - continuous
• What to do for large data sets
– Contour plots
105
Transparent plotting
Alpha-blending:
• plot( rnorm(1000), rnorm(1000), col="#0000ff22", pch=16,cex=3)
106
More than
two variables
Pairwise scatterplots
Can be somewhat
ineffective for
categorical data
107
108