Professional Documents
Culture Documents
BookSlides 3B Data Exploration
BookSlides 3B Data Exploration
2 Data Preparation
Normalization
Binning
Sampling
3 Summary
Advanced Data Exploration Data Preparation Summary
35
35
30
30
Age
Age
25
25
20
20
0 500 1000 1500 160 170 180 190 200 210 220
Sponsorship Earnings Height
(a) (b)
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
Density
Density
Density
0.2
0.2
0.2
0.1
0.1
0.1
0.0
0.0
0.0
mid−career rookie veteran mid−career rookie veteran mid−career rookie veteran
Career Stage Career Stage Career Stage
0.8
0.8
0.6
0.6
0.6
Density
Density
Density
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
center forward guard center forward guard center forward guard
Position Position Position
10 12
Frequency
8
Frequency
8
6
6
4
4
2
2
1 0
1 0
Shoe Sponsor Shoe Sponsor
no no
yes yes
Percentage
Percentage
0.5
0.5
0
0
(a) Career Stage and Shoe Spon- (b) Position and Shoe Sponsor
sor
(a) Age
Position = guard Position = center Position = forward
0.15
0.15
0.15
0.10
0.10
0.10
Density
Density
Density
0.05
0.05
0.05
0.00
0.00
0.00
20 25 30 35 20 25 30 35 20 25 30 35
Age Age Age
(a) Height
Position = guard Position = center Position = forward
0.06
0.06
0.06
0.04
0.04
0.04
Density
Density
Density
0.02
0.02
0.02
0.00
0.00
0.00
160 170 180 190 200 210 220 160 170 180 190 200 210 220 160 170 180 190 200 210 220
Height Height Height
35
30
30
Age
Age
25
25
20
20
center forward guard
Position
Figure: Using box plots to visualize the relationship between the AGE
and the P OSITION feature.
220
220
200
200
Height
Height
180
180
160
160
center forward guard
Position
7,009.9
cov (H EIGHT, W EIGHT) = = 241.72
29
570.8
cov (H EIGHT, AGE) = = 19.7
29
Advanced Data Exploration Data Preparation Summary
Correlation values fall into the range [−1, 1], where values
close to −1 indicate a very strong negative correlation (or
covariance), values close to 1 indicate a very strong
positive correlation, and values around 0 indicate no
correlation.
Features that have no correlation are said to be
independent.
Advanced Data Exploration Data Preparation Summary
241.72
corr (Height, Weight) = = 0.898
13.6 × 19.8
19.7
corr (Height, Age) = = 0.345
13.6 × 4.2
Advanced Data Exploration Data Preparation Summary
P
The covariance matrix, usually denoted as , between a
set of continuous features, {a, b, . . . , z}, is given as
var (a) cov (a, b) · · · cov (a, z)
X cov (b, a) var (b) · · · cov (b, z)
= (3)
.. .. .. ..
{a,b,...,z}
. . . .
cov (z, a) cov (z, b) ··· var (z)
Advanced Data Exploration Data Preparation Summary
13
10
10
B
B
7
7
4
4
4 7 10 13 16 19 4 7 10 13 16 19
A A
13
13
10
10
B
B
7
7
4
4 7 10 13 16 19 4 7 10 13 16 19
A A
Advanced Data Exploration Data Preparation Summary
Data Preparation
Advanced Data Exploration Data Preparation Summary
Normalization
Normalization
Normalization
Binning
Binning
50
60
70
Value
Density
0.00 0.01 0.02 0.03 0.04 0.05 0.06
(e) 3 bins
80
50
90
60
70
Value
Density
0.00 0.01 0.02 0.03 0.04 0.05 0.06
80
(g) 60 bins
50
90
60
70
Value
(f) 14 bins
80
90
Advanced Data Exploration Data Preparation Summary
Binning
0.008
0.006
0.006
Density
Density
0.004
0.004
0.002
0.002
0.000
0.000
0 100 200 300 400 0 100 200 300 400
Value Value
Value
Binning
0.008
0.006
0.006
Density
Density
0.004
0.004
0.002
0.002
0.000
0.000
0 100 200 300 400 0 100 200 300 400
Value Value
Value
Sampling
Sampling
Sampling
Sampling
Sampling
Sampling
Sampling
Summary
Advanced Data Exploration Data Preparation Summary
2 Data Preparation
Normalization
Binning
Sampling
3 Summary