Professional Documents
Culture Documents
ExploratoryDataAnalysis DSaa1920
ExploratoryDataAnalysis DSaa1920
2D arrays Heatmap
Dynamic
Tp ( x ) ax p b
1€= 1936.27 Lit.
p=1, a= 1936.27 ,b =0
ºC= 5/9(ºF -32)
p = 1, a = 5/9, b = -160/9
This type of transformation allows to improve
interpretability
Data Transformation
Logaritmic tranformations
Tp ( x ) c log x d
This can be applied when dealing with positive
values
Smooth log-normal distribution probabilities
Example: used for smoothing season peaks
Task: regularize variance
Data Transformation
There are different transformations which can be
used to regularize data variance
p
Tp ( x ) ax b
Root transformation
p = 1/c, c integer
used to regularize particular probability distribution (i.e.
Poisson)
Reciprocal transformation
p<0
Used when data are temporal series and variance highly
increases w.r.t. average value
Data Normalization
Which is the goal of any normalization?
Changing the values of numeric features to a
common scale, without distorting differences in value
ranges
When features have different ranges and learning
algorithm is based on some distance
Normalization affects the whole learning process
Normalization parameters needed to be stored
min-max normalization
z-score normalization
normalization by decimal scaling
Data Normalization
Normalization by decimal scaling
Transform the data by moving the decimal points of
values of a selected feature. The number of decimal
points moved depends on the maximum absolute
value of feature.
Let v be a given feature it is normalized by
v (i )
v (i ) K , K arg min(max v (i ) 1)
10 i
Data Normalization
Min-Max Normalization
Trasform data from a rang [min(v), max(v)] into a range [0,1]
It is based on the a-priori knowledge of min/max values
Operatively we use sample minimum and maximum value
v (i ) min v (i )
i
v (i )
max v (i ) min v (i )
i i
Data normalization
Z-score normalization (standardization)
Transform data by converting the values to a
common scale with an average of zero and a
standard deviation of one.
A value, , of is normalized to ′ by computing:
v (i ) mean( v )
v (i )
sd ( v )
v feature, mean(v), sd(v) mean and standard deviation
Drawback: data are completely modified
Data Smoothing
Estimate f(x) when the shape is unknown, but it is
assumed to be smooth
General idea is to group data points that are
expected to have similar expectations and compute
the average or fit a simple parametric model
Smoothers can be used to discretize continuous
features (modifying them into discrete one)
Data: differences and ratios
Data transformation can be simply based on
arithmetic operations on features values
aimed to obtain new features
Example: Body mass index (BMI) body mass
divided by the square of the body height (kg/m2)
Missing data
• Even in a well-designed and controlled
study, missing data occurs
• Missing data can reduce the statistical power
of a study and can produce biased
estimates, leading to invalid conclusions
• Missing data (or missing values) defined as
the data value that is not stored for a
variable in the observation of interest
Missing data
Is there any reason? “Difficulties” (any type you can think
about) to compute/obtain/measure feature value on a
specific sample
Rubin DB. Inference and missing data. Biometrika. 63:581–
592, 1976; firstly described and divided the types of
missing data according to the assumptions based on the
reasons for the missing data.
There are three types of missing data according to the
mechanisms of missingness
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
Missing data: MCAR
• MCAR: defined as when the probability that the
data are missing is not related to either the specific
value which is supposed to be obtained or the set
of observed responses
• Data are missing by design, because of an
equipment failure or because the samples are lost
in transit or technically unsatisfactory, such data are
regarded as being MCAR.
• The statistical advantage of MCAR is that any
(statistical) analysis remains unbiased
• Power may be lost in the design, but the estimated
parameters are not biased by the absence of the data
Missing data: MCAR
• The presence of MCAR can be confirmed by
splitting the samples into two groups (with and
without missing data) and performing a t-test on
the average differences between the features to
highlight whether the two groups of samples
have (or not) significant differences
Missing data: MAR
• MAR: when the probability that the responses are
missing depends on the set of observed responses,
but is not related to the specific missing values
which is expected to be obtained
• Most common than MCAR
• As we tend to consider randomness as not
producing bias, we may think that MAR does not
present a problem.
• However, MAR does not mean that the missing
data can be ignored.
Missing data: MNAR
• If missing data do not meet those of MCAR or MAR,
then they fall into the category of missing not at random
(MNAR)
• Non-ignorable missingness (Missing Not at random)
• The missing values are not randomly distributed over all
observations, but the probability of finding a missing
data cannot be estimated using the variables in the
model
• The cases of MNAR data are problematic
• Treatment: replace the missing data based on some
external a priori knowledge on the learning task
Missing data
How to treat missing values?
No simple and safe solution o solve cases where a significant
number of missing values appears
Try to evaluate the importance of missing data by
experimenting with learning techniques with and without
the attributes that present these data
Methods for processing missing data:
Sequential (or pre-processing methods): the incomplete
dataset is converted into a complete dataset
Parallel (methods in which missing values are considered
within the knowledge acquisition process) and learning
algorithm is modified to manage such data
Missing data: sequential method
Dataset reduction (simpler solution)
Elimination of examples with missing values (listwise or
casewise deletion)
Solution to be used when dataset size is large and/or the
missing value percentage is low
Loss (sometimes significant) of information
Replacement of missing values with constant values
a global value (usually the most common feature value)
average of the corresponding feature (for numerical type
attributes)
average of the class feature (in classification problems)
Missing data: sequential method
Global closest fit: replace missing value with the
value of the most similar attribute
Comparisons between two features (one containing
the missing value and the closest fit candidate)
some distance between these two vectors is calculated
search is carried out on all features
the vector with the minimum distance is used to determine
the missing value
Missing data: sequential method
Distance measure adopted by Global Closest Fit
n
dist ( x , y ) dist ( x , y )
i 1
i i
0 se x i y i
se x , y sono simboliche e x i y i
dist ( x i , y i ) 1
o xi ? o y i ?
| xi yi |
se x i e y i sono numeri e x i y i
r
r = difference between the maximum and the minimum value of the feature
containing the missing value
show outliers
Example: 80, 75, 90, 95, 65, 65, 80, 85, 70, 100
order the data in ascending order
Determine the first quartile (70), the median (80), the
third quartile (90), the largest (100) and the smallest
value (65)
Visual representation of single
variable: box-plot
65 70 75 80 85 90 95 100
Scatter-plot
• Scatter plot (catter chart, scatter diagram,
scatter graph): 2-D plot displaying the joint
variation of two data items.
• Show data in Cartesian coordinate in a
graphic which displays the relationship
existing between two variables (x and y axis)
• Allow to determine (in a visual form) if the
data points are related or not
• How data are spread across?
• Are they closely related?
Scatter-plot
Explanatory variable (independent variable)
Response variable (dependent variable)
Scatter-plot
Some statistical properties can be observed
using scatter-plot
Dispersion, linear correlation, outliers
Positive associations (increasing trend)
Negative association (decreasing trend)
Lack of association (cloud trend)
Bubble-plot
• A bubble plot: plot three values data points
and shows the relationship that exists
between the minimum of three variables
• Two of them are the plot axes, while the third
one by the bubble size
• Each bubble is a observation.
• Colors can be used to represent an additional
measure
Scatter-matrix
Conditioned
Petal width
Iris class
Trellis diagrams: Iris Data
setosa setosa
Petal L.: [1.0 4.4] Petal L.: [4.4 7.1]
versicolor versicolor
Petal L.: [1.0 4.4] Petal L.: [4.4 7.1]