Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

Exploratory data

analysis and basic


visual representation of
multivariate data
N. Del Buono
EDA (Exploratory Data Analysis)
What is it?
 Wikipedia: EDA approach to analyzing
datasets and summarizing their main
characteristics, often with visual methods
A statistical model can be used or not, but
primarily EDA is for seeing what the data can
tell us beyond the formal modeling or
hypothesis testing task
EDA (Exploratory Data Analysis)
What is it?
 EDA is an essential step in any data analysis task
 It aims to:
 Examine data for distribution, outliers and anomalies to
direct specific testing of your hypothesis
 Provide tools for hypothesis generation by visualizing
and understanding the data usually through graphical
representation
 Show us hidden relationships and attributes present in
our data even before a machine learning model is
applied
EDA (Exploratory Data Analysis)
When does it do it?
• EDA is a fundamental early step after
• data collection
• data pre-processing
• EDA is simply devoted to visualize, plot,
manipulate, without any assumptions, in order
to help assessing the quality of the data and
building model
• EDA methods can be cross-classified as
• graphic or non-graphic
• univariate or multivariate methods
Few EDA techniques
The use of an “appropriate” EDA techniques depends on
the type of data and the objective of the analysis
Useful EDA techniques depending on the type of data
Type of data Suggested EDA techniques
Categorical Descriptive statistics
Univariate continuous Line plot, Histograms

Bivariate Continuous 2D scatter plots

2D arrays Heatmap

Multivariate: Trivariate 3D scatter plot or 2D scatter plot with a


3rd variable represented in
different color, shape or size
Multiple groups Side-by-side boxplot
Few EDA techniques
Useful EDA techniques depending on the objective

Objective Suggested EDA techniques


Getting an idea of the distribution of a Descriptive statistics
variable
Finding outliers Line plot, Histograms

Quantify the relationship between two 2D scatter plots


variables
Visualize the relationship between Heatmap
variables (2 input -1 output)
Visualization of high-dimensional data PCA + 2D/3D scatterplot

Multiple groups Side-by-side boxplot


Data(1/2)
 Data can be classified in:
 Structured
 Semi-structured
 Unstructured
• Most scientific databases contain structured data
well-defined fields (numeric or alphanumeric values)
• Semi-structured data  electronic images, medical or
web documents,…
• Unstructured data video and multimedia recordings
Structured data(1/2)
 Generally 2-D array representation (tableau-single
relationship)
 Columns are features (measurement of some
observable phenomenon)
 Rows are the specific values the example has for those
features
features
FEATURE
VALUE
for the
samples
specific
sample
Structured data(1/2)
 Data quality (does any pre-processing need? )
 Data should be
 accurate
 stored accordingly to the "data type" (numerical data,
"character" data, real data ...)
 non-redundant (redundancy must be tackled!)
 complete (how must missing values be treated?)
Features
 Machine learning and pattern recognition: a
feature is an individual measurable property or
characteristic of a phenomenon being observed
 Crucial step for effective algorithms in pattern
recognition, classification and regression 
choosing informative, discriminating and
independent features
 Different type of features
 Existence of not observable features they do
influence the data (numerical) model
Features - Type of data
• Most common feature type
• Numerical
• discrete features
• continuous features
• Categorical (or symbolic)
• Alpha-numeric features
Numerical Features
 They can be real, integer, complex numbers
 Age, velocity, length, …
 Main properties
 Order relationship (2 < 5)
 Distance relationship ( d(2.3,4.2)=1.9 )
Symbolic Features
• Categorical (or symbolic) features do not possess
numerical values (they are not measurable by
numbers) eye color, nationality, sex, ...
• No order or distance relationships
• Only equality relationship exists between
categorical features
• the values of a categorical feature may be the same or
not (blue = blue, red is non equal to black)
Symbolic Features
o 2-valued categorical feature can be converted
onto a binary numeric variable (0/1)
o N-valued symbolic feature can be converted to
N-numeric Boolean variables
o a binary variable for each value in the category
o the categorical features thus treated are called
“Dummy variables”
o Example: eye color
o blue = 1000; black=0100;
green=0010; brown=0001
Continuous and discrete
feature
 Continuous features: quantitative (or metric)
variables represented by real or integer numbers
 Measuring method
 Interval scale: scale of measurement possessing
magnitude and equal intervals, but not an absolute zero
 Example: temperature 0 °F is not absence of heat
 Ratio scale: scale of measurement possessing
magnitude and equal intervals and an absolute zero
 Example: weight, height, salary, ...
 The two scales differ in the
definition of zero!
Discrete Feature –nominal
scale
 Discrete features  qualitative variables
represented by symbols (alpha-numeric characters)
 Numbers must be treated as labels, they indicate only
different values of the attribute
 NO ordering
 Measuring method:
 Nominal scale: any scale that contains no magnitude
 order-less scale that uses symbols or numbers
 Example: classification or description
 user identification code (residential/non residential, A/B, 1/0)
Discrete Feature –ordinal scale
 Ordinal scale: any scale that reflects only magnitude but
does not contain equal intervals or an absolute zero
 Example: rating or rank ordering
 Discrete feature in ordinal scale are categorical variable
for which the order relationship is valid, but no distance
can be defined
 Ordinal relationships are not linear
 4th classified runner , 5th classified runner
 15th grade student, 16th grade student
 Only >, >=, ==, <,<= relationships between attributes
 Ordinal variables closely related to
linguistic variables (fuzzy)
 age (young, middle-aged, old)
Special defined features
 Periodic: discrete variables for which there is a distance
relationship but NO an ordering
 Example: days in a week/month/ year

 Data depending on time


 Static

 Dynamic

 A general assumption in learning independence of the


observations involved, observations should be equally likely
to show a particular value. Data which present some
dependencies (hierarchically dependent data or time
dependent, linked data) could be problematic
 Which will be our data? Mainly static datasets
Data: do they need any
pre-processing?
 Data transformation
 Normalization
 Smoothing
 Differences and ratio
 Missing values
 Outliers
Data transformation
 What is the rationale behind transformations?
 Learning algorithms generally perform better
when some pre-processing is applied to data
 Reduce (mitigate) problems on data
 Errors or incompleteness
 Not satisfying data distribution
 Strong asymmetries
 Presence of several highly varying values (peaks)
 The peculiar data transformation to be used
depends on data types, numbers of samples, a priori
information of problem to be tackled, and so on…
Data transformation
 T can be defined as Y = T(X) such that
 “Relevant” information embedded in the attribute (feature)
X are preserved
 Y mitigates some of the problems affecting X
 Y is more useful than X
 Main aims data transformations produce:
 Equalizing variance
 Normalizing distributions
 Linearizing variable relationships
 Secondary effects simplify data manipulations,
represent data in a more appropriate scale,…
Data transformation
 Exponential transformation
 ax p  b (p  0)
Tp ( x )  
c log x  d (p  0)
 a,b,c,d, p real values
 Preserve ordering
 Preserve some statistics
 Continuous functions
 They are differentiable
 They are defined in trough simply analytical functions
Data Transformation
 Linear transformations

Tp ( x )  ax p  b
 1€= 1936.27 Lit.
 p=1, a= 1936.27 ,b =0
 ºC= 5/9(ºF -32)
 p = 1, a = 5/9, b = -160/9
 This type of transformation allows to improve
interpretability
Data Transformation
 Logaritmic tranformations

Tp ( x )  c log x  d
 This can be applied when dealing with positive
values
 Smooth log-normal distribution probabilities
 Example: used for smoothing season peaks
 Task: regularize variance
Data Transformation
 There are different transformations which can be
used to regularize data variance
p
Tp ( x )  ax  b
 Root transformation
 p = 1/c, c integer
 used to regularize particular probability distribution (i.e.
Poisson)
 Reciprocal transformation
 p<0
 Used when data are temporal series and variance highly
increases w.r.t. average value
Data Normalization
 Which is the goal of any normalization?
 Changing the values of numeric features to a
common scale, without distorting differences in value
ranges
 When features have different ranges and learning
algorithm is based on some distance
 Normalization affects the whole learning process
 Normalization parameters needed to be stored
 min-max normalization
 z-score normalization
 normalization by decimal scaling
Data Normalization
 Normalization by decimal scaling
 Transform the data by moving the decimal points of
values of a selected feature. The number of decimal
points moved depends on the maximum absolute
value of feature.
 Let v be a given feature it is normalized by
v (i )
v (i )  K , K  arg min(max v (i )  1)
10 i
Data Normalization
 Min-Max Normalization
 Trasform data from a rang [min(v), max(v)] into a range [0,1]
 It is based on the a-priori knowledge of min/max values
 Operatively we use sample minimum and maximum value

v (i )  min v (i )( new max  new min)  new min


v (i ) 
max v (i )  min v (i )

v (i )  min v (i )
i
v (i ) 
max v (i )  min v (i )
i i
Data normalization
 Z-score normalization (standardization)
 Transform data by converting the values to a
common scale with an average of zero and a
standard deviation of one.
 A value, , of is normalized to ′ by computing:
v (i )  mean( v )
v (i ) 
sd ( v )
 v feature, mean(v), sd(v) mean and standard deviation
 Drawback: data are completely modified
Data Smoothing
 Estimate f(x) when the shape is unknown, but it is
assumed to be smooth
 General idea is to group data points that are
expected to have similar expectations and compute
the average or fit a simple parametric model
 Smoothers can be used to discretize continuous
features (modifying them into discrete one)
Data: differences and ratios
 Data transformation can be simply based on
arithmetic operations on features values
aimed to obtain new features
 Example: Body mass index (BMI) body mass
divided by the square of the body height (kg/m2)
Missing data
• Even in a well-designed and controlled
study, missing data occurs
• Missing data can reduce the statistical power
of a study and can produce biased
estimates, leading to invalid conclusions
• Missing data (or missing values) defined as
the data value that is not stored for a
variable in the observation of interest
Missing data
 Is there any reason? “Difficulties” (any type you can think
about) to compute/obtain/measure feature value on a
specific sample
 Rubin DB. Inference and missing data. Biometrika. 63:581–
592, 1976; firstly described and divided the types of
missing data according to the assumptions based on the
reasons for the missing data.
 There are three types of missing data according to the
mechanisms of missingness
 Missing completely at random (MCAR)
 Missing at random (MAR)
 Missing not at random (MNAR)
Missing data: MCAR
• MCAR: defined as when the probability that the
data are missing is not related to either the specific
value which is supposed to be obtained or the set
of observed responses
• Data are missing by design, because of an
equipment failure or because the samples are lost
in transit or technically unsatisfactory, such data are
regarded as being MCAR.
• The statistical advantage of MCAR is that any
(statistical) analysis remains unbiased
• Power may be lost in the design, but the estimated
parameters are not biased by the absence of the data
Missing data: MCAR
• The presence of MCAR can be confirmed by
splitting the samples into two groups (with and
without missing data) and performing a t-test on
the average differences between the features to
highlight whether the two groups of samples
have (or not) significant differences
Missing data: MAR
• MAR: when the probability that the responses are
missing depends on the set of observed responses,
but is not related to the specific missing values
which is expected to be obtained
• Most common than MCAR
• As we tend to consider randomness as not
producing bias, we may think that MAR does not
present a problem.
• However, MAR does not mean that the missing
data can be ignored.
Missing data: MNAR
• If missing data do not meet those of MCAR or MAR,
then they fall into the category of missing not at random
(MNAR)
• Non-ignorable missingness (Missing Not at random)
• The missing values are not randomly distributed over all
observations, but the probability of finding a missing
data cannot be estimated using the variables in the
model
• The cases of MNAR data are problematic
• Treatment: replace the missing data based on some
external a priori knowledge on the learning task
Missing data
 How to treat missing values?
 No simple and safe solution o solve cases where a significant
number of missing values appears
 Try to evaluate the importance of missing data by
experimenting with learning techniques with and without
the attributes that present these data
 Methods for processing missing data:
 Sequential (or pre-processing methods): the incomplete
dataset is converted into a complete dataset
 Parallel (methods in which missing values are considered
within the knowledge acquisition process) and learning
algorithm is modified to manage such data
Missing data: sequential method
 Dataset reduction (simpler solution)
 Elimination of examples with missing values (listwise or
casewise deletion)
 Solution to be used when dataset size is large and/or the
missing value percentage is low
 Loss (sometimes significant) of information
 Replacement of missing values with constant values
 a global value (usually the most common feature value)
 average of the corresponding feature (for numerical type
attributes)
 average of the class feature (in classification problems)
Missing data: sequential method
 Global closest fit: replace missing value with the
value of the most similar attribute
 Comparisons between two features (one containing
the missing value and the closest fit candidate)
 some distance between these two vectors is calculated
 search is carried out on all features
 the vector with the minimum distance is used to determine
the missing value
Missing data: sequential method
 Distance measure adopted by Global Closest Fit
n
dist ( x , y )   dist ( x , y )
i 1
i i


 0 se x i  y i
 se x , y sono simboliche e x i  y i
dist ( x i , y i )   1
 o xi  ? o y i  ?
 | xi  yi |
se x i e y i sono numeri e x i  y i
 r
r = difference between the maximum and the minimum value of the feature
containing the missing value

 Class closest Fit is used in classification problems


Missing data: other strategies
• Maximum likelihood: assumption that the
observed data are a sample drawn from a
multivariate normal distribution
• Compute statistical parameters using the
available data
• Estimate missing data abased on the parameters
which have just been obtained using known data
Missing data: other strategies
• Expectation-Maximization (EM): create a
new dataset, in which all missing values are
imputed with values estimated by the
maximum likelihood methods
• Expectation step during which the parameters
(e.g., variances, covariances, and means) are
estimated (using listwise deletion)
• Estimates are then used to create a regression
equation to predict the missing data
• Maximization step uses those equations to fill in
the missing data
Missing data: EM

• Expectation step is then repeated with the new


parameters, where the new regression equations
are determined to "fill in" the missing data
• The expectation and maximization steps are
repeated until the system stabilizes, when the
covariance matrix for the subsequent iteration is
virtually the same as that for the preceding
iteration
Missing data: EM
• Advantage
• when the new data set with no missing values is
generated, a random disturbance term for each
imputed value is incorporated in order to reflect
the uncertainty associated with the imputation
• Drawback
• long time to converge (especially when there is a
large fraction of missing data)
• lead to the biased parameter estimates
• underestimate standard error
Outliers
 Data points that deviate markedly from others
 An exact definition of an outlier often depends on
hidden assumptions regarding the data structure
and the applied detection method
 Some general (enough) definitions:
 An outlier as an observation that deviates so much from
other observations as to arouse suspicion that it was
generated by a different mechanism (Hawkins,
Identification of Outlier. Chapman and Hall, 1980)
Outliers
 Some general (enough) definitions:
 An outlying observation, or outlier, is one that appears to
deviate markedly from other members of the sample in
which it occurs (Barnett e Lewis (Outlier in Statistical Data.
John Wiley 1994)
 An outlier is an observation in a dataset which appears to
be inconsistent with the remainder of that set of data
(Johnson, Applied Multivariate Statistical Analysis.
Prentice Hall, 1992)
Outliers: detection analysis
 Detecting outliers is an important task for many
applications
 credit card fraud detection
 clinical trials
 voting irregularity analysis
 data cleansing
 network intrusion
 weather prediction
 geographic information systems
 …….
Outliers detection
 Detecting outliers is an important task for many
applications
 credit card fraud detection
 clinical trials
 voting irregularity analysis
 data cleansing
 network intrusion
 weather prediction
 geographic information systems
 …….
Outlier detection methods: a
taxonomy
 Outlier detection methods can be divided
 univariate methods (iid. assumption)
 multivariate methods
 parametric (statistical) methods
 a known underlying distribution of the observations is assumed or,
 based on statistical estimates of unknown distribution parameters
 these methods flag as outliers those observations that deviate from
the model assumptions
 often unsuitable for high-dimensional data
 nonparametric methods (model-free)
 distance-based methods: based on local distance measures
 clustering techniques: clusters of small sizes can be considered as
clustered outliers
Outlier detection: univariate
statistical method
 Assumptions
 underlying known distribution of the data (identically and
independently distributed - i.i.d.)
 or knowledge on the distribution parameters and the type
of expected outliers
 Simplest mechanism
 compute mean and variance of data
 define a threshold (as a function of variance)
 values overcoming the fixed threshold are considered
outliers
Outlier detection: univariate
statistical method
 Simplest mechanism: example
 age={3,56,23,39,156,52,41,22,9,28,139,55,20,-
67,37,11,55,45,37}
 mean = 39.9
 standard deviation= 45.65
 threshold = mean ± 2std  [-54.1, 131.2]
 age is positive  [0, 131.2]
 Observations: 156, -67, 139 are outliers
 possible typo-error
Outlier detection: multivariate
method
 Method based on distance
 Outliers are those samples which do not present neighbours
 A sample si is an outlier if there exists a subset of p other
samples having a distance from si grater than a given value d
 Parameters p e d are a-priori assigned
 High computational complexity
 Some distance measure has to be computed for all samples in
the multidimensional dataset
Outlier detection: multivariate
method
 Mahalanobis distance
 Let Vn the covariance data matrix
1 n T
Vn  
n  1 i1
( xi  xn )( xi  xn )

 Mahalanobis distance of a single sample is :


n 1/ 2
 T 1 
Mi   ( xi  xn ) Vn ( xi  xn )  i  1,...,n
 i1 

 outlier  samples with great value Mi


Outlier detection: multivariate
method
 Method based on deviation
 Based on dissimilarity functions (sequential-
exception technique)
 Example: total variance of data
 Basic characteristics of set of samples are
established
 Samples with values diverting from these
characteristics are labelled as outliers
Visual data analysis
• What is Visual Data Analysis best used for?
• Make easier for human beings to understand
data
• Allow the eye-brain human system to easy
identify similar patterns into data
• Facilitate “data-driven hypothesis generation”
(in contrast to methods of “hypothesis
testing”)
Visual data analysis
• Lots of visualization methods have been
developed, so as to be able to represent large
information and as well as examine them
• histogram, line chart, table, pie chart, bar chart,
scatter plot, bubble plot, area chart, flow chart
• Venn diagram, data flow diagram, entity relationship
diagram,
• heat-map, tree map …
Visual representation of single
variable: histogram
• Histogram visualizes the distribution of data
over a continuous interval or certain time
period.
• Each bar in a histogram represents the tabulated
frequency at each interval/bin
• Estimate as the values are concentrated, what
the extremes are and whether there are any
gaps or unusual values
• Useful for having a rough view of the probability
distribution
Visual representation of single
variable: Pareto chart
 Pareto chart: variation of histogram
 It contains both bars and a line graph (Lorenz curve),
where individual values are represented in descending
order by bars, and the cumulative total is represented by
the line
 categories are arranged in such a way that the one most
frequently is on the left part of the graph, followed by
those of lower frequency
 it allows to establish which are the major factors that influence a
given phenomenon
 it highlights the most important among a set of factors
Visual representation of single
variable: Pareto chart
 Pareto chart: variation of histogram
 It contains both bars and a line graph (Lorenz curve),
where individual values are represented in descending
order by bars, and the cumulative total is represented by
the line
 categories are arranged in such a way that the one most
frequently is on the left part of the graph, followed by
those of lower frequency
 it allows to establish which are the major factors that influence a
given phenomenon
 it highlights the most important among a set of factors
Visual representation of single
variable: Pareto chart
Visual representation of single
variable: box-plot
 Box and whiskers plot (box-plot) method for
graphically depicting groups of numerical
data through their quartiles
 Box includes the median and is bounded by the
25th and 75th percentiles
 Whiskers are minimum and maximum value
 It display variation in samples of a statistical
population without making any assumptions of
the underlying statistical distribution (non-
parametric)
Visual representation of single
variable: box-plot
 Spacings between the different parts of the box
indicate
 the degree of dispersion (spread)

 skewness in the data

 show outliers

 Example: 80, 75, 90, 95, 65, 65, 80, 85, 70, 100
 order the data in ascending order
 Determine the first quartile (70), the median (80), the
third quartile (90), the largest (100) and the smallest
value (65)
Visual representation of single
variable: box-plot

65, 65, 70, 75,80, 80, 85, 90, 95 ,100

First quartile Third quartile


Median
(second quartile)

65 70 75 80 85 90 95 100
Scatter-plot
• Scatter plot (catter chart, scatter diagram,
scatter graph): 2-D plot displaying the joint
variation of two data items.
• Show data in Cartesian coordinate in a
graphic which displays the relationship
existing between two variables (x and y axis)
• Allow to determine (in a visual form) if the
data points are related or not
• How data are spread across?
• Are they closely related?
Scatter-plot
 Explanatory variable (independent variable)
 Response variable (dependent variable)
Scatter-plot
 Some statistical properties can be observed
using scatter-plot
 Dispersion, linear correlation, outliers
 Positive associations (increasing trend)
 Negative association (decreasing trend)
 Lack of association (cloud trend)
Bubble-plot
• A bubble plot: plot three values data points
and shows the relationship that exists
between the minimum of three variables
• Two of them are the plot axes, while the third
one by the bubble size
• Each bubble is a observation.
• Colors can be used to represent an additional
measure
Scatter-matrix

 A scatter matrix consists of several pair-wise


scatter plots of variables presented in a
matrix format
 It can be used to determine whether the
variables are correlated and whether the
correlation is positive or negative
Co-plot and Trellis plot
 Trellis Graphics family of techniques for viewing
complex, multi-variable data sets (formalized by
researchers at Bell Laboratories during the
1990s).
 Multi-panel conditioning plots
Co-plot and Trellis plot
 Coplot sequence of conditioned scatter-plot
 each diagram corresponds to a particular range of values
of a third variable said conditioning factor
 It allows you to highlight how an output variable
depends on an input variable given other
descriptive variables
 Different ways of representation
 Given panels: variability intervals of the "conditioned"
variable
 Dependence panels: bivariate scatterplots of the
"susceptible" variable with respect to the remaining
descriptive variables
Esempio: IRIS data
 Iris flower classification problem
 Setosa, Versicolor, Virginica.

Iris Setosa Iris Versicolor Iris Virginica


Example: IRIS data
 150 samples
 4 features :
 Length and width of the sepals (elements of the
calyx of the flower)
 Length and width of the petals (element of the
corolla of the flower)
 Each sample is a 5-D vector with 4 continuous
values a 1 categorical features

Iris features (input) class (output)

5.4 3.9 1.7 0.4 Iris-virginica


Example: IRIS data
 Scatterplot bivariato
 (Sepal length vs Petal length)
Scatterplot matrix: IRIS data
Box-and-whisker plot: IRIS data
Diagramma di Trellis: Iris Data
 Trellis diagram
 3D graph
 Sepal length (x-axis)
 Sepal width (y-axis)

 Petal length (z-axis)

 Conditioned
 Petal width
 Iris class
Trellis diagrams: Iris Data
setosa setosa
Petal L.: [1.0 4.4] Petal L.: [4.4 7.1]

versicolor versicolor
Petal L.: [1.0 4.4] Petal L.: [4.4 7.1]

You might also like