Multivariate

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 78

Multivariate Methods

Multivariate Analysis

• Many statistical techniques focus on just one


or two variables

• Multivariate analysis (MVA) techniques allow


more than two variables to be analysed at once
Multivariate Data
• In many applications, several measurements are
made on each individual or event generating an
observation vector. The sample may be viewed as
a data matrix.

• where the d columns correspond to d variables


denoting the result of measurements made on an
individual or event.
Multivariate Data
• These are also called inputs, features, or
attributes. The N rows correspond to independent
and identically distributed observations,
examples, or instances on N individuals or events.
• For example, in deciding on a loan application,
an observation vector is the information
associated with a customer and is composed of
age, marital status, yearly income, and so forth,
and we have N such past customers.
Multivariate Data
• These measurements may be of different
scales, for example, age in years and yearly
income in monetary units.
• Some like age may be numeric, and some like
marital status may be discrete.
• Typically these variables are correlated.
• If they are not, there is no need for a
multivariate analysis.
Multivariate Data
• Our aim may be simplification, that is,
summarizing this large body of data by means
of relatively few parameters or our aim may be
exploratory, and we may be interested in
generating hypotheses about data.
• In some applications, we are interested in
predicting the value of one variable from the
values of other variables.
Multivariate Data
• If the predicted variable is discrete, this is
multivariate classification, and if it is numeric,
this is a multivariate regression problem.
• Types of data
– Nominal Ex (Month, Name of the Dept)
– Ordinal Ex(Rank-Low, medium,high)
– Interval Ex (Continuous data-Temperature)
– Ratio Ex(In Percentage)
Parameter Estimation
• The mean vector μ is defined such that each of its
elements is the mean of one column of X:
E[x] = μ = [μ1, . . . , μd]T

• The variance of Xi is denoted as σ2 i , and the


covariance of two variables Xi and Xj is defined as
σij ≡ Cov(Xi,Xj) = E[(Xi − μi)(Xj− μj )] = E[XiXj] −
μiμj
Parameter Estimation
• If two variables are related in a linear way, then
the covariance will be positive or negative
depending on whether the relationship has a
positive or negative slope.
• But the size of the relationship is difficult to
interpret because it depends on the units in
which the two variables are measured.
• If two variables are independent, then their
covariance, and hence their correlation, is 0.
Example
Example
Example
Example
Example
Covariance matrix
Estimation of Missing Values
• Frequently, values of certain variables may be
missing in observations.
• The best strategy is to discard those
observations all together, but generally we do
not have large enough samples to be able to
afford this and we do not want to lose data as
the non-missing entries do contain
information.
Estimation of Missing Values
• We try to fill in the missing entries by
estimating them. This is called imputation.
• In mean imputation, for a numeric variable, we
substitute the mean (average) of the available
data for that variable in the sample.
• For a discrete variable, we fill in with the most
likely value, that is, the value most often seen
in the data.
Estimation of Missing Values
• In imputation by regression, we try to predict
the value of a missing variable from other
variables whose values are known for that
case.
• Depending on the type of the missing variable,
we define a separate regression or
classification problem that we train by the data
points for which such values are known.
Estimation of Missing Values
• If many different variables are missing, we
take the means as the initial estimates and the
procedure is iterated until predicted values
stabilize.
• Depending on the context, however,
sometimes the fact that a certain attribute
value is missing may be important.
Estimation of Missing Values
• For example, in a credit card application, if the
applicant does not declare his or her telephone
number, that may be a critical piece of
information.
• In such cases, this is represented as a separate
value to indicate that the value is missing and
is used as such.
Normal Distribution
Normal Distribution
Multivariate Normal Distribution
Multivariate Normal Distribution
Multivariate Normal Distribution
Multivariate classification
• Binary classification
– Spam or not
• Multi class classification
– Email Foldering, Topic classification, scene
classification
• Examples
– KNN, Naïve Bayes
Scenario
One vs All
One vs All
Dimensionality Reduction
Data Dimensionality
• From a theoretical point of view, increasing the
number of features should lead to better
performance.

• In practice, the inclusion of more features leads to


worse performance (i.e., curse of dimensionality).

• The number of training examples required


increases exponentially with dimensionality.
Dimensionality Reduction
• Significant improvements can be achieved by
first mapping the data into a lower-dimensional
space.
Dimensionality Reduction
• Dimensionality can be reduced by:

− Combining features using a linear or non-linear


transformations.

− Selecting a subset of features (i.e., feature


selection).
Dimensionality Reduction
• Linear combinations are particularly attractive
because they are simple to compute and
analytically tractable.
• Given x ϵ RN, the goal is to find an N x K
matrix U such that:

y = UTx ϵ RK where K<<N


Dimensionality Reduction
• Represent data in terms of basis vectors in a
lower dimensional space (embedded within the
original space).
(1) Higher-dimensional space representation:
Dimensionality Reduction
(2) Lower-dimensional sub-space representation:
Dimensionality Reduction
• Classical approaches for finding an optimal
linear transformation:
– Principal Components Analysis (PCA): Seeks a projection
that preserves as much information in the data as possible
(in a least-squares sense).

– Linear Discriminant Analysis (LDA): Seeks a projection


that best separates the data (in a least-squares sense).
PCA Algorithm
• PCA algorithm:
– 1. X  Create N x d data matrix, with one row
vector xn per data point
– 2. X subtract mean x from each row vector xn in X
– 3. Σ  covariance matrix of X
– Find eigenvectors and eigenvalues of Σ
– PC’s  the M eigenvectors with largest
eigenvalues
Principal Components
5

-1

-2

-3

-4

-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
Example
A city can be rated on various parameters like

• Climate and Terrain


• Housing
• Health Care & the Environment
• Crime
• Transportation
• Education
• The Arts
• Recreation
• Economics
Example
  Principal Component

Variable 1 2 3 4 5

Climate 0.158 0.069 0.800 0.377 0.041

Housing 0.384 0.139 0.080 0.197 -0.580

Health 0.410 -0.372 -0.019 0.113 0.030


Crime 0.259 0.474 0.128 -0.042 0.692

Transportation 0.375 -0.141 -0.141 -0.430 0.191

Education 0.274 -0.452 -0.241 0.457 0.224

Arts 0.474 -0.104 0.011 -0.147 0.012

Recreation 0.353 0.292 0.042 -0.404 -0.306

Economy 0.164 0.540 -0.507 0.476 -0.037


Example
2 4
1 3
0 1
-1 0.5

Calculate covariance matrix


Example
A B
x y xi-x* y-y* AB A2 B2

2 4 1.5 1.875 2.81 2.25 3.51


1 3 0.5 0.875 0.43 0.25 0.76
0 1 -0.5 -1.125 0.562 0.25 1.26
-1 0.5 -1.5 -1.625 2.42 2.25 2.64

x* = 0.5 y*= 2.125


Example
n
• Cov (x,x) = Σ (xi-µ)2 =5/3=1.67

i=1 n-1
• Cov (y,y) = 8.1874/3 =2.73
• Cov(x,y) = 6.25/3 = 2.083
1.67 2.083
A = 2.083 2.73
Example
|A -  λ I| = 0
(1.67 - λ) (2.73- λ)- 2.0832= 0

 λ2-4.4 λ+0.2202=0

 λ1 = 4.3494  λ2= 0.0506


Example

z1 = a11 x1 +a12 x2
z2 = a21 x1 +a22 x2

-2.6794 a11 + 2.083 a12 = 0


2.083 a11 – 1.6194 a12 =0

a11 = 0.61 a12= 0.79


Similarly a21 & a22 should be obtained.
• The most important (principal) Eigen vector
would have the direction in which the
variables strongly correlate.
• The Eigen vectors with highest Eigen value
will be chosen as PCA.
• ‘n’ dimensions of data features would get ‘n’
Eigen vectors and choose ‘p’ Eigen vectors
such that p<n
• Final Data = Row feature vector X Row Data
Adjust
PCA Vs LDA
• Principal Component Analysis (PCA) is an
unsupervised learning algorithm as it ignores
the class labels that maximize the variance in a
dataset, to find the directions.
• Note that, PCA does not select a set of features
and discard other features, but it infers some
new features, which best describe the type of
class from the existing features.
PCA Vs LDA
• PCA works on eigenvectors and eigenvalues of the
covariance matrix, which is the equivalent of fitting
those straight, principal-component lines to the
variance of the data. 
• Linear Discriminant Analysis is a supervised
algorithm as it takes the class label into
consideration.
• It is a way to reduce ‘dimensionality’ while at the
same time preserving as much of the class
discrimination information as possible.
PCA Vs LDA
• LDA helps you find the boundaries around
clusters of classes. It projects data points on a
line so that your clusters are as separated as
possible, with each cluster having a relative
(close) distance to a centroid.
• Basically LDA finds a centroid of each class data
points. For example with thirteen different
features LDA will find the centroid of each of its
class using the thirteen different feature dataset.
PCA Vs LDA
• Now on the basis of this, it determines a new
dimension which is the axis which should
satisfy two criteria:
1. Maximize the distance between the centroid
of each class.
2. Minimize the variation (which LDA calls
scatter and is represented by s2), within each
category.
Example
• Compute the Linear Discriminant
projection for the following two-
dimensional dataset.
• Samples for class ω1 : X1=(x1,x2)={(4,2),(2,4),
(2,3),(3,6),(4,4)}
• Sample for class ω2 : X2=(x1,x2)={(9,10),(6,8),
(9,5),(8,7),(10,8)}
Example
Example
Example
Example
Example
Example
Example
• Apply the similar steps like PCA and solve the
further steps to derive the Eigen vectors.
Example
Factor Analysis

• A data reduction technique designed to


represent a wide range of attributes on a
smaller number of dimensions.
• A statistical approach that can be used to
analyze interrelationship among a large
number of variables and explain these
variables in terms of their common underlying
dimension
Factor Analysis

• Factor analysis is a general name denoting a


class of Procedures primarily used for data
reduction and summarization.
• Variables are not classified as either dependent
or independent. Instead, the whole set of
interdependent relationships among variables
is examined in order to define a set of common
dimensions called Factors.
Purpose of Factor Analysis
• To identify underlying dimensions called
Factors, that explain the correlations among a
set of variables.
-- lifestyle statements may be used to measure
the psychographic profile of consumers.
Purpose of Factor Analysis
• To identify a new, smaller set of uncorrelated
variables to replace the original set of
correlated variables for subsequent analysis
such as Regression or Discriminant Analysis.
• -- psychographic factors may be used as
independent variables to explain the difference
between loyal and non loyal customers.
Exploratory FA
• Summarizing data by grouping correlated
variables. Investigating sets of measured
variables related to theoretical constructs. The
method is similar to principal components
•  In factor analysis we model the observed
variables as linear functions of the “factors.
•  In principal components, we create new
variables that are linear combinations of the
observed variables.
Steps
• Collect all of the variables X 's into a
vector X for each individual subject.
Let Xi denote observable trait i. These are the
data from each subject, and are collected into a
vector of traits.
Steps
• This is a random vector, with a population
mean. Assume that vector of traits X is
sampled from a population with population
mean vector:
Steps
• Consider m unobservable common
factors f1 , f2 , ..., fm . The ith common factor
is fi . Generally, m is going to be substantially
less than p .
• The common factors are also collected into a
vector,
Steps
• Our factor model can be thought of as a series
of multiple regressions, predicting each of the
observable variables Xi from the values of the
unobservable common factors fi :
Steps
• The regression coefficients lij (the partial
slopes) for all of these multiple regressions are
called factor loadings.
Steps
• And finally, the errors εi are called the specific
factors. Here, εi = specific factor for
variable i.  The specific factors are also
collected into a vector:
Steps
• In summary, the basic model is like a regression
model. Each of our response variables X is predicted
as a linear function of the unobserved common
factors  f1 , f2  through fm . Thus, our explanatory
variables are f1 ,  f2 through fm.  We have m unobserved
factors that control the variation among our data.
• We will generally reduce this into matrix notation as
shown in this form here:
X=μ+Lf+ϵ
Multidimensional Scaling
• Let us say for N points, we are given the distances
between pairs of points, dij, for all i, j = 1, . . . , N. We
do not know the exact coordinates of the points, their
dimensionality or how the distances are calculated.
• Multidimensional scaling (MDS) is the method for
placing these points in a low for example, two-
dimensional space such that the Euclidean distance
between them there is as close as possible to d ij , the
given distances in the original space.
Multidimensional Scaling
• In the archetypical example of multidimensional
scaling, we take the road travel distances between
cities, and after applying MDS, we get an
approximation to the map.
• The map is distorted such that in parts of the country
with geographical obstacles like mountains and lakes
where the road travel distance deviates much from
the direct bird-flight path (Euclidean distance), the
map is stretched out to accommodate longer
distances.
Multidimensional Scaling
• MDS can be used for dimensionality reduction
by calculating pairwise Euclidean distances in
the d-dimensional x space and giving this as
input to MDS, which then projects it to a
lower-dimensional space so as to preserve
these distances.
Multidimensional Scaling
Locally Linear Embedding
• Locally linear embedding (LLE) recovers
global nonlinear structure from locally linear
fits.
• The idea is that each local patch of the
manifold can be approximated linearly and
given enough data, each point can be written
as a linear, weighted sum of its neighbors.
Locally Linear Embedding

You might also like