An Alisis de Datos: Regresi On

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Summary

Análisis de datos
Regresión

Prof. Dr. Juan José Egozcue

Dep. Matemática Aplicada III


Universidad Politécnica de Catalunya
Septiembre, 2008
Normal Correlation Linear regression Tasks

Summary

1 Normal distribution

2 Correlation

3 Linear regression
Linear model
Example ADCP

4 Tasks
Multiple regression Lab.
Bibliography
Normal Correlation Linear regression Tasks

Multivariate normal distribution (non-singular)

Multivariate random vector: X = (X1 , . . . , Xn )0


Sample space and support: Rn
Probability density:

 
1 1 0 −1
fX (x) = exp − (x − µ) Σ (x − µ)
(2π)n/2 det Σ1/2 2

Mean: E[X] = µ
Covariance:
Cov[X] = E[(x − µ)(x − µ)0 ] = Σ, (det Σ 6= 0)
Symmetric with respect to µ
Sums, marginals and conditionals of normal variables are normal
Sums of non-normal variables are approached by normal ones
Normal Correlation Linear regression Tasks

Simulation of a k-multivariate normal

Simulate N(µ, Σ)
Factorise (Cholevsky) Σ = TT 0
Simulate k independent N(0,1) → u
Compute z = T u
Mean: x = z + µ
Covariance of z

Cov[z] = E[zz0 ] = E[T uu0 T 0 ] = TT 0 = Σ


Normal Correlation Linear regression Tasks

PP and QQ-plots

The Quantile-Quantile plot, QQ-plot

Goal: compare a distribution (cdf) to a reference cdf.


Scale: original variable
FR reference cdf
FX a distribution, e.g. sample distribution
x-axis: values of the variate X (observed quantile)
y-axis: FR−1 (FX (x)) (R-predicted quantile)
representation of FX : (x, FR−1 (FX (x)))
Normal Correlation Linear regression Tasks

PP and QQ-plots

QQ-plot of a simulated sample

Simulated sample N(0, 1), n = 50


Reference N(0, 1), n = 50
5

1
Q predicted

-1

-2

-3

-4

-5
-4 -3 -2 -1 0 1 2 3
Q observed
Normal Correlation Linear regression Tasks

PP and QQ-plots

The Probability-Probability plot, PP-plot

Goal: compare a distribution (cdf) to a reference cdf.


Scale: probability scale
FR reference cdf
FX a distribution, e.g. sample distribution
x-axis: FX (x) (probability of X -quantile)
y-axis: values of the variate FR (x) (probability of
R-predicted quantile)
representation of FX : (FX (x), FR (x))
Normal Correlation Linear regression Tasks

PP and QQ-plots

PP-plot of a simulated sample

Simulated sample N(0, 1), n = 50


Reference N(0, 1), n = 50
1

0.9

0.8

0.7

0.6
P predicted

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P observed
Normal Correlation Linear regression Tasks

Covariance

Covariance measures the linear association between two


variables Xi and Xj
Definition

σij = Cov[Xi , Xj ] = E[(Xi − µi )(Xj − µj )] , E[Xi ] = µi


Z Z
σij = (xi − µi )(xj − µj ) fij (xi , xj )dxi dxj
xi xj

Covariance matrix (Σ)ij = σij


Normal Correlation Linear regression Tasks

Properties of covariance

Variance ofP
a linear combination:
Y = a0 X = ai Xi

0 ≤ Var[Y ] = a0 · Σ · a

Non-negative definiteness
For all a, a0 · Σ · a ≥ 0
All eigenvectors are non-negative
If det Σ = 0, then a0 X = 0 for a non-null a
If Σ−1 exists, equation x0 · Σ−1 · x = k define an ellipsoid
Covariance is an inner product of random variables
Normal Correlation Linear regression Tasks

Correlation

Correlation is a standardisation of the covariance


Cov[Xi , Xj ] σij
ρij = p =
Var[Xi ]Var[Xj ] σ i σj

Correlation is a covariance of standardised variables


X i − µi
Yi = ⇒ Cov[Yi , Yj ] = ρij
σi

It ranges −1 ≤ ρij ≤ 1
(Cauchy-Schwarz inequality)
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = −0.99


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = −0.90


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = −0.50


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = 0.00


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = 0.50


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = 0.90


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Simulated bivariate normal samples

n = 70, µi = 0, σi = 1, ρ12 = 0.99


3

-1

-2

-3
-4 -3 -2 -1 0 1 2 3 4
Normal Correlation Linear regression Tasks

Interpretation of correlation (I)

Consider two r.v. linearly related

Y − µy = λ · (X − µX )

Multiplying by Y − µY and X − µX and taking E[·],

σY2 = λ Cov[X , Y ] , λ σX2 = Cov[X , Y ]

If there is a linear relationship between X and Y ,

ρXY = ±1
Normal Correlation Linear regression Tasks

Interpretation of correlation (II)


Simple linear regression:

Y = b0 + b1 X + R , E[R] = 0 , Cov[X , R] = 0

Variance and covariance of Y :

σY2 = b12 σX2 + σR2 , Cov(X , Y ) = b1 σX2


σY
b1 = ρXY
σX
Variance of Y explained by X

σY2 = ρ2XY σY2 + σR2


|{z} | {z } |{z}
Total Var Y Var Y expl. by X Non−explained Var

ρ2XY is the per unit of σY2 (linearly) explained by X


Normal Correlation Linear regression Tasks

Linear model

The linear regression model

Response data: y1 , y2 , . . . , yn
Covariables, predictors: xi1 , xi2 , . . . , xik
Residuals, errors: e1 , e2 , . . . , en
Model: for i = 1, 2, . . . , n
k
X
yi = β0 + βj xij + ei
j=1

Matrix notation
y = Xb + e
with xi0 = 1, i = 1, 2, . . . , n
Normal Correlation Linear regression Tasks

Linear model

Least squares solution

Model: y = Xb + e
Find the β’s such that
Criterion: kek2 minimum
Taking derivatives and equating to 0:
b = ( Xt
b
|{z} |{z} X )−1 |{z}
|{z} Xt y
|{z}
(k+1,1) (k+1,n) (n,k+1) (k+1,n) (n,1)

Pseudo inverse of X: X† = (Xt X)−1 Xt

X† XX† = X† , XX† X = X

Symmetry of X† X and XX†


Normal Correlation Linear regression Tasks

Linear model

Sums of squares

Total variability: total sum of squares (SST)


n
X
SST = (yi − y)2
i=1

Regression: sum of squares of regressor (SSR)


n
X
SSR = (yb − y)2 , yb = xti b
b
i=1

Error variability: residual sum of squares (SSE)


n
X
SSE = (yi − yb)2
i=1
Normal Correlation Linear regression Tasks

Linear model

Multiple regression coefficient

Main property of SS (Pythagoras theorem)

SST = SSR + SSE

Multiple regression coefficient: the per unit variability


explained by regression,

SSR SSE
R2 = =1−
SST SST
R 2 = 0 ⇒ Regression model is useless
R 2 = 1 ⇒ No errors at all
Normal Correlation Linear regression Tasks

Linear model

Testing hypothesis (ANOVA)


Assumption: ei ∼ N(0, σe2 )iid
Consequence:
SSE
2
∼ χ2n−k−1
σe
Hypothesis:

H0 : β1 = β2 = · · · = βk = 0 ⇔ SSR = 0 ⇔ R 2 = 0

Statistics: if H0 holds, then


SSR SST
∼ χ2k , ∼ χ2n−1
σe2 σe2
SSR/k
F = ∼ Fk,n−k−1
SSE/(n − k − 1)
H0 is rejected for large F
Normal Correlation Linear regression Tasks

Example ADCP

Fitting a model of current velocity with depth

Current velocity (v ) in depth (d) of water are measured


using two on board ADCP’s
Goal: fit a model of velocities

v = β0 + β1 · ln h + β2 · h , h = d − d0

Is the term including h useless?


Additional issues: Are measurements from both ADCP’s
credibly equal?
Data: Jiménez-González, S., R. Mayerle and J. J. Egozcue: A proposed
approach for the determination of the accuracy of acoustic profilers for field
conditions. Die Küste, 69, 409–420, 2006
Normal Correlation Linear regression Tasks

Example ADCP

Data and model fitting

Data from Seston (blue) and Südfall (green)


16

14

12

10
dist fondo (m)

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
vel (cm/s)
Normal Correlation Linear regression Tasks

Example ADCP

Data and model fitting

Data from Seston (blue) and Südfall (green)


16

14

12

10
dist fondo (m)

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
vel (cm/s)
Normal Correlation Linear regression Tasks

Example ADCP

Correlation matrix

Correlaciones

vel dfondo logdfondo


Correlación de Pearson vel 1.000 .742 .919
dfondo .742 1.000 .842
logdfondo .919 .842 1.000
Sig. (unilateral) vel . .000 .000
dfondo .000 . .000
logdfondo .000 .000 .
N vel 60 60 60
dfondo 60 60 60
logdfondo 60 60 60
Normal Correlation Linear regression Tasks

Example ADCP

Linear regression

Current velocity predicted by log-depth


1

0.9

0.8

0.7

0.6
vel (cm/s)

0.5

0.4

0.3

0.2

0.1

0
-3 -2 -1 0 1 2 3
log-dist-fondo (m)
1 Por pasos
Normal Correlation Linear regression
(criterio: Tasks
Prob. de F
Example ADCP para
logdfondo . entrar <= .
050, Prob.
Regression coefficient and ANOVA de F para
salir >= .
100).
a. Variable dependiente: vel

Resumen del modelob

R cuadrado Error típ. de la


Modelo R R cuadrado corregida estimación
1 .919a .845 .843 .04571
a. Variables predictoras: (Constante), logdfondo
b. Variable dependiente: vel

ANOVAb

Suma de Media
Modelo cuadrados gl cuadrática F Sig.
1 Regresión .663 1 .663 317.312 .000a
Residual .121 58 .002
Total .784 59
a. Variables predictoras: (Constante), logdfondo
b. Variable dependiente: vel
Normal Correlation Linear regression Tasks

Example ADCP

Coefficients and model

Coeficientesa

Coeficientes
Coeficientes no estandarizad
estandarizados os
Modelo B Error típ. Beta t Sig.
1 (Constante) .614 .010 59.083 .000
logdfondo .093 .005 .919 17.813 .000
a. Variable dependiente: vel

Variables excluidasb

Estadísticos de
Correlación colinealidad
Modelo Beta dentro t Sig. parcial Tolerancia
1 dfondo -.110a -1.155 .253 -.151 .291
a. Variables predictoras en el modelo: (Constante), logdfondo
b. Variable dependiente: vel

Correlaciones de los coeficientesa


Normal Correlation Linear regression Tasks

Example ADCP

Residuals

Current velocity predicted by log-dist-to-bottom


0.2

0.15

0.1

0.05
resid vel (cm/s)

-0.05

-0.1

-0.15

-0.2
-3 -2 -1 0 1 2 3
log-dist-fondo (m)
Normal Correlation Linear regression Tasks

Example ADCP

Residuals

Velocity residuals, normal QQ-plot and KS-critical region


0.20

0.15

0.10
normal resid. quantile

0.05

0.00

-0.05

-0.10

-0.15

-0.20
-0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15 0.20
resid. quantile (m/s)
Normal Correlation Linear regression Tasks

Multiple regression Lab.

Multiple regression Lab.

Use data of velocity-depth.


Fit a linear model using h, log h, h2 , h3 .
Plot residuals in normal QQ and PP plots.
Using the previous model introduce the factor ”boat” and
decide if both measurements can be assumed ”equal”.
Discuss briefly the results.
Normal Correlation Linear regression Tasks

Bibliography

Bibliography: Regression

Canavos, G. (1987): Probabilidad y estadı́stica: Aplicaciones y


métodos, McGraw–Hill, México, 651 p.

Draper, N.R. and Smith, H.(1981): Applied regression analysis,


2nd ed., John Wiley and Sons, New York, NY (USA), 407p.

Peña, D. (1987): Estadı́stica, modelos y métodos. Modelos


lineales y series temporales, Alianza Editorial Textos, Madrid,
683p.

You might also like