How to do a linear regression?

Tom Broekel
Diagnostics: Residuals

Regression (residual) diagnostics show if requirements for test of

statistical significance are me

If requirements not met - inference of statistical significance of

regression parameters (~p-values) not vali


Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

First impression of residuals by looking at their scatterplo

Regression on NUT2 regions: GDP ~ Pop_Den + Patents



0 100 200

Test if residuals are normally distribute

Normal distribution: No systematic biases & just random noise

Explicit test: Shapiro-Wilk-test compares distribution of residuals

with normal distributio

Significant result (p-value below chosen level of significance) indicates

rejection of normal distribution hypothesi

Insignificant result (p-value above chosen level of significance) indicates

not to reject normal distribution hypothesis

Rejection of normal distribution hypothesi

Coefficients correc

But: Test of coefficients’ significances not reliabl

Results cannot be interpreted

Rejection of normal distribution hypothesi

What to do

Wrong function relation? (Non-linearity?

Missing variables

Inherent characteristics of data ➡ different empirical approach (e.g.,


Testing for normal distribution of regression residuals in

Function ols_test_normality() in package olsrr directly applicable to

regression results objec

Function reports additional tests with usually little differences and similar

Normal distribution
hypothesis to be rejected
because p-value below 0.01

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

Test for homoscedastic residual

Test of significance based on the assumption of residuals’ variance

being constant across their distributio

“No heteroscedasticity

No relationship between residuals and explanatory variable

Variance of residuals should be constant across the distribution of fitted

values of the dependent variable

Comparison of fitted values of dependent variable with residuals


Breusch-Pagan test of heteroscedasticit

Regression of squared residuals on same explanatory variable

If “too” much variance explained by regression - residuals not

independent of explanatory variable

Significant result of BP-Test suggests rejection of homoscedasticity


More tests available

Rejection of homoscedasticity


Coefficients correc

Test of significance not reliabl

Results cannot be interpreted

Rejection of homoscedasticity - Causes and consequence

Wrong function relation? (Test for non-linearities?

Missing variables

Inherent characteristics of data ➡ different empirical approach (e.g.,


Testing for heteroscedasticity in

Function ols_test_breusch_pagan() in package olsrr directly applicable

to regression results object

Prob>Chi2 = p-value
p-value above 0.01 implying
homoscedasticity cannot be

Normally distributed residual

Heteroscedasticity (non-constant variance of residuals

Auto-correlation (independence of residuals)

Autocorrelation: Observations correlate with themselve

Observation in some kind of orde

Temporal: observations’ values in t correlate with their values in t-1 (temporal-

autocorrelation) -> Panel & time series analysis

Spatial: observations correlate with others in geographical proximity (spatial


Autocorrelation implies correlated residuals, e.g., residuals not independent of

each other and hence include structural biases and not just random nois

Tests of significance not reliable

High & low GDP

values geographically

Spatial Autocorrelation: Residuals of observation i (region i)

correlate with those from its neighbouring region

Almost always a problem is case of spatial dat

Spatial autocorrelation hints at similarities across regions or strong

relations between the

Unobserved regional characteristics or relation

Regions part of a “larger” regions - regional borders not optimal

Spatially autocorrelated residuals: mapping regression residual

Residuals clearly geographically structured / clustere

North Europe: under-estimating GDP from population density & patent

East Europe: over-estimating GDP from population density & patents

Regression residual

Residual = Y − Y


20 © TBroekel

Diagnostics of linear regression: Residuals

Exact test of spatial autocorrelation: Moran‘s

Extension of Pearson‘s correlation coefficient to spatial structur

Comparison of value for region i with those of (direct) neighbouring


Moran-correlation coefficient I
Pn Pn
n i=1 j=1 wij (xi x̄)(xj x̄)
I = Pn Pn Pn
i=1 j=1 wij i=1 (xi x̄)2

Example: spatial relations reflected by direct
Region Neighbours
ResidualA ResidualB
Region A’s neighbours: B, C, D ResidualA ResidualC
ResidualA ResidualD
Region D’s neighbours: A, E ResidualD ResidualA
ResidualD ResidualE
Region E’s neighbours: D, F, G, H
ResidualE ResidualD
Arranging residuals according to spatial ResidualE ResidualF
neighbourhood ResidualE ResidualG
ResidualE ResidualH

Estimation of
correlation coef cient=Moran’s I

Moran‘s I test of spatial autocorrelatio

Values between -1 (negative autocorrelation) and 1 (positive


Significant result indicated presence of autocorrelation

Problem with Moran’s

Different ways to define “neighbourhood

Direct neighbourhood (weight of neighbouring values =1, all others = 0

Weighting based on distance (growing distance implies less weight in region i’s

Neighbourhood definition impacts estimation results

Motivate choice from theory: What type of dependencies are relevant?

When spatial autocorrelation presen

Use different spatial units (definition of regions

Consideration of spatial characteristics, e.g., urban vs. rura

Model spatial dependencies with dummy variables (e.g., Country


Use of spatial regression models (not this class

Multi-level regression (not this class)

How to test for spatial autocorrelation in R

Load spatial information concerning geographical locations of observation

Usually, a “map

Maps = so called “shapefiles” that link geographical information (latitude and

longitude data) to empirical observations, e.g., region

R with excellent capabilities of handling spatial information using the

package sf

Use of sf (simple feature) library makes working with such data eas

Full compatibility with tidyverse and all its feature

Easy integration with ggplot

How to test for spatial autocorrelation in R

Load shapefile with read_sf() of sf librar

Add regression residuals to original data set using add_residuals() from

modelr librar Regression object Name of new column

Merge extended data set with shapefile to “geolocated” observations

ID columns in shape le &

Shapefile (sf) object includes data.frame with merged data

Merged columns of regional data

Before calculating Moran’s I, create information on spatial relations

(who is neighbour of whom?

Extract neighbourhood information from shapefile with poly2neigh()

and transform into spatial dependency object (spatial weights) with

Some regions with no neighbours (islands)

Moran’s I test implemented in spdep library with function


Highly signi cant

spatial autocorrelation!

Neighbours’ residuals correlat

with 0.65 (correlation coef cient)
Estimating regression model considering the full set of country

dummies (almost) solves the issue

Slightly signi cant

Weak spatial autocorrelation!

How to do linear regressions!

Ex-ante checks

Number of observation

Type of dependent variabl


Ex-post checks with potentially model refinemen



Normal distribution of residual


Autocorrelation (spatial/temporal)

Diagnostics: Residuals

