ECONOMETRICS

You might also like

Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 4

ECONOMETRICS

Multicollinearity
Def : high correlation exists among two or more independent variables  can affect the
regression results, because of instable coeficients, coeff signs may not match prior
expectations.
Example: Pie sales = 100 + 25*Temperature (°C) + Temperature (F)
Sales: 350 units
Search for the higher VIF and eliminate it, than we run the regression again, and if there is
another too high VIF, we do it again until all the variables are below 5.
If the VIF is greater than 5, we should eliminate the variable

Outliers (extreme value)


Use scatter plots or boxplot to find outliers and delete them to decrease the sig. of the
variable coefficient.
The Cook distance, detects the influence of one observation on all predicted values
 if high, is influential and may be an outlier
If above 1, means that an observation has a strong effect and may be an outlier

Analyse of R2: the selected variables explain X% of the model, so they are significant
R² (adj) should be close to R² otherwise not good
Global sig. : F-test  P-value < 5%

FICHE FINAL

Linear regression

We can write the formula as Y = f(X) + ε,


X is the explanatory variable and Y is a continuous quantitative variable, ε is the error rate,
and it is necessary because a model is never perfect:
some X in the database may be faulty, Y may be explained by other variables, or the
functions may underestimate the effect of X on Y. Or if the sampling is not representative of
the population.
Global quality: to test your model you need to know its global quality.

SST (total of Sum Square) = SSR (Sum of Square Regression) + Sum of Square Error
So SST = SSR + SSE
 From that we can calculate R² = SSR/SST ou R² = SSR/ (SSR + SSE)
 0 < r2 < 1 (perf linear relationship)
Adjusted R2 shows the proportion of variation in Y explained by all X. If we want to know
which model is the best, it is smaller than R²

Is the model globally significant?


Run a f-test, shows if there is a linear relationship btw all the X var.
F-stat = MSR / MSE = (SSR/k)/(SSE/n-k-i)
The bigger the F, the smaller the p-value and the better the chances to reject H0 and to
prove that at least 1 var is influencing  the model is significant

Assumption of regression : LINE

Linearity  the relationship between X and Y is linear


Independence of errors  errors values are statistically independent
Normality of errors  error values are normally distributed for any given value of X
Equal variance  the probability distribution of the errors has constant variance

STEPS LINEAR REGRESSION SAS


Summary statistics : median, mean, quartiles… are data to understand the database and
present the variables
Student’s test (2 tailed two samples t-test; confidence interval around the mean)
THE Linear regression with the Fisher test with P-value, R2 and adjusted R2
 ( Y= -0.07739 (intercept)+ 0.00002918* number of cigarettes per smoker + 0.00395* middle
age -0.00001199* average salary per month -0.00011649* average price of a package +
0.04332* fertility rate -0.00000207* GDP in billions -0.00788* annual mean temperature -
0.00120* health access + 0.26258* men who smoke -0.59258* mortality rate + 0.00187       *
unemployment rate + 0.39462* women who smoke + εi )
 Prendre les valeurs ds le tableau « parameters estimates », colonne parameters estimates

LINE assumptions
Linearity and Independence
 Linear : look at the residuals by predicted for…(variable) ; if no distinguishing shape, good,
 Independence : look at the Scatter plot and we shouldn’t distinguish any shape either
around 0.
Searching for outliers  Cook’s distance, if not many, we can keep them
Normality : graph from linear regression, showing the residuals as an histogram, the curve
should be similar to the reference curve Kolmogorov-Smirnov Test (table distribution of
analysis : residuals …) we have X% chance (multiply value by 100 to have a percentage and
must be under 10%) to reject H0, if small percentage, reject H1 and keep H0  distrib of
residuals is normal
(can also be detected from the 2 other test, CvM and AD)
+QQPlots for graph assumption, must follow the line
Equal variance or Homoscedasticity : no specific shape in the distribution
run a White test : compare the (R2 value * nmb of observations) and the Chi2 value
if White test stat is < chi2 value, we don’t reject H0  regression model is not subject to
heteroscedasticity

 LINE ASSUMPTIONS RESPECTED

Multicollinearity and VIF (+P value)


VIF = 1/ (1- R²)
 The rule for VIF is to remove the variables with a VIF value higher than 5
Then we remove one by one the variables according to the Pr value until there are not any
over 5%
T-test : hypothesis about the mean (stand dev unknown)
Linearise : put into Y formulas with a formula, for example : ln(x)
Quadratic effect
Dummy variable : quali var that can be replaced by 0 and 1 (ex: gender)
Heteroscedasticity : when stand dev is not constant (graph nuage point croissant)  White
test

Consequences : sig test too high or too low, stand erro biaised

You might also like