Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

STATS FINAL

THEME 1
Vocabulary :
 Population: the entire group of subjects (or units) we wish to study (e.g. the collection of invoices submitted by the
employees).
 Sample: a group of units drawn from the population (e.g. some subset of the invoices submitted by the employees).
 Variable of interest: a feature or property of every member of the population we wish to study (e.g. the amount of
inadmissible expenses on an invoice). This feature or property varies from one individual to the other. It is often
denoted by X . (each unit of the population has its own value)
 Parameter: a population-specific quantity associated with the variable X , almost always unknown to us (e.g. the
mean amount of inadmissible expenses on invoices). (always fixed and free from errors, ex: mean (u), std dev (o),
correlation (p), proportions (p). Defines the population. Fixed quantity. What we are looking for
 Estimator: statistic defined from a sample that enables the estimation of an unknown parameter. (sampling error)
Its value will vary with the sample (has its own
distribution, mean and std dev) so has an
estimation error
 Point Estimate: value of the estimator for a
given observed sample. (depends on the
sample used to calculate it. Varies = sampling
variability) tool used to approximate the
parameter of interest

Statistical Inference vs Descriptive Statistics


Descriptive Statistics: techniques aiming to describe features of data sets, by using graphs, tables and summary
statistics. (st dev, mode, describing the data) (only describe, no conclusions) (facts, numbers)

Statistical Inference: Drawing conclusions about a population from a sample. (infer what the data is). Make sure the
sample is representative of the population (extrapolate to the whole population, “think that about”, generalizations)

Experiments (or controlled experiments)


Controlled experiment: the experimenter divides individuals or units into two (or more) groups receiving different
treatments in order to study the validity of a hypothesis regarding the effectiveness of a treatment. (researchers
intervention) Can establish cause/effect relationships
** groups need to be built in a similar way
 Treatment group: group that receives the treatment to be studied. There can be several treatment groups in a
single study, so that multiple treatments can be simultaneously measured.
 Control group: group used as a baseline measure. It will not receive the treatment. (placebo group)
Methods to create a control group:
o Simple randomization (flip a coin). Good cuz eliminates bias, but does not necessarily lead to
equally-sized groups
o To get same size: pull the names out of a hat

Observational Study
Observational study:
A study where individuals are merely observed. The variable of interest is measured but no attempt is made to
influence the response. (researcher’s passive observation)

Longitudinal study vs. Cross-sectional study


Longitudinal study: characteristic of interest is measured at different moments in time on the same individual (unit).
Repeated over time, evolution of a feature over time
cross-sectional study: data are collected at a single point in time (one moment in time)
THEME 2:
To be able to draw conclusions from data, and generalize any conclusion to the entire population, the sample must be
representative of the population and sufficiently large. + feature must be properly measured
Bias
Bias: the systematic error between the parameter of interest and an estimate of the latter.
 Selection bias: the method used to select the sample may create a sample that is not representative of the
population. This method is likely to lead us to an erroneous result. (for ex when some persons targeted by the study
are excluded by default)
Example: Telephone survey using a sample taken from the phone book. (doesn’t include unlisted numbers)
 Avoid selection bias by:
 Use a probabilistic sampling method (use chance) & so allowed to draw inferences

 Measurement bias: the method used to measure the feature we wish to study does not adequately measure what
we really want to measure. (answering incorrectly or not wanting to answer honestly)
Example: Getting teachers to measure swearing at recess as a way of measuring the amount of swearing in the
school.
 Problems:
 Lack of integrity (so do it anonymously) (sensitive topics)
 Bad formulation (avoid double negation and make it clear), impartial vocab (not suggesting an answer)
 Avoid memory/recall bias by doing a longitudinal study (instead of a cross-sectional)

 Nonresponse: if we cannot measure the feature(s) of interest for some individuals (units) in the sample, a bias may
occur if the individuals who answered differ from those who didn’t. (cuz some ppl refuse to answer or can’t reach
them); missing values/holes in the data set
 To avoid it (when sensitive topic, wrong timing, data collection methodology, …)
 Attempt to reach out the individuals in the sample multiple times

Probabilistic sampling methods


 Simple random sampling with replacement (SRSWR) : we select units at random from the population until we have
the target size. Units can be chosen more than once.

 Simple random sampling without replacement (SRSWOR) : we select units at random from the population until we
have the target size. Units can only be chosen once. This is more natural!
When the sampling fraction is small (i.e. if the population is large compared to the sample), this method provides a
good approximation to SRSWR.

 Systematic sampling: we systematically select the units from a list (e.g. every 12th unit) having previously chosen at
random the starting point. This is the easiest way to approximate a SRSWR.

 Stratified sampling: the population is divided into separate groups called strata (e.g. by age, sex, etc.). A simple
random sample is then chosen from each group. This ensures the different strata are well represented in the
sample in order to improve the precision of the estimates.

 Cluster sampling: if the population already appears in groups of units, named clusters (e.g. houses, locations from a
region, buildings or blocks in a city, etc.), we can take a simple random sample of clusters. All individuals from the
sampled clusters are part of the final sample. This method often reduces costs and complexity, but it also reduces
precision of the estimates for a similar sample size.
Want to avoid:
 Convenience samples
 Volunteer samples
THEME 3:
Common Estimators

Sample proportions are means. X bar=p hat (when its an indicator variable)

Risk of error and sample variability


Margin of error / Precision of a sample depends on its variability (stdev) and sample size
 E [ X́ ]=μ The expected value of X́ is μ. That is, the sample mean is “correct on average” over many samples.
Such an estimator is said to be unbiased. On average, the estimation error is zero.

σ
 σ X́ = √V ( X́)= Roughly speaking, it is the average estimation error = standard deviation of the sample
√n
mean
This means that:
 The larger the sample size n , the less the sample mean varies from one sample to the other (the smaller σ X́
). Larger sample sizes thus lead to more precise estimates of the mean of the population.
 The less volatile the variable of interest in the population (the weaker the standard deviation σ ), the smaller
σ X́ . Less volatile variables of interest thus lead to more precise estimates of the mean of the population.
The estimation error correspond to the difference between the point estimate and the true population mean, namely
𝑥̅−𝜇
Differences between S and S/ √ n
 S estimates the variability of the variable of interest across the units in the population.
S
estimates sample variability, i.e. the variability of the sample mean across possible samples of size n .
√n
Sample std dev of sample mean. It is an evaluation of the average estimation error. We estimate that the
sample mean duration deviates on average by 3.22 minutes from the population mean duration = average
estimation error
Confidence Interval

CI= estimator ± margin of error

 As n increases, the sample mean values become


increasingly concentrated around the true mean μ
 As n increases, the distribution of the sample mean
gets closer to the Normal distribution.
The Central Limit Theorem
If the sample size n is sufficiently large( n ≥ 30 ) , the Normal distribution describes the variability of the
σ
sample mean across all possible simple random samples (SRSWR): X́ N ( μ , σ X́ )=N μ , ( √n).

 The Central Limit Theorem is true no matter the distribution of the variable X in the population.
 the Central Limit Theorem allows estimation by confidence intervals of the mean of any population as long as
the sample size is large enough.
Normal distribution: Pointed ↔ Small dispersion ↔ Small standard deviation
curve
Flat curve ↔ Large ↔ Large standard deviation
Student’s t-distribution dispersion with n−1 degrees
of freedom
σ
Use this because we don’t know the σ of the population in X́ ± z α /2 .
√n
S
confidence interval for μ, at a 1−α confidence level: X́ ± t n−1 , α / 2 .
√n
This interval is valid if at least one of the following conditions is satisfied:

1. Variable of interest X in the population has a Normal distribution;


OR
2. Sample size n is sufficiently large (n ≥ 30 ¿.

INTERPRETATION
“The interval (14, 20) was generated by a process that captures the population mean μ 95% of the time.”
“At a 95% confidence level, we estimate that the real mean is located between 14 and 20.”
At the 95% confidence level, we estimate that the average income of households within 30km of the site is between
$64,264 and $67,003.
At the 95% confidence level, we estimate that the proportion of clients that spend at least $100 during a visit lies
between 29.0% and 40.5%.
95% CI: 1 out of 20 times, on average, the true parameter is not contained within the margin of error
The ME is 2.9% and 95% of the sample estimate the true proportion within the ME provided
U: At the 95% confidence level, we estimate that cars of the studied model consume on average between 8.560 and
8.920 liters per 100 km.
 “quantile calculation” to calculate the t or z

Special case of a proportion


^ ^
P(1− P)

n
^
P ±t n−1 ,α /2
√ n
.

Since S=

n−1
×√ ^ ^ ≈ √P
P(1− P) ^ ( 1− P
^ ) when n is large

Increasing the level of confidence requires a larger margin of error, thus decreasing the accuracy.
Increasing the accuracy for a fixed confidence level
To keep the same CL (not modify q), but still reduce the MOE, need to reduce the variability of the estimator by:
 increasing the sample size (reduces the MOE/ the interval) (what we control)
 using a more sophisticated sampling technique (not this class)
Sample size calculation
The sample size that ensures the MOE does not exceed Emax at a 1-α level of confidence
zα/ 2 ~ 2 2
σ z α / 2 0.5
Mean: n ≥ ( )
E max
Proportion: n ≥ (Emax )
Emax 0.01 et non 1% (always round up)
~
σ is the standard deviation of a previous study of the same variable in the same population
OR the standard deviation of a preliminary study in a small sample;

THEME 4
hypothesis testing
A hypothesis test is a statistical process that, based on a sample, chooses between two competing hypotheses about
one or more parameters. (so the population). Statistical inference cuz will draw conclusions about the pop from a
sample

 H 0 : Null hypothesis
o It usually represents the status quo, what is standard or normal.
o It is the “default”
o It is the claim that we will try to find evidence against. Try to reject
o Le monde tel qu’on s’attende qu’il soit. Ce a quoi on s’attend
o always include the equality sign (¿ , ≤ ,≥ ) in the null hypothesis.

 H 1 : Alternative hypothesis
o It is usually the alternative we want to prove, and which will trigger an action or a change
o Want to prove this/ Ce qu’on test
o wrongly rejecting H 0 in favor of this hypothesis usually has the more serious consequences.

p-values
The p-value corresponds to the probability of observing, a priori, a sample with similar or more extreme characteristics
than those collected from the real sample, assuming that H 0 is true for the population. In other words, it is the
probability of observing a sample like ours (or worse) by luck if H 0 is true. (prob d’avoir une valeur comme la notre ou +
extreme si Ho true) : distance between observed outcome and what is expected under H0: test statistic

 If this probability is small, it would be surprising to observe a sample like ours, at least if H 0 effectively describes
the situation in the population. Most likely H 0 is false, and hence we should reject H 0 in favor of H 1.
 If this probability is large, the observed sample is not unusual or surprising at all under the hypothesis that H 0 is
true. Therefore, there is no real evidence against the hypothesis, and thus we do not reject H 0.
Btwn 0 and 1

Steps in hypothesis testing


1. Formulate the hypotheses H 0 and H 1.
2. Specify the significance level of the test, α .
It is typical to choose α =1 % , 5 % or 10 %.
3. Collect a sample (following the best practices – see Theme 2).
4. Calculate the p-value based on the observed sample.
o For all tests about one or two means, the template “tests_and_intervals_means.xlsx”
5. Draw a conclusion by applying the decision rule,
p-value ≤ α ⇒ Reject H 0
Don’t reject
p-value ¿ α ⇒ H0
and interpret the conclusion in the context of the problem.
 When rejecting H 0:
“At the α significance level, the observed data allow us to reject the null hypothesis.”
“At the α significance level, the observed data provide statistical evidence in favor of the alternative
hypothesis.”
 When not rejecting H 0:
“At the α significance level, the observed data do not provide sufficient evidence to reject the null
hypothesis. “
Potential errors

When the risk of committing a Type I error decreases, the risk of committing a Type II error increases, and vice versa.
 prioritize control of Type I error. (cuz these ones are more harmful)

Type I ¿ P ( reject H 0| H 0 is true) ≤ α


risk
The significance level of the test corresponds to the maximum risk of a Type I error.
By fixing α , we control the risk of a Type I error.
When increasing the significance level of the test, we increase its capacity to detect a given deviation from the null
hypothesis, that is, we increase its power. (type I ↑ ⇒ power↑ ⇒type II=1-power ↓).

Power of a test
Can’t control the risk of a type II error
Type II ¿ P ( do not reject H 0| H 1 is true ¿
risk
Power=1−risk of a Type II error=P( reject H 0∨H 1 is true).
 The higher the power, the better the test.
 Generally, the larger the samples, the weaker the risk of a Type II error (for the same significance level α ) thus
the larger the power. (financial and time constraints for the sample size)
 The smaller the Type I error risk, the weaker the power of the test.

Comparison of means: pay attention to the type of data!


Independent samples
Data are obtained through the random and independent selection of individuals from two populations.
Paired samples
The individuals are selected from a single population, but two measures are taken on each of them.
 work with the sample of differences btwn the 2 measures on each individual (to eliminate the individual specific
effects)
 conduct a test for the mean (here μ D=μ1 −μ2) of a single population (where the characteristic under study is
the difference D= X 1−X 2)
 same validity conditions
Testing a single mean

t-test statistic

Test to compare the means μ1 and μ2of two populations, based on two independent samples
Test on the mean μ of a single population
Conditions for the validity of test (for means)
When a test on one or two means is considered, either Condition 1 or 2 must be valid for the p-value (computed by
Excel) to be reliable:

1. The variable of interest has a Normal distribution in the population(s) under study;
OR
2. The sizes of the samples are sufficiently large. (if true also normal dist cuz central limit theorem applies

Testing vs. confidence intervals


Should always get the same conclusion. In fact, you can always conduct a two-tailed hypothesis test at the significance
level α using a confidence interval at an 1-α confidence level. The conclusions will always be the same.
THEME 5:

Variable types
Numerical: values measured on a scale.
• Discrete: the possible values are isolated.
Example: Number of clients, number of children
• Continuous: the possible values run over an interval or a collection of intervals.
Example: Time, profit, weight, size, energy consumption
 Relationship btwn 2 numerical variables
 Descriptive analysis:
 Scatter plot
 See if there’s a relationship btwn the 2 vairables
 Visualize the shape of the relationship (linear or other)
 check if some values differ from the trend (inconsistent values or outliers)
 Correlation coefficient (r)
 measures the strength and direction of the linear relationship between two variables on a scale
from -1 to 1
 the closer it is to 1 or -1, the stronger it is (intensity)
 r=0: no linear relationship btwn the variables (NOT no relationship)
 Sign indicates the direction of the relationship. Positive: the variables vary in the same
direction). Negative: vary in opposite directions
 a high correlation coefficient (positive or negative) does not necessarily indicate a linear
relationship. r=1: means a perfect positive relationship
 Test for independence between variables:
 Test of independence using the correlation coefficient
H 0: ρ = 0 Absence of linear relationship (ALWAYS)
H 1: ρ ≠ 0 Presence of linear relationship

or H 1: ρ ¿ 0 Presence of negative linear relationship


H 1: ρ ¿ 0 Presence of positive linear relationship btwn A and B
Parameter(s) under study : 𝜌 : Correlation coefficient between the score of Extraversion and the score of Narcissism
Conditions for the validity of the test
1. The variables have a Normal distribution within the population
OR
2. The size of the sample is large enough (n ≥ 30).
 Conclusion: At an α = 5% significance level, the observed data allow to reject the null hypothesis and conclude
there is a linear relationship between the grade in math and sleep efficiency of children in elementary schools of
the South Shore (Montreal).
The larger the sample estimate of r (positive or negative), the more likely we are to reject the null hypothesis.
Intuitively, we reject H_0 when t takes on extreme values. Decision rule: p-value ≤α ⇒ We reject H_0
 Excel: The template “Test_CorrelationCoefficient.xlsx”
Significant relationship ≠ Strong relationship

Categorical: describes characteristics or attributes.


• Nominal: a label is associated to each category, but it does not correspond to a well-known unit of measure and
there is no natural order between labels.
Example: Gender, marital status, color, brand
• Ordinal: the categories are naturally ordered, but they do not have a numerical interpretation.
Example: Appreciation, scale with adjectives (e.g. slow, medium, fast)
 Descriptive analysis:
 Contingency table
 Cramer’s contigency coefficient
o To measure the strength of the relationship for 2 categorical variables
o Value from 0 to 1. The bigger the number, the stronger the association btwn the variables. 0=no
relationship
o ContingencyTable_ChiSquareTest.xls on Excel
 Test for independence between variables:
 Test of independence for two categorical variables: the chi-square test
H 0 : variables X and Y are independent
H 1 : variables X and Y are dependent
Conditions for the validity of the test
1. The sample size must be sufficiently large (n ≥ 30)
2. All expected cell frequencies under H0 must be ≥ 5
Both conditions must be satisfied!
 Expected frequencies: (under the hypothesis of independence; what we would get if indep)
Expected counts = Sum of the line x Sum of the column / Total Sum
 we compare the distance between the observed counts and those expected under H 0
2 ( observed frequencies−expected frequencies )2
χ =∑
expected frequencies

For 2 categorical variables, the strength of the relationship can be measured by: Cramer’s contingency coefficient
It takes values from 0 to 1.
0 value means there is no relationship between the variables. Larger Cramer coefficients mean a stronger association
between the variables.

 Correlation ≠ causation
 Dependence ≠ causality (stork effect)

Confounding Factor
In order to establish a causal relationship, i.e. a statement of the form X is the cause of Y , we must make sure there is
no confounding variable Z which is associated with both X and Y and is making you believe that the factor X under study
is causing Y while in fact it isn’t.
 Controlled expiriment tells us if theres causation or not

THEME 6:
Analysis of Variance (ANOVA) – Single Factor
To compare +2 means of independent samples
Does the factor or treatment have an impact on the variable of interest?
Are the means μ1 , μ 2 , … , μk for all k groups equal or is there some difference ?
Hypotheses in a Single Factor ANOVA:
H 0 :μ 1=μ2=…=μ K (no difference)

H 1 : At least one meanis different .

Two sources of variability: (The goal of an ANOVA is to decompose the variability of the data )

1. Observation error (the variability within groups)


It comprises all variability sources which affect the sales other than the studied factor or treatment. It
corresponds to all individual differences not explained by the factor.
Examples: supermarket geographical location, client profile, …

2. Variation due to the factor/treatment (the variability between groups) (explained by the regression model)

Total variability = variability due to factor + variability due to error


SST = SSF + SSE

ANOVA tables

Fisher’s test
Mean variation due to factor MSF
F= =
Mean variation due to error MSE
Fisher’s test is valid if the following three conditions are met:
1. The underlying variables of interest ( X i ¿follow a Normal distribution in all k populations or the collected samples
are sufficiently large.
2. The k samples are independent.
3. The variances in the k populations are equal (σ 12=σ 22=…=σ 2k =σ 2). (can’t be controlled by researcher, so limits
the applicability of the test
Welch’s test
H 0 :μ 1=μ2=…=μk
H 1 : At least one mean is differ ent
Conditions for the validity of the test:
1. 1. At least one of the following conditions is met:
i. The underlying variables of interest ( X i ¿follow a Normal distribution in all k populations.
OR
ii. The samples are sufficiently large.
 For two to nine groups, all group sample sizes should be ¿ 15.
 For 10 to 12 groups, all group sample sizes should be ¿ 20.
2. 2. The k samples are independent.
Use the Excel file “ANOVA_Welch_Holm.xlsx”

Pairwise comparisons/ pairwise test


If we reject H 0, there is at least one mean that is different from the rest. But which one?
 test H 0 :μ i=μ j vs. H 1 : μi ≠ μ j for all pairs (i, j)
 Use the Holm-Bonferroni correction for the p-value (test for 2 independent groups)
 Compare the largest p-value (with correction) with the significance level
 we can multiply each p-value by r and compare the corrected p-values with the global level α.
global 5% significance level: use holm Bonferroni correction. If just 5% then p-value
 In general with k groups, there are k (k−1)/2 different pairs.
At the global significance level of 1%, we can conclude that there is a difference in the mean sales of bottles sold when
they are displayed at the entrance versus when they are at one of the two other sites considered, however, no
difference in mean sales is detected depending on whether the bottles are near the cashiers or in the aisle of cleaning
products.

The anchor effect


The anchor effect is a cognitive bias originating from the common human tendency to rely too heavily on first
impressions. When keep an initial piece of info (the anchor) to make subsequent judgements

Blind experiments
 Blind experiment: if information is hidden from the participants until the end of the study to avoid bias in the
results.
 Double-blind experiment, both participants and researchers don’t have access to all the information. This reduces
bias. (experimenters don’t know who has the drug or placebo)

Multiple comparisons
 Even rare events become likely when many cases are examined.
 When H 0 is true, a test will rarely reject H 0. Indeed, the frequency of incorrect rejections (Type I Error) is set by the
significance level, e.g. α =5 % .
 When performing many tests, we are at risk of finding effects that do not exist!

The Bonferroni correction


If we run r tests and want keep the global significance level at α ,
¿
we can run each test at levelα =α /r .

Equivalently, we can multiply each p-value by r and


compare the corrected p-values with the global level α .
 Note: Decreasing the significance level or increasing the p-values makes rejecting H 0 more difficult! Conservative,
rarely rejects the null hypothesis. Reduces the type 1 risk, but also the power (not ideal)

The Holm-Bonferroni correction


 keeps the Type I error at α , but doesn’t lower the power as much.
Gives you the desired global significance level α .

THEME 7.1:
Regression line (or regression equation)

The regression line is the line that best represents the linear relationship between two variables, as shown in the
scatterplot below.
Ordinary least square (OLS) line = regression line: it’s the “best line”; the one that minimizes the sum of the squared
errors (“least squares”).

Use it to
1. to make predictions (We can use the value of one variable to predict the value of the other)
 Predict a new (or future) value of Y
 prediction in regression template: MOE is much smaller for a prediction of an average than for an
individual
2. To test for the presence of a linear relationship between variables (inference)

The variables

1- Dependent or response variable (Y) Y dépend de X


 variable of primary interest whose values are to be explained or predicted. The dependent variable must be a
quantitative variable that may take on a large number of values (continuous variable or discrete variable with
many possible values). Ce qu’on veut prédire/ ce qu’on veut expliquer

2- One or several independent or explanatory variables ( X or X 1 , X 2 , … , X k )


 used to explain or predict values of Y . They are numerical or categorical (or indicator: 0/1). The regression is
simple or multiple, depending on whether one or several independent variables are used to explain Y.

Simple linear regression


The simple linear regression model is:
Y = β0 + β 1 X +ε .

 β 0 is the y-intercept, i.e. the value of y when x=0 .


 β 1 is the slope, which means that y increases by β 1 units whenever x increases by 1 unit. Slope is constant
 β 0 + β 1 X =E [ Y ∨X ]=¿ average value of Y for all population individuals that have the specified value of X
 Y = observation of a specific individual
 The random error term ε , varies from one individual to another and results from factors other than X that may
explain the value of Y but that do not appear in the model. ε is assumed to have zero mean.

Sales = 6.47 + 54.37*Budget


• With a $0 budget (a situation that never happens within the data set), we estimate that a franchise would have sales of
$6.47 million during the first year.
• For each increase of $1 million in the advertising budget, we estimate that sales of a given franchise during the first
year would increase by $54.37 million on average.
salary = 1199 + 2.88 × number of months of experience
• It is estimated that an additional month of experience results, on average, in an increase in salary of £ 2.88 per month.
• It is estimated that a beginner teacher (with 0 months of experience) earns, on average, £ 1199 per month.

General linear regression

Multiple linear model:

Y = β0 + β 1 X 1 + β 2 X 2 + ⋯+ β p X p +ε .
 β 0 + β 1 X 1 + β 2 X 2 +⋯+ β p X p represents the average value of Y for population individuals corresponding to a given
set of values ( X 1 , X 2 , … , X p) .
 ε is the difference between the Y -value of an individual and the average Y -value of population individuals with the
same characteristics ( X 1 , X 2 , … , X p) . C’est la différence entre un point du nuage de point et la ligne. Y observée-Y
estimée
 Excel template “Regression.xlsx”.

Interpretation of a numerical variable (quantitative)


 It is estimated that an additional month of experience leads to a £2.67 increase in average monthly salary,
(when the other variables are otherwise identical ) (so when its 0 for all the variables)
 For numerical explanatory/independent variables, each unit increase (each additional month) brings an
average increase in the response variable corresponding to the coefficient of the parameter (£2.67 here), if
the other variables included in the model are otherwise identical.
Interpretation of a indicator variable (a comparison btwn 2 groups)
 It is estimated that for the same level of experience, type of school, and extended absence status, teachers with
higher education degrees earn, on average, £478.86 more per month that those who do not have such degrees.
 With all the other variables fixed
 For indicator variables, the associated coefficient estimates the average difference in the response variable
between the two groups defined by the indicator variable, e.g. average wage difference between those
with or without a post-graduate degree, when all other variables included in the model are otherwise
identical. o The interpretation of the coefficient is about performing a COMPARISON BETWEEN THE TWO
GROUPS.

When categorical transformed into indicator


 We estimate that an adolescent in Secondary 3 updates his or her Facebook status 1.053 times less than an
adolescent in Secondary 1 (the reference category), when all other variables are kept constant (same gender
and same scores on extraversion and narcissism).
If does not make sense: we estimate that a female adolescent (gender = 0) with a score of 0 for narcissism, updates her
Facebook status on average -1.177 times per week. This is an extrapolation and cannot be interpreted. Or (We cannot
interpret this value since values are out of the range of values in the sample)

Test on the correlation coefficient


H 0 : ρ=0 vs H 1 : ρ≠ 0.
The correlation allows us to test for the presence of a linear relation between only two variables. When Y = β0 + β 1 X

Global test of significance


determines if there is a linear relationship between a response variable and several explanatory variables: when
Y = β0 + β 1 X 1 + β 2 X 2 + ⋯+ β p X p.
You often want to know if the model is globally significant, i.e. if at least some of the variability of Y is explained by the
explanatory variables in the model.
H 0 : β1 =β2 =…= β p =0 ↔ The model is not significant
 reduces to Y = β0 + ε . X 1 , X 2 , … , X p do not appear in
this expression, which means that they do not
generate variations to Y
 None of the explanatory variables are useful in
explaining variations of Y , therefore you may simply
conclude that there is no linear relationship. This
model is useless for predictions; abandon it.
 If you want to explain Y , you will need to look for new
explanatory variables. But the correct model could
also be exactly nonlinear. You should not abandon
your variables before you have looked at them.

H 1 : at least one of the ↔ The model is significant


coefficients β 1 , … , β p is  Reject H0 (small p-value)
different from zero  There exists a linear relationship between Y and the
explanatory variables.

Conclusion: At the level of significance 1%, the model is globally significant (Look in regression, result, p-value ANOVA )
What should you do if the regression is significant, but not useful?
 In this case, a relationship between Y and the explanatory variables has been established, but it is relatively
weak ( R2 is small). This means that the explanatory variables explain variations in Y to some extent, but are not
sufficient to explain them well.
 To improve the quality of predictions, it would be desirable to incorporate other explanatory variables to the
ones we already have in a new linear regression model.

How strong is the linear relationship? R-squared (or R2 ¿


Variability explained by the regression model
R 2=
Total variability in the data
R2 is the proportion of variability in the data explained by the model.
 R2 takes on values between 0 and 1. (in business want about 50 or 60%)
 A value close to 1 means that the points are close to the regression line (or regression equation).
 if only one explanatory variable is included in the regression model, then R2=r 2. r measures the strength of a
linear relationship between two variables
 Thus, R2 should not be used to compare two models with different numbers of explanatory variables.
2
 The adjusted R-squared coefficient Radj (Adj. R-sq. in the table) is a correction that takes into consideration the
number of variables in the model. It is used to compare models with a different number of variables.
2
 Plus Radj est haut meilleur est le modèle
R2
Interpret R2
 R2=0,756. The regression model accounts for 75.6% of the variability in the observed salaries.
 proportion on the variability of Variable Y can be explained by Variable X
 R2 = 0.0825. The model explains only 8.25% of the observed variability in the frequency of Facebook status updates
per week. (evaluate and interpret the model’s performance)
 Compare the R2 to know which model is the best predictor
 Significant regression: the data allow us to reject H 0 : β1 =…=β p =0 (small p-value). There exists a linear
relationship between Y and the explanatory variables.
 Useful regression: R2 is large. There is a strong linear relationship between Y and the explanatory variables.

Which variables contribute to the model?


Individual tests on the coefficients
To know which explanatory variables included in the model are actually contributing.
Which of the explanatory variables contribute significantly to this 4-variable model?
H 0 : βi =0 ↔ Variable X i does not contribute significantly;
H 1: βi ≠ 0 ↔ Variable X i contributes significantly to the
model;
for i=1,2 , … , p , i.e. for each explanatory variable X i present in the model.
To find the p-value: look in Regression, results, p-value of the coefficients

Conditions that need to be satisfied:


- Analysis of residuals

At the 1% level of significance, the average wages of men and women differ
 A ‘Significant’ beta parameter in front of the gender variable (sex) means that it is statistically too large to be
explained by chance alone

Control variables: possible/potential confounding variables (other variables that could explain differences in salary)
Controlling for confounders allows us to detect whether a variable has an effect on Y when other specific factors are
taken into account.
Only if data comes from a controlled randomized experiment it proves causality
An association does not necessarily mean causation

THEME 7.2:
How can we incorporate categorical variables?

Only numerical and indicator explanatory variables may be included. (non-binary) categorical variables should never be
included directly in a linear regression model.

1. Choose a reference category (arbitrarily): (ex: type 1 is the reference)


2. Define an indicator variable for all other values of the categorical variable:
Example:
1 if X 3 =2 1 if X 3=3
X 32= {0 otherwise
X 33= {
0 otherwise
1 if X 3=4 1 if X 3=5
X 34 = {0 otherwise
X 35= {
0 otherwise
Use the Excel function IF (SI in the French version) to perform this coding.

3. Incorporate the categorical information (e.g. X 3 info) into the model using the indicator variables (e.g. X 32, X 33 ,
X 34 and X 35 ). This will allow you to compare each category against the reference category.
Confidence intervals for prediction 7.2.15
A. A confidence interval for the mean (e.g. average income of all female teachers with 25 months of experience and
with a degree of type 1):
B. A confidence interval for the salary of a specific person (e.g. Sophie):
o The interval for a specific person is therefore larger since it incorporates an additional source of
uncertainty.
- Prédire la valeur de y pour une observation future : interval de confiance plus long
Warning of prediction:
 Don't extrapolate far beyond the observed range of the predictors X
Predictions should be limited to values of the explanatory variables (the X ) that are within (or close to) the range of
sample values used to fit the model. Example: If you have data ranging from 1900 to 2015, the predictions for 2016
or 2020 may be reliable, but the prediction for 2200 may not.
 Don't extrapolate beyond the population
Predictions should be made only for units belonging to the population under study (from which the sample has been
selected). Example: The British teacher regression equation cannot be used to draw a conclusion for Deborah, a
Canadian teacher.

Multicollinearity: be cautious
 There is multicollinearity when the explanatory variables in the model are highly correlated. (redundant info)
 Multicollinearity tends to increase the uncertainty in the estimation of the model coefficients.
 Thus, significant variables may lose their significance because of the additional uncertainty in estimating the model
parameters and, moreover, the interpretation of the coefficients may be distorted.
 To avoid the multicollinearity problem when carrying out inference, it is recommended to remove from the model
the variable with the lowest correlation with the response variable Y among any pair of explanatory variables
with a strong correlation (e.g. |r|> 0.7).

Interaction
 There is an interaction between the two explanatory variables X i and X j when the effect of X i on the response
variable Y differs according to the value of X j, and vice-versa. (valeur de l’une modifie le taux de variation de
l’autre
 there is interaction between two explanatory variables when their effect is multiplicative, not only additive. Such
an effect is incorporated into a regression model by adding a new variable defined as the product
of the two original variables.
 Makes the lines of the graph different when add an interaction term
 The hypothesis are: 𝐻0:𝛽nb_stores*sex=0 (slope is the same) vs. 𝐻1:𝛽nb_stores*sex≠0. (slopes are different)

Analysis of residuals
 All tests and confidence intervals presented in Theme 7 (Parts 1 and 2) are valid only if the errors are
independent and follow the same Normal distribution with zero mean: ε N (0 ,σ ε ).
 The residuals are the differences between the sample points and what the model predicts they should be.
 The conditions for validity (of the regression model) can be checked with tests (not covered in these course
notes) and plots of the residuals. This inspection is called analysis of residuals.
Before going any further, let us detail the assumptions formulated previously:
1. The error term ε has mean 0 (the relation actually is linear),
2. The variance of ε is constant over the individuals of the population (homoscedasticity= non-uniform
variance)
3. The errors ε are normally distributed,
4. The errors ε are independent (or uncorrelated).

1. Linearity
If the relation is not linear, all results are invalidated.
No problem is detected if the points are randomly scattered around the horizontal axis (or we do not see any particular
trend). (compare Residuals and X)

2. Homogeneity of the variance

The residuals must have a constant variance. In addition to the plots of residuals with respect to each X , we also look at
the residuals with respect to the predicted values of Y . (compare residual vs predicted

3.Normality of residuals
We can check graphically that the residuals follow approximately a Normal distribution. The figure below is called a QQ-
plot. If the residuals are normal, it will resemble a line.

4.Independence between observations


The best strategy is then to be careful when collecting the data. In other words, the data collection methodology itself
is the best guarantee of independence between different observations.
Look at residual vs observation number (which represents time): If observations vary freely. There is no dependence
problem.
7.2.47: but this analysis applies only if the order of the observation is choronological

THEME 8: Making Predictions with Linear Regression


Prediction (…)
In the current theme, we shall use linear regression with the sole objective of providing predictions (in the form of
point estimates) of the response variable Y with the highest accuracy possible for individuals for which the response
has not been measured.
We are no longer interested in interpreting, measuring and testing the effects of given factors on the response
and do not wish to compute any confidence intervals. So conditions for validity for linear regression do not need
to be verified and multicollinearity is not a concern. No interpretation of the slopes

Objective: predict response variable (y) with explanatory variables (x)


Predict a point estimate (y) with the highest accuracy possible
You want to predict the value of Y , for events that have not been observed yet.
 Solution:
o Fit a linear regression to the historical data.
o Then use the equation of the regression to make predictions.

The lesson
 Several models may be proposed to predict a characteristic (a variable).
 Predictive models make errors.
 A better model is one that leads to smaller errors.
 It is not reasonable to assess the performance of a model by assessing how good it performs for the data that
were used to fit this model.

Cross-validation
To assess the performance of a model, you MUST use data that were not involved in the fitting. One option is:
1. To split (randomly) your data into two parts: training data and test data.
2. To use only the training data to fit the model.
3. To measure performance on the test data.
Training data ↔ Fitting the model (data tab)
Test data ↔ Measure performance (cross-validation tab)

To measure the performance of a model: use RMSE


 RMSE can be interpreted as the average size of the prediction error committed by the model. The units are the
same as those of Y . (square root of the mean square of errors)
 A smaller RMSE indicates a better performance (a better prediction).
 Don’t forget that the RMSE should be computed using the test data, not the training data.
- Use Regression, then data and cross-validation in excel

Interpret RMSE: On average, the model including (for ex all variables) makes prediction errors of about $822.15.
Should we always use all variables? NO
We should:
o consider as many variables as possible,
o but then choose the best model using possibly only a subset of these variables.
Creating variables from existing ones
 More generally, adding new variables that are powers or interaction terms of existing variables can boost the
performance of a model.
all-subsets approach consists in fitting all the models and selecting the best one. However, the number of models
quickly gets out of control; so use:
Stepwise methods

Backward selection with a 5% significance level


Starts with the full model (containing all variables). Then,
1. Fit the model.
2. Identify the variable with the largest p-value (on the test for individual contribution). (c’est la variable la moins
utile)
3. If the p-value ≤ 0.05, stop and keep the current model. (variable considéré utile si p-value above 0.05)
If the p-value ¿ 0.05, remove this variable from the model and repeat from step 1 with the reduced model.
It is important to fit and re-examine the reduced model after each elimination
Won’t necessarily give the best model, but a reasonable one (useful when the total number of models is unmanageable)

Alternative measures of performance


What if I have a very small sample and cannot afford to divide into training and test data?
 An alternative to cross-validation is to use statistics that are designed to account for the number of variables in the
model:
o R2adj : A larger R2adj is better. (takes into account the number of variables)
o AIC : Akaike Information Criterion: A smaller AIC means a better fit.
o BIC : Bayesian Information Criterion : the smaller the better.
 The three criteria presented previously do not necessarily agree: they can suggest different models. When they
contradict one another, the proposed models are all reasonable choices.

2
 If hesitate btwn 2 variables to include (grad or degree); can compare with Radj, AIC or BIC . (they take into account
the number of variables involved in the model0

You might also like