Professional Documents
Culture Documents
Statistical Inference Vs Descriptive Statistics: Conclusions
Statistical Inference Vs Descriptive Statistics: Conclusions
THEME 1
Vocabulary :
Population: the entire group of subjects (or units) we wish to study (e.g. the collection of invoices submitted by the
employees).
Sample: a group of units drawn from the population (e.g. some subset of the invoices submitted by the employees).
Variable of interest: a feature or property of every member of the population we wish to study (e.g. the amount of
inadmissible expenses on an invoice). This feature or property varies from one individual to the other. It is often
denoted by X . (each unit of the population has its own value)
Parameter: a population-specific quantity associated with the variable X , almost always unknown to us (e.g. the
mean amount of inadmissible expenses on invoices). (always fixed and free from errors, ex: mean (u), std dev (o),
correlation (p), proportions (p). Defines the population. Fixed quantity. What we are looking for
Estimator: statistic defined from a sample that enables the estimation of an unknown parameter. (sampling error)
Its value will vary with the sample (has its own
distribution, mean and std dev) so has an
estimation error
Point Estimate: value of the estimator for a
given observed sample. (depends on the
sample used to calculate it. Varies = sampling
variability) tool used to approximate the
parameter of interest
Statistical Inference: Drawing conclusions about a population from a sample. (infer what the data is). Make sure the
sample is representative of the population (extrapolate to the whole population, “think that about”, generalizations)
Observational Study
Observational study:
A study where individuals are merely observed. The variable of interest is measured but no attempt is made to
influence the response. (researcher’s passive observation)
Measurement bias: the method used to measure the feature we wish to study does not adequately measure what
we really want to measure. (answering incorrectly or not wanting to answer honestly)
Example: Getting teachers to measure swearing at recess as a way of measuring the amount of swearing in the
school.
Problems:
Lack of integrity (so do it anonymously) (sensitive topics)
Bad formulation (avoid double negation and make it clear), impartial vocab (not suggesting an answer)
Avoid memory/recall bias by doing a longitudinal study (instead of a cross-sectional)
Nonresponse: if we cannot measure the feature(s) of interest for some individuals (units) in the sample, a bias may
occur if the individuals who answered differ from those who didn’t. (cuz some ppl refuse to answer or can’t reach
them); missing values/holes in the data set
To avoid it (when sensitive topic, wrong timing, data collection methodology, …)
Attempt to reach out the individuals in the sample multiple times
Simple random sampling without replacement (SRSWOR) : we select units at random from the population until we
have the target size. Units can only be chosen once. This is more natural!
When the sampling fraction is small (i.e. if the population is large compared to the sample), this method provides a
good approximation to SRSWR.
Systematic sampling: we systematically select the units from a list (e.g. every 12th unit) having previously chosen at
random the starting point. This is the easiest way to approximate a SRSWR.
Stratified sampling: the population is divided into separate groups called strata (e.g. by age, sex, etc.). A simple
random sample is then chosen from each group. This ensures the different strata are well represented in the
sample in order to improve the precision of the estimates.
Cluster sampling: if the population already appears in groups of units, named clusters (e.g. houses, locations from a
region, buildings or blocks in a city, etc.), we can take a simple random sample of clusters. All individuals from the
sampled clusters are part of the final sample. This method often reduces costs and complexity, but it also reduces
precision of the estimates for a similar sample size.
Want to avoid:
Convenience samples
Volunteer samples
THEME 3:
Common Estimators
Sample proportions are means. X bar=p hat (when its an indicator variable)
σ
σ X́ = √V ( X́)= Roughly speaking, it is the average estimation error = standard deviation of the sample
√n
mean
This means that:
The larger the sample size n , the less the sample mean varies from one sample to the other (the smaller σ X́
). Larger sample sizes thus lead to more precise estimates of the mean of the population.
The less volatile the variable of interest in the population (the weaker the standard deviation σ ), the smaller
σ X́ . Less volatile variables of interest thus lead to more precise estimates of the mean of the population.
The estimation error correspond to the difference between the point estimate and the true population mean, namely
𝑥̅−𝜇
Differences between S and S/ √ n
S estimates the variability of the variable of interest across the units in the population.
S
estimates sample variability, i.e. the variability of the sample mean across possible samples of size n .
√n
Sample std dev of sample mean. It is an evaluation of the average estimation error. We estimate that the
sample mean duration deviates on average by 3.22 minutes from the population mean duration = average
estimation error
Confidence Interval
The Central Limit Theorem is true no matter the distribution of the variable X in the population.
the Central Limit Theorem allows estimation by confidence intervals of the mean of any population as long as
the sample size is large enough.
Normal distribution: Pointed ↔ Small dispersion ↔ Small standard deviation
curve
Flat curve ↔ Large ↔ Large standard deviation
Student’s t-distribution dispersion with n−1 degrees
of freedom
σ
Use this because we don’t know the σ of the population in X́ ± z α /2 .
√n
S
confidence interval for μ, at a 1−α confidence level: X́ ± t n−1 , α / 2 .
√n
This interval is valid if at least one of the following conditions is satisfied:
INTERPRETATION
“The interval (14, 20) was generated by a process that captures the population mean μ 95% of the time.”
“At a 95% confidence level, we estimate that the real mean is located between 14 and 20.”
At the 95% confidence level, we estimate that the average income of households within 30km of the site is between
$64,264 and $67,003.
At the 95% confidence level, we estimate that the proportion of clients that spend at least $100 during a visit lies
between 29.0% and 40.5%.
95% CI: 1 out of 20 times, on average, the true parameter is not contained within the margin of error
The ME is 2.9% and 95% of the sample estimate the true proportion within the ME provided
U: At the 95% confidence level, we estimate that cars of the studied model consume on average between 8.560 and
8.920 liters per 100 km.
“quantile calculation” to calculate the t or z
n
^
P ±t n−1 ,α /2
√ n
.
Since S=
√
n−1
×√ ^ ^ ≈ √P
P(1− P) ^ ( 1− P
^ ) when n is large
Increasing the level of confidence requires a larger margin of error, thus decreasing the accuracy.
Increasing the accuracy for a fixed confidence level
To keep the same CL (not modify q), but still reduce the MOE, need to reduce the variability of the estimator by:
increasing the sample size (reduces the MOE/ the interval) (what we control)
using a more sophisticated sampling technique (not this class)
Sample size calculation
The sample size that ensures the MOE does not exceed Emax at a 1-α level of confidence
zα/ 2 ~ 2 2
σ z α / 2 0.5
Mean: n ≥ ( )
E max
Proportion: n ≥ (Emax )
Emax 0.01 et non 1% (always round up)
~
σ is the standard deviation of a previous study of the same variable in the same population
OR the standard deviation of a preliminary study in a small sample;
THEME 4
hypothesis testing
A hypothesis test is a statistical process that, based on a sample, chooses between two competing hypotheses about
one or more parameters. (so the population). Statistical inference cuz will draw conclusions about the pop from a
sample
H 0 : Null hypothesis
o It usually represents the status quo, what is standard or normal.
o It is the “default”
o It is the claim that we will try to find evidence against. Try to reject
o Le monde tel qu’on s’attende qu’il soit. Ce a quoi on s’attend
o always include the equality sign (¿ , ≤ ,≥ ) in the null hypothesis.
H 1 : Alternative hypothesis
o It is usually the alternative we want to prove, and which will trigger an action or a change
o Want to prove this/ Ce qu’on test
o wrongly rejecting H 0 in favor of this hypothesis usually has the more serious consequences.
p-values
The p-value corresponds to the probability of observing, a priori, a sample with similar or more extreme characteristics
than those collected from the real sample, assuming that H 0 is true for the population. In other words, it is the
probability of observing a sample like ours (or worse) by luck if H 0 is true. (prob d’avoir une valeur comme la notre ou +
extreme si Ho true) : distance between observed outcome and what is expected under H0: test statistic
If this probability is small, it would be surprising to observe a sample like ours, at least if H 0 effectively describes
the situation in the population. Most likely H 0 is false, and hence we should reject H 0 in favor of H 1.
If this probability is large, the observed sample is not unusual or surprising at all under the hypothesis that H 0 is
true. Therefore, there is no real evidence against the hypothesis, and thus we do not reject H 0.
Btwn 0 and 1
When the risk of committing a Type I error decreases, the risk of committing a Type II error increases, and vice versa.
prioritize control of Type I error. (cuz these ones are more harmful)
Power of a test
Can’t control the risk of a type II error
Type II ¿ P ( do not reject H 0| H 1 is true ¿
risk
Power=1−risk of a Type II error=P( reject H 0∨H 1 is true).
The higher the power, the better the test.
Generally, the larger the samples, the weaker the risk of a Type II error (for the same significance level α ) thus
the larger the power. (financial and time constraints for the sample size)
The smaller the Type I error risk, the weaker the power of the test.
t-test statistic
Test to compare the means μ1 and μ2of two populations, based on two independent samples
Test on the mean μ of a single population
Conditions for the validity of test (for means)
When a test on one or two means is considered, either Condition 1 or 2 must be valid for the p-value (computed by
Excel) to be reliable:
1. The variable of interest has a Normal distribution in the population(s) under study;
OR
2. The sizes of the samples are sufficiently large. (if true also normal dist cuz central limit theorem applies
Variable types
Numerical: values measured on a scale.
• Discrete: the possible values are isolated.
Example: Number of clients, number of children
• Continuous: the possible values run over an interval or a collection of intervals.
Example: Time, profit, weight, size, energy consumption
Relationship btwn 2 numerical variables
Descriptive analysis:
Scatter plot
See if there’s a relationship btwn the 2 vairables
Visualize the shape of the relationship (linear or other)
check if some values differ from the trend (inconsistent values or outliers)
Correlation coefficient (r)
measures the strength and direction of the linear relationship between two variables on a scale
from -1 to 1
the closer it is to 1 or -1, the stronger it is (intensity)
r=0: no linear relationship btwn the variables (NOT no relationship)
Sign indicates the direction of the relationship. Positive: the variables vary in the same
direction). Negative: vary in opposite directions
a high correlation coefficient (positive or negative) does not necessarily indicate a linear
relationship. r=1: means a perfect positive relationship
Test for independence between variables:
Test of independence using the correlation coefficient
H 0: ρ = 0 Absence of linear relationship (ALWAYS)
H 1: ρ ≠ 0 Presence of linear relationship
For 2 categorical variables, the strength of the relationship can be measured by: Cramer’s contingency coefficient
It takes values from 0 to 1.
0 value means there is no relationship between the variables. Larger Cramer coefficients mean a stronger association
between the variables.
Correlation ≠ causation
Dependence ≠ causality (stork effect)
Confounding Factor
In order to establish a causal relationship, i.e. a statement of the form X is the cause of Y , we must make sure there is
no confounding variable Z which is associated with both X and Y and is making you believe that the factor X under study
is causing Y while in fact it isn’t.
Controlled expiriment tells us if theres causation or not
THEME 6:
Analysis of Variance (ANOVA) – Single Factor
To compare +2 means of independent samples
Does the factor or treatment have an impact on the variable of interest?
Are the means μ1 , μ 2 , … , μk for all k groups equal or is there some difference ?
Hypotheses in a Single Factor ANOVA:
H 0 :μ 1=μ2=…=μ K (no difference)
Two sources of variability: (The goal of an ANOVA is to decompose the variability of the data )
2. Variation due to the factor/treatment (the variability between groups) (explained by the regression model)
ANOVA tables
Fisher’s test
Mean variation due to factor MSF
F= =
Mean variation due to error MSE
Fisher’s test is valid if the following three conditions are met:
1. The underlying variables of interest ( X i ¿follow a Normal distribution in all k populations or the collected samples
are sufficiently large.
2. The k samples are independent.
3. The variances in the k populations are equal (σ 12=σ 22=…=σ 2k =σ 2). (can’t be controlled by researcher, so limits
the applicability of the test
Welch’s test
H 0 :μ 1=μ2=…=μk
H 1 : At least one mean is differ ent
Conditions for the validity of the test:
1. 1. At least one of the following conditions is met:
i. The underlying variables of interest ( X i ¿follow a Normal distribution in all k populations.
OR
ii. The samples are sufficiently large.
For two to nine groups, all group sample sizes should be ¿ 15.
For 10 to 12 groups, all group sample sizes should be ¿ 20.
2. 2. The k samples are independent.
Use the Excel file “ANOVA_Welch_Holm.xlsx”
Blind experiments
Blind experiment: if information is hidden from the participants until the end of the study to avoid bias in the
results.
Double-blind experiment, both participants and researchers don’t have access to all the information. This reduces
bias. (experimenters don’t know who has the drug or placebo)
Multiple comparisons
Even rare events become likely when many cases are examined.
When H 0 is true, a test will rarely reject H 0. Indeed, the frequency of incorrect rejections (Type I Error) is set by the
significance level, e.g. α =5 % .
When performing many tests, we are at risk of finding effects that do not exist!
THEME 7.1:
Regression line (or regression equation)
The regression line is the line that best represents the linear relationship between two variables, as shown in the
scatterplot below.
Ordinary least square (OLS) line = regression line: it’s the “best line”; the one that minimizes the sum of the squared
errors (“least squares”).
Use it to
1. to make predictions (We can use the value of one variable to predict the value of the other)
Predict a new (or future) value of Y
prediction in regression template: MOE is much smaller for a prediction of an average than for an
individual
2. To test for the presence of a linear relationship between variables (inference)
The variables
Y = β0 + β 1 X 1 + β 2 X 2 + ⋯+ β p X p +ε .
β 0 + β 1 X 1 + β 2 X 2 +⋯+ β p X p represents the average value of Y for population individuals corresponding to a given
set of values ( X 1 , X 2 , … , X p) .
ε is the difference between the Y -value of an individual and the average Y -value of population individuals with the
same characteristics ( X 1 , X 2 , … , X p) . C’est la différence entre un point du nuage de point et la ligne. Y observée-Y
estimée
Excel template “Regression.xlsx”.
Conclusion: At the level of significance 1%, the model is globally significant (Look in regression, result, p-value ANOVA )
What should you do if the regression is significant, but not useful?
In this case, a relationship between Y and the explanatory variables has been established, but it is relatively
weak ( R2 is small). This means that the explanatory variables explain variations in Y to some extent, but are not
sufficient to explain them well.
To improve the quality of predictions, it would be desirable to incorporate other explanatory variables to the
ones we already have in a new linear regression model.
At the 1% level of significance, the average wages of men and women differ
A ‘Significant’ beta parameter in front of the gender variable (sex) means that it is statistically too large to be
explained by chance alone
Control variables: possible/potential confounding variables (other variables that could explain differences in salary)
Controlling for confounders allows us to detect whether a variable has an effect on Y when other specific factors are
taken into account.
Only if data comes from a controlled randomized experiment it proves causality
An association does not necessarily mean causation
THEME 7.2:
How can we incorporate categorical variables?
Only numerical and indicator explanatory variables may be included. (non-binary) categorical variables should never be
included directly in a linear regression model.
3. Incorporate the categorical information (e.g. X 3 info) into the model using the indicator variables (e.g. X 32, X 33 ,
X 34 and X 35 ). This will allow you to compare each category against the reference category.
Confidence intervals for prediction 7.2.15
A. A confidence interval for the mean (e.g. average income of all female teachers with 25 months of experience and
with a degree of type 1):
B. A confidence interval for the salary of a specific person (e.g. Sophie):
o The interval for a specific person is therefore larger since it incorporates an additional source of
uncertainty.
- Prédire la valeur de y pour une observation future : interval de confiance plus long
Warning of prediction:
Don't extrapolate far beyond the observed range of the predictors X
Predictions should be limited to values of the explanatory variables (the X ) that are within (or close to) the range of
sample values used to fit the model. Example: If you have data ranging from 1900 to 2015, the predictions for 2016
or 2020 may be reliable, but the prediction for 2200 may not.
Don't extrapolate beyond the population
Predictions should be made only for units belonging to the population under study (from which the sample has been
selected). Example: The British teacher regression equation cannot be used to draw a conclusion for Deborah, a
Canadian teacher.
Multicollinearity: be cautious
There is multicollinearity when the explanatory variables in the model are highly correlated. (redundant info)
Multicollinearity tends to increase the uncertainty in the estimation of the model coefficients.
Thus, significant variables may lose their significance because of the additional uncertainty in estimating the model
parameters and, moreover, the interpretation of the coefficients may be distorted.
To avoid the multicollinearity problem when carrying out inference, it is recommended to remove from the model
the variable with the lowest correlation with the response variable Y among any pair of explanatory variables
with a strong correlation (e.g. |r|> 0.7).
Interaction
There is an interaction between the two explanatory variables X i and X j when the effect of X i on the response
variable Y differs according to the value of X j, and vice-versa. (valeur de l’une modifie le taux de variation de
l’autre
there is interaction between two explanatory variables when their effect is multiplicative, not only additive. Such
an effect is incorporated into a regression model by adding a new variable defined as the product
of the two original variables.
Makes the lines of the graph different when add an interaction term
The hypothesis are: 𝐻0:𝛽nb_stores*sex=0 (slope is the same) vs. 𝐻1:𝛽nb_stores*sex≠0. (slopes are different)
Analysis of residuals
All tests and confidence intervals presented in Theme 7 (Parts 1 and 2) are valid only if the errors are
independent and follow the same Normal distribution with zero mean: ε N (0 ,σ ε ).
The residuals are the differences between the sample points and what the model predicts they should be.
The conditions for validity (of the regression model) can be checked with tests (not covered in these course
notes) and plots of the residuals. This inspection is called analysis of residuals.
Before going any further, let us detail the assumptions formulated previously:
1. The error term ε has mean 0 (the relation actually is linear),
2. The variance of ε is constant over the individuals of the population (homoscedasticity= non-uniform
variance)
3. The errors ε are normally distributed,
4. The errors ε are independent (or uncorrelated).
1. Linearity
If the relation is not linear, all results are invalidated.
No problem is detected if the points are randomly scattered around the horizontal axis (or we do not see any particular
trend). (compare Residuals and X)
The residuals must have a constant variance. In addition to the plots of residuals with respect to each X , we also look at
the residuals with respect to the predicted values of Y . (compare residual vs predicted
3.Normality of residuals
We can check graphically that the residuals follow approximately a Normal distribution. The figure below is called a QQ-
plot. If the residuals are normal, it will resemble a line.
The lesson
Several models may be proposed to predict a characteristic (a variable).
Predictive models make errors.
A better model is one that leads to smaller errors.
It is not reasonable to assess the performance of a model by assessing how good it performs for the data that
were used to fit this model.
Cross-validation
To assess the performance of a model, you MUST use data that were not involved in the fitting. One option is:
1. To split (randomly) your data into two parts: training data and test data.
2. To use only the training data to fit the model.
3. To measure performance on the test data.
Training data ↔ Fitting the model (data tab)
Test data ↔ Measure performance (cross-validation tab)
Interpret RMSE: On average, the model including (for ex all variables) makes prediction errors of about $822.15.
Should we always use all variables? NO
We should:
o consider as many variables as possible,
o but then choose the best model using possibly only a subset of these variables.
Creating variables from existing ones
More generally, adding new variables that are powers or interaction terms of existing variables can boost the
performance of a model.
all-subsets approach consists in fitting all the models and selecting the best one. However, the number of models
quickly gets out of control; so use:
Stepwise methods
2
If hesitate btwn 2 variables to include (grad or degree); can compare with Radj, AIC or BIC . (they take into account
the number of variables involved in the model0