Week4 Modified

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

BDM 2053

Big Data Algorithms and Statistics


Weekly Course Objectives
● What is the t-distribution?
● Hypothesis testing.
○ What are p-values?
● How do we assess linear models?
○ MSE, MAE, F-tests
● Feature selection in linear models.
○ Backward selection
○ Forward selection
● Do some examples in Python!
History on Student’s t-distribution
● z-tables were always used until the early 1900s.
● However, assuming the normal distribution has one issue… on
small samples, we tend to get a lot of variation (less data, more
variability).
○ This was the case with Guinness. They needed to measure
the content of their barley in their beer, but taking too
much of a sample would mean they sell less (which was
tough enough during prohibition). If they took too little,
they didn’t have enough information on their estimates.
● William Gosset realized this issue. The standard error (σ/(n0.5))
is not a good measure when the sample is small because we use
the sample standard deviation.
○ We actually use s, the sample standard deviation, not σ, the
population standard deviation.
History on Student’s t-distribution cont.
● Say we had a standard deviation of 10. Depending on n, our
sample size, our standard error measure would change.

● When n ≥ 30, we see that this curve tends to flatten out. In


other words, to minimize our standard errors we must must
increase our sample size!
History on Student’s t-distribution cont.
● Size matters!... sample size that is. Bigger is better. It reduces
variability and therefore, when making inferences, tends to
reduce uncertainty.
● This lead Gosset to release a paper in the early 1900s called
“The Application of the Law of Error to the work of the
Brewery” where he discusses a new distribution that is flatter
but still bell-shaped when compared to the normal
distribution.
● This allowed Gosset to account for smaller sample sizes, which
solve this concern of bias in our sample mean test statistic.
To the left, we can see that this distribution that
Gosset made would be able to account for the
variability introduced by the sample size. The the
less the variables, the more spread out the
distribution.
Recall the area under the curve is equal to 1, so you
must stretch and flatten the curve out.
History on Student’s t-distribution cont.
● Gosset didn’t change the calculation of the test statistic, just
the distribution to model the statistic when the sample size
was small.
● Gosset didn’t want to publish his paper, but his paper still
appeared under the name Student.
○ Some say Guinness didn’t want him to because they’d then
be admitting there’s variability in their beer.
○ Others say Guinness didn’t want their competitors to know
they are using science.
○ Some say Gosset published secretly to get around the
prohibition rules.
● Originally they were published under “Student’s z
Distribution” because it referenced the normal distribution.
● Later another statistician named Fisher came and made a
suggestion to drop the sample size of each distribution by 1 to
account for uncertainty. This is called degrees of freedom.
● For this reason, the revised tables were called “Student’s t
Distribution”.
Degrees of Freedom
● “The number of degrees of freedom is the number of values in
the final calculation of a statistic that are free to vary.”
● What does this even mean?
○ Say we have 5 observations and we calculate a mean of 3.
○ If I assign x1=2, x2=1, x3=7 and x4=4, what does the last
value, x5 have to equal?
○ Well since the sum of the 5 have to be 15 such that 15/5 = 3,
aka the mean in this case, that means that x5 = 15 - x1- x2-
x3- x4 = 1. The fifth value has to equal 1 or else the mean
won’t be 5.
○ The first 4 values have the freedom to take on any values
observed, but due to the fact that the test statistic exists,
the fifth value is therefore predetermined!
○ This is why the t-distribution took on n-1 degrees of
freedom when estimating the sample mean!
When do we use the z-test vs t-test?
● Put it simply, this decision tree summarizes it the best
t-test examples!
● A random sample of statistics students were asked to estimate
the total number of hours they spend watching Netflix in an
average week. The responses are recorded below. Use this
sample data to construct a 98% confidence interval for the
mean number of hours statistics students will spend watching
television in one week.
0 3 1 20 9 5 10 1 10 4 1 4 2 4 4 5
○ x̄ = 92/15 = 6.133
○ s =5.514
○ d.o.f. = 15-1 = 14
○ α = 1-0.98 = 0.02
■ α/2 = 0.01
○ Therefore, t0.99, 14 = 2.6245
○ x̄ - (t0.99, 14)(s/150.5) , x̄ + (t0.99, 14)(s/150.5) = 2.3964, 9.8701
○ Fun fact! The mean is in the middle of the
Hypothesis testing
● A hypothesis is a claim that can be tested, specifically through
data.
○ The mean salary of graduate students is $80,000 is a
hypothesis, because we have data to test this claim.
○ The United States would do better with a Bernie Sanders
administration vs. Trump is not a hypothesis, but more of
an idea. We do not have data to support this!
● Hypothesis testing typically has a null hypothesis and an
alternative hypothesis.
○ The null hypothesis is what we claim to be true or what we
believe is the reality, until there is evidence against our
claim. Denoted as H0
○ The alternative hypothesis is everything opposite of what
our null hypothesis is. Denote as HA
Types of hypothesis tests
● There are 2 main families of hypothesis tests
○ 1-tail
○ 2-tail
● A 1-tailed test is a statistical hypothesis test set up to show that
the sample mean would be higher or lower than the population
mean, but not both.
● A 2-tailed test is a statistical hypothesis test set up to show that
the sample mean could be higher or lower than the population
mean, but not one exclusively.
● To reject or fail to reject the null hypothesis, we must look
at the alternative hypothesis.
○ The alternative hypothesis is what formulates the rejection
region.
● To reject or fail to reject, we must make reference to some
region denoted by α, which gives rise to a rejection region.
This must be specified before your experiment.
Types of hypothesis tests cont.

One sided t-test Two sided t-test

● If our test statistics are at least as extreme as our null hypothesis and within the region where they are
more “in favor” of the alternative, then we must reject the null
Hypothesis testing example
● It is believed that a bag of Lays is actually only 60% full, and
the rest of the bag is just air. A factory worker states that this is
no longer true and that Lays does not make bags that are 60%
full of air. What is the null and alternative hypothesis?
○ H0: µ = 0.6 The mean air in a bag of Lays is equal to 60%.
○ HA: µ ≠ 0.6 The mean air in a bag of Lays is not equal to 60%.

● Atinder thought he made a fair lab that would’ve taken less


than 40 minutes to complete. What is the null and alternative
hypothesis?
○ H0: µ < 40 The mean completion of the lab is less than 40 minutes.
○ HA: µ ≥ 40 The mean completion of the lab is greater than or equal to 40 minutes.
● We only reject or fail to reject (not accept) the null.
● We might intuitively understand hypothesis testing, but how
do we actually test it?
The notorious p-values
● You may have heard of p-values… they aren’t simply just
probabilities, but rather the probabilities of observing specific
events.
● More generally, p-values are the probabilities of observing
events that are at least as extreme as the hypothesis, assuming
that the null hypothesis is true.
○ … what does that even mean :S ?
● All this means is that, if we had some data and gathered some
statistics (mean, standard deviation), we can then use this
information to determine how “extreme” our hypothesis is,
assuming it’s the truth.
○ If our null hypothesis is off, the probability will be very
small! But how small is small… this is where we reflect
back on our rejection level α
p-value example
● It has been asked ”What do you think is the ideal number of
children for a family to have?”. A popular blog reports that
women want 3 kids. Later, 50 females responded had mean of
3.22, and standard deviation of 1.99. Test this hypothesis at α =
0.05
○ H 0: µ = 3
○ H A: µ ≠ 3
● Since the sample here is large, we can assume it comes from
either a normal distribution or t-distribution. Let’s assume a
t-distribution for now…
○ x̄ = 3.22, s = 1.99,
○ n = 50, d.o.f. = 50-1 = 49
○ α = 0.05
■ α/2 is needed here because it goes in both directions of
the distribution!
p-value example cont.
● If we want to test the null hypothesis, we let µ = 3 and compute
the probability of this event. Therefore we get the following:
● t49 = (x̄-µ0)/(s/ (n0.5) ) = (3.22-3)/(1.99/(500.5)) = 0.788

Our rejection regions.


Anything outside these regions
is considered rare.

● What is the probability of getting a value that is at least as


extreme as the test statistic? aka…
● P(T>t49) = 1 - P(T < t49) = 1 - 0.78275 = 0.21725
○ Since we reject if our probability is in the rejection region of
2.5% probability, we fail to reject this null hypothesis!
Let’s assume a more wild case…
● Say we wanted to test the null hypothesis that the population
mean is 5. Very extreme in hindsight! But this will help us get
the idea.
● Recalculating our t statistic yields:
○ t49 = (x̄-µ0)/(s/ (n0.5) ) = (3.22-5)/(1.99/(500.5)) = -6.324

Our rejection regions.


Our value is WAY out here. Very Anything outside these regions
extreme! is considered rare.

● Again, finding the probability of this event:


● P(T < t49) = P(T < t49) = 0
● The probability of observing this test statistic or something
more extreme than µ0= 5 would be 0, aka very rare! REJECT!
One more example.
● Six coins of the same type are discovered at an archaeological
site. Archeologists believe that their weights are more than
5.25g indicating that they are from another provenance. The
coins are weighed and have mean 4.73g with sample standard
deviation 0.18g. Perform the relevant test at the 2% level of
significance, assuming a normal distribution of weights of all
such coins.
● H0: µ > 5.25 vs. HA: µ ≤ 5.25
● x̄ = 4.73, s = 0.18
● n = 6, d.o.f. = 6-1 = 5
● α = 0.02
● Since the sample size is small, and we do not know the
population standard deviation, we must use a t-distribution!
One more example.
● t5 = (x̄-µ0)/(s/ (n0.5) ) = (4.73-5.25)/(0.18/(60.5)) = -7.0763

Our value is WAY out here. Very


extreme!

● You can likely imagine that the probability of this event being
observed is EXTREMELY small
● P(T ≤ t5) = 0. So we REJECT the null hypothesis in favor of the
alternative hypothesis.
● Alternatively, once we have the our test statistic, we can
determine what the rejection statistic needs to be such that we
reject.
○ t0.02,5 = -2.7565. Since our test statistic is lower, REJECT.
BREAK!
Other measures to assess linear regression
● We left off fitting a linear regression model, finding the Least
Squares estimation and assessing our model using R2.
● What other measures can we use to assess linear regression
models?
● The first is mean squared error (MSE). Literally the average
sum of squares of the residuals (or error).
MSE = (1/n)Σ(yi-ŷi)2
● The second is mean absolute error (MAE). Literally the
average absolute value of the residuals (or error).
MAE = (1/n)Σ|yi-ŷi|
● We’ll look at this top down, starting with the F-test…
○ Top-down meaning, start at the broadest test, and then
narrow down our methods.
F-tests
● Without involving the statistical theory, all we need to know is
that the F-test is much like a z-test or a t-test - it’s another
statistic from a distribution.
● However, this distribution tests if all the coefficients of the
linear regression model are 0. It’s a hypothesis test!
○ H 0: β 1 = β 2 = β 3 = β 4 = … = β p = 0
○ H A: β i ≠ 0
● In english, the null hypothesis states that your model is better
off having no variables (Just use the average value of your
dependent variable to predict). The alternative is that at least 1
is not equal to 0.
● The p-value for the F-test is assuming the null is true. If it’s less
than 5%, we REJECT the null hypothesis and have evidence
that having at least one of the coefficients is beneficial to our
model!
Individual t-tests for coefficients!
● Some of you may have spotted that our coefficients had t-value
statistics associated to them. What does this mean?...
● Since our model is based on our data, we ultimately come up
with a test statistic for our coefficients in the process!
● Our hypothesis testing for a simple linear regression model for
β1 would be:
○ H 0: β 1 = 0
○ H A: β 1 ≠ 0
○ This means that the population coefficient for the first
independent variable equals 0 is our null hypothesis. A
p-value less than 5% indicates we reject our null
hypothesis.
Variable / Feature selection for linear models
● We now have established that the p-values for the coefficients
is a good method of understanding if that coefficient should be
removed from our model. But what do we do now?
● Backward Selection is a process where we start off with every
variable in the model and remove one variable at a time until a
stopping rule (like p-value threshold, adjusted R2, etc) is hit,
until you cannot remove any more or all are removed.
● Forward Selection is a process where we start off with no
variable in the model and add one variable at a time based on a
stopping rule (greatest improvement to R2, greatest reduction
in MSE, etc) until you can no longer add more or all are added.
● Stepwise Selection is a process where you add a variable,
evaluate, then try backward selection, and then repeat by
adding another variable, trying backward selection, etc.
Challenges with step selections
● Doing any of the previously described methods come with
potential challenges…
● If you have a lot of variables and a lot of data - this could be
a very computationally expensive task refitting all these
models, so you must be less strict with your stopping rule …
which yields to poorer model performance.
● A variable dropped or not considered to be added might be
better to keep at a later step.
○ Say you did forward selection, you have 10 variables and
currently your selection stops are using Var1, Var4 and
Var6. Maybe Var8 is worth adding and does a good job to
helping your model, but only after you add a less
significant variable like Var 5.
Recommended Strategy

a
1 2 3 4
Ev

1) Evaluate by dropping or adding one variable at a time.


2) Assess the p-value
⚫ Drop the variable with the highest p-value and add the
variable with the lowest.
3) Remove / Add recommended variable.
4) Compare your model to the previous. Keep an eye on MSE!
Let’s do examples in Python!
Thank You

28

You might also like