AP SHAH ADS Notes Mod 2 p2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Subject: Applied Data Science Semester: VIII

Hypothesis testing

Hypothesis testing helps in data analysis by providing a way to make inferences about a
population based on a sample of data. It allows analysts to make decisions about whether
to accept or reject a given assumption or hypothesis about the population based on the
evidence provided by the sample data. For example, hypothesis testing can be used to
determine whether a sample mean is significantly different from a hypothesized population
mean or whether a sample proportion is significantly different from a hypothesized
population proportion. This information can be used to make decisions about whether to
accept or reject a given assumption or hypothesis about the population.

In statistical analysis, hypothesis testing is used to make inferences about a population


based on a sample of data.

In machine learning, hypothesis testing is used to evaluate the performance of a model


and determine the significance of its parameters. For example, a t-test or z-test can be
used to compare the means of two groups of data to determine if there is a significant
difference between them. This information can then be used to improve the model, or
select the best set of features.

Definition of Hypothesis Testing

The hypothesis is a statement, assumption or claim about the value of the parameter
(mean, variance, median etc.).
A hypothesis is an educated guess about something in the world around you. It should be
testable, either by experiment or observation.

Like, if we make a statement that “Dhoni is the best Indian Captain ever.” This is an
assumption that we are making based on the average wins and losses team had under his
captaincy. We can test this statement based on all the match data.

Null and Alternative Hypothesis Testing

The null hypothesis is the hypothesis to be tested for possible rejection under the
assumption that it is true. The concept of the null is similar to innocent until proven guilty
We assume innocence until we have enough evidence to prove that a suspect is guilty.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

In simple language, we can understand the null hypothesis as already accepted


statements, For example, Sky is blue. We already accept this statement.

It is denoted by H0.
The alternative hypothesis complements the Null hypothesis. It is the opposite of the null
hypothesis such that both Alternate and null hypothesis together cover all the possible
values of the population parameter.

It is denoted by H1.
Let’s understand this with an example:

A soap company claims that its product kills on an average of 99% of the germs.

Suppose lifebuoy claims that, it kills 99.9% of germs. So how can they say so? There has
to be a testing technique to prove this claim right?? So hypothesis testing uses to prove a
claim or any assumptions.

To test the claim of this company we will formulate the null and alternate hypothesis.

Null Hypothesis(H0): Average =99%

Alternate Hypothesis(H1): Average is not equal to 99%.

Note: When we test a hypothesis, we assume the null hypothesis to be true until there is
sufficient evidence in the sample to prove it false. In that case, we reject the
null hypothesis and support the alternate hypothesis. If the sample fails to provide
sufficient evidence for us to reject the null hypothesis, we cannot say that the null
hypothesis is true because it is based on just the sample data. For saying the null
hypothesis is true we will have to study the whole population data.

Simple and Composite Hypothesis Testing

When a hypothesis specifies an exact value of the parameter, it is a simple hypothesis and
if it specifies a range of values then it is called a composite hypothesis.

e.g. Motor cycle company claiming that a certain model gives an average mileage of
100Km per liter, this is a case of simple hypothesis.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

The average age of students in a class is greater than 20. This statement is a composite
hypothesis.

One-tailed and two-tailed Hypothesis Testing

If the alternate hypothesis gives the alternate in both directions (less than and greater than)
of the value of the parameter specified in the null hypothesis, it is called a Two-tailed test.

If the alternate hypothesis gives the alternate in only one direction (either less than or
greater than) of the value of the parameter specified in the null hypothesis, it is called
a One-tailed test.

e.g. if H0: mean= 100 H1: mean not equal to 100

here according to H1, mean can be greater than or less than 100. This is an example of a
Two-tailed test

Similarly, if H0: mean>=100 then H1: mean< 100

Here, the mean is less than 100. It is called a One-tailed test.

Critical Region

The critical region is that region in the sample space in which if the calculated value lies
then we reject the null hypothesis.

Let’s understand this with an example:

Suppose you are looking to rent an apartment. You listed out all the available apartments
from different real state websites. You have a budget of Rs. 15000/ month. You cannot
spend more than that. The list of apartments you have made has a price ranging from
7000/month to 30,000/month.

You select a random apartment from the list and assume below hypothesis:

H0: You will rent the apartment.

H1: You won’t rent the apartment.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Now, since your budget is 15000, you have to reject all the apartments above that price.

Here all the Prices greater than 15000 become your critical region. If the random
apartment’s price lies in this region, you have to reject your null hypothesis and if the
random apartment’s price doesn’t lie in this region, you do not reject your null hypothesis.

The critical region lies in one tail or two tails on the probability distribution curve according
to the alternative hypothesis. The critical region is a pre-defined area corresponding to a
cut off value in the probability distribution curve. It is denoted by α.

Critical values are values separating the values that support or reject the null hypothesis
and are calculated on the basis of alpha.

We will see more examples later on and it will be clear how do we choose α.

Based on the alternative hypothesis, three cases of critical region arise:

Case 1) This is a double-tailed test.

Case 2) This scenario is also called a Left-tailed test.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Case 3) This scenario is also called a Right-tailed test.

Type I and Type II Error

So Type I and type II error is one of the most important topics of hypothesis testing. Let’s
simplify it by breaking down this topic into a smaller portion.

A false positive (type I error) — when you reject a true null hypothesis.

A false negative (type II error) — when you accept a false null hypothesis.

The probability of committing Type I error (False positive) is equal to the significance level
or size of critical region α.

α= P [rejecting H0 when H0 is true]

The probability of committing Type II error (False negative) is equal to the beta β. It is
called the ‘power of the test’.

β = P [not rejecting H0 when h1 is true]

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Let’s take another example to understand.

A person is on trial for a criminal offense, and the judge needs to provide a verdict on his
case. Now, there are four possible combinations in such a case:

• First Case: The person is innocent, and the judge identifies the person as innocent
• Second Case: The person is innocent, and the judge identifies the person as guilty
• Third Case: The person is guilty, and the judge identifies the person as innocent
• Fourth Case: The person is guilty, and the judge identifies the person as guilty

Here

H0: Person is innocent

H1: Person is guilty

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

As you can clearly see, there can be two types of error in the judgment –

Type I error will be if the Jury convicts the person [rejects H0] although the person was
innocent [H0 is true].

Type II error will be the case when Jury released the person [Do not reject H0] although
the person is guilty [H1 is true].

According to the Presumption of Innocence, the person is considered innocent until proven
guilty. We consider the Null Hypothesis to be true until we find strong evidence against
it. Then we accept the Alternate Hypothesis. That means the judge must find the
evidence which convinces him “beyond a reasonable doubt.” This phenomenon
of “Beyond a reasonable doubt” can be understood as Significance Level (⍺) ie.
(Judge Decided Guilty | Person is Innocent) should be small. Thus, if ⍺ is smaller, it
will require more evidence to reject the Null Hypothesis.

The basic concepts of Hypothesis Testing are actually quite analogous to this situation.

Steps to Perform Hypothesis Testing


There are four steps to performing Hypothesis Testing:

1. Set the Null and Alternate Hypotheses


2. Set the Significance Level, Criteria for a decision
3. Compute the test statistic
4. Make a decision

It must be noted that z-Test & t-Tests are Parametric Tests, which means that the Null
Hypothesis is about a population parameter, which is less than, greater than, or equal to

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

some value. Steps 1 to 3 are quite self-explanatory but on what basis can we make a
decision in step 4? What does this p-value indicate?

We can understand this p-value as the measurement of the Defense Attorney’s argument.
If the p-value is less than ⍺ , we reject the Null Hypothesis, and if the p-value is greater
than ⍺, we fail to reject the Null Hypothesis.

Level of significance(α)

The significance level, in the simplest of terms, is the threshold probability of incorrectly
rejecting the null hypothesis when it is in fact true. This is also known as the type I error
rate.

It is the probability of a type 1 error. It is also the size of the critical region.

Generally, strong control of α is desired and in tests, it is prefixed at very low levels like
0.05(5%) or 01(1%).

If H0 is not rejected at a significance level of 5%, then one can say that our null hypothesis
is true with 95% assurance.

The p-value is the smallest level of significance at which a null hypothesis can be
rejected.

Decision making with p-value

We compare p-value to significance level(alpha) for taking a decision on Null Hypothesis.

If p-value is greater than alpha, we do not reject the null hypothesis.

If p-value is smaller than alpha, we reject the null hypothesis.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

p-value
To understand this question, we will pick up the normal distribution:

p-value is the cumulative probability (area under the curve) of the values to the right of the red
point in the figure above.
Or,

p-value corresponding to the red point tells us about the ‘total probability’ of getting any value
to the right hand side of the red point, when the values are picked randomly from the population
distribution.
A large p-value implies that sample scores are more aligned or similar to the population score.

Alpha value is nothing but a threshold p-value, which the group conducting the
test/experiment decides upon before conducting a test of similarity or significance ( Z-test or a
T-test).

Consider the above normal distribution again. The red point in this distribution represents the
alpha value or the threshold p-value. Now, let’s say that the green and orange points represent
different sample results obtained after an experiment.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

We can see in the plot that the leftmost green point has a p-value greater than the alpha. As a
result, these values can be obtained with fairly high probability and the sample results are
regarded as lucky.

The point on the rightmost side (orange) has a p-value less than the alpha value (red). As a
result, the sample results are a rare outcome and very unlikely to be lucky. Therefore, they
are significantly different from the population.

The alpha value is decided depending on the test being performed. An alpha value of 0.05 is
considered a good convention if we are not sure of what value to consider.

Let’s look at the relationship between the alpha value and the p-value closely.

p-value < alpha


Consider the following population distribution:

Here, the red point represents the alpha value. This is basically the threshold p-value. We can
clearly see that the area under the curve to the right of the threshold is very low.

The orange point represents the p-value using the sample population. In this case, we can
clearly see that the p-value is less than the alpha value (the area to the right of the red point is
larger than the area to the right of the orange point). This can be interpreted as:

The results obtained from the sample is an extremity of the population distribution (an
extremely rare event), and hence there is a good chance it may belong to some other
distribution (as shown below).

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Considering our definitions of alpha and the p-value, we consider the sample results obtained
as significantly different. We can clearly see that the p-value is far less than the alpha value.

p-value > alpha:


p-value greater than the alpha means that the results are in favor of the null hypothesis and
therefore we fail to reject it. This result is often against the alternate hypothesis (obtained results
are from another distribution) and the results obtained are not significant and simply a matter
of chance or luck.

Again, consider the same population distribution curve with the red point as alpha and the
orange point as the calculated p-value from the sample:

So, p-value > alpha (considering the area under the curve to the right-hand side of the red and
the orange points) can be interpreted as follows:

The sample results are just a low probable event of the population distribution and are very
likely to be obtained by luck.

We can clearly see that the area under the population curve to the right of the orange point is
much larger than the alpha value. This means that the obtained results are more likely to be
part of the same population distribution than being a part of some other distribution.

Example of p-value in Statistics

In the National Academy of Archery, the head coach intends to improve the performance of
the archers ahead of an upcoming competition. What do you think is a good way to improve
the performance of the archers?

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

He proposed and implemented the idea that breathing exercises and meditation before the
competition could help. The statistics before and after experiments are below:

Interesting. The results favor the assumption that the overall score of the archers improved. But
the coach wants to make sure that these results are because of the improved ability of the
archers and not by luck or chance. So what do you think we should do?

This is a classic example of a similarity test (Z-test in this case) where we want to check
whether the sample is similar to the population or not. In order to solve this, we will follow a
step-by-step approach:

1. Understand the information given and form the alternate and null hypothesis
2. Calculate the Z-score and find the area under the curve
3. Calculate the corresponding p-value
4. Compare the p-value and the alpha value
5. Interpret the final results

Step 1: Understand the given information

• Population Mean = 74
• Population Standard Deviation = 8
• Sample Mean = 78
• Sample Size = 60

We have the population mean and standard deviation with us and the sample size is over
30, which means we will be using the Z-test.

According to the problem above, there can be two possible conditions:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

1. The after-experiment results are a matter of luck, i.e. mean before and after experiment
are similar. This will be our “Null Hypothesis”
2. The after-experiment results are indeed very different from the pre-experiment ones.
This will be our “Alternate Hypothesis”

Step 2: Calculating the Z-Score

On plugging in the corresponding values, Z-Score comes out to be – 3.87.

Step 3: Referring to the Z-table and finding the p-value:


If we look up the Z-table for 3.87, we get a value of ~0.999. This is the area under the curve
or probability under the population distribution. But this is the probability of what?

The probability that we obtained is to the left of the Z-score (Red Point) which we calculated.
The value 0.999 represents the “total probability” of getting a result “less than the sample
score 78”, with respect to the population.

Here, the red point signifies where the sample mean lies with respect to the population
distribution. But we have studied earlier that p value is to the right-hand side of the red point,
so what do we do?

For this, we will use the fact that the total area under the normal Z distribution is
1. Therefore the area to the right of Z-score (or p-value represented by the unshaded region)
can be calculated as:

p-value = 1 – 0.999= 0.001

0.001 (p-value) is the unshaded area to the right of the red point. The value 0.001 represents
the “total probability” of getting a result “greater than the sample score 78”, with respect to
the population.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Step 4: Comparing p-value and alpha value:

We were not given any value for alpha, therefore we can consider alpha = 0.05. According to
our understanding, if the likeliness of obtaining the sample (p-value) result is less than the alpha
value, we consider the sample results obtained as significantly different.

We can clearly see that the p-value is far less than the alpha value:

0.001 (red region) << 0.05 (orange region)

This says that the likeliness of obtaining the mean as 78 is a rare event with respect to the
population distribution. Therefore, it is convenient to say that the increase in the
performance of the archers in the sample population is not the result of luck. The sample
population belongs to some other (better in this case) distribution of itself.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Box plot

A box and whisker plot—also called a box plot—displays the five-number summary of
a set of data. The five-number summary is the minimum, first quartile, median, third
quartile, and maximum.

It is also termed as box and whisker plot.

In a box plot, we draw a box from the first quartile to the third quartile. A vertical line
goes through the box at the median. The whiskers go from each quartile to the
minimum or maximum.

Minimum: The minimum value in the given dataset


First Quartile (Q1): The first quartile is the median of the lower half of the data set.
Median: The median is the middle value of the dataset, which divides the given
dataset into two equal parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile
is known as the interquartile range. (i.e.) IQR = Q3-Q1

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Outlier: The data that falls on the far left or right side of the ordered data is tested to
be the outliers. Generally, the outliers fall more than the specified distance from the
first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).

Applications
It is used to know:

• The outliers and their values


• Symmetry of Data
• Tight grouping of data
• Data skewness – if, in which direction and how.
o Positively Skewed: If the distance from the median to the maximum is
greater than the distance from the median to the minimum, then the box
plot is positively skewed.
o Negatively Skewed: If the distance from the median to minimum is
greater than the distance from the median to the maximum, then the box
plot is negatively skewed.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

o Symmetric: The box plot is said to be symmetric if the median is


equidistant from the maximum and minimum values.

Example:
3, 7, 8, 5, 12, 14, 21, 13, 18, 50
Step1: Sort the values
3, 5, 7, 8, 12, 14, 15, 18, 21, 50
Step 2: Find the median.
Q2=13
Step 3: Find the quartiles.
First quartile, Q1 = data value at position (N + 2)/4=12/4=3rd position
Third quartile, Q3 = data value at position (3N + 2)/4=8th position
Q1=7
Q3=18
Step 4: Complete the five-number summary by finding the min and the max.

The min is the smallest data point, which is 3.

The max is the largest data point, which is 50.

The five-number summary is 3,7,13,18,50.

Here IQR=Q3-Q1=18-7=12

Any point beyond Q3+ 1.5 IQR (18+1.5*12=18+18=36) and Q1-1.5 IQR (7-1.5*12) is
considered an outlier.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

So the box plot excluding outliers is as follows:

Box plot generated using matplotlib:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Scatter plot
Scatter plots are the graphs that present the relationship between two variables in a
data-set. It represents data points on a two-dimensional plane or on a Cartesian
system. The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis. These plots are often called scatter
graphs or scatter diagrams.
A scatter plot is also called a scatter chart, scattergram, or scatter plot, XY graph. The
scatter diagram graphs numerical data pairs, with one variable on each axis, show
their relationship.
Scatter plots are used in either of the following situations.

• When we have paired numerical data


• When there are multiple values of the dependent variable for a unique value of
an independent variable
• In determining the relationship between variables in some scenarios, such as
identifying potential root causes of problems, checking whether two products
that appear to be related both occur with the exact cause and so on.

The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“.

Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –

1. Positive Correlation
2. Negative Correlation
3. No Correlation

Positive Correlation
A scatter plot with increasing values of both variables can be said to have a positive
correlation. Now positive correlation can further be classified into three categories:

• Perfect Positive – Which represents a perfectly straight line


• High Positive – All points are nearby

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

• Low Positive – When all the points are scattered

Negative Correlation
A scatter plot with an increasing value of one variable and a decreasing value for
another variable can be said to have a negative correlation.These are also of three
types:

• Perfect Negative – Which form almost a straight line


• High Negative – When points are near to one another
• Low Negative – When points are in scattered form

No Correlation
A scatter plot with no clear increasing or decreasing trend in the values of the variables
is said to have no correlation

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Scatter plot Example


Let us understand how to construct a scatter plot with the help of the below example.
Question:
Draw a scatter plot for the given data that shows the number of games played and
scores obtained in each instance.

No. of games 3 5 2 6 7 1 2 7 1 7

Scores 80 90 75 80 90 50 65 85 40 100

Solution:
X-axis or horizontal axis: Number of games
Y-axis or vertical axis: Scores
Now, the scatter graph will be:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Note: We can also combine scatter plots in multiple plots per sheet to read and
understand the higher-level formation in data sets containing multivariable, notably
more than two variables.

Scatter plot Matrix


For data variables such as x1, x2, x3, and xn, the scatter plot matrix presents all the
pairwise scatter plots of the variables on a single illustration with various scatterplots
in a matrix format. For the n number of variables, the scatterplot matrix will contain n
rows and n columns. A plot of variables xi vs xj will be located at the ith row and jth
column intersection. We can say that each row and column is one dimension, whereas
each cell plots a scatter plot of two dimensions.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Z-Test

z tests are a statistical way of testing a Null Hypothesis when either:


• We know the population variance, or
• We do not know the population variance, but our sample size is large n ≥ 30.

If we have a sample size of less than 30 and do not know the population variance, we must use a t-test. This is
how we judge when to use the z-test vs the t-test. Further, it is assumed that the z-statistic follows a standard
normal distribution. In contrast, the t-statistics follows the t-distribution with a degree of freedom equal to n-1,
where n is the sample size.

It must be noted that the samples used for z-test or t-test must be independent sample, and also must have a
distribution identical to the population distribution. This makes sure that the sample is not “biased” to/against
the Null Hypothesis which we want to validate/invalidate.

For the null hypothesis H0 if µ = µ0 then.

H1 = µ > µ0 (Right tail)

H1 = µ < µ0 (Left tail)

H1 = µ # µ0 (Two tail test)

One-Sample Z-Test
We perform the One-Sample z-Test when we want to compare a sample mean with the population mean.

Here’s an Example to Understand a One Sample z-Test


Let’s say we need to determine if girls on average score higher than 600 in the exam. We have the information
that the standard deviation for girls’ scores is 100. So, we collect the data of 20 girls by using random samples
and record their marks. Finally, we also set our ⍺ value (significance level) to be 0.05.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

In this example:
• Mean Score for Girls is 641
• The number of data points in the sample is 20
• The population mean is 600
• Standard Deviation for Population is 100

Since the P-value is less than 0.05, we can reject the null hypothesis and conclude based on our result that
Girls on average scored higher than 600.

Two-Sample Z-Test

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

We perform a Two Sample z-test when we want to compare the mean of two samples.

Here’s an Example to Understand a Two Sample Z-Test


Here, let’s say we want to know if Girls on an average score 10 marks more than the boys. We have the
information that the standard deviation for girls’ Score is 100 and for boys’ score is 90. Then we collect the data
of 20 girls and 20 boys by using random samples and record their marks. Finally, we also set our ⍺ value
(significance level) to be 0.05.

In this example:
• Mean Score for Girls (Sample Mean) is 641
• Mean Score for Boys (Sample Mean) is 613.3
• Standard Deviation for the Population of Girls’ is 100
• Standard deviation for the Population of Boys’ is 90
• Sample Size is 20 for both Girls and Boys
• Difference between Mean of Population is 10

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

Thus, we can conclude based on the p-value that we fail to reject the Null Hypothesis. We don’t have enough
evidence to conclude that girls on average score of 10 marks more than the boys. Pretty simple, right?

T-Test

T-tests are a statistical way of testing a hypothesis when:


• We do not know the population variance
• Our sample size is small, n < 30.

T test is a type of inferential statistic used to study if there is a statistical difference between two groups.
Mathematically, it establishes the problem by assuming that the means of the two distributions are equal (H₀:
µ₁=µ₂). If the t-test rejects the null hypothesis (H₀: µ₁=µ₂), it indicates that the groups are highly probably different.

The statistical test can be one-tailed or two-tailed. The one-tailed test is appropriate when there is a difference
between groups in a specific direction. It is less common than the two-tailed test. When choosing a t test, you
will need to consider two things: whether the groups being compared come from a single population or two
different populations, and whether you want to test the difference in a specific direction.

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

There are three main types of t-test :

• One Sample t-test : Compares mean of a single group against a known/hypothesized/ population mean.
• Two Sample: Paired Sample T Test: Compares means from the same group at different times.
• Two Sample: Independent Sample T Test: Compares means for two different groups.

One Sample t-test:

(Sample Mean- Population Mean)


t=
Standard Error

̅−µ
𝒙
𝒕=𝒔
⁄ 𝒏

𝑥̅ Sample mean
µ Population mean
𝑠 Sample standard deviation
𝑛 Sample size
Standard deviation can be calculated as:

Degree of freedom =n-1

Two-sample - Paired Sample t-test

̅
𝒅
𝒕=𝒔
⁄ 𝒏

𝑑̅ =Mean of the difference


𝑠=Standard deviation of the difference
𝑛 =is the sample size

Degree of freedom =n-1

Standard deviation can be calculated as:

Prof.Ramya R B Dept.of Computer Engineering


Subject: Applied Data Science Semester: VIII

∑𝑑 2 −∑𝑛(𝑑̅ )2
s= √ 𝑛−1

Two Sample: Independent Sample T Test:


̅̅̅
𝒙𝟏 − ̅̅̅
𝒙𝟐
𝒕=
𝟏 𝟏
𝒔𝒑 √𝒏 + 𝒏
𝟏 𝟐

(𝒏𝟏 − 𝟏)𝒔𝟏 𝟐 + (𝒏𝟐 − 𝟏)𝒔𝟐 𝟐


𝒔𝒑 = √
𝒏𝟏 + 𝒏𝟐 − 𝟐
̅̅̅:
𝒙𝟏 Mean of the First Sample
̅̅̅:
𝒙𝟐 Mean of the Second Sample
n1 : Number of items in First Sample
n2 : Number of items Second Sample
s1= Standard deviation of First Sample
s2= Standard deviation of Second Sample
Sp = Pooled Standard /Combined Standard Deviation

Degree of freedom is n1 + n2 – 2.

• If the calculated t value is greater than critical t value (obtained from a critical value table called the T-
distribution table) then reject the null hypothesis.
• P-value <significance level (𝜶) => Reject your null hypothesis in favor of your alternative
hypothesis. Your result is statistically significant.
• P-value >= significance level (𝜶) => Fail to reject your null hypothesis. Your result is not statistically
significant.

Prof.Ramya R B Dept.of Computer Engineering

You might also like