Professional Documents
Culture Documents
AD8412 - Data Analytics - Staff COPY - V1
AD8412 - Data Analytics - Staff COPY - V1
For
B.TECH
(ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE)
(Anna University Regulation 2017)
For the Batch (2020 to 2024)
Semester: IV
Academic Year: 2021-2022
1
LABORATORY MANUAL
Reg No ………………………………………………………
Course Code :
Programme :
It is certified that this is the bonafide record of the work carried out by
_____________________________of____________________________class
2
LABORATORY RECORD
Reg No ………………………………………………………
Course Code :
Programme :
It is certified that this is the bonafide record of the work carried out by
_____________________________of____________________________class
3
GE3171 PROBLEM SOLVING AND PYTHON PROGRAMMING LABORATORY
COURSE OUTCOMES
COURSE OUTCOME
CO1: After the completion of this course, students will be able to:
CO2: To become skilled to use various packages in Python
CO3: Demonstrate the understanding of data distribution with various samples
CO4: Ability to Implement T--Test , Anova and Z--Test on sample data sets
CO5: Understanding of Mathematical models in real world problems.
CO6: Conduct time series analysis and draw conclusion.
LIST OF EXERCISES
1. Random Sampling
2. Z-test case study
3. T-test case studies
4. ANOVA case studies
5. Regression
6. Logistic Regression
7. Time series Analysis
4
GE3171 - PROBLEM SOLVING AND PYTHON PROGRAMMING LABORATORY
INDEX
Expt. Page
Name of the Experiment Date Signature
No. No
1 Demonstration the random sampling
Demonstrate the probability sampling from a known
2 population
Implementation of Z-Test – One Sample Z-Test and Two
3 Sample Z-Test
5
Ex.No: 1 Demonstration the random sampling using python
Date :
1. Problem statement :
It helps the students to understand how to randomly sample items from lists as well as how
to generate pseudorandom numbers in python.
3. Problem analysis:
Python provides many useful tools for random sampling as well as functions for
generating random numbers. Random sampling has applications in statistics where often
times a random subset of a population is observed and used to make inferences about the
overall population. Further, random number generation has many application in the
sciences. For example, in chemistry and physics Monte Carlo simulations require random
number generation
4. Algorithm:
1. Create a list
2. Use the ‘random.choice()’ method to randomly select individual values from this list
3. Use ‘random.sample()’ method for randomly sampling N items from a list
4. In addition to random selection and sampling, the random module has a function for
shuffling items in a list
5. Picking Random Items in a List using ‘random.sample()’
6. The random module has a function for generating a random integer provided a range of
values. Perform Randomly Shuffling Items in a List using ‘random.shuffle()
7. The random module has a function for generating a random integer provided a range of
values. Generate the Random Integers using ‘random.randint()’
8. The random module also has a function for generating a random floating point value
between 0 and 1. Generate the Random Floating Point Values
9. Scale the random float numbers. If we want random numbers between 0 and 500 we just
multiply our random number by 500
10. And if we want to add a lower bound as well we can add a conditional statement before
appending
6
11. The random module has a function for computing uniformly distributed numbers.
Compute Uniformly Distributed Numbers with ‘random.uniform()’
12. The random module has a function for computing normally distributed numbers. Compute
Normally Distributed Numbers with ‘random.gauss()’
5. Code:
bmi_list = [29, 18, 20, 22, 19, 25, 30, 28,22, 21, 18, 19, 20, 20, 22, 23]
Use the ‘random.choice()’ method to randomly select individual BMI values from this list:
import random
print("First random choice:", random.choice(bmi_list))
print("Second random choice:", random.choice(bmi_list))
print("Third random choice:", random.choice(bmi_list))
Run the code multiple number of times and check what output are you getting
The ‘random.sample()’ method is useful for randomly sampling N items from a list.
Consider sample N=5 items and apply it to our BMI list:
Print our BMI list and then print the result of shuffling our BMI list
7
print("BMI list: ", bmi_list)
random.shuffle(bmi_list)
print("Shuffled BMI list: ", bmi_list)
The random module has a function for generating a random integer provided a range of
values. Let’s generate a random integer in the range from 1 to 5
random_ints_list = []
for i in range(1,50):
n = random.randint(1,5)
random_ints_list.append(n)
print("My random integer list: ", random_ints_list)
random_float_list = []
for i in range(1,5):
n = random.random()
random_float_list.append(n)
print("My random float list: ", random_float_list)
Scale the random float numbers by multiplying our random number by 500
random_float_list = []
for i in range(1,5):
n = random.random()*500
random_float_list.append(n)
print("My random float list: ", random_float_list)
8
Add a lower bound as well add a conditional statement before appending and generate
random numbers between 100 and 500
random_float_list = []
for i in range(1,10):
n = random.random()*500
if n>=100.0:
random_float_list.append(n)
print("My random float list: ", random_float_list)
import numpy as np
uniform_list = np.random.uniform(-10,1,50)
print("Uniformly Distributed Numbers: ", uniform_list)
normal_list = np.random.uniform(-50,0,50)
print("Normally Distributed Numbers: ", normal_list)
6. Result
9
7. Viva voce
b. What are the types of Random sampling? Explain each of the sampling methods
2. Systematic sampling
Systematic sampling is the selection of specific individuals or members from an entire
population. The selection often follows a predetermined interval (k). The systematic
sampling method is comparable to the simple random sampling method; however, it is less
complicated to conduct.
3. Stratified sampling
Stratified sampling, which includes the partitioning of a population into subclasses with
notable distinctions and variances. The stratified sampling method is useful, as it allows
the researcher to make more reliable and informed conclusions by confirming that each
respective subclass has been adequately represented in the selected sample.
4. Cluster sampling
Cluster sampling, which, similar to the stratified sampling method, includes dividing a
population into subclasses. Each of the subclasses should portray comparable
characteristics to the entire selected sample. This method entails the random selection of a
whole subclass, as opposed to the sampling of members from each subclass. This method
is ideal for studies that involve widely spread populations.
10
c. Give a real time example for random sampling
A company currently employs 850 individuals. The company wishes to conduct a survey
to determine employee satisfaction based on a few identified variables. The research team
decides to have the sample set at 85 employees. The 85 employees will be part of the
survey and will be used as a representation for the total population of 850 employees.
In such a scenario, the sample is the 85 employees, and the population is the entire
workforce consisting of 850 individuals. Based on the sample size, any employee from the
workforce can be selected for the survey. It goes to say that each employee has an
equivalent probability of being randomly selected for the survey.
11
Ex.No: 2 Demonstration the probability sampling using python
Date :
1. Problem statement :
Probability sampling is used in cases when every unit from a given population has
the same probability of being selected. This technique includes simple random
sampling, systematic sampling, cluster sampling and stratified random sampling
What is Sampling?
Cases where it is impossible to study the entire population due to its size
Cases where the sampling process involves samples destructive testing
Cases where there are time and costs constrains
Sampling Techniques
12
There are two types of sampling techniques:
Probability sampling: cases when every unit from a given population has the same
probability of being selected. This technique includes simple random sampling,
systematic sampling, cluster sampling and stratified random sampling.
Non-probability sampling: cases when units from a given population do not have
the same probability of being selected. This technique includes convenience
sampling, quota sampling, judgement sampling and snowball sampling. In
comparison with probability sampling, this technique is more prone to end up with
a non-representative sample group, leading to wrong conclusions about the
population.
4. Algorithm:
1. Create a sample from a set of 10 products using probability sampling to determine the
population mean of a particular measure of interest.
2. Implement Simple Random Sampling
The simple random sampling method selects random samples from a process or
population where every unit has the same probability of getting selected
3. Implement Systematic Sampling
The systematic sampling method selects units based on a fixed sampling
interval
4. Implement Cluster Sampling
The cluster sampling method divides the population in clusters of equal size n
and selects clusters every Tth time
5. Implement Stratified Random Sampling
The stratified random sampling method divides the population in subgroups
and selects random samples where every unit has the same probability of
getting selected
5. Code
1. Create Sample
13
import numpy as np
import pandas as pd
indexes = np.arange(0,len(df),step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample
14
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)
try:
# Divide the units into cluster of equal size
df['cluster_id'] =
np.repeat([range(1,number_of_clusters+1)],len(df)/number_of_clusters)
# Append the indexes from the clusters that meet the criteria
# For this formula, clusters id must be an even number
for i in range(0,len(df)):
if df['cluster_id'].iloc[i]%2 == 0:
indexes.append(i)
cluster_sample = df.iloc[indexes]
return(cluster_sample)
except:
print("The population cannot be divided into clusters of equal size!")
15
# Create data dictionary
data = {'product_id':np.arange(1, number_of_products+1).tolist(),
'product_strata':np.repeat([1,2], number_of_products/2).tolist(),
'measure':np.round(np.random.normal(loc=10, scale=0.5,
size=number_of_products),3)}
# Import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit
6. Output
16
7. Result
8. Viva voce
b. Determine a suitable sample frame: Your frame should consist of a sample from
your population of interest and no one from outside to collect accurate data.
c. Select your sample and start your survey: It can sometimes be challenging to find
the right sample and determine a suitable sample frame. Even if all factors are in
your favor, there still might be unforeseen issues like cost factor, quality of
respondents, and quickness to respond. Getting a sample to respond to a
probability survey accurately might be difficult but not impossible
1. When you want to reduce the sampling bias: This sampling method is used when
the bias has to be minimum. The selection of the sample largely determines the
quality of the research’s inference. How researchers select their sample largely
determines the quality of a researcher’s findings. Probability sampling leads to
higher quality findings because it provides an unbiased representation of the
population.
2. When the population is usually diverse: Researchers use this method extensively
as it helps them create samples that fully represent the population. Say we want to
17
find out how many people prefer medical tourism over getting treated in their own
country. This sampling method will help pick samples from various socio-
economic strata, background, etc. to represent the broader population.
18
F. What are the Applications of cluster sampling
This sampling technique is used in an area or geographical cluster sampling for market
research. A broad geographic area can be expensive to survey in comparison to surveys
that are sent to clusters that are divided based on region. The sample numbers have to be
increased to achieve accurate results, but the cost savings involved make this process of
rising clusters attainable.
The technique is widely used in statistics where the researcher can’t collect data from the
entire population as a whole. It is the most economical and practical solution for
statisticians doing research. Take the example of a researcher who is looking to
understand the smartphone usage in Germany. In this case, the cities of Germany will
form clusters. This sampling method is also used in situations like wars and natural
calamities to draw inferences of a population, where collecting data from every
individual residing in the population is impossible.
19
population. Due to the accuracy involved, it is highly probable that the required
sample size will be much lesser and that will help researchers in saving time and
efforts
20
Ex.No: 3
Implementation of Z-Test – One Sample Z-Test and Two Sample Z-Test
Date :
1. Problem statement :
The Z-test has several assumptions which need to be fulfilled before using it. The
assumptions are as follows:
21
Where,
μ is Population Mean,
After calculating the Z-score, the conclusion may change based on the tails of the
test. Generally, there are three types of tailed tests: Left tailed test, Right Tailed
test, Two-tailed test. The thumb rule is if we have ≤ or ≥ signs between the Ha &
the given value, it is a one-tailed test (or Left & Right tail, respectively), or if we
have ≠ between the Ha & the given value, it is a two-tailed test.
Thus, after calculating the z statistic value, if our test is left tailed, our conclusion
can be determined by the rule: If the calculated Z-statistic value is less than the
critical Z value, Reject Null Hypothesis H0 else we fail to reject the H0. If our test
is right-tailed, our conclusion can be determined by the rule: If the calculated Z-
statistic value is more than the critical Z value, Reject Null Hypothesis H0 else we
fail to reject the H0. If our test is two-tailed, our conclusion can be determined by
the rule: If the calculated Z-statistic value is less than or greater than the critical Z
value, Reject Null Hypothesis H0 else we fail to reject the H0.
The sample size should be greater than 30. Otherwise, we should use the t-
test.
Samples should be drawn at random from the population.
The standard deviation of the population should be known.
Samples that are drawn from the population should be independent of each
other.
The data should be normally distributed, however for large sample size, it is
assumed to have a normal distribution.
Type of Z-test
22
Left-tailed Test: In this test, our region of rejection is located to the extreme left of
the distribution. Here our null hypothesis is that the claimed value is less than or
equal to the mean population value.
Right-tailed Test: In this test, our region of rejection is located to the extreme right
of the distribution. Here our null hypothesis is that the claimed value is less than or
equal to the mean population value.
You can use the ztest() function from the statsmodels package to perform one
sample and two sample z-tests in Python.
where:
23
4. Algorithm:
Step 1: Evaluate the data distribution.
Step 2: Formulate Hypothesis statement symbolically
Step 3: Define the level of significance (alpha)
Step 4: Calculate Z test statistic or Z score.
Step 5: Derive P-value for the Z score calculated.
Step 6: Make decision:
Step 6.1: P-Value <= alpha, then we reject H0.
Step 6.2: If P-Value > alpha, Fail to reject H0
5. Code:
A researcher wants to know if a new drug affects IQ levels, so he recruits 20 patients to try
it and records their IQ levels.
The following code shows how to perform a one sample z-test in Python to determine if
the new drug causes a significant difference in IQ levels:
Conclusion:
24
The test statistic for the one sample z-test is 1.5976 and the corresponding p-value is
0.1101.
Since this p-value is not less than .05, we do not have sufficient evidence to reject the
null hypothesis. In other words, the new drug does not significantly affect IQ level.
A researcher wants to know if the mean IQ level between individuals in city A and city B
are different, so she selects a simple random sample of 20 individuals from each city and
records their IQ levels.
cityB = [90, 91, 91, 91, 95, 95, 99, 99, 108, 109,
109, 114, 115, 116, 117, 117, 128, 129, 130, 133]
Conclusion:
The test statistic for the two sample z-test is -1.9953 and the corresponding p-value is
0.0460.
25
Since this p-value is less than .05, we have sufficient evidence to reject the null
hypothesis. In other words, the mean IQ level is significantly different between the two
cities.
6. Output
7. Result
8. Viva voce
a. What is the difference between one-sample and two sample z-test?
The two-sample z test is to tests the difference between means of two groups, whereas
a one-sample z test is to tests the difference between a single group and the
hypothesized population value.
b. What is a 2 sample z-test?
Two-Sample Z-Test. The Two-Sample Z-test is used to compare the means of two
samples to see if it is feasible that they come from the same population. The null
hypothesis is: the population means are equal.
Z-scores may be positive or negative, with a positive value indicating the score is
above the mean and a negative score indicating it is below the mean
26
Ex.No: 4
Implementation of Z-Test – using Titanic case study
Date :
1. Problem statement :
A hypothesis is a new research question. Let’s say there is a proposal for a new
drug, what’s the significance of investing in manufacturing a new drug if its effect
on people is trivial. The decision should be driven based on a Hypothesis test,
which should show statistically significant results so we make a well-informed
decision to either go with the new research claim or not.
There are many varieties of Hypothesis tests, based on the objective of our research
and the data that we have, we choose an appropriate type of hypothesis test. It’s
important to get an intuition of what’s happening in these tests and which test is
suitable for a real scenario. In this post, we will only cover Z-test in detail covering
different scenarios.
27
First, let’s understand the distribution of data in general. The green graph shows
the normal distribution of data with mean=5 and standard deviation = 2, converting
this to a standard normal distribution(grey graph) will shift the central location to
0(Mean=0) and will be 1 standard deviation away from mean 0. This is nothing but
a Z score. Z score translates the data that we have from normal distribution into a
standard normal distribution.
The formula for Z score is:
z = (x1— μ) / (σ )
In the above example, we just had one value x1=3. Now when we have a
population, it’s hard to validate each element of the population. That is the reason
we evaluate the Z score using sampling distribution of means and population mean.
z = (x^ — μ) / (σ / √n),
Here x^ is the sample mean, μ is the population mean,σ standard deviation, n is the
sample size.
28
Standardizing makes it easy to work with data. Once we have the Z score, we can
use the Z table(standard normal distribution table) which allows us to find the area
of the region located under the bell curve. This is useful to calculate the probability
of occurrence within our normal distribution. This can also be used to compare 2
scores that are from different normal distributions.
Today we use software programs or say libraries that will give us the Z score and
probability values(p-value) in a click. Still, it’s important to understand what goes
in the background to get an intuition.
Z test
Z-test is used with continuous variables, Continuous random variables can take an
infinite number of values. For example: Age, Fare, Weight, Height.
Parameter of interest for the Z test is:
mean(μ)
proportion(p)
29
4. Algorithm:
Step 1: Evaluate the data distribution.
Step 2: Formulate Hypothesis statement symbolically
Step 3: Define the level of significance (alpha)
Step 4: Calculate Z test statistic or Z score.
Step 5: Derive P-value for the Z score calculated.
Step 6: Make decision:
Step 6.1: P-Value <= alpha, then we reject H0.
Step 6.2: If P-Value > alpha, Fail to reject H0
Z test implementation
5. Code
https://github.com/datasciencedojo/datasets/blob/master/titanic.csv
1) Some new survey/research claims that the average age of passengers in Titanic
who survived is greater than 28.
2) There is a difference in average age between the two genders who survived?
3) Greater than 50% of passengers who survived in Titanic are in the age group of
20–40.
4) Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)
30
Titanic data set overview:
In this dataset, Age and Fare are the continuous variables on which we can perform the Z
test.
1) Some new survey/research claims that the average age of passengers in Titanic who
survived is greater than 28.
First, let’s look at the data in hand. As shown below(Graph 1), the population is not
normally distributed. So we as per the Central Limit Theorem will take 60 random
sampling distribution of 60 sample mean of passengers who survived which will
approximate to a normal distribution(Graph 2).
31
H0: Average age of passengers in Titanic is less than 28:μ0 <=28
32
Conclusion: As per the Z test, we reject H0 and go with the alternate theory which says the
Average age of survived passengers is > 28. With a confidence interval of 95%, we can
say that the average age ranges between 28.01 and 28.98.
2. There is a difference in average age between the two genders who survived?
Evaluating the data distribution of Survived, male and female passenger’s age.
Population data is not normal, so I took 60 sampling distributions of 60 sample
mean which will approximate to a normal distribution.
33
34
H0:No difference in mean age of male & female passengers who survived: μ_male
=μ_female or μ_male-μ_female=0
HA:There is difference in mean age of male & female passengers who survived μ_male
<> μ_female or μ_male-μ_female <> 0
Conclusion: We can go with alternate theory which says “there is a difference in the mean
age of male & female passengers who survived”Also we can check
Let's check if mean age of male is greater than female mean age?
35
Conclusion:
We do not have significant result to conclude mean age of male is greater than that of
female passengers who survived.
3. Greater than 50% of passengers who survived in Titanic are in the age group of 20–40.
H0: p ≤ 0.5, Less than 50% of passengers who survived in Titanic are in the age group of
20–40
H1:p > 0.5 , Greater than 50% of passengers who survived in Titanic are in the age group
of 20–40
36
Conclusion:
We fail to reject H0, we do not have a significant result, so we cannot go with an alternate
theory that says more than 50% of passengers who survived are in the 20–40 age range.
With 95% confidence, we can say survived passengers are in the age range between 47.01
to 58.0
4. Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)
37
Conclusion: We have significant results to go with an alternate theory. Greater than 50%
of passengers in Titanic are in the age group of 20–40. With 95% confidence, we can say
that more than 50% of passengers are in the age range between 50.3 to 57.6
7. Results
8. Viva Voce
If the average earnings from the sample data are sufficiently far from zero, then the
gambler will reject the null hypothesis and conclude the alternative hypothesis—
namely, that the expected earnings per play are different from zero. If the average
earnings from the sample data are near zero, then the gambler will not reject the null
hypothesis, concluding instead that the difference between the average from the data
and zero is explainable by chance alone.
b) What are Type I and Type II errors? What's the difference between Type 1 error
and Type 2 error?
38
In statistics, a Type I error means rejecting the null hypothesis when it's actually true,
while a Type II error means failing to reject the null hypothesis when it's actually false
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is
actually true in the population; a type II error (false-negative) occurs if the investigator
fails to reject a null hypothesis that is actually false in the population.
The decision rule is a statement that tells under what circumstances to reject the
null hypothesis. The decision rule is based on specific values of the test statistic
(e.g., reject H0 if Z > 1.645). The decision rule for a specific test depends on 3
factors: the research or alternative hypothesis, the test statistic and the level of
significance.
d) To reject the null hypothesis, what are the steps to be performed. Explain the steps
in detail
Step 1: State the null hypothesis. When you state the null hypothesis, you also have
to state the alternate hypothesis. Sometimes it is easier to state the alternate
hypothesis first, because that’s the researcher’s thoughts about the experiment.
How to state the null hypothesis (opens in a new window).
Step 2: Support or reject the null hypothesis. Several methods exist, depending on
what kind of sample data you have. For example, you can use the P-value method.
For a rundown on all methods, see: Support or reject the null hypothesis.
39
Ex.No: 5
Implementation of T-Test – one sample t-test
Date :
1. Problem statement :
To perform a one sample t-test to determine whether the mean of a population is equal
to some value or not
A very simple example: Let’s say you have a cold and you try a naturopathic remedy.
Your cold lasts a couple of days. The next time you have a cold, you buy an over-the-
counter pharmaceutical and the cold lasts a week. You survey your friends and they all
tell you that their colds were of a shorter duration (an average of 3 days) when they
took the homeopathic remedy. What you really want to know is, are these results
repeatable? A t test can tell you by comparing the means of the two groups and letting
you know the probability of those results happening by chance
One sample t-test : The One Sample t Test determines whether the sample mean is
statistically different from a known or hypothesised population mean. The One
Sample t Test is a parametric test.
40
The one sample t test compares the mean of your sample data to a known value. For
example, you might want to know how your sample mean compares to the population
mean. You should run a one sample t test when you don’t know the population
standard deviation or you have a small sample size. For a full rundown on which test
to use, see: T-score vs. Z-Score.
Assumptions of the test (your data should meet these requirements for the test to be
valid):
Data is independent.
Data is collected randomly. For example, with simple random sampling.
The data is approximately normally distributed.
Please notice that the formula is a ratio. A common analogy is that the t-value is the
signal-to-noise ratio.
The numerator is the signal. You simply take the sample mean and subtract the null
hypothesis value. If your sample mean is 10 and the null hypothesis is 6, the
difference, or signal, is 4.
If there is no difference between the sample mean and null value, the signal in the
numerator, as well as the value of the entire ratio, equals zero. For instance, if your
sample mean is 6 and the null value is 6, the difference is zero.
As the difference between the sample mean and the null hypothesis mean increases in
either the positive or negative direction, the strength of the signal increases.
Noise
41
accurately your sample estimates the mean of the population. A larger number
indicates that your sample estimate is less precise because it has more random error.
This random error is the “noise.” When there is more noise, you expect to see larger
differences between the sample mean and the null hypothesis value even when the null
hypothesis is true. We include the noise factor in the denominator because we must
determine whether the signal is large enough to stand out from it.
Signal-to-Noise ratio
Both the signal and noise values are in the units of your data. If your signal is 6 and
the noise is 2, your t-value is 3. This t-value indicates that the difference is 3 times the
size of the standard error. However, if there is a difference of the same size but your
data have more variability (6), your t-value is only 1. The signal is at the same scale as
the noise.
In this manner, t-values allow you to see how distinguishable your signal is from the
noise. Relatively large signals and low levels of noise produce larger t-values. If the
signal does not stand out from the noise, it’s likely that the observed difference
between the sample estimate and the null hypothesis value is due to random error in
the sample rather than a true difference at the population level.
4. Algorithm :
Step 1: Create some dummy age data for the population of voters in the entire
country
Step 2: Create Sample of voters in Minnesota and test the whether the average age
of voters Minnesota differs from the population
Step 3: Conduct a t-test at a 95% confidence level and see if it correctly rejects the
null hypothesis that the sample comes from the same distribution as the population.
Step 4: If the t-statistic lies outside the quantiles of the t-distribution corresponding
to our confidence level and degrees of freedom, we reject the null hypothesis.
Step 5: Calculate the chances of seeing a result as extreme as the one being
observed (known as the p-value) by passing the t-statistic in as the quantile to the
stats.t.cdf() function
5. Code:
42
A one-sample t-test checks whether a sample mean differs from the population mean. Let's
create some dummy age data for the population of voters in the entire country and a
sample of voters in Minnesota and test the whether the average age of voters Minnesota
differs from the population:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math
np.random.seed(6)
print( population_ages.mean() )
print( minnesota_ages.mean() )
43.000112
39.26
Notice that we used a slightly different combination of distributions to generate the sample
data for Minnesota, so we know that the two means are different. Let's conduct a t-test at a
95% confidence level and see if it correctly rejects the null hypothesis that the sample
comes from the same distribution as the population. To conduct a one sample t-test, we
can the stats.ttest_1samp() function:
43
stats.ttest_1samp(a = minnesota_ages, # Sample data
popmean = population_ages.mean()) # Pop mean
Ttest_1sampResult(statistic=-2.5742714883655027,
pvalue=0.013118685425061678)
The test result shows the test statistic "t" is equal to -2.574. This test statistic tells us how
much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the
quantiles of the t-distribution corresponding to our confidence level and degrees of
freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():
-2.0095752344892093
stats.t.ppf(q=0.975, df=49)
2.009575234489209
We can calculate the chances of seeing a result as extreme as the one we observed (known
as the p-value) by passing the t-statistic in as the quantile to the stats.t.cdf() function:
Notice this value is the same as the p-value listed in the original t-test output. A p-value of
0.01311 means we'd expect to see data as extreme as our sample due to chance about 1.3%
of the time if the null hypothesis was true. In this case, the p-value is lower than our
significance level α (equal to 1-conf.level or 0.05) so we should reject the null hypothesis.
If we were to construct a 95% confidence interval for the sample it would not capture
population mean of 43:
44
sigma = minnesota_ages.std()/math.sqrt(50) # Sample stdev/sample size
(36.369669080722176, 42.15033091927782)
On the other hand, since there is a 1.3% chance of seeing a result this extreme due to
chance, it is not significant at the 99% confidence level. This means if we were to
construct a 99% confidence interval, it would capture the population mean:
(35.40547994092107, 43.11452005907893)
With a higher confidence level, we construct a wider confidence interval and increase the
chances that it captures to true mean, thus making it less likely that we'll reject the null
hypothesis. In this case, the p-value of 0.013 is greater than our significance level of 0.01
and we fail to reject the null hypothesis.
6. Output
7. Result
45
8. Viva Voce
a. What does a t-test measure?
A t-test measures the difference in group means divided by the pooled standard error
of the two group means.
In this way, it calculates a number (the t-value) illustrating the magnitude of the
difference between the two group means being compared, and estimates the likelihood
that this difference exists purely by chance (p-value).
46
A paired t-test is used to compare a single population before and after some
experimental intervention or at two different points in time (for example, measuring
student performance on a test before and after being taught the material).
47
Ex.No: 6
Implementation of T-Test – Two sample t-test and
Paired T-Test
Date :
1. Problem statement :
To perform a two sample t-test and paired t-test to determine whether the mean of two
population are equal to some value or not
3. Problem analysis:
For the 2-sample t-test, the numerator is again the signal, which is the difference
between the means of the two samples. For example, if the mean of group 1 is 10, and
the mean of group 2 is 4, the difference is 6.
The default null hypothesis for a 2-sample t-test is that the two groups are equal. You
can see in the equation that when the two groups are equal, the difference (and the
entire ratio) also equals zero. As the difference between the two groups grows in either
a positive or negative direction, the signal becomes stronger.
In a 2-sample t-test, the denominator is still the noise, but Minitab can use two
different values. You can either assume that the variability in both groups is equal or
not equal, and Minitab uses the corresponding estimate of the variability. Either way,
48
the principle remains the same: you are comparing your signal to the noise to see how
much the signal stands out.
Just like with the 1-sample t-test, for any given difference in the numerator, as you
increase the noise value in the denominator, the t-value becomes smaller. To
determine that the groups are different, you need a t-value that is large.
Each type of t-test uses a procedure to boil all of your sample data down to one value,
the t-value. The calculations compare your sample mean(s) to the null hypothesis and
incorporates both the sample size and the variability in the data. A t-value of 0
indicates that the sample results exactly equal the null hypothesis. In statistics, we call
the difference between the sample estimate and the null hypothesis the effect size. As
this difference increases, the absolute value of the t-value increases.
That’s all nice, but what does a t-value of, say, 2 really mean? From the discussion
above, we know that a t-value of 2 indicates that the observed difference is twice the
size of the variability in your data. However, we use t-tests to evaluate hypotheses
rather than just figuring out the signal-to-noise ratio. We want to determine whether
the effect size is statistically significant.
To see how we get from t-values to assessing hypotheses and determining statistical
significance, read the other post in this series, Understanding t-Tests: t-values and t-
distributions.
A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent
samples t test) is where you run a t test on dependent samples. Dependent samples are
essentially connected — they are tests on the same person or thing. For example:
Choose the paired t-test if you have two measurements on the same item, person or
thing. You should also choose this test if you have two items that are being measured
with a unique condition. For example, you might be measuring car safety performance
in vehicle research and testing and subject the cars to a series of crash tests. Although
the manufacturers are different, you might be subjecting them to the same conditions.
With a “regular” two sample t test, you’re comparing the means for two different
samples. For example, you might test two different groups of customer service
49
associates on a business-related test or testing students from two universities on their
English skills. If you take a random sample each group separately and they have
different conditions, your samples are independent and you should run an independent
samples t test (also called between-samples and unpaired-samples).
4. Algorithm :
Step 1: Create the data
Step 2: Conduct a two sample t-test.
Step 3: Interpret the results
5. Code
Two-Sample T-Test
A two-sample t-test investigates whether the means of two independent data samples differ
from one another. In a two-sample test, the null hypothesis is that the means of both
groups are the same. Unlike the one sample-test where we test against a known population
parameter, the two sample test only involves sample means. You can conduct a two-
sample t-test by passing with the stats.ttest_ind() function. Let's generate a sample of voter
age data for Wisconsin and test it against the sample we made earlier:
np.random.seed(12)
wisconsin_ages1 = stats.poisson.rvs(loc=18, mu=33, size=30)
wisconsin_ages2 = stats.poisson.rvs(loc=18, mu=13, size=20)
wisconsin_ages = np.concatenate((wisconsin_ages1, wisconsin_ages2))
print( wisconsin_ages.mean() )
42.8
stats.ttest_ind(a= minnesota_ages,
b= wisconsin_ages,
50
equal_var=False) # Assume samples have equal variance?
Ttest_indResult(statistic=-1.7083870793286842, pvalue=0.09073104343957748)
The test yields a p-value of 0.0907, which means there is a 9% chance we'd see sample
data this far apart if the two groups tested are actually identical. If we were using a 95%
confidence level we would fail to reject the null hypothesis, since the p-value is greater
than the corresponding significance level of 5%.
Paired T-Test
The basic two sample t-test is designed for testing differences between independent
groups. In some cases, you might be interested in testing differences between samples of
the same group at different points in time. For instance, a hospital might want to test
whether a weight-loss drug works by checking the weights of the same group patients
before and after treatment. A paired t-test lets you check whether the means of samples
from the same group differ.
We can conduct a paired t-test using the scipy function stats.ttest_rel(). Let's generate
some dummy patient weight data and do a paired t-test:
np.random.seed(11)
weight_df = pd.DataFrame({"weight_before":before,
"weight_after":after,
"weight_change":after-before})
The summary shows that patients lost about 1.23 pounds on average after treatment. Let's
conduct a paired t-test to see whether this difference is significant at a 95% confidence
level:
51
The result of a statistical hypothesis test and the corresponding decision of whether to
reject or accept the null hypothesis is not infallible. A test provides evidence for or against
the null hypothesis and then you decide whether to accept or reject it based on that
evidence, but the evidence may lack the strength to arrive at the correct conclusion.
Incorrect conclusions made from hypothesis tests fall in one of two categories: type I error
and type II error.
Type I error describes a situation where you reject the null hypothesis when it is actually
true. This type of error is also known as a "false positive" or "false hit". The type 1 error
rate is equal to the significance level α, so setting a higher confidence level (and therefore
lower alpha) reduces the chances of getting a false positive.
Type II error describes a situation where you fail to reject the null hypothesis when it is
actually false. Type II error is also known as a "false negative" or "miss". The higher your
confidence level, the more likely you are to make a type II error.
plt.figure(figsize=(12,10))
plt.fill_between(x=np.arange(-4,-2,0.01),
y1= stats.norm.pdf(np.arange(-4,-2,0.01)) ,
facecolor='red',
alpha=0.35)
plt.fill_between(x=np.arange(-2,2,0.01),
y1= stats.norm.pdf(np.arange(-2,2,0.01)) ,
facecolor='grey',
alpha=0.35)
plt.fill_between(x=np.arange(2,4,0.01),
y1= stats.norm.pdf(np.arange(2,4,0.01)) ,
facecolor='red',
alpha=0.5)
plt.fill_between(x=np.arange(-4,-2,0.01),
y1= stats.norm.pdf(np.arange(-4,-2,0.01),loc=3, scale=2) ,
facecolor='grey',
alpha=0.35)
plt.fill_between(x=np.arange(-2,2,0.01),
y1= stats.norm.pdf(np.arange(-2,2,0.01),loc=3, scale=2) ,
52
facecolor='blue',
alpha=0.35)
plt.fill_between(x=np.arange(2,10,0.01),
y1= stats.norm.pdf(np.arange(2,10,0.01),loc=3, scale=2),
facecolor='grey',
alpha=0.35)
Conclusion:
In the plot above, the red areas indicate type I errors assuming the alternative hypothesis is
not different from the null for a two-sided test with a 95% confidence level.
The blue area represents type II errors that occur when the alternative hypothesis is
different from the null, as shown by the distribution on the right. Note that the Type II
error rate is the area under the alternative distribution within the quantiles determined by
the null distribution and the confidence level.
6. Output
7. Result
8. Viva Voce
a. Differences between the two-sample t-test and paired t-test
Two-sample t-test is used when the data of two samples are statistically independent,
while the paired t-test is used when data is in the form of matched pairs
There are also some technical differences between them. To use the two-sample t-test, we
need to assume that the data from both samples are normally distributed and they have the
same variances. For paired t-test, we only require that the difference of each pair is
normally distributed. An important parameter in the t-distribution is the degrees of
freedom.
53
A paired t-test is used when we are interested in the difference between two variables for
the same subject. Often the two variables are separated by time. For example, in the Dixon
and Massey data set we have cholesterol levels in 1952 and cholesterol levels in 1962 for
each subject
Subjects must be independent. Measurements for one subject do not affect measurements
for any other subject. Each of the paired measurements must be obtained from the same
subject. For example, the before-and-after weight for a smoker in the example above must
be from the same person
Ex.No: 7
Implementation of Variance Analysis ( ANOVA)
Date :
1. Problem statement :
3. Problem analysis:
54
ANOVA tests if there is a difference in the mean somewhere in the model (testing if
there was an overall effect), but it does not tell us where the difference is (if there is
one). To find where the difference is between the groups, we have to conduct post-hoc
tests.
To perform any tests, we first need to define the null and alternate hypothesis:
Null Hypothesis – There is no significant difference among the groups
Alternate Hypothesis – There is a significant difference among the groups
The result of the ANOVA formula, the F statistic (also called the F-ratio), allows for
the analysis of multiple groups of data to determine the variability between samples
and within samples. The formula for one-way ANOVA test can be written like this:
The formula for one-way ANOVA test can be written like this:
55
When we plot the ANOVA table, all the above components can be seen in it as below:
In general, if the p-value associated with the F is smaller than 0.05, then the null
hypothesis is rejected and the alternative hypothesis is supported. If the null
hypothesis is rejected, we can conclude that the means of all the groups are not equal.
56
3. N-Way ANOVA: A researcher can also use more than two independent variables,
and this is an n-way ANOVA (with n being the number of independent variables you
have), aka MANOVA Test.
4. Algorithm :
A. Input:
A bunch of students from different colleges taking the same exam. You want to see if one
college outperforms the other, hence your null hypothesis is that the means of GPAs in
each group are equivalent to those of the other groups. To keep it simple, we will consider
3 groups (college ‘A’, ‘B’, ‘C’) with 6 students each.
A=[25,25,27,30,23,20]
B=[30,30,21,24,26,28]
C=[18,30,29,29,24,26]
Null Hypothesis: GPAs in each group are equivalent to those of the other groups.
Alternate Hypothesis – There is a significant difference among the groups
B. Output:
To find the null hypothesis or alternate hypothesis is acceptable or not.
57
5. The squared difference values are added. The result is a value that relates to the total
deviation of rows from the mean of their respective groups. This value is referred to as the
sum of squares within groups, or S2Wthn.
6. For each group, the difference between the total mean and the group mean is squared
and multiplied by the number of values in the group. The results are added. The result is
referred to as the sum of squares between groups or S2Btwn.
7. The two sums of squares are used to obtain a statistic for testing the null hypothesis, the
so called F-statistic. The F-statistic is calculated as:
where dfBtwn (degree of freedom between groups) equals the number of groups minus 1,
and dfWthn (degree of freedom within groups) equals the total number of values minus the
number of groups
5. Code:
import pandas as pd
import numpy as np
import scipy.stats as stats
a=[25,25,27,30,23,20]
b=[30,30,21,24,26,28]
c=[18,30,29,29,24,26]
list_of_tuples = list(zip(a, b,c))
df = pd.DataFrame(list_of_tuples, columns = ['A', 'B', 'C'])
df
m1=np.mean(a)
58
m2=np.mean(b)
m3=np.mean(c)
print('Average mark for college A: {}'.format(m1))
print('Average mark for college B: {}'.format(m2))
print('Average mark for college C: {}'.format(m3))
m=(m1+m2+m3)/3
print('Overall mean: {}'.format(m))
SSb=6*((m1-m)**2+(m2-m)**2+(m3-m)**2)
print('Between-groups Sum of Squared Differences: {}'.format(SSb))
MSb=SSb/2
print('Between-groups Mean Square value: {}'.format(MSb))
err_a=list(a-m1)
err_b=list(b-m2)
err_c=list(c-m3)
err=err_a+err_b+err_c
ssw=[]
for i in err:
ssw.append(i**2)
SSw=np.sum(ssw)
print('Within-group Sum of Squared Differences: {}'.format(SSw))
MSw=SSw/15
print('Within-group Mean Square value: {}'.format(MSw))
F=MSb/MSw
print('F-score: {}'.format(F))
print(stats.f_oneway(a,b,c))
6. Output
7. Result
8. Viva Voce
You would use ANOVA to help you understand how your different groups respond,
with a null hypothesis for the test that the means of the different groups are equal. If
there is a statistically significant result, then it means that the two populations are
unequal (or different).
59
The factorial ANOVA has a several assumptions that need to be fulfilled – (1) interval
data of the dependent variable, (2) normality, (3) homoscedasticity, and (4) no
multicollinearity.
c. Why is ANOVA better than multiple t tests?
Two-way anova would be better than multiple t-tests for two reasons: (a) the within-
cell variation will likely be smaller in the two-way design (since the t-test ignores the
2nd factor and interaction as sources of variation for the DV); and (b) the two-way
design allows for test of interaction of the two factors.
ANOVA equates three or more such groups. t-test is less likely to commit an error.
ANOVA has more error risks. Sample from class A and B students have given a
mathematics course may have different mean and standard deviation.
There is a thin line of demarcation amidst t-test and ANOVA, i.e. when the population
means of only two groups is to be compared, the t-test is used, but when means of
more than two groups are to be compared, ANOVA is preferred
f. Can ANOVA be used for hypothesis testing?
The specific test considered here is called analysis of variance (ANOVA) and is a test
of hypothesis that is appropriate to compare means of a continuous variable in two or
more independent comparison groups. For example, in some clinical trials there are
more than two comparison groups.
g. What distribution does ANOVA use?
F-distribution
The second is one-way analysis of variance (ANOVA), which uses the F-distribution
to test to see if three or more samples come from populations with the same mean
60
Ex.No: 8
Demonstration of Linear Regression
Date :
1. Problem statement :
3. Problem analysis:
Problem Analysis: Consider the linear line y = a + bx. Find the value of b = nΣxy − (Σx)
(Σy) / nΣx2−(Σx)2, a = Σy−b (Σx)n, implement the value of a and b in y = a + bx and solve
the equation. Once if you get the value of a, b and can regress y value for any x.
4. Algorithm
5. Code
import numpy as np
import matplotlib.pyplot as plt
61
SS_xx = np.sum(x*x - n*m_x*m_x)
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
# observations
x = np.array([25, 23, 25, 31, 32, 25, 36, 27, 28, 29])
y = np.array([3.2, 3, 3.5, 3, 3.6, 3.7, 3.3, 3.6, 3.2, 3.1])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
6. Output
7. Result
8. Viva Voce
a. What are the limitations to linear regression?
The Disadvantages of Linear Regression
62
Linear Regression Only Looks at the Mean of the Dependent Variable. Linear
regression looks at a relationship between the mean of the dependent variable and the
independent variables. ...
Linear Regression Is Sensitive to Outliers. ...
Data Must Be Independent.
For example, if parents were very tall the children tended to be tall but shorter than
their parents. If parents were very short the children tended to be short but taller than
their parents were. This discovery he called "regression to the mean," with the word
"regression" meaning to come back to.
Linear regression is commonly used for predictive analysis and modeling. For
example, it can be used to quantify the relative impacts of age, gender, and diet (the
predictor variables) on height (the outcome variable)
Statisticians say that a regression model fits the data well if the differences between the
observations and the predicted values are small and unbiased. Unbiased in this context
means that the fitted values are not systematically too high or too low anywhere in the
observation space.
63
Ex.No: 9
Demonstration of Logistic Regression
Date :
1. Problem statement :
To train the students to understand the basics logistics regression using python
3. Problem analysis:
Let’s say that your goal is to build a logistic regression model in Python in order to
determine whether candidates would get admitted to a prestigious university.
𝜃=p/1-p
The values of odds range from zero to ∞ and the values of probability lies between zero
and one.
𝑦 = 𝛽0 + 𝛽1* 𝑥 4
64
Here, 𝛽0 is the y-intercept
𝛽1 is the slope of the line
x is the value of the x coordinate
y is the value of the prediction
Now to predict the odds of success, we use the following formula:
Let Y = e 𝛽0+𝛽1 * 𝑥
Then p(x) / 1 - p(x) = Y
p(x) = Y(1 - p(x))
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y
65
The sigmoid curve obtained from the above equation is as follows:
4. Algorithm:
A. Input: GMAT score, GPA and Years of work experience directly given in the program
as input.
5. Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
candidates = {'gmat':
[780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,62
0
,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa':
[4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,
3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience':
66
[3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
'admitted':
[1,1,0,1,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,1]
}
df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted'])
print (df)
X = df[['gmat', 'gpa','work_experience']]
y = df['admitted']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
print (X_train)
print (y_train)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'],
colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
print (X_test) #test dataset
print (y_pred) #predicted values
print('confusion_matrix:', confusion_matrix, sep='\n', end='\n\n')
plt.show()
6. Output
7. Result
8. Viva Voce
67
logistic regression equation. This type of analysis can help you predict the likelihood
of an event happening or a choice being made.
68
h. What is a good sample size for logistic regression?
In conclusion, for observational studies that involve logistic regression in the analysis,
this study recommends a minimum sample size of 500 to derive statistics that can
represent the parameters in the targeted population
Ex.No: 10
Demonstration of Multiple-Linear Regression
Date :
1. Problem statement :
To train the students to understand the Multi- Linear Regression using python program
3. Problem analysis:
Multiple linear regression attempts to model the relationship between two or more features
and a response by fitting a linear equation to the observed data.
Consider a dataset with p features (or independent variables) and one response (or
dependent variable).
We define:
X (feature matrix) = a matrix of size n X p where x_{ij} denotes the values of jth feature
for ith observation.
So,
69
and
y (response vector) = a vector of size n where y_{i} denotes the value of response for ith
observation.
where h(x_i) is predicted response value for ith observation and b_0, b_1, …, b_p are the
regression coefficients.
where e_i represents residual error in ith observation. We can generalize our linear model
a little bit more by representing feature matrix X as:
70
So now, the linear model can be expressed in terms of matrices as:
Where,
And
As already explained, the Least Squares method tends to determine b’ for which total
residual error is minimized.
71
where ‘ represents the transpose of the matrix while -1 represents the matrix inverse.
Knowing the least square estimates, b’, the multiple linear regression model can now be
estimated as:
4. Algorithm:
5. Code:
72
# train the model using the training sets
reg.fit(X_train, y_train)
# regression coefficients
print('Coefficients: ', reg.coef_)
# plotting legend
plt.legend(loc = 'upper right')
# plot title
plt.title("Residual errors")
6. Output
73
7. Result
8. Viva Voce
Linear regression attempts to draw a line that comes closest to the data by finding
the slope and intercept that define the line and minimize regression errors. If two
or more explanatory variables have a linear relationship with the dependent
variable, the regression is called a multiple linear regression
You can use multiple linear regressions when you want to know: How strong the
relationship is between two or more independent variables and one dependent variable
(e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth)
TWO
It is also widely used for predicting the value of one dependent variable from the
values of two or more independent variables. When there are two or more independent
variables, it is called multiple regression.
74
Ex.No: 11
Implementation of Time Series Analysis
Date :
1. Problem statement :
https://www.kaggle.com/chirag19/air-passengers
To train the students to understand understanding various aspects about the inherent nature
of the series so that you are better informed to create meaningful and accurate forecasts
3. Problem analysis:
Because it is the preparatory step before you develop a forecast of the series. Besides, time
series forecasting has enormous commercial significance because stuff that is important to
a business like demand and sales, number of visitors to a website, stock price etc are
essentially time series data.
Time series analysis involves understanding various aspects about the inherent nature of
the series so that you are better informed to create meaningful and accurate forecasts.
Across industries, organizations commonly use time series data, which means any
information collected over a regular interval of time, in their operations. Examples include
daily stock prices, energy consumption rates, social media engagement metrics and retail
75
demand, among others. Analyze time series data yields insights like trends, seasonal
patterns and forecasts into future events that can help generate profits. For example, by
understanding the seasonal trends in demand for retail products, companies can plan
promotions to maximize sales throughout the year.
4. Algorithm
To start, let’s import the Pandas library and read the airline passenger data into a data
frame:
import pandas as pd
df = pd.read_csv(“AirPassengers.csv”)
Now, let’s display the first five rows of data using the data frame head() method:
print(df.head())
Let’s take a look at the last five records the data using the tail() method:
print(df.tail())
We see that the data ends in 1960. The next thing we will want to do is convert the month
column into a datetime object. This will allow it to programmatically pull time values like
the year or month for each record. To do this, we use the Pandas to_datetime() method:
Note that this process automatically inserts the first day of each month, which is basically
a dummy value since we have no daily passenger data.
The next thing we can do is convert the month column to an index. This will allow us to
more easily work with some of the packages we will be covering later:
76
df.index = df[‘Month’]
del df[‘Month’]
print(df.head())
Next, let’s generate a time series plot using Seaborn and Matplotlib. This will allow us to
visualize the time series data.
First, let’s import Matplotlib and Seaborn:
sns.lineplot(df)
plt.ylabel(“Number of Passengers”)
Analysis
Stationarity is a key part of time series analysis. Simply put, stationarity means that the
manner in which time series data changes is constant. A stationary time series will not
have any trends or seasonal patterns. You should check for stationarity because it not only
makes modeling time series easier, but it is an underlying assumption in many time series
methods.
Let’s test for stationarity in our airline passenger data. To start, let’s calculate a seven
month rolling mean:
rolling_mean = df.rolling(7).mean()
rolling_std = df.rolling(7).std()
Next, let’s overlay our time series with the seven month rolling mean and seven month
rolling standard deviation. First, let’s make a Matplotlib plot of our time series:
77
plt.plot(rolling_mean, color=”red”, label=”Rolling Mean Passenger Number”)
And a legend:
plt.legend(loc=”best”)
Next, let’s import the augmented Dickey-Fuller test from the statsmodels package. The
documentation for the test can be found here.
Next, let’s pass our data frame into the adfuller method. Here, we specify the autolag
parameter as “AIC”, which means that the lag is chosen to minimize the information
criterion:
adft = adfuller(df,autolag=”AIC”)
print(output_df)
We can see that our data is not stationary from the fact that our p-value is greater than 5
percent and the test statistic is greater than the critical value. We can also draw these
conclusions from inspecting the data, as we see a clear, increasing trend in the number of
passengers.
78
Forecasting
Time series forecasting allows us to predict future values in a time series given current and
past data. Here, we will use the ARIMA method to forecast the number of passengers.
ARIMA allows us to forecast future values in terms of a linear combination of past values.
We will use the auto_arima package, which will allow us to forgo the time consuming
process of hyperparameter tuning.
First, let’s split our data for training and testing and visualize the split:
df[‘Date’] = df.index
train = df[df[‘Date’] < pd.to_datetime(“1960–08”, format=’%Y-%m’)]
train[‘train’] = train[‘#Passengers’]
del train[‘Date’]
del train[‘#Passengers’]
test = df[df[‘Date’] >= pd.to_datetime(“1960–08”, format=’%Y-%m’)]
del test[‘Date’]
test[‘test’] = test[‘#Passengers’]
del test[‘#Passengers’]
plt.plot(train, color = “black”)
plt.plot(test, color = “red”)
plt.title(“Train/Test split for Passenger Data”)
plt.ylabel(“Passenger Number”)
plt.xlabel(‘Year-Month’)
sns.set()
plt.show()
Let’s import auto_arima from the pdmarima package, train our model and generate
predictions:
6. Output
7. Result
8. Viva Voce
a. How do you analyze time series?
79
Step 1: Visualize the Time Series. It is essential to analyze the trends prior to
building any kind of time series model. ...
Step 2: Stationarize the Series. ...
Step 3: Find Optimal Parameters. ...
Step 4: Build ARIMA Model. ...
Step 5: Make Predictions
80