Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 80

LABORATORY MANUAL

For

AD8412 : DATA ANALYTICS LABORATORY


Of

B.TECH
(ARTIFICIAL INTELLIGENCE AND DATA
SCIENCE)
(Anna University Regulation 2017)
For the Batch (2020 to 2024)
Semester: IV
Academic Year: 2021-2022

KCG COLLEGE OF TECHNOLOGY, CHENNAI – 600 097

1
LABORATORY MANUAL

Reg No ………………………………………………………

Course Code :

Name of the Course :

Programme :

It is certified that this is the bonafide record of the work carried out by

_____________________________of____________________________class

during the year 2021 – 2022.

Faculty In-Charge:_______________________ HoD:___________________________

2
LABORATORY RECORD

Reg No ………………………………………………………

Course Code :

Name of the Course :

Programme :

It is certified that this is the bonafide record of the work carried out by

_____________________________of____________________________class

during the year 2021 – 2022.

Int.Examiner:____________________________ Ext. Examiner: ___________________

Date of the Examination: __________________

3
GE3171 PROBLEM SOLVING AND PYTHON PROGRAMMING LABORATORY

COURSE OUTCOMES
COURSE OUTCOME

CO1: After the completion of this course, students will be able to:
CO2: To become skilled to use various packages in Python
CO3: Demonstrate the understanding of data distribution with various samples
CO4: Ability to Implement T--Test , Anova and Z--Test on sample data sets
CO5: Understanding of Mathematical models in real world problems.
CO6: Conduct time series analysis and draw conclusion.

LIST OF EXERCISES
1. Random Sampling
2. Z-test case study
3. T-test case studies
4. ANOVA case studies
5. Regression
6. Logistic Regression
7. Time series Analysis

4
GE3171 - PROBLEM SOLVING AND PYTHON PROGRAMMING LABORATORY
INDEX
Expt. Page
Name of the Experiment Date Signature
No. No
1 Demonstration the random sampling
Demonstrate the probability sampling from a known
2 population
Implementation of Z-Test – One Sample Z-Test and Two
3 Sample Z-Test

4 Implementation of Z-Test – using Titanic case study

5 Implementation of T-Test – one sample t-test

Implementation of T-Test – Two sample t-test and


6
Paired T-Test
7 Implementation of VARIANCE ANALYSIS ( ANNOVA)

8 Demonstration of Linear Regression

9 Demonstration of Logistic Regression

10 Demonstration of Multiple-Linear Regression

11 Implementation of Time Series Analysis

5
Ex.No: 1 Demonstration the random sampling using python

Date :

1. Problem statement :

To implement the random sampling using various packages available in python


2. Expected Learning outcomes :

It helps the students to understand how to randomly sample items from lists as well as how
to generate pseudorandom numbers in python.
3. Problem analysis:
Python provides many useful tools for random sampling as well as functions for
generating random numbers. Random sampling has applications in statistics where often
times a random subset of a population is observed and used to make inferences about the
overall population. Further, random number generation has many application in the
sciences. For example, in chemistry and physics Monte Carlo simulations require random
number generation

4. Algorithm:

1. Create a list
2. Use the ‘random.choice()’ method to randomly select individual values from this list
3. Use ‘random.sample()’ method for randomly sampling N items from a list
4. In addition to random selection and sampling, the random module has a function for
shuffling items in a list
5. Picking Random Items in a List using ‘random.sample()’
6. The random module has a function for generating a random integer provided a range of
values. Perform Randomly Shuffling Items in a List using ‘random.shuffle()
7. The random module has a function for generating a random integer provided a range of
values. Generate the Random Integers using ‘random.randint()’
8. The random module also has a function for generating a random floating point value
between 0 and 1. Generate the Random Floating Point Values
9. Scale the random float numbers. If we want random numbers between 0 and 500 we just
multiply our random number by 500
10. And if we want to add a lower bound as well we can add a conditional statement before
appending

6
11. The random module has a function for computing uniformly distributed numbers.
Compute Uniformly Distributed Numbers with ‘random.uniform()’
12. The random module has a function for computing normally distributed numbers. Compute
Normally Distributed Numbers with ‘random.gauss()’

5. Code:

Picking Random Items in a List using ‘random.choice()’

Consider a list of BMI values for people living in a rural area:

bmi_list = [29, 18, 20, 22, 19, 25, 30, 28,22, 21, 18, 19, 20, 20, 22, 23]

Use the ‘random.choice()’ method to randomly select individual BMI values from this list:

import random
print("First random choice:", random.choice(bmi_list))
print("Second random choice:", random.choice(bmi_list))
print("Third random choice:", random.choice(bmi_list))

Run the code multiple number of times and check what output are you getting

Picking Random Items in a List using ‘random.sample()’

The ‘random.sample()’ method is useful for randomly sampling N items from a list.
Consider sample N=5 items and apply it to our BMI list:

print("Random sample, N = 5 :", random.sample(bmi_list, 5))

try sampling 10 items

print("Random sample, N = 10:", random.sample(bmi_list, 10))

Randomly Shuffling Items in a List using ‘random.shuffle()’

Print our BMI list and then print the result of shuffling our BMI list

7
print("BMI list: ", bmi_list)
random.shuffle(bmi_list)
print("Shuffled BMI list: ", bmi_list)

Generating Random Integers using ‘random.randint()’

The random module has a function for generating a random integer provided a range of
values. Let’s generate a random integer in the range from 1 to 5

print("Random Integer: ", random.randint(1,5))

Generate a list of random integers using a for-loop

random_ints_list = []
for i in range(1,50):
n = random.randint(1,5)
random_ints_list.append(n)
print("My random integer list: ", random_ints_list)

Generating Random Floating Point Values

Generate a random floating point value between 0 and 1

print("Random Float: ", random.random())

generate a list of random floats between 0 and 1

random_float_list = []
for i in range(1,5):
n = random.random()
random_float_list.append(n)
print("My random float list: ", random_float_list)

Scale the random float numbers by multiplying our random number by 500

random_float_list = []
for i in range(1,5):
n = random.random()*500
random_float_list.append(n)
print("My random float list: ", random_float_list)

8
Add a lower bound as well add a conditional statement before appending and generate
random numbers between 100 and 500

random_float_list = []
for i in range(1,10):
n = random.random()*500
if n>=100.0:
random_float_list.append(n)
print("My random float list: ", random_float_list)

Computing Uniformly Distributed Numbers with ‘random.uniform()’

Generate 50 uniformly distributed numbers between -10 and 1

import numpy as np
uniform_list = np.random.uniform(-10,1,50)
print("Uniformly Distributed Numbers: ", uniform_list)

Computing Normally Distributed Numbers with ‘random.gauss()’

Generate 50 normally distributed numbers between -50 and 0

normal_list = np.random.uniform(-50,0,50)
print("Normally Distributed Numbers: ", normal_list)

6. Result

9
7. Viva voce

a. Define random sampling


Random sampling, or probability sampling, is a sampling method that allows for the
randomization of sample selection, i.e., each sample has the same probability as other
samples to be selected to serve as a representation of an entire population

b. What are the types of Random sampling? Explain each of the sampling methods

1. Simple random sampling


Simple random sampling is the randomized selection of a small segment of individuals or
members from a whole population. It provides each individual or member of a population
with an equal and fair probability of being chosen. The simple random sampling method is
one of the most convenient and simple sample selection techniques.

2. Systematic sampling
Systematic sampling is the selection of specific individuals or members from an entire
population. The selection often follows a predetermined interval (k). The systematic
sampling method is comparable to the simple random sampling method; however, it is less
complicated to conduct.

3. Stratified sampling
Stratified sampling, which includes the partitioning of a population into subclasses with
notable distinctions and variances. The stratified sampling method is useful, as it allows
the researcher to make more reliable and informed conclusions by confirming that each
respective subclass has been adequately represented in the selected sample.

4. Cluster sampling
Cluster sampling, which, similar to the stratified sampling method, includes dividing a
population into subclasses. Each of the subclasses should portray comparable
characteristics to the entire selected sample. This method entails the random selection of a
whole subclass, as opposed to the sampling of members from each subclass. This method
is ideal for studies that involve widely spread populations.

10
c. Give a real time example for random sampling
A company currently employs 850 individuals. The company wishes to conduct a survey
to determine employee satisfaction based on a few identified variables. The research team
decides to have the sample set at 85 employees. The 85 employees will be part of the
survey and will be used as a representation for the total population of 850 employees.

In such a scenario, the sample is the 85 employees, and the population is the entire
workforce consisting of 850 individuals. Based on the sample size, any employee from the
workforce can be selected for the survey. It goes to say that each employee has an
equivalent probability of being randomly selected for the survey.

It is important to keep in mind that samples do not always produce an accurate


representation of a population in its entirety; hence, any variations are referred to as
sampling errors. A sampling error can be defined as the difference between the respective
statistics (sample values) and parameters (population values). The sampling error is
inevitable when sample data is being used.

d. Compare Probability (Random) Sampling vs. Non-Probability Sampling

Probability – or random sampling – is the random selection of sample participants to


derive conclusions and assumptions about an entire population. On the other hand, non-
probability sampling is the selection of sample participants based on specified criteria or
suitability

11
Ex.No: 2 Demonstration the probability sampling using python

Date :

1. Problem statement :

To understand and implement the probability sampling using various packages


available in python
2. Expected Learning outcomes:
It helps the students on obtaining information and drawing conclusions about a
population based on the statistics of such units (i.e. the sample), without the need
of having to study the entire population
3. Problem analysis:

Probability sampling is used in cases when every unit from a given population has
the same probability of being selected. This technique includes simple random
sampling, systematic sampling, cluster sampling and stratified random sampling

What is Sampling?

Sampling is the process of selecting a random number of units from a known


population. It allows obtaining information and drawing conclusions about a
population based on the statistics of such units (i.e. the sample), without the need
of having to study the entire population.

Why is Sampling Used?

Sampling is performed for multiple reasons, including:

 Cases where it is impossible to study the entire population due to its size
 Cases where the sampling process involves samples destructive testing
 Cases where there are time and costs constrains

Sampling Techniques

12
There are two types of sampling techniques:

Probability sampling: cases when every unit from a given population has the same
probability of being selected. This technique includes simple random sampling,
systematic sampling, cluster sampling and stratified random sampling.

Non-probability sampling: cases when units from a given population do not have
the same probability of being selected. This technique includes convenience
sampling, quota sampling, judgement sampling and snowball sampling. In
comparison with probability sampling, this technique is more prone to end up with
a non-representative sample group, leading to wrong conclusions about the
population.

4. Algorithm:
1. Create a sample from a set of 10 products using probability sampling to determine the
population mean of a particular measure of interest.
2. Implement Simple Random Sampling
The simple random sampling method selects random samples from a process or
population where every unit has the same probability of getting selected
3. Implement Systematic Sampling
The systematic sampling method selects units based on a fixed sampling
interval
4. Implement Cluster Sampling
The cluster sampling method divides the population in clusters of equal size n
and selects clusters every Tth time
5. Implement Stratified Random Sampling
The stratified random sampling method divides the population in subgroups
and selects random samples where every unit has the same probability of
getting selected

5. Code

1. Create Sample

# Import required libraries

13
import numpy as np
import pandas as pd

# Set random seed


np.random.seed(42)

# Define total number of products


number_of_products = 10

# Create data dictionary


data = {'product_id':np.arange(1, number_of_products+1).tolist(),
'measure':np.round(np.random.normal(loc=10, scale=0.5,
size=number_of_products),3)}

# Transform dictionary into a data frame


df = pd.DataFrame(data)

# Store the real mean in a separate variable


real_mean = round(df['measure'].mean(),3)

# View data frame


df

2. Implement Simple Random Sampling

# Obtain simple random sample


simple_random_sample = df.sample(n=4).sort_values(by='product_id')

# Save the sample mean in a separate variable


simple_random_mean = round(simple_random_sample['measure'].mean(),3)

# View sampled data frame


simple_random_sample

3. Implement Systematic Sampling

# Define systematic sampling function


def systematic_sampling(df, step):

indexes = np.arange(0,len(df),step=step)
systematic_sample = df.iloc[indexes]
return systematic_sample

14
# Obtain a systematic sample and save it in a new variable
systematic_sample = systematic_sampling(df, 3)

# Save the sample mean in a separate variable


systematic_mean = round(systematic_sample['measure'].mean(),3)

# View sampled data frame


systematic_sample

4. Implement Cluster Sampling

def cluster_sampling(df, number_of_clusters):

try:
# Divide the units into cluster of equal size
df['cluster_id'] =
np.repeat([range(1,number_of_clusters+1)],len(df)/number_of_clusters)

# Create an empty list


indexes = []

# Append the indexes from the clusters that meet the criteria
# For this formula, clusters id must be an even number
for i in range(0,len(df)):
if df['cluster_id'].iloc[i]%2 == 0:
indexes.append(i)
cluster_sample = df.iloc[indexes]
return(cluster_sample)

except:
print("The population cannot be divided into clusters of equal size!")

# Obtain a cluster sample and save it in a new variable


cluster_sample = cluster_sampling(df,5)

# Save the sample mean in a separate variable


cluster_mean = round(cluster_sample['measure'].mean(),3)

# View sampled data frame


cluster_sample

5. Implement Stratified Random Sampling

15
# Create data dictionary
data = {'product_id':np.arange(1, number_of_products+1).tolist(),
'product_strata':np.repeat([1,2], number_of_products/2).tolist(),
'measure':np.round(np.random.normal(loc=10, scale=0.5,
size=number_of_products),3)}

# Transform dictionary into a data frame


df = pd.DataFrame(data)

# View data frame


df

# Import StratifiedShuffleSplit
from sklearn.model_selection import StratifiedShuffleSplit

# Set the split criteria


split = StratifiedShuffleSplit(n_splits=1, test_size=4)

# Perform data frame split


for x, y in split.split(df, df['product_strata']):
stratified_random_sample = df.iloc[y].sort_values(by='product_id')

# View sampled data frame


stratified_random_sample

# Obtain the sample mean for each group


stratified_random_sample.groupby('product_strata').mean().drop(['product_i
d'],axis=1)

6. Output

16
7. Result

8. Viva voce

A. What are the steps involved in probability sampling?


a. Choose your population of interest carefully: Carefully think and choose from the
population, people you believe whose opinions should be collected and then
include them in the sample.

b. Determine a suitable sample frame: Your frame should consist of a sample from
your population of interest and no one from outside to collect accurate data.

c. Select your sample and start your survey: It can sometimes be challenging to find
the right sample and determine a suitable sample frame. Even if all factors are in
your favor, there still might be unforeseen issues like cost factor, quality of
respondents, and quickness to respond. Getting a sample to respond to a
probability survey accurately might be difficult but not impossible

B. When to use probability sampling?

1. When you want to reduce the sampling bias: This sampling method is used when
the bias has to be minimum. The selection of the sample largely determines the
quality of the research’s inference. How researchers select their sample largely
determines the quality of a researcher’s findings. Probability sampling leads to
higher quality findings because it provides an unbiased representation of the
population.

2. When the population is usually diverse: Researchers use this method extensively
as it helps them create samples that fully represent the population. Say we want to

17
find out how many people prefer medical tourism over getting treated in their own
country. This sampling method will help pick samples from various socio-
economic strata, background, etc. to represent the broader population.

3. To create an accurate sample: Probability sampling help researchers create


accurate samples of their population. Researchers use proven statistical methods to
draw a precise sample size to obtained well-defined data.

C. What are the Advantages of probability sampling


1. It’s Cost-effective: This process is both cost and time effective, and a larger
sample can also be chosen based on numbers assigned to the samples and then
choosing random numbers from the more significant sample.

2. It’s simple and straightforward: Probability sampling is an easy way of sampling


as it does not involve a complicated process. It’s quick and saves time. The time
saved can thus be used to analyze the data and draw conclusions.

3. It is non-technical: This method of sampling doesn’t require any technical


knowledge because of its simplicity. It doesn’t require intricate expertise and is not
at all lengthy.

D. List few advantages of simple random sampling

 It is a fair method of sampling, and if applied appropriately, it helps to reduce any


bias involved compared to any other sampling method involved.
 Since it involves a large sample frame, it is usually easy to pick a smaller sample
size from the existing larger population.
 The person conducting the research doesn’t need to have prior knowledge of the
data he/ she is collecting. One can ask a question to gather the researcher need not
be a subject expert.
 This sampling method is a fundamental method of collecting the data. You don’t
need any technical knowledge. You only require essential listening and recording
skills.
 Since the population size is vast in this type of sampling method, there is no
restriction on the sample size that the researcher needs to create. From a larger
population, you can get a small sample quite quickly.
 The data collected through this sampling method is well informed; more the
samples better is the quality of the data.

E. What is the difference between probability sampling and non-probability sampling?

18
F. What are the Applications of cluster sampling

This sampling technique is used in an area or geographical cluster sampling for market
research. A broad geographic area can be expensive to survey in comparison to surveys
that are sent to clusters that are divided based on region. The sample numbers have to be
increased to achieve accurate results, but the cost savings involved make this process of
rising clusters attainable.

Cluster sampling in statistics

The technique is widely used in statistics where the researcher can’t collect data from the
entire population as a whole. It is the most economical and practical solution for
statisticians doing research. Take the example of a researcher who is looking to
understand the smartphone usage in Germany. In this case, the cities of Germany will
form clusters. This sampling method is also used in situations like wars and natural
calamities to draw inferences of a population, where collecting data from every
individual residing in the population is impossible.

G. When to use Stratified Random Sampling?

 Stratified random sampling is an extremely productive method of sampling in


situations where the researcher intends to focus only on specific strata from the
available population data. This way, the desired characteristics of the strata can be
found in the survey sample.
 Researchers rely on this sampling method in cases where they intend to establish a
relationship between two or more different strata. If this comparison is conducted
using simple random sampling, there is a higher likelihood of the target groups being
not equally represented.
 Samples with a population which are difficult to access or contact, can be easily be
involved in the research process using the stratified random sampling technique.
 The accuracy of statistical results is higher than simple random sampling since the
elements of the sample and chosen from relevant strata. The diversification within
the strata will be much lesser than the diversification which exists in the target

19
population. Due to the accuracy involved, it is highly probable that the required
sample size will be much lesser and that will help researchers in saving time and
efforts

20
Ex.No: 3
Implementation of Z-Test – One Sample Z-Test and Two Sample Z-Test
Date :

1. Problem statement :

To Perform One Sample & Two Sample Z-Tests in Python


2. Expected Learning outcomes:
It helps the students to analyze the dataset using Z-test for the proportions. In other
words this is a statistical test that helps to evaluate our beliefs about certain
proportions in the population based on the sample at hand.
3. Problem analysis:

Z-test is a kind of Hypothesis test based on Standard Normal Distribution. It is also


known as Standard Normal Z Test. Using this test, we calculate the Z-score or Z-
statistic value. Z-test is used for testing the following:

1. Mean of a single population (μ)

2. Difference between means of two populations (μ1 – μ2)

3. Proportion of a single population (P)

4. Difference between proportions of two populations (P1 – P2)

The Z-test has several assumptions which need to be fulfilled before using it. The
assumptions are as follows:

1. The sample size should be more than 30.

2. Sample data should be selected at random from the Population.

3. The Samples should be drawn from Normal Population Data.

4. The Population Variance should be known beforehand.

5. The Samples (or Populations) should be independent of each other.

The Z-statistic value can be calculated using the formula:

21
Where,

x̄ is the Sample Mean,

μ is Population Mean,

σ is Sample Standard Deviation

n is the sample size.

Note: This formula is for one sample Z-test.

After calculating the Z-score, the conclusion may change based on the tails of the
test. Generally, there are three types of tailed tests: Left tailed test, Right Tailed
test, Two-tailed test. The thumb rule is if we have ≤ or ≥ signs between the Ha &
the given value, it is a one-tailed test (or Left & Right tail, respectively), or if we
have ≠ between the Ha & the given value, it is a two-tailed test.

Thus, after calculating the z statistic value, if our test is left tailed, our conclusion
can be determined by the rule: If the calculated Z-statistic value is less than the
critical Z value, Reject Null Hypothesis H0 else we fail to reject the H0. If our test
is right-tailed, our conclusion can be determined by the rule: If the calculated Z-
statistic value is more than the critical Z value, Reject Null Hypothesis H0 else we
fail to reject the H0. If our test is two-tailed, our conclusion can be determined by
the rule: If the calculated Z-statistic value is less than or greater than the critical Z
value, Reject Null Hypothesis H0 else we fail to reject the H0.

When to Use Z-test:

 The sample size should be greater than 30. Otherwise, we should use the t-
test.
 Samples should be drawn at random from the population.
 The standard deviation of the population should be known.
 Samples that are drawn from the population should be independent of each
other.
 The data should be normally distributed, however for large sample size, it is
assumed to have a normal distribution.

Type of Z-test

22
Left-tailed Test: In this test, our region of rejection is located to the extreme left of
the distribution. Here our null hypothesis is that the claimed value is less than or
equal to the mean population value.

Right-tailed Test: In this test, our region of rejection is located to the extreme right
of the distribution. Here our null hypothesis is that the claimed value is less than or
equal to the mean population value.

You can use the ztest() function from the statsmodels package to perform one
sample and two sample z-tests in Python.

This function uses the following basic syntax:

statsmodels.stats.weightstats.ztest(x1, x2=None, value=0)

where:

x1: values for the first sample


x2: values for the second sample (if performing a two sample z-test)
value: mean under the null (in one sample case) or mean difference (in two sample
case)

23
4. Algorithm:
Step 1: Evaluate the data distribution.
Step 2: Formulate Hypothesis statement symbolically
Step 3: Define the level of significance (alpha)
Step 4: Calculate Z test statistic or Z score.
Step 5: Derive P-value for the Z score calculated.
Step 6: Make decision:
Step 6.1: P-Value <= alpha, then we reject H0.
Step 6.2: If P-Value > alpha, Fail to reject H0

5. Code:

One Sample Z-Test in Python


Suppose the IQ in a certain population is normally distributed with a mean of μ = 100 and
standard deviation of σ = 15.

A researcher wants to know if a new drug affects IQ levels, so he recruits 20 patients to try
it and records their IQ levels.

The following code shows how to perform a one sample z-test in Python to determine if
the new drug causes a significant difference in IQ levels:

from statsmodels.stats.weightstats import ztest as ztest

#enter IQ levels for 20 patients


data = [88, 92, 94, 94, 96, 97, 97, 97, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 115]

#perform one sample z-test


ztest(data, value=100)

Conclusion:

24
The test statistic for the one sample z-test is 1.5976 and the corresponding p-value is
0.1101.

Since this p-value is not less than .05, we do not have sufficient evidence to reject the
null hypothesis. In other words, the new drug does not significantly affect IQ level.

Two Sample Z-Test in Python


Suppose the IQ levels among individuals in two different cities are known to be normally
distributed with known standard deviations.

A researcher wants to know if the mean IQ level between individuals in city A and city B
are different, so she selects a simple random sample of 20 individuals from each city and
records their IQ levels.

from statsmodels.stats.weightstats import ztest as ztest


#enter IQ levels for 20 individuals from each city
cityA = [82, 84, 85, 89, 91, 91, 92, 94, 99, 99,
105, 109, 109, 109, 110, 112, 112, 113, 114, 114]

cityB = [90, 91, 91, 91, 95, 95, 99, 99, 108, 109,
109, 114, 115, 116, 117, 117, 128, 129, 130, 133]

#perform two sample z-test


ztest(cityA, cityB, value=0)

Conclusion:
The test statistic for the two sample z-test is -1.9953 and the corresponding p-value is
0.0460.

25
Since this p-value is less than .05, we have sufficient evidence to reject the null
hypothesis. In other words, the mean IQ level is significantly different between the two
cities.

6. Output

7. Result
8. Viva voce
a. What is the difference between one-sample and two sample z-test?
The two-sample z test is to tests the difference between means of two groups, whereas
a one-sample z test is to tests the difference between a single group and the
hypothesized population value.
b. What is a 2 sample z-test?

Two-Sample Z-Test. The Two-Sample Z-test is used to compare the means of two
samples to see if it is feasible that they come from the same population. The null
hypothesis is: the population means are equal.

c. Is a one-sample z test reported differently for one-tailed and two-tailed tests?

No, the same values are reported.

d. Can the z-score be negative?

Z-scores may be positive or negative, with a positive value indicating the score is
above the mean and a negative score indicating it is below the mean

e. How do you change a negative z-score to a positive?


Show activity on this post. In short, just subtract the values in this table above from 1.
As z-scores move from negative to positive they are moving from left to right on the
bell curve

26
Ex.No: 4
Implementation of Z-Test – using Titanic case study
Date :

1. Problem statement :

To Perform Z-test on Titanic case study

2. Expected Learning outcomes:


It helps the students to analyze the dataset using Z-test for the proportions. In other
words this is a statistical test that helps to evaluate our beliefs about certain
proportions in the population based on the sample at hand.
3. Problem analysis:

A hypothesis is a new research question. Let’s say there is a proposal for a new
drug, what’s the significance of investing in manufacturing a new drug if its effect
on people is trivial. The decision should be driven based on a Hypothesis test,
which should show statistically significant results so we make a well-informed
decision to either go with the new research claim or not.

There are many varieties of Hypothesis tests, based on the objective of our research
and the data that we have, we choose an appropriate type of hypothesis test. It’s
important to get an intuition of what’s happening in these tests and which test is
suitable for a real scenario. In this post, we will only cover Z-test in detail covering
different scenarios.

27
First, let’s understand the distribution of data in general. The green graph shows
the normal distribution of data with mean=5 and standard deviation = 2, converting
this to a standard normal distribution(grey graph) will shift the central location to
0(Mean=0) and will be 1 standard deviation away from mean 0. This is nothing but
a Z score. Z score translates the data that we have from normal distribution into a
standard normal distribution.
The formula for Z score is:

z = (x1— μ) / (σ )

In the above example, we just had one value x1=3. Now when we have a
population, it’s hard to validate each element of the population. That is the reason
we evaluate the Z score using sampling distribution of means and population mean.

The formula for Z score becomes:

z = (x^ — μ) / (σ / √n),

Here x^ is the sample mean, μ is the population mean,σ standard deviation, n is the
sample size.

Now, Why standardise?

28
Standardizing makes it easy to work with data. Once we have the Z score, we can
use the Z table(standard normal distribution table) which allows us to find the area
of the region located under the bell curve. This is useful to calculate the probability
of occurrence within our normal distribution. This can also be used to compare 2
scores that are from different normal distributions.

Today we use software programs or say libraries that will give us the Z score and
probability values(p-value) in a click. Still, it’s important to understand what goes
in the background to get an intuition.

Z test

Z-test is used with continuous variables, Continuous random variables can take an
infinite number of values. For example: Age, Fare, Weight, Height.
Parameter of interest for the Z test is:

mean(μ)
proportion(p)

Few requirements need to be met to use Z- test

1) Sample should be an independent random sample from a population, to ensure


there is no bias in data.
2) Z test is also known as a parametric test. It assumes, normal distribution. If the
population is normal, then sample means will have a normal distribution
independent of sample size.
3) Central Limit Theorem: In lots of real scenarios, the population will not be
normally distributed, but if we take sample size “n” a large value, that is >30(30 is
the magical number here), then sampling distribution of sample mean approximates
a normal distribution for any population distribution shape with identical sample
sizes.
4) For assessing the mean, the standard deviation of the population should be known.
5) For assessing proportions p, there are only 2 possible outcomes. If the probability
of one option is p, then the other is (1-p). Both n*p and n*(p-1) should be at least
10. The sampling distribution of a sample proportion (p^) is approximately normal
as long as the expected number of successes and failures are both at least 10.

29
4. Algorithm:
Step 1: Evaluate the data distribution.
Step 2: Formulate Hypothesis statement symbolically
Step 3: Define the level of significance (alpha)
Step 4: Calculate Z test statistic or Z score.
Step 5: Derive P-value for the Z score calculated.
Step 6: Make decision:
Step 6.1: P-Value <= alpha, then we reject H0.
Step 6.2: If P-Value > alpha, Fail to reject H0

Z test implementation

a. We will implement hypothesis test on below cases


b. Some new survey/research claims that the average age of passengers in Titanic who
survived is greater than 28.
c. There is a difference in average age between the two genders who survived?
d. Greater than 50% of passengers who survived in Titanic are in the age group of 20–
40.
e. Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)

5. Code

Download Titanic Data set from the below link

https://github.com/datasciencedojo/datasets/blob/master/titanic.csv

We will implement hypothesis test on below cases

1) Some new survey/research claims that the average age of passengers in Titanic
who survived is greater than 28.
2) There is a difference in average age between the two genders who survived?
3) Greater than 50% of passengers who survived in Titanic are in the age group of
20–40.
4) Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)

30
Titanic data set overview:

In this dataset, Age and Fare are the continuous variables on which we can perform the Z
test.

1) Some new survey/research claims that the average age of passengers in Titanic who
survived is greater than 28.
First, let’s look at the data in hand. As shown below(Graph 1), the population is not
normally distributed. So we as per the Central Limit Theorem will take 60 random
sampling distribution of 60 sample mean of passengers who survived which will
approximate to a normal distribution(Graph 2).

31
H0: Average age of passengers in Titanic is less than 28:μ0 <=28

HA : New research claims mean age is greater than 28: μ1 > 28

32
Conclusion: As per the Z test, we reject H0 and go with the alternate theory which says the
Average age of survived passengers is > 28. With a confidence interval of 95%, we can
say that the average age ranges between 28.01 and 28.98.

2. There is a difference in average age between the two genders who survived?

Evaluating the data distribution of Survived, male and female passenger’s age.
Population data is not normal, so I took 60 sampling distributions of 60 sample
mean which will approximate to a normal distribution.

33
34
H0:No difference in mean age of male & female passengers who survived: μ_male
=μ_female or μ_male-μ_female=0

HA:There is difference in mean age of male & female passengers who survived μ_male
<> μ_female or μ_male-μ_female <> 0

Conclusion: We can go with alternate theory which says “there is a difference in the mean
age of male & female passengers who survived”Also we can check

Let's check if mean age of male is greater than female mean age?

H0: μ_male ≤ μ_female

H1: μ_male > μ_female

35
Conclusion:
We do not have significant result to conclude mean age of male is greater than that of
female passengers who survived.

3. Greater than 50% of passengers who survived in Titanic are in the age group of 20–40.

H0: p ≤ 0.5, Less than 50% of passengers who survived in Titanic are in the age group of
20–40

H1:p > 0.5 , Greater than 50% of passengers who survived in Titanic are in the age group
of 20–40

36
Conclusion:
We fail to reject H0, we do not have a significant result, so we cannot go with an alternate
theory that says more than 50% of passengers who survived are in the 20–40 age range.
With 95% confidence, we can say survived passengers are in the age range between 47.01
to 58.0

4. Greater than 50% of passengers in Titanic are in the age group of 20–40 ( including
both survived and non-survived passengers)

37
Conclusion: We have significant results to go with an alternate theory. Greater than 50%
of passengers in Titanic are in the age group of 20–40. With 95% confidence, we can say
that more than 50% of passengers are in the age range between 50.3 to 57.6

6. Output & Conclusions

7. Results

8. Viva Voce

a) What Is a Null Hypothesis? Give example

A null hypothesis is a type of hypothesis in statistics that proposes that there is no


difference between certain characteristics of a population (or data-generating process).

For example, a gambler may be interested in whether a game of chance is fair. If it is


fair, then the expected earnings per play come to zero for both players. If the game is
not fair, then the expected earnings are positive for one player and negative for the
other. To test whether the game is fair, the gambler collects earnings data from many
repetitions of the game, calculates the average earnings from these data, then tests the
null hypothesis that the expected earnings are not different from zero.

If the average earnings from the sample data are sufficiently far from zero, then the
gambler will reject the null hypothesis and conclude the alternative hypothesis—
namely, that the expected earnings per play are different from zero. If the average
earnings from the sample data are near zero, then the gambler will not reject the null
hypothesis, concluding instead that the difference between the average from the data
and zero is explainable by chance alone.

b) What are Type I and Type II errors? What's the difference between Type 1 error
and Type 2 error?

38
In statistics, a Type I error means rejecting the null hypothesis when it's actually true,
while a Type II error means failing to reject the null hypothesis when it's actually false
A type I error (false-positive) occurs if an investigator rejects a null hypothesis that is
actually true in the population; a type II error (false-negative) occurs if the investigator
fails to reject a null hypothesis that is actually false in the population.

c) What is decision rule hypothesis testing

The decision rule is a statement that tells under what circumstances to reject the
null hypothesis. The decision rule is based on specific values of the test statistic
(e.g., reject H0 if Z > 1.645). The decision rule for a specific test depends on 3
factors: the research or alternative hypothesis, the test statistic and the level of
significance.

d) To reject the null hypothesis, what are the steps to be performed. Explain the steps
in detail

Step 1: State the null hypothesis. When you state the null hypothesis, you also have
to state the alternate hypothesis. Sometimes it is easier to state the alternate
hypothesis first, because that’s the researcher’s thoughts about the experiment.
How to state the null hypothesis (opens in a new window).

Step 2: Support or reject the null hypothesis. Several methods exist, depending on
what kind of sample data you have. For example, you can use the P-value method.
For a rundown on all methods, see: Support or reject the null hypothesis.

39
Ex.No: 5
Implementation of T-Test – one sample t-test
Date :

1. Problem statement :

To perform a one sample t-test to determine whether the mean of a population is equal
to some value or not

2. Expected Learning outcomes:


It helps the students to analyze the dataset using T-test of a single population mean
3. Problem analysis:

T- Test :- A t-test is a type of inferential statistic which is used to determine if there is


a significant difference between the means of two groups which may be related in
certain features. It is mostly used when the data sets, like the set of data recorded as
outcome from flipping a coin a 100 times, would follow a normal distribution and may
have unknown variances. T test is used as a hypothesis testing tool, which allows
testing of an assumption applicable to a population.

A very simple example: Let’s say you have a cold and you try a naturopathic remedy.
Your cold lasts a couple of days. The next time you have a cold, you buy an over-the-
counter pharmaceutical and the cold lasts a week. You survey your friends and they all
tell you that their colds were of a shorter duration (an average of 3 days) when they
took the homeopathic remedy. What you really want to know is, are these results
repeatable? A t test can tell you by comparing the means of the two groups and letting
you know the probability of those results happening by chance

T-test has 2 types :

1. One sampled t-test


2. Two-sampled t-test.

One sample t-test : The One Sample t Test determines whether the sample mean is
statistically different from a known or hypothesised population mean. The One
Sample t Test is a parametric test.

40
The one sample t test compares the mean of your sample data to a known value. For
example, you might want to know how your sample mean compares to the population
mean. You should run a one sample t test when you don’t know the population
standard deviation or you have a small sample size. For a full rundown on which test
to use, see: T-score vs. Z-Score.

Assumptions of the test (your data should meet these requirements for the test to be
valid):

Data is independent.
Data is collected randomly. For example, with simple random sampling.
The data is approximately normally distributed.

HOW 1-SAMPLE T-TESTS CALCULATE T-VALUES


Understanding this process is crucial to understanding how t-tests work.

Formula to calculate t for a 1-sample t-test

Please notice that the formula is a ratio. A common analogy is that the t-value is the
signal-to-noise ratio.

Signal (a.k.a. the effect size)

The numerator is the signal. You simply take the sample mean and subtract the null
hypothesis value. If your sample mean is 10 and the null hypothesis is 6, the
difference, or signal, is 4.

If there is no difference between the sample mean and null value, the signal in the
numerator, as well as the value of the entire ratio, equals zero. For instance, if your
sample mean is 6 and the null value is 6, the difference is zero.

As the difference between the sample mean and the null hypothesis mean increases in
either the positive or negative direction, the strength of the signal increases.

Noise

The denominator is the noise. The equation in the denominator is a measure of


variability known as the standard error of the mean. This statistic indicates how

41
accurately your sample estimates the mean of the population. A larger number
indicates that your sample estimate is less precise because it has more random error.

This random error is the “noise.” When there is more noise, you expect to see larger
differences between the sample mean and the null hypothesis value even when the null
hypothesis is true. We include the noise factor in the denominator because we must
determine whether the signal is large enough to stand out from it.

Signal-to-Noise ratio

Both the signal and noise values are in the units of your data. If your signal is 6 and
the noise is 2, your t-value is 3. This t-value indicates that the difference is 3 times the
size of the standard error. However, if there is a difference of the same size but your
data have more variability (6), your t-value is only 1. The signal is at the same scale as
the noise.

In this manner, t-values allow you to see how distinguishable your signal is from the
noise. Relatively large signals and low levels of noise produce larger t-values. If the
signal does not stand out from the noise, it’s likely that the observed difference
between the sample estimate and the null hypothesis value is due to random error in
the sample rather than a true difference at the population level.

4. Algorithm :
Step 1: Create some dummy age data for the population of voters in the entire
country
Step 2: Create Sample of voters in Minnesota and test the whether the average age
of voters Minnesota differs from the population
Step 3: Conduct a t-test at a 95% confidence level and see if it correctly rejects the
null hypothesis that the sample comes from the same distribution as the population.

Step 4: If the t-statistic lies outside the quantiles of the t-distribution corresponding
to our confidence level and degrees of freedom, we reject the null hypothesis.
Step 5: Calculate the chances of seeing a result as extreme as the one being
observed (known as the p-value) by passing the t-statistic in as the quantile to the
stats.t.cdf() function

5. Code:

42
A one-sample t-test checks whether a sample mean differs from the population mean. Let's
create some dummy age data for the population of voters in the entire country and a
sample of voters in Minnesota and test the whether the average age of voters Minnesota
differs from the population:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import math
np.random.seed(6)

population_ages1 = stats.poisson.rvs(loc=18, mu=35, size=150000)


population_ages2 = stats.poisson.rvs(loc=18, mu=10, size=100000)
population_ages = np.concatenate((population_ages1, population_ages2))

minnesota_ages1 = stats.poisson.rvs(loc=18, mu=30, size=30)


minnesota_ages2 = stats.poisson.rvs(loc=18, mu=10, size=20)
minnesota_ages = np.concatenate((minnesota_ages1, minnesota_ages2))

print( population_ages.mean() )
print( minnesota_ages.mean() )

43.000112
39.26

Notice that we used a slightly different combination of distributions to generate the sample
data for Minnesota, so we know that the two means are different. Let's conduct a t-test at a
95% confidence level and see if it correctly rejects the null hypothesis that the sample
comes from the same distribution as the population. To conduct a one sample t-test, we
can the stats.ttest_1samp() function:

43
stats.ttest_1samp(a = minnesota_ages, # Sample data
popmean = population_ages.mean()) # Pop mean
Ttest_1sampResult(statistic=-2.5742714883655027,
pvalue=0.013118685425061678)

The test result shows the test statistic "t" is equal to -2.574. This test statistic tells us how
much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the
quantiles of the t-distribution corresponding to our confidence level and degrees of
freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():

stats.t.ppf(q=0.025, # Quantile to check


df=49) # Degrees of freedom

-2.0095752344892093

stats.t.ppf(q=0.975, df=49)

2.009575234489209

We can calculate the chances of seeing a result as extreme as the one we observed (known
as the p-value) by passing the t-statistic in as the quantile to the stats.t.cdf() function:

stats.t.cdf(x= -2.5742, # T-test statistic


df= 49) * 2 # Multiply by two for two tailed test *
0.013121066545690117

Notice this value is the same as the p-value listed in the original t-test output. A p-value of
0.01311 means we'd expect to see data as extreme as our sample due to chance about 1.3%
of the time if the null hypothesis was true. In this case, the p-value is lower than our
significance level α (equal to 1-conf.level or 0.05) so we should reject the null hypothesis.
If we were to construct a 95% confidence interval for the sample it would not capture
population mean of 43:

44
sigma = minnesota_ages.std()/math.sqrt(50) # Sample stdev/sample size

stats.t.interval(0.95, # Confidence level


df = 49, # Degrees of freedom
loc = minnesota_ages.mean(), # Sample mean
scale= sigma) # Standard dev estimate

(36.369669080722176, 42.15033091927782)

On the other hand, since there is a 1.3% chance of seeing a result this extreme due to
chance, it is not significant at the 99% confidence level. This means if we were to
construct a 99% confidence interval, it would capture the population mean:

stats.t.interval(alpha = 0.99, # Confidence level


df = 49, # Degrees of freedom
loc = minnesota_ages.mean(), # Sample mean
scale= sigma) # Standard dev estimate

(35.40547994092107, 43.11452005907893)

With a higher confidence level, we construct a wider confidence interval and increase the
chances that it captures to true mean, thus making it less likely that we'll reject the null
hypothesis. In this case, the p-value of 0.013 is greater than our significance level of 0.01
and we fail to reject the null hypothesis.

6. Output

7. Result

45
8. Viva Voce
a. What does a t-test measure?
A t-test measures the difference in group means divided by the pooled standard error
of the two group means.

In this way, it calculates a number (the t-value) illustrating the magnitude of the
difference between the two group means being compared, and estimates the likelihood
that this difference exists purely by chance (p-value).

b. What type of t-test should I use?


When choosing a t-test, you will need to consider two things: whether the groups being
compared come from a single population or two different populations, and whether
you want to test the difference in a specific direction.
c. One-sample, two-sample, or paired t-test?
If the groups come from a single population (e.g. measuring before and after an
experimental treatment), perform a paired t-test.
If the groups come from two different populations (e.g. two different species, or people
from two separate cities), perform a two-sample t-test (a.k.a. independent t-test).
If there is one group being compared against a standard value (e.g. comparing the
acidity of a liquid to a neutral pH of 7), perform a one-sample t-test.
d. One-tailed or two-tailed t-test?
If you only care whether the two populations are different from one another, perform a
two-tailed t-test.
If you want to know whether one population mean is greater than or less than the other,
perform a one-tailed t-test.

e. What is the difference between a one-sample t-test and a paired t-test?


A one-sample t-test is used to compare a single population to a standard value (for
example, to determine whether the average lifespan of a specific town is different from
the country average).

46
A paired t-test is used to compare a single population before and after some
experimental intervention or at two different points in time (for example, measuring
student performance on a test before and after being taught the material).

47
Ex.No: 6
Implementation of T-Test – Two sample t-test and
Paired T-Test
Date :

1. Problem statement :

To perform a two sample t-test and paired t-test to determine whether the mean of two
population are equal to some value or not

2. Expected Learning outcomes:


It helps the students to analyze the dataset using T-test and investigates whether the
means of two independent data samples differ from one another

3. Problem analysis:

HOW TWO-SAMPLE T-TESTS CALCULATE T-VALUES


The 2-sample t-test takes your sample data from two groups and boils it down to the t-
value. The process is very similar to the 1-sample t-test, and you can still use the
analogy of the signal-to-noise ratio. Unlike the paired t-test, the 2-sample t-test
requires independent groups for each sample.

The formula is below, and then some discussion.

Formula to calculate t for a 2-sample t-test

For the 2-sample t-test, the numerator is again the signal, which is the difference
between the means of the two samples. For example, if the mean of group 1 is 10, and
the mean of group 2 is 4, the difference is 6.

The default null hypothesis for a 2-sample t-test is that the two groups are equal. You
can see in the equation that when the two groups are equal, the difference (and the
entire ratio) also equals zero. As the difference between the two groups grows in either
a positive or negative direction, the signal becomes stronger.

In a 2-sample t-test, the denominator is still the noise, but Minitab can use two
different values. You can either assume that the variability in both groups is equal or
not equal, and Minitab uses the corresponding estimate of the variability. Either way,

48
the principle remains the same: you are comparing your signal to the noise to see how
much the signal stands out.

Just like with the 1-sample t-test, for any given difference in the numerator, as you
increase the noise value in the denominator, the t-value becomes smaller. To
determine that the groups are different, you need a t-value that is large.

WHAT DO T-VALUES MEAN?

Each type of t-test uses a procedure to boil all of your sample data down to one value,
the t-value. The calculations compare your sample mean(s) to the null hypothesis and
incorporates both the sample size and the variability in the data. A t-value of 0
indicates that the sample results exactly equal the null hypothesis. In statistics, we call
the difference between the sample estimate and the null hypothesis the effect size. As
this difference increases, the absolute value of the t-value increases.

That’s all nice, but what does a t-value of, say, 2 really mean? From the discussion
above, we know that a t-value of 2 indicates that the observed difference is twice the
size of the variability in your data. However, we use t-tests to evaluate hypotheses
rather than just figuring out the signal-to-noise ratio. We want to determine whether
the effect size is statistically significant.

To see how we get from t-values to assessing hypotheses and determining statistical
significance, read the other post in this series, Understanding t-Tests: t-values and t-
distributions.

A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent
samples t test) is where you run a t test on dependent samples. Dependent samples are
essentially connected — they are tests on the same person or thing. For example:

Knee MRI costs at two different hospitals,


Two tests on the same person before and after training,
Two blood pressure measurements on the same person using different equipment.

Choose the paired t-test if you have two measurements on the same item, person or
thing. You should also choose this test if you have two items that are being measured
with a unique condition. For example, you might be measuring car safety performance
in vehicle research and testing and subject the cars to a series of crash tests. Although
the manufacturers are different, you might be subjecting them to the same conditions.

With a “regular” two sample t test, you’re comparing the means for two different
samples. For example, you might test two different groups of customer service

49
associates on a business-related test or testing students from two universities on their
English skills. If you take a random sample each group separately and they have
different conditions, your samples are independent and you should run an independent
samples t test (also called between-samples and unpaired-samples).

4. Algorithm :
Step 1: Create the data
Step 2: Conduct a two sample t-test.
Step 3: Interpret the results

5. Code

Two-Sample T-Test

A two-sample t-test investigates whether the means of two independent data samples differ
from one another. In a two-sample test, the null hypothesis is that the means of both
groups are the same. Unlike the one sample-test where we test against a known population
parameter, the two sample test only involves sample means. You can conduct a two-
sample t-test by passing with the stats.ttest_ind() function. Let's generate a sample of voter
age data for Wisconsin and test it against the sample we made earlier:

np.random.seed(12)
wisconsin_ages1 = stats.poisson.rvs(loc=18, mu=33, size=30)
wisconsin_ages2 = stats.poisson.rvs(loc=18, mu=13, size=20)
wisconsin_ages = np.concatenate((wisconsin_ages1, wisconsin_ages2))

print( wisconsin_ages.mean() )
42.8

stats.ttest_ind(a= minnesota_ages,
b= wisconsin_ages,

50
equal_var=False) # Assume samples have equal variance?
Ttest_indResult(statistic=-1.7083870793286842, pvalue=0.09073104343957748)

The test yields a p-value of 0.0907, which means there is a 9% chance we'd see sample
data this far apart if the two groups tested are actually identical. If we were using a 95%
confidence level we would fail to reject the null hypothesis, since the p-value is greater
than the corresponding significance level of 5%.

Paired T-Test

The basic two sample t-test is designed for testing differences between independent
groups. In some cases, you might be interested in testing differences between samples of
the same group at different points in time. For instance, a hospital might want to test
whether a weight-loss drug works by checking the weights of the same group patients
before and after treatment. A paired t-test lets you check whether the means of samples
from the same group differ.

We can conduct a paired t-test using the scipy function stats.ttest_rel(). Let's generate
some dummy patient weight data and do a paired t-test:

np.random.seed(11)

before= stats.norm.rvs(scale=30, loc=250, size=100)

after = before + stats.norm.rvs(scale=5, loc=-1.25, size=100)

weight_df = pd.DataFrame({"weight_before":before,
"weight_after":after,
"weight_change":after-before})

weight_df.describe() # Check a summary of the data

The summary shows that patients lost about 1.23 pounds on average after treatment. Let's
conduct a paired t-test to see whether this difference is significant at a 95% confidence
level:

stats.ttest_rel(a = before, b = after)


Ttest_relResult(statistic=2.5720175998568284, pvalue=0.011596444318439857)

Type I and Type II Error

51
The result of a statistical hypothesis test and the corresponding decision of whether to
reject or accept the null hypothesis is not infallible. A test provides evidence for or against
the null hypothesis and then you decide whether to accept or reject it based on that
evidence, but the evidence may lack the strength to arrive at the correct conclusion.
Incorrect conclusions made from hypothesis tests fall in one of two categories: type I error
and type II error.

Type I error describes a situation where you reject the null hypothesis when it is actually
true. This type of error is also known as a "false positive" or "false hit". The type 1 error
rate is equal to the significance level α, so setting a higher confidence level (and therefore
lower alpha) reduces the chances of getting a false positive.

Type II error describes a situation where you fail to reject the null hypothesis when it is
actually false. Type II error is also known as a "false negative" or "miss". The higher your
confidence level, the more likely you are to make a type II error.

Let's investigate these errors with a plot:

plt.figure(figsize=(12,10))

plt.fill_between(x=np.arange(-4,-2,0.01),
y1= stats.norm.pdf(np.arange(-4,-2,0.01)) ,
facecolor='red',
alpha=0.35)

plt.fill_between(x=np.arange(-2,2,0.01),
y1= stats.norm.pdf(np.arange(-2,2,0.01)) ,
facecolor='grey',
alpha=0.35)

plt.fill_between(x=np.arange(2,4,0.01),
y1= stats.norm.pdf(np.arange(2,4,0.01)) ,
facecolor='red',
alpha=0.5)

plt.fill_between(x=np.arange(-4,-2,0.01),
y1= stats.norm.pdf(np.arange(-4,-2,0.01),loc=3, scale=2) ,
facecolor='grey',
alpha=0.35)

plt.fill_between(x=np.arange(-2,2,0.01),
y1= stats.norm.pdf(np.arange(-2,2,0.01),loc=3, scale=2) ,

52
facecolor='blue',
alpha=0.35)

plt.fill_between(x=np.arange(2,10,0.01),
y1= stats.norm.pdf(np.arange(2,10,0.01),loc=3, scale=2),
facecolor='grey',
alpha=0.35)

plt.text(x=-0.8, y=0.15, s= "Null Hypothesis")


plt.text(x=2.5, y=0.13, s= "Alternative")
plt.text(x=2.1, y=0.01, s= "Type 1 Error")
plt.text(x=-3.2, y=0.01, s= "Type 1 Error")
plt.text(x=0, y=0.02, s= "Type 2 Error");

Conclusion:

In the plot above, the red areas indicate type I errors assuming the alternative hypothesis is
not different from the null for a two-sided test with a 95% confidence level.

The blue area represents type II errors that occur when the alternative hypothesis is
different from the null, as shown by the distribution on the right. Note that the Type II
error rate is the area under the alternative distribution within the quantiles determined by
the null distribution and the confidence level.
6. Output

7. Result
8. Viva Voce
a. Differences between the two-sample t-test and paired t-test
Two-sample t-test is used when the data of two samples are statistically independent,
while the paired t-test is used when data is in the form of matched pairs

There are also some technical differences between them. To use the two-sample t-test, we
need to assume that the data from both samples are normally distributed and they have the
same variances. For paired t-test, we only require that the difference of each pair is
normally distributed. An important parameter in the t-distribution is the degrees of
freedom.

b. Where is paired t-test used?

53
A paired t-test is used when we are interested in the difference between two variables for
the same subject. Often the two variables are separated by time. For example, in the Dixon
and Massey data set we have cholesterol levels in 1952 and cholesterol levels in 1962 for
each subject

c. What are the assumptions for a paired t-test?

Paired t-test assumptions

Subjects must be independent. Measurements for one subject do not affect measurements
for any other subject. Each of the paired measurements must be obtained from the same
subject. For example, the before-and-after weight for a smoker in the example above must
be from the same person

Ex.No: 7
Implementation of Variance Analysis ( ANOVA)
Date :

1. Problem statement :

Write a python Application Program to demonstrate the Analysis of covariance


(ANOVA).

2. Expected Learning outcomes:


To train the students to understand the basics Covariance (ANOVA) using python
Program.

3. Problem analysis:

An Analysis of Variance Test, or ANOVA, can be thought of as a generalization of


the t-tests for more than 2 groups. The independent t-test is used to compare the means
of a condition between two groups. ANOVA is used when we want to compare the
means of a condition between more than two groups.

54
ANOVA tests if there is a difference in the mean somewhere in the model (testing if
there was an overall effect), but it does not tell us where the difference is (if there is
one). To find where the difference is between the groups, we have to conduct post-hoc
tests.

To perform any tests, we first need to define the null and alternate hypothesis:
Null Hypothesis – There is no significant difference among the groups
Alternate Hypothesis – There is a significant difference among the groups

Basically, ANOVA is performed by comparing two types of variation, the variation


between the sample means, as well as the variation within each of the samples. The
below-mentioned formula represents one-way Anova test statistics.

The result of the ANOVA formula, the F statistic (also called the F-ratio), allows for
the analysis of multiple groups of data to determine the variability between samples
and within samples. The formula for one-way ANOVA test can be written like this:

The formula for one-way ANOVA test can be written like this:

55
When we plot the ANOVA table, all the above components can be seen in it as below:

In general, if the p-value associated with the F is smaller than 0.05, then the null
hypothesis is rejected and the alternative hypothesis is supported. If the null
hypothesis is rejected, we can conclude that the means of all the groups are not equal.

Types of ANOVA Tests

1. One-Way ANOVA: A one-way ANOVA has just one independent variable


o For example, differences in Corona cases can be assessed by Country, and a
Country can have 2, 20, or more different categories to compare

2. Two-Way ANOVA: A two-way ANOVA (also called factorial ANOVA) refers to


an ANOVA using two independent variables
o Expanding the example above, a two-way ANOVA can examine differences in
Corona cases (the dependent variable) by Age group (independent variable 1) and
Gender (independent variable 2). Two-way ANOVA can be used to examine the
interaction between the two independent variables. Interactions indicate that
differences are not uniform across all categories of the independent variables
o For example, Old Age Group may have higher Corona cases overall compared to
the Young Age group, but this difference could be greater (or less) in Asian
countries compared to European countries

56
3. N-Way ANOVA: A researcher can also use more than two independent variables,
and this is an n-way ANOVA (with n being the number of independent variables you
have), aka MANOVA Test.

o For example, potential differences in Corona cases can be examined by Country,


Gender, Age group, Ethnicity, etc, simultaneously
o An ANOVA will give you a single (univariate) f-value while a MANOVA will
give you a multivariate F-value

4. Algorithm :
A. Input:
A bunch of students from different colleges taking the same exam. You want to see if one
college outperforms the other, hence your null hypothesis is that the means of GPAs in
each group are equivalent to those of the other groups. To keep it simple, we will consider
3 groups (college ‘A’, ‘B’, ‘C’) with 6 students each.

A=[25,25,27,30,23,20]
B=[30,30,21,24,26,28]
C=[18,30,29,29,24,26]
Null Hypothesis: GPAs in each group are equivalent to those of the other groups.
Alternate Hypothesis – There is a significant difference among the groups
B. Output:
To find the null hypothesis or alternate hypothesis is acceptable or not.

1. Rows are grouped according to their value in the category column.


2. The total mean value of the value column is computed.

3. The mean within each group is computed.


4. The difference between each value and the mean value for the group is calculated and
squared.

57
5. The squared difference values are added. The result is a value that relates to the total
deviation of rows from the mean of their respective groups. This value is referred to as the
sum of squares within groups, or S2Wthn.
6. For each group, the difference between the total mean and the group mean is squared
and multiplied by the number of values in the group. The results are added. The result is
referred to as the sum of squares between groups or S2Btwn.

7. The two sums of squares are used to obtain a statistic for testing the null hypothesis, the
so called F-statistic. The F-statistic is calculated as:

where dfBtwn (degree of freedom between groups) equals the number of groups minus 1,
and dfWthn (degree of freedom within groups) equals the total number of values minus the
number of groups

8. The F-statistic is distributed according to the F-distribution (commonly presented in


mathematical tables/handbooks). The F-statistic, in combination with the degrees of
freedom and an F-distribution table, yields the p-value.
The p-value is the probability of the actual or a more extreme outcome under the null-
hypothesis. The lower the p-value, the larger the difference.

5. Code:

import pandas as pd
import numpy as np
import scipy.stats as stats
a=[25,25,27,30,23,20]
b=[30,30,21,24,26,28]
c=[18,30,29,29,24,26]
list_of_tuples = list(zip(a, b,c))
df = pd.DataFrame(list_of_tuples, columns = ['A', 'B', 'C'])
df
m1=np.mean(a)

58
m2=np.mean(b)
m3=np.mean(c)
print('Average mark for college A: {}'.format(m1))
print('Average mark for college B: {}'.format(m2))
print('Average mark for college C: {}'.format(m3))
m=(m1+m2+m3)/3
print('Overall mean: {}'.format(m))
SSb=6*((m1-m)**2+(m2-m)**2+(m3-m)**2)
print('Between-groups Sum of Squared Differences: {}'.format(SSb))
MSb=SSb/2
print('Between-groups Mean Square value: {}'.format(MSb))
err_a=list(a-m1)
err_b=list(b-m2)
err_c=list(c-m3)
err=err_a+err_b+err_c
ssw=[]
for i in err:
ssw.append(i**2)
SSw=np.sum(ssw)
print('Within-group Sum of Squared Differences: {}'.format(SSw))
MSw=SSw/15
print('Within-group Mean Square value: {}'.format(MSw))
F=MSb/MSw
print('F-score: {}'.format(F))
print(stats.f_oneway(a,b,c))

6. Output

7. Result

8. Viva Voce

a. Why do we need the ANOVA

You would use ANOVA to help you understand how your different groups respond,
with a null hypothesis for the test that the means of the different groups are equal. If
there is a statistically significant result, then it means that the two populations are
unequal (or different).

b. What are the two assumptions of ANOVA?

59
The factorial ANOVA has a several assumptions that need to be fulfilled – (1) interval
data of the dependent variable, (2) normality, (3) homoscedasticity, and (4) no
multicollinearity.
c. Why is ANOVA better than multiple t tests?

Two-way anova would be better than multiple t-tests for two reasons: (a) the within-
cell variation will likely be smaller in the two-way design (since the t-test ignores the
2nd factor and interaction as sources of variation for the DV); and (b) the two-way
design allows for test of interaction of the two factors.

d. Why is ANOVA more powerful than t-test?

ANOVA equates three or more such groups. t-test is less likely to commit an error.
ANOVA has more error risks. Sample from class A and B students have given a
mathematics course may have different mean and standard deviation.

e. When should you use ANOVA instead of t-tests?

There is a thin line of demarcation amidst t-test and ANOVA, i.e. when the population
means of only two groups is to be compared, the t-test is used, but when means of
more than two groups are to be compared, ANOVA is preferred
f. Can ANOVA be used for hypothesis testing?
The specific test considered here is called analysis of variance (ANOVA) and is a test
of hypothesis that is appropriate to compare means of a continuous variable in two or
more independent comparison groups. For example, in some clinical trials there are
more than two comparison groups.
g. What distribution does ANOVA use?
F-distribution
The second is one-way analysis of variance (ANOVA), which uses the F-distribution
to test to see if three or more samples come from populations with the same mean

60
Ex.No: 8
Demonstration of Linear Regression
Date :

1. Problem statement :

Write a python program Application Program Linear Regression


2. Expected Learning outcomes:
To train the students to understand the concept linear regression in python

3. Problem analysis:

Problem Analysis: Consider the linear line y = a + bx. Find the value of b = nΣxy − (Σx)
(Σy) / nΣx2−(Σx)2, a = Σy−b (Σx)n, implement the value of a and b in y = a + bx and solve
the equation. Once if you get the value of a, b and can regress y value for any x.

4. Algorithm

A. Input: Get any value of x.


B.Output: Find the value of y for any x.

Step1: Consider a set of values x, y.


Step2: Take the linear set of equation y = a+bx.
Step3: Computer value of a, b with respect to the given values, b = nΣxy − (Σx)
(Σy) / nΣx2−(Σx)2, a = Σy−b (Σx)n.
Step4: Implement the value of a, b in the equation y = a+ bx.
Step5: Regress the value of y for any x.

5. Code

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x, m_y = np.mean(x), np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x - n*m_y*m_x)

61
SS_xx = np.sum(x*x - n*m_x*m_x)
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return(b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m", marker = "o", s = 30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color = "g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()

# observations
x = np.array([25, 23, 25, 31, 32, 25, 36, 27, 28, 29])
y = np.array([3.2, 3, 3.5, 3, 3.6, 3.7, 3.3, 3.6, 3.2, 3.1])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)

6. Output

7. Result

8. Viva Voce
a. What are the limitations to linear regression?
The Disadvantages of Linear Regression

62
Linear Regression Only Looks at the Mean of the Dependent Variable. Linear
regression looks at a relationship between the mean of the dependent variable and the
independent variables. ...
Linear Regression Is Sensitive to Outliers. ...
Data Must Be Independent.

b. Why is it called linear regression?

For example, if parents were very tall the children tended to be tall but shorter than
their parents. If parents were very short the children tended to be short but taller than
their parents were. This discovery he called "regression to the mean," with the word
"regression" meaning to come back to.

c. Can linear regression be curved?


Linear regression can produce curved lines and nonlinear regression is not named for
its curved lines.

d. When the slope is zero What is the rise?


' When the 'rise' is zero, then the line is horizontal, or flat, and the slope of the line is
zero.

e. Where is linear regression used?

Linear regression is commonly used for predictive analysis and modeling. For
example, it can be used to quantify the relative impacts of age, gender, and diet (the
predictor variables) on height (the outcome variable)

f. How do you tell if a regression model is a good fit?

Statisticians say that a regression model fits the data well if the differences between the
observations and the predicted values are small and unbiased. Unbiased in this context
means that the fitted values are not systematically too high or too low anywhere in the
observation space.

63
Ex.No: 9
Demonstration of Logistic Regression
Date :

1. Problem statement :

Write a python program Application Program to perform Classification using Logistic


Regression
2. Expected Learning outcomes:

To train the students to understand the basics logistics regression using python

3. Problem analysis:

Let’s say that your goal is to build a logistic regression model in Python in order to
determine whether candidates would get admitted to a prestigious university.

To understand logistic regression, let’s go over the odds of success.

Odds (𝜃) = Probability of an event happening / Probability of an event not happening.

𝜃=p/1-p

The values of odds range from zero to ∞ and the values of probability lies between zero
and one.

Consider the equation of a straight line:

𝑦 = 𝛽0 + 𝛽1* 𝑥 4

64
Here, 𝛽0 is the y-intercept
𝛽1 is the slope of the line
x is the value of the x coordinate
y is the value of the prediction
Now to predict the odds of success, we use the following formula:

Exponentiations both the sides, we have:

Let Y = e 𝛽0+𝛽1 * 𝑥
Then p(x) / 1 - p(x) = Y
p(x) = Y(1 - p(x))
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y

The equation of the sigmoid function is:

65
The sigmoid curve obtained from the above equation is as follows:

4. Algorithm:

A. Input: GMAT score, GPA and Years of work experience directly given in the program
as input.

B. Output: Aspiring candidate get admitted or not.

Step1: Initialize the variables


Step2: Set the Data frame
Step3: Spilt data set into training and testing.
Step4: Fit the data into logistic regression function.
Step5: Predict the test data set.
Step6: Print the results.

5. Code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt
candidates = {'gmat':
[780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610,580,650,540,590,62
0
,600,550,550,570,670,660,580,650,660,640,620,660,660,680,650,670,580,590,690],
'gpa':
[4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3,2.7,3.7,2.7,2.3,3.3,2,2.3,2.7,3,3.3,
3.7,2.3,3.7,3.3,3,2.7,4,3.3,3.3,2.3,2.7,3.3,1.7,3.7],
'work_experience':

66
[3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1,4,6,2,3,2,1,4,1,2,6,4,2,6,5,1,2,4,6,5,1,2,1,4,5],
'admitted':
[1,1,0,1,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0,0,0,0,1,1,0,1,1,0,0,1,1,1,0,0,0,0,1]
}
df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted'])
print (df)
X = df[['gmat', 'gpa','work_experience']]
y = df['admitted']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
print (X_train)
print (y_train)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'],
colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
print (X_test) #test dataset
print (y_pred) #predicted values
print('confusion_matrix:', confusion_matrix, sep='\n', end='\n\n')
plt.show()

6. Output

7. Result

8. Viva Voce

a. What is logistic regression used for?


It is used in statistical software to understand the relationship between the dependent
variable and one or more independent variables by estimating probabilities using a

67
logistic regression equation. This type of analysis can help you predict the likelihood
of an event happening or a choice being made.

b. What is difference between logistic regression and linear regression?

Linear Regression is used to handle regression problems whereas Logistic regression is


used to handle the classification problems. Linear regression provides a continuous
output but Logistic regression provides discreet output

c. What does logistic regression predict?


Logistic regression is used to predict the class (or category) of individuals based on
one or multiple predictor variables (x). It is used to model a binary outcome that is a
variable, which can have only two possible values: 0 or 1, yes or no, diseased or non-
diseased

d. What are the assumptions of logistic regression?


Basic assumptions that must be met for logistic regression include independence of
errors, linearity in the logit for continuous variables, absence of multicollinearity, and
lack of strongly influential outliers

e. What are the steps in logistic regression?

Logistic Regression by Stochastic Gradient Descent


Calculate Prediction. Let's start off by assigning 0.0 to each coefficient and calculating
the probability of the first training instance that belongs to class 0. ...
Calculate New Coefficients.
Repeat the Process.
Make Predictions.

f. How many predictors can be used in logistic regression?


There must be two or more independent variables, or predictors, for a logistic
regression. The IVs, or predictors, can be continuous (interval/ratio) or categorical
(ordinal/nominal).

g. What type of data would you use with logistic regression?

Logistic Regression is used when the dependent variable(target) is categorical. For


example, To predict whether an email is spam (1) or (0) Whether the tumor is
malignant (1) or not (0)

68
h. What is a good sample size for logistic regression?
In conclusion, for observational studies that involve logistic regression in the analysis,
this study recommends a minimum sample size of 500 to derive statistics that can
represent the parameters in the targeted population

Ex.No: 10
Demonstration of Multiple-Linear Regression
Date :

1. Problem statement :

Write a python Application Program to demonstrate the Multiple Linear Regression.

2. Expected Learning outcomes:

To train the students to understand the Multi- Linear Regression using python program

3. Problem analysis:

Multiple linear regression attempts to model the relationship between two or more features
and a response by fitting a linear equation to the observed data.

Clearly, it is nothing but an extension of simple linear regression.

Consider a dataset with p features (or independent variables) and one response (or
dependent variable).

Also, the dataset contains n rows/observations.

We define:

X (feature matrix) = a matrix of size n X p where x_{ij} denotes the values of jth feature
for ith observation.

So,

69
and

y (response vector) = a vector of size n where y_{i} denotes the value of response for ith
observation.

The regression line for p features is represented as:

where h(x_i) is predicted response value for ith observation and b_0, b_1, …, b_p are the
regression coefficients.

Also, we can write:

where e_i represents residual error in ith observation. We can generalize our linear model
a little bit more by representing feature matrix X as:

70
So now, the linear model can be expressed in terms of matrices as:

Where,

And

Now, we determine an estimate of b, i.e. b’ using the Least Squares method.

As already explained, the Least Squares method tends to determine b’ for which total
residual error is minimized.

We present the result directly here:

71
where ‘ represents the transpose of the matrix while -1 represents the matrix inverse.
Knowing the least square estimates, b’, the multiple linear regression model can now be
estimated as:

where y’ is the estimated response vector.

4. Algorithm:

A. Input: Boston house pricing dataset using Scikit-learn.


B. Output: List of all Coefficients, variance score and residual error plots
Step1: Get the multi-attribute dataset using the Scikit-learn data source.
Step 2: Create a regression object.
Step 3: Train the dataset with the regression model fit.
Step 4: Get and print the regression coefficients and variance.
Step 5. Plot the residual error.

5. Code:

import matplotlib.pyplot as plt


import numpy as np
from sklearn import datasets, linear_model, metrics

# load the boston dataset


boston = datasets.load_boston(return_X_y=False)

# defining feature matrix(X) and response vector(y)


X = boston.data
y = boston.target

# splitting X and y into training and testing sets


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# create linear regression object


reg = linear_model.LinearRegression()

72
# train the model using the training sets
reg.fit(X_train, y_train)

# regression coefficients
print('Coefficients: ', reg.coef_)

# variance score: 1 means perfect prediction


print('Variance score: {}'.format(reg.score(X_test, y_test)))

# plot for residual error

# setting plot style


plt.style.use('fivethirtyeight')

# plotting residual errors in training data


plt.scatter(reg.predict(X_train), reg.predict(X_train) - y_train,
color = "green", s = 10, label = 'Train data')

# plotting residual errors in test data


plt.scatter(reg.predict(X_test), reg.predict(X_test) - y_test,
color = "blue", s = 10, label = 'Test data')

# plotting line for zero residual error


plt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)

# plotting legend
plt.legend(loc = 'upper right')

# plot title
plt.title("Residual errors")

# method call for showing the plot


plt.show()

6. Output

73
7. Result
8. Viva Voce

a. What is the difference between linear and multiple regression?

Linear regression attempts to draw a line that comes closest to the data by finding
the slope and intercept that define the line and minimize regression errors. If two
or more explanatory variables have a linear relationship with the dependent
variable, the regression is called a multiple linear regression

b. What is multiple regression used for?


Multiple regression is a statistical technique that can be used to analyze the
relationship between a single dependent variable and several independent variables.
The objective of multiple regression analysis is to use the independent variables whose
values are known to predict the value of the single dependent value.

c. Give few examples of multiple regression?


For example, if you're doing a multiple regression to try to predict blood pressure (the
dependent variable) from independent variables such as height, weight, age, and hours
of exercise per week, you'd also want to include sex as one of your independent
variables

d. When should we use multiple linear regressions?

You can use multiple linear regressions when you want to know: How strong the
relationship is between two or more independent variables and one dependent variable
(e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth)

e. How many variables can you use in a multiple regression?

TWO
It is also widely used for predicting the value of one dependent variable from the
values of two or more independent variables. When there are two or more independent
variables, it is called multiple regression.

f. What are the assumptions of multiple regression?


Multivariate Normality–Multiple regression assumes that the residuals are normally
distributed. No Multicollinearity—Multiple regression assumes that the independent
variables are not highly correlated with each other. This assumption is tested using
Variance Inflation Factor (VIF) value

74
Ex.No: 11
Implementation of Time Series Analysis
Date :

1. Problem statement :

Implement a python Application Program to analyze the characteristics of a given time


series on given data set

https://www.kaggle.com/chirag19/air-passengers

2. Expected Learning outcomes:

To train the students to understand understanding various aspects about the inherent nature
of the series so that you are better informed to create meaningful and accurate forecasts

3. Problem analysis:

Time series is a sequence of observations recorded at regular time intervals. Depending on


the frequency of observations, a time series may typically be hourly, daily, weekly,
monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time
series as well, like, number of clicks and user visits every minute etc.

Why even analyze a time series?

Because it is the preparatory step before you develop a forecast of the series. Besides, time
series forecasting has enormous commercial significance because stuff that is important to
a business like demand and sales, number of visitors to a website, stock price etc are
essentially time series data.

So what does analyzing a time series involve?

Time series analysis involves understanding various aspects about the inherent nature of
the series so that you are better informed to create meaningful and accurate forecasts.

Across industries, organizations commonly use time series data, which means any
information collected over a regular interval of time, in their operations. Examples include
daily stock prices, energy consumption rates, social media engagement metrics and retail

75
demand, among others. Analyze time series data yields insights like trends, seasonal
patterns and forecasts into future events that can help generate profits. For example, by
understanding the seasonal trends in demand for retail products, companies can plan
promotions to maximize sales throughout the year.

4. Algorithm

Step1: Loading time series dataset correctly in Pandas


Step2: Indexing in Time-Series Data
Step4: Time-Resampling using Pandas
Step5: Rolling Time Series
Step6: Plotting Time-series Data using Pandas

5. Code & Output

To start, let’s import the Pandas library and read the airline passenger data into a data
frame:

import pandas as pd
df = pd.read_csv(“AirPassengers.csv”)

Now, let’s display the first five rows of data using the data frame head() method:

print(df.head())

Let’s take a look at the last five records the data using the tail() method:

print(df.tail())

We see that the data ends in 1960. The next thing we will want to do is convert the month
column into a datetime object. This will allow it to programmatically pull time values like
the year or month for each record. To do this, we use the Pandas to_datetime() method:

df[‘Month’] = pd.to_datetime(df[‘Month’], format=’%Y-%m’)


print(df.head())

Note that this process automatically inserts the first day of each month, which is basically
a dummy value since we have no daily passenger data.
The next thing we can do is convert the month column to an index. This will allow us to
more easily work with some of the packages we will be covering later:

76
df.index = df[‘Month’]
del df[‘Month’]
print(df.head())

Next, let’s generate a time series plot using Seaborn and Matplotlib. This will allow us to
visualize the time series data.
First, let’s import Matplotlib and Seaborn:

import matplotlib.pyplot as plt


import seaborn as sns

Next, let’s generate a line plot using Seaborn:

sns.lineplot(df)

And label the y-axis with Matplotlib:

plt.ylabel(“Number of Passengers”)

Analysis
Stationarity is a key part of time series analysis. Simply put, stationarity means that the
manner in which time series data changes is constant. A stationary time series will not
have any trends or seasonal patterns. You should check for stationarity because it not only
makes modeling time series easier, but it is an underlying assumption in many time series
methods.

Let’s test for stationarity in our airline passenger data. To start, let’s calculate a seven
month rolling mean:

rolling_mean = df.rolling(7).mean()
rolling_std = df.rolling(7).std()

Next, let’s overlay our time series with the seven month rolling mean and seven month
rolling standard deviation. First, let’s make a Matplotlib plot of our time series:

plt.plot(df, color=”blue”,label=”Original Passenger Data”)

Then the rolling mean:

77
plt.plot(rolling_mean, color=”red”, label=”Rolling Mean Passenger Number”)

And finally the rolling standard deviation:

plt.plot(rolling_std, color=”black”, label = “Rolling Standard Deviation in Passenger


Number”)

Let’s then add a title

plt.title(“Passenger Time Series, Rolling Mean, Standard Deviation”)

And a legend:

plt.legend(loc=”best”)

Next, let’s import the augmented Dickey-Fuller test from the statsmodels package. The
documentation for the test can be found here.

from statsmodels.tsa.stattools import adfuller

Next, let’s pass our data frame into the adfuller method. Here, we specify the autolag
parameter as “AIC”, which means that the lag is chosen to minimize the information
criterion:

adft = adfuller(df,autolag=”AIC”)

Next, let’s store our results in a data frame display it:

output_df = pd.DataFrame({“Values”:[adft[0],adft[1],adft[2],adft[3], adft[4][‘1%’], adft[4]


[‘5%’], adft[4][‘10%’]] , “Metric”:[“Test Statistics”,”p-value”,”No. of lags
used”,”Number of observations used”, “critical value (1%)”, “critical value (5%)”,
“critical value (10%)”]})

print(output_df)

We can see that our data is not stationary from the fact that our p-value is greater than 5
percent and the test statistic is greater than the critical value. We can also draw these
conclusions from inspecting the data, as we see a clear, increasing trend in the number of
passengers.

78
Forecasting

Time series forecasting allows us to predict future values in a time series given current and
past data. Here, we will use the ARIMA method to forecast the number of passengers.
ARIMA allows us to forecast future values in terms of a linear combination of past values.
We will use the auto_arima package, which will allow us to forgo the time consuming
process of hyperparameter tuning.

First, let’s split our data for training and testing and visualize the split:

df[‘Date’] = df.index
train = df[df[‘Date’] < pd.to_datetime(“1960–08”, format=’%Y-%m’)]
train[‘train’] = train[‘#Passengers’]
del train[‘Date’]
del train[‘#Passengers’]
test = df[df[‘Date’] >= pd.to_datetime(“1960–08”, format=’%Y-%m’)]
del test[‘Date’]
test[‘test’] = test[‘#Passengers’]
del test[‘#Passengers’]
plt.plot(train, color = “black”)
plt.plot(test, color = “red”)
plt.title(“Train/Test split for Passenger Data”)
plt.ylabel(“Passenger Number”)
plt.xlabel(‘Year-Month’)
sns.set()
plt.show()

Let’s import auto_arima from the pdmarima package, train our model and generate
predictions:

from pmdarima.arima import auto_arima


model = auto_arima(train, trace=True, error_action=’ignore’, suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=len(test))
forecast = pd.DataFrame(forecast,index = test.index,columns=[‘Prediction’])

6. Output
7. Result
8. Viva Voce
a. How do you analyze time series?

79
Step 1: Visualize the Time Series. It is essential to analyze the trends prior to
building any kind of time series model. ...
Step 2: Stationarize the Series. ...
Step 3: Find Optimal Parameters. ...
Step 4: Build ARIMA Model. ...
Step 5: Make Predictions

b. What is the purpose of time series analysis?


Time series analysis helps organizations understand the underlying causes of trends
or systemic patterns over time. Using data visualizations, business users can see
seasonal trends and dig deeper into why these trends occur.

c. Why do we decompose time series?


Time series decomposition involves thinking of a series as a combination of level,
trend, seasonality, and noise components. Decomposition provides a useful abstract
model for thinking about time series generally and for better understanding
problems during time series analysis and forecasting.

d. What are the characteristics of a time series?


When plotted, many time series exhibit one or more of the following features:
Trends.
Seasonal and nonseasonal cycles.
Pulses and steps.
Outliers.

e. What are the 3 key characteristics of time series data?


Main idea: 3 basic characteristics of a time series (stationarity, trend and
seasonality) Prerequisites: time series definition, statistics such as mean, variance,
covariance.26

f. How many models are there in time series?


Types of Models
There are two basic types of “time domain” models. Models that relate the present
value of a series to past values and past prediction errors - these are called ARIMA
models (for Autoregressive Integrated Moving Average). We'll spend substantial
time on these.

80

You might also like