Professional Documents
Culture Documents
2021 Stat Notes
2021 Stat Notes
Regression
Linear Regression
• After having established the fact that 2 variables are closely related, we should
estimate/predict the value of one variable given the value of another.
• Eg. If advertising and sales are correlated, we can find out:
• The expected amount of sales for a given advertisement expenditure or
• The required amount of expenditure for achieving a fixed sales target
• The statistical tool with the help of which we are in a position to estimate or predict the
unknown values of one variable from known values of another variable is called regression
• It helps to find the average probable change in one variable given a certain amount of change
in another
• Regression is a technique for measuring the linear association between a dependent (Y) and
an independent variable (X)
• Regression analysis attempts to predict the values of a continuous DV from specific values of
the IV
3 advantages of regression analysis
• It provides estimates of values of the DVs from the values of IVs- with
the help of regression line- which describes the average relationship
existing between X and Y variables
• Calculates the standard error- the error involved in using the
regression line as a basis for estimations
• If the scatter is lesser and the line fits the data closely, it means that there is
relatively little scatter of observations around the regression line. This means,
we can make a good estimate of Y, the DV
• But, if the observations are scattered around the fitted regression line, it will
not produce accurate estimates of the DV
Difference between correlation and
regression
Correlation Regression
• Precedes regression • Succeeds correlation
• Tool for ascertaining the dergee • Tool for studying the nature of
of relationship relation- i.e. the cause and effect
relationship
Linear Bivariate Regression Model
• In this, we proceed by observing the sample data,
• and use the results obtained as estimates of the corresponding population relationship
• For a bivariate population, the model chosen is simple linear regression model
• Assumptions:
• The value of Y is dependent upon the value of X
• The average relationship between X & Y can be described as a linear equation
Y=a+bX, which gives a straight line graph
• Y: DV,
• X: IV,
• a: Y-intercept and
• b : slope, i.e. the average amount of change of Y per unit of change in the value of
X. The sign of b indicates the type of relationship between X & Y (direct/ inverse)
Regression lines
• Considering 2 variables X & Y, we have 2 regression lines
• Regression line of X on Y (is the line which gives the best estimate for the value of X for any specified value of
Y)
• Regression line of Y on X (is the line which gives the best estimate for the value of Y for any specified value of
X)
• The farther these lines are from each other, the lesser the degree of
correlation between them and vice versa
• The 2 regression lines show the average relationship between the 2 variables
based on the 2 equations known as Regression equations
• Regression equations: are algebraic expressions of the regression lines. Since
there are 2 regression lines, there are 2 regression equations (RE)
• RE of X on Y: describes the variations in the values of X for given change(s) in Y
and
• RE of Y on X: describes the variations in the values of Y for given change(s) in X
Regression Equation of Y on X
• Y= a + b X
• Y is the DV or criterion variable to be estimated and X is the IV or the predictor variable
• a & b are 2 unknown constants (fixed numerical values) which determine the position of the line
completely
• The constants are called parameters of the line
• If the value of either or both of them is changed, another line is determined
• Parameter ‘a’ determines the level of the fitted line (i.e. the distance of the line directly above or
below the origin)
• Parameter ‘b’ determines the slope of the line (i.e. change in Y for unit change in X)
• Once the values of a and b are obtained, we can determine the line. This is done by the METHOD OF
LEAST SQUARES
• It states tat the line should be drawn through the plotted points in such a manner that the sum of the
squares of the vertical deviations of the actual Y values from the estimated Y value IS THE LEAST
• i.e the (Actual Y – Estimated Y)^2 is minimum
• Such a line is called the LINE OF BEST FIT
The line of best fit
3
4
5
Unit 6
• Introduction to Probability
• Basic Concepts of Probability
Uncertainties
An experiment is any process that generates The sample space for an experiment is the set of
An experimental outcome is also called a sample point.
well- defined outcomes. all experimental outcomes.
Probability as a Numerical Measure
of the Likelihood of Occurrence
0 .5 1
Probability:
where:
Ei is the ith experimental outcome
and P(Ei) is its probability
2. The sum of the probabilities for all experimental
outcomes must equal 1.
Subjective Method
Assigning probabilities based on judgment
Classical Method
Example: Rolling a Die
If an experiment has n possible outcomes, the
classical method would assign a probability of 1/n
to each outcome.
Sample
Event A Ac Space S
Venn
Diagram
Union of Two Events
Sample
Event A Event B Space S
Intersection of Two Events
Sample
Event A Event B Space S
Intersection of A and B
Addition Law
Sample
Event A Event B Space S
Mutually Exclusive Events
There is no need to
include “- P(A B”
Independent Events
Importance
Components
Trend
Free hand method
Methods of semi-averages, moving averages and least squares
Problems based on these
Forecasting, methods of forecasting
Introduction
• When estimates of future conditions are made on a systematic
basis, the process is referred to as forecasting
• The figure or statement obtained is called forecast
• Forecasting is a service whose purpose is to offer the best available
basis for ‘management expectations of the future’
• Forecasting aims at reducing the area of uncertainty that surrounds
management decision-making, with respect to costs, profit, sales,
production, pricing, capital investment etc.
• Forecasting is the process of making predictions of the future,
based on the past & present data by the analysis of trends
• The knowledge of forecasting methods is essential for decision
makers to make reliable and accurate estimates and assess or
evaluate the future consequences of decisions in the face of
uncertainty
What Is Forecasting?
• An essential tool in any decision-making process
• Process of predicting a future event
• Underlying basis of all business decisions
• Planning production in expectation of certain levels of
sales
• Building warehouses in expectation of certain levels of
stocks and sales
• Setting prices in expectation of certain levels of raw
material costs, financial constraints, wages and sales
• Recruiting labour, buy materials, arrange finance or plan
factories in expectation of certain levels of sales and
other activity
Objectives of forecasting
• Creating plans of action
• Monitoring its progress
• Developing a warning-system of the critical factors
Types of forecasts
• Demand forecasts: prediction of demands for products/services based
on the sales and marketing information
• Environmental forecasts: concerned with the social, political and
economic environments of the state/ country
• Technological forecasts: concerned with the new developments in
existing technologies
Timing of forecasts
• Short-range: 0-3 months, may go upto 1 year- for job scheduling, work
force levels, job assignments etc.
• Uses mathematical techniques like moving averages, exponential
smoothing, trend exploration
• Medium-range: 1-3 yrs time span- used for sales planning, production
planning, budgeting etc.
• Long-range: >=3 yrs: for designing/ installing new plants, facility
location, R&D etc.
Forecasting Approaches
Qualitative Methods Quantitative Methods
• Used when situation is • Used when situation is
vague & little data exist ‘stable’ & historical data
• New products exist
• New technology • Existing products
• Involve intuition, • Current technology
experience • Involve mathematical
techniques
Quantitative Forecasting- process
• Limitations:
• Highly subjective method as the trend line depends on personal judgement
• Time-consuming
Smoothing methods
• This provides pattern of movements in the data over time, by eliminating random
variations due to irregular components of time series
• 3 smoothing methods are:
Moving averages: a subjective method which depends on the length of the period
for calculating moving averages
• It is a technique to get an overall idea of the trends in a data set
• It is described as moving because old data points get replaced by new figures in its calculation
• Focuses on long-term trend in a time series
Weighted moving averages: A moving average where some time periods are
weighted differently than others
• The most recent observations are assigned larger weightage and it decreases for older data
values
• WMA= ∑ (Weight for period n) * (Data values in period n) / ∑weights
Semi-average method: used to estimate the slope and intercept of the trend line, if
time series is represented by a linear function
Steps in semi-average method
1. Data are divided into 2 parts
2. Their respective arithmetic means are computed
3. These 2 means are plotted corresponding to the midpoint of the data/
class interval covered by the respective part
4. These points are joined by a straight line to get the required trend line
5. The AM of the 1st part is the intercept value ‘a’
6. Slope: ratio of the difference in the AM of the number of years
between them, b= Δy/ Δ x = (AM1-AM2)/ (year 1- year 2)
7. Time series of the form y’ =a+bx, where y’ is the predicted y
Trend projection method
(linear, exponential or quadratic)
• It fits a trend line to a time series data and then projects medium-to-
long range forecasts
• Helps to describe the long-term general direction of any business
activity over a long period of time
• The study of trend facilitate in making intermediate and long term
forecasting projections
Linear projection method
• The method of least squares from regression analysis is used to find
the trend line of best fit to a time series data.
• This line is defend as y’ = a+bx where
• y’ is the predicted value of the DV
• a is the y intercept
• b is the slope of regression line Δy/ Δx
• x is the IV represented as time in year/ month etc.
Characteristics of the trend line of best fit
• The sum of all vertical deviations about the line of best fit is zero
• ∑(y-y’)=0
• The sum of all vertical deviations squared is minimum
• ∑(y-y’)^2 is the least
• The line of best fit passes through the mean values of variables x & y
• The value of 2 constants a & b can be found by the simultaneous
solution of normal equations
What is a Time Series?
• It is used to detect patterns of change in statistical
information over regular intervals of time
• We project these patterns to arrive at an estimate for the
future
• Thus it helps to cope with uncertainty about the future
Secular Cyclical
trend variations
Seasonal Irregular
variations variations
Time Series Patterns
© Wiley 2010 74
•
Secular Trend Component
Value of the variable tends to increase or decrease over a long
period of time Overall Upward or Downward Movement- in the
average value of the forecast variable
• Due to long-term factors like population increase, changing
demographic characteristics, technology, consumer preferences
etc.
• Data Taken Over a Period of Years
Sales d tre nd
Upwar
Time
Seasonal Component
• Fluctuations are repeated within a year- daily, weekly, monthly, quarterly etc.
• Regular patterns of upward or downward swings- high degree of regularity
• Due to climate, weather, customs, traditions etc.- tend to be repeated from year to
year (Observed Within One Year mostly)
• daily traffic volume shows within-the-day “seasonal” behavior, with peak levels
occurring during rush hours, moderate flow during the rest of the day and early
evening, and light flow from midnight to early morning
Sales
Summer
Winter
Spring Fall
Time (Monthly or Quarterly)
Irregular Component
• Rapid changes caused by short-term unanticipated and non recurring
factors
• Due to random variation or unforeseen events
• Nature (flood, earthquake), accidents, Union strike, War
• Erratic, unpredictable, unsystematic, random, ‘residual’ fluctuations
• Short duration & non-repeating
Time Series Forecasting
T im e
S e r ie s
S m o o th in g No Yes T re n d
M e th o d s T re n d ? M o d e ls
M o v in g E x p o n e n tia l
A v e ra g e S m o o th in g
A u to -
L in e a r Q u a d r a tic E x p o n e n tia l R e g r e s s iv e
3 forecasting methods
• Three forecasting methods that are appropriate for a time series with
a horizontal pattern:
• moving averages, weighted moving averages, and exponential smoothing
• objective of each of these methods is to “smooth out” the random
fluctuations in the time series, they are referred to as smoothing
methods
Moving Averages
• The moving averages method uses the average of the most
recent data values in the time series as the forecast for the
next period
• The term moving is used because every time a new
observation becomes available for the time series, it replaces
the oldest observation in the equation and a new average is
computed
• Eg: sales of women’s blouses in the first three weeks (in
thousands of Rs.) are 17, 21 and 19
• forecast of sales in week 4 using the average of the time
series values in weeks 1–3, F4= average of weeks 1-
3=(17+21+19)/3=19
• Thus, the moving average forecast of sales in week 4 is 19 or
Rs. 19,000
• But, the actual value observed in week 4 is 23, the forecast
error in week 4 is 23-19= 4 (Rs.4000)
• Next, we compute the forecast of sales in week 5 by averaging the time
series values in weeks 2–4.
• F5 average of weeks 2-4= (21+19+23)/3 =21
• Hence, the forecast of sales in week 5 is 21 and the error associated with
this forecast is 18 -21 = 3
WEIGHTED MOVING AVERAGES
• select a different weight for each data value and then computing a
weighted average of the most recent values as the forecast.
• In most cases, the most recent observation receives the most weight, and
the weight decreases for older data values.
Exponential smoothing
• Exponential smoothing also uses a weighted average of past time
series values as a forecast; it is a special case of the weighted moving
averages method in which we select only one weight—the weight for
the most recent observation.
• The weights for the other data values are computed automatically
and become smaller as the observations move farther into the past
Steps in forecasting
• Define organizational objective of forecasting
• Select the variables to be forecasted- eg. Capital investment,
employment level, etc.
• Determine the time horizon- short/ medium or long-term of the
forecast, to predict the future
• Select appropriate forecasting method
• Collect relevant data for forecasting
• Make the forecast and implement the results
Unit – VII: Introduction to Inferential
Statistics
• Meaning & Purpose of inferential statistics, Introduction to testing of Hypothesis:
Procedure for testing hypothesis - Setting of Hypothesis -Null and alternative
hypotheses,
• Computation of Test statistics ( simple problems)-
• Types of errors in hypothesis testing - Level of significance, Critical region and
value - Decision making.
• Test of significance for Large and small sample tests, Z and t tests for mean and
proportion,
• One way ANOVA, Chi-square test for goodness of fit and independence of
attributes.
• (Simple problems)
Hypothesis
• An unproven statement or supposition that tentatively explains
certain facts or phenomena; a proposition that is empirically testable
• Null hypothesis (H0): a statement in which no difference or no effect
is expected. It is the hypothesis that is always tested.
• Alternative hypothesis (H1): A statement that some difference or
effect is expected. It is a statement indicating the opposite of the null
hypothesis
86
The role of hypothesis in a research study
87
Types of hypothesis
1. Hypotheses based on Empirical Uniformities.
88
2. Hypotheses based on association between Variables.
90
Stating hypotheses
Stated in declarative form.
States generally a relationship between variables.
Ideally reflects the theoretical framework of the study
based on a theory/body of literature.
Is brief and to the point.
Example of Hypothesis
Research Idea Objective Hypothesis
92
Right-tailed test (critical region is the blue
region
zα =
93
Left-tailed test
-zα
94
2-tailed test
Area of
Acceptance
(95%)
-Zα/2 +Zα/2
.025 - 1.96 1.96
.025
CRITICAL
VALUES
95
What does the critical region mean?
• The critical region of the sampling distribution of a
statistics is also known as the alpha region.
• The critical region of a hypothesis test is the set of
all outcomes which, if they occur, cause the H0 to be
rejected and the H1 accepted.
• The values within the acceptance region are called
acceptable at the 95% Confidence Level and if we
find that our sample mean lies within this region, we
would conclude that Ho is true and we accept it.
96
• The critical region CR, or rejection region RR, is a set of values of the test
statistic for which the null hypothesis is rejected in a hypothesis test.
• That is, the sample space for the test statistic is partitioned into two regions;
one region (the critical region) will lead us to reject the null hypothesis Ho, the
other will not.
• So, if the observed value of the test statistic is a member of the critical region,
we conclude "Reject Ho"; if it is not a member of the critical region then we
conclude "Do not reject Ho".
• For instance, if we have calculated that the critical region at a 95% confidence
level is between 10 and 20, then we can be 95% confident that the true mean
lies within that region.
97
Hypotheses pertaining to Left, right and 2-tailed tests
• There are three ways to set up a the null and alternate
hypothesis, mathematically.
98
Type I and Type II errors
• Whenever we draw inferences about a population, there is a risk that
an incorrect conclusion will be reached
Two types of errors can occur
• Type I error and
• Type II error
99
• H0: Patient is alive (because null hypothesis represents no
change)
• H1: Patient is not alive (dead)
• Possible states of nature: (based on H0)
• Patient is alive (H0 true & H1 false)
• Patient is dead (H0 false & H1 true)
• Decisions are something that researcher has control over, we
make correct or incorrect decision
• Possible decisions (based on Ho) / conclusions (based on claim)
• Reject Ho: sufficient evidence to say patient is dead
• Fail to reject or accept Ho: insufficient evidence to say patient is
dead
• Four possibilities that can occur based on 2 possible states of
nature and the 2 decisions which we can make:
100
Testing of Hypotheses
Errors in hypothesis testing
Decision
Accept H0 Reject H0
Correct Type I
H0 is True Decision Error
Condition
H0 is False
Type II Correct
Error Decision
102
Type I error
• Occurs when the sample results lead to the rejection of H0 when it is in fact true
• Type I error in this eg:
Would occur if we concluded, based on the sample data, that the proportion of
customers preferring the new service plan was greater than 0.40, when in fact it was
less than or equal to 0.40 (ie. when H0 is true). Hence we do the mistake of
introducing the service incurring huge loss.
The probability of Type I error (α) is also called the level of significance
103
Type II error (β)
• Occurs when, based on the sample results, H0 is not rejected when it is
in fact false.
• In our eg., type II error would occur
If we concluded, based on sample data, that the proportion of customers
preferring the new service plan was less than or equal to 0.40, when, in
fact, it was greater than 0.40. Hence we do the mistake of not
introducing the service.
104
1. Test of hypothesis concerning
population mean
• Test concerning mean of one population
To test Ho: μ= μo against
a) H1: μ> μo
b) H1: μ< μo
c) H1: μ= μo
105
• A sample of size n (n>30) is taken from the population with unknown
mean μ and known SD σ
• Let x be the sample mean
• Critical value z=(x- μ)
(σ/√n)
106
Case a) H1: μ> μo
• This is right tailed test.
• The rule is: “If z> zα (the tabled value), the test is significant. There is
significant difference between the sample mean and the hypothetical
mean and hence we reject Ho at (1- α)100% confidence level”
• hence we reject Ho at (α)100% significance level
• “If z< zα (the tabled value), the test is not significant. There is no
significant difference between the sample mean and the hypothetical
mean and hence and we fail to reject Ho at (1- α)100% confidence
level”
• hence we accept Ho at (α)100% significance level
107
Case b) H1: μ< μo
• This is left tailed test.
• The rule is: “If z≤ -zα (the tabled value), the test is significant. There is
significant difference between the sample mean and the hypothetical
mean and hence we reject Ho at (1- α)100% confidence level”
• “If z>- zα (the tabled value), the test is not significant. There is no
significant difference between the sample mean and the hypothetical
mean and hence and we fail to reject Ho at (1- α)100% confidence
level”
108
Case c) H1: μ= μo
• This is two tailed test.
• The rule is: “If absolute value of z, ie. IzI > zα/2 (the tabled value), the test is
significant. There is significant difference between the sample mean and the
hypothetical mean and hence we reject Ho at (1- α)100% confidence level”
• “If IzI < zα/2 (the tabled value), the test is not significant. There is no significant
difference between the sample mean and the hypothetical mean and hence and
we fail to reject Ho at (1- α)100% confidence level”
109
• To save time and effort, the table below relates critical z values to alpha
levels and type of test (whether one-tailed or two-tailed).
• Alpha Tails Critical Z
• 0.05 two plus or minus 1.96
0.05 right 1.645
0.05 left -1.645
0.01 two plus or minus 2.58
0.01 right 2.33
0.01 left -2.33
110
Practice 1
• A sample of 100 students is taken from the students of a college with
heights having standard deviation 10 cm. the mean height of the
sample of students was found to be 168.8 cm. Can we accept the
assumption that the mean height of the students of the college is 170
cm? Significance level= 0.05
111
Solution 1
• σ=10
• x = 168.8
• n=100
• To test Ho: μ= 170 against H1: μ= 170
• This is a 2-tailed test
• α= 0.05, then zα/2 = 1.96
• Applying the formula, z= -1.2
• Here IzI < zα/2 and hence we accept the assumption that the mean
height of the students of the college is 170 cm
112
Practice 2
• A sample of 400 observations were taken from a population with
standard deviation of 15. if the mean of the sample is 27, test
whether the hypothesis that the mean of the population is less than
24.
α= 0.05
113
Solution 2
• To test Ho: μ= 24 against H1: μ<24
• σ=15 α= 0.05
• x = 27 so, zα = -1.64
• n=400
• Applying the formula, z= (27-24) /(15/20)=4
• This is a left-tailed test. If z>-zα, the test is not
significant. We accept Ho at 95% CL.
• Hence, mean is reasonable accepted to be 24.
114
2. Test of hypothesis concerning population
proportions
• Test concerning one population proportion
To test Ho: p= po against (read po as p not)
a) H1: p> po
b) H1: p< po
c) H1: p= po
115
• Given a sample of size n from the population.
• x is the number of items having a particular characteristic
• Sample proportion p=x/n
• Formula to calculate z is z= (p- po)/√ po(1- po)/n
116
Practice 1
• In a survey of 70 business firms, it was found that 45 are planning to
expand their capacities next year. Does the sample information
contradict the hypothesis that 70% of the firms are planning to
expand next year?
117
Solution
• To test H0: p= 0.7 against H1: p=0.7
• This is a 2-tailed test. At α= 0.05, zα/2= 1.96
• n=70, x=45
• p= x/n = 45/70 = 0.64
• z= (p- po)/√ po(1- po)/n
= (0.64-0.70)/√(0.7x0.3)/70
= -0.06/0.05 = -1.2
IzI= 1.2 is < z α/2 (ie. 1.96)
Test is not significant and we accept Ho at 95% CL. There is no reason to doubt the
hypothesis that 70% of the companies are going to expand their capacities.
118
• Practice 1
• An e-commerce research company claims that 60%
or more graduate students have bought
merchandise on-line. A consumer group is
suspicious of the claim and thinks that the
proportion is lower than 60%. A random sample of
80 graduate students show that only 22 students
have ever done so. Is there enough evidence to
show that the true proportion is lower than 60%?
Conduct the test at 5% Type I error rate, and use
the rejection region approaches.
119
• Left tailed test
• To test H0: p= 0.6 against H1: p< 0.6
• n=80, x=22; p= x/n =22/80=0.275;
• po = 0.6; 1- po = 0.4
• Z= (p- po) / √ po(1- po)/n
• =(0.275−0.6)/√[0.6×0.4]/80= −5.93
• Z< - zα; Test is significant and we reject Ho at
5% SL. There is enough evidence to show
that the true proportion is lower than 60%
120
3. Test of hypothesis concerning Chi Square
Statistic
• When the assumption that ‘the samples are drawn from a normal population’ cannot be
justified, we use statistical procedures generally referred to as non-parametric tests.
• Chi square is one such test belonging to this category, first used by Karl Pearson
121
Properties
• Chi square distribution is a continuous probability distribution which has a
value zero as its lower limit and extends to infinity in the positive direction
• It can never have a negative value, as the difference between observed
and expected frequencies is squared
• The exact shape of the distribution depends upon the degrees of freedom
• For a small df, the shape of the curve is positively skewed. As the df
becomes larger, it becomes symmetrical and approximates to the shape of
a normal distribution
• It makes no assumptions about the population being sampled
• The greater the chi square value, the greater is the discrepancy between
observed and expected frequencies
3. Test of hypothesis concerning Chi Square
Statistic
Calculate the chi square statistic by completing the following steps:
• For each observed number in the table subtract the corresponding
expected number (O — E).
• Square the difference [ (O —E)2 ].
• Divide the squares obtained for each cell in the table by the expected
number for that cell [(O - E)2 / E ].
• Sum all the values for (O - E)2 / E. This is the chi square statistic.
123
Expected value
• Eij = (Ri * Cj)/n
where
Ri= total observed frequency in the ith row
Cj= total observed frequency in the jth column
and n is the sample size
124
Conditions
• Minimum 50 observation in the sample (n>50)
• Each cell frequency should not be less than 5 observations, otherwise
increase the sample size per cell
• The data should be expressed in original units (frequencies/counts),
i.e. frequencies and not in percentage or ratio form
• Sample data to be drawn at random from the target population
125
• Formulate the null hypothesis and determine the expected
frequency of each answer
• Ho: the two attributes are independent/ there is no association
between the attributes
• H1: X is dependent on Y/ there is a significant association
between the 2 attributes
• Determine the appropriate significance level
• Calculate the chi-square value, using the observed (from
sample) and expected frequencies
• Make the statistical decision by comparing the calculated chi
square with the critical (tabled) value
126
Decision rule- X2 test
• The tabled X2 value is X2 k-1,α where k is the number of classes and k-1
is the degrees of freedom to find the tabled value.
• If calculated X2 > tabled X2 k-1,α, the test is significant and we reject H0 at
(1-α)100% CL.
• Otherwise we accept H0
127
Problem
• A company has to choose among three No. of employees favoring
proposed pension plans. The company Job Plan A Plan B Plan C
wishes to test the hypothesis ‘preference classifi
for plans is independent of job cation
classification’. It asks the opinion of a sample Factory 160 30 10
of employees and obtains the information employ
presented in the table. Test the hypothesis ees
which the company wishes to do.
Clerical 148 40 20
employ
ees
Supervi 72 10 10
sors
Executi 70 20 10
ves
128
observed expected
160 30 10 200 150 33.33 16.67
148 40 20 208 156 34.67 17.33
72 10 10 92 69 15.33 7.67
70 20 10 100 75 16.67 8.33
450 100 50 600
O E (O-E)^2 (O-E)^2/E
160 150 100 0.666666667
Ho: preference for plans is
148 156 64 0.41025641
independent of job classification
72 H1: preference for plans is dependent
69 9 0.130434783
70 on job classification
75 25 0.333333333
30 33.33 11.0889 0.33270027
dof= (3-1)*(4-1)=6
40 34.67 28.4089 0.819408711
X2 calc<X2 6,0.05, =12.6
10 15.33 28.4089 1.853157208
9.3 129
• This has a X2 distribution with (r-1)*(c-1) degrees of freedom
• If calculated X2 > tabled X2 (r-1)*(c-1) ,α, the test is significant and we reject H0
at (1-α)100% CL. There is evidence to believe that the two attributes
(variable 1 and variable 2) are dependent or related
• Otherwise we accept H0
• Cell: section of a table representing a specific combination of two
variables or a specific value of a variable
130
Chi square problem
• Of the 1000 workers in a factory exposed to an Covid-19, 700 in all were
attacked, 400 had been inoculated and of these 200 were attacked
• On the basis of this information, can it be said that inoculation and
attack are independent?
Table
Inoculated Not Inoculated Total
132
Tests based on statistics following Student’s t distribution:
the study of statistical inference with small samples
• If the orginial population is normally distributed, SD of the
population is unknown, the sampling distribution of the mean
derived from the small samples (n<30) will follow a t-distribution
• The shape of the t-distribution is influenced by its degrees of
freedom (d.o.f or d.f)
• The number of d.o.f is equal to the number of useful items of
information generated by a sample of given size with respect to the
estimation of a given population parameter
• It is calculated as df= n-1
133
• In statistics, Student's t-
distribution (or simply
the t-distribution) is a
probability distribution
that arises in the problem
of estimating the mean of
a normally distributed
population when the
sample size is small
• Assumptions:
1.Population is normal
2.SD of population is
unknown
• Properties:
1.It ranges from -∞ to +∞
• The t-table :
2.It is bell shaped and symmetrical around the mean
3.Its shape changes with the change in df
1.Its value is called tα or tα/2
4.It is more platykurtic that the normal distribution 2.Determined from the table given a
5.As n approaches 30, the t-distribution approaches the normal particular df and level of significance
134
form
• A sample of size n (n<30) is taken from a normal population with
unknown population standard deviation
• Let x be the sample mean and s be the sample SD
• Then t=(x-μ) / (s/√n)
• μ is the hypothesized population mean
• s= √∑(x-x) 2/n-1
135
Questions you may ask to arrive at a decision
between t and Z
• Is the population standard deviation (σ) known?
• If the answer is yes, the Z-distribution is appropriate
• When σ is unknown, a second question is asked: “Is the sample size
greater than 30?”
• If the answer is no, the t-distribution should be used, if it is yes, the Z-
distribution should be used (because as the sample size increases, the t-
distribution becomes increasingly similar to the Z-distribution)
136
Test concerning mean of one population
137
Case a) H1: μ> μo
• This is right tailed test.
• The golden rule is: “If calculated t> tα,n-1 (the tabled value with n-1 degrees
of freedom), the test is significant. We reject Ho at (1- α)*100% confidence
level”
• “Otherwise, we accept (fail to reject) Ho at (1- α)100% confidence level”
138
Case b) H1: μ< μo
• This is left tailed test.
• The golden rule is: “If calculated t< -tα,n-1 (the tabled value with n-1
degrees of freedom), the test is significant. We reject Ho at (1- α)100%
confidence level”
• “Otherwise, we accept (fail to reject) Ho at (1- α)100% confidence level”
139
Case c) H1: μ= μo
• This is two tailed test.
• The golden rule is: “If absolute value of the calculated ItI> tα/2,n-1 (the
tabled value with n-1 degrees of freedom), the test is significant. We
reject Ho at (1- α)100% confidence level”
• “Otherwise, we accept (fail to reject) Ho at (1- α)100% confidence
level”
140
Analysis Of Variance (ANOVA)
• Analysis of variance (ANOVA) is a collection of statistical models and their associated
estimation procedures (such as the "variation" among and between groups) used to
analyze the differences among means. ANOVA was developed by the statistician Ronald
Fisher.
• When the means of more than two groups or populations are to be compared, one-way
analysis of variance, a bivariate statistical technique, is the appropriate statistical tool
• One way because there is only one independent variable (though several levels of that
variable may be present)
• It is the analysis of the effects of one treatment variable on an interval-scaled or ratio-
scaled dependent variable; a technique to determine if statistically significant
differences in means occur between two or more groups
• Eg. Students from different colleges take the same exam. You want to see if one college
outperforms the other.
Example of an ANOVA problem
• To compare women who are working full-time outside the
home, working part-time outside the home, and not working
outside the home on their willingness to purchase a personal
computer
• This eg: has only one IV- working status with 3 levels:
• Full time employment
• Part-time employment and
• No employment outside the home
• Because there are 3 levels (groups), a t-test cannot be used to
test for statistical significance.
Contd…
• The null hypothesis, here, can be stated as “All the means are equal”
or
• Ho: μ1= μ2 = μ3
Mean X1 = X2 = X3 =
104.75 134.75 119.25
Grand Mean Is the mean
of all 3 means
X =119.58
• Total sum of squares =
within group sum of squares + between group sum
of squares
ie. SS total = SS within + SS between
i=1 j=1
Applying the formula…
• SSwithin = (130-104.75)2 + (118-104.75)2
+ (87-104.75)2 + (84-104.75)2
+ (145-134.75)2 + (143-134.75)2
+ (120-134.75)2 + (131-134.75)2
+ (153-119.25)2 + (129-119.25)2
+ (96-119.25)2 + (99-119.25)2
= 4148.25
To calculate SSbetween
• SSbetween , the variability of the group means about a grand mean, is calculated by squaring the
deviation of each group mean from the grand mean, multiplying by the number of items in the
group, and summing these scores:
• SSbetween = Σc nj (Xj – X)2 where
• X = grand mean
• nj= number of items in the jth group, same as ‘n’
j=1
Applying the formula…
• SSbetween = 4(104.75-119.58)2
+ 4(134.75-119.58)2
+ 4(119.25-119.58)2
= 1800.68
• The next calculation requires dividing the various sums of squares by
their appropriate degrees of freedom.
• These divisions produce the variances, or mean squares
MSbetween
• To obtain mean square between groups, SSbetween is divided by c-1
degrees of freedom:
• MSbetween = SSbetween
c-1
= 1800.68 =900.34
3-1
MSwithin
• To obtain the mean square within groups, SSwithin is divided by cn-c
degrees of freedom
• MSwithin = SSwithin
cn-c
= 4148.25 =460.91
12-3
F-ratio
• F-ratio is calculated by taking the ratio of the mean square between
groups to the mean square within groups
• The between-groups mean square is used as the numerator and the
within-groups mean square is used as the denominator:
F= MSbetween =900.34 =1.95
MSwithin 460.91
Summary for Analysis Of Variance
Source of Sum of squares Degre Mean F-ratio
variation es of square
freedo
m
Between SSbetween = c-1 MSbetween = -
groups Σc nj (Xj – X)2 Ssbetween/
j=1
c-
1
Within groups SSwithin = cn-c MSwithin = F=
SSwithin / cn-c MSbetween
Σn Σc (Xij – Xj)2
i=1 j=1
MSwithin
From F-distribution table, at 0.05 level for 2 (n1) and 9 (n2) dof, indicates an F of
4.26
Pricing experiment: ANOVA table
• As calculated F 1.95< Source of
variation
Sum of
squares
Degrees
of
Mean
square
F
-ratio
4.26 (tabled), we fail freedom
to reject H0 at 95 % CL
Between 1800.68 2 900.34 --
• We conclude that all groups
the price treatments
produce Within 4148.25 9 460.91 1.953
groups
approximately the
same sales volume
Total 5948.93 11 -- --
All the best!