Professional Documents
Culture Documents
stats_cheatsheet.docx
stats_cheatsheet.docx
b). No women
We choose 3 men out of 4.
(4 3 ) 4
𝑃(𝑛𝑜 𝑤𝑜𝑚𝑒𝑛) = 35
= 35
= 0. 1143
(3 3 ) 1
𝑃(𝑎𝑙𝑙 3 𝑤𝑜𝑚𝑒𝑛) = 35
= 35
12 1 13
𝑃(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑤𝑜𝑚𝑒𝑛) = 35
+ 35
= 35
= 0. 3714
1. Selection Bias: The choice of items included in the index can introduce bias if it does not
accurately represent the market or population being measured. For example, if certain goods
or services are excluded or underrepresented, the index may not reflect true changes in the
cost of living or economic activity.
2. Weighting: Assigning appropriate weights to different components of the index is crucial for
accurately reflecting their relative importance. If weights are outdated or misallocated, the
index may not accurately reflect changes in the underlying data.
3. Substitution Bias: Index numbers may not fully account for consumer behavior changes,
such as substitution effects. For example, if the price of one item in the index rises
significantly, consumers may switch to cheaper alternatives, but the index may not adequately
reflect this behavior.
4. Quality Changes: Changes in the quality of goods or services can pose challenges for index
construction. If quality improvements are not properly accounted for, the index may overstate
inflation or understate changes in real income.
5. Base Year Choice: The choice of base year can affect the interpretation of index numbers,
especially when comparing data over long periods. Revising the base year can lead to
revisions in historical data and alter perceptions of trends.
6. Lack of Data: In some cases, data limitations may prevent the construction of comprehensive
index numbers. This can be particularly problematic in developing countries or for niche
markets where data availability is limited.
7. Formula Bias: The formula used to aggregate individual prices into an index can influence
the results. Different formulas, such as Laspeyres, Paasche, or Fisher, may yield different
index values and interpretations.
8. Seasonal Variations: Index numbers may not adequately capture seasonal variations in prices
or economic activity, leading to distortions in the data.
9. Subgroup Analysis: Aggregating data into broad categories may obscure important
variations within subgroups. For example, regional differences or differences between income
levels may not be adequately reflected in the index.
Theorem of probability:
1. Sample Space
The sample space for a random experiment is the set of all experimental outcomes.
1. No two or more of these outcomes can occur simultaneously;
2. Exactly one of the outcomes must occur, whenever the experiment is performed.
2. Event:
• In the theory of probability, the term event is used to denote any phenomenon which occurs in
a random experiment.
• An event is a collection of sample points; this is a subset of sample space.
• E.g. tossing a coin; sample space S= {H,T}; and the event could be E1= {H}; E2= {T}
3. Mutually Exclusive Events (Disjoint events):
If two or more events cannot occur simultaneously in a single trial of an experiment, then such events
are called mutually exclusive events or disjoint events.
In other words: Two events are said to be mutually exclusive if the events have no sample points in
common.
Events A and B are mutually exclusive if, when one event occurs, the other cannot occur.
If A and B are mutually exclusive then 𝐴∩𝐵 = ∅
5. Complement of an Event:
Given an event A, the complement of A is defined to be the event consisting of all sample points that
𝑐
are not in A. The complement of a is denoted by 𝐴 .
Compliment of event A is shaded
Therefore, we have
( 𝑐)
𝑃(𝐴) + 𝑃 𝐴 = 1
𝑃(𝐴) = 1 − 𝑃 𝐴 ( 𝑐)
10. Joint Probability: If A and B are dependent events, then the joint probability as discussed under
statistical dependence case is no longer equal to the product of their respective probabilites. That is,
for dependent events
P(A and B) = P(A ∩ B) ≠ P(A) × P(B)
P(A) ≠ P(A | B) and P(B) ≠ P(B | A)
The joint probability of events A and B occurring together or in succession under statistical
dependences is given by
P(A ∩ B) = P(A) × P(B | A)
P(A ∩ B) = P(B) × P(A | B)
if 𝐸1 is subset of 𝐸2, then the P(𝐸1) is less than or equal to the P(𝐸2). That is 𝑃(𝐸1)≤ 𝑃(𝐸2).
𝑐
4. 𝑃(𝐴 ) = 1 − 𝑃(𝐴), that is probability of an event that does not occur is equal to one minus
the probability of the event that does occur (the probability rule for complementary events).
Approach of Probability
There are three approaches
1. Classical Approach
It is based on the assumption that all the possible outcomes (finite in number) of an experiment are
mutually exclusive and equally likely.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
𝑃(𝐸) = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
For example, if a fair die is rolled, then on any trial, each event (face or number) is equally likely to
occur since there are six equally likely exhaustive events, each will occur 1/6 of the time, and
therefore the probability of any one event occurring is 1/6.
𝑃(𝐴) = { } 𝑐(𝑠)
𝑛
3. Subjective Approach
● The subjective approach of calculating probability is always based on the degree of
beliefs, convictions, and experience concerning the likelihood of occurrence of a random
event.
● Probability assigned for the occurrence of an event may be based on just a guess or on
having some idea about the relative frequency of past occurrences of the event.
A bag contains 6 red and 8 green balls.
a) If one ball is drawn at random, then what is the probability of the ball being green?
b) If two balls are drawn at random, then what is the probability that one is red and the other
green?
Unit II
Random Variable
• A continuous random variable can take any numerical value in an interval or collection of
intervals. A continuous random variable is usually the result of experimental outcomes that
are based on measurement scales
• For instance, measurement of time, weight, distance, temperature, and so on are all treated as
continuous random variables
• This distribution describes discrete data resulting from an experiment called a Bernoulli
process (named after Jacob Bernoulli, 1654–1705, the first of the Bernoulli family of Swiss
mathematicians).
• For each trial of an experiment, there are only two possible complementary (mutually
exclusive) outcomes such as, defective or good, head or tail, zero or one, boy or girl.
• In such cases the outcome of interest is referred to as a ‘success’ and the other as a ‘failure’.
It is applicable when
If the probability, p of occurrence of an outcome of interest (i.e., success) in each trial is very small,
but the number of independent trials n is sufficiently large, then the average number of times that an
event occurs in a certain period of time, λ = np is also small.
−λ 𝑟
𝑒 λ
𝑃(𝑥 = 𝑟) = 𝑟!
, 𝑟 = 0, 1, 2…
𝑒 = 2. 7183
λ = 𝑛𝑝
1. The probability of an occurrence is the same for any two intervals of equal length.
1. The normal curve is bell – shaped curve. The top of the bell is directly above the mean
(μ).
2. The curve is symmetrical about the line (X= µ), (Z=0), i.e. it has the same shape on either
side of the line (X= µ), (Z=0).
3. Since the distribution is symmetrical mean, median and mode coincide.
𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒
4. Since 𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒, the or ordinate at X= µ, (Z=0) divides the whole area
into two equal parts. Further, since total area under normal probability curve is 1, the area
to the right of the ordinate as well as to the left of the ordinate at X= µ, (Z=0) is 0.5.
5. No portion of the curve lies below the x-axis, since p(x) being the probability can never
be negative.
6. The range of the distribution is from − ∞ 𝑡𝑜 + ∞.
7. By virtue of symmetry, the quartiles are equidistant from median i.e.
𝑄3 − 𝑀𝑒𝑑 = 𝑀𝑒𝑑 − 𝑄1⇒ 𝑄1 + 𝑄2 = 2𝑀𝑒𝑑
𝑥 = μ + 𝑍σ
Important Point
10. When x is less than the mean (μ), the value of z is negative
11. When x is more than the mean (μ), the value of z is positive
12. When x = μ, the value of z = 0.
13. 68.3% of the values of a normal random variable are within plus or minus one
standard deviation of its mean.
14. 95.4% of the values of a normal random variable are within plus or minus two
standard deviations of its mean.
15. 99.7% of the values of a normal random variable are within plus or minus three
standard deviations of its mean.
Unit IV
Correlation Analysis
Correlation analysis: A statistical technique that is used to analyse the strength and direction of the
relationship between two quantitative variables, is called correlation analysis.
• An analysis of the relationship of two or more variables is usually called correlation. — (A.
M. Tuttle)
• When the relationship is of a quantitative nature, the appropriate statistical tool for
discovering and measuring the relationship and expressing it in a brief formula is known as
correlation. — Croxton and Cowden
The coefficient of correlation, is a number that indicates the strength (magnitude) and direction of
statistical relationship between two variables.
• The strength of the relationship is determined by the closeness of the points to a straight line
when a pair of values of two variables are plotted on a graph. A straight line is used as the
frame of reference for evaluating the relationship.
• The direction is determined by whether one variable generally increases or decreases when
the other variable increases.
The importance of examining the statistical relationship between two or more variables can be divided
into the following questions and accordingly requires the statistical methods to answer these
questions:
1. Is there an association between two or more variables? If yes, what is the form and degree of
that relationship?
3. Can the relationship be used for predictive purposes, that is, to predict the most likely value of
a dependent variable corresponding to the given value of independent variable or variables?
TYPES OF CORRELATIONS
A positive (or direct) correlation refers to the same direction of change in the values of variables. In
other words, if values of variables are varying (i.e., increasing or decreasing) in the same direction,
then such correlation is referred to as positive correlation.
A negative (or inverse) correlation refers to the change in the values of variables in opposite
direction.
A linear correlation implies a constant change in one of the variable values with respect to a change
in the corresponding values of another variable.
𝑐 = α + β𝑦
A non-linear (or curvi-linear) correlation implies an absolute change in one of the variable values with
respect to changes in values of another variable.
2
𝑐 = α + β𝑦
The distinction between simple, partial, and multiple correlation is based upon the number of
variables involved in the correlation analysis.
• If only two variables are chosen to study correlation between them, then such a correlation is
referred to as simple correlation.
• In partial correlation, two variables are chosen to study the correlation between them, but the
effect of other influencing variables is kept constant. For example (i) yield of a crop is
influenced by the amount of fertilizer applied, rainfall, quality of seed, type of soil, and
pesticides.
• In multiple correlations, the relationship between more than three variables is considered
simultaneously for study.
• The correlation between two ratio-scaled (numeric) variables is represented by the letter r
which takes on values between –1 and +1 only.
• A low value of r does not indicate that the variables are unrelated but indicates that the
relationship is poorly described by a straight line. A non-linear relationship may also exist.
1. The value of r depends on the slope of the line passing through the data points and the
scattering of the pair of values of variables x and y about this line.
2. The sign of the correlation coefficient indicates the direction of the relationship. A positive
correlation indicates that the two variables tend to increase (or decrease) together (a positive
association) and a negative correlation indicates that when one variable increases the other is
likely to decrease (a negative association).
4. The value of r = +1 or –1 indicates that there is a perfect linear relationship between two
variables.
1
Where 𝑐𝑜𝑣 (𝑥, 𝑦) = 𝑛
∑(𝑥 − 𝑥) (𝑦 − 𝑦)
2
∑(𝑥−𝑥)
σ𝑥 = 𝑛
2
∑(𝑦−𝑦)
σ𝑦 = 𝑛
1
𝑛
∑(𝑥−𝑥) (𝑦−𝑦)
𝑟=
2 2
∑(𝑥−𝑥) ∑(𝑦−𝑦)
𝑛 𝑛
𝑛∑𝑥𝑦−∑(𝑥)∑(𝑦)
𝑟=
2 2
2 2
𝑛∑𝑥 −(∑𝑥) 𝑛∑𝑦 −(∑𝑦)
∑𝑥𝑦
𝑟=
2 2
∑𝑥 ∑𝑦
2
6∑𝑑
𝑟=1− 2
𝑛 (𝑛 −1)
The ranks of 15 students in two subjects A and B, are given below. The two numbers within
brackets denote the ranks of a student in A and B subjects respectively.
(1, 10), (2, 7), (3, 2), (4, 6), (5, 4), (6, 8), (7, 3), (8, 1), (9, 11), (10, 15), (11, 9), (12, 5),
(13, 14), (14, 12), (15, 13)
Find Spearman’s rank correlation coefficient.
Unit V
Statistical Hypothesis:
• A statistical hypothesis is a claim (assertion, statement, belief or assumption) about an
unknown population parameter value.
For instance
• A pharmaceutical company claims the efficacy of medicine against a disease that 95 percent
of all persons suffering from the said disease get cured.
Hypothesis Testing:
• The process that enables a decision maker to test the validity (or significance) of his claim by
analyzing the difference between the value of sample statistic and the corresponding
hypothesized population parameter value, is called hypothesis testing.
Step 1: State the Null Hypothesis (𝐻0) and Alternative Hypothesis (𝐻1)
• 𝐻0 represents the claim or statement made about the value or range of values of the
population parameter.
• The null hypothesis be considered true until it is proved false on the basis of results observed
from the sample data.
• where 𝑥 is sample mean and µ represents a hypothesized value. Only one sign out of ≤, = and
Alternative Hypothesis
• 𝐻1, is the counter claim (statement) made against the value of the particular population
parameter.
• Parameter value is not equal to the value stated in the null hypothesis and is written as:
𝐻1: 𝑥≠μ
• The level of significance, usually denoted by α (alpha), is specified before the samples are
drawn.
• It is specified in terms of the probability making a type I error when the null hypothesis is true
as an equality.
The area under the sampling distribution curve of the test statistic is divided into two mutually
exclusive regions. These regions are called the acceptance region and the rejection (or critical)
region.
• The acceptance region is a range of values of the sample statistic spread around the null
hypothesized population parameter. If values of the sample statistic fall within the limits of
acceptance region, the null hypothesis is accepted, otherwise it is rejected.
• The rejection region is the range of sample statistic values within which if values of the
sample statistic falls (i.e. outside the limits of the acceptance region), then null hypothesis is
rejected.
• The value of the sample statistic that separates the regions of acceptance and rejection is
called critical value.
(a) Whether the test involves one sample, two samples, or k samples?
1. Sample size
Compare the calculated value of the test statistic with the critical value (also called standard table
value of test statistic).
• Accept 𝐻0 if the test statistic value falls within the area of acceptance.
• Reject otherwise.
Type I Error
• The probability of making a Type I error is denoted by the symbol α. It is represented by the
area under the sampling distribution curve over the region of rejection.
• Type II Error: This is the probability of accepting the null hypothesis when it is false.
Chi-squared test
• A chi-squared test (symbolically represented as χ2) is basically a data analysis on the basis of
observations of a random set of variables. Usually, it is a comparison of two statistical data
sets. This test was introduced by Karl Pearson in 1900 for categorical data analysis and
distribution. So, it was mentioned as Pearson’s chi-squared test.
Use of Chi-Square Test
2
1. χ test as a test of goodness of fit.
2
2. χ test as a test of the independence of two variables.
2
3. χ as a test of significance for association/dependence.
1. Unlike the normal distribution, the chi-square distribution takes only positive values and
ranges from 0 to infinity.
2. Unlike the normal distribution, the chi-square distribution is a skewed distribution yet
becomes more symmetric and approaches the normal distribution as the degree of freedom
increases.
2
Chi – square (χ ) Distribution
• While the Z and t distributions are used for the sampling distributions of the sample mean, the
2
Chi – square (χ ) distribution is used for the sampling distribution of the sample variance.
2
∑(𝑂𝑖−𝐸𝑖)
2
χ = 𝐸𝑖
𝑅𝑇×𝐶𝑇
𝐸= 𝑁
A selection board consisting of two experts for the post of general manger in a company
interviewed 10 candidates whom the two experts assigned ranks as given below. Find the rank
correlation.
Expert I 7 9 2 4 5 5 8 10 3 1
Expert
II 8 10 4 6 4 4 7 9 1 2
Find the Karl Pearson’s coefficient of correlation between price and supply of a commodity
from the following data.
Price 8 10 15 17 20 22 24 25
Supply 25 30 32 35 37 40 42 45
Calculate Karl Pearson’s coefficient of skewness, first four moments and β1and β2 for the
following table.
Calculate mean, median and mode from the following table and comment on the relationship
between them.
Class Frequency
0-10 8
10-20 14
20-30 24
30-40 31
40-50 47
50-60 54
60-70 35
70-80 26
80-90 15
90-100 6
Calculate the Consumer Price Index number through Aggregate Expenditure Method and
Family Budget Method for the following data.
a) Geometric Mean
b) Harmonic Mean
100-12 120-14
X 0-20 20-40 40-60 60-80 80-100 0 0
f 7 15 30 40 17 18 19
a) Laspeyres Index
b) Paasche Index
c) Fisher’s Index
d) Marshall-Edgeworth Index
2020 2023
Item
Price Quantity Price Quantity
A 35 15 43 33
B 25 30 30 45
C 17 15 35 40
D 37 60 62 32
E 18 40 26 60
F 60 20 90 22
G 23 30 95 25
H 112 30 140 25
Given the following data, calculate the consumer price index under the aggregate expenditure method
and the family budget method.
Sleep
Time 5.5 6 4 7.5 8 5 6.5 6
Gym
Time 40 60 35 90 120 45 65 50
The following table shows distance travelled (in km) by rural women to come to Bangalore for work,
and the daily wages they earn. Use Spearman’s Rank Correlation to find the level and nature of
association between the two variables.
Distanc
e 25 30 20 10 60 50 40 25
Wages 350 500 600 650 750 400 900 620
A company that employs 6000 employees in Bangalore, 4000 employees in Kochi, and 5000
employees in Hyderabad decided to hand out bonuses to its employees over at the end of a financial
year for outstanding performance. If the probability of an employee getting the bonus was 40% in
Bangalore, equally likely in Kochi, and one-third likely in Hyderabad, find the average number of
employees who got the bonus in each office. Also comment on the spread of the results for each
office.
The total points accrued by four Premier League football clubs are given in the following table
Year A B C D
2010-1
1 68 58 71 80
2011-1
2 70 52 64 89
2012-1
3 73 61 75 89
2013-1
4 79 84 82 64
2014-1
5 75 62 87 70
2015-1
6 71 60 50 66
2016-1
7 75 76 93 69
2017-1
8 63 75 70 81
2018-1
9 70 97 72 66
2019-2
0 56 99 66 66
2020-2
1 61 69 67 74
2021-2
2 69 92 74 58
2022-2
3 84 67 44 75
Calculate the average points accrued in the given period by each team, as well as their individual
standard deviations. Based on your analysis, interpret the performance of all four teams in the study
period.
For a given intense monsoon in Meghalaya, find the average amount of rainfall and the standard
deviation of the precipitation. Also calculate the skewness of the dataset and describe the shape of
distribution with the help of a suitable diagram, in comparison with a normal distribution.
Commodit
y Quantity (Units) Value (Rs)
1995 2000 1995 2000
A 100 150 500 900
B 80 100 320 500
C 60 72 150 360
D 30 33 360 297
Calculate the first four moments about the mean from the following data. Also calculate the
value of β1 and β2
d*=d/1
Marks M f d=X-A 0 d*2 d*3 d*4 fd* fd*2 fd*3 fd*4
0-10 5 5 -30 -3 9 -27 81 -15 45 -135 405
Oct-20 15 12 -20 -2 4 -8 16 -24 48 -96 192
20-30 25 18 -10 -1 1 -1 1 -18 18 -18 18
30-40 35 40 0 0 0 0 0 0 0 0 0
40-50 45 15 10 1 1 1 1 15 15 15 15
50-60 55 7 20 2 4 8 16 14 28 56 112
60-70 65 3 30 3 9 27 81 9 27 81 243
Total 100 -19 181 -97 985
'
∑𝑓𝑑
' −19×10
µ1 = first moment about 35 = × 10 = 100
=− 1. 9
∑𝑓
2'
∑𝑓𝑑
' 181×100
µ2 = second moment about 35 = × 100 = 100
= 181
∑𝑓
3'
∑𝑓𝑑
' −97×1000
µ3 = third moment about 35 = × 1000 = 100
=− 970
∑𝑓
4'
∑𝑓𝑑
' 985×10000
µ4 = fourth moment about 35 = × 10000 = 100
=98500
∑𝑓
2
µ3
β1 = 3
µ2
µ4
β2 = 2
µ2
µ1 = 0
' '2 2
µ2 = µ2 − µ1 = 181 − (− 1. 9) = 181 − 3. 61 = 177. 39
3
' '
µ3 = µ3 − 3µ2µ1 + 2µ1
'
( )
'
=− 970 − 3×181×(− 1. 9) + (2×(− 1. 9))
3
2 4
'
µ4 = µ4 − 4µ1µ3 + 6µ2 µ1
' ' '
( )
'
( )
'
− 3 µ1
The daily earnings (in rupees) of employees working on a daily basis in a firm are:
Daily earnings (Rs.) 100 120 140 160 180 200 220
Number of
employees 3 6 10 15 24 42 75
Calculate the average daily earning for all employees
A company is planning to improve plant safety. For this, accident data for the last 50
weeks was compiled. These data are grouped into the frequency distribution as shown
below. Calculate the A.M. of the number of accidents per week.
Number of
accidents 0-4 05-09 10-14 15-19 20-24
Number of
weeks 5 22 13 8 2
1. The value of A.M. cannot be calculated accurately for unequal and open-ended class
intervals either at the beginning or end of the given frequency distribution.
2. The A.M. is reliable and reflects all the values in the data set. However, it is very
much affected by the extreme observations (or outliers) which are not representative
of the rest of the data set.
3. The calculation of A.M. sometime becomes difficult because every data element is
used in the calculation.
∑𝑥𝑖
∑(𝑥𝑖 − 𝑥𝑖) = ∑ 𝑥𝑖 − 𝑛 ∑ 𝑥𝑖 = ∑ 𝑥𝑖 − 𝑛 𝑛
=0
2. The sum of the squares of the deviations of all the observations from the A.M. is less
than the sum of the squares of all the observations from any other quantity.
2 2
∑(𝑥𝑖 − 𝑥𝑖) ≤ ∑ 𝑥𝑖 − 𝑎)
3. It is possible to calculate the combined (or pooled) arithmetic mean of two or more
than two sets of data of the same nature.
𝑛1 𝑥1 +𝑛2 𝑥2
𝑥12 = 𝑛1+𝑛2
AVERAGES OF POSITION
Median:
• Median may be defined as the middle value in the data set when its elements are
arranged in a sequential order, that is, in either ascending or descending order of
magnitude.
• It is called a middle value in an ordered sequence of data in the sense that half of the
observations are smaller and half are larger than this value.
• In this case, the data is arranged in either ascending or descending order of magnitude.
1. If the number of observations (n) is an odd number, then the median (Med) is
2. If the number of observations (n) is an even number, then the median is defined as the
𝑀𝑒𝑑 =
( ) 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠+(
𝑛
2
𝑛
2 )
+1
2
Grouped Data
𝑀𝑒𝑑 = 𝑙 +
( )−𝑐𝑓 ×ℎ
𝑛
2
𝑓
In a factory employing 3000 persons, 5 per cent earn less than Rs. 150 per day, 580
earn from Rs. 151 to Rs. 200 per day, 30 per cent earn from Rs. 201 to Rs. 250 per
day, 500 earn from Rs. 251 to Rs. 300 per day, 20 per cent earn from Rs. 301 to Rs.
350 per day, and the rest earn Rs. 351 or more per day. What is the median wage?
Advantages
1. Median is unique, i.e. like mean, there is only one median for a set of data.
2. The value of median is easy to understand and may be calculated from any type of
data. The median in many situations can be located simply by inspection.
3. The sum of the absolute differences of all observations in the data set from median
value is minimum.
4. In other words, the absolute difference of observations from the median is less than
from any other value in the distribution. That is, Σ | x − Med| = a minimum value.
5. The extreme values in the data set does not affect the calculation of the median value
and therefore it is the useful measure of central tendency when such values do occur.
6. The median is considered the best statistical technique for studying the qualitative
attribute of an observation in the data set.
7. The median value may be calculated for an open-ended distribution of data set.
Disadvantages
1. The median is not capable of algebraic treatment. For example, the median of two or
more sets of data cannot be determined.
2. The value of median is affected more by sampling variations, that is, it is affected by
the number of observations rather than the values of the observations.
4. The calculation of median in case of grouped data is based on the assumption that
values of observations are evenly spaced over the entire class interval.
Applications
Advantages
1. Mode value is easy to understand and to calculate. Mode class can also be located by
inspection.
2. The mode is not affected by the extreme values in the distribution. The mode value can also
be calculated for open-ended frequency distributions.
3. The mode can be used to describe quantitative as well as qualitative data. For example, its
value is used for comparing consumer preferences for various types of products, say
cigarettes, soaps, toothpastes, or other products.
Disadvantages
1. Mode is not a rigidly defined measure as there are several methods for calculating its value.
4. When data sets contain more than one modes, such values are difficult to interpret and
compare.
Advantages
1. It is independent of the measure of central tendency and easy to calculate and understand.
2. It is quite useful in cases where the purpose is only to find out the extent of extreme variation,
such as industrial quality control, temperature, rainfall, and so on.
Disadvantages
1. The calculation of range is based on only two values—largest and smallest in the data set and
fails to take into account of any other observations.
2. It is largely influenced by two extreme values and completely independent of the other values.
For example, range of two data sets {1, 2, 3, 7, 12} and {1, 1, 1, 12, 12} is 11, but the two
data sets differ in terms of overall dispersion of values.
3. Its value is sensitive to changes in sampling, that is, different samples of the same size from
the same population may have widely different ranges.
Applications of Range
1. Fluctuation in share prices: The range is useful in the study of small variations among values
in a data set, such as variation in share prices and other commodities that are very sensitive to
price changes from one period to another.
2. Quality control: It is widely used in industrial quality control. Quality control is exercised by
preparing suitable control charts. These charts are based on setting an upper control limit
(range) and a lower control limit (range) within which produced items shall be accepted. The
variation in the quality beyond these ranges requires necessary correction in the production
process or system.
3. Weather forecasts: The concept of range is used to determine the difference between
maximum and minimum temperature or rainfall by meteorological departments to announce
for the knowledge of the general public.
• The limitations or disadvantages of the range can partially be overcome by using another
measure of variation which measures the spread over the middle half of the values in the data
set so as to minimise the influence of outliers (extreme values) in the calculation of range.
• Since a large number of values in the data set lie in the central part of the frequency
distribution, therefore it is necessary to study the Interquartile Range (also called mid-spread).
• To compute this value, the entire data set is divided into four parts each of which contains 25
percent of the observed values.
• The interquartile range is a measure of dispersion or spread of values in the data set between
the third quartile, Q3 and the first quartile, Q1.
• Half the distance between Q1 and Q3 is called the semi-interquartile range or the quartile
deviation (QD).
𝑄3−𝑄1
• 𝑄𝐷 = 2
Advantages
1. It is not difficult to calculate but can only be used to evaluate variation among observed
values within the middle of the data set. Its value is not affected by the extreme (highest and
lowest) values in the data set.
Disadvantages
1. The value of Q.D. is based on the middle 50 percent observed values in the data set, therefore
it cannot be considered as a good measure of variation as it is not based on all the
observations.
3. The Q.D. has no relationship with any particular value or an average in the data set for
measuring the variation. Its value is not affected by the distribution of the individual values
within the interval of the middle 50 percent observed values.
𝑁
1
𝑀𝐴𝐷 = 𝑁
∑ |𝑥 − μ| for a population
𝑖=1
𝑛
1
𝑀𝐴𝐷 = 𝑛
∑ |𝑥 − 𝑥| for a sample
𝑖=1
Skewness
• A frequency distribution of the set of values that is not ‘symmetrical (normal)’ is called
asymmetrical or skewed.
• In a skewed distribution, extreme values in a data set move towards one side or tail of a
distribution, thereby lengthening that tail.
• When extreme values move towards the upper or right tail, the distribution is positively
skewed.
• When such values move towards the lower or left tail, the distribution is negatively skewed.
• For a positively skewed distribution A.M. > Median > Mode, and for a negatively skewed
distribution A.M. < Median < Mode.
• The relationship between these measures of central tendency is used to develop a measure of
skewness called the coefficient of skewness to understand the degree to which these three
measures differ.
MEASURES OF SKEWNESS
• The degree of skewness in a distribution can be measured both in the absolute and relative
sense.
• For an asymmetrical distribution, the distance between mean and mode may be used to
measure the degree of skewness because the mean is equal to mode in a symmetrical
distribution. Thus,
𝑆𝑘 = 𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
Or 𝑆𝑘 = (𝑄3 + 𝑄1) − 2 𝑀𝑒𝑑𝑖𝑎𝑛 (if measured in terms of quartiles)
• For a positively skewed distribution, Mean > Mode and therefore 𝑆𝑘 is a positive value,
otherwise it is a negative value
Relative Measures of Skewness
Karl Pearson’s Coefficient of Skewness
𝑀𝑒𝑎𝑛 −𝑀𝑜𝑑𝑒 𝑥−𝑀0
𝑆𝑘𝑃 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
= σ
Since a mode does not always exist uniquely in a distribution, therefore it is convenient to
define this measure using median
3(𝑥−𝑀𝑒𝑑)
𝑆𝑘𝑃 = σ
Theoretically, the value of 𝑆𝑘𝑃 varies between ± 3. But for a moderately skewed distribution,
𝑄3+𝑄1
𝑀𝑒𝑑𝑖𝑎𝑛 = 2
• This shows that the value of median is the mean value of Q1 and Q3. Obviously in such a
case, the absolute value of the coefficient of skewness will be zero.
• When a distribution is asymmetrical, quartiles are not at equal distance from the median
• The distribution is positively skewed, if Q1 – Me > Q3 – Me, otherwise negatively skewed.
KURTOSIS
The measure of kurtosis, describes the degree of concentration of frequencies (observations)
in a given distribution. That is, whether the observed values are concentrated more around the
mode (a peaked curve) or away from the mode towards both tails of the frequency curve.
The word ‘kurtosis’ comes from a Greek word meaning ‘humped’. In statistics, it refers to the
degree of flatness or peakedness in the region about the mode of a frequency curve.
𝑋−𝐴
But 𝑖𝑓 𝑑 = ℎ
, ℎ > 0, 𝑡ℎ𝑒𝑛 σ𝑥 = ℎ×σ𝑑
2. Standard deviation is the minimum value of the root mean square deviation.
2 1 2
σ = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥)
2 1 2
And mean square deviation (𝑠) = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝐴)
2 2
Then (𝑠) ≥ σ ⇒ 𝑠≥σ, the sign of equality holds if and only if, A= 𝑥.
3. Standard deviation is less than or equal to Range.
4. Standard deviation is suitable for further mathematical treatment. If we know the size, means,
and standard deviations of two or more groups, then we can obtain the standard deviation of
the group obtained by combining all the groups.
2
𝑛 −1
5. The standard deviation of the first n natural number viz. 1,2,3… n is 12
.
6. The empirical rule: for any symmetrical bel shaped distribution, we have approximately the
following area properties.
a) 68% of the observations lie in the range: mean± 1. σ
b) 95% of the observations lie in the range: mean± 2. σ
c) 99% of the observations lie in the range: mean± 3. σ
7. The approximate relationship between quartile deviation, mean deviation, and standard
deviation is:
2 4
𝑄. 𝐷≅ 3
σ 𝑎𝑛𝑑 𝑀. 𝐷. ≅ 5
σ ⇒𝑄. 𝐷: 𝑀. 𝐷∷𝑆: 𝐷∷10: 12: 15
8. For any discrete distribution, standard deviation is not less than mean deviation about mean.
𝑆. 𝐷(σ)≥𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑀𝑒𝑎𝑛