Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

There are 4 men and 3 women.

Find the probability of selecting 3 of which


a) Exactly two are women
b) No women
c) At least one woman
d) At least 2 women
e) At the most 2 women
Solutions:
The total number of ways to select 3 people out of 7 (4 men and 3 women):
7! 7×6×5×4! 7×6×5
Total outcomes = (7 3 ) = 3!(7−3)!
= 3×2×4!
= 3×2
= 35

a). probability of selecting exactly two women


We choose 2 women out of 3, and 1 man out of 4.
(3 2 )×(4 1 ) 3×4
𝑃(𝑒𝑥𝑎𝑐𝑡𝑙𝑦 𝑡𝑤𝑜 𝑤𝑜𝑚𝑒𝑛) = 35
= 35
= 0. 3429

b). No women
We choose 3 men out of 4.
(4 3 ) 4
𝑃(𝑛𝑜 𝑤𝑜𝑚𝑒𝑛) = 35
= 35
= 0. 1143

c). Probability of selecting at least one woman:


This is the complement of selecting no women.
4
𝑃(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑤𝑜𝑚𝑒𝑛) = 1 − 𝑃(𝑛𝑜 𝑤𝑜𝑚𝑒𝑛) = 1 − 35
= 0. 8857

d). Probability of selecting at least two women:


this is the probability of selecting exactly two women + the probability of selecting all
three women

𝑃(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑤𝑜𝑚𝑒𝑛) = 𝑃(𝑒𝑥𝑎𝑐𝑡𝑙𝑦 2 𝑤𝑜𝑚𝑒𝑛) + 𝑃(𝑎𝑙𝑙 3 𝑤𝑜𝑚𝑒𝑛)

(3 3 ) 1
𝑃(𝑎𝑙𝑙 3 𝑤𝑜𝑚𝑒𝑛) = 35
= 35

12 1 13
𝑃(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑤𝑜𝑚𝑒𝑛) = 35
+ 35
= 35
= 0. 3714

e). Probability of selecting at most two women:


This is the complement of selecting at least two women.
13 22
𝑃(𝑎𝑡 𝑚𝑜𝑠𝑡 𝑡𝑤𝑜 𝑤𝑜𝑚𝑒𝑛) = 1 − (𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 2 𝑤𝑜𝑚𝑒𝑛) = 1 − 35
= 35
= 0. 6286

Explain the problems in the construction of index numbers

1. Selection Bias: The choice of items included in the index can introduce bias if it does not
accurately represent the market or population being measured. For example, if certain goods
or services are excluded or underrepresented, the index may not reflect true changes in the
cost of living or economic activity.
2. Weighting: Assigning appropriate weights to different components of the index is crucial for
accurately reflecting their relative importance. If weights are outdated or misallocated, the
index may not accurately reflect changes in the underlying data.
3. Substitution Bias: Index numbers may not fully account for consumer behavior changes,
such as substitution effects. For example, if the price of one item in the index rises
significantly, consumers may switch to cheaper alternatives, but the index may not adequately
reflect this behavior.
4. Quality Changes: Changes in the quality of goods or services can pose challenges for index
construction. If quality improvements are not properly accounted for, the index may overstate
inflation or understate changes in real income.
5. Base Year Choice: The choice of base year can affect the interpretation of index numbers,
especially when comparing data over long periods. Revising the base year can lead to
revisions in historical data and alter perceptions of trends.
6. Lack of Data: In some cases, data limitations may prevent the construction of comprehensive
index numbers. This can be particularly problematic in developing countries or for niche
markets where data availability is limited.
7. Formula Bias: The formula used to aggregate individual prices into an index can influence
the results. Different formulas, such as Laspeyres, Paasche, or Fisher, may yield different
index values and interpretations.
8. Seasonal Variations: Index numbers may not adequately capture seasonal variations in prices
or economic activity, leading to distortions in the data.
9. Subgroup Analysis: Aggregating data into broad categories may obscure important
variations within subgroups. For example, regional differences or differences between income
levels may not be adequately reflected in the index.

Theorem of probability:
1. Sample Space
The sample space for a random experiment is the set of all experimental outcomes.
1. No two or more of these outcomes can occur simultaneously;
2. Exactly one of the outcomes must occur, whenever the experiment is performed.
2. Event:
• In the theory of probability, the term event is used to denote any phenomenon which occurs in
a random experiment.
• An event is a collection of sample points; this is a subset of sample space.
• E.g. tossing a coin; sample space S= {H,T}; and the event could be E1= {H}; E2= {T}
3. Mutually Exclusive Events (Disjoint events):
If two or more events cannot occur simultaneously in a single trial of an experiment, then such events
are called mutually exclusive events or disjoint events.
In other words: Two events are said to be mutually exclusive if the events have no sample points in
common.
Events A and B are mutually exclusive if, when one event occurs, the other cannot occur.
If A and B are mutually exclusive then 𝐴∩𝐵 = ∅

4. Collectively Exhaustive Events:


A list of events is said to be collectively exhaustive when all possible events that can occur from an
experiment includes every possible outcome.
Or events whose union is equal to sample space.
If A, B and C are collective exhaustive then 𝐴∪𝐵∪𝐶 = 𝑆

5. Complement of an Event:
Given an event A, the complement of A is defined to be the event consisting of all sample points that
𝑐
are not in A. The complement of a is denoted by 𝐴 .
Compliment of event A is shaded

Therefore, we have
( 𝑐)
𝑃(𝐴) + 𝑃 𝐴 = 1

𝑃(𝐴) = 1 − 𝑃 𝐴 ( 𝑐)

6. Union of two events


The union of A and B is the event containing all sample points belonging to A or B
or both. The union is denoted by 𝐴∪𝐵.

7. Interaction of two events


Given two events A and B, the intersection of A and B is the event containing the sample points
belonging to both A and B. The intersection is denoted by A ∩ B.

8. Marginal Probability: A marginal or unconditional probability is the simple probability of the


occurrence of an event. For example, in a fair coin toss, the outcome of each toss is an event that is
statistically independent of the outcomes of every other toss of the coin.
9. Joint Probability: The probability of two or more independent events occurring together or in
succession is called the joint probability. The joint probability of two or more independent events is
equal to the product of their marginal probabilities. In particular, if A and B are independent events,
the probability that both A and B will occur is given by
P(AB) = P(A ∩ B) = P(A)×P(B)

10. Joint Probability: If A and B are dependent events, then the joint probability as discussed under
statistical dependence case is no longer equal to the product of their respective probabilites. That is,
for dependent events
P(A and B) = P(A ∩ B) ≠ P(A) × P(B)
P(A) ≠ P(A | B) and P(B) ≠ P(B | A)
The joint probability of events A and B occurring together or in succession under statistical
dependences is given by
P(A ∩ B) = P(A) × P(B | A)
P(A ∩ B) = P(B) × P(A | B)

Fundamental Rules /Properties of Probability


( )
1. Each probability should fall between 0 and 1, i.e. 0≤𝑃 𝐸𝑖 ≤1.
2. The sum of the probabilities for all the experimental outcomes must equal 1. For n
experimental outcomes, this requirement can be written as
( ) ( )
𝑃 𝐸1 + 𝑃 𝐸2 + … + 𝑃 𝐸𝑛 = 1 ( )
3. If event 𝐸1 and 𝐸2 are two elements in S and if occurrence of 𝐸1 implies that 𝐸2 occurs, that is

if 𝐸1 is subset of 𝐸2, then the P(𝐸1) is less than or equal to the P(𝐸2). That is 𝑃(𝐸1)≤ 𝑃(𝐸2).
𝑐
4. 𝑃(𝐴 ) = 1 − 𝑃(𝐴), that is probability of an event that does not occur is equal to one minus
the probability of the event that does occur (the probability rule for complementary events).

Three unbiased coins are tossed. What is the probability of obtaining:


A. All heads
B. Two heads
C. One head
D. At least one head
E. At least two heads
F. All tails
A bag contains 7 white and 9 black balls. 3 balls are drawn together. What is the probability
that,
a) All are black
b) All are white
c) 1 white and 2 black
d) 2 white and 1 black
There are 7 blue, 9 red, 4 yellow, and 5 green balls inside a box. If three balls are drawn at
random from the box, find the probability of pulling:
a) Either all blue or all red balls.
b) Only green balls.
c) One yellow and 2 red balls.
d) No red balls.
e) One red, one blue, and one yellow ball.

Approach of Probability
There are three approaches
1. Classical Approach
It is based on the assumption that all the possible outcomes (finite in number) of an experiment are
mutually exclusive and equally likely.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑣𝑜𝑢𝑟𝑎𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
𝑃(𝐸) = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠

For example, if a fair die is rolled, then on any trial, each event (face or number) is equally likely to
occur since there are six equally likely exhaustive events, each will occur 1/6 of the time, and
therefore the probability of any one event occurring is 1/6.

2. Relative Frequency Approach


• It is based on the assumption that a random experiment can be repeated a large number of
times under identical conditions where trials are independent to each other.
• While conducting a random experiment, we may or may not observe the desired event. But as
the experiment is repeated many times, that event may occur some proportion of time.

𝑃(𝐴) = { } 𝑐(𝑠)
𝑛

3. Subjective Approach
● The subjective approach of calculating probability is always based on the degree of
beliefs, convictions, and experience concerning the likelihood of occurrence of a random
event.
● Probability assigned for the occurrence of an event may be based on just a guess or on
having some idea about the relative frequency of past occurrences of the event.
A bag contains 6 red and 8 green balls.
a) If one ball is drawn at random, then what is the probability of the ball being green?
b) If two balls are drawn at random, then what is the probability that one is red and the other
green?

Unit II

Random Variable

A random variable is a numerical description of the outcome of an experiment.


1. Discrete random variable
2. Continuous Random Variable
A discrete random variable can take on only a finite or countable infinite number of different
values such as 0, 1, 2, . . .For example: the number of students present in the class.

• A continuous random variable can take any numerical value in an interval or collection of
intervals. A continuous random variable is usually the result of experimental outcomes that
are based on measurement scales
• For instance, measurement of time, weight, distance, temperature, and so on are all treated as
continuous random variables

Binomial Probability Distribution:

• Binomial probability distribution is a widely used probability distribution for a discrete


random variable.

• This distribution describes discrete data resulting from an experiment called a Bernoulli
process (named after Jacob Bernoulli, 1654–1705, the first of the Bernoulli family of Swiss
mathematicians).

• For each trial of an experiment, there are only two possible complementary (mutually
exclusive) outcomes such as, defective or good, head or tail, zero or one, boy or girl.

• In such cases the outcome of interest is referred to as a ‘success’ and the other as a ‘failure’.

Properties of Binomial Experiment

1. The experiment consists of a sequence of n identical trials.


2. Two outcomes are possible on each trial. we refer to one outcome as a success and the
other outcome as a failure.
3. The probability of a success, denoted by p, does not change from trial to trial.
Consequently, the probability of a failure, denoted by 1 − p, does not change from trial to
trial.
4. The trials are independent.

Binomial Probability Density Function

It is known as binomial distribution with parameters n and p.


The mean and standard deviation of a binomial distribution are

Poisson Probability Distribution

It is applicable when

1. Probability of success, p is very small


2. Number of trials are very large.

If the probability, p of occurrence of an outcome of interest (i.e., success) in each trial is very small,
but the number of independent trials n is sufficiently large, then the average number of times that an
event occurs in a certain period of time, λ = np is also small.

So the Poisson Distribution function can be written as

−λ 𝑟
𝑒 λ
𝑃(𝑥 = 𝑟) = 𝑟!
, 𝑟 = 0, 1, 2…

𝑒 = 2. 7183

λ = 𝑛𝑝

Properties of Poisson Distribution

1. The probability of an occurrence is the same for any two intervals of equal length.

2. The occurrence or nonoccurrence in any interval is independent of the occurrence or


nonoccurrence in any other interval.

The Normal Distribution


Properties of Normal Distribution

1. The normal curve is bell – shaped curve. The top of the bell is directly above the mean
(μ).
2. The curve is symmetrical about the line (X= µ), (Z=0), i.e. it has the same shape on either
side of the line (X= µ), (Z=0).
3. Since the distribution is symmetrical mean, median and mode coincide.
𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒
4. Since 𝑚𝑒𝑎𝑛 = 𝑚𝑒𝑑𝑖𝑎𝑛 = 𝑚𝑜𝑑𝑒, the or ordinate at X= µ, (Z=0) divides the whole area
into two equal parts. Further, since total area under normal probability curve is 1, the area
to the right of the ordinate as well as to the left of the ordinate at X= µ, (Z=0) is 0.5.
5. No portion of the curve lies below the x-axis, since p(x) being the probability can never
be negative.
6. The range of the distribution is from − ∞ 𝑡𝑜 + ∞.
7. By virtue of symmetry, the quartiles are equidistant from median i.e.
𝑄3 − 𝑀𝑒𝑑 = 𝑀𝑒𝑑 − 𝑄1⇒ 𝑄1 + 𝑄2 = 2𝑀𝑒𝑑

8. Since the distribution is symmetrical, the moment coefficient of skewness is given by


β1 = 0⇒γ1 = 0
9. The standardized normal random variable, z (also called z–statistic, z–score or normal
variate)
𝑥−μ
𝑍= σ

𝑥 = μ + 𝑍σ

Important Point

10. When x is less than the mean (μ), the value of z is negative
11. When x is more than the mean (μ), the value of z is positive
12. When x = μ, the value of z = 0.

13. 68.3% of the values of a normal random variable are within plus or minus one
standard deviation of its mean.
14. 95.4% of the values of a normal random variable are within plus or minus two
standard deviations of its mean.
15. 99.7% of the values of a normal random variable are within plus or minus three
standard deviations of its mean.

Unit IV

Correlation Analysis

Correlation analysis: A statistical technique that is used to analyse the strength and direction of the
relationship between two quantitative variables, is called correlation analysis.

A few definitions of correlation analysis are:

• An analysis of the relationship of two or more variables is usually called correlation. — (A.
M. Tuttle)
• When the relationship is of a quantitative nature, the appropriate statistical tool for
discovering and measuring the relationship and expressing it in a brief formula is known as
correlation. — Croxton and Cowden

The coefficient of correlation, is a number that indicates the strength (magnitude) and direction of
statistical relationship between two variables.

• The strength of the relationship is determined by the closeness of the points to a straight line
when a pair of values of two variables are plotted on a graph. A straight line is used as the
frame of reference for evaluating the relationship.

• The direction is determined by whether one variable generally increases or decreases when
the other variable increases.

The importance of examining the statistical relationship between two or more variables can be divided
into the following questions and accordingly requires the statistical methods to answer these
questions:

1. Is there an association between two or more variables? If yes, what is the form and degree of
that relationship?

2. Is the relationship strong or significant enough to be useful to arrive at a desirable conclusion?

3. Can the relationship be used for predictive purposes, that is, to predict the most likely value of
a dependent variable corresponding to the given value of independent variable or variables?

TYPES OF CORRELATIONS

1. Positive and negative,

2. Linear and non-linear

3. Simple, partial, and multiple.

1. Positive and negative:

A positive (or direct) correlation refers to the same direction of change in the values of variables. In
other words, if values of variables are varying (i.e., increasing or decreasing) in the same direction,
then such correlation is referred to as positive correlation.

A negative (or inverse) correlation refers to the change in the values of variables in opposite
direction.

2. Linear and non-linear:

A linear correlation implies a constant change in one of the variable values with respect to a change
in the corresponding values of another variable.

𝑐 = α + β𝑦

A non-linear (or curvi-linear) correlation implies an absolute change in one of the variable values with
respect to changes in values of another variable.
2
𝑐 = α + β𝑦

Simple, Partial, and Multiple Correlation

The distinction between simple, partial, and multiple correlation is based upon the number of
variables involved in the correlation analysis.

• If only two variables are chosen to study correlation between them, then such a correlation is
referred to as simple correlation.

• In partial correlation, two variables are chosen to study the correlation between them, but the
effect of other influencing variables is kept constant. For example (i) yield of a crop is
influenced by the amount of fertilizer applied, rainfall, quality of seed, type of soil, and
pesticides.

• In multiple correlations, the relationship between more than three variables is considered
simultaneously for study.

METHODS OF CORRELATION ANALYSIS

• The correlation between two ratio-scaled (numeric) variables is represented by the letter r
which takes on values between –1 and +1 only.

Scatter Diagram Method:

Interpretation of Correlation Coefficients

• A low value of r does not indicate that the variables are unrelated but indicates that the
relationship is poorly described by a straight line. A non-linear relationship may also exist.

• A correlation does not imply a cause-and-effect relationship; it is merely an observed


association.
Properties of the Correlation coefficient

1. The value of r depends on the slope of the line passing through the data points and the
scattering of the pair of values of variables x and y about this line.

2. The sign of the correlation coefficient indicates the direction of the relationship. A positive
correlation indicates that the two variables tend to increase (or decrease) together (a positive
association) and a negative correlation indicates that when one variable increases the other is
likely to decrease (a negative association).

3. The values of the correlation coefficient range from + 1 to – 1.

4. The value of r = +1 or –1 indicates that there is a perfect linear relationship between two
variables.

Karl Pearson’s Correlation Coefficient


𝑐𝑜𝑣(𝑥, 𝑦) 𝑐𝑜𝑣(𝑥, 𝑦)
𝑟= = σ𝑥σ𝑦
𝑣𝑎𝑟 𝑥 𝑣𝑎𝑟 𝑦

1
Where 𝑐𝑜𝑣 (𝑥, 𝑦) = 𝑛
∑(𝑥 − 𝑥) (𝑦 − 𝑦)

2
∑(𝑥−𝑥)
σ𝑥 = 𝑛

2
∑(𝑦−𝑦)
σ𝑦 = 𝑛

1
𝑛
∑(𝑥−𝑥) (𝑦−𝑦)
𝑟=
2 2
∑(𝑥−𝑥) ∑(𝑦−𝑦)

𝑛 𝑛

𝑛∑𝑥𝑦−∑(𝑥)∑(𝑦)
𝑟=
2 2
2 2
𝑛∑𝑥 −(∑𝑥) 𝑛∑𝑦 −(∑𝑦)

∑𝑥𝑦
𝑟=
2 2
∑𝑥 ∑𝑦

Spearman’s Rank Correlation Coefficient

• Spearman’s rank correlation coefficient, denoted by the symbol ρ (rho), is a non-parametric


measure of statistical dependence between two variables.
The Spearman correlation coefficient is calculated as follows:
1. Assign ranks to each observation in each variable separately, from smallest to largest. If there
are ties (i.e., identical values), assign each tied value the average of the ranks it would have
received if it were unique.
2. Calculate the difference between the ranks of corresponding observations in the two variables.
3. Square each of these differences.
4. Sum up these squared differences.
5. Use the formula to calculate ρ.

2
6∑𝑑
𝑟=1− 2
𝑛 (𝑛 −1)

Where d = 𝑅1 − 𝑅2 (difference in a pair of ranks)


n = number of paired observations or individuals being ranked
R1 = rank of observations with respect to first variable
R2 = rank of observations with respect to second variable

The ranks of 15 students in two subjects A and B, are given below. The two numbers within
brackets denote the ranks of a student in A and B subjects respectively.
(1, 10), (2, 7), (3, 2), (4, 6), (5, 4), (6, 8), (7, 3), (8, 1), (9, 11), (10, 15), (11, 9), (12, 5),
(13, 14), (14, 12), (15, 13)
Find Spearman’s rank correlation coefficient.

Unit V

Statistical Hypothesis:
• A statistical hypothesis is a claim (assertion, statement, belief or assumption) about an
unknown population parameter value.
For instance
• A pharmaceutical company claims the efficacy of medicine against a disease that 95 percent
of all persons suffering from the said disease get cured.
Hypothesis Testing:
• The process that enables a decision maker to test the validity (or significance) of his claim by
analyzing the difference between the value of sample statistic and the corresponding
hypothesized population parameter value, is called hypothesis testing.

GENERAL PROCEDURE FOR HYPOTHESIS TESTING

Step 1: State the Null Hypothesis (𝐻0) and Alternative Hypothesis (𝐻1)
• 𝐻0 represents the claim or statement made about the value or range of values of the

population parameter.

• The null hypothesis be considered true until it is proved false on the basis of results observed
from the sample data.

• The null hypothesis is always expressed in the form of mathematical statement.

𝐻0: 𝑥 (≤, =, ≥)µ

• where 𝑥 is sample mean and µ represents a hypothesized value. Only one sign out of ≤, = and

≥ will appear at a time when stating the null hypothesis.

Alternative Hypothesis

• 𝐻1, is the counter claim (statement) made against the value of the particular population

parameter.

• That is, 𝐻1 must be true when the 𝐻0 is found to be false.

• Parameter value is not equal to the value stated in the null hypothesis and is written as:

𝐻1: 𝑥≠μ

Null hypothesis and Alternative hypothesis

Step 2: State the Level of Significance, α (alpha)

• The level of significance, usually denoted by α (alpha), is specified before the samples are
drawn.

• It is specified in terms of the probability making a type I error when the null hypothesis is true
as an equality.

The level of significance defines the likelihood of rejecting 𝐻0 when it is true.


● α = 0.05 is selected for research projects
● α = 0.01 is selected for quality assurance
● α = 0.10 is selected for political polling.

Step 3: Establish Critical or Rejection Region

The area under the sampling distribution curve of the test statistic is divided into two mutually
exclusive regions. These regions are called the acceptance region and the rejection (or critical)
region.

• The acceptance region is a range of values of the sample statistic spread around the null
hypothesized population parameter. If values of the sample statistic fall within the limits of
acceptance region, the null hypothesis is accepted, otherwise it is rejected.

• The rejection region is the range of sample statistic values within which if values of the
sample statistic falls (i.e. outside the limits of the acceptance region), then null hypothesis is
rejected.

• The value of the sample statistic that separates the regions of acceptance and rejection is
called critical value.

Step 4: Select the Suitable Test of Significance

(a) Whether the test involves one sample, two samples, or k samples?

(b) Whether two or more samples used are independent or related?

(c) Is the measurement scale nominal, ordinal, interval, or ratio?

It is also important to know:

1. Sample size

2. The number of samples, and their size.


Step 5: Formulate a Decision Rule to Accept Null Hypothesis

Compare the calculated value of the test statistic with the critical value (also called standard table
value of test statistic).

The decision rules for null hypothesis are as follows:

• Accept 𝐻0 if the test statistic value falls within the area of acceptance.

• Reject otherwise.

ERRORS IN HYPOTHESIS TESTING

Type I Error

• This is the probability of rejecting the null hypothesis when it is true.

• The probability of making a Type I error is denoted by the symbol α. It is represented by the
area under the sampling distribution curve over the region of rejection.

• Type II Error: This is the probability of accepting the null hypothesis when it is false.

• The probability of making a Type II is denoted by the symbol β.

Chi-squared test

• A chi-squared test (symbolically represented as χ2) is basically a data analysis on the basis of
observations of a random set of variables. Usually, it is a comparison of two statistical data
sets. This test was introduced by Karl Pearson in 1900 for categorical data analysis and
distribution. So, it was mentioned as Pearson’s chi-squared test.
Use of Chi-Square Test

It is a non-parametric test of hypothesis testing.

2
1. χ test as a test of goodness of fit.

2
2. χ test as a test of the independence of two variables.

2
3. χ as a test of significance for association/dependence.

Conditions for Chi-Square Test

• Randomly collect and record the observations.

• In the sample, all the entities must be independent.

Properties of chi-square distribution

1. Unlike the normal distribution, the chi-square distribution takes only positive values and
ranges from 0 to infinity.

2. Unlike the normal distribution, the chi-square distribution is a skewed distribution yet
becomes more symmetric and approaches the normal distribution as the degree of freedom
increases.

2
Chi – square (χ ) Distribution

• While the Z and t distributions are used for the sampling distributions of the sample mean, the
2
Chi – square (χ ) distribution is used for the sampling distribution of the sample variance.

2
∑(𝑂𝑖−𝐸𝑖)
2
χ = 𝐸𝑖

Where 𝑂𝑖 is observed value and 𝐸𝑖 is expected value.

𝑅𝑇×𝐶𝑇
𝐸= 𝑁

RT= The row total for the row containing cell

CT= The column total for the column containing cell

N = Total number of observations.


Find the mean deviation from median and standard deviation for the following table

Wages in 100-12 120-14 140-16


000’ (Rs.) 0-20 20-40 40-60 60-80 80-100 0 0 0
No of
workers 8 16 12 14 18 24 20 8

A selection board consisting of two experts for the post of general manger in a company
interviewed 10 candidates whom the two experts assigned ranks as given below. Find the rank
correlation.

Expert I 7 9 2 4 5 5 8 10 3 1
Expert
II 8 10 4 6 4 4 7 9 1 2

Find the Karl Pearson’s coefficient of correlation between price and supply of a commodity
from the following data.

Price 8 10 15 17 20 22 24 25
Supply 25 30 32 35 37 40 42 45

Calculate Karl Pearson’s coefficient of skewness, first four moments and β1and β2 for the

following table.

Marks No. of Students


0-10 5
10-20 12
20-30 26
30-40 35
40-50 48
50-60 64
60-70 50
70-80 32
80-90 20
90-100 8

Calculate mean, median and mode from the following table and comment on the relationship
between them.
Class Frequency
0-10 8
10-20 14
20-30 24
30-40 31
40-50 47
50-60 54
60-70 35
70-80 26
80-90 15
90-100 6

Calculate the Consumer Price Index number through Aggregate Expenditure Method and
Family Budget Method for the following data.

Price per unit Price per unit


Commodity Weight 2012 (Rs.) 2018 (Rs.)
Rice 50 15 35
Milk 10 17 28
Oil 35 75 120
Wheat 40 20 45
Sugar 20 15 40
Vegetables 35 15 75

For the following data, find the following:

a) Geometric Mean
b) Harmonic Mean

100-12 120-14
X 0-20 20-40 40-60 60-80 80-100 0 0
f 7 15 30 40 17 18 19

Calculate quantity indices for the following data using:

a) Laspeyres Index

b) Paasche Index

c) Fisher’s Index

d) Marshall-Edgeworth Index

2020 2023
Item
Price Quantity Price Quantity
A 35 15 43 33
B 25 30 30 45
C 17 15 35 40
D 37 60 62 32
E 18 40 26 60
F 60 20 90 22
G 23 30 95 25
H 112 30 140 25

Given the following data, calculate the consumer price index under the aggregate expenditure method
and the family budget method.

Item 2020 2023


Quantit Quantit
Price y Price y
A 140 20 180 42
B 60 25 93 55
C 76 77 107 126
D 50 60 68 80
E 95 30 135 67
F 37 150 65 174
G 136 20 204 43
H 105 35 166 71
The average sleep time (in hours) and time spent in a gymnasium (in minutes) is given in the
following table. Find Karl Pearson’s correlation coefficient.

Sleep
Time 5.5 6 4 7.5 8 5 6.5 6
Gym
Time 40 60 35 90 120 45 65 50

The following table shows distance travelled (in km) by rural women to come to Bangalore for work,
and the daily wages they earn. Use Spearman’s Rank Correlation to find the level and nature of
association between the two variables.

Distanc
e 25 30 20 10 60 50 40 25
Wages 350 500 600 650 750 400 900 620

A company that employs 6000 employees in Bangalore, 4000 employees in Kochi, and 5000
employees in Hyderabad decided to hand out bonuses to its employees over at the end of a financial
year for outstanding performance. If the probability of an employee getting the bonus was 40% in
Bangalore, equally likely in Kochi, and one-third likely in Hyderabad, find the average number of
employees who got the bonus in each office. Also comment on the spread of the results for each
office.

The total points accrued by four Premier League football clubs are given in the following table

Year A B C D
2010-1
1 68 58 71 80
2011-1
2 70 52 64 89
2012-1
3 73 61 75 89
2013-1
4 79 84 82 64
2014-1
5 75 62 87 70
2015-1
6 71 60 50 66
2016-1
7 75 76 93 69
2017-1
8 63 75 70 81
2018-1
9 70 97 72 66
2019-2
0 56 99 66 66
2020-2
1 61 69 67 74
2021-2
2 69 92 74 58
2022-2
3 84 67 44 75

Calculate the average points accrued in the given period by each team, as well as their individual
standard deviations. Based on your analysis, interpret the performance of all four teams in the study
period.

For a given intense monsoon in Meghalaya, find the average amount of rainfall and the standard
deviation of the precipitation. Also calculate the skewness of the dataset and describe the shape of
distribution with the help of a suitable diagram, in comparison with a normal distribution.

110-13 130-15 150-17


Rainfall 10-30 30-50 50-70 70-90 90-110 0 0 0
Days 60 70 90 30 20 15 7 4
Compute price index and quantity index numbers for the year 2000 with 1995 as base year, using
Laspeyre’s method and Paasche’s method, Fisher’s price and quantity index.

Commodit
y Quantity (Units) Value (Rs)
1995 2000 1995 2000
A 100 150 500 900
B 80 100 320 500
C 60 72 150 360
D 30 33 360 297

Calculate the mean for the following frequency distribution

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70


No. of
Students 6 5 8 15 7 6 3

Calculate the first four moments about the mean from the following data. Also calculate the
value of β1 and β2

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70


No. of
Students 5 12 18 40 15 7 3

d*=d/1
Marks M f d=X-A 0 d*2 d*3 d*4 fd* fd*2 fd*3 fd*4
0-10 5 5 -30 -3 9 -27 81 -15 45 -135 405
Oct-20 15 12 -20 -2 4 -8 16 -24 48 -96 192
20-30 25 18 -10 -1 1 -1 1 -18 18 -18 18
30-40 35 40 0 0 0 0 0 0 0 0 0
40-50 45 15 10 1 1 1 1 15 15 15 15
50-60 55 7 20 2 4 8 16 14 28 56 112
60-70 65 3 30 3 9 27 81 9 27 81 243
Total 100 -19 181 -97 985
'
∑𝑓𝑑
' −19×10
µ1 = first moment about 35 = × 10 = 100
=− 1. 9
∑𝑓
2'
∑𝑓𝑑
' 181×100
µ2 = second moment about 35 = × 100 = 100
= 181
∑𝑓

3'
∑𝑓𝑑
' −97×1000
µ3 = third moment about 35 = × 1000 = 100
=− 970
∑𝑓

4'
∑𝑓𝑑
' 985×10000
µ4 = fourth moment about 35 = × 10000 = 100
=98500
∑𝑓

2
µ3
β1 = 3
µ2

µ4
β2 = 2
µ2

µ1 = 0

' '2 2
µ2 = µ2 − µ1 = 181 − (− 1. 9) = 181 − 3. 61 = 177. 39

3
' '
µ3 = µ3 − 3µ2µ1 + 2µ1
'
( )
'
=− 970 − 3×181×(− 1. 9) + (2×(− 1. 9))
3

− 970 + 1031. 7 − 54. 872

− 970 + 1086. 57 = 6. 828

2 4
'
µ4 = µ4 − 4µ1µ3 + 6µ2 µ1
' ' '
( )
'
( )
'
− 3 µ1

= 98500 − 4(− 1. 9)(− 970) + 6×181×3. 61 − 3×13. 03

= 98500 − 7372 + 3920. 46 − 39. 09 = 102, 420. 46 − 7411. 09 = 95009. 37

The daily earnings (in rupees) of employees working on a daily basis in a firm are:

Daily earnings (Rs.) 100 120 140 160 180 200 220
Number of
employees 3 6 10 15 24 42 75
Calculate the average daily earning for all employees
A company is planning to improve plant safety. For this, accident data for the last 50
weeks was compiled. These data are grouped into the frequency distribution as shown
below. Calculate the A.M. of the number of accidents per week.

Number of
accidents 0-4 05-09 10-14 15-19 20-24
Number of
weeks 5 22 13 8 2

Disadvantages of Arithmetic Mean

1. The value of A.M. cannot be calculated accurately for unequal and open-ended class
intervals either at the beginning or end of the given frequency distribution.

2. The A.M. is reliable and reflects all the values in the data set. However, it is very
much affected by the extreme observations (or outliers) which are not representative
of the rest of the data set.

3. The calculation of A.M. sometime becomes difficult because every data element is
used in the calculation.

4. The mean cannot be calculated for qualitative characteristics such as intelligence,


honesty, beauty, or loyalty.

Properties of Arithmetic Mean

1. The algebraic sum of deviations of all the observations xi (i = 1, 2 . . ., n) from the


A.M. is always zero, that is,

∑𝑥𝑖
∑(𝑥𝑖 − 𝑥𝑖) = ∑ 𝑥𝑖 − 𝑛 ∑ 𝑥𝑖 = ∑ 𝑥𝑖 − 𝑛 𝑛
=0

2. The sum of the squares of the deviations of all the observations from the A.M. is less
than the sum of the squares of all the observations from any other quantity.

2 2
∑(𝑥𝑖 − 𝑥𝑖) ≤ ∑ 𝑥𝑖 − 𝑎)
3. It is possible to calculate the combined (or pooled) arithmetic mean of two or more
than two sets of data of the same nature.

𝑛1 𝑥1 +𝑛2 𝑥2
𝑥12 = 𝑛1+𝑛2

AVERAGES OF POSITION

Median:

• Median may be defined as the middle value in the data set when its elements are
arranged in a sequential order, that is, in either ascending or descending order of
magnitude.

• It is called a middle value in an ordered sequence of data in the sense that half of the
observations are smaller and half are larger than this value.

• The median is thus a measure of the location or centrality of the observations.

Measurement of the median for ungrouped data

• In this case, the data is arranged in either ascending or descending order of magnitude.

1. If the number of observations (n) is an odd number, then the median (Med) is

represented by the numerical value corresponding to the positioning point of ( )


𝑛+1
2

ordered observation. That is,

𝑀𝑒𝑑 = 𝑆𝑖𝑧𝑒 𝑜𝑟 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 ( ) th observation in the data array.


𝑛+1
2

2. If the number of observations (n) is an even number, then the median is defined as the

arithmetic mean of the numerical values of ( )th and(


𝑛
2
𝑛
2 )
+ 1 th observations in the

data array. That is,

𝑀𝑒𝑑 =
( ) 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠+(
𝑛
2
𝑛
2 )
+1
2

Grouped Data

𝑀𝑒𝑑 = 𝑙 +
( )−𝑐𝑓 ×ℎ
𝑛
2
𝑓

• 𝑙 = lower class limit (or boundary) of the median class interval


• 𝑐𝑓 = cumulative frequency of the class prior to the median class interval, that is, the
sum of all the class frequencies upto, but not including, the median class interval

• f = frequency of the median class

• h = width of the median class interval

• n = total number of observations in the distribution (sum of frequencies).

In a factory employing 3000 persons, 5 per cent earn less than Rs. 150 per day, 580
earn from Rs. 151 to Rs. 200 per day, 30 per cent earn from Rs. 201 to Rs. 250 per
day, 500 earn from Rs. 251 to Rs. 300 per day, 20 per cent earn from Rs. 301 to Rs.
350 per day, and the rest earn Rs. 351 or more per day. What is the median wage?

Advantages, Disadvantages, and Applications of Median

Advantages

1. Median is unique, i.e. like mean, there is only one median for a set of data.

2. The value of median is easy to understand and may be calculated from any type of
data. The median in many situations can be located simply by inspection.

3. The sum of the absolute differences of all observations in the data set from median
value is minimum.

4. In other words, the absolute difference of observations from the median is less than
from any other value in the distribution. That is, Σ | x − Med| = a minimum value.

5. The extreme values in the data set does not affect the calculation of the median value
and therefore it is the useful measure of central tendency when such values do occur.

6. The median is considered the best statistical technique for studying the qualitative
attribute of an observation in the data set.

7. The median value may be calculated for an open-ended distribution of data set.

Disadvantages
1. The median is not capable of algebraic treatment. For example, the median of two or
more sets of data cannot be determined.

2. The value of median is affected more by sampling variations, that is, it is affected by
the number of observations rather than the values of the observations.

3. Since median is an average of position, therefore arranging the data in ascending or


descending order of magnitude is time-consuming in case of a large number of
observations.

4. The calculation of median in case of grouped data is based on the assumption that
values of observations are evenly spaced over the entire class interval.

Applications

The median is helpful in understanding the characteristics of a data set when

1. Observations are qualitative in nature

2. Extreme values are present in the data set

3. A quick estimate of an average is desired.

Advantages and Disadvantages of Mode Value

Advantages

1. Mode value is easy to understand and to calculate. Mode class can also be located by
inspection.

2. The mode is not affected by the extreme values in the distribution. The mode value can also
be calculated for open-ended frequency distributions.

3. The mode can be used to describe quantitative as well as qualitative data. For example, its
value is used for comparing consumer preferences for various types of products, say
cigarettes, soaps, toothpastes, or other products.

Disadvantages
1. Mode is not a rigidly defined measure as there are several methods for calculating its value.

2. It is difficult to locate modal class in the case of multi-modal frequency distributions.

3. Mode is not suitable for algebraic manipulations.

4. When data sets contain more than one modes, such values are difficult to interpret and
compare.

Advantages, Disadvantages and Applications of Range

Advantages

1. It is independent of the measure of central tendency and easy to calculate and understand.

2. It is quite useful in cases where the purpose is only to find out the extent of extreme variation,
such as industrial quality control, temperature, rainfall, and so on.

Disadvantages

1. The calculation of range is based on only two values—largest and smallest in the data set and
fails to take into account of any other observations.

2. It is largely influenced by two extreme values and completely independent of the other values.
For example, range of two data sets {1, 2, 3, 7, 12} and {1, 1, 1, 12, 12} is 11, but the two
data sets differ in terms of overall dispersion of values.

3. Its value is sensitive to changes in sampling, that is, different samples of the same size from
the same population may have widely different ranges.

4. It cannot be computed in case of open-ended frequency distributions because no highest or


lowest value exists in open-ended class.

Applications of Range

1. Fluctuation in share prices: The range is useful in the study of small variations among values
in a data set, such as variation in share prices and other commodities that are very sensitive to
price changes from one period to another.

2. Quality control: It is widely used in industrial quality control. Quality control is exercised by
preparing suitable control charts. These charts are based on setting an upper control limit
(range) and a lower control limit (range) within which produced items shall be accepted. The
variation in the quality beyond these ranges requires necessary correction in the production
process or system.

3. Weather forecasts: The concept of range is used to determine the difference between
maximum and minimum temperature or rainfall by meteorological departments to announce
for the knowledge of the general public.

Interquartile Range or Deviation

• The limitations or disadvantages of the range can partially be overcome by using another
measure of variation which measures the spread over the middle half of the values in the data
set so as to minimise the influence of outliers (extreme values) in the calculation of range.

• Since a large number of values in the data set lie in the central part of the frequency
distribution, therefore it is necessary to study the Interquartile Range (also called mid-spread).

• To compute this value, the entire data set is divided into four parts each of which contains 25
percent of the observed values.

• The interquartile range is a measure of dispersion or spread of values in the data set between
the third quartile, Q3 and the first quartile, Q1.

• Interquartile range (IQR) = Q3 – Q1

• Half the distance between Q1 and Q3 is called the semi-interquartile range or the quartile
deviation (QD).
𝑄3−𝑄1
• 𝑄𝐷 = 2

Advantages and Disadvantages of Quartile Deviation

Advantages

1. It is not difficult to calculate but can only be used to evaluate variation among observed
values within the middle of the data set. Its value is not affected by the extreme (highest and
lowest) values in the data set.

2. It is an appropriate measure of variation for a data set summarized in open-ended class


intervals.
3. Since it is a positional measure of variation, therefore it is useful in case of erratic or highly
skewed distributions, where other measures of variation get affected by extreme values in the
data set.

Disadvantages

1. The value of Q.D. is based on the middle 50 percent observed values in the data set, therefore
it cannot be considered as a good measure of variation as it is not based on all the
observations.

2. The value of Q.D. is very much affected by sampling fluctuations.

3. The Q.D. has no relationship with any particular value or an average in the data set for
measuring the variation. Its value is not affected by the distribution of the individual values
within the interval of the middle 50 percent observed values.

Mean Absolute Deviation (MAD)

𝑁
1
𝑀𝐴𝐷 = 𝑁
∑ |𝑥 − μ| for a population
𝑖=1

𝑛
1
𝑀𝐴𝐷 = 𝑛
∑ |𝑥 − 𝑥| for a sample
𝑖=1

where | | indicates the absolute value.

For a grouped frequency distribution, MAD is given by


𝑛
∑ 𝑓𝑖|𝑥−𝑥 |
𝑖=1
𝑀𝐴𝐷 =
∑𝑓𝑖

Skewness
• A frequency distribution of the set of values that is not ‘symmetrical (normal)’ is called
asymmetrical or skewed.
• In a skewed distribution, extreme values in a data set move towards one side or tail of a
distribution, thereby lengthening that tail.
• When extreme values move towards the upper or right tail, the distribution is positively
skewed.
• When such values move towards the lower or left tail, the distribution is negatively skewed.
• For a positively skewed distribution A.M. > Median > Mode, and for a negatively skewed
distribution A.M. < Median < Mode.
• The relationship between these measures of central tendency is used to develop a measure of
skewness called the coefficient of skewness to understand the degree to which these three
measures differ.

MEASURES OF SKEWNESS
• The degree of skewness in a distribution can be measured both in the absolute and relative
sense.
• For an asymmetrical distribution, the distance between mean and mode may be used to
measure the degree of skewness because the mean is equal to mode in a symmetrical
distribution. Thus,
𝑆𝑘 = 𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒
Or 𝑆𝑘 = (𝑄3 + 𝑄1) − 2 𝑀𝑒𝑑𝑖𝑎𝑛 (if measured in terms of quartiles)
• For a positively skewed distribution, Mean > Mode and therefore 𝑆𝑘 is a positive value,
otherwise it is a negative value
Relative Measures of Skewness
Karl Pearson’s Coefficient of Skewness
𝑀𝑒𝑎𝑛 −𝑀𝑜𝑑𝑒 𝑥−𝑀0
𝑆𝑘𝑃 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
= σ

where 𝑆𝑘𝑃 = Karl Pearson’s coefficient of skewness.

Since a mode does not always exist uniquely in a distribution, therefore it is convenient to
define this measure using median
3(𝑥−𝑀𝑒𝑑)
𝑆𝑘𝑃 = σ
Theoretically, the value of 𝑆𝑘𝑃 varies between ± 3. But for a moderately skewed distribution,

value of 𝑆𝑘𝑃 = ± 1. Karl Pearson’s method of determining coefficient of skewness is

particularly useful in open-end distributions.

Bowley’s Coefficients of Skewness


• The method suggested by Prof. Bowley is based on the relative positions of the median and
the quartiles in a distribution. If a distribution is symmetrical, then Q1 and Q3 would be at
equal distances from the value of the median, that is,
𝑀𝑒𝑑𝑖𝑎𝑛 − 𝑄1 = 𝑄3 − 𝑀𝑒𝑑𝑖𝑎𝑛

𝑄3+𝑄1
𝑀𝑒𝑑𝑖𝑎𝑛 = 2

• This shows that the value of median is the mean value of Q1 and Q3. Obviously in such a
case, the absolute value of the coefficient of skewness will be zero.
• When a distribution is asymmetrical, quartiles are not at equal distance from the median
• The distribution is positively skewed, if Q1 – Me > Q3 – Me, otherwise negatively skewed.

KURTOSIS
The measure of kurtosis, describes the degree of concentration of frequencies (observations)
in a given distribution. That is, whether the observed values are concentrated more around the
mode (a peaked curve) or away from the mode towards both tails of the frequency curve.
The word ‘kurtosis’ comes from a Greek word meaning ‘humped’. In statistics, it refers to the
degree of flatness or peakedness in the region about the mode of a frequency curve.

Properties of standard deviation


1. Standard deviation is independent of change of origin but not of scale.
𝑖𝑓 𝑑 = 𝑋 − 𝐴 , 𝑡ℎ𝑒𝑛 σ𝑥 = σ𝑑

𝑋−𝐴
But 𝑖𝑓 𝑑 = ℎ
, ℎ > 0, 𝑡ℎ𝑒𝑛 σ𝑥 = ℎ×σ𝑑
2. Standard deviation is the minimum value of the root mean square deviation.
2 1 2
σ = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝑥)

2 1 2
And mean square deviation (𝑠) = 𝑁
∑ 𝑓𝑖(𝑥𝑖 − 𝐴)
2 2
Then (𝑠) ≥ σ ⇒ 𝑠≥σ, the sign of equality holds if and only if, A= 𝑥.
3. Standard deviation is less than or equal to Range.
4. Standard deviation is suitable for further mathematical treatment. If we know the size, means,
and standard deviations of two or more groups, then we can obtain the standard deviation of
the group obtained by combining all the groups.
2
𝑛 −1
5. The standard deviation of the first n natural number viz. 1,2,3… n is 12
.
6. The empirical rule: for any symmetrical bel shaped distribution, we have approximately the
following area properties.
a) 68% of the observations lie in the range: mean± 1. σ
b) 95% of the observations lie in the range: mean± 2. σ
c) 99% of the observations lie in the range: mean± 3. σ
7. The approximate relationship between quartile deviation, mean deviation, and standard
deviation is:
2 4
𝑄. 𝐷≅ 3
σ 𝑎𝑛𝑑 𝑀. 𝐷. ≅ 5
σ ⇒𝑄. 𝐷: 𝑀. 𝐷∷𝑆: 𝐷∷10: 12: 15

8. For any discrete distribution, standard deviation is not less than mean deviation about mean.
𝑆. 𝐷(σ)≥𝑀𝑒𝑎𝑛 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑀𝑒𝑎𝑛

You might also like