QM (UM20MB502) - Unit 1; Introduction to Statistics – Notes

QM (UM20MB502) - Unit 1
Introduction to Statistics – Notes
1. What is data and how is it classified?
Data refers to the fact that some existing information or knowledge is represented or
coded in some form suitable for better usage or processing. It is measured, collected,
analyzed and reported
It is classified into
i) Qualitative data
When data is classified according to some qualitative phenomenon which are
not capable of quantitative measurement like honesty, gender, intelligence, etc.
If the data is classified into only two classes like, presence or absence; yes or
no; honest and dishonest, the classification is termed as simple or
dichotomous. If it is more than two classes, then it is manifold classification.
ii) Quantitative data
When data is classified on the basis of phenomenon which is capable of
quantitative measurement like age, height, income, sales, etc. he
quantitative phenomenon under study is known as variable.
These variables can be measured by using 4 scales namely
i) Nominal scale
ii) Ordinal scale
iii) Interval scale
iv) Ratio scale

2.What are the requisites of a good average or measure of central tendency?

I. It should be rigidly defined i.e., the definition should be clear and unambiguous so that it leads to
one and only one interpretation by different persons. In other words, the definition should not leave
anything to the discretion of the investigator or the observer. If it is not rigidly defined then the bias
introduced by the investigator will make its value unstable and render it unrepresentative of the
II. It should be easy to understand and calculate even for a non-mathematical person. In other words,
it should be readily comprehensible and should be computed with sufficient ease and rapidity and
should not involve heavy arithmetical calculations. However, this should not be accomplished at the
expense of accuracy or some other advantages which an average may possess.
III. It should be based on all the observations. Thus, in the computation of an ideal average the entire
set of data at our disposal should be used and there should not be any loss of information resulting
from not using the available data. Obviously, if the whole data is not used in computing the average,
it will be unrepresentative of the distribution...
IV. It should be suitable for further mathematical treatment. In other words, the average should possess
some important and interesting mathematical properties so that its use in further statistical theory
is enhanced. For example, if we are given the averages and sizes (frequencies) of a number of
different groups then for an ideal average we should be in a position to compute the average of the
combined group. If an average is not amenable to further algebraic manipulation, then obviously its
use will be very much limited for further applications in statistical theory.
V. It should be affected as little as possible by fluctuations of sampling. By this we mean that if we take
independent random samples of the same size from a given population and compute the average
for each of these samples then, for an ideal average, the values so obtained from different samples
should not vary much from one another. The difference in the values of the average for different
samples is attributed to the so-called fluctuations of sampling. This property is also explained by
saying that an ideal average should possess sampling stability.
VI. It should not be affected much by extreme observations. By extreme observations we mean very
small or very large observations. Thus a few very small or very large observations should not unduly
affect the value of a good average.
2. List the merits and Demerits of Arithmetic Mean.
a) It is rigidly defined.
b) It is easy to calculate and understand
c) It is based on all the observations.
d) It is suitable for further mathematical treatment
e) Of all the averages, arithmetic mean is affected least by fluctuations of sampling. This property is
explained by saying that arithmetic mean is a stable average.

a) The strongest drawback of arithmetic mean is that it is very much affected by extreme
b) Arithmetic mean cannot be used in the case of open end classes such as less than 10, more than
70, etc., since for such classes we cannot determine the mid-value X of the class intervals
c) It cannot be determined by inspection nor can it be located graphically.
d) Arithmetic mean cannot be used if we are dealing with qualitative characteristics which cannot be
measured quantitatively such as intelligence, honesty, beauty, etc. In such cases median
(discussed later) is the only average to be used.
e) Arithmetic mean cannot be obtained if a single observation is missing or lost or is illegible unless
we drop it out and compute the arithmetic mean of the remaining values.
f) In extremely asymmetrical (skewed) distribution, usually arithmetic mean is not representative of
the distribution and hence is not a suitable measure of location.
g) Arithmetic mean may lead to wrong conclusions if the details of the data from which it is obtained
are not available.
h) Arithmetic mean may not be one of the values which the variable actually takes and is termed as
a fictitious average
3. What is Median? List the merits and Demerits of Median
“The median is that value of the variable which divides the group in two equal parts, one part
comprising all the values greater and the other, all the values less than median”.
a) It is rigidly defined.
b) Median is easy to understand and easy to calculate for a non-mathematical person.
c) Since median is a positional average, it is not affected at all by extreme observations and ax such
is very useful in the case of skewed
d) Median can be computed while dealing with a distribution with open end classes.
e) Median can sometimes be located by simple inspection and can also be computed graphically.
f) Median is the only average to be used while dealing with qualitative characteristics which cannot
be measured quantitatively but can still be arranged in ascending or descending order of
a) In case of even number of observations for an ungrouped data, median cannot be determined
b) Median, being a positional average, is not based on each and every item of the distribution,
c) Median is not suitable for further mathematical treatment
d) Median is relatively less stable than mean, particularly for small samples since it is affected more
by fluctuations of sampling as compared with arithmetic mean.
4. What is Mode? List the merits and Demerits of Mode
"The mode of a distribution is value at the point around which the items tend to be most heavily
concentrated. It may be regarded as the most typical of a series of values”.
a) Mode is easy to calculate and understand. In some cases it can be located merely by inspection.
It can also be estimated graphically from a histogram
b) Mode is not at all affected by extreme observations and as such is preferred to arithmetic mean
while dealing with extreme observations.
c) It can be conveniently obtained in the case of open end classes which do not pose any problems
a) Mode is not rigidly defined.
b) Since mode is the value of X corresponding to the maximum frequency, it is not based on all the
observations of the series
c) Mode is not suitable for further mathematical treatment
d) As compared with mean, mode is affected to a greater extent by the fluctuations of sampling.
5. What are partition values? Explain each of them with formula to calculate
The values which divide the series into a number of equal parts are called the ‘partition values.’
Thus median may be regarded as a particular partition value which divides the given data into two equal
parts. The different types of partition values are:

a) Quartiles
b) Deciles
c) Percentiles
The values which divide the given data into four equal parts are known as quartiles. Obviously
there will be three such points Q1, Q2 and Q3 such that Q1 ≤ Q2 ≤ Q3 termed as the three
quartiles. Q1, known as the lower or first quartile is the value which has 25% of the items of
distribution below it and consequently 75% of the items are greater than it. Incidentally Q2, the
second quartile, coincides with the median and has an equal number of observations above it and
below it. Q3, known as the upper or third quartile, has 75% of the observations below it and
consequently 25% of the observations above it.

The working principle for computing the quartiles is basically the same as that of computing the
To compute Q1, the following steps are required:
1) Find N/4, where N=∑f is the total frequency.
2) See the (less than) cumulative frequency (c.f) jus greater than N/4.
3) The corresponding value of X gives value of Q1. In case of continuous frequency distribution,
the corresponding class contains Q1 and the value of Q1 is obtained by the interpolation

Q1= l + h/f (N/4 - C)

Where l is the lower limit, f is the frequency and h is the magnitude of the class containing Q1
and C is the cumulative frequency of the class preceding the class containing Q1.
Similarly to compare Q3, see the (less than) c.f., just greater than 3N/4. The corresponding
value of X gives Q3. In case of continuous frequency distribution, the corresponding class
contains Q3 and the value of Q3 is given by the formula:

Q3= l+ h/f (3n/4 - C)

Where l is the lower limit, f is the frequency and h is the magnitude of the class containing Q3
and C is the cumulative frequency of the class preceding the class containing Q3.
Deciles are the values which divide the series into ten equal parts. Obviously there are nine deciles
D1, D2, D3,……., D9, such that D1 ≤ D2 ≤…….≤ D9. Incidentally D5 coincides with the median.
The method of computing the deciles Di, (i=1, 2,…., 9) see the c.f., just greater than i*N/10. The
corresponding value of X is Di. In case of continuous frequency distribution the corresponding
class contains Di and its value is obtained by the formula

Di = l + h/f (i*N/10 - C), (i=1, 2,…., 9)

Where l is the lower limit, f is the frequency and h is the magnitude of the class containing Di and
C is the cumulative frequency of the class preceding the class containing Di.

Percentiles are the values which divide the series into 100 equal parts. Obviously there are 99
percentiles P1, P2, P3,……., P9, such that P1 ≤ P2 ≤…….≤ P99. Incidentally P50 coincides with the
The method of computing the deciles Pi, (i=1, 2,…., 9) see the c.f., just greater than i*N/100. The
corresponding value of X is Pi. In case of continuous frequency distribution the corresponding
class contains Pi and its value is obtained by the formula

Pi = l + h/f (i*N/100 - C), (i=1, 2,…., 99)

Where l is the lower limit, f is the frequency and h is the magnitude of the class containing Pi and
C is the cumulative frequency of the class preceding the class containing Pi.
In particular we shall have,
P25 = Q1, P50 = D5 = Q2, P75 = Q3,
D1 = P10, D2 = P20, D3 = P30……, D9 = P90.
6. What are the essential requisites for a measure of variation?
The essential requisites for a good measure of variation are listed below these requisites help in identify
the merits and demerits of individual measure of variation.

1. It should be rigidly defined.

2. It should be based on all the values (elements) in the data set.
3. It should be calculated easily, quickly and accurately.
4. It should not be unduly affected by the fluctuation of sampling and also by extreme observations.
5. It should be usable for further mathematical or algebraic manipulations.
7. Briefly explain different types of measures of dispersion.
There are various methods of measuring the dispersion of a series which can be broadly classified into
three categories:

1. Dispersion by the method of limits

Under this method, the dispersions of a series are studied by taking into account the extreme
limits of certain factors viz. Value, quartiles, deciles, percentiles etc. of a series.
The following measures of dispersion come under this method:

(I) Range (II) Inter-Quartile Range (III) Semi-inter Quartile Range or Quartile Deviation, (IV) Decile
Range, and (V) Percentile Range.
2. Dispersion by the method of computation
Under this method, the dispersal character of a series is studied through the process of
The following measures of dispersion come under this method:
(I) Mean Deviation (II) Standard deviation (III) Co-efficient of variation (IV) variance.
3. Dispersion by the method of graphs

Under this method, the dispersion of a series is studied by drawing certain suitable graphs
certain suitable graphs, viz. Lorenz curve:

Absolute and Relative Dispersion

Further, each of the above types of dispersion is studied under two different methods:
Absolute Dispersion: The measure of dispersion which is expressed in terms of the units of the
(I) Range
(II) Quartile Deviation
(III) Mean Deviation
(IV) Standard Deviation

Range: is the difference between the highest and the lowest value in a series. This the simplest
absolute measure of deviation.
Symbolically: R=L-S
Where, R=Represent Range, L=Maximum value and S=Minimum value
Quartile Deviation: Half of the difference between third and first quartile.
Quartile Deviation, Q.D=Q3-Q1/2
Mean Deviation: Mean deviation of a series is the arithmetic average of the deviations of various
items from the median or mean of that series.
Standard Deviation: It is also known as root mean square deviation for the reason that it is the
square root of the mean or the squared deviation from the arithmetic mean, Standard Deviation
is denoted by small Greek letter sigma
Relative Dispersion: A measure of dispersion which is independent of unit or may involve the point
about which the deviations are taken is suggested.

(I) Co-efficient of Range = (Largest value-Smallest Value)/)Largest Value + Smallest

(II) Coefficient of Quartile Deviation
Co-efficient of Q.D. = (Q3-Q1)/ (Q3+Q1)

(III) Coefficient of Mean Deviation

Coefficient of Mean Deviation from mean = Mean Deviation (M.D.)/Arithmetic mean

Coefficient of Mean Deviation from median = Mean Deviation (M.D.)/ median

Coefficient of Mean Deviation from mode = Mean Deviation (M.D.)/ mode

(IV) Coefficient of Variation

The standard deviation must be converted into a relative measure of dispersion for the purpose
of comparison. The relative measure is known as the coefficient of variation.
Coefficient of standard deviation=S.D/Mean
Coefficient of variation =S.D*100/Mean
Variance: Square of standard deviation is called variance.

8. Define (i) Central tendency (ii) Variation (iii) Skewness

i) Central tendency: The numerical of an observation (also called central value) around which
most numerical values of other observation in the data set show a tendency to cluster or group,
called the central tendency.
ii) Variation: The extent to which numerical values are dispersed around the central value,
called variation.
iii) Skewness: The extent of departure of numerical values from symmetrical (normal)
distribution around the central value, called skewness.

9. Explain standard deviation. What are its advantages and disadvantages?

Standard Deviation may be defined as the square root of the arithmetic averages of the squares
of deviations taken from the arithmetic average of the series. Thus if X1, X 2………… X n is a set of n
observations then its standard deviation is given by:

 ( xi − x ) 2

= i =1

Mathematical Properties of Standard Deviation:

1. Standard deviation is independent of change of origin but not of scale.

2. Standard deviation is the minimum value of the root mean square deviation.
3. Standard deviation is less than or equal to range.
4. Standard deviation is suitable for further mathematical treatment.
n2 −1
5. The standard deviation of the first n natural numbers viz, 1,2,3….n is √ .

Standard Deviation

1. It is rigidly designed and free from any ambiguity.

2. Its calculation is based on all the observations of the series and it cannot be correctly calculated
ignoring any item of a series.
3. It strictly follows the algebraic principles and never ignores the + and – signs like the mean
4. It is capable of further algebraic treatment as it has a lot of algebraic properties.
5. It is used as formidable instrument in making higher statistical analysis viz: correlation, skeweness,
regression and sample studies, etc.
6. It is not much affected by the fluctuations in sampling for which it is widely used in testing the
hypotheses and for conducting the different tests of significance.
7. In a normal distribution covers 68.27% of the values for which it is called a standard measure of
8. It exhibits the scatter of dispersion of the various items of a series from its arithmetic mean and
thereby justifies its name as a measure of dispersion.


1. It is not understood by a common man.

2. Its calculation is difficult as it involves many mathematical models and processes.
3. It is affected very much by extreme values of a series in a much as the squares of deviation of big
items become proportionately bigger than the squares of the smaller items.
4. It cannot be used for comparing the depression of two or more series given in different unit.
10.What are the Objectives of analyzing Skewness?
The following are the chief objectives of skewness:

1. To find out the nature and degree of concentration of the frequency distribution of a series
i.e. whether there is more concentration with the lower value or vice versa.
2. To find out the extent to which the empirical relationship between the values of the mean.
Median and Mode i.e. 1/3(Mean-Mode) = (Mean-Median), holds well.
3. To ascertain if the distribution is normal and to determine the various measures of a normal
distribution as per the requirement.
11.Differences between Dispersion and Skewness?
Dispersion Skewness
I. Dispersion deals with the dispersal of I. Skewness deals with the nature of distribution of a
the items of a series around its central series i.e. to find out whether the series is
value symmetrically distributed or not.
II. Dispersion speaks of the amount of II. skewness speak about the direction of the items i.e.
variation of the items from the average whether it is towards the right or left of the
value distribution
III. Dispersion is computed both on the III. Skewness is computed only on the basis of average
basis and form of certain average viz, mean, mode, median, quartile and percentile.
IV. Dispersion studies the degree of IV. Skewness studies the concentration of the data
variation in the data either in the lower or higher values.
V. Dispersion speaks of the representative V. Skewness speaks of the normalcy or otherwise of
character of a central value the distribution.
VI. Skewness indicates how the dispersion on the two
VI. Dispersion indicates the general shape
sides of the mode varies in the arrangement of
of a frequency distribution
12.Explain the significance of measuring dispersion.
The main significance of measuring dispersion may be summarized as follows.

1. To find out the reliability of an average: The measures of variation enable us to find out if the
average is representative of the data. As stated earlier, dispersion gives us an idea about the spread
of the observations about an average value. If the dispersion is small, it means that the given data
values are closer to the central value (average) and hence the average may be regarded as reliable
in the sense that it provides a fairly good estimate of the corresponding population average. If the
dispersion is large, then the data values are more deviated from the central value, thereby implying
that the average is not representative of the data and hence not quite reliable.
2. To control the variation of the data from the central value: The measures of variation help us to
determine the causes and the nature of variation, so as to control the variation itself. It helps to
measure the extent of variation from the standard quality of various works carried in industries. For
example, we use 3-sigma control limits to determine if a manufacturing process is in control or not.
This helps us to identify the causes of variations in the manufactured product and accordingly take
corrective and remedial measures. The government can also take suitable policy decisions to
remove the inequalities in the distribution of income and wealth, after careful study of the
dispersion of the income and wealth.
3. To compare two or more sets of data regarding their variability: The relative measure of dispersion
may be used to compare two or more distributions, even if they are measured in different units, as
regards their variability or uniformity.
4. To obtain other statistical measures for further analysis of data: The measures of variation are
used for computing other statistical measures which are used extensively in correlation analysis,
regression analysis, theory of estimation an testing of hypothesis, statistical quality control and so
13.Briefly explain the different types of skewness with sketches.
Skewness is defined as asymmetry in the distribution of the sample data values. Values on one side of the
distribution tend to be further from the 'middle' than values on the other side. For skewed data, the usual
measures of location will give different values, for example, mode<median<mean would indicate positive
(or right) skewness. Positive (or right) skewness is more common than negative (or left) skewness.
There are two types of skewness; they are (1) Positive skewness and (2) Negative skewness
Positive Skewness:
A Series is said to have positive skewness when the following characteristics are noticed.

• Mean> Median> Mode

• The right tail of the curve is longer than its left tail when the data are plotted through histogram
or a frequency polygon.
• The formula of skewness and its coefficient give positive figures.

Negative Skewness:
A Series is said to have negative skewness when the following characteristics are noticed.

• Mode > Median > Mean

• The left tail of the curve is longer than the right tail when the data are plotted through histogram
or a frequency polygon.
• The formula of skewness and its coefficient give negative figures.

The major differences between the positive and the negative skewness are as follows…
1 The right tail is longer The left tail is longer
The mass of the distribution is The mass of the distribution is
concentrated on the left of the figure concentrated on the right of the figure
3 It has relatively few high values It has relatively few low values
The distribution is said to be right- The distribution is said to be left-
skewed or "skewed to the right" skewed or "skewed to the left"
14.What are outliers in the data? What is a box plot?
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.
The box plot is a graphical display for describing the behaviour of the data in the
middle as well as at the ends of the distributions. The box plot uses the median and
the lower and upper quartiles (defined as the 25th and 75th percentiles).
A box plot is constructed by drawing a box between the upper and lower quartiles
with a solid line drawn across the box to locate the median. The following quantities
are needed for identifying extreme values in the tails of the distribution
1. Central Line: Median
2. Body Lower Level: Q1
3. Body Upper Level: Q3
4. Lower Limit: Q1 - 1.5*IQR
5. Upper Limit: Q3 + 1.5*IQR

