Computational Statistics 1r (1)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 364

e

in
nl
O
ty
Computational Statistics

r si
ve
ni
U
ity
m
)A
(c
e
in
© Amity University Press

All Rights Reserved

nl
No parts of this publication may be reproduced, stored in a retrieval system or transmitted
in any form or by any means, electronic, mechanical, photocopying, recording or otherwise
without the prior permission of the publisher.

O
Advisory Committee

ty
Chairman : Ms. Monica Agarwal
Members : Prof. Arun Bisaria

si
Dr. Priya Mary Mathew
Prof. Aindril De
r
Mr. Alok Awtans
ve
Dr. Coral J Barboza
Dr. Monica Rose
Mr. Sachit Paliwal
ni

SLM Review Committee


U

Mr. Gaurav Agarwal


Ms. Nitika Khanna
Ms. Rashmi Saxena
ity

Ms. Renu Singh


Ms. Mona Chaudhary
m
)A
(c

Published by Amity University Press for exclusive use of Amity Directorate of Distance and Online Education,
Amity University, Noida-201313
Contents

e
Page No.

Module - 1: Introduction to Computational Statistics 01

in
1.1 Concept of statistical population
1.1.1 Concept of Statistical Population

nl
1.2 attributes and variables, Different types of scales
1.2.1 Discrete and Coninous Variables
1.2.2 Nominal, Ordinal, Ratio and Interval

O
1.3 Primary data and Secondary data
1.3.1 Designing a Questionnaire and Schedule

ty
1.3.2 Collection of Primary Data, Checking their Consistency
1.3.3 Scrutiny of Data for Internal Consistency
1.3.4 Detection of Errors of Recording

si
1.4 Presentation of data
1.4.1 Classification, Tabulation

r
1.4.2 Diagrammatic Representation of Grouped Data
1.4.3 Graphical Representation of Grouped Data
ve
1.5 Frequency distributions
1.5.1 Cumulative Frequency Distributions and their Graphical Representations
1.5.2 Frequency Polygon and Ogives
ni

1.5.3 Stem and Leaf Plot


1.5.4 Box Plot
U

Module - 2: Introduction to Computational Statistics 75


2.1 Introduction to Central Tendency
2.1.1 Central Tendency - Introduction
ity

2.1.2 Central Tendency - Measures


2.2 Measures of Central Tendency
2.2.1 Mean/Average - Meaning and Characteristics
2.2.2 Arithmetic Mean - Intro and Application
m

2.2.3 Combined Mean - Intro and Application


2.2.4 Weighted Mean - Intro and Application
2.2.5 Median - Meaning and Characteristics
)A

2.2.6 Median - Applications


2.2.7 Mode - Meaning and Characteristics
2.2.8 Mode - Applications
2.2.9 Relationship between Mean, Median and Mode
(c

2.3 Industry examples on Central Tendency Measures


2.3.1 Relavant Industry Example
2.3.2 Case Study
2.4 Introduction to Dispersion

e
2.4.1 Introduction_dispersion_1
2.5 Measures of Dispersion

in
2.5.1 Range_Measure_2
2.5.2 mean_deviation_3

nl
2.5.3 SD_Variance_4
2.5.4 SD_Variance_Calculation_5
2.5.5 Combined_Mean

O
2.5.6 Comparison_measures_dispersion
2.5.7 Coefficient_Variation
2.5.3 Quartile_Deviation

ty
2.5.4 Application_Quartile_Deviation
2.5.5 Data_distribution_Introduction_skewness
2.5.6 Measures_skewness

si
2.5.7 Calculation_Importnace_skewness
2.5.8 Moments
2.5.9 Introduction_Kurtosis
2.5.10 Summary_Module2 r
ve
2.6 Industry Example for Dispersion
2.6.1 Business_Problem_Introduction_1
2.6.2 Calculation_Business_Metrics_2
ni

2.6.3 Insights_Graphs_3
2.6.4 Data_Interpretation_4
2.6.5 Conclusion_Module2_5
U

Module - 3: Skewness and Kurtosis 187


3.1 Skewness; Pearsonian and Bowley’s measure of Skewness
3.1.1 Introduction to Skewness
ity

3.1.2 Application of Skewness


3.1.3 Personian Measure of Skewness
3.1.4 Bowley’s Measure of Skewness
m

3.2 Kurtosis and Moments (Numerical examples and applications).


3.2.1 Introduction to Kurtosis
3.2.2 Application of Kurtosis
)A

3.2.3 Intoduction to Moments


3.2.4 Factorial Moments
3.2.5 Shephard’s Correction for Moments
(c

3.2.6 Skewness Using Moments


3.2.7 Kurtosis Using Moments

e
Module - 4: Correlation and Regression Analysis 238
4.1 Correlation –Meaning, types, Limitations of Correlation

in
4.1.1 Introduction to Correlation
4.1.2 Types of Correlation

nl
4.1.3 Limitations of Correlation
4.1.4 Diagramatic Represntation for Types of Correlation
4.2 Correlation Coefficient; Meaning, Properties Measurement of Coefficient of Correlation – Rank

O
Correlation Co-efficient
4.2.1 Introduction to Correlation Coefficient
4.2.2 Meaning of Correlation Coefficient

ty
4.2.3 Properties of Correlation Coefficient
4.2.4 Measurement of Correlation Coefficient
4.2.5 Estimation of Rank Correlation Coefficient

si
4.2.6 Application of Rank Correlation Coefficient
4.2.7 Kendall’s Measure of Correlation
4..3 r
Karl Pearson’s correlation coefficient in bivariate distribution; Estimation and interpretations
ve
4.3.1 Meaning of Karl Pearsons Correlation Coefficient
4.3.2 Karl Pearsons Correlation Coefficient in Bivariate Distribution
4.3.3 Estimation of Karl Pearsons Correlation Coefficient
4.3.4 Interpretation of Karl Pearsons Correlation Coefficient
ni

4.3.5 Intra-Class Correlation


4.3.6 Correlation Ratio
U

4.4 Regression – Meaning, types, properties and assumptions of Regression


4.4.1 Introduction to Regression Analysis
4.4.2 Meaning of Regression
ity

4.4.3 Properties of Regression Analysis


4.4.4 Assmptions of Regression
4.5 Two variable linear regression; Regression lines and regression Co-Efficient
4.5.1 Linear Regression
m

4.5.2 Two Variable Linear Regression


4.5.3 Regression Lines
)A

4.5.4 Regression Coefficients


4.5.5 Multiple Regression for Trivariate data

Module - 5: Association of Attributes 313


5.1 Association of attributes, Independence
(c

5.1.1 Association of attributes


5.1.2 Independence
5.2 Measure of association for 2x2 table

e
5.2.1 Measure of association for 2x2 table

in
5.3 Chi-square, Karl Pearson’s and Tschuprow’s coefficient of association
5.3.1 Chi-square coefficient of association
5.3.2 Karl Pearson’s coefficient of association

nl
5.3.3 Tschuprow’s coefficient of association
5.4 Contingency tables with ordered categories

O
5.4.1 Contingency tables with ordered categories

ty
r si
ve
ni
U
ity
m
)A
(c
Computational Statistics 1

Module - 1: Introduction to Computational Statistics


Notes

e
Structure:

in
1.1 Concept of statistical population
1.1.1 Concept of Statistical Population

nl
1.2 attributes and variables, Different types of scales
1.2.1 Discrete and Coninous Variables

O
1.2.2 Nominal, Ordinal, Ratio and Interval
1.3 Primary data and Secondary data
1.3.1 Designing a Questionnaire and Schedule

ty
1.3.2 Collection of Primary Data, Checking their Consistency
1.3.3 Scrutiny of Data for Internal Consistency
1.3.4 Detection of Errors of Recording

si
1.4 Presentation of data
1.4.1 Classification, Tabulation
1.4.2 r
Diagrammatic Representation of Grouped Data
ve
1.4.3 Graphical Representation of Grouped Data
1.5 Frequency distributions
1.5.1 Cumulative Frequency Distributions and their Graphical Representations
ni

1.5.2 Frequency Polygon and Ogives


1.5.3 Stem and Leaf Plot
U

1.5.4 Box Plot


ity
m
)A
(c

Amity Directorate of Distance & Online Education


2 Computational Statistics

Unit - 1.1: Concept of Statistical Population


Notes

e
Objectives:

in
At the end of this unit, you will be able to:

●● Understand the concept of statistical population

nl
●● Learn how to select samples from a given population

Introduction

O
Computational statistics, also known as statistical computing can also be
considered of as the meeting point of statistics and computer science. Computational
statistics is a branch of computational science that focuses on statistics as a

ty
mathematical science. The goal of computational statistics is very much like the goal
of traditional statistics: to turn raw data into knowledge and gain valuable insights
from it. The main distinction between computational statistics and traditional statistical
techniques is that computational statistics focuses on using computer-intensive

si
statistical methods, particularly when there is an extremely large sample size and non-
homogeneous datasets.

r
Although the terms “computational statistics” and “statistical computing” are
frequently used interchangeably, Carlo Lauro, a former president of the International
ve
Association for Statistical Computing, believes there is a distinction between the
two. Carlo defined ‘statistical computing’ as “the application of computer science to
statistics,” and ‘computational statistics’ as “the design of algorithms for implementing
statistical methods on computers, including those unimaginable before the computer
ni

age, as well as to deal with analytically intractable problems.”

1.1.1. Concept of Statistical Population


U

Statistical methods are especially useful for studying, analysing, and learning about
experimental unit populations.

An experimental (or observational) unit is a data collection object (e.g., person,


ity

thing, transaction, or event).

A population is a collection of all the units (usually people, objects, transactions, or


events) that we want to study.
m

Populations may include, for example, (1) all employed workers in the India, (2)
all registered voters in Delhi, (3) everyone with CORONA, (4) all cars produced last
year by a specific assembly line, (5) the entire stock of spare parts available at Maruti’s
)A

repair facility, (6) all sales made at a McDonald’s drive-in window during a given year,
or (7) the set of all accidents occurring on a specific street. The first three population
examples (1–3) are sets (groups) of people, the next two (4–5) are sets of objects, the
following (6) is a set of transactions, and the final (7) is a set of events. It’s also worth
noting that each set contains all of the units in the population.
(c

When studying a population, we concentrate on one or more characteristics or


properties of the population’s units. Such characteristics are referred to as variables.

Amity Directorate of Distance & Online Education


Computational Statistics 3

For example, we might be interested in the variables age, gender, and years of
Notes

e
education of people who are currently unemployed in India.

A variable is a feature or property of a single experimental (or observational) unit in

in
a population. The term variable refers to the fact that any given characteristic can differ
between units in a population.

It is useful to be able to obtain a numerical representation for a variable when

nl
studying it. However, because numerical representations are not always readily
available, measurement is an important supporting factor in statistical studies. The
process of assigning numbers to variables of individual population units is known as

O
measurement. For example, we could assess the leader’s performance by asking
registered voters to rate it on a scale of 1 to 10. Alternatively, we could simply ask
each worker, “How old are you?” to determine the age of the country’s workforce. In
other cases, measurement is accomplished through the use of instruments such as

ty
stopwatches, scales, and callipers.

If the population we want to study is small enough, we can measure a variable


for each unit in the population. For example, if we are measuring the marks of all

si
incoming first-year students at a university, we should be able to obtain all marks. A
census of the population is when we measure a variable for each unit of a population.
In most applications, however, the populations of interest are much larger, possibly
r
involving many thousands, if not an infinite number, of units. Examples of large
ve
populations include all graduates of a university or college, all potential buyers of a
new phone, and all pieces of first-class mail handled by the Postal Service. A census
would be prohibitively time consuming or expensive for such populations. A reasonable
alternative would be to choose and study a subset (or portion) of the population’s units.
ni

After measuring the variables of interest for each unit in the sample (or population),
the data is analysed using either descriptive or inferential statistical methods. For
example, the pollster may be only interested in describing the voting patterns of the
U

sample of 1,500 voters. More likely, he will want to use the information in the sample to
draw conclusions about the entire population.

A statistical inference is a guess, prediction, or other generalisation about a


ity

population based on data from a sample.

In other words, we use the information in the smaller sample to learn about the
larger population. Thus, based on the sample of 1,500 voters, the pollster can estimate
the percentage of all voters who would vote for each presidential candidate if the
m

election were held on the same day as the poll, or he can use the results to predict the
outcome on election day.

It may be noted that the terms population and sample are frequently used
)A

interchangeably to refer to the sets of measurements as well as the units on which the
measurements are taken. When measuring a single variable of interest, this usage
causes little confusion. However, to avoid ambiguity, we should try and refer to the
measurements as population data sets and sample data sets, respectively.
(c

Amity Directorate of Distance & Online Education


4 Computational Statistics

Example
Notes

e
The term “Cola wars” refers to the intense competition between Coca-Cola and
Pepsi in their marketing campaigns, which have included movie and television stars,

in
rock videos, athletic endorsements, and claims of consumer preference based on
taste tests. Assume that 1,000 cola consumers are subjected to a blind taste test as
part of a Pepsi marketing campaign (i.e., a taste test in which the two brand names are

nl
disguised). Each consumer is asked whether they prefer brand A or brand B.

◌◌ Describe the Population.


◌◌ Describe the variable you’re interested in.

O
◌◌ Explain the sample.
◌◌ Explain the inference.

Solution

ty
◌◌ Because the area of interest is in the responses of cola consumers in a taste
test, the experimental unit is a cola consumer. As a result, the population of
interest is defined as the collection or set of all cola consumers.

si
◌◌ The variable of interest is the consumer’s cola preference as revealed by a
blind taste test, which is the characteristic that Pepsi wants to measure.
◌◌ r
The sample consists of 1,000 cola consumers chosen from a population of all
cola consumers.
ve
◌◌ The inference of interest is the generalisation of the 1,000 sampled
consumers’ cola preferences to the population of all cola consumers. In
particular, the preferences of the consumers in the sample can be used to
ni

estimate the percentages of cola consumers who prefer each brand.

Check Your Understanding


U

A. Fill in the Blanks


1. Computational statistics is a branch of computational science that focuses on
statistics as a ____________ science.
ity

A) statistical computing
B) computational
C) mathematical
D) None of the above
m

2. Computational statistics focuses on using ____________statistical methods.


A) traditional
)A

B) computer-intensive
C) computational
D) None of the above
(c

Amity Directorate of Distance & Online Education


Computational Statistics 5

3. Carlo defined ‘computational statistics’ as “the design of ____________ f o r


Notes

e
implementing statistical methods on computers, including those unimaginable before
the computer age, as well as to deal with ____________intractable problems.”

in
A) statistical, applications
B) computer, science
C) statistical, analytically

nl
D) algorithms, analytically
4. The term “____________” has a slightly different meaning in statistics than it does in

O
everyday speech.
A) population
B) statistics

ty
C) computing
D) application

si
5. The relationship between the ____________and the population must be such that
true ____________about the population can be made from that sample.
A) population, size
B) numbers, numbers r
ve
C) sample, inferences
D) individuals, parameter

B. True or False
ni

1. The standard error of the mean is calculated by multiplying the standard deviation by
the square root of the number of observations in the sample.
U

2. The World Bank is an international organisation whose mission is to control global


population by generating awareness in poor countries for projects that improve their
economies and raise their overall standard of living.
ity

3. The Latin letters are used to represent the population mean and standard deviation,
respectively.
4. The data from the statistical sample allows statisticians to form hypotheses about the
larger population.
m

5. The first important feature of a sample is that every individual in the population from
which it is drawn has a known non-zero chance of being included in it.

Summary
)A

●● Computational statistics is a branch of computational science that focuses on


statistics as a mathematical science.
●● The main distinction between computational statistics and traditional statistical
techniques is that computational statistics focuses on using computer-intensive
(c

statistical methods, particularly when there is an extremely large sample size and
non-homogeneous datasets.

Amity Directorate of Distance & Online Education


6 Computational Statistics

●● The term “population” has a slightly different meaning in statistics than it does in
Notes

e
everyday speech.
●● When taken from populations, statistics such as averages and standard deviations

in
are referred to as population parameters.
●● A well-chosen sample will contain the majority of the information about a specific
population parameter.

nl
●● The first important feature of a sample is that every individual in the population
from which it is drawn has a known non-zero chance of being included in it; a
natural assumption is that these chances are equal.

O
●● The term “random” refers to the method by which the sample is chosen rather than
the sample itself.
●● Before taking a sample, the investigator must first define the population from which

ty
it will be drawn.
●● Political polling is a good example of the difficulty in selecting a random sample of
the population.

si
●● A sample is a random selection of members from a population. It is a subset of the
population that shares the characteristics of the entire population.
●● r
The data from the statistical sample allows statisticians to form hypotheses about
the larger population.
ve
●● In statistical equations, the population is usually represented by an uppercase N,
while the sample is represented by a lowercase n.
●● A parameter is data that applies to an entire population. When taken from
ni

populations, statistics such as averages and standard deviations are referred to as


population parameters.
●● The standard deviation represents the variation in the population inferred from the
U

variation in the sample.


●● The standard error of the mean is calculated by dividing the standard deviation by
the square root of the number of observations in the sample.
ity

Activity
1. This hands-on activity helps students understand the difference between a population
and a sample. Have students take a short quiz using questions prior to the activity.
m

2. Step 1: Return the graded quizzes and write the population mean on the board.
Next, inform students that you will randomly choose two samples of five individuals
each from the class. Assign a number from 1 to N to each student (e.g., if there
)A

are 30 students in the class, assign the numbers 1-30). Begin anywhere in a table
of random numbers and work your way down two digits at a time (because N is a
two-digit number). Record the first ten non-repeating numbers that correspond to a
number assigned to students in the class, skipping numbers that do not correspond
to a number ranging from 1 to N. As an example:
(c

Amity Directorate of Distance & Online Education


Computational Statistics 7

Random numbers
Notes

e
28 26 44 30 01 25 85 72 05 93 57 09
68 27 98 60 87 51 80 12 88 26 97

in
Sample 1 Sample 2

28, 26, 30, 01, 25 05, 09, 27, 12, 26

nl
Step 2: Collect the quiz papers from the two samples and post the results on the
board. Allow the students to compute the sample means for these data.

O
Step 3: Conclude this activity with a discussion of population and sample concepts.
Discuss the distinction between a population of scores and a sample of scores, as well
as the relationship between a population and a parameter and the relationship between
a random sample and a representative sample.

ty
Glossary
●● Computational Statistics: It is a branch of computational science that focuses on

si
statistics as a mathematical science.
●● Sample: A sample is a random selection of members from a population.
●●
r
Random: The term “random” refers to the method by which the sample is chosen
rather than the sample itself.
ve
●● Parameter: A parameter is data that applies to an entire population.
●● Population Parameter: When taken from populations, statistics such as averages
and standard deviations are referred to as population parameters.
ni

●● Standard Deviation: The standard deviation represents the variation in the


population inferred from the variation in the sample.
U

Questions and Exercises


1. What is the difference between population and sample?
2. Give real world examples of population and samples.
ity

Further Readings
1. James E. Gentle, Wolfgang Karl Hardle and Yuichi Mori (2012) “Handbook of
Computational Statistics: Concepts and Methods”, Second Edition, Springer.
m

2. James E. Gentle (2002) “Elements of Computational Statistics”, Springer.


3. Geof H. Givens and Jennifer A. Hoeting (2013) “Computational Statistics”,
Second Edition, John Wiley & Sons, Inc.
)A

4. Wolfgang Karl Hardle, Ostap Okhrin and Yarema Okhrin (2017) “Basic
Elements of Computational Statistics”, First Edition, Springer.
(c

Amity Directorate of Distance & Online Education


8 Computational Statistics

Check Your Understanding- Answers


Notes

e
A. Fill in the Blanks

in
1. C) mathematical
2. B) computer-intensive
3. D) algorithms, analytically

nl
4. A) population
5. C) sample, inferences

O
B. True/False
F, F, F, T, T

ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 9

Unit - 1.2: Attributes and Variables, Different types of


Notes

e
Scales

in
Objectives:
At the end of this unit, you will be able to:

nl
●● Learn what are discrete and continuous variables
●● Understand nominal, ordinal, ratio and interval

O
●● Know the types of measurement scales

Introduction
Statistics are applied to a data set. The data set can be visualised as a two-

ty
dimensional matrix, similar to a blank spreadsheet found in many modern software
packages such as Excel. The observations are represented by the rows of the data
matrix. Observations in neuroscience are usually of organisms (humans, rats, mice),

si
but they can also be of other phenomena such as cell cultures. The columns of the data
matrix contain attributes measured on the observations, such as gender, parietal lobe
activity in a PET (positron emission tomography) scan, and number of bar presses. This
r
chapter describes measurement scales as well as the various mathematical classes for
attributes.
ve
1.2.1 Discrete and Continuous Variables

Continuous Variable
ni

Between any two points on the measurement scale, a continuous variable has an
infinite number of possible values. For example, mouse weight has an infinite number of
U

possible values between 25 and 26 gms because extra decimal places can always be
added to the measurement.

Other Examples could be:


ity

◌◌ Weight or height of students in a class


◌◌ Time taken by employee to reach office
◌◌ Distance travelled each day.
m

Formal Definition
A random variable X is said to be continuous if and only if the probability of its
realisation falling within the interval [a, b] can be expressed as an integral:
)A

where the integrand function is called the probability density


function of X.
(c

In the definition of a continuous variable, the integral is the area under the
probability density function in the interval between a and b.

Amity Directorate of Distance & Online Education


10 Computational Statistics

Notes

e
in
nl
O
ty
si
Discrete Variable
A discrete variable, on the other hand, has a finite number of possible values. All
r
categorical variables are discrete by definition, but so are many variables measured on
ratio scales. A count, such as the number of pups in a rat litter or the number of correct
ve
responses on a memory task, is an important type of discrete variable measured on a
ratio scale. Counts are always integers that are positive.

Other Examples could be:


ni

◌◌ Number of students in a class


◌◌ Number of marbles in a jar
U

A discrete random variable “X” has a definite countable value possibility.

Formal Definition
A random variable X is discrete if its support Rx is countable and there exist a
ity

function, called probability mass function of X, such that

where P(X = x) is the probability that X will take the value x.


m

Example: Let X represent the sum of two dice, then the probability would be as
below:

Then the probability distribution of X is as follows:


)A

Graphically the Probability of Random Discrete Variable X in the above case would
be as below:
(c

Amity Directorate of Distance & Online Education


Computational Statistics 11

Notes

e
in
nl
O
ty
Discrete Variable Continuous Variable
Countable Support Uncountable Support

si
Probability Mass Function Probability Density Function
Probabilities assigned to single Probabilities assigned to
values intervals of values
Each possible value has strictly
positive probability
r
Each possible value has zero
probability
ve
1.2.2 Nominal, Ordinal, Ratio and Interval
A variable is a statistician’s term for an attribute that differs between observations.
A scale is the type of unit used to measure a variable. Statisticians traditionally refer to
ni

four types of measurement scales: (1) nominal, (2) ordinal, (3) ratio, and (4) interval.

Nominal
U

The term nominal is derived from the Latin word nomen, which means “name.”
Nominal scales merely name differences and are most commonly used for qualitative
variables with discrete groups of observations. The most important feature of a nominal
ity

scale is that there is no inherent quantitative difference between the categories. In the
behavioural sciences, three classic nominal scales are used: sex, religion, and race. In
biology, taxonomic categories (rodent, primate, canine) are nominal scales. Variables
on a nominal scale are frequently referred to as categorical variables.
m

Ordinal
Ordinal scales rank-order observations. Examples include class rank and horse
race results. An ordinal scale has two distinguishing features. First, the observations
)A

differ in terms of an underlying quantitative measure. This underlying quantitative


attribute could be composite grade point average for class rank, and time to the finish
line for horse race results. Individual differences on the underlying quantitative measure
are either unavailable or ignored, according to the second attribute. As a result, ranking
the horses in a race as first, second, third, and so on conceals whether the first-place
(c

horse won by several lengths or by a nose.

Amity Directorate of Distance & Online Education


12 Computational Statistics

Ordinal scales may be preferred over a quantitative index of the underlying scale
Notes

e
in a few cases. College admissions officers, for example, prefer class rank to overcome
the issue of different GPA calculating criteria used by school districts. However,
measurement of the underlying quantitative dimension is preferred over rank-ordering

in
observations in general because the resulting scale has greater statistical power than
the ordinal scale.

nl
Ratio
A ratio scale not only has equal intervals, but it also has a true “0” point. As a result,
ratio scales can be used to multiply, divide, add, and subtract. Time units (msec, hours),

O
distance and length units (cm, kilometres), weight units (mg, kilos), and volume units
(cc) are all ratio scales. Scales that involve the division of two ratio scales are also ratio
scales. As a result, rates (miles per hour) and volumetric measures (mg/dL) are ratio
scales. It is important to note that, while a ratio scale has a true 0 point, it is possible

ty
that the nature of the variable will prevent a value of “0” from ever being observed.
Human height is measured on a ratio scale, but everyone is taller than “0”. Because
ratio scales have a multiplicative property, it is possible to state that 60 mg of fluoxetine

si
is three times as powerful as 20 mg.

Interval
r
The interval between adjacent values on ordinal scales is not constant. For
ve
example, the difference in finishing time between the first and second place horses
does not have to be the same as the difference between the second and third place
horses. An interval scale has a fixed interval but no true “0” point. As a result, on an
interval scale, one can add and subtract values but not multiply or divide units.
ni

The classic example of an interval scale is temperature, which is used in day-to-


day weather reports. The assignment of the number “0” to a specific height in a column
of mercury is an arbitrary convenience that is obvious to anyone who understands the
U

difference between the Celsius and Fahrenheit scales. As a result, it is not possible
to say that 30oC is twice as warm as 15oC because that statement involves implied
multiplication. To persuade yourself, convert these two to Fahrenheit and ask whether
86oF is twice as hot as 50oF.
ity

Temperature, on the other hand, has constant intervals between numbers, allowing
one to add and subtract. The temperature difference between 28oC and 21oC is 7
oC, as is the temperature difference between 53oC and 46oC. Convert these back
to Fahrenheit and see if the difference between 82.4oF and 69.8oF is the same in
m

Fahrenheit units as the difference between 127.4oF and 114.8oF.


)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 13

Notes

e
in
nl
O
ty
r si
ve
ni
U
ity
m
)A

Check Your Understanding

A. Fill in the Blanks


1. Between any two points on the measurement scale, a ____________variable has an
____________ number of possible values.
(c

A) non-continuous, finite

Amity Directorate of Distance & Online Education


14 Computational Statistics

B) continuous, infinite
Notes

e
C) non-continuous, infinite
D) continuous, finite

in
2. In the definition of a continuous variable, the ___________ is the area under the
probability density function in the interval between a and b.

nl
A) integral
B) integer
C) probability

O
D) function
3. A discrete variable, on the other hand, has a ____________ number of possible
values.

ty
A) infinite
B) continuous

si
C) finite
D) discrete
4. r
A ____________random variable “X” has a definite countable value possibility.
ve
A) finite
B) integral
C) infinite
ni

D) discrete
5. The term nominal is derived from the Latin word____________.
A) nomenclature
U

B) name
C) nomen
ity

D) None of the above

B. True or False
1. A scale is the type of unit used to compare a variable.
m

2. Statisticians traditionally refer to three types of measurement scales.


3. Nominal scales merely name differences and are most commonly used for
quantitative variables with discrete groups of observations.
)A

4. A ratio scale not only has equal intervals, but it also has a true “0” point.
5. The interval between adjacent values on ordinal scales is constant.

Summary
(c

●● Statistics are applied to a data set. The observations are represented by the rows
of the data matrix.

Amity Directorate of Distance & Online Education


Computational Statistics 15

●● The columns of the data matrix contain attributes measured on the observations.
Notes

e
●● Between any two points on the measurement scale, a continuous variable has an
infinite number of possible values.

in
●● In the definition of a continuous variable, the integral is the area under the
probability density function in the interval between a and b.
●● All categorical variables are discrete by definition, but so are many variables

nl
measured on ratio scales.
●● A discrete random variable “X” has a definite countable value possibility.

O
●● A scale is the type of unit used to measure a variable.
●● Statisticians traditionally refer to four types of measurement scales: (1) nominal,
(2) ordinal, (3) ratio, and (4) interval.

ty
●● Nominal scales merely name differences and are most commonly used for
qualitative variables with discrete groups of observations.
●● The most important feature of a nominal scale is that there is no inherent

si
quantitative difference between the categories.
●● Ordinal scales rank-order observations.
●●
r
Ordinal scales may be preferred over a quantitative index of the underlying scale
in a few cases.
ve
●● A ratio scale not only has equal intervals, but it also has a true “0” point. As a
result, ratio scales can be used to multiply, divide, add, and subtract.
●● Time units (msec, hours), distance and length units (cm, kilometres), weight units
ni

(mg, kilos), and volume units (cc) are all ratio scales.
●● Scales that involve the division of two ratio scales are also ratio scales. As a result,
rates (miles per hour) and volumetric measures (mg/dL) are ratio scales.
U

●● While a ratio scale has a true 0 point, it is possible that the nature of the variable
will prevent a value of “0” from ever being observed.
●● The interval between adjacent values on ordinal scales is not constant.
ity

●● The classic example of an interval scale is temperature, which is used in day-to-


day weather reports.
●● Temperature, on the other hand, has constant intervals between numbers,
allowing one to add and subtract.
m

Activity
1. This activity helps students understand the various levels of measurement.
)A

2. Request that students bring a magazine or newspaper to class. Explain to them that
they should read their reading material, paying special attention to examples of the
various levels of measurement presented in the magazine or newspaper. Assign
students the task of identifying and briefly discussing at least one example of a
(c

nominal, ordinal, interval, or ratio scale. Finish with a discussion of student examples.

Amity Directorate of Distance & Online Education


16 Computational Statistics

Glossary
Notes

e
●● Continuous Variable: A random variable X is said to be continuous if and only if the
probability of its realisation falling within the interval [a, b] can be expressed as an

in
integral.
●● Discrete Variable: A discrete variable, on the other hand, has a finite number of
possible values.

nl
●● Scale: A scale is the type of unit used to measure a variable.
●● Nominal Scale: Nominal scales merely name differences and are most commonly

O
used for qualitative variables with discrete groups of observations.
●● Ordinal Scale: Ordinal scales rank-order observations.
●● Ratio Scale: A ratio scale not only has equal intervals, but it also has a true “0”

ty
point.
●● Interval Scale: The interval between adjacent values on ordinal scales is not
constant. An interval scale has a fixed interval but no true “0” point.

si
●● Questions and Exercises
●● What is the difference between discrete and continuous variable?
●●
r
Write short notes on types of measurement scales.
ve
Further Readings
1. Wolfgang Karl Hardle, Ostap Okhrin and Yarema Okhrin (2017) “Basic
Elements of Computational Statistics”, First Edition, Springer.
ni

2. Geof H. Givens and Jennifer A. Hoeting (2013) “Computational Statistics”,


Second Edition, John Wiley & Sons, Inc.
3. James E. Gentle (2002) “Elements of Computational Statistics”, Springer.
U

4. James E. Gentle, Wolfgang Karl Hardle and Yuichi Mori (2012) “Handbook of
Computational Statistics: Concepts and Methods”, Second Edition, Springer.

Check Your Understanding- Answers


ity

A. Fill in the Blanks


1. B) continuous, infinite
2. A) integral
m

3. C) finite
4. D) discrete
)A

5. C) nomen
B. True/False
F, F, F, T, F
(c

Amity Directorate of Distance & Online Education


Computational Statistics 17

Unit - 1.3: Primary Data and Secondary Data


Notes

e
Objectives:

in
At the end of this unit, you will be able to:

●● Understand the difference between primary and secondary data

nl
●● How to conduct primary data collection
●● Learn the characteristics of a good questionnaire

O
●● Know how to quantify hypotheses by standardised questionnaire
●● Understand the concept of data consistency

Introduction

ty
Statistics is the study of facts and figures that can be numerically measured. The
collection of numerical measures of the same characteristic is known as data, and
the collection of observations is known as observation. Individual research workers or

si
organisations collect data through sample surveys or experiments while keeping the
study’s objectives in mind. The information gathered could include:

◌◌ Primary Data r
ve
◌◌ Secondary Data
Primary data is collected first-hand by a researcher (organisation, person, authority,
agency, or party, etc.) through experiments, surveys, questionnaires, focus groups,
conducting interviews, and taking (required) measurements, whereas Secondary Data
ni

is readily available (collected by someone else) and is available to the public through
publications, journals, and newspapers.

Primary Data
U

Primary data is raw data (data that has not been fabricated or tailored) that has
been collected directly from the source and has not undergone any statistical treatment
such as sorting and tabulation. Primary data is sometimes used to refer to first-hand
ity

information.

Primary Data Sources


Primary data are collected from primary units such as basic experimental units,
m

individuals, and households. Typically, the following methods are used to collect data
from primary units, and these methods vary depending on the nature of the primary
unit. Secondary data is data that has been published or collected in the past.
)A

●● Private Investigation
The researcher conducts the experiment or survey and collects the data from it. The
information gathered is generally accurate and reliable. This method of collecting
primary data is only feasible for small-scale laboratory, field, or pilot surveys; it is
(c

impractical for large-scale experiments and surveys due to the time required.

Amity Directorate of Distance & Online Education


18 Computational Statistics

●● Using Investigators
Notes

e
To collect the necessary data, trained (experienced) investigators are used. In the
case of surveys, they contact individuals and ask for the necessary information before

in
filling out questionnaires, where a questionnaire is an inquiry form with a number of
questions designed to elicit information from respondents. Most organisations use
this method of data collection because it provides reasonably accurate information,

nl
but it is very expensive and may take a long time.
●● Using a questionnaire
The required information (data) is obtained by mailing a questionnaire (printed or

O
soft form) to the selected individuals (respondents), who fill it out and return it to the
investigator. This method is less expensive than the “through investigator” method,
but the non-response rate is very high because most respondents do not bother to
fill out the questionnaire and send it back to the investigator.

ty
●● Using Local Resources
Local representatives or agents are asked to send the necessary information, which

si
they provide based on their own experience. This method is quick, but it only provides
rough estimates.
●● Telephonically
r
Contacting the individuals over the phone is one way to obtain the information. It is
ve
a quick and accurate way to get the information you need.
●● Using the Internet
With the advent of information technology, people may be contacted via the internet
ni

and individuals may be asked to provide relevant information. Google surveys are
now widely used as an online data collection method. There are also numerous paid
online survey services.
U

Before applying a statistical treatment to the primary data, it is critical to go through


it and identify any inconsistent observations.

Secondary Data
ity

Data that has already been collected by someone may have been sorted,
tabulated, and statistically treated. It is made-up or customised data.

Secondary Data Sources


m

Secondary data can be obtained from the following sources:

●● Organizations of Government
)A

Federal and Provincial Bureaus of Statistics, Crop Reporting Service-Agriculture


Department, Census and Registration Organizations, and so on.
●● Semi-Governmental Organization
Municipal committees, District Councils, Commercial and Financial Institutions such
(c

as banks, and so on

Amity Directorate of Distance & Online Education


Computational Statistics 19

●● Organizations for Teaching and Research


Notes

e
●● Journals and newspapers for research
●● Internet

in
nl
O
ty
si
1.3.1 Designing a Questionnaire and Schedule
r
A well-designed questionnaire is essential for the success of any survey.
ve
Unfortunately, there is no theoretical foundation to guide the marketing researcher in
developing a flawless questionnaire. All the researcher has to guide him or her is a long
list of do’s and don’ts based on the experience of previous and current researchers. As
a result, designing questionnaires is more of an art than a science.
ni

The characteristics of a good questionnaire


The design of a questionnaire will be determined by whether the researcher wishes
U

to collect exploratory data (i.e., qualitative data for the purposes of better understanding
or the generation of hypotheses on a subject) or quantitative data (to test specific
hypotheses that have previously been generated).

Exploratory questionnaires: If the data to be collected is qualitative or will not


ity

be statistically analysed, a formal questionnaire may not be required. A formal


questionnaire, for example, may limit the discussion and prevent a full exploration of the
woman’s views and processes when interviewing the female head of the household to
find out how decisions are made within the family when purchasing breakfast foodstuffs.
m

Instead, create a brief guide with ten major open-ended questions and appropriate
probes/prompts listed under each.

Formal standardised questionnaires: A formal standardised questionnaire is


)A

designed if the researcher wants to test and quantify hypotheses and analyse the data
statistically. In general, such questionnaires are distinguished by:

◌◌ prescribed question wording and order to ensure that each respondent


receives the same stimuli
(c

◌◌ standardised definitions or explanations for each question, to ensure


interviewers handle questions consistently and can respond to respondents’
requests for clarification if they arise
Amity Directorate of Distance & Online Education
20 Computational Statistics

◌◌ prescribed response format, in order to complete the questionnaire quickly


Notes

e
during the interviewing process
◌◌ Given the same task and hypotheses, six different people will almost certainly

in
create six different questionnaires that differ greatly in their question selection,
line of questioning, use of open-ended questions, and length. There are no
hard and fast rules for creating a questionnaire, but there are a few things to
keep in mind:

nl
◌◌ A well-designed questionnaire should achieve the research goals. This may
appear obvious, but many research surveys omit important aspects due to
insufficient preparation and do not adequately probe specific issues due to

O
a lack of understanding. To some extent, some of this is unavoidable. Every
survey is bound to leave some questions unanswered and necessitate
additional research, but the goal of good questionnaire design is to ‘minimize’

ty
these issues.
◌◌ It should collect the most complete and accurate data possible. The
questionnaire designer must ensure that respondents fully comprehend
the questions and are unlikely to refuse to answer, lie to the interviewer, or

si
conceal their attitudes. A good questionnaire is organised and worded in such
a way that respondents are encouraged to provide accurate, unbiased, and
complete information.
◌◌
r
A well-designed questionnaire should make it simple for respondents to
ve
provide the necessary information and for the interviewer to record the
response, and it should be laid out in such a way that sound analysis and
interpretation are possible.
◌◌ It would keep the interview brief and to the point, and it would be structured
ni

in such a way that the respondent(s) would remain interested throughout the
interview.
U
ity
m
)A
(c

Questionnaire Design Matrix

Amity Directorate of Distance & Online Education


Computational Statistics 21

A good questionnaire’s qualities and characteristics


Notes

e
◌◌ It is necessary to understand what needs to be measured – Having a clear
picture and understanding of what data needs to be collected helps to improve

in
data collection quality.
◌◌ Should understand how to phrase/frame questions, and words should be
neutral and not leading – Whatever your point of view is, it should never be

nl
reflected in the questions. This is done both intentionally and unintentionally,
but it must be addressed.
◌◌ The importance of using the correct word/phrase should not be overlooked

O
– the language should be clear so that the necessary data can be received.
This also makes the survey question and requirements easier to understand,
resulting in a better response and answer.
◌◌ Define and qualify terms – This is especially important when conducting

ty
a technical or field-specific survey. If you believe that some terms may be
unfamiliar to the audience being polled, they must be defined in order to
receive a proper response. This will improve the quality while decreasing the

si
bounce rate or the number of unanswered questions.
◌◌ Avoid using double negatives or more than one negative word in a sentence
– The use of a negative word has a psychological effect and can influence the
response. r
ve
◌◌ Alternatives that are sufficient or adequate should be provided – The available
options should contain the most likely answers.
◌◌ Multiple questions in a row should be avoided – each question should have
only one answer. If more than one question must be asked, each one should
ni

be treated as a separate question to improve question clarity.


◌◌ Words that need to be emphasised should be emphasised – it aids in making
a point and asking a question. Clear alternatives such as good/bad/fair/
U

average should be quantified using photographs or other means – These are


very ambiguous terms, and how they are interpreted varies from person to
person.
◌◌ Unwanted assumptions should be avoided – The goal of a survey is to collect
ity

factual data, so assumptions should be avoided.

What Should You Avoid When Creating a Good Questionnaire?


A side from the characteristics of a good questionnaire listed above, there are a
m

few drawbacks that should be avoided as much as possible.

◌◌ More than the required length – If there are too many questions, the chances
of getting them answered are extremely low. In a hectic schedule, no one
)A

wants to answer too many questions. So keep it brief and to the point. A good
questionnaire is brief and precise in order to meet the requirements; try to
keep it under one page.
◌◌ Subjective Questions – Subjective or open-ended questions should be
(c

avoided unless it is absolutely necessary to obtain the user’s or targeted


group’s opinion. This is due to the fact that everyone has a different point of

Amity Directorate of Distance & Online Education


22 Computational Statistics

view, and the response received may not provide any useful information. It will
Notes

e
be challenging to quantify and present. It has also been observed that people
begin narrating their experiences in open-ended questions, which adds little
value to the research being conducted. It also consumes a lot of space, time,

in
and effort and should be avoided.
◌◌ Questions with contradictory answers – Before completing a questionnaire,
read it three to four times. This is done to ensure that no question contradicts

nl
or repeats itself.
◌◌ Objectify/bias/conclusion – This is one of the most common errors made
when creating a questionnaire. Always keep in mind that the survey being

O
conducted must be unbiased and must not reflect the company’s or the
person preparing the questionnaire’s opinion. All of the questions should be
neutral, and no one question should influence the answer to another.

ty
The Benefits of a Questionnaire
◌◌ The questionnaire is usually mailed to the respondents and contains specific,
unambiguous instructions; the people in charge of data collection do not need

si
to go out of their way to provide additional explanations or instructions.
◌◌ It is possible to cover a large number of people spread across a large territory.
It is more cost-effective in terms of time, energy, and money.
◌◌
r
It is a method that is impersonal. The standardised wording of questions,
ve
sequence, and instructions ensures consistency from one measurement
situation to the next.
◌◌ It ensures privacy. Respondents are more confident that they will not be
identified as holding a particular point of view.
ni

◌◌ Reduce the pressure on respondents to respond immediately. Before putting


his response in writing, the respondent can carefully consider each point.
U

Questionnaire Disadvantages
◌◌ It can only be administered to someone with a high level of education.
◌◌ The response rate for a mailed questionnaire is low.
ity

◌◌ Incorrect information may result from a misinterpretation of the question.


There is little opportunity to clarify a specific response.
◌◌ Success is determined by the respondent’s sense of responsibility.
◌◌ The investigator is unable to comprehend the stimuli impinging on the
m

respondent in accordance with his design.

Considerations to make when creating a good questionnaire


)A

◌◌ The physical appearance of the questionnaire influences people’s cooperation


or response. A visually appealing questionnaire is a plus. Size, paper quality,
paper colour, arrangement of items on the questionnaire, or layout/format of
the questionnaire are all important factors in improving physical appearance.
◌◌ Consider who will record the responses. If a highly trained investigator is to
(c

conduct the interviews and enter the responses, the form should be different

Amity Directorate of Distance & Online Education


Computational Statistics 23

from the one designed for the informant to fill out on his own.
Notes

e
◌◌ The word chosen is an important consideration. Respondents with limited
vocabularies are more likely to be oblique. He may simply select one of the

in
alternative responses without having any idea what his response means. It is
necessary to consider a simple word with no multiple meanings. Dangerous
words, catchphrases, and words with emotional overtones should be avoided.
Long questions should be avoided at all costs.

nl
◌◌ The order and sequence of questions are critical. Many refusals and
misunderstandings can be avoided with proper questioning. The initial
questions should be simple to answer. Questions that may cause

O
embarrassment to the informant should be placed in the middle or at the end
of the questionnaire.
◌◌ If the number of questions is small, their arrangement on the questionnaire will

ty
not necessitate extensive planning.
◌◌ The purpose of the survey should guide the design of the questions. Problem
formulations serve as the starting point for developing the question. It is

si
critical to be clear about the information that will be sought and the type of
questionnaire that will be used.
◌◌ It must be designed with an understanding of the possible process of analysis
in mind. r
ve
Schedule
The schedule is one of the most effective data collection tools in any type of
research or market study, ensuring maximum success and accuracy.
ni

A series of questions are asked of the respondent with full assistance, i.e., if the
respondent is unable to understand the question, the interviewer or investigator will
assist them in understanding the same.
U

There is no option for alternate choices in the Schedule; instead, a respondent


must answer the question with his/her ability or knowledge and have the interviewer
write it down.
ity

The type of audience or respondents involved is solely determined by the research


topic.

Schedules are created to achieve objectivity and to facilitate detailed analysis with
high-quality data points or information. It is a method for gathering qualitative data.
m

The order of questions, structure of the question set or parts, and language all play
a significant role in Schedule and must be in the correct order that cannot be jumbled.
)A

It is an expensive method of data collection because interviewers or investigators


must be hired, trained, and dispatched to the respondents’ locations.

The schedule is a time-bound activity, and as such, it is always completed on time.

Whether respondents are literate or not, the interviewer will assist them if they have
(c

any problems.

Amity Directorate of Distance & Online Education


24 Computational Statistics

Schedule has a very high success rate, and the data collected will be of the highest
Notes

e
precision because it is collected personally by an interviewer.

Examples include population census records, voting polls, and so on.

in
Schedule vs. Questionnaire
The distinction between a questionnaire and a schedule is that a questionnaire is

nl
a structured type of research tool in which a predefined list of questions with outcomes
or answer choices from which respondents can choose may or may not be included.
The schedule is a series of structured questions on a specific topic that were personally

O
asked by the investigator or interviewer.

Respondents in a questionnaire use their knowledge and experience to answer the


questions. It is used to gather information on a specific subject from a group of people
who fall into the same category, such as age, gender, and so on.

ty
In the case of a schedule, if the respondent has any difficulty understanding the
question, the investigator or interviewer can assist them further.

si
The Primary Differences Between a Questionnaire and a Schedule
Although both are implausible data collection methods with a high degree of
r
authenticity, there is a significant difference between Questionnaire and Schedule,
which are:
ve
◌◌ The Questionnaire consists of a series of questions with an option of
alternatives or choices as an answer to choose from, whereas Schedules do
not provide choices.
ni

◌◌ Respondents fill out the questionnaire on their own in the questionnaire,


whereas respondents in Schedule are assisted by the interviewer or
investigator.
U

◌◌ The presence of the investigator or invigilator is not required in the


Questionnaire, but the presence of the interviewer is required in the Schedule.
◌◌ The questionnaire is a low-cost or cost-effective data collection method,
whereas Schedule is costly.
ity

◌◌ The Questionnaire has a large audience or respondents, whereas the


Schedule has a small set of audiences.
◌◌ The response rate in the Questionnaire is very low because people may or
may not respond to the questionnaire mail, whereas the response rate in the
m

Schedule is extremely high because the interviewer is directly involved.


)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 25

Parameter Questionnaire Schedule


Notes

e
Definition A questionnaire is a structured The schedule is a well-
data collection method in which thought-out structured set of

in
a list of pre-defined questions is questions that were personally
designed, usually with their best asked by the interviewer to
possible answers or choices, from the respondent or that the

nl
which respondents must choose respondent was required
based on their experience and to write answers to in the
scope of knowledge. Many times, presence of the interviewer
options are not provided, and the or investigator.

O
respondent must answer on his
or her own.
Available Options Alternative answer options can be There are no alternate answer

ty
made available for selection. options to choose from; either
the respondent must write or
respond to the interviewer.
Technique Type Quantitative in nature Qualitative in nature

si
Grouping Grouping is done based on various The association may or may
criteria such as age, gender, not exist.
location, and so on.
Cover r
A questionnaire can easily cover Generally, scheduling is done
ve
a large number of people. when a small group or groups
of people are involved.
Extension of Help There is no assistance provided; The respondent is given
the respondent must choose from complete assistance in
ni

the options provided, regardless understanding the question


of whether or not he or she so that they can express their
understood the question. true feelings or provide the
U

correct answer.
Accuracy Little Accurate Quite High Accuracy

1.3.2 Collection of Primary Data, Checking their Consistency


ity

Remember the definition of statistics “Statistics is a branch of science that deals


with the collection, classification, tabulation, analysis, and interpretation of data.”

In the above definition, the first of the five successive steps (i.e., collection,
m

classification, tabulation, analysis, and interpretation) used in any statistical investigation


is data collection, and the last step is data interpretation, which is ultimately dependent
on data collection. So, if data collection is not done carefully and sincerely, the goal of
)A

the statistical investigation will not be achieved, or the objective(s) of the statistical
investigation will not be met, or the final results of the investigation will be unsatisfactory.

Data Consistency
When the frequencies of different classes are counted and any class frequency
(c

obtained is negative, the data is said to be inconsistent. Inconsistency arises as a result


of incorrect counting, inaccurate addition or subtraction, or, in rare cases, a printing
error. To determine whether the data is consistent, all of the class frequencies are
Amity Directorate of Distance & Online Education
26 Computational Statistics

computed, and if none of them are negative, the data is consistent. It should be noted
Notes

e
that just because the data is consistent does not mean that the counting or calculations
are correct. However, if the data is inconsistent, it indicates that there is a typo or
misprint in the figures.

in
Obtain the ultimate class frequencies to test the consistency of the data. If any of
them are negative, the data is inconclusive. It can also be seen that no higher order

nl
class can have a higher frequency than the lower order class. The data is inconsistent
if any frequency of an attribute or combination of attributes is greater than the total
frequency N (frequency of zero order). Entering the class frequencies in the chart
provided in Section 13.6 of Unit 13 is an easy way to determine whether the ultimate

O
class frequencies are negative or not (i.e. checking the data for consistency). This will
provide a comprehensive picture of all the ultimate class frequencies.

It is also possible to specify conditions for data consistency.

ty
1.3.3 Scrutiny of Data for Internal Consistency
When collecting primary data, information provided by informants is recorded in the

si
form of a questionnaire, a schedule, or other means.

Before tabulating primary data, it is necessary to examine the questionnaires,


r
schedules, or other methods used to collect primary data in order to ensure
ve
◌◌ data completeness
◌◌ data consistency.
◌◌ data accuracy, and
ni

◌◌ data homogeneity

Data Completeness
We may come across some incomplete questionnaires or schedules.
U

Incomplete questionnaires or schedules should be completed by revisiting the


informants if possible; otherwise, they should be rejected. Questionnaires or schedules
that are incomplete should not be considered.
ity

Data Consistency
Sometimes the information provided by informants is incomprehensible or
contradicts other information. For example, suppose an informant declares his age to
m

be 40 years old while his son declares his age to be 32 years old.

This information is incomprehensible.


)A

If the informant’s total age and date of birth do not match, the two types of
information contradict each other. Again, either these facts should be corrected by
revisiting the informants, or they should be rejected.

Data Accuracy
(c

Maintaining 100 percent accuracy is a difficult task. We can only make corrections
to specific figures by checking their sum, subtraction, and so on. However, we have no

Amity Directorate of Distance & Online Education


Computational Statistics 27

control over whether or not some informants provide incorrect information. For example,
Notes

e
some informants may provide incorrect information about their annual income.

Data homogeneity

in
Maintaining data homogeneity is also an important aspect of data editing/scrutiny.
Here, we must determine whether the units of measurement used by the informants
are the same or different. Some informants, for example, may give their heights in

nl
centimetres, while others may give them in inches.

Some informants may enter their monthly income, while others may enter their

O
yearly income, and so on.

1.3.4 Detection of Errors of Recording

ty
When we examine methods for measuring error sources and indicators for
describing data quality information, the error sources thus identified can be: sampling
error, nonresponse error, coverage error, measurement error, and processing error.

si
Sampling Error
The most well-known source of survey error is sampling error, which refers to the
variability that occurs by chance as a result of surveying a sample rather than an entire
r
population. All statistical agencies should prioritise the reporting of sampling error for
ve
survey estimates. Data from any survey based on a probability sample can be used to
estimate the standard errors of survey estimates. Most estimates’ standard errors can
now be easily computed using software that takes into account the survey’s complex
sample design. The difficulty in calculating standard errors stems from the multi-
ni

purpose nature of many federal surveys. Surveys generate a large number of complex
statistics, and the task of computing and reporting standard errors for all survey
estimates and differences between estimates is enormous.
U

Non-response Error
Nonresponse error is a well-known and visible source of non-sampling error. It is
a non-observational error indicating a failed attempt to obtain the desired information
ity

from an eligible unit. Nonresponse reduces sample size, increases variance, and
introduces the possibility of bias in survey estimates. Nonresponse rates are frequently
reported and are frequently used as a proxy for survey quality. Nonresponse rates
can be calculated in a variety of ways for various purposes, and they are frequently
m

underestimated. Because of the complexities of survey design, calculating and


communicating response rates can be confusing and potentially problematic. While
reporting nonresponse rates is important, it provides no indication of nonresponse bias.
Special studies are required.
)A

Coverage Error
Coverage error is the error caused by failing to include some population units
in the sample selection frame (under-coverage) and the error caused by failing to
(c

identify units represented on the frame more than once (over-coverage). The sampling
frame itself is the source of coverage error. It is therefore critical to have knowledge
of the sampling frame’s quality and completeness for the target population. Methods

Amity Directorate of Distance & Online Education


28 Computational Statistics

for measuring coverage error rely on methods outside of survey operations, such as
Notes

e
comparing survey estimates to independent sources or implementing a case-by-case
matching of two lists.

in
Measurement Error
Measurement error is defined as the difference between the observed value of
a variable and its true but unobserved value. In survey data collection, measurement

nl
error is caused by four primary sources: the questionnaire, as the official presentation
or request for information; the data collection method, as the manner in which the
request for information is made; the interviewer, as the question deliverer; and the

O
respondent, as the recipient of the request for information. These sources constitute the
entirety of data collection, and each source has the potential to introduce error into the
measurement process. For example, measurement error can occur in survey responses
due to respondents misinterpreting the question’s meaning, failing to recall the

ty
information accurately, or failing to construct the response correctly (e.g., by summing
the components of an amount incorrectly). Measurement errors are difficult to quantify
and usually necessitate specialised, costly studies. Approaches used to quantify

si
measurement error include re-interview programmes, record check studies, behaviour
coding, cognitive testing, and randomised experiments.

Processing Error r
ve
After the survey data is collected, processing error occurs during the processes
that convert reported data to published estimates and consistent machine-readable
information. Each processing step, from data collection to the publication of the final
survey results, can result in errors in the data or statistics published. These errors
ni

range from simple transcribing or transmission errors to more complex errors caused
by a poorly specified edit or imputation model. They are rarely treated in the survey
research literature because they are not well-reported or well-documented. Data entry,
coding, editing, and imputation errors are examples of processing errors. Processing
U

errors include imputation errors because many agencies treat failed edits as missing
and impute values for them. Quality control samples are used to determine error rates;
however, in recent years, authors have advocated for continuous quality management
ity

practises (Morganstein and Marker 1997; Linacre and Trewin 1989).

The above-mentioned classification of error sources in surveys provides a


framework for statistical data users to develop an understanding of the nature of the
data they analyse. An understanding of data limitations can help an analyst develop
m

methods to compensate for their data’s known shortcomings. Of course, the errors from
various sources are not all the same size or significance. Later chapters will go over
measurement techniques for determining the magnitude of error sources.
)A

Check Your Understanding

A. Fill in the Blanks


1. The collection of numerical measures of the same characteristic is known
(c

as____________, and the collection of observations is known as ___________.


A) research, observation

Amity Directorate of Distance & Online Education


Computational Statistics 29

B) observation, data
Notes

e
C) facts, measurements
D) data, observation

in
2. ____________is collected first-hand by a researcher.
A) Secondary data

nl
B) Primary data
C) Tertiary data

O
D) Quarterly data
3. The researcher conducts the experiment or survey and collects the ____________
from it.

ty
A) information
B) source
C) data

si
D) measurement
4. Data that has already been collected by someone may have been sorted,
____________, and ____________treated. r
ve
A) tabulated, statistically
B) discrete, non-tabulated
C) statistically, tabulated
ni

D) tabulated, traditionally
5. A formal standardised ___________is designed if the researcher wants to test and
quantify hypotheses and analyse the data statistically.
U

A) focused group
B) questionnaire
C) survey
ity

D) interview

B. True or False
m

1. If the data to be collected is qualitative or will not be statistically analysed, a formal


questionnaire may not be required.
2. The design of a questionnaire will be determined by whether the researcher wishes
)A

to collect exploratory data or quantitative data.


3. Subjective or open-ended questions should not be avoided unless it is absolutely
necessary to obtain the user’s or targeted group’s opinion.
4. The survey being conducted must be biased and must reflect the company’s or the
(c

person preparing the questionnaire’s opinion.


5. The physical appearance of the questionnaire do not influence people’s cooperation
or response.
Amity Directorate of Distance & Online Education
30 Computational Statistics

Summary
Notes

e
●● Statistics is the study of facts and figures that can be numerically measured.
●● The collection of numerical measures of the same characteristic is known as data,

in
and the collection of observations is known as observation.
●● Primary data is collected first-hand by a researcher through experiments, surveys,
questionnaires, focus groups, conducting interviews, and taking (required)

nl
measurements.
●● Secondary Data is readily available (collected by someone else) and is available

O
to the public through publications, journals, and newspapers.
●● Primary data is raw data that has been collected directly from the source and has
not undergone any statistical treatment such as sorting and tabulation.

ty
●● Primary data are collected from primary units.
●● Secondary data is data that has been published or collected in the past.
●● Before applying a statistical treatment to the primary data, it is critical to go

si
through it and identify any inconsistent observations.
●● A well-designed questionnaire is essential for the success of any survey.
●●
r
The design of a questionnaire will be determined by whether the researcher
wishes to collect exploratory data or quantitative data.
ve
●● A well-designed questionnaire should achieve the research goals.
●● A well-designed questionnaire should make it simple for respondents to provide
the necessary information and for the interviewer to record the response.
ni

●● If there are too many questions, the chances of getting them answered are
extremely low.
●● Subjective or open-ended questions should be avoided unless it is absolutely
U

necessary to obtain the user’s or targeted group’s opinion.


●● The survey being conducted must be unbiased and must not reflect the company’s
or the person preparing the questionnaire’s opinion.
ity

●● The physical appearance of the questionnaire influences people’s cooperation or


response.
●● The purpose of the survey should guide the design of the questions.
m

●● The schedule is one of the most effective data collection tools in any type of
research or market study, ensuring maximum success and accuracy.
●● The type of audience or respondents involved is solely determined by the research
)A

topic.
●● Schedules are created to achieve objectivity and to facilitate detailed analysis with
high-quality data points or information.
●● The schedule is a time-bound activity.
(c

●● Schedule has a very high success rate, and the data collected will be of the
highest precision because it is collected personally by an interviewer.

Amity Directorate of Distance & Online Education


Computational Statistics 31

●● A questionnaire is a structured type of research tool in which a predefined list of


Notes

e
questions with outcomes or answer choices from which respondents can choose
may or may not be included.

in
●● The schedule is a series of structured questions on a specific topic that were
personally asked by the investigator or interviewer.
●● Statistics is a branch of science that deals with the collection, classification,

nl
tabulation, analysis, and interpretation of data.
●● The first of the five successive steps (i.e., collection, classification, tabulation,
analysis, and interpretation) used in any statistical investigation is data collection,

O
and the last step is data interpretation, which is ultimately dependent on data
collection.
●● When the frequencies of different classes are counted and any class frequency

ty
obtained is negative, the data is said to be inconsistent.
●● To determine whether the data is consistent, all of the class frequencies are
computed, and if none of them are negative, the data is consistent.

si
●● The data is inconsistent if any frequency of an attribute or combination of attributes
is greater than the total frequency N.
●● Maintaining data homogeneity is also an important aspect of data editing/scrutiny.
●●
r
While examining methods for measuring error sources and indicators for
ve
describing data quality information, the error sources can be: sampling error,
nonresponse error, coverage error, measurement error and processing error.

Activity
ni

1. This activity helps students understand the concept of collecting data.


2. Divide the class in five or six groups. Assign students the task of identifying a
U

research problem and formulate a hypothesis as per the selected research topic in
each group, along with briefly discussing the steps of collecting data. Finish with a
discussion of examples carried out in actual research, their hypothesis and ways in
which their data was collected.
ity

Glossary
●● Questionnaire: A questionnaire is a structured type of research tool in which
a predefined list of questions with outcomes or answer choices from which
m

respondents can choose may or may not be included.


●● Statistics: It is a branch of science that deals with the collection, classification,
tabulation, analysis, and interpretation of data.
)A

●● Schedule: It is a series of structured questions on a specific topic that were


personally asked by the investigator or interviewer.
●● Primary Data: It is raw data that has been collected directly from the source and
has not undergone any statistical treatment such as sorting and tabulation.
(c

●● Secondary Data: Data that has already been collected by someone may have
been sorted, tabulated, and statistically treated. It is made-up or customised data.

Amity Directorate of Distance & Online Education


32 Computational Statistics

●● Sampling Error: It refers to the variability that occurs by chance as a result of


Notes

e
surveying a sample rather than an entire population.
●● Non-response Error: It is a non-observational error indicating a failed attempt to

in
obtain the desired information from an eligible unit.
●● Coverage Error: It is caused by failing to include some population units in the
sample selection frame and the error caused by failing to identify units represented

nl
on the frame more than once.
●● Measurement Error: It is defined as the difference between the observed value of
a variable and its true but unobserved value.

O
●● Processing Error: It occurs during the processes that convert reported data to
published estimates and consistent machine-readable information.

Questions and Exercises

ty
1. Explain the relevance of secondary data in research. What are the sources from
which secondary data can be obtained?
2. What should be avoided while creating a good questionnaire?

si
3. Explain the difference between questionnaire and schedule.
4. Write short notes on the following:
a) Data completeness r
ve
b) Data consistency
c) Data accuracy
d) Data homogeneity
ni

5. Explain the types of detection of errors of recording.

Further Readings
U

1. James E. Gentle (2002) “Elements of Computational Statistics”, Springer.


2. Wolfgang Karl Hardle, Ostap Okhrin and Yarema Okhrin (2017) “Basic
Elements of Computational Statistics”, First Edition, Springer.
ity

3. James E. Gentle, Wolfgang Karl Hardle and Yuichi Mori (2012) “Handbook of
Computational Statistics: Concepts and Methods”, Second Edition, Springer.
4. Geof H. Givens and Jennifer A. Hoeting (2013) “Computational Statistics”,
Second Edition, John Wiley & Sons, Inc.
m

Check Your Understanding- Answers


A. Fill in the Blanks
)A

1. D) data, observation 2. B) Primary data


3. C) data 4. A) tabulated, statistically
5. B)questionnaire
B. True/False
(c

T, T, F, F, F

Amity Directorate of Distance & Online Education


Computational Statistics 33

Unit - 1.4: Presentation of Data


Notes

e
Objectives:

in
In this unit, you will be able to:

●● Learn the meaning, purpose and types of tables

nl
●● Understand the significance of diagrams and diagrammatic representation of a
data
●● Know how to represent grouped data in graphical form

O
Introduction
Once data has been collected, it must be classified and organised in such a way

ty
that it can be easily read and interpreted, i.e., converted to information. It is sometimes
helpful to present data as tables, charts, diagrams, or graphs before calculating
descriptive statistics. Most people consider ‘pictures’ to be far more useful than

si
‘numbers’ because they, in their opinion, present data more meaningfully.

1.4.1 Classification, Tabulation

Table
r
ve
Meaning: A table is a systematic arrangement of related statistical data in columns
and rows with a specific goal or purpose in mind. Can you arrange the following data in
a tabular format?
ni

“A college has 50 Chemistry, 50 Mathematics, and 50 Arts students.” The number


of students from low-income families is the same for each course, with a total of 30.
Whereas science and commerce courses are equally popular among wealthy families,
U

wealthy arts students outnumber them by a factor of two. In total, 40 students from
wealthy families are enrolled in college. The majority of students come from middle-
class families, and there are 80 of them.”

Let us arrange this information in a table. There is a total of 150 students. A table
ity

has a longer lasting impact on the human mind than statements that say the same
thing. A picture, as they say, is worth a thousand words.

Distribution of students according to course and economic status


m

Chemistry Arts Mathematics Total


Rich 10 20 10 40
Middle Class 30 20 30 80
)A

Poor 10 10 10 30
Total 50 50 50 150

Purpose: A table’s purpose is to simplify the presentation of related data and to


facilitate comparisons. The reader can easily find the information he or she is looking
(c

for. The purpose of the table below, for example, could be to show the imports and
exports of country ‘Z’ in comparison to other countries V, W, X, and Y.

Amity Directorate of Distance & Online Education


34 Computational Statistics

Imports and Exports of Country ‘Z’ during 2002-05 (`.Crores)


Notes

e
S. No. Country Imports Exports
1 V 70 73

in
2 W 72 80
3 X 74 85
4 Y 85 80

nl
We can easily identify the country with the highest exports from table on exports
and imports. Rows of data are read from left to right. Row 1 shows, for example, that
country A imports 70 from B and exports 73 to B. Column B contains data that is read

O
from top to bottom. Column 2 shows, for example, that country Z imports 70, 72, 74,
and 85 percent of its imports from countries V, W, X, and Y.

ty
Type of Tables: There are two kinds of tables.
◌◌ Reference or general-purpose tables: These tables are essentially a
database with the goal of presenting detailed statistical data. Smaller tables

si
can be derived from these larger tables. Statistical tables presented by the
Government of India and its various statistical agencies and departments are
generally reference or general-purpose tables.
◌◌
r
Special purpose or test tables: These tables are smaller than reference tables
and can be obtained from them. They seek to analyse a specific aspect in
ve
order to bring out a specific point or answer a specific question.
To build a table, it is necessary to first understand the components of a good
statistical table. When these components are assembled in a systematic manner, they
form a table.
ni

The most basic way to think about a table is to present the data in rows and
columns with some explanatory notes. Depending on the number of characteristics
U

involved, tabulation can be done using one-way, two-way, or three-way classification. A


good table should essentially include the following features:

Table 1.4.1: (.”Example”. Title) (in `. Crores)


ity

Stub Head Column Head 1 Column Head 2


Sub Column Sub column Sub column Sub column
Stub Entries head head head head
Main Body of the Table (field)
m

Footnote (… )
Source of data (… )
◌◌ Table Numbers: If more than one table has been used or presented in a single
)A

location, it is always preferable to assign them numbers. It facilitates future


references to them. This number is always displayed at the top of the table, for
example, Table 1.4.1, Table 1.4.2, and so on.
◌◌ Title: A table’s title is equivalent to an essay’s heading. It appears at the top
of a table and provides information about what is contained in the table’s
(c

main body. The title should be succinct and direct. It is preferable if the title is
presented in bold or capital letters. Table 1.4.1’s title is “Example”

Amity Directorate of Distance & Online Education


Computational Statistics 35

◌◌ Head note (or prefatory note): It appears beneath the title. It clarifies the
Notes

e
table’s contents and unit of measurement, such as “in rupees crores” or “in
lakh tonnes” or “in thousand bales of cotton,” and so on. It must be written in
brackets on the right side (top) of the table, directly beneath the title. In table

in
1.4.1, for example, the unit of measurement is rupees crores.
◌◌ Stub: A stub is made up of a stub head and stub entries. Whereas stub
describes the stub entries below it, each stub entry labels a specific piece

nl
of data in its row. The table’s left column contains both stub head and stub
entries. Stub entries also describe the column heads.
◌◌ Main Body or Field: This is the most important part of the table and contains

O
the numerical data that is hinted at in the title. For example, if the title is
“Exports and Imports of Country a During 1995-96,” it is clear that the body of
the table contains statistical/numerical information on the value of country A’s

ty
exports and imports with various countries.
◌◌ Footnote: A qualifying statement at the bottom of a table. Its purpose is to
explain any omissions or limitations in the data presented in the table’s main
body. For example, if data for a given year is unavailable, it is noted at the

si
bottom of the table.
◌◌ Source of Data: Last but not least, the source of the data presented in the

r
table must be mentioned. It enables the reader to independently verify
the original source of data and obtain additional information on the subject.
ve
This also improves the reliability of the data presented in the table. It should
include information such as the title, edition, page number, and publication
source, among other things.
ni

1.4.2 Diagrammatic Representation of Grouped Data

The following points explain the significance of the diagrams.


U

●● Simple to Understand
Through diagrams, a large number of observations become understandable. As the
number of observations increases, the analysis becomes more time-consuming, but the
ity

presented data can be easily understood using diagrams. It is also stated that a picture
has the ability to explain more than 10000 words.

●● Appealing Appearance
Diagrams are visually appealing. The numbers are tedious, whereas the diagrams
m

are visually appealing. Diagrams are more visually appealing and impressive than
numbers. As a result, when reading a newspaper or magazine, the reader pays more
attention to the diagrams rather than the numbers. As a result, the use of diagrams in
)A

exhibitions, fairs, newspapers, and common festivals is increasing at a rapid pace.

●● Improved Memorising
Diagrams last longer than numbers. Numbers may be difficult to remember,
but diagrams have a greater memorising effect because the impressions they create
(c

remain in the mind for a long time.

Amity Directorate of Distance & Online Education


36 Computational Statistics

●● Data Comparison
Notes

e
The diagrams make it simple to compare data from different areas and times.
Numbers are difficult to read and compare, whereas diagrams can be easily compared

in
by viewing the presented information.

Diagram Components

nl
When creating diagrams, the following elements should be carefully considered:

●● The Diagram’s Title

O
Every diagram should be labelled appropriately. The diagram’s title should convey
the main idea in as few words as possible while still including all necessary information.
The diagram’s title should ideally be placed at the top of the diagram.

●● Diagram Dimensions

ty
The size of the diagram should be appropriate. A proper proportion between the
diagram’s height and width should be maintained. The diagram would appear strange if
either the height or the width were too short or too long in proportion.

si
There are no hard and fast rules about dimensions, but we can follow an important
suggestion made by Lutz in his book “Graphic Presentation,” which states that the

r
proportion of height to width should be 1:1.414. This proportion diagram appears appealing.
ve
●● The Diagram’s Scale
A proper scale should be determined before constructing the diagram. There are
no hard and fast rules regarding the scale. The guiding factors are the concern data
and the required size of the diagram. The diagram should not grow too large or too
ni

small. A similar scale is required for diagram comparison. The scale should be clearly
stated at the top or bottom of the diagram.

●● Footnotes
U

Footnotes are to be used to clarify specific points in the diagram. Footnotes can be
included at the bottom of the diagram.

●● Diagram Index
ity

An index should be provided to illustrate different types of lines or shades or


colours, so that the reader can easily understand the meaning of the diagram.

●● A neat and tidy diagram


m

A good diagram should be extremely neat and tidy. Too much information should
not be presented in a single diagram; otherwise, the reader may become confused.

●● Basic Diagram
)A

A good diagram should be as simple as possible so that the reader can clearly
understand its meaning; otherwise, the complexity may obscure the main theme.

We discussed the significance of diagrams and general rules for diagram


(c

construction in the previous two subsections. In the following subsection, we will simply
list the different types of diagrams. Then, in the following sections, we will go over each
type of diagram in detail.

Amity Directorate of Distance & Online Education


Computational Statistics 37

Different types of diagrams are used in practise, and new ones are constantly
Notes

e
being added. For the sake of application and simplicity, several types of diagrams are
classified as follows:

in
◌◌ One Dimensional Diagrams or Bar Diagrams
◌◌ Two Dimensional Diagrams
◌◌ Pie Diagrams

nl
◌◌ Pictogram
◌◌ Cartogram

O
One Dimensional Diagrams or Bar Diagrams
The most common diagrams are bar diagrams. A bar has the shape of a rectangle
that is filled with some colour (see Example 1). They are referred to as one-dimensional

ty
diagrams because only the length of the bar is important, not its width. That is, the width
of each bar in a diagram remains constant, but it may vary from diagram to diagram
depending on the space available and the number of bars to be presented. To save
space, lines may be drawn instead of bars for a large number of observations.

si
The following are the advantages of using bar diagrams or one-dimensional
diagrams:

◌◌ r
They are simple to understand even for those who are not chart experts.
ve
◌◌ They are the simplest and most straightforward way of comparing two or more
diagrams.
◌◌ They are the only form that can effectively be used for comparing large
numbers of observations.
ni

After considering the benefits of bar diagrams, you’ll want to know how bar
diagrams are made and how many different types of bar diagrams are commonly used.
U

●● Simple Bar Diagram


If the data must be represented using only one variable, a simple bar diagram can
be used. For example, simple bar diagrams can be used to represent production, profit,
and sales figures for various years. Because the width of each bar is the same and only
ity

the lengths of the bars vary, readers can easily see the variation in the characteristic
under study with respect to time or some other given factor from simple bar diagrams.
In our illustration, we will use the length of the bars along the vertical axis and another
given factor along the horizontal axis. In practise, they are extremely popular. For
m

example, when presenting a company’s total turnover over the last five decades, simple
bar diagrams can only depict the total turnover amount. In the following example, we’ll
draw a simple bar diagram.
)A

Example: The profit (in Rs crore) of a company from 1990-91 to 1999- 2000 are
given below:

Year Profit (in Rs crore) Year Profit (in Rs crore)


1990-91 35.6 1995-96 87.2
(c

1991-92 46.7 1996-97 113.1


1992-93 39.8 1997-98 123.6

Amity Directorate of Distance & Online Education


38 Computational Statistics

1993-94 68.2 1998-99 119.7


Notes

e
1994-95 93.5 1999-00 130.8

The simple bar diagram of the above data is given below:

in
nl
O
●● Subdivided Bar Diagram

ty
When multiple components of a variable must be represented in a single diagram,
subdivided bar diagrams are used. A subdivided bar diagram, for example, could
represent a number of members of teaching staff in various departments of an institute.

si
In this diagram, each bar is divided into the number of components. First, the total or
cumulative amount is calculated from the component amounts. The bar is then divided
by the magnitude of the components. The length of the bar is equal to the sum of the
component amounts. r
ve
A bar is represented in the order of magnitude from the largest component at the
base of the bar to the smallest component at the end of the bar, but the order of various
components in each bar is maintained. To distinguish between different components,
different shades or colours are used. The index should be used in the bar diagram to
ni

explain such differences.

Vertical or horizontal representations of subdivided bar diagrams are possible.


When the number of components exceeds 10 or 12, subdivided bar diagrams are not
U

used because the diagram becomes overloaded with information and is difficult to
compare and understand. Let’s look at an example of a subdivided bar diagram to see
how it’s done:
ity

Example: Represent the following data by subdivided bar diagram:

Category Cost per chair (in Rs) year wise


1990 1995 2000
Cost of Raw Material 15 20 30
m

Labour Cost 15 18 25
Polish 5 6 15
Delivery 5 6 10
)A

Total 40 50 80

At first, we calculate the cumulative cost on the basis of the given

amounts:
(c

Amity Directorate of Distance & Online Education


Computational Statistics 39

Category 1990 1995 2000


Notes

e
Cost Cumulative Cost Cumulative Cost Cumulative
(in Rs) Cost (in Rs) (in Rs) Cost (in Rs) (in Rs) Cost (in Rs)

in
Cost of RM 15 15 20 20 30 30
L Cost 15 30 18 38 25 55
Polish cost 5 35 6 44 15 70

nl
Delivery 5 40 6 50 10 80
Total 40 50 80

On the basis of above table required subdivided bar diagram is given below:

O
ty
r si
ve
●● Multiple Bar Diagram
We build two or more bars together in a multiple bar diagram. The multiple bars
ni

are built to represent either the various components of the total or the magnitudes of
the variables. All of the bars from one set of data are created together so that the bars
from different sets can be properly compared. The magnitude of the component to be
presented will be represented by the height of the bars, similar to how we do in a simple
U

bar diagram. There is space between the vertical axis and the first bar of the first group
of bars in this diagram, but no space between the bars of the same group. The space
between the bars of the two different groups of bars must also be left.
ity

Multiple bar diagrams present two or more groups of interrelated data. The
technique for drawing such diagrams is the same as for drawing a simple bar diagram.
The only difference is that because each group contains more than one component,
different shades, colours, dots, or crossings are used to distinguish between the bars of
the same group, while the same symbols are used for the corresponding components
m

of the other groups. Multiple bar diagrams are very useful when the number of relative
components is large or the change in the values of one variable’s components is
significant. The following example shows how to draw a multiple bar diagram for given
)A

data.
(c

Amity Directorate of Distance & Online Education


40 Computational Statistics

Example: Draw the multiple bar diagram for the following data.
Notes

e
Sale Gross profit Net profit
Year
(in ,000 Rs) (in ,000 Rs) (in ,000 Rs)

in
1990 100 30 10
1995 120 40 15
2000 130 45 25

nl
2005 150 50 30
2010 200 70 30

O
Multiple bar diagram for the above data is given below.

ty
r si
ve
●● Percentage Bar Diagram
A percentage bar diagram is a subdivided bar diagram drawn on the basis of a
percentage of the total. When creating such diagrams, the length of all the bars is kept
constant at 100, and segments are formed in these bars to represent the components
ni

based on their percentage of the total. First and foremost, the total of the given variable
is assumed to be 100. The percentage is then calculated for each component of the
variable. The cumulative percentage is then calculated for each component. Finally,
U

the bars are subdivided into cumulative percentages and displayed in the form of
a subdivided bar diagram. Let us explain the procedure with the help of the following
example.

Example: Draw a percentage bar diagram for the following data:


ity

Category Cost Per Unit (1990) Cost Per Unit (2000)


Material 20 32
Labour 25 36
m

Delivery 5 12
Total 50 80

At first, percentage and cumulative percentage are obtained for both the years in
)A

various category.

Category Cost Per Unit (1990) Cost Per Unit (2000)


Material 20 32
Labour 25 36
(c

Delivery 5 12
Total 50 80

Amity Directorate of Distance & Online Education


Computational Statistics 41

Category Cost Per Unit (1990) Cost Per Unit (2000)


Notes

e
Cost Cumulative Cost Cumulative Cost Cumulative
(in Rs) Cost (in Rs) (in Rs) Cost (in Rs) (in Rs) Cost (in Rs)

in
Material 20 40 40 32 40 40
Labour 25 50 90 36 45 85
Delivery 5 10 100 12 15 100

nl
Total 50 100 80 100

On the basis of above table required percentage bar diagram is given below

O
ty
●● Deviation Bar Diagram
r si
ve
Deviation bar diagrams are used to represent net quantities in excess or deficit,
such as net profit, net loss, net exports, net imports, and so on. This type of bar can
represent both positive and negative values. Positive values can be drawn above
the base line, while negative values can be drawn below it. This type of diagram is
ni

illustrated in the following example:

Example: Draw a deviation diagram for the following data:

Year Sale Net profits


U

1990 20% 35%


2000 15% 50%
2010 35% 30%
ity

Deviation diagram for the given data


m
)A
(c

Amity Directorate of Distance & Online Education


42 Computational Statistics

●● Broken Bar Diagram


Notes

e
If there is a lot of variation in the values of a certain type of data, i.e., some values
are very small and some are very large, then the large bar(s) may be presented as

in
broken bars to make room for the smaller bars of the data.

These bars are similar to the others, but their presentation is different due to their
wide range of variation. Let us use the following example to demonstrate the concept of

nl
a broken bar diagram:

Example: Represent the following data by a suitable bar diagram.

O
Year Sale of cars
1950 200
1960 360
1970 442

ty
1980 520
1990 587
2000 2860

si
The sale of the cars in year 2000 is almost 14 times that of in year 1950. In order to
gain space for the sale figure in the year 1950, we have to use broken bar to represent

r
the sale of cars for year 2000. Subdivided bar diagram for the given data is shown
below.
ve
ni
U
ity

Two Dimensional Diagrams


m

In one dimensional diagrams, only the length of the bar is important, and bar
comparisons are done solely on the basis of their lengths, whereas in two dimensional
)A

diagrams, both length and width of the bars are considered, i.e. given numerical figures
are represented by areas of the bars in two dimensional diagrams. As a result, two-
dimensional diagrams are also referred to as “Area Diagrams.”

●● The areas of rectangles in a rectangles diagram represent numerical figures.


We know that the area of a rectangle equals (length) (breadth). So, a rectangles
(c

diagram is created by using one of the two variables as the lengths of the rectangles
and the other variable as the breadths of the rectangles along two axes.

Amity Directorate of Distance & Online Education


Computational Statistics 43

Example: Two businesses, A and B, manufacture the same item. In January 2011,
Notes

e
Company A produced 2000 units, while Company B produced 2400 units. Company A’s
and Company B’s production costs per unit were Rs 12 and Rs 10.5 respectively. Use a
rectangles diagram to represent these facts.

in
The length and width of these companies’ rectangles will be 2000:2400 and
12:10.5, respectively. The areas calculated for both companies based on their length

nl
and breadth are now the total cost of the two companies. These rectangles are shown
below.

O
ty
●● Square Diagram r si
ve
Instead of rectangles, a squares diagram is more appropriate. In the same way that
rectangles are represented by areas of squares, numerical figures are represented by
areas of squares in this diagram.
ni

Example: In a square diagram, represent the following data on the number of


schools in a city A from 1970-1980 to 2000-10.

Years 1970-80 1980-90 1990-2000 2000-10


U

Number of schools
4 9 36 64
in city A

Square Diagram of above table is shown below


ity
m
)A

●● Circles Diagram
(c

A circle diagram is another way to prepare a two-dimensional diagram. As in


the square diagram, we used the given numerical figures/observations to calculate

Amity Directorate of Distance & Online Education


44 Computational Statistics

the areas of the corresponding squares. Similarly, we use given numerical figures/
Notes

e
observations to calculate the areas of the corresponding circles.

Example: Draw a circles diagram for the data given above example

in
nl
O
ty
si
PIE Diagrams
A pie diagram/chart is used when it is necessary to know the relationship
r
between the whole of a thing and its parts, i.e., a pie chart shows us how the whole
ve
thing is divided up into different parts. For example, suppose a family’s total monthly
expenditure is Rs 1000, of which Rs 250 is spent on food, Rs 200 on education, Rs
100 on rent, Rs 150 on transportation, and Rs 300 on miscellaneous items. This
gives us the information that food, education, rent, transportation, and miscellaneous
ni

items account for 25%, 20%, 10%, 15%, and 30% of the family’s total expenditure,
respectively. We can see that if money spent on food increases from 25% to 30%, the
percentages of other heads must decrease so that the total remains 100%.
U

Similarly, if money is spent on any of the heads, the percentages of the other
head(s) must be spread so that the total remains 100 percent. That is why a pie chart
depicts the relationship between the whole and its constituent parts.
ity

Example: A company is started by the four persons A, B, C and D and they


distribute the profit or loss between them in proportion of 4 : 3 : 2 :1. In year 2010
company earned a profit of Rs 14,400. Represent the shares of their profits in a pie
chart.
m

Pie chart which shows the shares of profit of the four partners is shown
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 45

●● It is recommended that the components on the pie diagram be drawn in a logical


Notes

e
arrangement, pattern, or sequence. For example, order by size, with the largest on
top and the others following in clockwise order.

in
●● A pie chart is used only when
◌◌ the sum of the parts results in a meaningful whole. For example, the total of a
family’s expenditures on various items makes a meaningful whole, but if a city has

nl
100 doctors, 40 engineers, 50 milkmen, and 80 businessmen, the total of these
does not make a meaningful whole, so a pie chart should not be used here.
◌◌ observations in various parts are mutually exclusive. In the case discussed

O
above, for example, a businessman may also be an engineer, so the
observations in different parts are not mutually exclusive.
◌◌ observations of the various parts are made at the same time.

ty
Let’s look at some of the pie diagram’s limitations.
●● The pie diagram is less effective than bar diagrams for accurate reading and
interpretation, especially when data is divided into a large number of components

si
or the difference between component values is very small.
●● If the number of parts of the whole exceeds 7 or 8, the attractiveness of a pie chart
suffers. That is, if the number of parts of the whole exceeds 7 or 8, a pie chart
should be avoided. r
ve
●● When compared to a simple bar diagram or a divided bar diagram, the pie diagram
appears to be inferior.
●● A pie chart is used only when the sum of the parts results in a meaningful whole.
ni

●● A pie chart should not be used if the observations of the various parts are not
mutually exclusive.
●● A pie chart should not be used if the observations of the various parts are made at
U

different times.

Pictogram
ity

Pictograms, also known as picture grams, are widely used in statistical data
representation. Pictograms are created with the help of images. These diagrams point
to the nature of the facts represented.

Pictograms are visually appealing and simple to understand, making them ideal for
m

presenting statistics to the general public.

The image that is used as a symbol to represent the units or values of any variable
or commodity is carefully chosen. The picture symbol must be self-explanatory. For
)A

example, if the increase in the number of Airlines Company is to be shown over time, an
aeroplane would be the appropriate symbol.

Pictograms have the following advantages:


●● By counting the pictures, the magnitudes of the variables can be determined.
(c

●● Even illiterates can obtain the information.


●● Facts represented in a pictorial format are more easily remembered.
Amity Directorate of Distance & Online Education
46 Computational Statistics

Example: Draw a pictogram for the data of production of tea (in hundred kg) in a
Notes

e
particular area of Assam from year 2006 to 2010.

Year 2006 2007 2008 2009 2010

in
Production of Tea
2.5 3.0 4.0 5.5 7.0
(in 100 kg.)

Pictogram for the above is shown below:

nl
O
ty
si
Cartogram
r
Cartogram refers to the representation of numerical facts using a map. The
ve
impact of the results on different geographical areas can be shown and compared
by representing the facts with maps. Maps are useful for comparing different districts
of a state or different states of a country. Cartograms can be used to represent the
production of wheat in various geographical areas, for example. The quantities on the
ni

map can be represented in a variety of ways, such as by shading or colour, by dots, by


placing pictograms in each geographical area, or by the appropriate numerical figure in
each geographical area.
U

Example: Density per square kilometer in different states and union territories in
India according 2011 census data is given below.

Density Density Density


State/Union State/Union State/Union
ity

(per sq. (per sq. (per sq.


Territory Territory Territory
km. km. km.
Andhra P 308 Kerala 859 Tripura 350
Arunachal P 17 Madya P 236 Uttarakhand 189
Assam 397 Maharashtra 365 Uttar P 828
m

Bihar 1102 Manipur 122 West Bengal 1029


Chhattisgarh 189 Meghalaya 132 Andaman and N I 46
)A

Goa 394 Mizoram 52 Chandigarh 9252


Gujarat 308 Nagaland 119 Dadar and N H 698
Haryana 573 Orissa 269 Daman and Diu 2169
Himachal P 123 Punjab 550 Delhi 11297
J and K 124 Rajasthan 201 Lakshadeep 2013
(c

Jharkhand 414 Sikkim 86 Pondicherry 2598


Karnataka 319 Tamil Nadu 555

Amity Directorate of Distance & Online Education


Computational Statistics 47

Cartogram of above data is shown below


Notes

e
in
nl
O
ty
r si
ve
ni
U
ity

1.4.3 Graphical Representation of Grouped Data


Numerical data is frequently complex and difficult to interpret and comprehend.

This is true for ordinary people, investigators, and, in our case, teachers.
m

For example, in the preceding example, student grades were converted to a


frequency distribution. However, frequency distribution may not always serve the
purpose because it is unappealing and complex. As a result, a more interesting
)A

and appealing type of representation emerged, which is the graphical form of data
representation. Data is represented graphically as geometric figures that can be easily
interpreted and understood by anyone. However, the geometric picture must be drawn
while taking into account the proportion and measurements of the data. As a result,
numerical data can be visualised and transformed into a picture or graphic format that
(c

is drawn in a reasonable proportion. The numerical data is represented in a graph as a


scaled geometric figure.

Amity Directorate of Distance & Online Education


48 Computational Statistics

Graphical representations are important for the following reasons:


Notes

e
◌◌ They are attractive and beautiful,
◌◌ Allows easy visualisation and appealing to the eyes.

in
◌◌ Graphical representation makes interpretation and decision-making easier.
◌◌ It provides a bird’s-eye view of all data.

nl
◌◌ It is simple to build.
The various kinds of graphs are:

◌◌ Time Series Line Graph

O
Line graphs can also be used to present statistical data. The relationship between
two variables is depicted by a line graph. A time series line graph is produced when one
of the two variables is time in days, weeks, months, or years. For example, let us create

ty
a line graph based on the following data on coal production in country ‘A’ from 2009-10
to 2013-14.

Production of Coal (Million Tons)

si
Year Production (Million Tons)
2009-10 77.22

r
2010-11
2011-12
78.17
88.42
ve
2012-13 99.80
2013-14 103.50
ni
U
ity
m

Production of coal in country A 2009-14 (in Million Tons)


)A

The graph above is a time series line graph. Time is represented on the X axis,
and production is represented on the Y axis. In this graph, the variables are time and
production. It is production that evolves over time. In other words, as time passes,
production changes, increasing or decreasing or remaining constant. Because
(c

production changes over time, it is said to be time dependent. As a result, production is


treated as a dependent variable. Because time is unaffected by production, it is treated
as an independent variable.

Amity Directorate of Distance & Online Education


Computational Statistics 49

First Point on the line graph (also known as a curve) shows that country ‘A’
Notes

e
produced 77.22 million tonnes of coal in 2009-10. Other Points show production
levels in subsequent years in the same way. The upward rising curve from left to right
indicates that coal production in Country ‘A’ has been steadily increasing since 2009-10.

in
A time series line graph can display two or more comparable dependent variables.
In this case, each dependent variable will be recorded on its own curve. Consider the

nl
following data on country ‘X’s’ exports and imports.

Exports and Imports of Country ‘X’

O
Imports Exports
Year (in ` 100 (in ` 100
crores) crores)
2009-10 15 35

ty
2010-11 85 100
2011-12 90 70
2012-13 130 120

si
2013-14 170 180

r
ve
ni
U
ity

Exports and Imports of Country X in 2009-14 (in rupees’ 00 crores)

●● Histogram
A histogram is a continuous series joining rectangular diagram in which each
m

rectangle represents a class interval with frequency. It’s a two-dimensional diagram


known as a frequency histogram.

◌◌ Histogram of equal class intervals:


)A

Example: Presents the following data in a histogram:

Marks Frequency
0-10 2
(c

10-20 5
20-30 8
30-40 11

Amity Directorate of Distance & Online Education


50 Computational Statistics

40-50 10
Notes

e
50-60 9
60-70 4

in
70-80 1

◌◌ Histogram of unequal class intervals

nl
Example: Represent the following data by mean of histogram

Marks No. of students(F)


10-15 6

O
15-20 19
20-25 28
25-30 15

ty
30-40 12
40-60 12
60-80 8

r si
ve
ni
U

Since the Interval are not equal, table has to be re-created


ity

Marks Frequency Adjusted Adjusted Frequency


10-15 6 – 6
15-20 19 – 19
20-25 28 – 28
m

25-30 15 – 15
5×12
30-40 12 6
10
)A

5×12
40-60 12 3
20
5×8
60-80 8 2
20
(c

Amity Directorate of Distance & Online Education


Computational Statistics 51

Notes

e
in
nl
O
●● Frequency Polygon

ty
A polygon is a diagrammatic data presentation formed by joining the midpoints of
the tops of rectangles in a diagram. A polygon, on the other hand, can be drawn without
the use of a histogram.

si
Example: Construct a frequency polygon from the data given below

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70


Students 5 10 15 20 r 12 8 5
ve
ni
U
ity
m

●● Cumulative Frequency Curve (OGIVE)


The cumulative frequency curve, also known as the ogive, is a smooth curve
)A

formed by plotting cumulative frequency data on graph paper.

There are two ways to build a cumulative frequency curve or ogive.

◌◌ Less Than Method


The less than cumulative frequencies are plotted against the upper-class
(c

boundaries of the respective classes in the ‘less than’ ogive. It is an increasing curve
with an upward slope from left to right.

Amity Directorate of Distance & Online Education


52 Computational Statistics

◌◌ More Than Method


Notes

e
The cumulative frequencies of more than ogive are plotted against the lower-class
boundaries of the respective classes in more than ogive. It is a decreasing curve with a

in
downward slope from left to right.

Example: Present the following data in the form of less than ogive and more than
ogive

nl
Marks 0-5 5-10 10-15 15-20 20-25 25-30 30-35 35-40
Students 4 6 10 10 25 22 18 5

O
Cumulative frequency Distributive

Cumulative Cumulative
Marks Marks
Frequencies Frequencies

ty
less than 5 4 More than 0 100
less than 10 4 + 6 = 10 More than 5 100 – 4 = 96
less than 15 10 + 10 = 20 More than 10 96 – 6 = 90

si
less than 20 20 + 10 = 30 More than 15 90 – 10 = 80
less than 25 30 + 25 = 55 More than 20 80 – 10 = 70
less than 30 55 + 22 = 77 More than 25 70 – 25 = 45
less than 35 r
77 + 18 = 95 More than 30 45 – 22 = 23
ve
less than 40 95 + 5 = 100 More than 35 23 – 18 = 5
More than 40 5–5=0
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 53

Check Your Understanding


Notes

e
A. Fill in the Blanks

in
1. Once data has been collected, it must be classified and organised in such a way that
it can be easily____________ and ___________.
A) interpreted, read

nl
B) converted, read
C) read, interpreted

O
D) calculated, interpreted
2. A table is a systematic arrangement of related statistical data in ____________ and
___________with a specific goal or purpose in mind.

ty
A) rows, tables
B) columns, rows
C) charts, graphs

si
D) tables, columns
3. Numbers may be difficult to remember, but diagrams have a greater ____________
effect. r
ve
A) remembering
B) impression
C) creation
ni

D) memorising
4. Bars referred to as ____________ diagrams because only the length of the bar is
important, not its ____________.
U

A) two-dimensional, length
B) one-dimensional, width
C) three-dimensional, width
ity

D) square, length
5. A ___________is used when it is necessary to know the relationship between the
whole of a thing and its parts.
m

A) bar diagram
B) square diagram
)A

C) pie diagram
D) area diagram

B. True or False
1. The areas of rectangles in a rectangles diagram represent numerical figures.
(c

2. Deviation bar diagrams are used to represent total quantities in excess or


deficit.

Amity Directorate of Distance & Online Education


54 Computational Statistics

3. A percentage bar diagram is a subdivided bar diagram drawn on the basis of a


Notes

e
percentage of the total.
4. Line graphs cannot be used to present statistical data.

in
5. A histogram is an irregular series joining rectangular diagram in which each
rectangle represents a class interval with frequency.

nl
Summary
●● A table is a systematic arrangement of related statistical data in columns and rows
with a specific goal or purpose in mind.

O
●● A table’s purpose is to simplify the presentation of related data and to facilitate
comparisons.
●● There are two kinds of tables: (a) Reference or general-purpose tables (b) Special

ty
purpose or test tables.
●● To build a table, it is necessary to first understand the components of a good
statistical table.

si
●● Depending on the number of characteristics involved, tabulation can be done
using one-way, two-way, or three-way classification.
●●
r
If more than one table has been used or presented in a single location, it is always
preferable to assign them numbers.
ve
●● Through diagrams, a large number of observations become understandable.
●● Diagrams are more visually appealing and impressive than numbers.
●● The diagrams make it simple to compare data from different areas and times.
ni

●● A bar has the shape of a rectangle that is filled with some colour. The width of
each bar in a diagram remains constant, but it may vary from diagram to diagram
depending on the space available and the number of bars to be presented.
U

●● The multiple bars are built to represent either the various components of the total
or the magnitudes of the variables.
●● A percentage bar diagram is a subdivided bar diagram drawn on the basis of a
ity

percentage of the total.


●● Deviation bar diagrams are used to represent net quantities in excess or deficit,
such as net profit, net loss, net exports, net imports, and so on.
m

●● In one dimensional diagrams, only the length of the bar is important, and bar
comparisons are done solely on the basis of their lengths, whereas in two
dimensional diagrams, both length and width of the bars are considered.
)A

●● A circle diagram is another way to prepare a two-dimensional diagram.


●● A pie chart shows us how the whole thing is divided up into different parts.
●● Pictograms, also known as picture grams, are widely used in statistical data
representation. Pictograms are created with the help of images.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 55

●● Pictograms are visually appealing and simple to understand, making them ideal for
Notes

e
presenting statistics to the general public.
●● Cartogram refers to the representation of numerical facts using a map.

in
●● Numerical data is frequently complex and difficult to interpret and comprehend.
●● Line graphs can also be used to present statistical data. The relationship between
two variables is depicted by a line graph.

nl
●● A time series line graph is produced when one of the two variables is time in days,
weeks, months, or years.

O
●● A time series line graph can display two or more comparable dependent variables.
●● A histogram is a continuous series joining rectangular diagram in which each
rectangle represents a class interval with frequency. It’s a two-dimensional

ty
diagram known as a frequency histogram.
●● A polygon is a diagrammatic data presentation formed by joining the midpoints of
the tops of rectangles in a diagram. A polygon, on the other hand, can be drawn

si
without the use of a histogram.
●● The cumulative frequency curve, also known as the ogive, is a smooth curve
formed by plotting cumulative frequency data on graph paper.

Activity
r
ve
1. This activity helps students to interpret a graph.
2. This activity requires you to use your skills to analyse the information in a graph
and solve some clever problems. Your task is to figure out what this graph could be
ni

about because some important information is missing. To conduct this investigation,


examine the column graph below, which displays some ‘Mystery Data.”
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


56 Computational Statistics

By looking at the graph, you can see that:


Notes

e
◌◌ The vertical axis (the graph’s left edge) is marked in centimetres.
◌◌ The horizontal axis (the graph’s bottom edge) is labelled ‘a’ to ‘m.’

in
◌◌ This graph does not have a title. At the moment, it is simply referred to as
“Mystery Data.”
3. Let the students think on: (a) What is the significance of this graph? (b)What could

nl
possibly be that long? (c) What’s the point of having so many columns?
4. Students decide what the title of this graph should be after carefully analysing it, and

O
they record their suggestions and reasons.

Glossary
●● Cumulative Frequency Curve (OGIVE): The cumulative frequency curve, also

ty
known as the ogive, is a smooth curve formed by plotting cumulative frequency
data on graph paper.
●● Histogram: A histogram is a continuous series joining rectangular diagram in which

si
each rectangle represents a class interval with frequency.
●● Cartogram: Cartogram refers to the representation of numerical facts using a map.
●●
r
Pictogram: Pictograms, also known as picture grams, are widely used in statistical
data representation. Pictograms are created with the help of images. These
ve
diagrams point to the nature of the facts represented.
●● Pie diagram: A pie diagram/chart is used when it is necessary to know the
relationship between the whole of a thing and its parts.
ni

●● Deviation Bar Diagram: Deviation bar diagrams are used to represent net
quantities in excess or deficit, such as net profit, net loss, net exports, net imports,
etc.
U

Questions and Exercises


1. Explain the one-dimension diagrams and their advantages.
ity

2. What elements should be considered while creating diagrams?


3. Explain the two dimensional diagrams.
4. Write short notes on the following:
a) Time series line graph
m

b) Cartogram
c) Frequency polygon
)A

d) Broken bar diagram


5. Why graphical representation is important?
(c

Amity Directorate of Distance & Online Education


Computational Statistics 57

Further Readings
Notes

e
1. James E. Gentle (2002) “Elements of Computational Statistics”, Springer.
2. Geof H. Givens and Jennifer A. Hoeting (2013) “Computational Statistics”,

in
Second Edition, John Wiley & Sons, Inc.
3. Wolfgang Karl Hardle, Ostap Okhrin and Yarema Okhrin (2017) “Basic
Elements of Computational Statistics”, First Edition, Springer.

nl
4. James E. Gentle, Wolfgang Karl Hardle and Yuichi Mori (2012) “Handbook of
Computational Statistics: Concepts and Methods”, Second Edition, Springer.

O
Check Your Understanding- Answers
A. Fill in the Blanks
1. C) read, interpreted

ty
2. B) columns, rows
3. D) memorising

si
4. B) one-dimensional, width
5. C) pie diagram/chart
B. True/False r
ve
T, F, T, F, F
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


58 Computational Statistics

Unit - 1.5: Frequency Distributions


Notes

e
Objectives:

in
In this unit, you will be able to:

●● Know the significance of cumulative frequency distributions

nl
●● Understand the concept of polygon and Ogives
●● Learn the concept of stem and leaf plot

O
●● Know how to interpret box plot

Introduction
The frequency distribution, once again, provides a rough picture of the

ty
observations. The viewers are still unable to identify the contour of the actual spread of
score distribution across different classes. It is not possible to have a physical image of
a sample’s scores using frequency distribution; it merely reflects the counts among the

si
classes. After constructing the frequency distribution, it is scientific tradition to plot the
frequencies on a pictorial platform formed of horizontal and vertical lines known as a
‘graph’ to elaborate the data array. Graphs are also referred to as polygons, charts, or
r
diagrams. A graph is constructed using two mutually perpendicular lines known as the X
ve
and Y–axes, which are labelled with appropriate scales. The horizontal line is known as
the abscissa, and the vertical line is known as the ordinate.

There are many different types of graphs, just like different types of frequency
distributions, that help the reader gain a better understanding of science. Among these
ni

are bar graphs, line graphs, pie charts, pictographs, and so on.

1.5.1 Cumulative Frequency Distributions and their Graphical


U

Representations
The use of a table or graph to distribute any data makes the data more
comprehensive. The frequency distribution of data is made up of different classes
ity

and subclasses that indicate the frequency. The frequency is the number of
times an observation occurs over a given time period. Cumulative frequency aids
in determining the number of operations that are performed above a specific
observation. The frequency of the observation is added to the sum of the frequencies
of the predecessors to calculate the increasing frequency of any data. The cumulative
m

frequency distribution of the entire set of data is denoted by the previous sum.

While graphical representation makes data easier to understand, drawing graphs


allows one to easily see fluctuations and ups and downs. For graphical representation
)A

of frequencies, bar graphs and frequency polygons are commonly used. Ogive is a
graphical representation of cumulative incidence.

1.5.2 Frequency Polygon and Ogives


(c

Frequency Polygon and OGIVES are the frequency distribution curves used to
represent data in graphical format.

Amity Directorate of Distance & Online Education


Computational Statistics 59

Let’s relook at the examples which we worked out in earlier section


Notes

e
POLYGON

in
Example: Construct a frequency polygon from the data given below

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70


Students 5 10 15 20 12 8 5

nl
O
ty
r si
ve
Process of Creating the Graph:
◌◌ Create a suitable histogram while keeping all of the fundamental principles in
mind.
ni

◌◌ Determine the midpoints of each rectangle’s upper horizontal side.


◌◌ Draw straight lines connecting the midpoints of the adjacent rectangle of the
histogram.
U

◌◌ Both axes should be labelled clearly.

Ogives
ity

These are the Cumulative Frequency curves

These Graphs can be plotted in two Ways:

◌◌ Less Than type cumulative frequency distribution curve


m

◌◌ More Than type of cumulative frequency distribution curve


The example provided below will show how to find More Than cumulative
distribution frequency and how to calculate Less Than cumulative distribution frequency
)A

in a step-by-step manner.
(c

Amity Directorate of Distance & Online Education


60 Computational Statistics

More Than Cumulative Frequency Curve Calculations


Notes

e
Example:

Daily Income of No of Lower Cumulative

in
Workers (In Rs) Workers (F) Limits Frequency (CF)
100-150 10 100 50
150-200 15 150 50-10 = 40

nl
200-250 12 200 40-15 = 25
250-300 13 250 25-12 = 13

O
The daily wage of workers, as well as the number of employees in the company,
are represented by four different groups in the table above. For example, in the first

ty
group of workers in the table, there are ten people earning between 100 and 150
rupees per month.

◌◌ 1st Step: The frequency column is the number of worker columns. Frequency

si
is simply the rate at which activity occurs, which is denoted by F. The total
number of frequencies is 50. The cumulative frequency is the running
frequency of all of the groups in the table.
◌◌ r
2nd Step: Because we are solving for more than one type, we must use
the lower limits from the table, which are 100,150,200, and 250. When we
ve
compute the lower bound from the table, we get results such as more than
100, more than 150, and so on.
◌◌ 3rd Step: We have obtained 50 as the total frequency; the very first cumulative
frequency will be written as 50. The cumulative frequency of the second row
ni

will be greater than 150 but less than 300. As a result, we must subtract ten
from 50 and gradually subtract the frequency from the resulting cumulative
frequency. When there are multiple cumulative type frequencies, the last
U

frequency should match the cumulative frequency.

Making a graph of the frequency distribution


ity

The following steps must be followed in order to create the cumulative frequency
distribution graph:

◌◌ Because cumulative frequency is a dependent variable, it will be displayed


on the Y-axis, while daily income will be displayed on the X-axis. If there is
m

a break in the numbers on the x-axis, a key should be placed before all of
the numbers. Before jotting down the graph, remember to set the scale. Begin
plotting the points using the numbers from the table.
)A

◌◌ The frequency distribution curve will be a decreasing curve after joining all of
the points freehand, because the curve obtained in more than one type will
always be a decreasing curve.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 61

Notes

e
in
nl
O
ty
Less Than Cumulative Frequency Curve Calculations
Example:

si
Daily Income of No of Upper Cumulative
Workers (In Rs) Workers (F) Limits Frequency (CF)
100-150 10 150 10
150-200 15 200
r 10+15 = 25
ve
200-250 12 250 25+12 = 37
250-300 13 300 37+13 = 50

The preceding example is similar to the more than type, but there are some
ni

differences. Each group has a size of 50 workers. We’ll be using the upper limits rather
than the lower limits over here.

1st and 2nd Steps are identical to the more than type frequency. As a result, we will
U

proceed to 3rd Step right away.

◌◌ 3rd Step: For the first group, the frequency and cumulative frequency will be
the same, which is 10. For the second group, add the frequencies between
ity

100 and 200, which equals 10+15, 25. The cumulative incidence obtained will
be 50 in the end.
The total frequency is always equal to the most recent cumulative frequency.

Making a graph of the frequency distribution


m

To create a graph of the cumulative frequency distribution, The X-axis will


represent daily income in rupees, while the Y-axis will represent cumulative frequency.
Begin plotting the points on the graph after you’ve set the scale. Unlike the more than
)A

type graphs, the frequency distribution curve will be an increasing curve after drawing a
freehand curve.
(c

Amity Directorate of Distance & Online Education


62 Computational Statistics

Notes

e
in
nl
O
ty
1.5.3 Stem and Leaf Plot

si
Once the data has been collected, it may be useful to summarise the data. Of
course, we could use a histogram to summarise the data. The original data are not
preserved in the graph when using a histogram to summarise data. A stem-and-leaf
r
plot, on the other hand, summarises the data while also preserving it.
ve
The Stem and Leaf plot is a method of organising data in such a way that the
frequency of different values can be easily seen. A Stem and Leaf Plot, in other words,
is a table in which each data value is divided into a “stem” and a “leaf.” The “stem” is
the left-hand column with the digits in tens. The “leaves” are listed in the right-hand
ni

column, with all of the ones digits for the tens, twenties, thirties, and forties. Remember
that while Stem and Leaf plots are a pictorial representation of grouped data, they can
also be referred to as modal representations. Because we can determine the mode with
U

a quick visual inspection of the Stem and Leaf plot.

When we have decided that a stem and leaf plot is the best way to present our
data, it should be created as follows:
ity

◌◌ Write the thousands, hundreds, or tens on the left side of the page (all digits
but the last one). These are your stems.
◌◌ To the right of these stems, draw a line.
◌◌ Write down the ones on the other side of the line (the last digit of a number).
m

These are your leaves.


If the observed value is 35, for example, the stem is 3 and the leaf is 5. If 467 is the
observed value, the stem is 46 and the leaf is 7. Where measurements are precise to
)A

one or more decimal places, such as 25.7, the stem is 25 and the leaf is 7. To limit the
number of stems if the range of values is too wide, the number 25.7 can be rounded up
to 26.

Making a stem and leaf plot (Example 1)


(c

A teacher inquired of ten of her students how many books they had read in the
previous year. Their responses were as follows:

Amity Directorate of Distance & Online Education


Computational Statistics 63

12, 21, 19, 6, 10, 7, 12, 15, 25, 23


Notes

e
For these data, let’s create a stem and leaf plot.

Please note that the number 6 can be written as 06, which means it has a 0 stem

in
and a 6 leaf.

This is how the stem and leaf plot should look:

nl
Stem Leaf
0 67
1 29025

O
2 153

Above Table shows:

ty
◌◌ The class intervals 0 to 9 are represented by stem 0;
◌◌ 10 to 19 are represented by stem 1; and
◌◌ 20 to 29 are represented by stem 2.

si
A stem and leaf plot is typically ordered, which means that the leaves are arranged
in ascending order from left to right. Also, because each leaf is always a single digit,
there is no need to separate the leaves (digits) with punctuation marks (commas or
periods). r
ve
The ordered stem and leaf plot is shown below using the data from above Table:

Stem Leaf
0 67
ni

1 02259
2 135

Splitting the Stems


U

Mark is a runner who is preparing for a competition. He ran the following number of
100-metre laps per day for 30 days:

22, 21, 24, 19, 27, 28, 24, 25, 29, 28, 26, 31, 28, 27, 22, 39, 20, 10, 26, 24, 27, 28,
ity

26, 28, 18, 32, 29, 25, 31, 27

Let’s create a well-organized stem and leaf plot and then Split the stems into five-
unit intervals and redraw the stem and leaf plot.
m

◌◌ Observations are ranging from 10 – 39, hence we will have three stems
Stem Leaf
1 089
)A

2 01224445566677778888899
3 1129

The stem and leaf plot show that Mark usually Runs between 20 and 29 laps in
training each day.
(c

◌◌ Splitting the stems into five-unit intervals

Amity Directorate of Distance & Online Education


64 Computational Statistics

Stem Leaf
Notes

e
1 (0) 0
1 (5) 89

in
2 (0) 0122444
2 (5) 5566677778888899
3 (0) 112

nl
3 (5) 9

The stem 1(0) means all data between 10 and 14, 1(5) means all data between 15
and 19, and so on.

O
The revised stem and leaf plot show that Mark usually runs between 25 and 29
laps in training each day. The values 1(0) 0 = 10 and 3(5) 9 = 39 seems to be outliers

ty
Outliers
An outlier is a data value that is extremely high or low. It is an observation value
that deviates significantly from the rest of the data. In a set of data, there may be more

si
than one outlier.

Outliers are sometimes important pieces of information that should not be


overlooked. They should be ignored when they occur as a result of an error or
misinformation. r
ve
1.5.4 Box Plot
A box plot, also known as a box-and-whisker plot, is a convenient way of
graphically representing numerical data in descriptive statistics. It summarises the data
ni

using five numbers: the smallest observation (sample minimum), the lower quartile Q1,
the median Q2, the upper quartile Q3, and the largest observation (sample maximum).
Box plots are commonly used to describe a distribution that is extremely skewed or
U

multimodal. A box plot also shows which observations, if any, could be considered
outliers. A box plot is a simple graphic method for examining one or more sets of data.

Box plots show differences between populations without making any distributional
ity

assumptions. The spacing between the various parts of the box indicates the degree of
dispersion (spread) and skewness in the data, as well as identifies outliers. Box plots
can be drawn vertically or horizontally. Box plots will be drawn vertically in this case.

A typical Box Plot


m

How to Construct a Box Plot


When comparing two or more distributions, box plots are useful for identifying key
)A

values. Consider the following data from 37 students in a class who were examined
by a game with a box containing some balls to better understand the method of
constructing box plots. Their task was to pick a ball from one box in one corner of the
room and place it in another blank box in another corner of the room as quickly as
possible, and their times (in seconds) were recorded. The scores of 16 boys and 21
(c

girls who took part in the game were compared. The following table summarises the
observed data:

Amity Directorate of Distance & Online Education


Computational Statistics 65

Time (in seconds) required to complete the given task


Notes

e
Boys 18, 19, 20, 22, 24, 25, 26, 16, 17, 19, 25, 27, 28, 23, 23, 31
Girls 15, 17, 18, 19, 20, 21, 23, 14, 17, 18, 19, 20, 21, 24, 19, 16, 17, 18, 20, 22, 28

in
How to create separate box plots for boys and girls’ data is described below:

A box plot can be built in a variety of ways. The first is based on the distribution

nl
of scores’ quartiles, lowest and highest values. Figure below (subsequent to the
descriptive example) depicts how these three statistics are applied in the preceding
example. We create a box plot for each gender, extending from the first to third

O
quartiles. The second quartile is drawn within the box. As a result,

◌◌ the bottom of each box represents the first quartile,


◌◌ the top represents the third quartile, and

ty
◌◌ the line in the middle represents the second quartile.
◌◌ The lower whisker is a line drawn from the point corresponding to the smallest
observation to the first quartile.

si
◌◌ The upper whisker is a line drawn from the third quartile to the point
corresponding to the largest observation.
Let us arrange the given data in ascending order for each gender, as shown in
r
the table below, to determine the above components, namely Q1, Q2, Q3, smallest
ve
observation, and largest observation.

Gender Times (in Seconds)


Boys 16, 17, 18, 19, 19, 20, 22, 22, 23, 23, 24, 25, 25, 27, 28, 31
Girls 14, 15, 16, 17, 17, 17, 18, 18, 18, 19, 19, 19, 20, 20, 20, 21, 21, 22, 23, 24, 28
ni

For Boys
◌◌ Lowest observation is xs = 16
U

◌◌ Largest observation is xl = 31
◌◌ Total number of observations are 16
◌◌ First Quartile
ity

observation
= 4.25th observation = 4th + 0.25 (5th – 4th) observation

= 19 + 0.25 (19 - 19) = 19


m

◌◌ Second Quartile

observation
)A

= 8.5th observation = mean of 8th and 9th observation

= (22 + 23) / 2= 22.5

◌◌ Third Quartile
(c

observation
= 12.75th observation = 12th + 0.75 (13th – 12th) observation

Amity Directorate of Distance & Online Education


66 Computational Statistics

= 25 + 0.75 (25 - 25) = 25


Notes

e
For Girls

in
◌◌ Lowest observation is xs = 14
◌◌ Largest observation is xl = 28
◌◌ Total number of observations are 21

nl
◌◌ First Quartile

observation

O
= 5. 5th observation = 5th + 0.5 (6th – 5th) observation
= 17 + 0.5 (17 - 17) = 17
◌◌ Second Quartile

ty
observation
= 11th observation = 19

si
◌◌ Third Quartile

observation


r
= 16. 5th observation = 16th + 0.5 (17th – 16th) observation
= 21 + 0.5 (21 - 21) = 21
ve
The Box Plot for Girls and boys would be represented as below
ni
U
ity
m

Box Plot Components


)A

Let us now go over the various components and terminologies used in drawing
different types of box plots.

◌◌ Hinges (Upper and Lower)


Upper and lower hinges are built to represent the third and first quartiles,
(c

respectively. In the preceding example, the upper and lower hinge values for the
girls are 21 and 17, respectively, while the values for the boys are 25 and 19.

Amity Directorate of Distance & Online Education


Computational Statistics 67

◌◌ H-Spread
Notes

e
The difference between the upper and lower hinges is used to calculate this.
The H-Spread measures the spread of data elements between the first and

in
third quartiles. In the preceding example, the H-spread for boys is 25–19 = 6,
whereas the H-spread for girls is 21–17 = 4.
◌◌ Whiskers

nl
Whiskers are the lines that extend above and below the box. The lower whisker
extends from the smallest observation point to Q1, while the upper whisker
extends from Q3 to the largest observation point.

O
◌◌ Step
By multiplying the H-spread by 1.5, the step is calculated. In the preceding
example, the value of 1 step for boys is 1.5 6 = 9, whereas for girls it is 1.5 4 6.

ty
◌◌ Inner Fences (Upper and Lower)
The upper and lower inner fences are determined by adding one step to the
upper hinge and subtracting one step from the lower hinge. In other words, the

si
upper inner fence is one step ahead of the upper hinge, while the lower inner
fence is one step ahead of the lower hinge. In the preceding example, the upper
and lower inner fences for boys are 25 + 9 = 34 and 19 – 9 = 10, respectively,
r
whereas the upper and lower inner fences for girls are 21 + 6 = 27 and 17 – 6 =
11, respectively.
ve
◌◌ Outer fences (Upper and Lower)
The upper and lower outer fences are determined by adding two steps to the
upper hinge and subtracting two steps from the lower hinge. In the preceding
ni

example, the upper and lower outer fences for boys are 25 + 2 x 9 = 43 and 19
- 2 x 9 = 1, respectively, whereas the upper and lower outer fences for girls are
21 + 2 x 6 = 33 and 17 – 2 x 6 = 5, respectively.
U

◌◌ Outside Value
The outside value is a value that is outside of an inner fence but not outside of
an outer fence. This is used to represent the data’s scattered values. Circles are
ity

used to represent these values.


◌◌ Adjacent (Upper and Lower)
The upper and lower adjacent are used to represent the data’s largest and
smallest observations. In the preceding example, the upper and lower adjacent
m

values for boys are 31 and 16, respectively, whereas for girls, they are 28 and
14, respectively.
◌◌ Far Out Value
)A

A value that is outside of the upper or lower outer fences is referred to as a far
out value. Asterisks are used to represent these values.
(c

Amity Directorate of Distance & Online Education


68 Computational Statistics

Components of Box Plot for Boys


Notes

e
in
nl
O
ty
r
Components of Box Plot for Girls
si
ve
ni
U
ity

Box Plot drawn with Outliers


m

Extreme observations are defined as outside value(s) and far out value(s). If the
data contains extreme observations, those observations can be represented in a box
plot by an individual mark. When extreme observations are represented in box plots,
)A

they are referred to as outliers. Outside values are those that are beyond an inner fence
but not beyond an outer fence, whereas far out values are those that are beyond both
the lower and upper outer fences. In a box plot, the individual marks for extreme values
can be plotted above and below the whiskers. Outside values, in particular, are denoted
by small circles. In the preceding example, 28 is only a far out value in the data of girls,
(c

whereas no value is beyond the lower or upper inner fence in the data of boys.

Amity Directorate of Distance & Online Education


Computational Statistics 69

Notes

e
in
nl
O
ty
Box Plot with + sign

si
The value of some important parameters such as mean, mode, and so on must
also be included in box plots. In box plots, we insert a plus sign to indicate the mean
score for a set of values. For the example discussed above, the mean for boys is
20.875 and the mean for girls is 19.33. r
ve
ni
U
ity
m

Box Plot with whisker, + sign and outliers


Based on the data from the example discussed earlier, we can see that girls
)A

dropped the balls from one box to another faster than boys. We can also see that one
of the boys was faster than almost all of the women (except 3). The box plot for the
girl’s data with whisker, + sign, and outliers is shown as below:
(c

Amity Directorate of Distance & Online Education


70 Computational Statistics

Notes

e
in
nl
O
ty
Check Your Understanding

si
Fill in the Blanks
1. Graphs are also referred to as ____________, charts or ___________.
A) diagrams, polygons
r
ve
B) lines, bars
C) squares, diagrams
D) polygons, diagrams
ni

2. The frequency distribution of data is made up of different ____________ and


subclasses that indicate the ___________.
A) classes, frequency
U

B) classes, subclasses
C) frequency, classes
D) observations, frequency
ity

3. Frequency Polygon and ____________are the frequency distribution curves used to


represent data in graphical format.
A) pie charts
m

B) bar diagrams
C) OGIVES
)A

D) graphs
4. A ____________ summarises the data while also preserving it.
A) leaf-and-stem plot
B) stem-and-leaf plot
(c

C) box plot
D) frequency polygon

Amity Directorate of Distance & Online Education


Computational Statistics 71

5. An ___________ is a data value that is extremely high or low.


Notes

e
A) OGIVE
B) abscissa

in
C) outlier
D) ordinate

nl
B. True or False
1. An outlier is an observation value that does not deviate significantly from the

O
rest of the data.
2. Box plots are used to depict the distributions of numerical data values,
particularly when comparing them across multiple groups.

ty
3. When a data distribution is symmetric, the median should not be located in the
exact centre of the box.
4. Outliers should be evenly distributed on both sides of the box.

si
5. Box plots are least effective when comparing distributions between groups.

Summary
●● r
The frequency distribution provides a rough picture of the observations.
ve
●● It is not possible to have a physical image of a sample’s scores using frequency
distribution; it merely reflects the counts among the classes.
●● A graph is constructed using two mutually perpendicular lines known as the X and
Y–axes, which are labelled with appropriate scales.
ni

●● The horizontal line is known as the abscissa, and the vertical line is known as the
ordinate.
U

●● The use of a table or graph to distribute any data makes the data more
comprehensive. The frequency distribution of data is made up of different classes
and subclasses that indicate the frequency.
ity

●● Cumulative frequency aids in determining the number of operations that are


performed above a specific observation.
●● While graphical representation makes data easier to understand, drawing graphs
allows one to easily see fluctuations and ups and downs.
m

●● Frequency Polygon and OGIVES are the frequency distribution curves used to
represent data in graphical format.
●● Ogives are the Cumulative Frequency curves.
)A

●● A stem-and-leaf plot summarises the data while also preserving it.


●● The Stem and Leaf plot is a method of organising data in such a way that the
frequency of different values can be easily seen.
●● The “stem” is the left-hand column with the digits in tens. The “leaves” are listed in
(c

the right-hand column, with all of the ones digits for the tens, twenties, thirties, and
forties.

Amity Directorate of Distance & Online Education


72 Computational Statistics

●● Stem and Leaf plots are a pictorial representation of grouped data, they can also
Notes

e
be referred to as modal representations.
●● A stem and leaf plot is typically ordered, which means that the leaves are arranged

in
in ascending order from left to right.
●● An outlier is a data value that is extremely high or low. It is an observation value
that deviates significantly from the rest of the data. In a set of data, there may be

nl
more than one outlier.
●● Outliers are sometimes important pieces of information that should not be
overlooked. They should be ignored when they occur as a result of an error or

O
misinformation.
●● A boxplot is a standardised method of displaying data distribution based on a five-
number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and

ty
maximum”).
●● A box plot, when drawn, does a good job of graphically dividing the data into
fourths.

si
●● Box plots are used to depict the distributions of numerical data values, particularly
when comparing them across multiple groups.
●● Box plots are designed to provide high-level information at a glance, providing
r
general information about the symmetry, skew, variance, and outliers of a group of
ve
data.
●● A box plot is built around the quartiles of a dataset, or the values that divide the
dataset into equal fourths. The first quartile (Q1) contains more than 25% of the
data and less than the remaining 75%.
ni

●● The second quartile (Q2) is located in the centre and divides the data in half. Q2
is also referred to as the median. The third quartile (Q3) is larger than 75% of the
data but smaller than the remaining 25%.
U

●● The locations of these three quartiles in box plots are marked by the ends of the
box and its centre line in a box and whiskers plot.
●● When a data distribution is symmetric, the median should be located in the exact
ity

centre of the box.


●● If a distribution is skewed, the median will be off to the side rather than in the
centre.
m

●● A box plot can be aligned so that the boxes are arranged vertically (with groups on
the horizontal axis) or horizontally (with groups aligned vertically).
●● When the data represents a sample, notches are used to show the most likely
)A

values expected for the median.


●● Box width can be used to calculate how many data points are in each group.
●● Letter-value plots enclose increasing proportions of the dataset with multiple
boxes.
(c

●● The letter-value plot is motivated by the fact that as more data is gathered, more
stable tail estimates can be made.

Amity Directorate of Distance & Online Education


Computational Statistics 73

Activity
Notes

e
1. This activity will help the students to understand frequency distribution.
2. Form groups of 3-4 students and assign them the task of preparing a presentation

in
on different types of frequency distribution graphs and how to create them.

Glossary

nl
●● Letter-value plots: Letter-value plots are an extension of the standard box plot.
Letter-value plots enclose increasing proportions of the dataset with multiple
boxes.

O
●● Whiskers: The whiskers are traditionally extended to the farthest data point within
1.5 times the IQR from each box end.
●● Notches: Notches are used to show the most likely values expected for the

ty
median.
●● Box Plots: A boxplot is a standardised method of displaying data distribution
based on a five-number summary. Box plots are used to depict the distributions of

si
numerical data values, particularly when comparing them across multiple groups.
●● Outliers: An outlier is a data value that is extremely high or low. It is an observation
value that deviates significantly from the rest of the data.
●●
r
Stem and Leaf Plot: It is a method of organising data in such a way that the
ve
frequency of different values can be easily seen.

Questions and Exercises


1. Explain the process of creating graph.
ni

2. Write short note on Ogives.


3. What are the steps in making a graph of frequency distribution?
U

4. Write short notes on the following:


5. Stem and Leaf Plot
6. Box Plot
ity

7. Outliers and whisker range

Further Readings
1. James E. Gentle, Wolfgang Karl Hardle and Yuichi Mori (2012) “Handbook of
m

Computational Statistics: Concepts and Methods”, Second Edition, Springer.


2. James E. Gentle (2002) “Elements of Computational Statistics”, Springer.
)A

3. Geof H. Givens and Jennifer A. Hoeting (2013) “Computational Statistics”,


Second Edition, John Wiley & Sons, Inc.
4. Wolfgang Karl Hardle, Ostap Okhrin and Yarema Okhrin (2017) “Basic
Elements of Computational Statistics”, First Edition, Springer.
(c

Amity Directorate of Distance & Online Education


74 Computational Statistics

Check Your Understanding- Answers


Notes

e
Fill in the Blanks
a) polygons, diagrams.

in
b) classes, frequency
c) OGIVES

nl
d) stem-and-leaf plot
e) outlier

O
True/False
F, T, F, T, F

ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 75

Module - 2: Introduction to Computational Statistics


Notes

e
Structure:

in
2.1 Introduction to Central Tendency
2.1.1 Central Tendency - Introduction
2.1.2 Central Tendency - Measures

nl
2.2 Measures of Central Tendency
2.2.1 Mean/Average - Meaning and Characteristics
2.2.2 Arithmetic Mean - Intro and Application

O
2.2.3 Combined Mean - Intro and Application
2.2.4 Weighted Mean - Intro and Application
2.2.5 Median - Meaning and Characteristics

ty
2.2.6 Median - Applications
2.2.7 Mode - Meaning and Characteristics
2.2.8 Mode - Applications

si
2.2.9 Relationship between Mean, Median and Mode
2.3 Industry examples on Central Tendency Measures
2.3.1 Relavant Industry Example
2.3.2 Case Study r
ve
2.4 Introduction to Dispersion
2.4.1 Introduction_dispersion_1
2.5 Measures of Dispersion
2.5.1 Range_Measure_2
ni

2.5.2 mean_deviation_3
2.5.3 SD_Variance_4
2.5.4 SD_Variance_Calculation_5
U

2.5.5 Combined_Mean
2.5.6 Comparison_measures_dispersion
2.5.7 Coefficient_Variation
ity

2.5.3 Quartile_Deviation
2.5.4 Application_Quartile_Deviation
2.5.5 Data_distribution_Introduction_skewness
2.5.6 Measures_skewness
2.5.7 Calculation_Importnace_skewness
m

2.5.8 Moments
2.5.9 Introduction_Kurtosis
)A

2.5.10 Summary_Module2
2.6 Industry Example for Dispersion
2.6.1 Business_Problem_Introduction_1
2.6.2 Calculation_Business_Metrics_2
2.6.3 Insights_Graphs_3
(c

2.6.4 Data_Interpretation_4
2.6.5 Conclusion_Module2_5

Amity Directorate of Distance & Online Education


76 Computational Statistics

Unit - 2.1 Introduction to Central Tendency


Notes

e
Objectives:

in
At the end of this unit, you will be able to:

●● Understand the term central tendency

nl
●● Understand the term central tendency - measures

Introduction

O
The phrase “central tendency” was coined in the late 1920s. It is the statistical
metric that identifies a single value as representative of an entire distribution. Its goal
is to provide a complete description of the data. It’s the single value that most closely

ty
resembles/represents the data. To describe this feature of data description, the term
“number crunching” is utilised. The three most popular measures of central tendency
are the mean, median, and mode.

si
A central tendency (or measure of central tendency) is a typical or central value for
a probability distribution in statistics. It is also known as a distribution centre or location.
Averages are a common term for measures of central tendency.

r
2.1.1 Central Tendency – Introduction
ve
The arithmetic mean, median, and mode are the most popular metrics of central
tendency. For a finite collection of data or a theoretical distribution, such as the normal
distribution, a median tendency can be determined. The term “central tendency” is
ni

occasionally used by authors to describe “the tendency of quantitative data to cluster


around some central value.

Measures - The following procedures can be used with one-dimensional data. It


U

may be necessary to change the data before estimating a central tendency, depending
on the conditions. Squaring the values or taking logarithms are two examples. The data
being studied strongly influences whether or not a transformation is suitable and what it
should be.
ity

Arithmetic mean or simply, mean - It is the total number of measurements divided


by the total number of observations in the data set.

Median - It is the value in the centre that separates the data set’s upper and lower
m

halves. The median and mode are the only measures of central tendency that can be
used to ordinal data, where values are ordered relative to one another but not assessed
absolutely.
)A

Mode – It is the most often occurring value in the data set. This is the only measure
of central tendency that can be applied to nominal data with completely qualitative
category assignments.

Geometric mean - It is the nth root of the product of the data values, where n is
(c

the number of data values. This metric is only applicable to data that is measured on a
purely positive scale.

Amity Directorate of Distance & Online Education


Computational Statistics 77

Harmonic mean - It is the reciprocal of the arithmetic mean of the data values’
Notes

e
reciprocals. This measure, too, is only valid for data that is measured precisely on a
positive scale.

in
Weighted arithmetic mean – It is an arithmetic mean that includes weighting for
certain data items.

Truncated mean or trimmed mean – It is the arithmetic mean of data values after a

nl
predetermined number or proportion of the highest and lowest data values have been
removed.

Interquartile mean – It is a shortened mean calculated using data from the

O
interquartile range.

Midrange – It is the arithmetic mean of a data set’s maximum and least values.

ty
Midhinge – It is calculated as the arithmetic mean of the first and third quartiles.

Trimean – It is the weighted arithmetic mean of the median and the first and
second quartiles.

si
Winsorized mean – It is an arithmetic mean in which values closer to the median
replace extreme values.

r
Any of the above can be applied to each dimension of multidimensional data,
although the results may not be rotationally invariant. There is also the Geometric
ve
median, which minimises the sum of distances to the data points. When applied to one-
dimensional data, this is the same as the median, but it is not the same as taking the
median of each dimension separately. It is not insensitive to varied rescalings of the
various dimensions.
ni

Quadratic mean (often known as the root mean square) – It is useful in


engineering, although it is rarely used in statistics. This is because it is not a useful
predictor of the distribution’s centre when the distribution contains negative numbers.
U

Simplicial depth – It is the likelihood that a randomly generated simplex with


vertices drawn from the specified distribution would have the given centre.
ity

Tukey median – It is a point with the property that any half space that contains it
also contains a large number of sample points.

Several measures of central tendency can be described as solving a variational


problem in the calculus of variations, specifically minimising variation from the centre.
m

That is, given a measure of statistical dispersion, one requests a measure of central
tendency that minimises variation: such that deviation from the centre is minimum
among all possible centre choices. “Dispersion comes before placement,” as the saying
goes. These measurements are first developed in a single dimension, but they can be
)A

generalised to many dimensions. This centre may or may not be one-of-a-kind.

Measures of central tendency are a single value which can be considered as


representative of a set of observations. The value around which the observations
can be considered as centered is known as an Average or average value or a
(c

location center. Since such representative values tend to lie centrally within a set of
observations when arranged according to magnitudes, these averages are then called
measures of central tendency.
Amity Directorate of Distance & Online Education
78 Computational Statistics

2.1.2 Central Tendency – Measures


Notes

e
Central tendency has three main measures: mean, median and mode. Each
of those measurements represents a specific indication of the distribution’s typical or

in
central value.

◌◌ Mean- The mean is the average of the numbers. It is easy to calculate: add up
all the numbers, then divide by how many numbers there are. In other words,

nl
it is the sum divided by the count.
◌◌ Median- Within a sorted, ascending or descending list of numbers, the median
is the middle number and may be more representative of that set of data than

O
the average. The median is often used as opposed to the mean when the
series includes outliers that may distort the average of the values.
◌◌ Mode- The mode is the number most frequently seen in a dataset. A collection

ty
of numbers may have one mode, one mode, or no mode at all. Other popular
central tendency measurements include a set’s mean, or mean, and a set’s
median, middle value.

si
Check Your Understanding

A. Multiple Choice Question


1. r
The value around which the observations can be considered as cantered is known
ve
as an __________.
A) Number
B) Sum
ni

C) Average
D) Mode
2. _______________ has three main measures mean, median and mode.
U

A) Central Tendency
B) Mean
ity

C) Simple Mean
D) Simple Depth
3. Average is the total number of measurements divided by the total number of
__________ in the data set.
m

A) Numerical
B) Sum
)A

C) Average
D) Observation
4. It is a shortened mean calculated using data from the interquartile range is called
__________.
(c

A) Mode
B) Interquartile Mean

Amity Directorate of Distance & Online Education


Computational Statistics 79

C) Median
Notes

e
D) Average
5. The _________ number and may be more representative of that set of data than the

in
average.
A) Right

nl
B) Left
C) Central
D) Middle

O
B. State True and False
1. A collection of numbers may not have one mode, one mode, or no mode at all.

ty
2. Measures of central tendency are a single value which can be considered as
representative of a set of observations.
3. A central tendency (or measure of central tendency) is a typical or central value for a

si
probability distribution in statistics.
4. The median and mode are the only measures of central tendency that can be used
to ordinal data
5.
r
Several measures of central tendency cannot be described as solving a variational
ve
problem in the calculus of variations, specifically minimising variation from the centre.

Summary
●● A central tendency (or measure of central tendency) is a typical or central value
ni

for a probability distribution in statistics. It is also known as a distribution centre or


location. Averages are a common term for measures of central tendency.
U

●● The arithmetic mean, median, and mode are the most popular metrics of central
tendency. For a finite collection of data or a theoretical distribution, such as the
normal distribution, a median tendency can be determined. The term “central
tendency” is occasionally used by authors to describe “the tendency of quantitative
ity

data to cluster around some central value.


●● A collection of numbers may have one mode, one mode, or no mode at all. Other
popular central tendency measurements include a set’s mean, or mean, and a
set’s median, middle value.
m

●● Several measures of central tendency can be described as solving a variational


problem in the calculus of variations, specifically minimising variation from the
centre. That is, given a measure of statistical dispersion, one requests a measure
)A

of central tendency that minimises variation: such that deviation from the centre is
minimum among all possible centre choices.
●● The mean is the average of the numbers. It is easy to calculate: add up all the
numbers, then divide by how many numbers there are. In other words, it is the
sum divided by the count.
(c

Amity Directorate of Distance & Online Education


80 Computational Statistics

●● There is also the Geometric median, which minimises the sum of distances to
Notes

e
the data points. When applied to one-dimensional data, this is the same as the
median, but it is not the same as taking the median of each dimension separately.
It is not insensitive to varied rescaling of the various dimensions.

in
Activity
1. Discuss working of skewed distributions and the mean and median.

nl
2. Discuss measures of central tendency and their applications.

Questions & Exercises

O
1. How outliers influence the measures of central tendency?
2. How does shape of a distribution influence measures of central tendency?

ty
3. What are legitimate measures of central tendency, their advantages and
disadvantages?
4. What is meant by skewed distributions and the mean and median?

si
Glossary
●● Central tendency: It is a typical or central value for a probability distribution in
statistics. r
ve
●● Arithmetic mean or simply, mean - It is the total number of measurements divided
by the total number of observations in the data set.
●● Median - It is the value in the centre that separates the data set’s upper and lower
halves. The median and mode are the only measures of central tendency that can
ni

be used to ordinal data, where values are ordered relative to one another but not
assessed absolutely.
●● Mode – It is the most often occurring value in the data set. This is the only
U

measure of central tendency that can be applied to nominal data with completely
qualitative category assignments.

References
ity

1. Bordens, K. S. and Abbott, B. B. (2011). Research Design and Methods: A Process


Approach. New Dekhi: McGraw Hill Education (India) Private Limited.
2. King, Bruce. M; Minium, Edward. W. (2008). Statistical Reasoning in the Behavioural
m

Sciences. Delhi: John Wiley and Sons, Ltd.


3. Mangal, S. K. (2002). Statistics in Psychology and Education. new Delhi: PhiLearning
Private Limited.
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 81

Check your Understanding-Answers


Notes

e
Multiple Choice Question
1. C) Average

in
2. A) Central Tendency
3. D) Observation

nl
4. B) Interquartile Mean
5. D) Middle

O
State True and False
1. False
2. True

ty
3. True
4. True

si
5. False

r
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


82 Computational Statistics

Unit - 2.2 Measures of Central Tendency


Notes

e
Objectives

in
At the end of this unit, you will be able to:

●● Understand the term mean/average

nl
●● Understand the term arithmetic mean and combined mean
●● Understand the term weighted mean, median and mode

O
●● Understand relationship between mean, median and mode

Introduction
Measures of central tendency assist you in determining the centre, or average, of

ty
a data collection. The mode, median, and mean are the three most popular metrics of
central tendency.

si
◌◌ The most common value is the mode.
◌◌ Median: the middle number in an ordered data set.
◌◌ Mean: the sum of all values divided by the total number of values.
r
When using descriptive statistics, it is critical to understand the variability and
ve
distribution of your data set in addition to the central tendency.

Distributions and central tendency: A data set is a collection of n scores or values.

Normal distribution: Data in a normal distribution is symmetrically distributed and


ni

has no skew. The majority of values cluster around a central location, with values
decreasing as one moves out from the centre. In a normal distribution, the mean, mode,
and median are all the same.
U

Example: If a survey of a sample is conducted in a local community regarding the


number of books which have been read the previous year. A data histogram shows the
frequency of responses for every possible number of books. The chart clearly shows a
normal distribution.
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 83

Notes

e
in
nl
O
ty
The mean, median and mode of this data set are all equal and the central tendency
is 8.

si
Skewed distributions: More values fall on one side of the centre than the other in
skewed distributions, and the mean, median, and mode all differ from one another. The
r
tail on one side is more spread out and longer, with fewer points at one end than the
other. The skew’s side is indicated by the direction of this tail.
ve
On the right side of a positively skewed distribution, there is a cluster of lower
scores and a spread out tail. On the left side of a negatively skewed distribution, there
is a cluster of higher scores and a spread out tail.
ni

Example of Positively Skewed Histogram


The distribution is skewed to the right in the following histogram and the central
U

tendency of the data set is towards the lower end of possible scores.
ity
m
)A
(c

Example of Negatively Skewed Histogram


The distribution is skewed to the left in this histogram and the central tendency of the

Amity Directorate of Distance & Online Education


84 Computational Statistics

data set is towards the higher end of possible scores. For a negatively skewed distribution:
Notes

e
Mean < Median < Mode

in
nl
O
ty
Mode - The mode is the value that appears the most frequently in the data

si
collection. It is possible to have no mode, one mode, or multiple modes. Sort your data
collection numerically or categorically to discover the mode, then choose the response
that happens the most frequently.
r
ve
Example:
If in a survey, nine participants are asked that either they consider themselves as
conservative, moderate or liberal. Sort the data by category to see which response was
chosen the most frequently to find the mode. A frequency table can be created to count
ni

the values for each category to make things easier.

Political Ideology Frequency


Conservative 2
U

Moderate 3
Liberal 4

Because it is the value with the highest bar in a bar graph, the mode is easily
ity

visible, i.e., Liberal.


m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 85

Notes

e
in
nl
O
ty
When to use the mode - The mode works best with data from a nominal level of
measurement. Because nominal data is divided into mutually exclusive categories, the
mode indicates the most popular category. The mode may not be a useful indicator of

si
central tendency for continuous variables or ratio levels of measurement. This is due to
the fact that there are many more possible values than in a nominal or ordinal level of
measurement. A value is unlikely to reoccur at the same level of measurement.
r
ve
Example:
In a computer task, data on reaction times can be collected, and the data set
contains values that are all different from one another.

Participant 1 2 3 4 5 6 7 8 9
ni

Reaction Time (seconds) 266 344 421 323 402 311 383 297 304

There is no mode in the preceding data set because each value appears only
U

once.

Median - A data set’s median is the value that is exactly in the middle when ordered
from low to high. In bigger data sets, simple formulas can be used to determine the
ity

position of the middle value in the distribution. Depending on whether the total number
of items is even or odd, you utilise different approaches to find the median of a data set.

Example:
Measure the reaction times of seven participants on a computer task and divide
m

them into three categories: slow, medium and fast.

Participant 1 2 3 4 5 6 7
)A

Speed Slow Medium Fast Slow Fast Slow Fast

In order to find the median, first sort all of the values from lowest to highest. Then,
in this case, find the value in the fourth position in the ordered data set, i.e., Medium

Ordered Data Set Slow Slow Slow Medium Fast Fast Fast
(c

In larger data sets, simple formulas can be used to determine the position of the
middle value in the distribution. Depending on whether the total number of values is
Amity Directorate of Distance & Online Education
86 Computational Statistics

even or odd, different methods can be used to calculate the median of a data set.
Notes

e
Median of an odd-numbered data set - Find the value that lies at the (n+1)/2
position in an odd-numbered data set, where n is the number of items in the data set.

in
Example:
Order the data set after measuring the reaction times in milliseconds of five participants.

nl
Reaction Time (seconds) 288 296 346 365 379

The middle positions can be calculated by using (n+1)/2, where n=5

O
Therefore,

(5+1)/2 = 3

ty
Thus, it can be concluded that the median is the 3rd value in the above ordered
data set, i.e., 346 seconds.

Median of an even-numbered data set - Find the two values in the centre of an

si
even-numbered data set: the values at the n/2 and (n/2) + 1 places. Then, determine
their mean.

Example:
r
ve
Order the data set after measuring the reaction times of six participants.

Reaction Time (seconds) 288 296 346 354 365 379

The middle positions can be calculated by using n/2 and (n/2) + 1, where n=6
ni

Therefore,

6/2 = 3
U

(6/2) + 1 = 4

Thus, it can be concluded that the middle values are the 3rd and the 4th values,
i.e., 346 and 354, respectively.
ity

To calculate the median, add the mean of the two middle values and divide by two.

(346 + 354)/2 = 350

Thus, Median is 350 seconds.


m

Mean - A dataset’s arithmetic mean (as opposed to its geometric mean) is the sum
of all values divided by the total number of values. Because all values are used in the
calculation, it is the most often used measure of central tendency.
)A

Example:

Participant 1 2 3 4 5
Reaction Time (seconds) 291 343 375 296 390
(c

Add the sum of all the above values:

∑ x = 291 + 343 + 375 + 296 + 390 = 1695


Amity Directorate of Distance & Online Education
Computational Statistics 87

Calculate mean by using the formula ∑ x/n. Since, the data set has 5 values,
Notes

e
therefore n= 5

Mean (x̄ ) = 1695/5 = 391 seconds

in
Outlier effect on the mean - When outliers are included in the calculation, they
might drastically increase or decrease the mean. Because the mean is calculated using
all results, it might be influenced by extreme outliers. An outlier is a value that deviates

nl
significantly from the rest of the values in a data set.

Example:

O
One value is replaced in this data set with an extreme outlier.

Participant 1 2 3 4 5
Reaction Time (seconds) 758 343 375 299 390

ty
∑ x = 758 + 343 + 375 + 299 + 390 = 2165

Mean (x̄ ) = ∑ x/n = 2165/5 = 433 seconds

si
Because of the outlier (i.e., 758), the mean rises significantly, despite the fact that
all other numbers in the data set remain constant.

r
Population versus sample mean - A data set is a collection of values from a sample
or a population. A population is the complete group that you want to study, but a sample is
ve
merely a subset of that population. While data from a sample can be used to make educated
guesses about a population, only full population data can provide a complete picture.

The notation of a sample mean and a population mean, as well as their formulas,
ni

differs in statistics. However, the methods for computing the population and sample
means are the same.

Uses of mean, median or mode - Because they have complementary strengths


U

and limitations, the three primary measures of central tendency are best employed in
tandem. However, depending on the level of measurement of the variable, only one or
two of them may be suitable to your data set.
ity

◌◌ The mode can be used to any level of measurement, but it is most useful at
the nominal and ordinal levels.
◌◌ The median can only be applied to data that can be ordered, such as ordinal,
interval, and ratio measurements.
m

◌◌ Because it needs equal spacing between adjacent values or scores on


the scale, the mean can only be employed on interval and ratio levels of
measurement.
)A

◌◌ You should also examine the distribution of your data set when deciding the
metrics of central tendency to utilise.
◌◌ For normally distributed data, all three measures of central tendency will give
you the same answer, so you can use them all.
◌◌ The median is the best measure in skewed distributions since it is unaffected
(c

by extreme outliers or non-symmetric score distributions. In skewed


distributions, the mean and mode might differ.

Amity Directorate of Distance & Online Education


88 Computational Statistics

2.2.1 Mean/Average - Meaning and Characteristics


Notes

e
An average is a single figure that sums up the characteristics of a whole group of
figures.

in
In the words of Clark “average is an attempt to find one single figure to describe
whole of figures. An average is described as a measure of central tendency as it is
more or less a central value around which various values cluster.

nl
In the world of CROXTON and COWDEN “an average is a single value within
the range of the data that is used to represent all of the values in the series. Since an
average is somewhere within the range of the data, it is called the measure of cultural

O
value.

Objectives Served by Averages

ty
Averages serve the following purposes:

◌◌ To obtain a clear and concise picture of large number of numerical data.


◌◌ To compare different groups by the means of averages.

si
◌◌ To obtain a clear picture of a whole group studying sample data.
◌◌ To provide definite rates to the relationship between different groups.
r
Characteristics of a Good Average
ve
◌◌ It is rigidly defined and its value is always definite.
◌◌ It is easy to understand and calculate, hence it is very popular.
◌◌ It is based on all the observations; so that it can become a good
ni

representative.
◌◌ It can be easily used for comparisons.
◌◌ It is capable of further algebraic treatments, like finding the sum of the
U

observation values. Finding the mean and total number of the observations,
and finding the combined arithmetic mean when different groups are given
etc.
◌◌ It is not affected much by sampling fluctuations.
ity

Essentials of a Good Average


The essentials of a good average are as follows:
m

◌◌ It must be defined rigidly.


◌◌ It must be based on all the observation of the data.
◌◌ It must be readily comprehensible or understandable.
)A

◌◌ It must be capable of being calculated with reasonable ease and rapidity.


◌◌ It must be affected as little as possible by fluctuations of sampling.
◌◌ It must be readily amenable to arithmetic or algebraic treatment.
(c

2.2.2 Arithmetic Mean - Intro and Application


Arithmetic mean is defined as the value obtained by dividing the total values of
all items in the series by their number. In other word it is defined as the sum of the
Amity Directorate of Distance & Online Education
Computational Statistics 89

given observations divided by the number of observations, i.e., add values of all items
Notes

e
together and divide this sum by the number of observations.

Symbolically –x = x1 + x2 + x3 + xn/n

in
Properties of Arithmetic Mean
◌◌ The sum of the deviations, of all the values of x, from their arithmetic mean, is

nl
zero.
◌◌ The product of the arithmetic mean and the number of items gives the total of
all items.

O
◌◌ Finding the combined arithmetic mean when different groups are given.

Demerits of Arithmetic Mean

ty
◌◌ Arithmetic mean is affected by the extreme values.
◌◌ Arithmetic mean cannot be determined by inspection and cannot be located
graphically.

si
◌◌ Arithmetic mean cannot be located if a single observation is lost or missing.
◌◌ Arithmetic mean cannot be calculated when open-end class intervals are
present in the data.

Arithmetic Mean for Ungrouped Data r


ve
A) Individual Series

1. Direct Method
ni

The following steps are involved in calculating arithmetic mean under an individual
series using direct method:

◌◌ Add up all the values of all the items in the series.


U

◌◌ Divide the sum of the values by the number of items. The result is the
arithmetic mean.
The following formula is used: X = ∑ x/N
ity

Where, X = Arithmetic mean ∑ x = Sum of the values N = Number of items.

Illustration 1 – Value(x) – 125 128 132 135 140 148 155 157 159 191
Calculate the arithmetic mean
m

Solution –
Total number of terms = N = 10
)A

Mean = ∑ x = 125 + 128 + 132 + 135 + 140 + 148 + 155 + 157 + 159 + 191 = 1470

X = ∑ x/N = ∑ 1470/10

= 147
(c

2. Short-cut Method or Indirect Method

Amity Directorate of Distance & Online Education


90 Computational Statistics

The following steps are involved in calculating arithmetic mean under individual
Notes

e
series using short-cut or indirect method:

1. Assume one of the values in the series as an average. It is called as working

in
mean or assumed average.
2. Find out the deviation of each value from the assumed average.
3. Add up the deviations

nl
4. Apply the following formula. X = A d N + ∑
where, X = Arithmetic mean A = Assumed average d = Sum of the deviations

O
N = Number of items

Illustration - 1

ty
Calculate the arithmetic average of the data given below using short–cut method

Roll No 1 2 3 4 5 6 7 8 9 10
Marks 43 48 65 57 31 60 37 48 78 59

si
Solution-

Roll Marks
rNo Obtained
D = 60
ve
1 43 -17
2 48 -12
3 65 5
4 57 -3
ni

5 31 -29
6 60 0
7 37 -23
U

8 48 -12
9 78 18
10 59 -1
ity

∑d = – 74

X = a + ∑d/N

+ 60 + (- 74/10) = 52.6 marks


m

2.2.3 Combined Mean - Intro and Application


Arithmetic mean and number of items of two or more related groups are known
)A

as combined mean of the entire group. The combined average of two series can be
calculated by the given formula –

n1x1 + n2x2/ n1 + n2

Where, n1 = No. of items of the first group, n2 = No. of items of the second group
(c

x1= A.M of the first group, x2 = A.M of the second group,

Amity Directorate of Distance & Online Education


Computational Statistics 91

Example - From the following data ascertain the combined mean of a factory
Notes

e
consisting of 2 branches namely branch A and Branch B. In branch A the number of
workers is 500, and their average salary of 300. In branch B the number of workers is
1,000 and their average salary is 250

in
Solution:
Let the no. of workers in branch A be n1 = 500

nl
Let the no. of worker in branch B be n2 = 1000

Average salary x1 = 300

O
Average salary x2 = 250

n1x1 + n2x2/ n1 + n2

ty
= 500(300) + 1000(250)/ 500 + 1000

= 1, 50,000 + 2, 50,000/1500

si
= 266.66

Calculating a Combined Mean: Examples

r
Assume you are conducting a survey on kindergarten math competency (as
evaluated by an achievement exam) and you have findings from two separate schools.
ve
◌◌ At school 1, 57 kindergarten students were tested, with a mean score of 82.
◌◌ At School 2, 23 kindergartners were tested, with a mean score of 63.
By inserting our numbers into the above-mentioned method, we can determine the
ni

combined mean:

[(57*82)+(23*63)]/(57+23) = 76.5.
U

Assume you were conducting a survey on reading speed, as assessed by the


amount of time it took first graders to read a certain block of text. Your results for five
schools have arrived:
ity
m

To calculate the combined mean:


)A

1. Multiply column 2 and column 3 for each row


2. Add up the results from Step 1
3. Divide the sum from Step 2 by the sum of column 2
((189*83) + (46*121) + (89*82) + (50*147) + (12*60)) / (189 + 46 + 89 + 50 + 12) =
(c

94.87

Put that in your calculator, and the result is 94.87, which is the aggregate mean for

Amity Directorate of Distance & Online Education


92 Computational Statistics

all five schools; the average reading time for all kids.
Notes

e
This technique can be used to combine any number of methods.

in
2.2.4 Weighted Mean - Intro and Application
Sometimes, some observations get relatively more importance than other
observations. The weight for such observation must be given on the basis of their

nl
relative importance. In weighted arithmetic mean, for finding an average the value of
each item is multiplied by its weight and then the product are divided by the number of
weights.

O
Symbolically = ∑wx / ∑w

Example – Calculate simple and weighted average from the following data –

ty
Month Jan Feb March April May June
Price 42.5 51.25 50 52 44.25 54
No. of tonnes 25 30 40 50 10 45

si
Solution:

Price Per No. of tonnes WX purchased


Month rTonn (in 000)(x) (w)
ve
Jan 42.5 25 1062.5
Feb 51.25 30 1537.5
March 50 40 2000
April 52 50 2600
ni

May 44.25 10 442.5


June 54 45 2430
N=6 X = 294 ∑w = 200 ∑wx = 10027.5
U

Simple AM

X = ∑x/n = 294/6 = 49 Weighted AM


ity

Xw = ∑wx/∑w = 10027.5/200 = 50.137

The correct average price paid is `50.30 and not `49 i.e., weight arithmetic mean is
correct than simple arithmetic mean.
m

2.2.5 Median - Meaning and Characteristics


Median is defined as the value of the item dividing the series into two equal
)A

halves, where one half contains all values a less than (or equal to) it and the other half
contains all values greater than (or equal to) it. It is also defined as the “central value
of the variable. In median, the value of items must be arranged in order of their size or
magnitude to find out the median.

Median is a positional average. The term position refers to the place of a value in
(c

the series, where the place of median is such that it is equal to the number of items
lying on the either side; therefore, it is also called as locative average.

Amity Directorate of Distance & Online Education


Computational Statistics 93

Merits of Median
Notes

e
Following are the advantages of median:

◌◌ It is rigidly defined.

in
◌◌ It is easy to calculate and understand.
◌◌ It can be located graphically.

nl
◌◌ It is not affected by extreme values like the arithmetic mean.
◌◌ It can be found by mere inspection.
◌◌ It can be used for qualitative studies.

O
◌◌ Even if the extreme values are unknown, median can be calculated if one
knows the number of items.

Demerits of Median

ty
Following are the disadvantages of median:

◌◌ In the case of individual observations, the values are to be arranged in order


of their size to locate median. Such an arrangement of data is tedious task if

si
the number of items is large.
◌◌ If the median is multiplied by the number of items, the total value of all the
items cannot be obtained as in the case of the arithmetic average.
◌◌
r
It is not suitable for complex algebraic or mathematical treatment.
ve
◌◌ It is more affected by sampling fluctuations.

2.2.6 Median – Applications


ni

Example – Determine the median from the following –


25, 15, 23, 40, 27 25 23 25 20
U

Solution - Arranging the figures in ascending order –

Median = 10/2

= 5th term
ity

= 25

Example:
The following steps are involved in calculating median in continuous series:
m

◌◌ Find out the cumulative frequency


◌◌ Find out the median item, i.e., N/2 th item.
)A

◌◌ Find out the group or class containing the median


◌◌ Estimate the median applying the following formula.
(c

where me = Median

i = Lower limit of the median class


Amity Directorate of Distance & Online Education
94 Computational Statistics

cf = Cumulative frequency of the class preceding the median class


Notes

e
i = Size of class interval

fm = Middle one of frequency sequence.

in
Example 1:
Calculate the median mark from the following frequency distribution.

nl
Mark No. of students
0-10 5

O
0-20 13
0-30 20
0-40 32

ty
0-50 60
0-60 80
0-70 90

si
Solution:

Mark F CF
r 0-10 5 5
ve
0-20 6 13
0-30 7 20
0-40 12 32
0-50 28 60
ni

0-60 20 80
0-70 10 90
U
ity
m

Find the median from the following series. Also draw less than ogive, more than
ogive and locate median on a graph.
)A

Income (`) No. of Persons


0-20 82
20-40 112
40-60 150
60-80 95
(c

80-100 48

Amity Directorate of Distance & Online Education


Computational Statistics 95

Solution:
Notes

e
Class (Less Class
C.I. F L.C.F. M.C.F.
then) (More then)

in
0-20 82 20 82 0 487
20-40 112 40 194 20 405
40-60 150 60 344 40 293

nl
60-80 95 80 439 60 143
80-100 48 100 487 80 48

O
ty
2.2.7 Mode - Meaning and Characteristics
r si
ve
The word “mode” is derived from the French word “1a mode” which means fashion.
So, it can be regarded as the most fashionable item in the series or the group. Croxtan
and Cowden regard mode as “the most typical of a series of values”. As a result, it can
sum up the characteristics of a group more satisfactorily than the arithmetic mean or
ni

median. Mode is defined as the value of the variable occurring most frequently in a
distribution. In other words, it is the most frequent size of item in a series.
U

Merits of Mode
The following are the merits of mode:

◌◌ The most important advantage of mode is that it is usually on an actual value.


ity

◌◌ In the case of discrete series, mode can be easily located by inspection.


◌◌ Mode is not affected by extreme values.
◌◌ Mode can be determined even if extreme values are not given.
m

◌◌ It is easy to understand and this average is used by people in their everyday speech.

Demerits of Mode
)A

The following are the demerits of mode:

◌◌ It is not based on all the observation of the data


◌◌ In a number of cases there will be more than one mode in the series.
◌◌ If mode is multiplied by the number of items, the product will not be equal to
(c

the total value of the items.


◌◌ It will not truly represent the group if there are a small number of items of the
same size in a large group of items of different sizes
Amity Directorate of Distance & Online Education
96 Computational Statistics

◌◌ It is not suitable for further mathematical treatment


Notes

e
2.2.8 Mode – Applications

in
Mode in Ungrouped Data

a) Individual Series

nl
The mode of this series can be obtained by mere inspection. The number which
occurs most often is the mode.

O
Illustration - 1
Locate mode in the data 7, 12, 8, 5, 9, 6, 10, 9, 4, 9, 9

Solution:

ty
On inspection, it is observed that the number 9 has maximum frequency i.e.,
repeated maximum of 4 times than any other number. Therefore mode (Z)= 9

si
b) Discrete Series
The mode is calculated by applying grouping and analysis table.

i) r
Grouping Table: Consisting of six columns including frequency column, 1st
ve
column is the frequency 2nd and 3rd column is the grouping two way frequencies
and 4th, 5th and 6th column is the grouping three way frequencies.
ii) Analysis table: consisting of two columns namely tally bar and frequency
ni

Steps in Calculating Mode in Discrete Series


The following steps are involved in calculating mode in discrete series:

◌◌ Group the frequencies by two’s.


U

◌◌ Leave the frequency and group the other frequencies in two’s.


◌◌ Group the frequencies in threes.
◌◌ Leave the frequency of the first size and add the frequencies of other sizes in
ity

◌◌ three’s.
◌◌ Leave the frequencies of the first two sizes and add the frequencies of the
other
m

◌◌ sizes in threes.
◌◌ Prepare an analysis table to know the size occurring the maximum number
of times. Find out the size, which occurs the largest number of times. That
)A

particular size is the mode.

c) Continuous Series
The following steps are involved in calculating mode in continuous series.

1. Find out the modal class. Modal class can be easily found out by inspection.
(c

The group containing maximum frequency is the modal group. Where two or
more classes appear to be a modal class group, it can be decided by grouping

Amity Directorate of Distance & Online Education


Computational Statistics 97

process and preparing an analyzed table as was discussed in question number


Notes

e
2.102.
2. The actual value of mode is calculated by applying the following formula.

in
Mo = l + fm – f1 / 2fm – f1 – f2i

Example:

nl
Marks F CF
0-10 5 5

O
Calculate the modal wages,

Daily wages in ` (x) : 20-25 25-30 30-35 35-40 40-45 45-50

No. of workers (f) : 1 2 8 12 7 5

ty
Solution:
Here, the maximum frequency is 12, corresponding to the class interval (35-40)

si
which is the modal class, Therefore, L1=35 L2=40 F1=12 FM=8 F2=7

X F
20-25
25-30
1
3
r
ve
30-35 8
35-40 12
40-45 7 f2
ni

15-50 5
U
ity

Example 2:
Less than 10 20 30 40 50 60 70 80

Frequency: 4 16 40 76 96 112 120 125


m

Solution:
Need to ascertain lower limit of the continuous class (LL = UL –) Class length (CL)
)A

= 20–10 = 10 i.e., (10–10 = 0…….)

Value (x) Class f


Less than 10 0-10 4 4
Less than 20 10-20 16-4=12 12
(c

Less than 30 20-30 40-16=24 24


Less than 40 I130-40I1 76-40=36 36

Amity Directorate of Distance & Online Education


98 Computational Statistics

Less than 50 40-50 96-76=20 20


Notes

e
Less than 60 50-60 112-96=16 16
Less than 70 60-70 120-112=8 8

in
Less than 80 70-80 125-125=5 5

nl
O
2.2.9 Relationship between Mean, Median and Mode
When mode is ill defined, it is difficult to find the value of mode, a sort of empirical
relationship exist among the mean, median and mode in such a way that the median

ty
lies between the mode and the mean. The mode departs (to the left i.e., positive
skewed) 2/3 difference from the median and the mean departs (to the right i.e.,
negatively skewed) 1/3 difference from the median. Karl Pearson’s expressed this
relationship as Z = 3M - 2X (when it is positive skewness).

si
Example - M is 28, AM is 29 find Mode

Solution: r
ve
Z = 3M - 2X

= 3(28)-2(29)

= 84– 78
ni

=26

29>28>26
U

– M = ? AM = 39 Z = 36.5

Solution:
ity

Z = 3M - 2X

= 36.5 = 3(M)-2(39)

= 36.5 = 3M –78

= 3M = -78 - 36.5
m

M = - 114.5/-3

= 38.16
)A

Question Using the Mean, Median, and Mode Relationship as an Example


Question: The median and mean of a moderately skewed distribution are 20 and
22.5, respectively. Determine the mode’s estimated value using these values.
(c

Solution:
Given,
Amity Directorate of Distance & Online Education
Computational Statistics 99

Mean = 22.5
Notes

e
Median = 20

Mode = x

in
Now, utilising the mean mode and median relationship, we get,

(Mean – Mode) = 3 (Mean – Median)

nl
So,

22.5 – x = 3 (22.5 – 20)

O
22.5 – x = 7.5

∴ x = 15

So, Mode = 15.

ty
Check Your Understanding
1. is a single figure that sums up the characteristics of a whole group of figures.

si
a) Average
b) Median
c) Standard deviation r
ve
d) Histogram
2. is defined as the value obtained by dividing the total values of all items in the series
by their number
ni

a) Mean
b) Median
c) Standard deviation
U

d) Histogram
3. Arithmetic mean and number of items of two or more related groups are known as
combined mean of the entire group.
ity

a) Standard arithmetic mean


b) Combined arithmetic mean
c) Actual arithmetic mean
m

d) Weighted arithmetic mean


4. table consist of two columns namely frequency and a tally bar
)A

a) Discrete table
b) Grouping table
c) Analysis table
d) Frequency table
(c

5. The value of the variable occurring most frequently in a distribution is -


a) Mean

Amity Directorate of Distance & Online Education


100 Computational Statistics

b) Median
Notes

e
c) Mode
d) Average

in
State True and False
1. The majority of values cluster around a Right location, with values decreasing as one

nl
moves out from the centre.
2. The mode is the value that appears the most frequently in the data collection. It is
possible to have no mode, one mode, or multiple modes.

O
3. A data set’s Mode is the value that is exactly in the middle when ordered from low to
high.
4. While data from a sample cannot be used to make educated guesses about a

ty
population, only full population data can provide a complete picture.
5. Arithmetic mean is defined as the value obtained by dividing the total values of all
items in the series by their number.

si
6. If the median is multiplied by the number of items, the total value of all the items
cannot be obtained as in the case of the arithmetic average.

Summary r
ve
●● Measures of central tendency: It is a single value which can be considered as
representative of a set of observations and around which the observations can be
considered as Centered is called an ’Average’ (or average value) or a Center of
location.
ni

●● Average: It is described as a measure of central tendency as it is more or less a


central value around which various values cluster. In the world of CROXTON and
COWDEN “an average is a single value within the range of the data that is used to
U

represent all of the values in the series.


●● Median: It is defined as the value of that item which divides the series into two
equal halves, one half contains all values less than (or equal to) it and the other
half containing all values greater than (or equal to) it. It is also defined as the
ity

“central value of the variable.


●● Mode: It is derived from the French word “1a mode” meaning fashion. So, it can be
regarded as the most fashionable item in the series or the group.
●● Range: The ‘Range’ of the data is the difference between the largest value of data
m

and smallest value of data.

Activity
)A

1. Make a Chart and Differentiate between Mean, Median and Mode.


2. Discussion on How Average use in our daily life.

Questions & Exercises


(c

1. Explain the measures of Central Tendency


2. What is an average? What are the characteristics and objectives of a good average?

Amity Directorate of Distance & Online Education


Computational Statistics 101

3. Describe the empirical relationship between mean, median and mode


Notes

e
4. Explain the steps in calculating mode in discrete series
5. Determine the median from the following – 25, 15, 23, 41, 28 26 24 25 20

in
Glossary
●● Mode - The mode is the value that appears the most frequently in the data

nl
collection. It is possible to have no mode, one mode, or multiple modes. Sort your
data collection numerically or categorically to discover the mode, then choose the
response that happens the most frequently.

O
●● Median - A data set’s median is the value that is exactly in the middle when
ordered from low to high. In bigger data sets, simple formulas can be used to
determine the position of the middle value in the distribution. Depending on
whether the total number of items is even or odd, you utilise different approaches

ty
to find the median of a data set.
●● Mean - A dataset’s arithmetic mean (as opposed to its geometric mean) is the sum
of all values divided by the total number of values. Because all values are used in

si
the calculation, it is the most often used measure of central tendency.
●● Midrange – It is the arithmetic mean of a data set’s maximum and least values.
●●
r
Harmonic mean - It is the reciprocal of the arithmetic mean of the data values’
reciprocals. This measure, too, is only valid for data that is measured precisely on
ve
a positive scale.

Further Readings
1. Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui,
ni

Statistics for Management, Pearson Education, 7th Edition,2016.


2. Prem. S. Mann, Introductory Statistics, 7th Edition, Wiley India,2016.
U

3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An


Introduction to Statistical Learning with Applications in R, Springer, 2016.

Answers to Check Your Understanding


ity

1. a) Average
2. a) Mean
3. b) Combined arithmetic mean
m

4. d) Frequency table
5. c) Mode
State True and False
)A

1. False 2. True . False


4. False 5. True 6. True
(c

Amity Directorate of Distance & Online Education


102 Computational Statistics

Unit - 2.3 : Industry examples on Central Tendency


Notes

e
Measures

in
Objectives
At the end of this unit, you will be able to:

nl
●● Understand central tendency measures through relevant industry example
●● Understand through case study the central tendency measures

O
Introduction
According to Sir Arthur Lyon Bowley, “Measures of central tendency (averages) are
statistical constants which enable us to figure out in a single effort the significance of

ty
the whole.” The main aim of the central tendency measures are:

a) Data should be reduced to a single value.


b) To make simple data comparisons.

si
2.3.1 Relevant Industry Example
r
Three metrics are widely used to describe the centre of a dataset: mean, median,
ve
and mode.

Here’s a quick rundown of each metric:

◌◌ Mean: A dataset’s average value.


ni

◌◌ Median: The value in a dataset that is in the middle.


◌◌ Mode: The value(s) that appear the most frequently in a dataset.
Individuals and businesses use these metrics all the time to obtain a better
U

understanding of datasets in a variety of disciplines. The examples below show how the
mean, median, and mode are applied in various real-world settings.

Explanation 1: Mean, Median, and Mode in Healthcare


ity

In the healthcare industry, insurance analysts and actuaries frequently employ the
mean, median, and mode.

As an example:
m

◌◌ Mean: Insurance analysts frequently compute the mean age of the individuals
for whom they provide insurance in order to determine the average age of
their customers.
)A

◌◌ Median: Actuaries frequently calculate the median amount spent on


healthcare by individuals each year in order to determine how much insurance
they need to be able to supply to individuals.
◌◌ Mode: Actuaries also calculate the mode of their clients (the most regularly
occurring age) so that they may determine which age group is the most likely
(c

to use their insurance.

Amity Directorate of Distance & Online Education


Computational Statistics 103

Explanation 2: Mean, Median, and Mode in Real Estate


Notes

e
Real estate salespeople frequently employ the mean, median, and mode.

As an example:

in
◌◌ Mean: Real estate agents determine the average price of properties in a given
area so that they may advise their clients on how much they can anticipate to

nl
spend on a home.
◌◌ Median: Real estate agents compute the median price of homes to get
a better picture of the “average” home price because the median is less

O
influenced by outliers (such as multi-million dollar properties) than the mean.
◌◌ Mode: Real estate agents compute the mode of the number of bedrooms per
house so that they may educate their clients about the number of bedrooms
they can expect in properties in a specific location.

ty
Explanation 3: Mean, Median, and Mode in Human Resources
Individuals who work in Human Resources departments at businesses frequently

si
employ the mean, median, and mode.

As an example:

◌◌
r
Mean: Human resource managers frequently compute the mean income of
persons in a given field in order to determine what type of “average” salary to
ve
offer new employees.
◌◌ Median: Human resource managers sometimes compute the median salary in
specific fields so that they are aware of what the normal “middle” income is in
that field.
ni

◌◌ Mode: Human resource managers compute the mode of various roles in the
company so that they are aware of the most prevalent position of employees
at their organisation.
U

Explanation 4: Mean, Median, & Mode in Marketing


Marketers frequently use the mean, median, and mode to acquire a better
ity

understanding of how their commercials perform.

As an example:

◌◌ Mean: Marketers frequently compute the average income earned per


advertisement to see how much money their organisation makes on each ad.
m

◌◌ Median: Marketers compute the median revenue produced per advertisement


to determine how well the median ad performs.
◌◌ Mode: Marketers calculate the mode of the type of ad utilised (e.g.,
)A

newspaper, TV, radio, digital) so that they know which type of ads their
company employs the most frequently
(c

Amity Directorate of Distance & Online Education


104 Computational Statistics

Case Study
Notes

e
Masters Students’ Computation Errors on Measures of Central Tendency:
Implications For Andragogy

in
Green (1992) promoted this study, which investigated computation errors in
statistics, stating that statistical notions are a fascinating field to investigate. Indeed,
statistics terminology that statisticians regard as simple and obvious (mean, mode,

nl
median, variability, distribution) represent the distilled wisdom of numerous generations
of the mathematically gifted. It is too much to anticipate today that teachers would
not struggle to pass on this inheritance and students will not struggle to receive and

O
internalise for application without errors and mistakes.

This study recognises, based on the above platform, that mathematics learners,
whether primary or secondary school children or adults at the university level, commit

ty
mathematics computational errors despite their teachers’ wishes and expectations. This
is corroborated by Legutko (2007), who finds that students’ mathematical errors are
unavoidable, arising from the mathematics itself, textbooks, or as a result of education.

si
When Ressell (2001) observes that the most effective learning experiences
emerge from making mistakes, he/she qualifies mistakes as a learning ingredient.
When he/she stated, “the roots of many errors are anchored not so much in students
r
but in the maimer children are introduced to mathematics,” Booker (1989:101) pointed
to the teacher variable as a source of errors.
ve
From the standpoint that the instructor is the most essential variable in a teaching-
learning encounter, Booker’s (ibid) observation is warranted. All other variables in the
system are activated by the teacher. Teachers establish classroom learning standards
ni

and track deviations from them through student errors.

The discovery of mathematical errors is thus a key aspect of informative


evaluation. It can be done during lecture discussions, as well as when the lecturer
U

grades students’ assignments and tests in class. Teachers can use students’ mistakes
to reflect on their teaching techniques, textbooks used, and student needs. Indeed,
known and projected errors assist teachers in speculating on questions and answers
throughout the class planning stage. In this situation, mistake analysis serves as a
ity

foundation for student-specific education to fix the errors.

This discussion leads to the conclusion that mathematics error analysis is a must-
do activity for any serious mathematics teacher. It is even more critical for teachers of
adult learners. The vast majority of them would have been out of formal education for
m

an extended period of time.

Contextual Analysis
)A

Chinhoyi University of Technology (CUT) offers many first-year programmes as


well as a postgraduate Master of Science in Strategic Management degree, which
was established in 2005. It has two basic admission requirements: a first-year degree
in any recognised discipline from an accredited university and at least two years of
administrative experience.
(c

According to the criterion, all students enrolled in this programme are managers.
They are also the working adults in charge of various areas of Zimbabwe’s economy.
Amity Directorate of Distance & Online Education
Computational Statistics 105

Due to the nature of the student population, the Master of Science in Strategic
Notes

e
Management degree was provided on a block-release basis to minimise disruptions to
their work-place schedules.

in
These entry qualifications are devoid of any mathematical substance, despite
the fact that strategic management functions necessitate correct data analysis
and interpretation. It is reasonable to conclude that such pupils bring a variety of

nl
mathematical skills and experiences to the classroom.

Quantitative management is one of the required courses taken during the first
semester of the Master of Science in Strategic Management degree. Its primary

O
goal is to help students build decision-making abilities based on the understanding
of measurable business variables. Algorithms for computing mathematics are not
stressed. The argument goes that these students are managers whose job it is to make
judgments based on data analysis outcomes rather than analysis techniques.

ty
A typical Master of Science in Strategic Management class contains 160 adult
students. Large courses are allowed in order for the programme to be self-sustaining.
In the lecture hall, students are taught and learn through the use of lecture methods

si
and power point presentations. Tutorials are required to improve concept development
and application. Assignments requiring mathematical computations are completed
in groups and are overseen during tutorials. This is done so that the manager can be
r
mathematically literate and evaluate and supervise subordinates who analyse data.
ve
Research Problem
The researcher, a Quantitative Management tutor, is disturbed by the observation
that Master of Science in Strategic Management students are failing to receive full
ni

marks on word problems requiring the computation and interpretation of measures


of central tendency (mean, mode, and median) for grouped data. Students provide
incomplete solutions, and their responses are sometimes incorrect when compared to
U

a typical accepted answer from the marking guide. The investigation of computation
errors is a crucial first step in the development of focused remedial teaching.

Study Objectives
ity

This study was prompted by the need to:

1. Identify faults or mistakes that students make when computing measures of central
tendency for grouped data.
m

2. Determine the likely causes of such errors.


3. Make specific instructional suggestions to rectify the faults
)A

Study Rationale
Researchers and lecturers of statistical ideas to adult learners will find this study
useful. It is significant to note that:

1. Statistics, notably computational errors on measures of central tendency by adult


(c

students, have received less scholarly attention than other mathematical areas.
2. The majority of statistics research has been conducted in experimental settings, with
little regard for normal classroom practise.
Amity Directorate of Distance & Online Education
106 Computational Statistics

3. The majority of studies on mathematics in other nations concentrated on elementary


Notes

e
school children and college students (pedagogy), with little effort made to explain
errors by adult learners (andragogy) in Zimbabwe’s universities.

in
4. The majority of the investigations were conducted by psychologists rather than
statisticians.
5. Errors made by students can be analysed and focused educational approaches

nl
developed because a wrong answer indicates that the learner has a different
knowledge, which should be used to refocus the student. It aids in understanding
adult statistics learners as a distinct category.

O
Adult Learners
The researchers’ experiences tutoring adult learners in Zimbabwe for statistical
concepts verified Salih’s (2003) remark that adult learners are not giant children. As a

ty
result, pedagogical skills (child teaching and learning) cannot be an accurate alternative
for andragogy (helping adults to learn). Adult students must be understood by an
examination of their traits, learning styles, and mathematical blunders as shown in their

si
oral and written solutions to problems.

Adult learners should get instructional advice from Knowle’s (1984) andragogical
model, which is based on the four primary assumptions and possible instructional
consequences listed below: r
ve
1. Adult learners bring in mathematical theories and concepts, as well as mathematics
learning experiences from earlier encounters with the subject. This assumption
necessitates the use of pre-tests to determine the levels of knowledge and expertise
brought into the room. These can be a source of assumed knowledge activities that
ni

serve as a foundation for new information.


2. Before investing effort and resources in the learning process, adults must understand
the objective of their learning. This means that the lecturer should explain the course
U

objectives and outline the course on the first meeting with students.
3. Because adults are accustomed to making decisions in their daily lives, they require
self-direction over the nature, content, and approach to their learning. Lecturers can
ity

offer a course with optional topic content and learning mode.


4. Adults learn more efficiently when confronted with tasks and issues that they believe
to be real, relevant to, and coming from the demands of their daily lives. This requires
lecturers to shift away from textbook problem examples with foreign material and
m

toward problem examples created from Zimbabwe’s setting.


Rogers (2002:15) categorises adult learners into four types: activists, observers,
theorists, and experimentalists. As illustrated in the table below, each group has distinct
)A

consequences for instruction:

Characteristics of Adult Learner

Group Lear ner Characteristic and Implication


Activists Learn by involving themselves in various activities. jInvolve them
(c

in actual data collection before teaching and calculation of mean,


mode and median.

Amity Directorate of Distance & Online Education


Computational Statistics 107

Observers Perfer to wait and watch what is going on before they decide to
Notes

e
act. Involve them in end of group discussion presentation reports
and exercise evaluation.

in
Theorists LIke to generalize from their experiences and apply whta they lean
in one arena to another. Lecturer can mek them lead groups in real
life problem solving tasks.

nl
Experimentalists Enjoy devising new approaches and try them out to see what
happens. Provide problems which require different techniques.
Ask them to evaluate group discussion answers.

O
This classification means that one group of masters students can be made up
of adult learners from various positions on the spectrum. There are those who want to
be taught everything and those who want to discover everything on their own. In Bean
(2003), Bell and Gilbert (1996) divided adult learners’ learning strategies into three levels:

ty
surface, deep, and strategic. Their learning methodologies are shown in the table below:

Adult learner’s learing strategies

si
Nature of student Main Concern Learning strategies
Surface level To complete the ●● Serious about writing notes
student has no course with a pass and photocopying handout for
background (50%) r
procedral knowledge
ve
contgent of statics
●● Memorization of fragmentary facts
●● Pays attention to all course
content
ni

●● Aims to get the correct or right


answer
●● Assimilates unaltered chunks
U

of course material verbatim for


regurgitation in the exam
Strategic level To establish ●● Applies surface level approaches
student has appropriate content in a flexible combination
ity

knowledge of for specific tests contingent with perceived nature


statistics content evaluation and of task.
and its applications synthesis
●● Solve one problem with more
than one approach and evaluate
m

solutions

The student’s statistical background can be deduced as the major variable defining
)A

the student’s learning style. Students who dropped mathematics at “O” level and went
on to complete a first degree that required no maths can operate at the surface level.
Those who have completed mathematics at the “O” level, “A” level, and first degree
will be able to work at the strategic level. Lecturers are encouraged to use pie-tests to
assess students’ knowledge levels.
(c

There are two reasons for examining adult students’ mathematical competence
prior to instruction. First, university students would have gained extensive knowledge

Amity Directorate of Distance & Online Education


108 Computational Statistics

of measures of central tendency (mean, mode, and median) through prior learning and
Notes

e
daily experiences. Second, learning occurs as a result of the integration of what the
learner is taught with his/her current notions or concepts about the issue (Posner, Strike
and Hewson, 1982).

in
Mathematics as a Discipline
According to Atherton (2003), the subject is not neutral in the interplay between

nl
the teacher, the learner, and the subject being taught. Mathematics, in particular,
enforces its own language and logic, both of which lead to student errors. Mathematics,
according to Johnson and Rising (1972:3), is a way of thinking that includes arithmetic

O
(the science of numbers and computation), algebra (the language of symbols and
relations), geometry (the study of shapes, size, and space), statistics (the science of
interpreting data and graphs), and calculus (study of change, infinity and limits).

ty
From an applied standpoint, Howson (1988) defined mathematics as the study
of patterns (any regularity inform or idea such as sequences). It is a language that
improves communication by employing ideograms (symbols for ideas such as 2 for sum

si
of) to assist computations.

This study is looking for faults in three areas of mathematics: arithmetic (numbers
and computation), algebra (symbol language), and statistics (data interpretation) within
r
the context of measures of central tendency for grouped data.
ve
Central tendency measures are a convenient way of representing a set of data
with a single number (Gay, 1979). Measures of central tendency can be thought of
as a descriptive quantification of variable grouping within a distribution. The mode for
nominal data, the median for ordinal data, and the mean for interval and ratio data are
ni

the three indices of central tendency.

The values of these central tendency measures are used to describe the
distribution of large samples in research and in real-world settings. Their linkages and
U

distribution are depicted in the table below:

Measures of central tendency distribution description


ity

Relationship Distribution description


Mean < Median < Mode Negatively skewed distribution
Mean = Median = Mode Normal distribution
Mean > Median > Mode Positively skewed distribution
m

The following formulas for grouped data are included in students’ formula booklets,
thus there is no need for pupils to memorise them.
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 109

Notes

e
in
nl
O
The presence of a formula indicates the growth of procedural knowledge. Mtetwa
(1999) defines procedural knowledge as knowing what, when, and how to change rules
and stages to get the desired result. It may or may not include the (rationale) why?
methods work in this manner. Chinamasa (2008) urged teachers to use logic as a

ty
foundation for conceptual understanding. The rationale allows students to investigate
and apply mathematical content structures, relationships, objects, and concepts in real-
world settings.

si
Students’ Errors in Mathematics
Errors are defined in this study as systematic deviations from accepted
r
mathematical computing techniques. According to Radatz (1980), pupils’ blunders
ve
in mathematical instruction are not solely the consequence of ignorance, stupidity,
or chance. Most mistakes made by students are not the result of uncertainty,
carelessness, or unusual circumstances. Students’ blunders, on the other hand, are the
outcome or product of earlier encounters in the mathematics classroom.
ni

This perception encourages teachers to analyse students’ mistakes, identify


possible causes, and provide instructive comments for pupils who are struggling.
U

According to Frank (2009), the assumptions behind instructors’ studies of learners’


mathematical errors are as follows:

(1) Errors are an indicator of problems faced in acquiring the target concepts.
ity

(2) It allows teachers to predict likely errors in a group of students and provide remedial
instruction.
(3) They point to possible learning tactics adopted by students.
(4) They are a source of information about the stages of growth of learners. These
m

attitudes demand that the current study on measures of central tendency computation
errors produced by adult learners be conducted.
Researchers have approached error analysis from a variety of perspectives.
)A

According to Russell (2002), elementary school pupils commit the following offences:

(1) Mechanical errors caused by rushed approaches and missed steps.


(2) Application mistakes indicating a student’s misunderstanding of one or more steps.
(c

(3) Knowledge gaps resulting from a lack of concepts and a lack of familiarity with
terminology.
(4) Incorrect operation order as a result of rote learning.
Amity Directorate of Distance & Online Education
110 Computational Statistics

Cohen and Spenciner (2007) examined word problems and discovered that
Notes

e
students:

(1) Had difficulties reading

in
(2) Shown an inability to apply the context of the problem to a real-life situation.
(3) Failed to discriminate between important and irrelevant data.

nl
(4) We couldn’t figure out how many steps were needed to solve the problem.
(5) Had difficulty with mathematical operations using directed numbers.
These findings do not account for adult mistakes in Zimbabwe. Despite the fact that

O
these experiments were conducted on youngsters rather than adults, it is obvious that:

(1) Language contributes to errors in mathematical computation.

ty
(2) The researchers examined student response scripts.
(3) Their approaches did not ask the learner to explain why he or she used a particular
procedure.

si
Aside from the language barrier, Radatz (1979) also reported that students’
mathematics errors were caused by:

(1) a lack of mastery of pre-requisite abilities, such as counting before addition and
subtraction. r
ve
(2) Inadequate facts and conceptions.
(3) Wrong idea association.
(4) The application of inapplicable operational regulations.
ni

Booker (1989) was fascinated by the use of faults in focused training. During the
course of teaching, he or she witnessed the following errors:
U

(1) Incoherent presentation of mathematics content.


(2) Using improper textbook examples to create concepts.
(3) Inappropriate exercise tasks, such as those that emphasise procedure drill with no
ity

evaluation or explanation.
(4) Underestimating the importance of fundamentals by educating before pre-testing.
(5) Inappropriate instructor reaction to student mistakes.
(6) Inadequate technique selection for subject knowledge comprehension.
m

(7) Classwork based on the work of selected students who are invited to work on the
board while the rest of the class copies the solution.
)A

Newman (1977) proposes a more comprehensive method of error analysis that


involves asking a pupil five prompts to assist pinpoint where errors occur. The prompts
are as follows:

1. Could you just read the question to me? If you don’t understand a word, leave it out.
(c

(reading)

Amity Directorate of Distance & Online Education


Computational Statistics 111

2. Describe what the question asks you to do (comprehension).


Notes

e
3. Describe how you intend to locate the answer (transforming)
4. Tell me what I need to do to get a response. Speak aloud while doing it so that lean

in
can understand what you’re thinking. (Process or ability)
5. Lastly, write down your response to the question (Encoding in symbols or words).

nl
Clements (1980) used Newman’s prompts to examine the errors of 726 Grade 5 to
7 students in Papua New Guinea. He/she discovered that 50% of the errors occurred at
the reading, comprehension, and transformation stages.

O
Clements (ibid) argued that teacher remedial initiatives centred on procedural
method are misdirected since Newman’s cues are hierarchical. Ellerton and Clements
(1996) discovered that different queries induced quite diverse mistake patterns.

ty
Lankford (1994) expanded the use of Newman’s mistake analysis to adult students.
She discovered that nurses made mistakes at the understanding and transformation
levels. What is yet unknown is the sort of computation errors, their origins, and
classification among Zimbabwe’s adult pupils.

si
Methodology
Research Design
r
A descriptive case study was used in this work to aid in the discovery,
ve
quantification, and description of students’ computational errors and their distribution.
Descriptive research approaches make it easier to triangulate data sources and
methods. It allowed for documentary analysis and the use of Newman’s prompts in
this study. Because only one group of students from one university was used, the case
ni

study is acceptable for accounting for institutional issues.

Instruments
U

The researcher designed the instructional activity below for error analysis after
teaching the subject. The speed of vehicles on a road near Hillview Primary School is
shown in the table below:
ity

The assignment was completed under test conditions to ensure that students
m

present individual responses and to allow test emotional stress elements to naturally
prevail. To reduce the influence of time, an hour was allowed. The 103 student answer
scripts were the following set of instruments. These were examined for computation
)A

errors made by pupils. Scripts also offered a record of students’ errors but no rationale,
necessitating interviews utilising Newman’s (1977) prompts as a guide for students’
rationale and likely error classification.

The prompts are as follows:


(c

1. Please read the question aloud to me. If you don’t understand a word, leave it out.
(reading)

Amity Directorate of Distance & Online Education


112 Computational Statistics

2. Describe what the question asks you to do. (comprehension)


Notes

e
3. Explain how you intend to obtain the answer. (transforming)
4. Tell me what I need to do to get a response. Speak aloud while doing it so that I can

in
comprehend what you’re thinking. (Process or ability)
5. Lastly, write down your response to the question (Encoding in symbols or words).

nl
Population and sampling
This study’s population consisted of all 186 master’s students who enrolled in the
Quantitative Management course between January and July of 2011. All of the students

O
had completed the course and were excellent sources of errors resulting from their prior
experiences and university instruction. A count of the 103 scripts for pupils who were
present was taken.

ty
Because the script population was limited, probability sampling was adequate for
Newman’s interview. Because errors were assumed to be consistently distributed, basic
random sampling was used. The registration numbers of students were linked with

si
computer-generated random numbers to choose 60 people for interviews.

Data collection

r
The administration of the exam commenced data collection. Students were advised
that the test was an error-detection tool. Its conclusions were supposed to serve as
ve
the foundation for their revision session error focused tutor instruction. The test scripts
were invigilated and marked by the researcher. During the marking process, errors were
collected and frequency tables were prepared for errors that appeared more than once.
Errors were also noted on students’ scripts so that they might profit from the research.
ni

The tutor conducted one-on-one interviews with 60 pupils chosen from the sample.
Participants agreed to have their responses classified according to Newman’s prompt
U

guidance. For discussion, frequency tables were created.

Findings and Discussions


ity

According to the table, there are more male (60%) than female (40%) participants
m

in the group. Male dominance appears to have an impact on the findings. The
bulk of participants (35%) are between the ages of 46 and 50. This group is prone
to committing computation errors. Age, according to Bean (2003), is a significant
)A

determinant in computational jobs. As the complexity of tasks increases, so does the


capacity to perform them.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 113

Notes

e
in
nl
O
ty
Discussion of Results
r si
ve
1. Students make contextual interpretation errors for measures of central tendency
such as the mean (31%), median (95%), and mode (61 percent). Students’ use of
memorising of textbook definitions of median as a term that occupies a centre place
when they are in order can explain for this range of errors.
ni

2. Common responses were “the speed on the centre of the chart is 63.3 km/h.” Bell
and Gilbert (1996) classify such students as operating at the surface level. They
display a limited amount of statistical background content. Only (31% of those
U

polled) correctly interpreted the mean. This is due to the frequent use of the word
“average” in everyday life, which necessitates the inclusion of everyday examples
during training.
3. Students submitted three primary types of incorrect formulas. 41 (40%) of the 103
ity

students copied the formula for mean of ungrouped data rather than the calculation
for mean of grouped data.
The formula was interpreted incorrectly in two ways:
m
)A

These errors might occur due to a misunderstanding of the reasoning behind the
formula, process, and notation. According to Newman (1977), it is a comprehension
error.
(c

The tutor can point out the issue during the mean introduction for grouped data.
Formula notation can be provided in this enlarged form:

Amity Directorate of Distance & Online Education


114 Computational Statistics

Notes

e
1. To improve students’ knowledge of the formula, the extended form should be stressed.

in
It can be beneficial to ask students to trade notebooks and mark one other’s work
with the instructor, emphasising on the error points.
2. Students who did not pay attention to detail gave two inaccurate versions of the

nl
median (87 percent) and mode formulas (93 percent ). The first formula mistake
version was released in:

O
ty
Lm — the numerator does not include the lower class limit of both the median and
modal classes.

This is a transcribing error caused by an overly generalised view, an inability to

si
understand the formula, or poor vision. Because these lecturers employ power-point
presentations, the findings may validate Howard’s (2002) observation that older eyes
are more vulnerable to glare and adjust more slowly to shifts in light and dark. This
r
observation necessitates additional research to determine whether the use of power
ve
point presentations affects adult students’ vision and overall learning outcomes.

The second formula mistake occurred as a result of failing to read subscripts (m


+ 1) and (m -1) in the mode formula. This could also be due to transcription errors and
a lack of comprehension of the rationale for the m ode’s placement inside the modal
ni

class. Students gave an inaccurate formula:


U

The tutor can reduce this inaccuracy by introducing the mode of grouped
data using a graphical method in which the mode is inferred from the histogram, as
illustrated in the picture below:
ity
m
)A

Classes are located using bracketed notation for subscripts f l^) and f(m+1). After
pupils have copied the formula, they can trade notebooks and only mark the formula
(c

that has been transcribed.

4. The biggest mistake in the solution presentation was the removal of logical sentences

Amity Directorate of Distance & Online Education


Computational Statistics 115

and the usage of equal signs. It yields meaningless figures like:


Notes

e
in
nl
Such errors occur as a result of pupils’ focus on computations to obtain the
solution. Bean (2003) anticipates such errors from surface learners with no prior
knowledge of the subject. Tutors might emphasise the importance of logical

O
presentation and deduct points for such solutions.

Newman’s Error Analysis Categories N=60

ty
Type of Error category Frequency
Reading 13 (22%)
Comprehension 27 (45%)
Transforming 35 (58%)

si
Process or Skill (Application) 46 (77%)
Encoding (Presentation) 51 (85%)

r
According to the findings, the bulk of computation errors made by masters students
ve
begin with the transformation (58 percent) (not knowing how to find the solution),
progress through the process (77 percent), and end with encoding (85 percent).

When contrasted to Clements (1980), who discovered that 50 percent of Grade 5


to 7 students’ errors occurred at the reading, comprehension, and transformation levels
ni

in Papua New Guinea, one might deduce that age and students’ experiences account
for the disparities.

Lankford’s (1994) findings that nurses made errors at the comprehension and
U

transformation levels imply that adult learners make errors at higher levels, implying
that remedial activities focusing on procedural methods are appropriate for adult
learners.
ity

Implications for Andragogy


According to the findings of this study, instructors can reduce computation mistakes
for measures of central tendency by doing the following:
m

1. “Creating beginning activities that address fundamental mathematics concepts such


as:
(a) Operation priority order brackets, including multiplication, division, addition, and
)A

subtraction. (BOMDAS)
(b) Fractions
(c) Make use of the brackets from New General Mathematics, Book 2.
(d) Percentages and decimals
(c

2. Involving students in the collecting of data for which they will calculate, analyse, and
present the mean, mode, and median during group reporting.

Amity Directorate of Distance & Online Education


116 Computational Statistics

3. Adding a graphical representation to the calculation for measures of central tendency.


Notes

e
4. During the teaching-learning process, ask students to exchange and mark each
other’s notes.

in
5. Encourage pupils to explain the procedure as they solve the problem.
6. Teaching for understanding by describing the rationale for figure manipulation and
assigning method marks for right method given rationally.

nl
7. Power-point examples can be supplemented with photocopies of the answer
distributed to students as handouts.

O
Check Your Understanding
Multiple Choice Question

1. Measures of Central Tendency aid in determining the ............ around which all data

ty
in a group (observations) tend to cluster.
A) Double Value

si
B) Single Value
C) Triple Value
D) Valueless
2.
r
Because no two scores are precisely the same, the frequency of each value in
ve
................, such as response time measured to many decimals, is one.
A) Present Data
B) Past data
ni

C) Continuous Data
D) Upcoming Data
U

3. Individuals and businesses use these metrics all the time to obtain a better
understanding of ........ in a variety of disciplines.
A) Datasets
ity

B) Raw
C) Column
D) Sets
m

4. Measures of central tendency and ............ are useful for describing and comparing
data sets.
A) Straight
)A

B) Zig Zag
C) Dispersion
D) Downfall
(c

5. Descriptive research approaches make it easier to .......... data sources and methods.
A) Triangulate

Amity Directorate of Distance & Online Education


Computational Statistics 117

B) Quadrilateral
Notes

e
C) Circulate
D) Base

in
6. Measures of central tendency and dispersion are useful for ................ data sets.
A) Analysing

nl
B) Describing
C) Comparison

O
D) Action

State True and False


1) Measures of Central Tendency aid in determining the single value around which

ty
some data in a group (observations) tend to cluster.
2) The central tendency aids in comparing a double value of data to the complete data
set.

si
3) Individuals and businesses use these metrics all the time to obtain a better
understanding of datasets in a variety of disciplines.
4) r
Human resource managers frequently compute the mean income of persons in a
given field in order to determine what type of “average” salary to offer new employees.
ve
5) Teachers establish classroom learning standards and track deviations from them
through student errors.

Summary
ni

●● Central tendency is a statistical measure that uses a single number to represent a


group. It is a cost-effective assessment of a group’s general features.
U

●● A measure of central tendency is a single number that attempts to explain a set


of data by recognising its central position. Measures of central tendency are also
called as measures of central position at times. They are also known as summary
statistics.
ity

●● Measures of central tendency are an important strategy for dealing with and
communicating with graphs. In real-world applications, you can use many types of
tables and graphs to show information and extract information from data to aid in
analyses and forecasts.
m

●● The statistical model that represents the single value of the complete distribution
or database and seeks to implement an exact description of the full data in the
distribution is referred to as the central tendency.
)A

●● The three main measures are mean, median, and mode.


●● The mean is the sum of all observations.
●● The median is the midpoint of a distribution: half of the observations are above it
(c

and half are below it. The median is less susceptible to outliers than the mean. For
a data collection containing extreme values, it is a better metric than the mean.
●● The mode is the most common observation in a distribution.
Amity Directorate of Distance & Online Education
118 Computational Statistics

●● A dataset may contain numerous modes in some circumstances, while others may
Notes

e
not have any modes at all. The three measures of central values, namely mean,
median, and mode, are associated by the following relationships (referred to as an
empirical relationship): 2 Median = Mean + Mode

in
●● When information is regularly distributed, the mean is the preferred measure of
central tendency. When data is skewed, the median is the most useful metric of

nl
central tendency. The mode is the most useful metric of central tendency when
working with nominal variables. If all information values are 0, the mean and
median cannot be zero. However, the dataset may not contain any modes.

O
Activity
1. Discuss the usage of Mean, Median, and Mode in different industries.
2. Discuss the case study: Masters Students’ Computation Errors on Measures of

ty
Central Tendency: Implications for Andragogy.

Questions & Exercises

si
1. How is mean, median, & node used in Marketing?
2. What are the benefits of mode of continuous data?
3.
r
How is mean, median, & mode used in healthcare?
ve
4. How is mean, median, & mode used in real estate?
5. How is mean, median, & mode used in human resources?

Glossary
ni

●● Measure of central tendency: It assists in determining the centre of all


observations and is thus also known as Statistical Averages, Averages, or
Measures of Central Location.
U

●● Mean: The mean is the most commonly used metric for measuring central
tendency. It is the result of dividing the total number of observations by the total
number of observations in the dataset.
ity

●● Median: The median is a positional average, basically used in the context of


qualitative data, such as intelligence, etc. It divides the data into two equal parts
where half of the items are less than the median while the half of the part is
greater than the median.
m

●● Mode: The most often occurring item or observation in a data set is mode. For
example, a textile manufacturer needs to know which size is most frequently
ordered by clients so that he can produce a great quantity of that size.
)A

●● Graphic Mean: Another name for the slope of the secant line the equivalent of the
average rate of change between two points.
●● Sales Revenue: Revenue is the essence of business; therefore, tracking it is
critical for any corporation. It’s also included in a lot of corporate performance
measurements. Revenue growth from prior times is desirable for firm survival and
(c

profitability.
●● Gross Profit Margin: According to the formula, gross profit margin is a measure
Amity Directorate of Distance & Online Education
Computational Statistics 119

of profitability. It assesses the effectiveness of managing production costs in


Notes

e
relation to sales. The greater the margin of error, the better. The key to proper
interpretation is a comparison to industry benchmarks.

in
References
1. Gravetter FJ, Wallnau LB. Statistics for the behavioral sciences. 5th ed. Belmont:
Wadsworth – Thomson Learning; 2000.

nl
2. Rao PS Sundar, Richard J. Introduction to biostatistics and research methods. 4th
ed. New Delhi, India: Prentice Hall of India Pvt Ltd; 2006.

O
3. Sundaram KR, Dwivedi SN, Sreenivas V. Medical statistics principles and methods.
1st ed. New Delhi, India: BI Publications Pvt Ltd; 2010.

Check your Understanding-Answers

ty
Multiple Choice Question

1. B) single value

si
2. C) continuous data
3. A) datasets
4. C) dispersion’
5. A) triangulate
r
ve
6. B) describing
State True and False
1. False
ni

2. False
3. True
U

4. True
5. True
ity
m
)A
(c

Amity Directorate of Distance & Online Education


120 Computational Statistics

Unit - 2.4: Introduction to Dispersion


Notes

e
Objectives

in
At the end of this unit, you will be able to:

●● Understand the terms dispersion, range measure and mean deviation

nl
●● Understand terms SD variance, combined mean and coefficient variation
●● Understand functioning and usage of quartile deviation

O
●● Understand what is data distribution
●● Understand what is skewness and Kurtosis

Introduction

ty
We live in a changing world, and changes are occurring in all aspects of life.
The study of statistics is not particularly interested in things that are constant. To
a researcher, the total size of the Earth may not be very essential, but the area

si
covered by various crops, forests, residential and commercial structures are figures of
considerable interest, because these statistics change from time to time and from place
to place. Many professionals are involved in the investigation of shifting phenomena.
r
Experts from various countries keep an eye on the forces that are responsible
ve
for bringing about changes in domains of human interest. Economists, statisticians,
and other professionals are keenly interested in agricultural, industrial, and mineral
production, as well as their transit from one region to another. Changes in human
populations, living standards, literacy rates, and pricing entice experts to conduct
ni

extensive studies and then correlate these changes to human life. Thus, variability or
variation is linked to human life, and its study is critical for humanity.
U

2.4.1 Introduction Dispersion


In statistics, the term dispersion has a technical meaning. The average is one type
of observation that measures the centre of the data. Another aspect of the observation
ity

is the way the observations are distributed around the centre. The observations may
be clustered towards the centre or dispersed distant from it. We say that dispersion,
scatter, or variation is minor if the observations are close to the centre (typically the
arithmetic mean or median). Dispersion is defined as a broad spread of observations
away from the centre.
m

Assume we have three groups of students who received the following test scores.
The three groups’ arithmetic means are also presented below:
)A
(c

The arithmetic means in groups A and B are equivalent, i.e. X¯¯¯¯A=X¯¯¯¯B=50.


However, in group A, the observations are concentrated in the middle. All of the kids in
Group A perform at about the same level. We say that the observations in group A are

Amity Directorate of Distance & Online Education


Computational Statistics 121

consistent. The mean in group B is 50, however the observations are not near to the
Notes

e
centre. One observation can be as tiny as 30 and as large as 70.

As a result, dispersion is larger in group B. The mean in group C is 60, but the

in
spread of observations with regard to the centre is the same as the spread of
observations in group B with respect to their own centre, which is 50. Thus, the means
in groups B and C differ, but their dispersion is the same. The means and dispersion of

nl
the groups A and C are different. Dispersion is an important aspect of observation, and
it is assessed using dispersion, scatter, or variation measures. The term variability is
also used to describe the concept of dispersion.

O
Dispersion analysis is critical in statistical data. Workers will be satisfied if there
is consistency in their wages at a particular factory. However, if some workers get
high wages while others earn low wages, there will be unrest among the low-wage
earners, who may go on strike and organise rallies. When some people in a country

ty
are extremely impoverished while others are extremely wealthy, we refer to this as
economic disparity. This indicates that the dispersion is wide.

1. The concept of dispersion is essential in the analysis of employees’ pay, commodity

si
prices, different people’s standards of living, wealth distribution, land allocation
among framers, and many other aspects of life. Here are some quick definitions of
dispersion:
2.
r
Data dispersion or variance refers to the degree to which numerical data tend to vary
ve
around an average value.
3. Dispersion or variation can be defined as a statistic that indicates the degree to
which items are distributed around a measure of central tendency.
ni

4. Dispersion or variation is the size of the scatter of items in a series around the
average.
Different series may possess different dispersions of items around the average.
U

Measures of central tendency are averages of the first order. Measures of dispersion
are averages of the second order. A measure of dispersion gives an idea about the
extent of lack of uniformity in the sizes and qualities of the items in a series. It helps us
to know the degree of uniformity and consistency in the series. If the difference between
ity

items is large the dispersion or variation is large and vice versa.

A measure of dispersion or variation in any data shows the extent to which the
numerical values tend to spread about an average. If the difference between items is
small, the average represents and describes the data adequately. For large differences
m

it is proper to supplement information by calculating a measure of dispersion in addition


to an average. It is useful to determine data for the knowledge it may serve:

◌◌ To compare the current results with the past results.


)A

◌◌ To compare two are more sets of observations.


◌◌ To suggest methods to control variation in the data.
A study of variations helps us in knowing the extent of uniformity or consistency in
any data. Uniformity in production is an essential requirement in industry. Quality control
(c

methods are based on the laws of dispersion.

Amity Directorate of Distance & Online Education


122 Computational Statistics

Check Your Understanding


Notes

e
Multiple Choice Question
1. Experts from various countries keep an eye on the forces that are responsible for

in
bringing about changes in domains of ___________.
a) Public Interest

nl
b) Human Interest
c) Private Interest
d) Wealth Interest

O
2. ____________, statisticians, and other professionals are keenly interested in
agricultural, industrial, and mineral production, as well as their transit from one
region to another.

ty
a) Scientist
b) Engineer

si
c) Economists
d) Designer
3. Another aspect of the observation is the way the observations are distributed around
the ________. r
ve
a) Centre
b) Middle
c) Front
ni

d) Right
4. We say that dispersion, scatter, or variation is minor if the ___________ are close to
U

the centre.
a) Result
b) Observations
ity

c) Average
d) Range
5. If some workers get high wages while others earn low wages, there will be unrest
among the __________ earners, who may go on strike and organise rallies.
m

a) High wage
b) Moderate wage
)A

c) No wage
d) Low wage

State True and False


(c

1. The study of statistics is particularly interested in things that are constant.


2. Many professionals are involved in the investigation of shifting phenomena.

Amity Directorate of Distance & Online Education


Computational Statistics 123

3. Variability or variation is linked to human life, and its study is critical for humanity.
Notes

e
4. Dispersion is not important aspect of observation, and it is assessed using dispersion,
scatter, or variation measures.

in
5. A measure of dispersion or variation in any data shows the extent to which the
numerical values tend to spread about a range.

nl
Summary
●● Experts from various countries keep an eye on the forces that are responsible for
bringing about changes in domains of human interest. Economists, statisticians,

O
and other professionals are keenly interested in agricultural, industrial, and mineral
production, as well as their transit from one region to another.
●● Changes in human populations, living standards, literacy rates, and pricing entice

ty
experts to conduct extensive studies and then correlate these changes to human
life. Thus, variability or variation is linked to human life, and its study is critical for
humanity.

si
●● In statistics, the term dispersion has a technical meaning. The average is one
type of observation that measures the centre of the data. Another aspect of the
observation is the way the observations are distributed around the centre.
●● r
The observations may be clustered towards the centre or dispersed distant from it.
We say that dispersion, scatter, or variation is minor if the observations are close
ve
to the centre (typically the arithmetic mean or median).
●● The arithmetic means in groups A and B are equivalent, i.e. X¯¯¯¯A=X¯¯¯¯B=50.
However, in group A, the observations are concentrated in the middle. All of the
ni

kids in Group A perform at about the same level. We say that the observations in
group A are consistent. The mean in group B is 50, however the observations are
not near to the centre. One observation can be as tiny as 30 and as large as 70.
U

●● Dispersion is an important aspect of observation, and it is assessed using


dispersion, scatter, or variation measures. The term variability is also used to
describe the concept of dispersion.
●● Measures of dispersion are averages of the second order. A measure of dispersion
ity

gives an idea about the extent of lack of uniformity in the sizes and qualities of the
items in a series. It helps us to know the degree of uniformity and consistency in
the series.
●● A study of variations helps us in knowing the extent of uniformity or consistency in
m

any data. Uniformity in production is an essential requirement in industry. Quality


control methods are based on the laws of dispersion.
)A

Activity
1. Discussion on How Dispersion helpful in Static Economics.
2. Try to explain some situation of Dispersion in real world.
(c

Amity Directorate of Distance & Online Education


124 Computational Statistics

Question and Answer


Notes

e
1. Explain Dispersion.
2. What is Range?

in
3. What is Mean Deviation?
4. Explain SD variance, combined mean and coefficient variation.

nl
5. What is Quartile Deviation?
6. What is Skewness and Kurtosis?

O
Glossary
●● Range: It is simply the difference between the data set’s maximum and minimum
values.

ty
●● Absolute Measure of Dispersion - The same unit is used in an absolute measure
of dispersion as in the original data set. The absolute dispersion approach
expresses changes in terms of the average of observational deviations, such

si
as standard or means deviations. It covers terms like range, standard deviation,
quartile deviation, and so on.
●● Variance: The variance is calculated by subtracting the mean from each data point
r
in the set, then squaring each of them, adding each square, and then dividing
ve
them by the total number of values in the data set.
●● Quartiles and Quartile Deviation: The quartiles are values that divide a number list
into quarters. The quartile deviation is equal to half of the difference between the
third and first quartiles.
ni

●● Co-efficient of Dispersion: When two series with widely different averages are
compared, the coefficients of dispersion (along with the measure of dispersion)
are determined. When comparing two series with different measurement units, the
U

dispersion coefficient is also utilised. It is abbreviated as C.D.


●● Quartile Scores: These are based on more information than the range and are not
affected by outliers, unlike the range.
ity

●● Quartile deviation: Quartile Deviation is the average of the difference between


upper quartile and lower quartile.
●● Inter-quartile Range: Inter-quartile range is a difference between upper quartile
(third quartile) and lower quartile
m

Further Reading
1. Bordens, K.S. and Abbott, B. B. (2011). Research Design and Methods: A
)A

Process Approach. New Dekhi:McGraw Hill Education(India) Private Limited.


2. King, Bruce. M; Minium, Edward. W. (2008). Statistical Reasoning in the
Behavioural Sciences. Delhi: John Wiley and Sons, Ltd.
3. Mangal, S. K. (2002). Statistics in Psychology and Education. new Delhi:
(c

PhiLearning Private Limited.

Amity Directorate of Distance & Online Education


Computational Statistics 125

Check Your Understanding – Answers


Notes

e
Multiple Choice Question

1. b) Human Interest

in
2. c) Economists
3. a) Centre

nl
4. b) Observations
5. d) Low-Wage

O
State True and False
1. False
2. True

ty
3. True
4. False
5. False

si
r
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


126 Computational Statistics

Unit - 2.5: Measures of Dispersion


Notes

e
Objectives

in
At the end of this unit, you will be able to:

●● Understand the term range measure

nl
●● Know more about mean deviation, SD variance and its calculation
●● Know about combined mean, coefficient variation and quartile deviation

O
●● Know about data distribution and skewness
●● Know about measures, calculation and importance of skewness
●● Know about terms moments and Kurtosis

ty
Introduction
Measures of Dispersion: Measures of dispersion in statistics aid in interpreting

si
data variability, i.e. determining how homogeneous or heterogeneous the data is. In
layman’s terms, it indicates how pinched or distributed the variable is.

Types of Measures of Dispersion: In statistics, there are two primary types of


dispersion methods: r
ve
◌◌ Absolute Dispersion Measurement
◌◌ Dispersion Measurement Relative
Absolute Measure of Dispersion - The same unit is used in an absolute measure
ni

of dispersion as in the original data set. The absolute dispersion approach expresses
changes in terms of the average of observational deviations, such as standard or
means deviations. It covers terms like range, standard deviation, quartile deviation, and
U

so on.

The various sorts of absolute measures of dispersion are as follows:

Range: It is simply the difference between the data set’s maximum and minimum
ity

values.

Example: 1,3,5, 6, 7 => Range = 7 -1= 6

Variance: The variance is calculated by subtracting the mean from each data point
in the set, then squaring each of them, adding each square, and then dividing them by
m

the total number of values in the data set.

Variance (σ2)=∑(X−μ)2/N
)A

Standard Deviation: The standard deviation is defined as the square root of the
variance i.e.

S.D. = √σ.

Quartiles and Quartile Deviation: The quartiles are values that divide a number
(c

list into quarters. The quartile deviation is equal to half of the difference between the
third and first quartiles.

Amity Directorate of Distance & Online Education


Computational Statistics 127

Mean and Mean Deviation: The mean is the average of numbers, and the mean
Notes

e
deviation is the arithmetic mean of the absolute departures of the observations from a
measure of central tendency (also called mean absolute deviation).

in
Relative Measure of Dispersion - When comparing the distributions of two or
more data sets, relative measures of dispersion are used. This metric compares values
without using units. Methods of relative dispersion that are commonly used include:

nl
◌◌ Co-efficient of Range
◌◌ Co-efficient of Variation
◌◌ Co-efficient of Standard Deviation

O
◌◌ Co-efficient of Quartile Deviation
◌◌ Co-efficient of Mean Deviation
Co-efficient of Dispersion: When two series with widely different averages

ty
are compared, the coefficients of dispersion (along with the measure of dispersion)
are determined. When comparing two series with different measurement units, the
dispersion coefficient is also utilised. It is abbreviated as C.D.

si
The common dispersion coefficients are:

C.D. In Terms of Coefficient of dispersion


Range r
C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)
ve
Quartile Deviation C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)
Standard Deviation (S.D.) C.D. = S.D. ⁄ Mean
Mean Deviation C.D. = Mean deviation/Average

Characteristics of a good measure of dispersion - An ideal measure of


ni

dispersion should have the following characteristics:

◌◌ It should be rigidly defined.


U

◌◌ It must be based on all of the items.


◌◌ Extreme items should not have an undue impact on it.
◌◌ It should be amenable to algebraic manipulation.
ity

◌◌ It should be straightforward to understand and compute


Comparisons of Measures of Dispersion - When data are described by a central
tendency measure (mean, median, or mode), all of the scores are summarised by a
single value. Central tendency reports are frequently enhanced and complemented by
m

giving a measure of dispersion. The dispersion measurements you’ve just examined


differ in ways that can help you decide which one is most effective in a given situation.

Range: The range is the simplest of all dispersion measures to calculate. It is


)A

frequently used as a first sign of dispersion. However, because it only considers the
scores at the two extremes, it is only of limited utility.

Quartile Scores: These are based on more information than the range and are
not affected by outliers, unlike the range. However, they are only used seldom to
(c

illustrate dispersion since they are more difficult to calculate than the range and lack the
mathematical qualities that make standard deviation and variance so helpful.

Amity Directorate of Distance & Online Education


128 Computational Statistics

The standard deviation (σ or s) and variance (σ2 or s2) are more comprehensive
Notes

e
measurements of dispersion that account for every score in a distribution. The other
measurements of dispersion we’ve covered are based on a lot less data.

in
However, because variance is calculated using the squared differences in scores
from the mean, a single outlier has a higher impact on the amount of the variance than
a single score close to the mean.

nl
Some statisticians consider this trait to be a flaw of variance as a measure of
dispersion, particularly when the veracity of some of the extreme scores is called into
question.

O
A researcher, for example, may conclude that a person who reports watching
television an average of 24 hours per day misinterpreted the question.

A single extreme score may result in a significantly greater standard deviation,

ty
especially if the sample is small. Fortunately, because all scores are incorporated in the
variance calculation, the numerous non-extreme scores (those closer to the mean) will
tend to offset the deceptive influence of any extreme numbers.

si
The standard deviation and variance are the most often used measures of
dispersion in the social sciences because:

◌◌
r
Both consider the precise difference between each score and the mean. As a
result, these measures are based on as much information as possible.
ve
◌◌ The standard deviation serves as the foundation for developing the concept of
standardised score, sometimes known as “z-score.”
◌◌ Variance in a set of scores on some dependent variable serves as a baseline
ni

for determining the relationship between two or more variables (the degree to
which they are related).

2.5.1 Range Measure


U

The ‘Range’ of the data is the difference between the largest value of data and
smallest value of data.
ity

This is an absolute measure of variability. However, if we have to compare two sets


of data, ‘Range’ may not give a true picture. In such case, relative measure of range,
called coefficient of range is used. This is given by,

Formulae: Range = L-S


m

Where L – Largest value and S- Smallest Value

In individual observations and discrete series, L and S are easily identified. In


continuous series, the following two methods are used as follows:
)A

Method 1: L - Upper boundary of the highest class.

S - Lower boundary of the lowest class.

Method 2: L - Mid value of the highest class.


(c

S - Mid Value of the lowest class.

Amity Directorate of Distance & Online Education


Computational Statistics 129

Example 1:
Notes

e
Find the set of observations 10 5 8 11 12 9

in
Solution: L = 12 S = 5
Range = L – S

= 12 – 5

nl
=7

Coefficient of range = L – S / L + S

O
= 12 – 5/ 12 + 5

= 7/17

ty
= 0.4118

Example 2:

si
Compute the range and the co-efficient of range from the following distribution.

C.I Frequency (f)


120 - 130 2
130 - 140 9
r
ve
140 - 150 16
150 - 160 12
160-170 5
ni

Solution:
U

In finding the range the frequencies are never taken into account. The upper limit
of the highe st class and the lower limit of the smallest class are only taken into account

Range = L - S
ity

= 170 - 120 = 50

Co-efficient of Range = L-S/L+S = 170 – 120/170 + 120

= 50/290
m

= 0.1724

Interquartile Range and Deviations


)A

Inter-quartile range and deviations are described in the following sub sections.

Inter-quartile Range
Inter-quartile range is a difference between upper quartile (third quartile) and lower
(c

quartile

(First quartile). Thus, Inter Quartile Range = (Q3 - Q1)

Amity Directorate of Distance & Online Education


130 Computational Statistics

Quartile deviation
Notes

e
Definition: Quartile Deviation is the average of the difference between upper
quartile and lower quartile.

in
Formulae: Thus, Quartile Deviation = QD = (Q3 - Q1)/2

Quartile Deviation (QD) also gives the average deviation of upper and lower

nl
quartiles from Median.

QD = (Q3 - Q1)/2 = Q3 - Q1 / Q3 + Q1

O
Example 1:
Weekly wages of labourers is given below. Calculate Q.D. and coefficient of Q.D.

Weekly wages 100 200 400 500 600 Total

ty
No. of Weeks: 5 8 21 12 6 52
Solution:

si
Weekly wages No. of Weeks: Cumulative Frequency
100 5 5
200 8 13
400
500
r 21
12
34
46
ve
600 6 52
N = 52

Q1 = N+1 /4
ni

= 52+1/4

13.25
U

Q1 = 13th value + 0.25 (14th value – 13th value)

= 200 + 0.25 (400-200)


ity

= 200 + 0.25 × 200

= 200 + 50

= 250
m

Q3 = 3(N+1 /4)

= 3 x 13.25

= 39.75
)A

Q3 = 39th value + 0.75 (40th value – 39th value)

= 500 + 0.75 (500-500)

= 500 + 0.75 X 0
(c

= 500.

Q.D. = Q3 - Q1 / 2
Amity Directorate of Distance & Online Education
Computational Statistics 131

= 500 – 250/2
Notes

e
= 250/2

= 125

in
Coefficient of Q.D. = Q3 - Q1/ Q3 + Q1

= 500 -250/ 500 + 250

nl
= 250/750

= 0.333

O
Example 2:
Determine the interquartile range and percentile range of the following distribution:

ty
C.I Frequency (f)
11 - 13 8
13 - 15 10

si
15 - 17 15
17 - 19 20
19 – 21 12
21 - 23 11 r
ve
23 – 25 4

Solution:

Class Intervals Frequency Less than C.F.


ni

11 - 13 8 8
13 - 15 10 18
15 - 17 15 33
U

17 - 19 20 53
19 – 21 12 65
21-23 11 76
ity

23-25 4 80

1. Calculation of Interquartile Range


m
)A
(c

Amity Directorate of Distance & Online Education


132 Computational Statistics

Notes

e
in
nl
O
ty
r si
ve
2.5.2 Mean Deviation
ni

Mean deviation is the arithmetic mean of the absolute deviations of the values
about their arithmetic mean or median or mode. Mean Deviation (MD) is an average
value of absolute deviation of observations from the data mean (or the median or the
U

mode). It gives how spread/dispersed the data is.

If x1, x2… xn are N observations, then,

Mean Deviation MD = di/N = xi – Average/N


ity

Where,

di = Deviation of each observation = xi – Average

Average used for calculating deviation can be the mean, the median or the mode.
m

However, usually the mean is used. There is also an advantage of taking deviations
from the median, because ‘Mean Deviation’ from median is lowest as compared to any
other ‘Mean Deviations’. Since absolute values of deviations ignoring sign are taken
)A

for calculating Mean Deviation, the mean deviation is not amenable to further algebraic
treatment.

Mean Deviation Application


(c

Definition: The relative measure corresponding to the ‘Mean Deviation’ is


coefficient of Mean Deviation’. It is defined as:

Amity Directorate of Distance & Online Education


Computational Statistics 133

Coefficient of mean deviation = Mean Deviation/ Mean or Median or Mode It can


Notes

e
also be expressed in percentage by multiplying it with 100.

Formulae:

in
Coefficient of Mean deviation (about mean) = =Mean deviation about Mean / Mean

= Σ|x-x|/N

nl
Coefficient of Mean deviation (about Median) = Mean deviation about Median/
Median

O
= Σf|x-M|/N

Coefficient of Mean deviation (about Mode) = Mean deviation about Mode / Mode

= Σf|x-z|/N

ty
Example:
Calculate mean deviation about the mean for the following:

si
12 7 9 7 7 4 10 9 15 20

Solution:
r
X = 12 + 7 + 9 + 7 + 7 + 4 + 10 + 9 + 15 + 20/ 10
ve
= 100/10

= 10
ni

Mean deviation about mean = = Σ|x-x|/N

= 2 + 3 +1+ 3 + 3 + 6 + 0 +1+ 5 +10/ 10

= 34/10
U

= 3.4

Example 1:
ity

MD in Individual series

Calculate mean deviation and it coefficient for the following data

Value (x) 125 128 132 135 140 145 155 157 159
m

161
)A
(c

Amity Directorate of Distance & Online Education


134 Computational Statistics

Solution:
Notes

e
Steps 1: First compute AM Step 2: Deviation From X
Mean deviation
Sl. No. Value (x) Formula (X-X) = Dx

in
A 125 Σx 125-144= -19 Σ Dx
X= MD =
B 128 128-144= -16
n n

nl
C 132 1440 132-144= -12 120
MD = ignoring MD =
D 135 135-144= -9
10 negative 10

O
E 140 140-144= -4 sign MD= 12
X=144
F 148 148-144= +4 Coefficient of
G 155 155-144= +11 MD
MD =

ty
H 157 157-144= 13
x
I 159 159-144= +15 12 =0.083
J 161 161-144= +17 144

si
Σx =
n =10 ΣDx = 120
1.440

r
ve
Example 2:
MD in Discrete Series

Calculate mean deviation and it co-efficient for the following data.


ni

X 35 40 45 50 55 60 65 70 75 80 85 90 95
f 3 8 12 9 4 7 15 5 10 7 5 3 2
U

Solution:

X f AM (X-X) = Dx fdx Mean


deviation
ity

Σ fx
Σ Dx
35 3 x= 35-61.95 = – 26.95 80.85
MD =
n
n
40 8 40-61.95= – 21.95 175.60
m

5.575 12.17.28
45 12 45-61.95 = – 16.95 203.40
= 90 MD =
50 9 50-61.95 = – 11.95 107.55 90
)A

=
55 4 55-61.95 = – 6.95 27.88
61.95
60 7 60-61.95 = – 1.95 13.65
MD = 13.525
65 15 65-61.95 = 3.05 45.75
70 5 70-61.95 = 8.05 40.25 Coefficient of
(c

75 10 75-61.95 = 13.05 130.50 MD


80 7 80-61.95 = 18.05 126.35 MD = x

Amity Directorate of Distance & Online Education


Computational Statistics 135

85 5 85-61.95 = 23.05 115.25 13.525


Notes

e
=
90 3 90-61.95 = 28.05 84.15
61.95

in
95 2 95-61.95 = 33.05 66.10 = 0.218
N=90 Σfx = 5575 Σfdx = 1217.28

nl
Example 3:
MD in Continuous Series

Calculate mean deviation and its co-efficient for the following data:

O
X f
10-20 5
20-30 4

ty
30-40 7
40-50 12
50-60 10

si
60-70 8
70-80 4
Solution:
r
ve
X f Mid Point fx AM (X-x) = fdx
X
10-20 5 75 Σ fx 31.6 80.85 Σ Dx
x= MD =
20-30 4 100 21.6 175.60
ni

n n
30-40 7 245 11.6 203.40 689.6
x = 2.330
MD =
40-50 12 540 50 1.6 107.55
50
U

50-60 10 550 = 46.6 8.4 84.0 Co-


efficient
60-70 8 520 18.4 147.2 of md
ity

MD= x
13.792
70-80 4 300 28.4 113.6
= 46.6
N= 50 Σfx = 2,330 Σfdx = 689.6 = 0.2959
m

2.5.3 SD Variance
Variance is defined as the average of squared deviation of data points from their
)A

mean.

When the data constitute a sample, the variance is denoted byσ2x and averaging
is done by dividing the sum of the squared deviation from the mean by ‘n – 1’. When
observations constitute the population, the variance is denoted by σ2 and we divide by
(c

N for the average.

Amity Directorate of Distance & Online Education


136 Computational Statistics

Different formulas for calculating variance:


Notes

e
in
nl
O
ty
Standard Deviation r si
ve
Standard Deviation is the root mean square deviation of the values from their
arithmetic mean. S.D. is denoted by symbol σ (read sigma). The Standard Deviation
(SD) of a set of data is the positive square root of the variance of the set. This is also
referred as Root Mean Square (RMS.) value of the deviations of the data points. SD
ni

of sample is the square root of the sample variance i.e. equal to σx and the Standard
Deviation of a population is the square root of the variance of the population and
denoted by σ.
U

The properties of standard deviation are:

◌◌ It is the most important and widely used measure of variability.


◌◌ It is based on all the observations.
ity

◌◌ Further mathematical treatment is possible.


◌◌ It is affected least by any sampling fluctuations.
◌◌ It is affected by the extreme values and it gives more importance to the values
that are away from the mean.
m

◌◌ The main limitation is; we cannot compare the variability of different data sets
given in different units
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 137

Notes

e
in
nl
O
Where d = X – A and C is the true class interval

N = Total frequency

ty
Example:
In a large group of students 80% have a recommended statistics book. Three
students are selected at random. Find the probability distribution of the number of

si
students having the book. Also compute the mean and variance of the distribution.

Solution:
r
Let the event that ‘a student selected at random has the book’ be termed as a
ve
success. Since the group of students is large, 3 trials, i.e., the selection of 3 students,
can be regarded as independent with probability of a success p = 0.8. Thus, the
conditions of the given experiment satisfies the conditions of binomial distribution.
ni

The probability mass function P(r) = 3Cr (0.8)r (0.2)3–r

Where r = 0, 1, 2 and 3

The mean is np = 3 x 0.8 = 2.4 and Variance is npq = 2.4 x 0.2 = 0.48
U

2.5.4 SD Variance Calculation


Example: Find the standard deviation for the following data:
ity

Class Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70


Frequency 6 14 10 8 1 3 8
Solution: Direct Method
m

Class Interval Class Mark mi Frequency Fi x mi di = (mi-A) di2 fi x di2


0-10 5 6 30 -25 625 3750
10-20 15 14 210 -15 225 3150
)A

20-30 25 10 250 -5 25 250


30-40 35 8 280 5 25 200
40-50 45 1 45 15 225 225
50-60 55 3 165 25 625 1875
(c

60-70 65 8 520 35 1225 9800


Σfi= 50 1500 19250

Amity Directorate of Distance & Online Education


138 Computational Statistics

Mean = 1500/50 = 30
Notes

e
SD = √19250/50 = 19.62

in
Example 1: Calculate variance and standard deviation for the following data:

x 2 4 6 8 10
f 3 5 9 5 3

nl
Solution:

O
x f fx D D2 fD2
2 3 6 -4 16 48
4 5 20 -2 4 20

ty
6 9 54 0 0 0
8 5 40 2 4 20
10 3 30 4 16 48

si
Σ= 25 Σ= 150 Σ= 136

Mean = Σfx / Σf = 150/25 = 6


r
Thus, variance = ΣfD2/N = 136/25 = 5.44
ve
Therefore, standard deviation, σ = √5.44 = 2.33

Example 2: Find the variance and standard deviation of the following scores on
an exam:
ni

92, 95, 85, 80, 75, 50


U

Solution: First find the mean of the data:


Mean, x̅ = (92+95+85+80+75+50)/6 = 477/6 = 79.5

Now, find the difference between each score and the mean (deviation)
ity

Score Score - Mean Difference from Mean


92 92 – 79.5 12.5
95 95 – 79.5 15.5
85 85 – 79.5 5.5
m

80 80 – 79.5 0.5
75 75 – 79.5 -4.5
50 50 – 79.5 -29.5
)A

Next, square each of these differences and sum them


(c

Amity Directorate of Distance & Online Education


Computational Statistics 139

Difference Difference Squared


Notes

e
12.5 156.25
15.5 240.25

in
5.5 30.25
0.5 0.25
-4.5 20.25

nl
-29.5 870.25
Sum of squares 1317.50

O
Now find Mean of this sum (the variance) = 1317.50/5 = 263.5

Now, the standard deviation, σ = √263.5 = 16.2

2.5.5 Combined Mean

ty
A combined mean is the mean of two or more distinct groups, and it is calculated
as follows:

si
1. Calculating each group’s mean
2. Combining the results

r
The formula can compute the combined mean of two sets:
ve
Where:
ni

◌◌ xa = the mean of the first set


◌◌ m = the number of items in the first set
◌◌ xb = the mean of the second set
U

◌◌ n = the number of items in the second set


◌◌ xc the combined mean
A combined mean is just a weighted mean with weights equal to the size of each
ity

group.

In the case of more than two groups:

1. Add the means of each group, each weighted by the number of individuals or
m

data points.
2. Divide the sum from Step 1 by the total number of individuals (or data points).
)A

Calculating a Combined Mean: Examples

Example: Find combined mean from following data:


m = 40, xa = 10
(c

n = 60, xb = 15

Amity Directorate of Distance & Online Education


140 Computational Statistics

Solution:
Notes

e
Combined mean:

in
= (40 . 10) + (60 . 15) / (40 + 60)

nl
= 400 + 900 / 100

= 1300 / 100 = 13

O
2.5.6 Comparison Measures Dispersion
The scattering of data is indicated by a measure of dispersion. It shows how data

ty
differs from one another, providing a clear picture of their distribution. The measure
of dispersion reveals and informs us about the variance and central value of a single
object.

si
In other words, dispersion is the degree to which values in a distribution deviate
from the distribution’s average. It offers us an indication of how much different objects
differ from one another and from the core value.

r
The variation can be quantified using several numerical measurements, such as:
ve
(i) Range: It is the most basic method of measuring dispersion, defining the difference
between the largest and smallest item in a particular distribution. If the two ultimate
items are Y max and Y min, then
Range = Y max – Y min
ni

(ii) Quartile deviation: It is referred to as the semi-interquartile range, which is half of


the difference between the upper and lower quartiles. The first quartile is denoted
U

by Q, and the middle digit Q1 ties the lowest value to the data’s median. The (Q2)
second quartile is the median of a data set. Finally, the third quartile is the number
that connects the largest number and the median (Q3). The quartile deviation can be
computed as follows:
ity

Q = ½ × (Q3 – Q1)

(iii) Mean deviation: The arithmetic mean (average) of deviations |D| of observations
from a central value is defined as mean deviation (mean or median).
m

The formula can be used to calculate mean deviation: A = 1⁄n [∑i|xi – A|]

(iv) Standard deviation: The square root of the arithmetic average of the square of
the deviations measured from the mean is the standard deviation. The standard
)A

deviation is expressed as:


σ = [(Σi (yi – ȳ) ⁄ n] ½ = [(Σ i yi 2 ⁄ n) – ȳ 2] ½

In addition to a numerical number, graphical methods are used to estimate


dispersion.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 141

Types of Measures of Dispersion


Notes

e
(1) Absolute measures

in
◌◌ Absolute measures of dispersion are expressed in the variable’s unit of
measurement, such as kilogrammes, rupees, centimetres, or marks

(2) Relative measures

nl
◌◌ Relative dispersion metrics are obtained as ratios or percentages of the
average.
◌◌ These are also known as dispersion coefficients.

O
◌◌ These are purely numerical values or percentages.

Dispersion Measurement Characteristics

ty
◌◌ It should be straightforward to calculate and understand.
◌◌ It should be based on all of the series’ observations.
◌◌ It must be precisely specified.

si
◌◌ Extreme values should have no effect on it.
◌◌ It should not be significantly influenced by fluctuations in sampling.
◌◌
r
It should be able to be subjected to additional mathematical treatment and
statistical analysis.
ve
Objectives of Computing Dispersion

(1) Comparative Study


ni

◌◌ Dispersion measures provide a single value that indicates the degree of


consistency or uniformity of distribution. This single value aids us in comparing
diverse distributions.
U

◌◌ The higher the consistency or uniformity, the smaller the magnitude (value) of
dispersion, and vice versa.

(2) Reliability of an Average


ity

◌◌ A low dispersion number indicates that there is little fluctuation between


observations and the average. That is, the average is a good representative of
observation and is very dependable.
◌◌ A larger value of dispersion indicates that the observations are more
m

dispersed. In this instance, the average is not a good representative and


cannot be trusted.

(3) Control the Variability


)A

◌◌ Various metrics of dispersion provide data on variability from various


perspectives, and this knowledge can be useful in reducing variance.
◌◌ These metrics of dispersion can be quite important, particularly in the financial
analysis of business and medical.
(c

Amity Directorate of Distance & Online Education


142 Computational Statistics

(4) Basis for Further Statistical Analysis


Notes

e
◌◌ Measures of dispersion serve as the foundation for subsequent statistical
analysis such as computing correlation, regression, hypothesis testing, and so

in
on.

Different ‘Absolute Measures’ of Dispersion

nl
The following are the many ‘absolute measures’ of dispersion:

(1) Range

O
◌◌ It is the most basic method of measuring dispersion
◌◌ It is defined as the difference between the largest and smallest item in a given
distribution.
◌◌ Range = Largest item (L) – Smallest item (S)

ty
(2) Interquartile Range
◌◌ It is defined as the difference between a particular distribution’s upper and

si
lower quartiles.
Interquartile Range = Upper Quartile (Q3)–Lower Quartile(Q1)

(3) Quartile Deviation r


ve
◌◌ It is known as the Semi-Inter-Quartile Range, which is half the difference
between the upper and lower quartiles.

◌◌ Quartile Deviation =
ni

(4) Mean Deviation


◌◌ Mean deviation is the arithmetic mean (average) of observations’ deviations
U

|D| from a central value (Mean or Median).

(5) Standard Deviation


◌◌ The standard deviation is defined as the square root of the arithmetic average
ity

of the squared deviations from the mean.

(6) Lorenz Curve


◌◌ The Lorenz Curve is a graphic method of assessing estimated dispersion.
m

◌◌ This curve is frequently used to measure income or wealth inequality in a


society.
Various ‘Relative Measures’ of Dispersion
)A

The Relative Measures of Dispersion are as follows:

(1) Coefficient of Range


It is the ratio of the difference between two extreme items of a distribution to their
(c

total.

Amity Directorate of Distance & Online Education


Computational Statistics 143

Notes

e
(2) Coefficient of Quartile Deviation

in
It is the ratio of the difference between a distribution’s upper and lower quartiles to
their sum.

nl
(3) Coefficient of Mean Deviation

O
◌◌ Mean deviation is an absolute measure of dispersion.
◌◌ To convert it to a relative measure, divide it by the average from which it was
determined.

ty
◌◌ This is known as the Coefficient of Mean Deviation.

◌◌
◌◌

si
(4) Coefficient of Standard Deviation
◌◌
r
ve
(5) Coefficient of Variation
◌◌ It is used to compare the stability of two data sets (or uniformity or consistency
or homogeneity).
ni

◌◌ It expresses the percentage relationship between the standard deviation and


the arithmetic mean.
◌◌
U

◌◌

Merits
ity

1. It is very simple to compute and understand.


2. When computing range, no particular knowledge is required.
3. It takes the least amount of time to compute.
m

4. It provides a high-level overview of the facts at a glance.

Demerits
)A

1. It is a sloppy measure because it is based solely on two extreme numbers


(highest and lowest).
2. It cannot be determined for open-ended series.
3. Range is highly affected by sampling fluctuations, i.e., it fluctuates greatly from
(c

sample to sample.

Amity Directorate of Distance & Online Education


144 Computational Statistics

Merits and Demerits of Quartile Deviation


Notes

e
Merits

1. It is also quite simple to compute and understand.

in
2. It is applicable even in the case of open-end distribution.
3. It is less sensitive to extreme values, making it superior than ‘Range.’

nl
4. It is more useful when calculating the dispersion of the middle 50%

Demerits

O
1. It is not founded on all observations.
2. It is not amenable to additional algebraic or statistical treatment.
3. It is significantly influenced by sampling fluctuations.

ty
4. It is not considered a particularly reliable measure of dispersion because it
disregards 50% of the observations.

si
Merits and Demerits of Mean Deviation
Merits

1.
r
It is based on all of the series’ observations, not just the boundaries such as
Range and QD.
ve
2. It is straightforward to compute and comprehend.
3. Extreme values have little effect on it.
4. Deviations from any average can be used to calculate mean deviation.
ni

Demerits
1. Ignoring + and – signs is incorrect mathematically.
U

2. It cannot be further mathematically treated.


3. When the mean or median is a fraction, it is difficult to compute.
ity

4. This strategy may not be applicable in the event of an open-ended series.

2.5.7 Coefficient of Variation


The standard deviation is a measure of dispersion that is absolute. It is expressed
m

in terms of units, and the original statistics are gathered and stated. The standard
variation of plant heights cannot be compared to the standard deviation of grain weights
since they are expressed in different units, namely centimetres and kilogrammes.
)A

As a result, for purposes of comparison, the standard deviation must be translated


into a relative measure of dispersion. The coefficient of variation is a relative measure.
The coefficient of variation is calculated as a percentage by dividing the standard
deviation by the mean.
(c

It is defined as the ratio of SD and mean, multiplied by 100.

CV =σ/ μ×100

Amity Directorate of Distance & Online Education


Computational Statistics 145

This is also called as variability. Smaller value of CV indicates greater stability and
Notes

e
lesser variability.

C.V. can be used to compare the variability of two or more series. The larger the

in
C.V. of a series or group of data, the more variable, less stable, less uniform, less
consistent, or less homogeneous the group. If the C.V. is lower, it suggests that the
group is less variable, more stable, uniform, consistent, or homogeneous.

nl
Example:
Two batsmen A and B made the following scores in the preliminary round of World

O
Cup Series of cricket matches.

A 14, 13, 26, 53, 17, 29, 79, 36, 84 and 49

B 37, 22, 56, 52, 28, 30, 37, 48, 20 and 40

ty
Who will you select for the final? Justify your answer?

Solution:

si
We will first calculate mean, standard deviation and Karl Pearson’s coefficient of
variation. We will select the player based on the average score as well as consistency.
We not only want the player who has been scoring at high average but also doing it
r
consistently. Thus, the probability of his playing good inning in final is high.
ve
For Player ‘A’ (Using Direct Method)

Now,

Score xi Deviation (xi – µ) (xi – µ)2 ∑ xi2


ni

14 –26 676 196


13 –27 729 169
26 –14 196 676
U

53 13 169 2809
17 –23 529 289
29 –11 121 841
ity

79 39 521 6241
36 –4 16 1296
84 44 1936 7056
49 9 81 24021
m

∑ xi = 400 ∑ (xi – µ) = 0 ∑ (xi – µ) = 0


2
∑ xi = 21974

Now,
)A
(c

Amity Directorate of Distance & Online Education


146 Computational Statistics

2.5.8 Quartile Deviation


Notes

e
This is a metric for dispersion. The interquartile range is calculated by taking the
difference between the upper and lower quartiles. Symbolically, it looks like this:

in
nl
Where Q3= Upper quartile Q1= Lower quartile

O
Coefficient of QD

Coefficient of QD derived by applying a specific set of formulas:

Coefficient of Quartile Deviation = (Q3 – Q1) / (Q3 + Q1)

ty
A QD Coefficient is used to evaluate and compare the degree of variation in
various scenarios.

si
The formula for calculating the First Quartile (Q1) is given below:

First Quartile (Q1)

r
Qi= [i * (n + 1) /4] th observation
ve
Q1= [1 * (10 + 1) /4] th observation

Q1 = [1 * (10 + 1) /4] th observation

Q1 = 2.75th observation
ni

As a result, the 2.75th observation falls between the second and third values in the
ordered group, or halfway between 12 and 14. Hence,
U

First Quartile (Q1) is calculated as:

Calculation of First Quartile -1.3

Q1 = 2nd observation + 0.75 * (3rd observation – 2nd observation)


ity

Q1 = 12 + 0.75 * (14 – 12)

Q1 = 12 + 1.50 Q1 = 13.50

The formula for calculating the third quartile (Q3) is given below:
m

Third Quartile (Q3)

Qi= [i * (n + 1) /4] th observation


)A

Q3 = [1 * (n + 1) /4] th observation

Q3 = [(10 + 1) /4] th observation

Q3 = 8.25th observation
(c

As a result, the 8.25th observation falls between the 8th and 9th values in the
ordered group, or halfway between 30 and 35. Hence,

Amity Directorate of Distance & Online Education


Computational Statistics 147

Third Quartile (Q3) is calculated as:


Notes

e
Calculation of Third Quartile-1.4

Q3 = 8th observation + 0.25 * (9th observation – 8th observation)

in
Q3 = 30 + 0.25 * (35 – 30)

Q3 = 31.25

nl
Using the Quartile values Q1 and Q3, we will calculate the Quartile deviation and
coefficient as shown below:

O
The formula for calculating Quartile Deviation is given below:

Quartile Deviation = (Q3 – Q1) / 2

Quartile Deviation Formula-1.5

ty
Quartile Deviation = (31.25 – 13.50) / 2

Quartile Deviation = 8.875

si
The formula for calculating the Coefficient of Quartile Deviation is given below:

Coefficient of Quartile Deviation = (Q3 – Q1) / (Q3 + Q1)

Coefficient of Q.D Formula-2.6 r


ve
Coefficient of Quartile Deviation = (31.25 – 13.50) / (31.25 + 13.50)

Coefficient of Quartile Deviation =0. 397


ni

2.5.9 Application Quartile Deviation

Definition:
U

Quartile deviation is defined as half the difference between the third and first
quartiles. It is often referred to as the semi-interquartile range. Quartile deviation
is calculated by taking half of the difference or variance between the third and first
quartiles of a simple distribution or frequency distribution.
ity

The quartile deviation formula is:

Q.D. = Q3-Q1/ 2
m

Example –
Quartiles are numerical values that divide a set of integers into quarters. Sort
the numbers ascending, then divide the list into four equal halves. The cuts are the
)A

quartiles.

For example- 5, 7, 4, 4, 6, 2, and 8.

Arrange them in order – 2, 4, 4, 5, 6, 7, and 8.

Cut the list into quarters.


(c

Quartile 1 (Q1) = 4 or lower quartile

Amity Directorate of Distance & Online Education


148 Computational Statistics

Quartile 2 (Q2) = which is also the Median = 5


Notes

e
Quartile 3 (Q3) = 7 or lower quartile

in
Application:
T. The interquartile range is significant because it represents the most important
spread of the data, and from this point, several regressions and deviations may be

nl
calculated, which are highly useful in assessing the features of the data. When the
difference is split by two, the result is referred to as quartile deviation or semi-inter-
quartile range.

O
Example:
Find the quartiles and quartile deviation of the following data:

ty
17, 2, 7, 27, 15, 5, 14, 8, 10, 24, 48, 10, 8, 7, 18, 28

Solution:

si
The given data is: 17, 2, 7, 27, 15, 5, 14, 8, 10, 24, 48, 10, 8, 7, 18, 28

Ascending order of the data: 2, 5, 7, 7, 8, 8, 10, 10, 14, 15, 17, 18, 24, 27, 28, 48

r
Number of data values = n = 16
ve
Q2 = Median of the given data set

n is even, Median = (1/2) [(n/2)th observation and (n/2 + 1)th observation]

= (1/2) [8th observation + 9th observation]


ni

= (10 + 14)/ 2

= 24/2 = 12
U

Thus, Q2 = 12

Now, lower half of the data is: 2, 5, 7, 7, 8, 8, 10, 10 (even number of observations)

Q1 = Median of lower half of the data


ity

= (1/2)[4th observation + 5th observation]

= (7 + 8)/2

= 15/2 = 7.5
m

Also, the upper half of the data is: 14, 15, 17, 18, 24, 27, 28, 48 (even number of
observations)
)A

Q3 = Median of upper half of the data

= (1/2)[4th observation + 5th observation]

= (18 + 24)/2
(c

= 42/2 = 21

Thus, Quartile Deviation = (Q3 – Q1)/2

Amity Directorate of Distance & Online Education


Computational Statistics 149

= (21 – 7.5)/2
Notes

e
= 13.5/2 = 6.75

Thus, the quartile deviation for the given dataset is 6.75.

in
Quartile Formula
Assume Q3 is the top quartile in the median of the data sample’s upper half.

nl
Q1 is the lower quartile and median of the data’s lower half.

Median is Q2.

O
The number of data elements is n, and the quartiles are given by:

Q1= (n+1)/4(n+1)/4th item

ty
Q2=(n+1)/2(n+1)/2th item

Q3=3(n+1)/43(n+1)/4th item

si
Hence, the formula for quartile can be written as:

r
Where Qr is the rth quartile, l1 is the lower limit, l2 is the upper limit, f is the
ve
frequency, and c is the cumulative frequency of the quartile class.

2.5.10 Data Distribution – Introduction Skewness


ni

Skewness is a distortion or asymmetry in a set of data that deviates from the


symmetrical bell curve, or normal distribution. The curve is said to be skewed if it is
displaced to the left or right. Skewness can be expressed as a measure of how far a
U

given distribution deviates from a normal distribution. A normal distribution has a skew
of zero, whereas a lognormal distribution, for example, has some right-skew.

In addition to positive and negative skew, distributions can contain zero or


undefined skew. The data on the right side of a distribution curve may taper differently
ity

than the data on the left side. These taperings are referred to as “tails.” Negative skew
is defined as a longer or fatter tail on the left side of the distribution, whereas positive
skew is defined as a longer or fatter tail on the right.

The mean of data that is positively skewed will be bigger than the median. The
m

exact converse is true with a negatively skewed distribution: the mean of negatively
skewed data is less than the median. The distribution has zero skewness if the data
graphs symmetrically, regardless of how long or fat the tails are.
)A

Types of Skewness
Positive Skewness - It is a positively skewed distribution if the supplied
distribution is pushed to the left and has its tail on the right side. It’s also known as the
(c

right-skewed distribution. A tail is the tapering of the curve that differs from the data
points on the other side.

Amity Directorate of Distance & Online Education


150 Computational Statistics

Notes

e
in
nl
A positively skewed distribution assumes a skewness value of more than zero.
Since the skewness of the given distribution is on the right, the mean value is greater

O
than the median and moves towards the right, and the mode occurs at the highest
frequency of the distribution.

Negative Skewness - If the given distribution is shifted to the right and with its

ty
tail on the left side, it is a negatively skewed distribution. It is also called a left-skewed
distribution.

r si
ve
ni

Any distribution with a negative skew has a skewness value that is less than zero.
The skewness of the given distribution is to the left; thus, the mean value is less than
the median and advances to the left, and the mode occurs at the distribution’s highest
frequency.
U
ity
m
)A

How to Interpret
(c

Instead of relying solely on the average, skewness takes into account the dataset’s
extremes. As a result, investors consider skewness when predicting the distribution of
investment returns. If an investor retains a stake for the long term, the average of the
Amity Directorate of Distance & Online Education
Computational Statistics 151

data set determines this. As a result, when looking for short-term and medium-term
Notes

e
securities positions, investors must consider extremes.

Typically, investors use standard deviation to forecast returns, and standard

in
deviation assumes a normal distribution with zero skewness. However, because of the
risk of skewness, it is preferable to derive performance estimates based on skewness.
Furthermore, the likelihood of return distributions approaching normality is modest.

nl
Skewness risk develops when skewed data is subjected to a symmetric
distribution. A normal distribution is used in financial models that attempt to forecast an
asset’s future performance. Skewed data, on the other hand, will improve the financial

O
model’s accuracy.

If a return distribution has a positive skew, investors might expect recurring tiny
losses and just a few significant rewards. A negatively skewed distribution, on the other

ty
hand, predicts many little wins and a few huge loses on the investment.

As a result, a positively skewed investment return distribution should be chosen


over a negatively skewed return distribution since the large gains may more than

si
compensate for the frequent – but minor – losses. Investors, on the other hand, may
prefer investments with a negatively skewed return distribution. It’s possible that they
prefer frequent small wins and a few big loses over frequent small losses and a few big
victories.
r
ve
2.5.11 Measures Skewness
In addition to measures of central tendency and measures of variation, there are
two attributes of frequency distribution of a data set that may be of interest to managers
ni

for effective decision-making. These are the Skewness and Kurtosis.

When the distribution stretches more to the right than it does to the left, the
distribution is said to be ‘right skewed’ or ‘positively skewed’. Similarly, a left-skewed
U

distribution is the one that stretches asymmetrically to the left. Thus, the skewness is
a measure of the extent of symmetry or asymmetry of the distribution. In symmetrical
distribution, with single mode, we have (mode = mean = median). In such case
skewness is zero. In case of positive skewness (i.e., right skewness) the mean is to the
ity

right of median, which in turn lies to the right of the mode. The opposite is for negative
skewness. Skewness can be measured either in absolute term as ‘mean minus mode’
or in relative terms. Some of the relative measures are as follows:

1. Kari Pearson’s coefficient of skewness (SKp). It is defined as:


m
)A

2. Bowley’s Coefficient of Skewness (SKB) (quartile coefficient of skewness). It is


defined as:
(c

Amity Directorate of Distance & Online Education


152 Computational Statistics

Where, Q is quartile.
Notes

e
3. Kelly’s coefficient of skewness (Skk). It is defined as:

in
nl
Where, P is percentile.

Skewness is also defined in term of the moment about mean. One such measure is

O
defined as:

ty
Relative kurtosis = Absolute kurtosis–3

si
4. Lorenz Curve: It is a special type of graph, designed to describe as to how much a
certain distribution varies from a completely uniform distribution. It is a cumulative
percentage curve comparing the population and factor under study. For example,
r
we could plot a graph of percentage of population and percentage of their wealth.
ve
Lorenz curve is very useful for comparing two populations particularly when their
means and SD are same.

Karl-Pearson’s Coefficient of Skewness


ni

Pearson’s coefficient of skewness is a method developed by Karl Pearson to find


skewness in a sample using descriptive statistics like the mean and mode. Skewness
is one measure of the shape of a set of data. Pearson’s coefficient of skewness is
calculated by multiplying the difference between the mean and median, multiplied by
U

three. The result is divided by the standard deviation.

The formula when using the mode is -


ity

Where x = the mean, Mo = the mode and s = the standard deviation for the
sample. The formula when using the median is -
m

Where x = the mean, Mo = the mode and s = the standard deviation for the sample.
)A

Bowley’s Coefficient of Skewness

Bowley skewness is a method to figure out whether there is a positively-skewed or


negatively skewed distribution. Bowley Skewness is used as an alternative to find out
(c

more about the asymmetry of a distribution. It is very useful if there are extreme data
values i.e., the outliers or if there is an open-ended distribution.

Amity Directorate of Distance & Online Education


Computational Statistics 153

Bowley Skewness = Q3+Q1 – 2Q2 / (Q3 – Q1)


Notes

e
Skewness = 0 means that the curve is symmetrical.

Skewness > 0 means the curve is positively skewed.

in
Skewness < 0 means the curve is negatively skewed.

In a symmetric distribution, like the normal distribution, the first (Q1) and third

nl
(Q3) quartiles are at equal distances from the mean (Q2). In other words, (Q3-Q2) and
(Q2-Q1) will be equal. If you have a skewed distribution then there will be a difference
between those two values.

O
Limitations of Bowley Skewness
Bowley Skewness is an absolute measure of skewness. It gives a result in the
units that the distribution is in. That’s compared to the Pearson Mode Skewness, which

ty
gives the results in a dimensionless unit — the standard deviation. This means that one
cannot compare the skewness of different distributions with different units using Bowley
Skewness.

si
Example:
Find the Bowley’s coefficient of the data

Pets Families
r
Cumulative Frequency
ve
0 60 60
1 60 120
2 50 170
ni

3 20 190
4 25 215
5 10 225
U

6 or more 5 230
Solution –
Step 1: Finding the quartiles for the data set. Looking at for the “nth” observation
ity

using the following formulas:

Q1 = (total cum freq + 1 / 4)th observation = (230 + 1 / 4 ) = 57.75

Q2 = (total cum freq + 1 / 2)th observation = (230 + 1 / 2 ) = 115.5


m

Q3 = 3 (total cum freq + 1 / 4)th observation = 3(230 + 1 / 4) = 173.25

Step 2: Looking in the table to find the nth observations as calculated in Step 1:
)A

Q1 = 57.75th observation = 0

Q2 = 115.5th observation = 1

Q3 = 173.25th observation = 3

Step 3: Plugging the above values into the formula:


(c

Skq = Q3 + Q1 – 2Q2 / Q3 – Q1

Amity Directorate of Distance & Online Education


154 Computational Statistics

Skq = 3 + 0 – 2 / 3 – 0 = 1/3
Notes

e
Skq = + 1/3, so the distribution is positively skewed.

in
2.5.12 Calculation Importance Skewness

Example:

nl
The first four central moments of a distribution are 0, 2.5, 0.7 and 18.75. Test the
skewness and kurtosis of the distribution.

Testing Skewness

O
We are given μ1 = 0, μ2 = 2.5, μ3 = 0.7 and μ4 = 18.75

Skewness is measured by the coefficient β1

ty
si
Here μ2 = 2.5, μ3 = 0.7

r
ve
Since β1 = + 0.031, the distribution is slightly skewed.

2.5.13 Moments
Moments are statistical measures used to quantify a distribution. Four moments
ni

are frequently utilised:

●● 1st, Mean: the average


U

●● 2d, Variance:
◌◌ The standard deviation is the square root of the variance: it indicates how
evenly the values are distributed around the mean. A low standard deviation
ity

indicates that the data are all similar. If the distribution is normal, 63 percent of
the values will be within one standard deviation of the mean.
●● 3d, Skewness: The asymmetry of a distribution about its peak is measured; it is a
number that describes the distribution’s shape.
m

◌◌ It is often approximated by Skew = (Mean - Median) / (Std dev).


◌◌ When skewness is positive, the mean exceeds the median and the distribution
has a long tail of high values.
)A

◌◌ If the skewness is negative, the mean is less than the median and the
distribution has a long tail of low values.
●● 4th: Kurtosis: The peakedness or flatness of a distribution is measured.
◌◌ A thin pointed distribution with positive kurtosis.
(c

◌◌ A broad flat distribution with negative kurtosis is indicated

Amity Directorate of Distance & Online Education


Computational Statistics 155

The following is the formula for calculating moments:


Notes

e
in
nl
O
ty
si
Use of moments in statistics
●● Moments, like the mean, variance, and so on, are population constants. These
r
constants aid in determining the features of the population, and a population is
ve
discussed based on these qualities.
●● Moments aid in directly determining the population’s AM, standard deviation, and
variance, as well as in determining the graphic forms of the population.
ni

●● Moments can be thought of as constants utilised in determining the graphic


shape of a population, as the graphic shape of a population also helps a lot in
characterising it.
U

●● Moments are used to calculate a distribution’s central tendency, dispersion,


skewness, and kurtosis.
●● The term “moment” is frequently used in physics. It calculates the turning effect of
a force at a given position. The moment of a force around any point is calculated
ity

as the product of the force’s magnitude and the perpendicular distance between
the point and the force.
●● Moments are used in statistics to understand the various properties of a frequency
distribution. The central tendency, dispersion, skewness, and kurtosis of a
m

distribution can be investigated using moments.


●● The first two moments determine the normal distribution. Other distribution families
can be determined by their moments. Equating moments is one way for estimating
)A

parameters (called the method of moments).

2.5.14 Introduction to Kurtosis


Kurtosis is a measure of peaked-ness of distribution. Larger the kurtosis, more and
(c

more peaked will be the distribution. The kurtosis is calculated either as an absolute or
a relative value. Absolute kurtosis is always a positive number. Absolute kurtosis of a

Amity Directorate of Distance & Online Education


156 Computational Statistics

normal distribution (symmetric bell shaped distribution) is taken as 3. Relative kurtosis


Notes

e
can be calculated as follows:

in
Relative kurtosis = Absolute kurtosis–3

nl
◌◌ Relative kurtosis can be negative. Managers usually work with relative
kurtosis.

O
◌◌ Negative kurtosis indicates a flatter distribution than the normal distribution,
and called as platykurtic.
◌◌ A positive kurtosis means more peaked curve, called Leptokurtic.

ty
◌◌ A peak of normal distribution is called Mesokurtic.

Example:
The first four central moments of a distribution are 0, 2.5, 0.7 and 18.75. Test the

si
kurtosis of the distribution.

Testing Kurtosis:
r
For testing kurtosis we compute the value of β2.
ve
When a distribution is normal or symmetrical, β2 = 3.

When a distribution is more peaked than the normal, β2 is more than 3 and when it

is less peaked than the normal, β2 is less than 3.


ni
U
ity

Since β2 is exactly three, the distribution is mesokurtic.

2.5.15 Summary Module 2


m

●● While measures of central tendency are useful for estimating “normal” values in a
dataset, measures of dispersion are useful for describing the spread of the data,
or its variance around a central value. Two separate samples may have the same
)A

mean or median but very different levels of variability, or vice versa.


●● Both of these features should be included in a proper description of a set of data.
There are several ways for measuring the dispersion of a dataset, each with its
own set of pros and downsides.
(c

●● Dispersion (also known as variability, scatter, or spread) is the extent to which


a distribution is stretched or squeezed in statistics. The variance, standard
deviation, and interquartile range are common examples of statistical dispersion

Amity Directorate of Distance & Online Education


Computational Statistics 157

measurements. When the variance of data in a collection is high, for example, the
Notes

e
data is widely dispersed. When the variance is modest, however, the data in the
set is clustered.

in
●● Dispersion is contrasted with location or central tendency, and the two are the
most commonly utilised distribution properties.
●● Measures: A measure of statistical dispersion is a nonnegative real number that

nl
starts at zero if all the data are the same and grows as the data become more
different.
●● Most dispersion measures use the same units as the quantity being measured.

O
In other words, if the measurements are in metres or seconds, the measure of
dispersion is also in metres or seconds. Standard deviation, interquartile range
(IQR), range, mean absolute difference (also known as Gini mean absolute
difference), median absolute deviation (MAD), and average absolute deviation are

ty
all examples of dispersion metrics (or simply called average deviation).
●● Distance standard deviation - These are typically used as estimators of scale
parameters (together with scale factors), in which capacity they are referred

si
to as estimations of scale. The IQR and MAD are examples of robust scale
measurements that are unaffected by a small number of outliers.
●●
r
All of the preceding statistical dispersion metrics have the advantage of being
location-invariant and linear in scale. This indicates that if a random variable X
ve
has a dispersion of SX, then a linear transformation Y = aX + b for real a and b
should have a dispersion SY = |a|SX, where |a| is the absolute value of a, ignoring
a preceding negative sign -.
●● Other dispersion measurements are dimensionless. In other words, even if the
ni

variable has units, they do not. Coefficient of variation; Quartile coefficient of


dispersion; and Relative mean difference, which is equivalent to twice the Gini
coefficient.
U

●● Entropy - Unlike the entropy of a discrete variable, which is location-invariant and


scale-independent and hence not a measure of dispersion in the above sense, the
entropy of a continuous variable is location-invariant and additive in scale: If Hz is
ity

the entropy of the continuous variable z, and z=ax+b, then Hz=Hx+log (a).
●● Other dispersion measures include:
◌◌ Variance (the square of the standard deviation) — location-independent yet
non-linear in scale.
m

◌◌ Variance-to-mean ratio Is usually utilised for count data when the phrase
coefficient of dispersion is employed and when this ratio is dimensionless,
because count data are dimensionless in and of itself.
)A

●● Some dispersion measures serve specific needs. The Allan variance can be
employed in instances where noise interferes with convergence. To compensate
for linear frequency drift sensitivity, the Hadamard variance might be used.
●● It is less customary to assess dispersion by a single number for categorical
(c

variables; see qualitative variation. The discrete entropy is one such measure.

Amity Directorate of Distance & Online Education


158 Computational Statistics

Check Your Understanding


Notes

e
Multiple Choice Question

1. The mean deviation and standard deviation determine how far the numbers deviate

in
from the __________.
a) Data
b) Average

nl
c) Range
d) Mean Deviation

O
2. ______________ reports are frequently enhanced and complemented by giving a
measure of dispersion.
a) Central Tendency

ty
b) Dispersion
c) Standard deviation
d) SD Variance

si
3. A single extreme score may result in a significantly _________ standard deviation,
especially if the sample is small.
a) Lower
r
ve
b) Equal
c) Greater
d) Constant
ni

4. _____________ is the average of the difference between upper quartile and lower
quartile.
a) Standard Deviation
U

b) Mean Deviation
c) Range
d) Quartile Deviation
ity

5. Standard Deviation is the root mean square deviation of the values from their
arithmetic mean. S.D. is denoted by symbol σ (read sigma).
a) Mean Deviation
b) Standard Deviation
m

c) SD Variance
d) Quartile Deviation
)A

State True and False


1. The absolute dispersion approach expresses changes in terms of the average of
observational deviations, such as standard or means deviations.
2. The variance is calculated by subtracting the mode from each data point in the set,
(c

then squaring each of them, adding each square, and then dividing them by the total
number of values in the data set.

Amity Directorate of Distance & Online Education


Computational Statistics 159

3. The quartile deviation is equal to one-third of the difference between the third and
first quartiles. Notes

e
4. The mean is the average of numbers, and the mean deviation is the arithmetic mean

in
of the absolute departures of the observations from a measure of central tendency
5. When comparing two series with different measurement units, the dispersion
coefficient is also utilised. It is abbreviated as C.D.

nl
Summary
●● This is an absolute measure of variability. However, if we have to compare two
sets of data, ‘Range’ may not give a true picture. In such case, relative measure of

O
range, called coefficient of range is used.
●● Inter-quartile range is a difference between upper quartile (third quartile) and lower
quartile

ty
●● (First quartile). Thus, Inter Quartile Range = (Q3 - Q1)
●● Quartile Deviation is the average of the difference between upper quartile and
lower quartile.

si
Formulae: Thus, Quartile Deviation = QD = (Q3 - Q1)/2

●● Mean deviation is the arithmetic mean of the absolute deviations of the values

r
about their arithmetic mean or median or mode. Mean Deviation (MD) is an
average value of absolute deviation of observations from the data mean (or the
ve
median or the mode).
●● Coefficient of mean deviation = Mean Deviation/ Mean or Median or Mode It can
also be expressed in percentage by multiplying it with 100.
●● When the data constitute a sample, the variance is denoted byσ2x and averaging
ni

is done by dividing the sum of the squared deviation from the mean by ‘n – 1’.
When observations constitute the population, the variance is denoted by σ2 and
we divide by N for the average.
U

●● Standard Deviation is the root mean square deviation of the values from their
arithmetic mean. S.D. is denoted by symbol σ (read sigma). The Standard
Deviation (SD) of a set of data is the positive square root of the variance of the set.
This is also referred as Root Mean Square (RMS.) value of the deviations of the
ity

data points.
●● In a large group of students 80% have a recommended statistics book. Three
students are selected at random. Find the probability distribution of the number of
students having the book. Also compute the mean and variance of the distribution.
m

●● Dispersion is the degree to which values in a distribution deviate from the


distribution’s average. It offers us an indication of how much different objects differ
from one another and from the core value.
)A

●● Range is the most basic method of measuring dispersion, defining the difference
between the largest and smallest item in a particular distribution.
●● Quartile deviation is referred to as the semi-interquartile range, which is half of the
difference between the upper and lower quartiles. The first quartile is denoted by
(c

Q, and the middle digit Q1 ties the lowest value to the data’s median.
●● The standard deviation is a measure of dispersion that is absolute. It is expressed
in terms of units, and the original statistics are gathered and stated. The standard
Amity Directorate of Distance & Online Education
160 Computational Statistics

variation of plant heights cannot be compared to the standard deviation of grain


Notes

e
weights since they are expressed in different units, namely centimetres and
kilogrammes.

in
●● Quartile deviation is defined as half the difference between the third and first
quartiles. It is often referred to as the semi-interquartile range.
●● Skewness is a distortion or asymmetry in a set of data that deviates from the

nl
symmetrical bell curve, or normal distribution. The curve is said to be skewed if it
is displaced to the left or right. Skewness can be expressed as a measure of how
far a given distribution deviates from a normal distribution.

O
●● Skewness risk develops when skewed data is subjected to a symmetric
distribution. A normal distribution is used in financial models that attempt to
forecast an asset’s future performance. Skewed data, on the other hand, will
improve the financial model’s accuracy.

ty
●● Pearson’s coefficient of skewness is a method developed by Karl Pearson to
find skewness in a sample using descriptive statistics like the mean and mode.
Skewness is one measure of the shape of a set of data. Pearson’s coefficient

si
of skewness is calculated by multiplying the difference between the mean and
median, multiplied by three.
●● Kurtosis is a measure of peaked-ness of distribution. Larger the kurtosis, more
r
and more peaked will be the distribution. The kurtosis is calculated either as an
absolute or a relative value. Absolute kurtosis is always a positive number.
ve
Activity
1. Explain what is the major differences between Mean Deviation and Standard
Deviation in MS PowerPoint.
ni

2. Give real-life examples on how Range works in our daily life.

Question and Answer


U

1. Explain the term range measure.


2. What is mean deviation, SD variance and its calculation?
ity

3. Explain about combined mean, coefficient variation and quartile deviation.


4. What is data distribution and skewness?
5. What is the importance of skewness?
6. Explain the terms moments and Kurtosis.
m

Glossary
●● Central tendency: It is a typical or central value for a probability distribution in
)A

statistics.
●● Median - It is the value in the centre that separates the data set’s upper and lower
halves. The median and mode are the only measures of central tendency that can
be used to ordinal data, where values are ordered relative to one another but not
assessed absolutely.
(c

●● Mode – It is the most often occurring value in the data set. This is the only
measure of central tendency that can be applied to nominal data with completely
Amity Directorate of Distance & Online Education
Computational Statistics 161

qualitative category assignments.


Notes

e
●● Geometric mean - It is the nth root of the product of the data values, where n is the
number of data values. This metric is only applicable to data that is measured on a

in
purely positive scale.
●● Weighted arithmetic mean – It is an arithmetic mean that includes weighting for
certain data items.

nl
●● Truncated mean or trimmed mean – It is the arithmetic mean of data values after
a predetermined number or proportion of the highest and lowest data values have
been removed.

O
●● Interquartile mean – It is a shortened mean calculated using data from the
interquartile range.
●● Mean deviation: The arithmetic mean (average) of deviations D of observations

ty
from a central value is defined as mean deviation (mean or median).
●● Quartile deviation: It is referred to as the semi-interquartile range, which is half of
the difference between the upper and lower quartiles.

si
●● Standard deviation: The square root of the arithmetic average of the square of
the deviations measured from the mean is the standard deviation. The standard
deviation is expressed as:

Further Reading
r
ve
1. Richard I. Levin, David S. Rubin, Sanjay Rastogi, Masood Husain Siddiqui,
Statistics for Management, Pearson Education, 7th Edition,2016.
2. Prem. S. Mann, Introductory Statistics, 7th Edition, Wiley India,2016.
ni

3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An


Introduction to Statistical Learning with Applications in R, Springer, 2016.
U

Check Your Understanding Answer


Multiple Choice Question

1. b) Average
ity

2. a) Central Tendency
3. c) Greater
4. d) Quartile Deviation
m

5. b) Standard Deviation
State True and False
1. True
)A

2. False
3. False
4. True
(c

5. True

Amity Directorate of Distance & Online Education


162 Computational Statistics

Unit - 2.6: Industry Example for Dispersion


Notes

e
Objectives:

in
At the end of this unit, you will be able to:

●● Understand about business problem

nl
●● Understand term calculation business metrics
●● Understand term insights graphs

O
●● Understand about data interpretation

Introduction
Recent research has demonstrated huge diversity in business performance and

ty
growth, which both drives and is driven by large reallocations of inputs and outputs
between firms (churning) within industries and marketplaces. These inequalities in
business-level outcomes and the associated firm turnover affect and are affected

si
by numerous economic policies (both labor- and non-labor-oriented), on both a
microeconomic and a macroeconomic scale. To properly evaluate these strategies, one
must be aware with the sources and effects of firm-level variation and within-industry
reallocation. r
ve
Example: The Tax authority collected the following amount of tax from different
firms in a particular market.

Amount of Taxes (in 000 ₹) 10 11 12 13 14


ni

Number of Firms 3 12 18 12 3
Calculate the quartile deviation and the coefficient of quartile deviation.
U

Solution:
Calculation of Quartile deviation

Amount of Number of Cummulative


ity

Taxes (in ‘000 ₹) Firms (f) Frequency


10 3 3
11 12 15
12 18 33
m

13 12 45
14 3 48
Σf = 48
)A

Here N = 48

Q1 = Size of (N + 1)th / 4 item

= Size of (48 + 1)th / 4 term


(c

= Size of 12.25th item = 11 (in ‘000 rupees)

Q3 = Size of 3(N + 1)th / 4 item

Amity Directorate of Distance & Online Education


Computational Statistics 163

= Size of 3(48 + 1)th / 4 item


Notes

e
= Size of 36.75th item = 13 (in ‘000 rupees)

Quartile Deviation = (Q3 – Q1) / 2

in
= (13 – 11) / 2

= 1 (in ‘000 rupees)

nl
Coefficient of Quartile Deviation = (Q3 – Q1) / (Q3 + Q1)

= (13 – 11) / (13 + 11)

O
= 0.083

2.6.1 Business Problem Introduction

ty
Example: A study of 1000 companies gives the following information

Profit (in ₹ crores) 0-10 10-20 20-30 30-40 40-50 50-60

si
No. of Companies 10 20 30 50 40 30
Calculate the standard deviation of the profit earned.

(i) Actual Mean Method


(ii) Step-Deviation Method
r
ve
Solution:
Calculation of Standard Deviation
ni

Profit (in No. of m fm d= d’ = fd fd2 fd’ fd’2


₹ crores) Companies (m-40) (m-45) / 10
0-10 10 5 50 - 35 -4 - 350 12250 - 40 160
U

10-20 20 15 300 - 25 -3 - 500 12500 - 60 180


20-30 30 25 750 - 15 -2 - 450 6750 - 60 120
30-40 50 35 1750 -5 -1 - 250 1250 - 50 50
ity

40-50 40 45 1800 5 0 200 1000 0 0


50-60 30 55 1650 15 1 450 6750 30 30
6300 Σd = -60 Σd’ = -9 - 900 40500 - 180 540

Applying Actual Method


m

Standard Deviation (σx) = √Σfx2 / Σf

Σfx2 = 36000, Σf = 180


)A

Therefore, (σx) = √36000/180 = 14.142 (in rupees crores)

Applying Step Deviation Method


(c

d’ = (m – 45)/ 10, Σfd’2 = 540, Σfd’ = - 180, Σf = 180, c = 10

Amity Directorate of Distance & Online Education


164 Computational Statistics

Notes

e
= 14.142 (in rupees crores)

in
2.6.2 Calculation Business Metrics

nl
Example: The following table shows the daily wages of a random sample of
construction workers. Calculate its mean deviation and standard deviation.

Daily Wages (₹) Number of Workers

O
200-399 5
400-599 15
600-799 25

ty
800-999 30
1000-1199 18
1200-1399 7

si
Total 100

Solution:

r
Computation of Mean Deviation
ve
Daily Number of Class Mark fi |m – x̅ |
fm
Wages (₹) X Workers (f) (m) = fi |m – 823.5|
200-399 5 299.5 1497.50 2620
400-599 15 499.5 7492.50 4860
ni

600-799 25 699.5 17487.50 3100


800-999 30 899.5 26985.00 2280
1000-1199 18 1099.5 19791.00 4968
U

1200-1399 7 1299.5 9096.50 3332


Total 100 82350.00 21,160
Mean Deviation = Σfi |m – x̅|/ Σfi
ity

= 21,160/100

= 211.60

Computation of Standard Deviation


m

Daily Number of Class


fi |m – x̅ |2
Wages (₹) Workers Mark (M.V)
200-399 5 299.5 1,372,880
)A

400-599 15 499.5 1,574,640


600-799 25 699.5 384,400
800-999 30 899.5 173,280
1000-1199 18 1099.5 1,371,168
(c

1200-1399 7 1299.5 1,586,032


Total 100 6,462,400

Amity Directorate of Distance & Online Education


Computational Statistics 165

Standard Deviation = √6462400/100 = 254.21 (Rupees)


Notes

e
Example: The following table shows the summary statistics for the daily wages of
two types of workers.

in
Worker’s Type Daily Wages
Mean Standard Deviation
I ₹ 100 ₹ 20

nl
II ₹ 150 ₹ 24
Compare these two daily wages distributions.

O
Solution:
Calculation of coefficient of variations

ty
In comparison Distribution Reason
Average magnitude II > I x̅ II = 150 > x̅ I = 100
CVI = (20/100) * 100

si
Variation I > II 20% > CVII
= (24/150) * 100 = 16%
Example: Prices of shares of a company were note as under from Monday through
Saturday. Find out range and the coefficient of range. r
ve
Day Mon Tues Wed Thurs Fri Sat
Price (₹) 200 210 208 160 220 250

Solution:
ni

Here,

Highest value among the prices of shares = 250


U

Lowest values among the prices of shares = 160

Range = Highest value (H) – Lowest value (L)

or, R = 250 – 160 = 90


ity

Coefficient of Range (CR) = (H – L)/(H + L)

or, CR = (250 – 160)/(250 + 160) = 90/140

Thus, CR = 0.219 or 0.22 (approx.)


m

Thus, the Range ® of the above data is 90 and Coefficient of Range (CR) is 0.22

Example: You know the share market is going bullish during the last several
)A

months. Collect weekly data on the share price of any two important industries during
the past six months. Calculate the range of share prices. Comment on how volatile are
the share prices.
(c

Amity Directorate of Distance & Online Education


166 Computational Statistics

Solution:
Notes

e
Price of Price of
Month shares (Tata shares

in
Motors) (Reliance)
Oct 325 913.35
Nov 397 900.25

nl
Dec 405 750.90
Jan 415 780.70
Feb 420 799.25

O
Mar 388 850.35
For Tata Motors

Highest Value = 420

ty
Lowest value = 325

Range = Highest value (H) – Lowest value (L)

si
or, R1 = 420 – 325 = 95

Coefficient of Range (CR1) = (H – L)/(H + L)

r
= (420 – 325)/(420 + 325)
ve
= 95/745 = 0.127

For Reliance

Highest value = 913.35


ni

Lowest value = 750.90

Range = Highest value (H) – Lowest value (L)


U

or, R2 = 913.35 – 750.90 = 162.45

Coefficient of Range (CR2) = (H – L)/(H + L)


ity

= (913.35 – 750.90)/(913.35 + 750.90)

= 162.45/1664.25 = 0.097

From the above results, we can observe that the prices of the Tata Motors are less
volatile as compared to the prices of Reliance stores.
m

2.6.3 Insights Graphs


)A

One of the most convincing and appealing ways in which statistical results may be
represented is through graphs and diagrams.

Graphs and diagrams are extremely used because of the following reasons:

(i) Graphs and diagrams attract to the eye.


(c

(ii) They have more memorizing effect.


(iii) It facilitates for easy comparison of data from one period to another.

Amity Directorate of Distance & Online Education


Computational Statistics 167

(iv) Graphs and diagrams give bird’s eye view of entire data; therefore, it conveys
Notes

e
meaning very quickly.

a. Bar Diagram

in
In a bar diagram, only the length of the bar is taken into account but not the width.
In other words bar is a thick line whose width is merely shown, but length of the bar is
taken into account and is called one-dimensional diagram.

nl
Simple Bar Diagram
It represents only one variable. Since these are of the same width and vary only in

O
lengths (heights), it becomes very easy for a comparative study. Simple bar diagrams
are very popular in practice. A bar chart can be either vertical or horizontal; for example
sales, production, population figures etc. for various years may be shown by simple bar

ty
charts.

Illustration - 1

The following table gives the birth rate per thousand of different countries over a

si
certain period of time.

New
Country India Germany U. K.
r
Zealand
Sweden China
ve
Birth Rate 33 16 20 30 15 40
ni
U
ity

Comparing the size of bars, China’s birth rate is highest, next is India whereas
m

Germany and Sweden equal in the lowest positions.

Illustration 2 - Represent the data by using a simple bar diagram.


)A

Countries: A B C D E F
Production of Rice (000’s tons): 38 42 29 28 18 11
(c

Amity Directorate of Distance & Online Education


168 Computational Statistics

Notes

e
in
nl
O
Sub-divided Bar Diagram
In a subdivided bar diagram, each bar representing the magnitude of given value is
further subdivided into various components. Each component occupies a part of the bar

ty
proportional to its share in total.

Illustration -

si
Present the following data in a sub-divided bar diagram.

Year/Faculty Science Humanities Commerce


2014-2015 r 240 560 220
ve
2015-2016 280 610 280
ni
U
ity

Illustration – 2

The Number of Students in University X during 2008 to 2011 areas follows.

Represent the data by a similar diagram.


m

Year Arts Commerce Science Total


2008 - 2009 20,000 10,000 5,000 35,000
)A

2009 - 2010 26,000 9,000 7,000 42,000


2010 - 2011 31,000 9,500 7,500 48,000
(c

Amity Directorate of Distance & Online Education


Computational Statistics 169

Notes

e
in
nl
O
ty
Multiple Bar Diagram
In a multiple bar diagram, two or more sets of related data are represented and the

si
components are shown as separate adjoining bars. The height of each bar represents
the actual value of the component. The components are shown by different shades or
colours.

r
Illustration 1 - Construct a suitable bar diagram for the following data of number of
ve
students in two different colleges in different faculties.

College Arts Science Commerce Total


A 1200 800 600 2600
B 700 500 600 180
ni
U
ity
m
)A

Fig: A multiple bar diagram showing numbers of students in two different colleges
in different departments.

Illustration 2
(c

Read the following data of results of III semester. B.B.A. examination of Mangalore
University held in May 2006, 2007 and 2008 in a multiple bar diagram

Amity Directorate of Distance & Online Education


170 Computational Statistics

Year Class I Class II Class III Failed


Notes

e
2006 100 300 500 300
2007 120 400 600 280

in
2008 100 500 700 300

Percentage bar Diagram

nl
In percentage bar diagram the length of the entire bar kept equal to 100 (Hundred).
Various segment of each bar may change and represent percentage on an aggregate.

Illustration 1

O
Year Men Women Children
1995 45% 35% 20%
1996 44% 34% 22%

ty
1997 48% 36% 16%

r si
ve
ni

Line Graph
A line graph is a type of chart used to show information changing over time. We
use multiple dots to plot line graphs connected by straight lines. It is also known as a
U

line chart. The line graph consists of two axes, defined as the axis ‘x’ and the axis ‘y.’

◌◌ The horizontal axis is known as the x axis


◌◌ The vertical axis is known as the y axis
ity

Plotting a line graph


Plotting a line graph is easy. There are simple steps to consider while plotting a line
graph.
m

◌◌ Draw the x-axis and y-axis on the graph paper. Make sure to write the title
above the table so that it determines the purpose of the graph.
◌◌ For instance, if one of the factors is time, it goes on the horizontal axis,
)A

referred to as the x-axis. The other factor would subsequently go on the


vertical axis, which is known as the y-axis. Both the axes are to be labeled as
per their respective factors. For example, the x axis can be labeled as time or
day.
◌◌ Afterward, with the help of the already given data, the exact values on the
(c

graph can be pointed. Once the points are joined, a clear inference about the
trend can be made.

Amity Directorate of Distance & Online Education


Computational Statistics 171

Pie Chart
Notes

e
A pie chart or a circle chart is a circular statistical graphic that is divided into
slices to illustrate a numerical proportion. In a pie chart, the arc length of each slice

in
is proportional to the quantity it represents. While it is named for its resemblance to a
pie which has been sliced, there are variations on the way it can be presented. In a pie
chart, categories of data are represented by wedges in the circle and are proportional in

nl
size to the percent of individuals in each category.

Pie charts are very widely used in the business world and the mass media. Pie
charts are generally used to show percentage or proportional data and usually the

O
percentage represented by each category is provided next to the corresponding slice of
pie. Pie charts are good for displaying data for around six categories or fewer.

Example:

ty
Show the following data of expenditure of an average working class family by a
suitable diagram

si
Item of Expenditure Percent of Total Expenditure
Food 65
Clothing 10
Housing
Fuel and Lighting
r 12
5
ve
Miscellaneous 8

Solution:
ni

1. Food = 65/ 100 x 360 = 234


2. Clothing = 10/ 100 x 360 = 36
U

3. Housing = 12/ 100 x 360 = 43.2


4. Fuel and Lighting = 5/ 100 x 360 = 18
5. Miscellaneous = 8/ 100 x 360 = 28.8
ity

The angles of different sectors are calculated as shown below:

Food Pie Chart


m
)A
(c

Amity Directorate of Distance & Online Education


172 Computational Statistics

Frequency Distribution
Notes

e
Classification of data shows the different values of a variable and their respective
frequency of occurrence is called a frequency distribution of the values.

in
There are two kinds of frequency distributions, namely, discrete frequency
distribution (or simple, or ungrouped frequency distribution), and continuous frequency
distribution (or condensed or grouped frequency distribution).

nl
a. Discrete Frequency Distribution
The process of preparing discrete frequency distribution is simple. First, all the

O
possible values of variables are arranged in ascending order in a column. Then another
column of ‘Tally’ mark is prepared to count the number of times a particular value of the
variable is repeated. To facilitate counting, a block of five ‘Tally’ marks is prepared. The
last column contains frequency. To illustrate this let us consider one example.

ty
Example:
Construct frequency distribution table for the following data of number of family

si
members in 30 families:

4 3 2 3 4 5 5 7 3 2
3 r 4 2 1 1 6 3 4 5 4
ve
2 7 3 4 5 6 2 1 5 3

Number of
‘Tally Marks’ Frequency
Family Members
ni

1 ||| 3
2 |||| 5
3 |||| || 7
U

4 |||| | 6
5 |||| 5
6 || 2
7 || 2
ity

Total N = 30

b. Continuous Frequency Distribution


m

For continuous data a ‘grouped frequency distribution’ is necessary. For discrete


data, discrete frequency distribution is better than array, but this does not condense the
data. ‘Grouped frequency distribution’ is useful for condensing discrete data by putting
them into smaller groups or classes called class-intervals. Some important terms used
)A

in case of continuous frequency distribution are as follows:

1. Class limits: Class limits denote the lowest and highest value which can be included
in the class. The two boundaries of class are known as the lower limit and upper limit
of the class. For example, 10-18, 20-28, where 10 and 18 are limits of the first class;
(c

20 and 28 are limits of second class, etc.

Amity Directorate of Distance & Online Education


Computational Statistics 173

2. Class intervals: The class interval represents the width, the span or the size of a
Notes

e
class. The width may be determined by subtracting the lower limit of one class from
the lower limit of the following class. For example, classes 10-15, 15-20, etc have
class interval 20 – 15 =5.

in
3. Class frequency: The number of observations falling within a particular class is known
as its class frequency. Total frequency indicates the total number of observations N

nl
=Σf.
4. Class mark or class mid-point: Mid-point of a class is defined as the sum of two
successive lower limits divided by two. Thus class mark is the value lying halfway

O
between lower and upper class limits. For example, classes 10-20, 20-30, etc have
class marks 15, 25etc.
5. Types of class intervals: There are many different ways in which limits of class
intervals can be shown.

ty
6. Exclusive method: In this method, the class intervals are so arranged that upper limit
of one class is the lower limit of next class. This method always presumes that the
upper limit is excluded from the class, for example, with class limits 20-25, 25-30

si
observation with value 25 is included in class25-30.
7. Inclusive method: In this method, the upper limit of the class is included in the same

r
class itself. In such case there is no overlap of upper limit of former class and lower
limit of successive class. For example, with class limits 20-29.5, 30-39.5, 40- 49.5,
ve
etc. there is no ambiguity but values from 29.5 to 30 or 39.5 to 40 etc. are not
allowed.
8. Open end: In an open-end distribution, the lower limit of the very first class or upper
limit of the last class is not given. For example, while stating the distribution of
ni

monthly salary of managers in rupees, one may specify class limits as, below 10000,
10000-15000, 15000-20000, 20000-25000, above 25000. Similarly, while recording
weights of college students in kg as grouped data the class intervals could be less
U

than 40, 40 to 50, 50 to 60, 60 to 70, 70 to 80, greater than80.


9. Unequal class interval: The method Is also used to limit the class intervals where
the width of the classes is not equal for all classes. This method is of practical use
ity

when there are large gaps in the data, or distribution of the data is uneven. It is used
for explaining, visualizing and plotting data with unequal class interval. However, we
must adjust formulae for calculations accordingly.

Cumulative and Relative Frequency


m

In many situations rather than listing the actual frequency opposite each class, it
may be appropriate to list either cumulative frequencies or relative frequencies or both.
)A

Cumulative Frequencies
The cumulative frequency of a given class interval thus, represents the total of all
the previous class frequencies including the class against which it is written.
(c

Relative Frequencies
Relative frequency is obtained by dividing the frequency of each class by the total
number of observations i.e., the total frequency.
Amity Directorate of Distance & Online Education
174 Computational Statistics

◌◌ If the relative frequency is multiplied by 100, we get the percentage frequency.


Notes

e
◌◌ There are two important advantages in looking at relative frequencies
(percentages) instead of the absolute frequencies in a frequency distribution.

in
These are:

◌◌ Relative frequencies facilitate the comparison of two or more than sets ofdata.
◌◌ Relative frequencies constitute the basis of understanding the probability

nl
concept.
Example: Age of 50 employees is given. Find cumulative frequency, relative
frequency and percentage frequency.

O
22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27

ty
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62

si
Class Class Cumulative Relative Percentage
Interval Frequency Frequency Frequency Frequency
20-30 7 (0+7) = 7 7/50 = 0.14 14
30-40 16 r (7+16) = 23 16/50 = 0.32 32
ve
40-50 15 (23+15) = 38 15/50 = 0.30 30
50-60 9 (38+9) = 47 9/50 = 0.18 18
60-70 3 (47+3) = 50 3/50 = 0.06 6
N = f = 50 Total = 1 Total = 100
ni

A frequency distribution is constructed to satisfy three objectives: (i) to facilitate the


analysis of data, (ii) to estimate frequencies of the unknown population distribution from
the distribution of sample data, and (iii) to facilitate the computation of various statistical
U

measures.

Histogram
ity

A histogram consists of contiguous boxes and has both horizontal axis and a
vertical axis. The horizontal axis is labeled with what the data represents (for instance,
distance from your home to school). The vertical axis is labeled either Frequency or
relative frequency. The graph will have the same shape with either label. The histogram
(like the stem plot) can give you the shape of the data, the center, and the spread of the
m

data. (The next section tells you how to calculate the center and the spread.)

The relative frequency is equal to the frequency for an observed value of the data
)A

divided by the total number of data values in the sample. (In the chapter on Sampling
and Data (Section 1.1), we defined frequency as the number of times an answer
occurs.)

RF = f/n
(c

Where f is the frequency n is the total number of data values (or the sum of the
individual frequencies), and RF is the relative frequency.

Amity Directorate of Distance & Online Education


Computational Statistics 175

Example – If 3 students in Mathematics class of 40 students received from 90% to


Notes

e
100%, then,

f = 3, n = 40 and

in
RF = f/n

= 3/40

nl
= 0.075

Seven and a half percent of the students received 90% to 100%. Ninety percent to
100% are quantitative measures.

O
Example:
Formulate the Histogram from the following data –

ty
Class Interval Frequency
10.5 – 18.5 3
18.5 – 26.5 5

si
26.5 – 34.5 5
34.5 – 42.5 2
42.5 – 50.5
50.5 – 58.5
r
4
2
ve
Solution:

Histogram
ni
U
ity
m
)A

Frequency Polygons
These are the frequencies plotted against the mid-points of the class-intervals and
the points thus obtained are joined by line segments. On comparing the Histogram and
a frequency polygon, in frequency polygons the points replace the bars (rectangles).
(c

Also, when several distributions are to be compared on the same graph paper,
frequency polygons are better than Histograms.

Amity Directorate of Distance & Online Education


176 Computational Statistics

Illustration 1
Notes

e
Draw a histogram and frequency polygon from the following data

Age in Years Number of Persons

in
10-20 3
20-30 16

nl
30-40 22
40-50 35
50-60 24

O
60-70 15
70-80 2

ty
r si
ve
ni

Ogives
When frequencies are added, they are called the cumulative frequencies. The
U

curve obtained by plotting cumulating frequencies is called a cumulative frequency


curve or an ogive (pronounced as ojive).

To construct an Ogive: (i) Add up the progressive totals of frequencies, class by


ity

class, to get the cumulative frequencies. (ii) Plot classes on the horizontal (x-axis) and
cumulative frequencies on the vertical (y-axis).

Less than Ogive: To plot a less than ogive, data is arranged in ascending order
of magnitude and frequencies are cumulated from the top i.e., adding. Cumulative
m

frequencies are plotted against the upper-class limits. Ogives under this method, gives
a positive curve.

Greater than Ogive: To plot a greater than ogive, the data is arranged in the
)A

ascending order of magnitude and frequencies are cumulated from the bottom or
subtracted from the total from the top. Cumulative frequencies are plotted against the
lower-class limits. Ogives under this method, gives negative curve.

Uses: Certain values like median, quartiles, quartile deviation, co-efficient of


(c

skewness etc. can be located using ogives. Ogives are helpful in the comparison of the
two distributions.

Amity Directorate of Distance & Online Education


Computational Statistics 177

Illustration 1 –
Notes

e
Draw less than and more than ogive curves for the following frequency distribution
and obtain median graphically. Verify the result.

in
CI 0-20 20-40 40-60 60-80 80-100 100-120 120-140 140-160
f 5 12 18 25 15 12 8 5

nl
Size Icf Mcf Size
20 5 100 0
40 17 95 20

O
60 35 83 40
80 60 65 60
100 75 40 80

ty
120 87 25 100
140 95 13 120
160 100 5 140

r si
ve
ni
U
ity

2.6.4 Data Interpretation


m

Data interpretation is the process of making sense of a set of processed data. This
collection may be presented in numerous ways such as bar graphs, line charts, tabular
forms, and other similar forms, and so requires some kind of interpretation.
)A

Bar Graph
A bar graph is a graphical depiction of data on a graph in the shape of bars or
structures. They are used to represent a variety of data kinds.

The bar graph is a common tool for presenting numerous types of data. It is
(c

frequently asked in the data interpretation segment of competitive examinations.


In data interpretation, it is the display of data in which the vertical bars are spaced

Amity Directorate of Distance & Online Education


178 Computational Statistics

equally apart. The height and length of the bars signify the value of the data provided.
Notes

e
The width is not necessary in this case; it is merely utilised to make the presentation
apparent.

in
They are plotted against the x-axis, which is a horizontal axis. Colored or tinted
vertical bars of equal width can be used to display the value. The values in a horizontal
bar graph are plotted against the vertical axis known as the y-axis. They are often

nl
utilised since the data is easily analysed.

Types

O
There are various sorts of bar graphs that are used to represent data. They are
listed below.

1. A straightforward bar graph

ty
2. Bar graph composite
3. Stacking bar graph
4. Bar charts to demonstrate divergence

si
Simple Bar Chart: The simplest bar graph is the simple bar chart, as the name
implies. It describes one continuous variable as well as one discrete value. The

r
illustration below depicts the proper representation of a simple bar chart.
ve
ni
U

Composite Bar Chart: To overcome the limitations of a simple bar chart, a


composite bar chart is employed. A composite bar chart is used to display two or more
ity

continuous variables on the same graph. The figure below depicts the appearance of a
composite bar-chart.
m
)A

Stacked Bar Chart: For continuous data, stacked bar charts are used to depict the
(c

content breakdown. It is quite useful for comparing different sets of data. It can be used
to describe variables such as revenues, earnings, and losses over a couple of years.
The diagram below depicts the same thing.
Amity Directorate of Distance & Online Education
Computational Statistics 179

Notes

e
in
nl
O
Bar Charts to Show Deviation: To display the variance, bar charts are commonly
employed. That is why bar charts are so popular for displaying variation. Deviation
charts can display both the deficit and the surplus, as well as imports and exports,
among other things. It is generally necessary to represent both the positive and

ty
negative values of continuous variables.

r si
ve
ni

As illustrated in the picture, a baseline is established in which positive numbers are


shown above the line and negative values are shown below the line.
U

Line Chart
The line chart is simply an improved version of the bar chart. It is created by joining
the uppermost points of the bars to make a line. We’ll get a line chart if we continue this
ity

method with additional bars as well. The calculations can be difficult, but the major goal
of the examiner is to see if you can use reasoning to eliminate choices and assess the
graph to get the solution.

Tabular Form
m

Tabular data is one of the most basic methods for analysing and displaying data.
A systematic organisation of rows and columns is obtained in tabular form. The titles
are indicated in the first column, and the same is indicated in the first row. It is a very
)A

accurate and simple means of displaying data.

Pie Chart
Pie charts are a sort of data visualisation. This information is provided in the
(c

shape of a circle. A pie chart is divided into segments and sectors, and each of these
segments and sectors accounts for a fraction of the total (in terms of percentage).

Amity Directorate of Distance & Online Education


180 Computational Statistics

The total of all the data in the pie chart equals 360 degrees. Proportionality is used to
Notes

e
calculate the degree of angles utilised to represent various elements.

In pie charts, the entire diagram looks like a pie and the components in it

in
resembles the numerous slices sliced from the pie. As a result, the pie chart is used to
depict the breakdown of a single continuous variable into its component pieces.

nl
2.6.5 Conclusion Module 2
●● Central Tendency Measures provide a summary measure that seeks to summarise
an entire set of data with a single number that represents the middle or centre

O
of its distribution. The mean, median, and mode are the three primary metrics of
central tendency.
●● When data is properly distributed, the mean, median, and mode should all be

ty
identical and effective in displaying the most common value of a data collection.
●● When assessing measures of central tendency, it is critical to consider the
dispersion of a data set.

si
●● In psychology, central tendency is quite useful. It tells us what is usual or ‘average’
for a given set of data. It also reduces the data set to a single representative value,
which is useful when working with vast amounts of data.
●● r
You can also compare one data set to another using central tendency.
ve
●● Measures of central tendency assist you in determining the centre, or average, of
a data collection.
●● The mean, median, and mode are the three most commonly used metrics of
ni

central tendency.
◌◌ The mode is the most common value.
◌◌ In an ordered data set, the median is the number in the middle.
U

◌◌ The mean is calculated by taking the sum of all values and dividing it by the
total number of values.
●● The measures of central tendency available to you are determined by the level of
ity

measurement of your data.


◌◌ The mode can only be used to obtain the most frequent value for a nominal
level.
◌◌ For ordinal or ranking data, the median can be used to locate the value in the
m

middle of your data set.


◌◌ In addition to the mode and median, you may use the mean to obtain the
average value for interval or ratio levels.
)A

●● The mean is the most commonly used measure of central tendency since it
averages all of the values in the data set.
●● When dealing with data from skewed distributions, the median is preferable than
the mean since it is not impacted by exceptionally big numbers.
(c

●● The mode is the sole measure that can be used for non-ordered nominal or
categorical data.

Amity Directorate of Distance & Online Education


Computational Statistics 181

●● Central tendency is a statistical measure that uses a single number to represent a


Notes

e
group. It is a cost-effective assessment of a group’s general features.
●● A measure of central tendency is a single number that attempts to explain a set

in
of data by recognising its central position. Measures of central tendency are also
called as measures of central position at times. They are also known as summary
statistics.

nl
●● Measures of central tendency are an important strategy for dealing with and
communicating with graphs. In real-world applications, you can use many types of
tables and graphs to show information and extract information from data to aid in

O
analyses and forecasts.
●● The statistical model that represents the single value of the complete distribution
or database and seeks to implement an exact description of the full data in the
distribution is referred to as the central tendency.

ty
●● The three main measures are mean, median, and mode.
●● The mean is the sum of all observations.

si
●● The median is the midpoint of a distribution: half of the observations are above it
and half are below it. The median is less susceptible to outliers than the mean. For
a data collection containing extreme values, it is a better metric than the mean.
●● r
The mode is the most common observation in a distribution.
ve
●● A dataset may contain numerous modes in some circumstances, while others may
not have any modes at all. The three measures of central values, namely mean,
median, and mode, are associated by the following relationships (referred to as an
empirical relationship): 2 Median = Mean + Mode
ni

●● When information is regularly distributed, the mean is the preferred measure of


central tendency. When data is skewed, the median is the most useful metric of
central tendency. The mode is the most useful metric of central tendency when
U

working with nominal variables. If all information values are 0, the mean and
median cannot be zero. However, the dataset may not contain any modes.
●● Dispersion is a statistical word that indicates the size of the predicted distribution
ity

of values for a specific variable and can be measured using a variety of statistics
such as range, variance, and standard deviation. Dispersion in finance and
investing usually refers to the range of probable returns on an investment. It can
also be used to assess the risk of a certain security or investment portfolio.
m

●● The range of prospective outcomes of investments based on historical volatility or


returns is referred to as dispersion.
●● Alpha and beta, which quantify risk-adjusted returns and returns relative to a
)A

benchmark index, can be used to measure dispersion. In general, the greater the
dispersion, the riskier the investment, and vice versa.
●● Dispersion is frequently used to quantify the degree of uncertainty, and
consequently risk, associated with a specific security or investment portfolio.
(c

●● Thousands of possible securities are available to investors, and there are


numerous considerations to consider when deciding where to invest. The risk

Amity Directorate of Distance & Online Education


182 Computational Statistics

profile of the investment is high on their list of considerations. Dispersion is one of


Notes

e
many statistical measurements used to provide context.
●● The volatility and risk associated with keeping an asset are depicted by the

in
dispersion of its return. The more changeable an asset’s return, the riskier or
volatile it is.
●● While measures of central tendency are useful for estimating “normal” values in a

nl
dataset, measures of dispersion are useful for describing the spread of the data,
or its variance around a central value. Two separate samples may have the same
mean or median but very different levels of variability, or vice versa. Both of these

O
features should be included in a proper description of a set of data. There are
several ways for measuring the dispersion of a dataset, each with its own set of
pros and downsides.
●● Over the last few decades, business and economic research, aided by recent

ty
releases of micro-level production data, has discovered substantial disparities in
performance across enterprises, even those operating in the same industry or
market.

si
●● This dispersion is seen in a wide range of industries, nations, and historical
periods, and involves variation in both levels (revenues, employment, and
productivity) and changes (growth rates in these or other operational indicators).
●●
r
This volatility both causes and results from industry turnover. Production is
ve
regularly reallocated across firms in an industry, whether between ongoing
operators or through company turnover due to entry and leave.
●● Even in industries with steady aggregate measures, this reallocation occurs.
According to recent studies, this churning is often in a direction that increases the
ni

industry’s total output.


●● Understanding the nature of this dispersion is important for policies that interact
U

with both microeconomic phenomena, such as the effect of hiring or investment


subsidies targeted at specific industries or markets, and macroeconomic
phenomena, such as trade policies and losses due to production misallocation
across firms. These policies influence labour market outcomes as well as other
ity

input and output markets.

Check Your Understanding


Multiple Choice Question
m

1. Problem-solving is described as establishing methods that __________ impediments


that impede you or others from achieving operational and strategic corporate goals.
a) Increase
)A

b) Decrease
c) Constant
d) Equal
(c

2. Understanding the value of problem-solving skills in the workplace will assist you in
growing as a __________.

Amity Directorate of Distance & Online Education


Computational Statistics 183

a) Looser
Notes

e
b) Cheater
c) Leader

in
d) Individual
3. __________ with the staff who have the most expertise about the problem to come

nl
up with more solutions.
a) Brainstorm
b) Thinking

O
c) Living
d) Conflict

ty
4. Break down complex problems into their ________ pieces wherever possible. Then,
look for answers to one unit at a time.
a) Biggest

si
b) Greatest
c) Thinnest
d) Smallest r
ve
5. _________ growth from prior times is desirable for firm survival and profitability.
a) Revenue
b) Gross Profit
ni

c) Net Profit
d) Goodwill
U

State True and False


1. Almost anyplace economic and business analysts investigate, they find enormous
disparities in performance across organisations, even those operating in the same,
narrowly defined market.
ity

2. Massive disparities in firm-level performance and reallocations within industries are


common features of production economics.
3. Companies also same substantially in terms of productivity, or the efficiency with
m

which they turn inputs into outputs.


4. A difficulty in business is a condition that causes a chasm between the desired and
actual outcomes.
)A

5. Problem-solving abilities will assist you in resolving key difficulties and conflicts that
may arise.

Summary
(c

●● Problem-solving is described as establishing methods that decrease or remove


impediments that impede you or others from achieving operational and strategic

Amity Directorate of Distance & Online Education


184 Computational Statistics

corporate goals. A difficulty in business is a condition that causes a chasm


Notes

e
between the desired and actual outcomes.
●● Problem-solving abilities are specialised techniques that can be utilised to fulfil one

in
or more of the four general problem-solving phases (discussed above).
●● It takes time and effort to learn how to solve business difficulties. Though some
people appear to be born with excellent problem-solving ability, outstanding

nl
problem solvers usually practise and refine their skills.
●● Business metrics quantify a business process or an aspect of a business
process’s performance. They monitor the performance of business processes in

O
a variety of sectors, including finance, marketing, human resources, information
technology, operations, manufacturing, and investment.
●● Financial metrics are financial performance indicators that track sales turnover,

ty
earnings, expenditures, assets, liabilities, and capital. Organizations in a variety of
industries use them to track company operations, increase operational efficiency,
and aid in planning and strategy formation.

si
●● In a bar diagram, only the length of the bar is taken into account but not the width.
In other words, bar is a thick line whose width is merely shown, but length of the
bar is taken into account and is called one-dimensional diagram.
●● r
In a subdivided bar diagram, each bar representing the magnitude of given value
ve
is further subdivided into various components. Each component occupies a part of
the bar proportional to its share in total.
●● In a multiple bar diagram, two or more sets of related data are represented and
the components are shown as separate adjoining bars. The height of each bar
ni

represents the actual value of the component. The components are shown by
different shades or colours.
●● A line graph is a type of chart used to show information changing over time. We
U

use multiple dots to plot line graphs connected by straight lines. It is also known as
a line chart.
●● A pie chart or a circle chart is a circular statistical graphic that is divided into slices
ity

to illustrate a numerical proportion. In a pie chart, the arc length of each slice is
proportional to the quantity it represents.
●● There are two kinds of frequency distributions, namely, discrete frequency
distribution (or simple, or ungrouped frequency distribution), and continuous
m

frequency distribution (or condensed or grouped frequency distribution).


●● The process of preparing discrete frequency distribution is simple. First, all
the possible values of variables are arranged in ascending order in a column.
)A

Then another column of ‘Tally’ mark is prepared to count the number of times a
particular value of the variable is repeated. To facilitate counting, a block of five
‘Tally’ marks is prepared.
●● Relative frequency is obtained by dividing the frequency of each class by the total
number of observations i.e., the total frequency.
(c

●● A histogram consists of contiguous boxes and has both horizontal axis and a

Amity Directorate of Distance & Online Education


Computational Statistics 185

vertical axis. The horizontal axis is labelled with what the data represents (for
Notes

e
instance, distance from your home to school). The vertical axis is labelled either
Frequency or relative frequency. The graph will have the same shape with either
label.

in
●● On comparing the Histogram and a frequency polygon, in frequency polygons
the points replace the bars (rectangles). Also, when several distributions are

nl
to be compared on the same graph paper, frequency polygons are better than
Histograms.
●● Data interpretation is the process of making sense of a set of processed data. This

O
collection may be presented in numerous ways such as bar graphs, line charts,
tabular forms, and other similar forms, and so requires some kind of interpretation.
●● The bar graph is a common tool for presenting numerous types of data. It is
frequently asked in the data interpretation segment of competitive examinations.

ty
In data interpretation, it is the display of data in which the vertical bars are spaced
equally apart.
●● Some data analysis approaches are not resistant to missing data and must be

si
“filled in” or imputed. Rubin (1987) contended that repeating imputation merely a
few times (5 or less) vastly increases estimation quality.

Activity r
ve
1. Developing a business problem statement in MS PowerPoint and Prepare
Presentation on it.
2. Discussion on what types of Business Problems arise in that time.
ni

Question and Answer


1. Explain about business problem.
U

2. What is the term of calculation business metrics?


3. What is insights graph?
4. Expatiation about data interpretation.
ity

5. What is the business problem statement?

Glossary
●● Sales Revenue: Revenue is the essence of business; therefore, tracking it is
m

critical for any corporation. It’s also included in a lot of corporate performance
measurements. Revenue growth from prior times is desirable for firm survival and
profitability.
)A

●● Gross Profit Margin: According to the formula, gross profit margin is a measure
of profitability. It assesses the effectiveness of managing production costs in
relation to sales. The greater the margin of error, the better. The key to proper
interpretation is a comparison to industry benchmarks.
●● Net Profit Margin: Another profitability indicator is net profit margin, which
(c

measures how much profit each dollar of revenue generates. If the margin is low,
either selling prices must be raised or costs must be reduced.

Amity Directorate of Distance & Online Education


186 Computational Statistics

●● Net Cash Flow: The difference between cash inflows and withdrawals is measured
Notes

e
as net cash flow. The amount of cash required depends on the sort of business.
●● Working Capital: Working capital reflects a company’s ability to meet its short-term

in
obligations. It is the ability to make payments to fulfil short-term obligations using
short-term assets.
●● Debt-to-Equity Ratio: The ratio calculates the proportion of debt and equity in a

nl
company’s capital structure. A ratio greater than one shows that the majority of
the capital is derived from debt. The ratio illustrates the risks associated with the
capital structure.

O
●● Inventory Turnover: It assesses the effectiveness of inventory investment in
generating sales over a given time period. It is a measure of how rapidly a
company can sell its stock. A larger ratio is preferred and more efficient.

ty
●● Days Sales Outstanding: Day sales outstanding indicates credit sales collection
efficiency and measures the average number of days it takes a company to
receive cash from credit sales. The fewer the days, the better the business’s
collecting efficiency. It can be computed on a monthly, quarterly, or annual basis.

si
●● Days Payables Outstanding (DPO): Day Payables Outstanding computes the
average number of days it takes a corporation to pay its suppliers. The bigger the

r
number of days, the longer it takes for a company to pay its suppliers and, in some
situations, the stronger the corporation’s bargaining power over suppliers. A higher
ve
ratio, on the other hand, may be interpreted as an incapacity to pay.
●● Current Ratio: The current ratio assesses a company’s liquidity, or its capacity
to meet short-term commitments when they come due. Desirable ratios vary per
industry, although a ratio greater than one is an economy-wide benchmark.
ni

Further Reading
1. Richard I. Levin, David S. Rubin, Sanjay Rastogi, Masood Husain Siddiqui,
U

Statistics for Management, Pearson Education, 7th Edition,2016.


2. Prem. S. Mann, Introductory Statistics, 7th Edition, Wiley India,2016.
3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An
ity

Introduction to Statistical Learning with Applications in R, Springer, 2016.

Check Your Understanding Answer


Multiple Choice Question
m

1. b) Decrease 2. c) Leader
3. a) Brainstorm 4. d) Smallest
)A

5. a) Revenue
State True and False
1. True 2. True
3. False 4. True
(c

5. True

Amity Directorate of Distance & Online Education


Computational Statistics 187

Module - 3: Skewness and Kurtosis


Notes

e
Structure:

in
3.1 Skewness; Pearsonian and Bowley’s measure of Skewness
3.1.1 Introduction to Skewness

nl
3.1.2 Application of Skewness
3.1.3 Personian Measure of Skewness

O
3.1.4 Bowley's Measure of Skewness
3.2 Kurtosis and Moments (Numerical examples and applications).
3.2.1 Introduction to Kurtosis

ty
3.2.2 Application of Kurtosis
3.2.3 Intoduction to Moments
3.2.4 Factorial Moments

si
3.2.5 Shephard's Correction for Moments
3.2.6 Skewness Using Moments
3.2.7 Kurtosis Using Moments r
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


188 Computational Statistics

Unit - 3.1: Skewness; Pearsonian and Bowley’s


Notes

e
measure of Skewness

in
Objectives
At the end of this unit, you will be able to:

nl
●● Understand the introduction to skewness.
●● Learn about the application of skewness.

O
●● Comprehend the Pearsonian measure of skewness.
●● Analyze Bowley’s measure of skewness.

Introduction

ty
Skewness is a measure of symmetry, or more specifically, the absence of
symmetry. A symmetric distribution or data set is one that looks the same to the left
and right of the centre point. This unit will discuss the introduction to skewness and its

si
application. It further incorporates the Personian and Bowley’s measure of skewness.
Characterizing the location and variability of a data collection is a crucial task in many
statistical analysis. Skewness and kurtosis are two more characteristics of the data.
r
ve
3.1.1 Introduction to Skewness
Skewness is a measure of the asymmetry of a real-valued random variable’s
probability distribution around its mean in probability theory and statistics. Positive,
zero, negative, or undefined skewness values are possible.
ni

Negative skew denotes that the tail is on the left side of a unimodal distribution,
while positive skew suggests that the tail is on the right side. Skewness does not follow
U

a simple rule when one tail is long and the other is fat. A zero value, for example,
indicates that the tails on both sides of the mean balance out in the overall distribution;
this is true for a symmetric distribution, but it can also be true for an asymmetric
distribution with one tail long and thin and the other short and fat.
ity

Skewness denotes the opposite of symmetry. It is lack of symmetry. As applied to


frequency distribution it indicates that the distribution of items on it is not symmetrical.
In a symmetrical series the mode, the median and the arithmetic mean are identical.
Therefore, skewness or lack of symmetry in a series is shown when these three
m

averages do not coincide.


)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 189

Notes

e
in
nl
O
ty
Skewness can be positive as well as negative. If the mean is greater than the
mode or the median, the skewness is positive. If it is less skewness is negative. In other
words, if Mode < Median < Mean, then skewness is positive and if Mean < Median <
Mode, the skewness is negative.

si
Symmetric Distribution
In a symmetric distribution, there are two types of skewness: positive and negative.
r
It is positive skew/right-skewed if the long tail is to positive values, and it is negative
ve
skew/left-skewed if the long tail is to negative numbers.
ni
U
ity
m
)A

Take a look at the two distributions in the diagram below. The values on the right
side of the distribution taper differently from the values on the left side of the distribution
in each graph.
(c

Amity Directorate of Distance & Online Education


190 Computational Statistics

Notes

e
in
nl
O
Tails are the tapering sides of a distribution that serve as a visual indicator of which
of the two types of skewness a distribution has:

ty
1. Negative skew: The left tail is longer, and the distribution’s mass is concentrated on
the right side of the graph. Despite the fact that the curve itself looks to be skewed
or leaning to the right, the distribution is said to be left-skewed, left-tailed, or skewed

si
to the left; left instead refers to the left tail being dragged out and, typically, the
mean being skewed to the left of a typical centre of the data. A right-leaning curve
represents a left-skewed distribution.
2.
r
Positive skew: The right tail is longer, and the distribution’s mass is concentrated
ve
on the figure’s left side. Despite the fact that the curve itself looks to be skewed or
leaning to the left, the distribution is said to be right-skewed, right-tailed, or skewed
to the right. Right refers to the right tail being dragged out and, typically, the mean
being skewed to the right of a typical centre of the data. A left-leaning curve is typical
ni

of a right-skewed distribution.
U
ity

Source: https://en.wikipedia.org/wiki/Skewness
m

Skewness in a data series can sometimes be detected not only graphically,


but also by looking at the numbers themselves. Consider the number sequence (49,
50, 51), which has an equitable distribution of values around a centre value of 50.
By adding a number well below the mean, which is most likely a negative outlier, we
)A

can turn this sequence into a negatively skewed distribution, e.g. (40, 49, 50, 51).
“Therefore, the mean of the sequence becomes 47.5, and the median is 49.5. Based
on the formula of nonparametric skew, defined as , the skew is negative. Similarly, we
can make the sequence positively skewed by adding a value far above the mean, which
(c

is probably a positive outlier, e.g. (49, 50, 51, 60), where the mean is 52.5, and the
median is 50.5.”

Amity Directorate of Distance & Online Education


Computational Statistics 191

As previously stated, a unimodal distribution with zero skewness does not


Notes

e
necessarily imply that the distribution is symmetric. A symmetric unimodal or multimodal
distribution, on the other hand, always has zero skewness.

in
nl
O
ty
Important Definitions of Skewness
●● “When a series is not symmetrical it is said to be asymmetrical or skewed. “ -
Croxton and Cowden.

si
●● “Skewness refers to the asymmetry or lack of symmetry in the shape of a
frequency distribution. “ - Morris Hamburg.
●●
r
“Measures of skewness tell us the direction and the extent of skewness. In
symmetrical distribution the mean, median and mode are identical. The more
ve
the mean moves away from the mode, the larger the asymmetry or skewness.”-
Simpson & Kalka
●● “A distribution is said to be ‘skewed’ when the mean and the median fall at different
points in the distribution, and the balance (or centre of gravity) is shifted to one
ni

side or the other-to left or right.“- Garrett


U
ity
m
)A

Relationship of Mean and Median


(c

The skewness of a distribution is unrelated to the relationship between the mean


and the median: a negative skew distribution might have a mean that is more than or

Amity Directorate of Distance & Online Education


192 Computational Statistics

less than the median, and a positive skew distribution can have a mean that is greater
Notes

e
than or less than the median.

in
nl
O
ty
r si
ve
ni

Source: https://en.wikipedia.org/wiki/Skewness

An asymmetric distribution with zero skewness as an example. This illustration


U

demonstrates that zero skewness does not always imply symmetric distribution.
(Skewness was estimated using Pearson’s skewness moment coefficient.)

“In the older notion of nonparametric skew, defined as (µ – v)/σ where µ is the
mean, v is the median, and σ is the standard deviation, the skewness is defined in
ity

terms of this relationship: positive/right nonparametric skew means the mean is greater
than (to the right of) the median, while negative/left nonparametric skew means the
mean is less than (to the left of) the median.”

However, the current and classic nonparametric definitions of skewness do not


m

necessarily have the same sign: while they agree for some families of distributions, they
diverge in others, and mixing them up is deceptive.
)A

If the distribution is symmetric, the mean equals the median, and the skewness of
the distribution is zero.

The mean = median = mode if the distribution is both symmetric and unimodal.
This is the case with a coin flip or a succession of numbers such as 1,2,3,4... The
opposite is not true in general, i.e., zero skewness (described below) does not indicate
(c

that the mean and median are equal.

Amity Directorate of Distance & Online Education


Computational Statistics 193

Notes

e
in
nl
Source: https://en.wikipedia.org/wiki/Skewness

O
Under diverse skewed unimodal distributions, a general relationship between mean
and median

ty
According to a 2005 journal article:
The mean, according to several textbooks, is right of the median under right
skew, and left of the median under left skew. Surprisingly often, this rule fails. It can

si
fail in multimodal distributions or distributions with one long tail and one heavy tail.
However, in discrete distributions where the areas to the left and right of the median are
not equal, the rule frequently fails. Such distributions not only defy the traditional link

r
between mean, median, and skew, but they also defy the median’s interpretation.
ve
The skew is to the right in the distribution of adult inhabitants among US
households, for example. The mean, on the other hand, is in the heavier left tail since
the majority of cases are less than or equal to the mode, which is also the median. As
a result, under right skew, the rule of thumb that the mean is right of the median failed.
ni

Definition
Fisher’s moment coefficient of skewness
U

“The skewness of a random variable X is the third standardized moment ,


defined as:
ity

where μ is the mean, σ is the standard deviation, E is the expectation operator,


μ3 is the third central moment, and κt are the t-th cumulants. It is sometimes referred
m

to as Pearson’s moment coefficient of skewness, or simply the moment coefficient of


skewness, but should not be confused with Pearson’s other skewness statistics (see
below). The last equality expresses skewness in terms of the ratio of the third cumulant
)A

κ3 to the 1.5th power of the second cumulant κ2. This is analogous to the definition of
kurtosis as the fourth cumulant normalized by the square of the second cumulant. The
skewness is also sometimes denoted Skew[X].”

If σ is finite, μ is finite too and skewness can be expressed in terms of the non-
central moment E[X3] by expanding the previous formula,
(c

Amity Directorate of Distance & Online Education


194 Computational Statistics

Notes

e
in
nl
Example

As an example, skewness can be infinite.

O
when the third cumulants are infinite, or when the third cumulants are

ty
The third cumulant, on the other hand, is indeterminate.

The following are some examples of finite skewness distributions.

si
◌◌ The skewness of a normal distribution and any other symmetric distribution
with a finite third moment is 0.
◌◌ r
The skewness of a half-normal distribution is just below one.
ve
◌◌ The skewness of an exponential distribution is two.
◌◌ Depending on its assumptions, a lognormal distribution can have any positive
skewness.

Sample skewness
ni

“Two natural method of moments estimators of population skewness are available


for a sample of n values.
U
ity

And
m

where is the sample mean, s is the sample standard deviation, m2 is the


(biased) sample second central moment, and m3 is the sample third central moment.
)A

Another common definition of the sample skewness is”


(c

“where k3 is the unique symmetric unbiased estimator of the third cumulant and
k2 = 82 is the symmetric unbiased estimator of the second cumulant (i.e. the sample
variance). This adjusted Fisher–Pearson standardized moment coefficient G1 is the

Amity Directorate of Distance & Online Education


Computational Statistics 195

version found in Excel and several statistical packages including Minitab, SAS and
Notes

e
SPSS.”

“Under the assumption that the underlying random variable X is normally

in
distributed, it can be shown that all three ratios b1, g1, G1 are unbiased and consistent
estimators of the population skewness γ1 = 0, with , i.e., their
distributions converge to a normal distribution with mean 0 and variance 6 (Fisher,

nl
1930). The variance of the sample skewness is thus approximately 6/n for sufficiently
large samples. More precisely, in a random sample of size n from a normal distribution,”

O
“In normal samples, b1 has the smaller variance of the three estimators, with”

var(b1) < var(g1) < var(G1).

ty
“For non-normal distributions, b1, g1 and G1 are generally biased estimators of the
population skewness γ1; their expected values can even have the opposite sign from

si
the true skewness. For instance, a mixed distribution consisting of very thin Gaussians
centred at −99, 0.5, and 2 with weights 0.01, 0.66, and 0.33 has a skewness γ1 of about
−9.77, but in a sample of 3 G1 has an expected value of about 0.32, since usually all
three samples are in the positive-valued part of the distribution, which is skewed the
other way.” r
ve
3.1.2 Application of Skewness
Following are the applications of skewness:
ni

●● Skewness is a descriptive statistic that can be used to characterize data or


distributions in conjunction with the histogram and the normal quantile plot.
●● Skewness describes the magnitude and direction of a distribution’s departure from
U

the normal distribution.


●● Standard statistical inference procedures, such as a confidence interval for a
mean, will not only be incorrect in the sense that the true coverage level will differ
ity

from the nominal (e.g., 95 percent) level, but they will also result in unequal error
probabilities on both sides when there is pronounced skewness.
●● The Cornish-Fisher expansion can be used to generate approximate probabilities
and quantiles of distributions (such as value at risk in finance).
m

●● Many models assume a normal distribution, which means that the data is
symmetric around the mean. The skewness of the normal distribution is zero.
However, data points may not be precisely symmetric in actuality. As a result,
)A

knowing the skewness of the dataset might help you determine whether deviations
from the mean will be positive or negative.
●● The K-squared test, developed by D’Agostino, is a goodness-of-fit normalcy test
based on sample skewness and kurtosis.
(c

Amity Directorate of Distance & Online Education


196 Computational Statistics

Notes

e
in
nl
O
ty
r si
ve
ni
U
ity

Relative Locations in Skewed Distribution


m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 197

Notes

e
in
nl
O
ty
Has pileup of cases to the left & the right tail of distribution is too long.

r si
ve
ni

Has pileup of cases to the right and the left tail of distribution is too long.
U

Measures of Skewness
In addition to measures of central tendency and measures of variation, there are
two attributes of frequency distribution of a data set that may be of interest to managers
ity

for effective decision-making. These are the Skewness and Kurtosis. When the
distribution stretches more to the right than it does to the left, the distribution is said to
be ‘right skewed’ or ‘positively skewed’. Similarly, a left-skewed distribution is the one
that stretches asymmetrically to the left. Thus, the skewness is a measure of the extent
m

of symmetry or asymmetry of the distribution. In symmetrical distribution, with single


mode, we have (mode = mean = median). In such case skewness is zero. In case of
positive skewness (i.e. right skewness) the mean is to the right of median, which in turn
lies to the right of the mode. The opposite is for negative skewness. Skewness can be
)A

measured either in absolute term as ‘mean minus mode’ or in relative terms. Some of
the relative measures are as follows:

1. Kari Pearson’s coefficient of skewness (SKp). It is defined as:


(c

Amity Directorate of Distance & Online Education


198 Computational Statistics

2. Bowley’s Coefficient of Skewness (SKB) (quartile coefficient of skewness). It is


Notes

e
defined as:

in
nl
Where, Q is quartile.

O
3. Kelly’s coefficient of skewness (Skk). It is defined as:

ty
Where, P is percentile.

Skewness can also be described as a deviation from the mean. The following is an

si
example of such a metric:

r
ve
4. The Lorenz Curve is a unique sort of graph that is used to describe how far a
distribution deviates from a perfectly uniform distribution. It’s a percentage curve that
ni

compares the population with the factor being studied. We could, for example, display
a graph of population percentages versus wealth percentages. When comparing
two populations, the Lorenz curve is most useful when their means and SD are the
same.
U

3.1.3 Personian Measure of Skewness


The Pearson coefficient of skewness is a method established by Karl Pearson for
ity

determining skewness in a sample using descriptive statistics such as the mean and
mode. Skewness is a metric for determining the shape of a set of data. Pearson’s
coefficient of skewness is not calculated using an Excel tool. Skewness is calculated
using the third power of deviations around the mean in the descriptive statistics section
m

of the Data Analysis Toolpak. Pearson’s coefficient of skewness, on the other hand,
employs either the mode or the mean. You can create a workaround by combining
several basic Excel functions.
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 199

Notes

e
in
nl
O
ty
si
The difference between the mean and median is multiplied by three to get
Pearson’s coefficient of skewness (second method). The standard deviation is used
r
to split the result. To calculate a value for this measure, utilize the Excel functions
AVERAGE, MEDIAN, and STDEV.P.
ve
The formula when using the mode is –
ni
U

Where x = the mean, Mo = the mode and s = the standard deviation for the sample.

The formula when using the median is –


ity

Where x = the mean, Mo = the mode and s = the standard deviation for the sample.

Pearson’s Measure of Skewness Criteria


m
)A
(c

Amity Directorate of Distance & Online Education


200 Computational Statistics

Notes

e
in
nl
O
ty
r si
ve
Example
ni

For a distribution, where standard deviation is 13 and mean is 59.2 and mode
value off 50.88 find coefficient of skewness using Karl Pearson coefficient.
U

Solution:
◌◌ We have given mean = 59.2
◌◌ Mode = 50.88
ity

◌◌ Standard deviation = 13
◌◌ Therefore by using formula
m

◌◌ Sk= (59.2- 50.88)/13


◌◌ Sk= 0.64
)A

Since the Karl Pearsons coefficient of skewness is 0.64 (positive) hence it is


positively skewed.

3.1.4 Bowley’s Measure of Skewness


(c

Bowley skewness is a metric for determining whether a distribution is positively


or negatively skewed. The Pearson Mode Skewness formula is one of the most often
used methods for determining skewness. However, you must know the mean, mode (or

Amity Directorate of Distance & Online Education


Computational Statistics 201

median), and standard deviation of your data in order to apply it. You may not have that
Notes

e
information in some cases; instead, you may have information on your quartiles. If this
is the case, you can utilise Bowley Skewness as an alternative to learn more about your
distribution’s asymmetry. If you have extreme data values (outliers) or an open-ended

in
distribution, it’s quite handy.

Bowley Skewness Formula

nl
Where,

O
SKB = Bowley’s Coefficient of skewness,

◌◌ Q1= Quartile first


◌◌ Q2= Quartile second

ty
◌◌ Q3= Quartile Third

r si
ve
ni
U

◌◌ Skewness = 0 means that the curve is symmetrical.


◌◌ Skewness > 0 means the curve is positively skewed.
ity

◌◌ Skewness < 0 means the curve is negatively skewed.


The above formula can be converted to:
m

The first (Q1) and third (Q3) quartiles are at equal distances from the mean in a
symmetric distribution, such as the normal distribution (Q2). To put it another way, (Q3-
)A

Q2) and (Q2-Q1) will be the same. There will be a disparity between those two values if
the distribution is skewed.
(c

Amity Directorate of Distance & Online Education


202 Computational Statistics

Notes

e
in
nl
Limitations

O
Bowley Skewness is an absolute skewness measurement. In other words, it will
return a result in the units in which your distribution is distributed. The Pearson Mode
Skewness, on the other hand, gives you results in a dimensionless unit called the

ty
standard deviation. This means that Bowley Skewness cannot be used to compare the
skewness of various distributions with different units.

Alternative Bowley Skewness formula

si
Bowley realised that the Bowley Skewness formula could not be used to compare
various distributions with different units, according to Business Statistics. You can’t
r
compare a distribution of heights in centimetres to one of weights in pounds, for
example. He proposed a different formula. If you want to compare multiple distributions
ve
with different units, use this formula:

Relative Skewness = ((Q3 + Q1) – (2 * Median))/ (Q3 – Q1).


ni

Example:
Calculate Bowley’s coefficient for skewness for the following distribution of weekly
wages of workers.
U

Wages Below 300 300-400 400-500 500-600 600-700 Above 700


Number of workers 5 8 18 35 27 7
ity

Solution:
First, calculate the N/4, N/2 and 3N/4 respectively. Then, calculate Bowley’s
coefficient of skewness by using the formula:

SKB=Q3+Q1−2Q2/Q3−Q1
m

Form the following table:

Class Frequency Cumulative Frequency


)A

Below 300 5 5
300-400 8 13
400-500 18 31
500-600 35 66
(c

600-700 27 93
Above 700 7 100

Amity Directorate of Distance & Online Education


Computational Statistics 203

Here, N= 100 and Class interval (h) = 100


Notes

e
Q1 = value of (N/4)th observation

= (100/4)th = 25th observation

in
As we can see in the cumulative frequency column, 25th observation belongs to
the class of 400-500

nl
Therefore, frequency (f) = 18; Cumulative frequency (CF) of previous class = 13;
Lower limit (L) = 400

Thus, Q1 = L + h (N/4 – CF) / f

O
= 400 + 100 (25 – 13) / 18 = 466.67

Now, Q2 = (N/2)th observation

ty
Thus, Q2 = (100/2)th observation = 50th observation

As we can see in the cumulative frequency column, 50th observation belongs to


the class of 500-600

si
Therefore, frequency (f) = 35; Cumulative frequency (CF) of previous class = 31;
Lower limit (L) = 500

Thus, Q2 = L + h (N/2 – CF) / f r


ve
= 500 + 100 (50 – 31) / 35 = 554.29

Now, Q3 = (3N/4)th observation = (3*100/4)th = 75th observation

As we can see in the cumulative frequency column, 75th observation belongs to


ni

the class of 600-700

Therefore, frequency (f) = 27; Cumulative frequency (CF) of previous class = 66;
Lower limit (L) = 600
U

Thus, Q2 = L + h (3N/4 – CF) / f

= 600 + 100 (75 – 66) / 27 = 633.33


ity

Now, Bowley’s coefficient of skewness is:

SKB = Q3+Q1−2Q2 / Q3−Q1

Thus, SKB =633.33 + 466.67 – 2(554.29) / 633.33 – 466.67


m

SKB = 1100 – 1108.58 / 166.67 = - 0.05

Therefore, Bowley’s coefficient of skewness is (- 0.05) which is negatively skewed.


)A

Example: If Q1 = 80, Q2 = 100, Q3 = 120, find Bowley’s coefficient of skewness.

Solution:
Bowley’s coefficient of skewness = Skb = Q3 + Q1 – 2Q2 / Q3 – Q1
(c

Thus, Skb = 120 + 80 – (2*100) / 120 – 80 = 0 / 40

Skb = 0

Amity Directorate of Distance & Online Education


204 Computational Statistics

Therefore, the distribution is symmetrical.


Notes

e
Example: The following table gives the number of children of 80 families in a
village. Find Bowley’s coefficient of skewness.

in
Number of children (x) 0 1 2 3 4 5
Number of families 12 23 16 9 10 10

nl
Solution:
Skb = Q3 + Q1 – 2Q2 / Q3 – Q1

O
x Frequency (f) Cumulative frequency (CF)
0 12 12
1 23 35

ty
2 16 51
3 9 60
4 10 70
5 10 80

si
Total 80

Q1 = (N/4)th observation = (80/4)th observation = 20th observation


r
Here, CF greater than or equal to 20th observation is 35.
ve
Thus, Q1 = 1 (which is the corresponding value of x)

Hence, families lower 25% had number of children less than or equal to 1.
ni

Now, Q2 = (N/2)th observation = (80/2)th observation = 40th observation

Here, CF greater than or equal to 40th observation is 51.

Thus, Q2 = 2 (which is the corresponding value of x)


U

Hence, families lower 50% had number of children less than or equal to 2.

Similarly, Q3 = (3N/4)th observation = (3*80/4)th observation = 60th observation


ity

Here, CF greater than or equal to 60th observation is 70.

Thus, Q3 = 4 (which is the corresponding value of x)

Therefore, families lower 75% had number of children less than or equal to 4.
m

Now, Bowley’s coefficient of skewness =

Skb = Q3 + Q1 – 2Q2 / Q3 – Q1
)A

Skb = 4 + 1 – 2*2 / 4 -1 = 1/3

Skb = 0.33, thus the distribution is positively skewed.


(c

Amity Directorate of Distance & Online Education


Computational Statistics 205

Check your Understanding


Notes

e
State true or false

in
1. All distributions can be classified as negative or positive skewed.
2. Two halves of a symmetrical distribution are mirror images of each other.
3. The sum of positive and negative deviations from median is always equal to zero in

nl
a symmetrical distribution.
4. It is possible that for some data Arithmetic Mean = Median = Mode, still. it is not
perfectly symmetrical.

O
5. Positive skewness implies that mean value is less than mode.’
6. Median can never be equal to mean in a skewed distribution.

ty
7. Only relative value of skewness is used ‘for comparison even though standard
deviation’ is the same.
8. Skewness cannot he calculated for open end class intervals.

si
9. Skewness does not exist in Bimodel distribution.

Fill in the blanks


1. r
If the mean and the mode of a given distribution are equal then its coefficient of
ve
skewness is ____________.
2. Skewness is positive when mean is __________ mode.
3. In a symmetrical distribution the mean, median and mode are _________.
ni

4. Median can never be equal to _________ in case of skewed distribution.

Summary
U

●● Skewness is a measure of the asymmetry of a real-valued random variable’s


probability distribution around its mean in probability theory and statistics. Positive,
zero, negative, or undefined skewness values are possible.
●● Skewness can be positive as well as negative. If the mean is greater than the
ity

mode or the median, the skewness is positive. If it is less skewness is negative. In


other words, if Mode < Median < Mean, then skewness is positive and if Mean <
Median < Mode, the skewness is negative.
●● Negative skew: The left tail is longer, and the distribution’s mass is concentrated
m

on the right side of the graph. Despite the fact that the curve itself looks to be
skewed or leaning to the right, the distribution is said to be left-skewed, left-tailed,
or skewed to the left; left instead refers to the left tail being dragged out and,
)A

typically, the mean being skewed to the left of a typical centre of the data. A right-
leaning curve represents a left-skewed distribution.
●● Positive skew: The right tail is longer, and the distribution’s mass is concentrated
on the figure’s left side. Despite the fact that the curve itself looks to be skewed or
(c

leaning to the left, the distribution is said to be right-skewed, right-tailed, or skewed


to the right. Right refers to the right tail being dragged out and, typically, the mean

Amity Directorate of Distance & Online Education


206 Computational Statistics

being skewed to the right of a typical centre of the data. A left-leaning curve is
Notes

e
typical of a right-skewed distribution.
●● The skewness of a distribution is unrelated to the relationship between the mean

in
and the median: a negative skew distribution might have a mean that is more than
or less than the median, and a positive skew distribution can have a mean that is
greater than or less than the median.

nl
●● The Pearson coefficient of skewness is a method established by Karl Pearson for
determining skewness in a sample using descriptive statistics such as the mean
and mode.

O
●● Bowley skewness is a metric for determining whether a distribution is positively or
negatively skewed.

Activity

ty
1. Prepare a presentation on the topic “Pearsonian and Bowley’s measure of Skewness”.
2. Formulate a report describing Application of Skewness and its importance in

si
Computational Statistics.

Questions & Exercises


1.
r
Differentiate between positive skewness and negative skewness.
ve
2. Distinguish between high skewness and moderate skewness.
3. State formulas of the Karl Pearson’s and Bowley’s methods of measuring skewness.
4. What is skewness?
ni

5. Give the absolute and relative measures of skewness.


6. Central tendency, dispersion and skewness are three different measures to analyze
numerical data, Comment.
U

Glossary
●● Skewness: It refers to the lack of symmetry. Skewness is a measure of the
asymmetry of a real-valued random variable’s probability distribution around its
ity

mean in probability theory and statistics. Positive, zero, negative, or undefined


skewness values are possible.
●● Symmetrical Data: When values’ of variable equidistant from middle have equal
frequencies.
m

●● b-Shaped Data : Data has high frequencies in the beginning and end, and lowest
frequencies in the middle.
)A

●● Negative skew: The left tail is longer, and the distribution’s mass is concentrated
on the right side of the graph. Despite the fact that the curve itself looks to be
skewed or leaning to the right, the distribution is said to be left-skewed, left-tailed,
or skewed to the left; left instead refers to the left tail being dragged out and,
typically, the mean being skewed to the left of a typical centre of the data. A right-
(c

leaning curve represents a left-skewed distribution.

Amity Directorate of Distance & Online Education


Computational Statistics 207

●● Positive skew: The right tail is longer, and the distribution’s mass is concentrated
Notes

e
on the figure’s left side. Despite the fact that the curve itself looks to be skewed or
leaning to the left, the distribution is said to be right-skewed, right-tailed, or skewed
to the right. Right refers to the right tail being dragged out and, typically, the mean

in
being skewed to the right of a typical centre of the data. A left-leaning curve is
typical of a right-skewed distribution.

nl
Further Readings
1. Elhance, D.N. and Veena Elhance, 1988. Fundamentals of Statistics, Kitab
Mahnl: Allahabad.

O
2. Gupta, C.B., An Introduction to Statistical, Methods, Vikas Publishing House :
3. New Delhi.

ty
4. Gupta, S.P., 1989, Elementary Statistical Methods, Sultan Chand & Sons: New
Delhi.
5. Sancheti, D.C., and Kapoor, V.K., 1989, Statistics Theory Methods and

si
Applications, Sultan Chand & Sons: New Delhi.
6. Shenoy, G.V., Srivastava V.K., and Sharrna, S.C., 1989, Business Statistics,
Wiley Eastern: New Delhi.
7. r
Simpson, G, and.Kafka, F. Basic Statistics, Oxford & IBH Publishing 1 New
ve
Delhi.

Check your Understanding – Answers


State true or false
ni

1. False
2. True
U

3. True
4. False
5. False
ity

6. True
7. True
8. False
m

9. False
Fill in the blanks
)A

1. Zero
2. Greater than
3. Equal
4. Mean
(c

Amity Directorate of Distance & Online Education


208 Computational Statistics

Unit - 3.2: Kurtosis and Moments (Numerical examples


Notes

e
and applications)

in
Objectives
At the end of this unit, you will be able to:

nl
●● Understand the introduction to kurtosis.
●● Learn about the application of kurtosis.

O
●● Comprehend introduction to moments.
●● Analyze factorial moments.
●● Evaluate Shephard’s correction for moments.

ty
●● Learn about skewness using moments.
●● Understand kurtosis using moments.

si
Introduction
This unit will teach you how to distinguish between different shapes of frequency
distributions using various strategies. This is the last Unit in the series on univariate
r
data summarization. This unit will introduce you to the idea of kurtosis. The necessity
ve
of studying these ideas stems from the fact that measures of central tendency and
dispersion fall short of completely describing a distribution. It is conceivable for
frequency distributions to differ greatly in nature and composition while still have the
same central tendency and dispersion. As a result, additional metrics of central
ni

tendency and dispersion are required.

3.2.1 Introduction to Kurtosis


U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 209

Kurtosis (from Greek: kyrtos or kurtos, meaning "curved, arching") is a measure of


Notes

e
the "tailedness" of a real-valued random variable's probability distribution in probability
theory and statistics. Kurtosis, like skewness, specifies the form of a probability
distribution, and there are various methods for defining it for a theoretical distribution

in
and estimating it from a sample of a population. Different kurtosis metrics may be
interpreted differently.

nl
It's a feature of the frequency distribution as well. It gives a sense of how a
frequency distribution might look. Kurtosis is defined as the extent to which a frequency
distribution is peaked as compared to a normal curve. It is the degree to which a
distribution is peaked.

O
ty
r si
ve
The standard measure of a distribution's kurtosis, developed by Karl Pearson, is
a scaled version of the distribution's fourth moment. This number is related to the tails
of the distribution, not the peak; thus, the term "peakedness" is incorrectly applied to
ni

kurtosis. Higher kurtosis corresponds to greater extremes of deviations (or outliers) in


this metric, rather than the structure of data towards the mean.
U

Any univariate normal distribution has a kurtosis of 3. The kurtosis of a distribution


is frequently compared to this value. Platykurtic distributions have a kurtosis of less
than 3, although this does not mean the distribution is "flat-topped," as some people
believe. Rather, it indicates that the distribution has fewer and less extreme outliers
ity

than the normal distribution. The uniform distribution, which does not produce outliers,
is an example of a platykurtic distribution.

Leptokurtic distributions have a kurtosis larger than three. The Laplace distribution,
which has tails that asymptotically approach zero more slowly than a Gaussian
m

and hence produces more outliers than the normal distribution, is an example of
a leptokurtic distribution. To give a contrast to the usual normal distribution, an
adjusted form of Pearson's kurtosis, the excess kurtosis, which is the kurtosis minus
)A

3, is commonly used. Some authors simply refer to the excess kurtosis as "kurtosis."
However, for clarity and generality, this article adheres to the non-excess convention
and clearly defines excess kurtosis.

The L-kurtosis, which is a scaled form of the fourth L-moment; measurements


based on four population or sample quantiles are examples of alternative kurtosis
(c

metrics. Alternative skewness measurements that aren't based on ordinary moments


are equivalent to these.

Amity Directorate of Distance & Online Education


210 Computational Statistics

The characteristics related with the nature of the concentration of the items in the
Notes

e
central part of a frequency distribution is called a Kurtosis.

In other words, Kurtosis is the degree of peakedness (or flatness) in a curve of the

in
frequency distribution. In fact Kurtosis is an indication for the peakedness of a single
humped frequency curve β2 ,γ2 measures of Kurtosis indicate the degree to which a
curve of the frequency distribution is peaked or flat topped.

nl
Types of Kurtosis
Kurtosis below 3 is defined as high/excess kurtosis. As detailed below, there are

O
three separate regimes.

ty
si
1. Mesokurtic
2. Leptokurtic
3. Platykurtic r
ve
ni
U

Comparative picture of three types of Kurtosis


ity

Mesokurtic
Mesokurtic or mesokurtotic distributions are those that have no excess kurtosis.
The normal distribution family is the most well-known example of a mesokurtic
distribution, regardless of the values of its parameters. Depending on parameter values,
m

a few more well-known distributions can be mesokurtic: for example, the binomial
distribution is mesokurtic for: .

Leptokurtic
)A

Leptokurtic, or leptokurtotic, is a distribution with a positive excess kurtosis. The


prefix "lepto-" indicates "slender." A leptokurtic distribution has fatter tails in terms
of shape. The Student's t-distribution, Rayleigh distribution, Laplace distribution,
exponential distribution, Poisson distribution, and logistic distribution are all examples
(c

of leptokurtic distributions. Super-Gaussian distributions are occasionally referred to as


such.

Amity Directorate of Distance & Online Education


Computational Statistics 211

Platykurtic
Notes

e
Platypurtic or platykurtotic distributions have a negative excess kurtosis. "Platy-"
is a Greek word that means "wide." A platykurtic distribution has narrower tails in

in
terms of morphology. The continuous and discrete uniform distributions, as well as
the raised cosine distribution, are examples of platykurtic distributions. The Bernoulli
distribution with p = 1/2 (for example, the number of times one gets "heads" while

nl
flipping a coin once, a coin toss) is the most platykurtic of all, with an excess kurtosis
of 2. Sub-Gaussian distributions, first hypothesised by Jean-Pierre Kahane and later
characterised by Buldygin and Kozachenko, are a type of distribution.

O
ty
si
The most platykurtic distribution is the coin toss.
r
Source: https://en.wikipedia.org/wiki/Kurtosis
ve
Kurtosis Criteria
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


212 Computational Statistics

Notes

e
in
nl
O
Measures of Kurtosis
There are two measures of Kurtosis:

ty
◌◌ Karl Pearson’s Measures of Kurtosis
◌◌ Kelly’s Measure of Kurtosis

si
Karl Pearson’s Measures of Kurtosis:
The second and fourth central moments of variable are used for calculating the
r
kurtosis. Following formula given by Karl Pearson is used:
ve
β2 = µ4 / µ22

or γ2 = β2 – 3

where, µ2 = second order central moment of distribution


ni

µ4 = fourth order central moment of distribution

Description:
U

If β2 = 3 or γ2 = 0, then curve is said to be mesokurtic;

If β2 < 3 or γ2 < 0, then curve is said to be platykurtic;

If β2 > 3 or γ2 > 0, then curve is said to be leptokurtic;


ity

Kelly’s Measure of Kurtosis


Kelly’s measure of kurtosis is based on percentiles. Following is the formula for
calculating the measure of kurtosis:
m

β2 = (P75 – P25) / P90 – P10

where, P75, P25, P90 and P10 are 75th, 25th, 90th and 10th Percentiles of dispersion
)A

respectively.

if β2 > 0.26315, then the distribution is platykurtic.

if β2 < 0.26315, then the distribution is leptokurtic.


(c

Example: First four moments about mean of a distribution are 0, 2.5, 0.7 and
18.75. Find coefficient of kurtosis.

Amity Directorate of Distance & Online Education


Computational Statistics 213

Solution:
Notes

e
We have µ1 = 0, µ2 = 2.5, µ3 = 0.7 and µ4 = 18.75

Therefore, Kurtosis, µ2 = µ4 / µ22 = 18.75 / (2.5)2 = 18.75 / 6.25 = 3

in
Since, β2 = 3, therefore the curve is mesokurtic.

Example: The first four raw moments of a distribution are 2, 136, 320, and 40,000.

nl
Find out coefficients of kurtosis.

Solution:

O
It is given that

µ1'=2, µ2' = 136, µ3' = 320 and µ4' = 40,000

ty
Firstly, calculate the first four central moments

µ 1 = µ 1' – µ 1' = 0

µ2 = µ2' - (µ1')2 = 136 - 22 = 132

si
µ3 = µ3' – 3µ2'µ1' + 2µ1'3 = 320 – 3 * 132 * 2 + 2 (2)3

= 320 – 792 + 16 = -456

µ4'= µ4' – 4µ1'µ3' + 6µ2'µ1'2 – 3µ1'4


r
ve
= 40,000 – 4 * 2 * 320 + 6 * 22 * 136 – 3*24

= 40,000 – 2560 + 3,264 – 48 = 40656


ni

Kurtosis = β2 = µ4 / µ22

= 40656 / (132)2
U

= 2.333

3.2.2 Application of Kurtosis


The sample kurtosis is a useful metric for determining whether a data collection
ity

has any outliers. A higher kurtosis value implies a more serious outlier problem, which
may prompt the researcher to use different statistical methodologies.
m
)A
(c

Amity Directorate of Distance & Online Education


214 Computational Statistics

The Jarque–Bera test for normality, like D'Agostino's K-squared test, is a


Notes

e
goodness-of-fit normality test based on a combination of sample skewness and sample
kurtosis.

in
The sample variance of non-normal samples is determined by the kurtosis; for
more information, see variance.

Kurtosis is defined by Pearson as an indicator of turbulence intermittency. It's also

nl
used to measure non-Gaussian diffusion in magnetic resonance imaging.

“A concrete example is the following lemma by He, Zhang, and Zhang: Assume
a random variable X has expectation E[X] = m, variance E [(X – m)2 ] = σ2 and kurtosis

O
. Assume we sample many independent copies.
Then

ty
“This shows that with many samples, we will see one that is above the
expectation with probability at least . In other words: If the kurtosis is large, we

si
might see a lot values either all below or above the mean.”

r
ve
ni
U
ity

Convergence of kurtosis
When using band-pass filters on digital photos, the kurtosis levels tend to be
uniform regardless of the filter's range. In forensic analysis, this pattern, known as
kurtosis convergence, can be used to detect picture splicing.
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 215

Notes

e
in
nl
O
ty
si
Kurtosis Example
The first four moment of a distribution about the value 5 are 2, 20, 40 and 50.
Obtain peakdness of the distribution on the basis of information given:
r
ve
Solution:
◌◌ Given, First moment =2
◌◌ Second moment = 20
◌◌ Third moment =40
ni

◌◌ Fourth moment = 50
◌◌ By using formula,
U

◌◌ Kurtosis =

Kurtosis = 50/ (20)2


ity

◌◌ = 0.125
Since, kurtosis < 3 it is of platykurtic type.

3.2.3 Introduction to Moments


m
)A
(c

Amity Directorate of Distance & Online Education


216 Computational Statistics

The moments of a function are numerical metrics connected to the shape of the
Notes

e
function's graph in mathematics. The first moment is the mass centre, and the second
moment is the rotational inertia, if the function represents mass. If the function is a
probability distribution, the anticipated value is the first moment, the variance is the

in
second central moment, the skewness is the third standardised moment, and the
kurtosis is the fourth standardised moment. In physics, the concept of moment is
closely related to the mathematical concept.

nl
O
ty
si
The collection of all the moments (of all orders, from 0 to ∞) uniquely determines

r
the distribution of mass or probability on a finite interval (Hausdorff moment problem).
On unbounded intervals, however, this is not the case (Hamburger moment problem).
ve
Pafnuty Chebyshev was the first to think systematically in terms of the moments of
random variables in the mid-nineteenth century.

The term "moments" is commonly used to characterize a distribution's feature.


ni

Many of the most regularly used statistical measures, such as measures of tendency,
variation, skewness, and kurtosis, can be summarized using this method.

Moments are statistical measurements that describe the distribution's features.


U

Raw moments, core moments, and moments centred on any arbitrary point are all
examples of moments. The first raw instant, for example, yields mean, whereas the
second central moment yields variance. Even while direct formulae exist for central
ity

moments, raw moments can be used to easily calculate them.

The rth central moment of the variable x equals hr times the rth central moment of the
variable u, where u = (x – A)/h is a new variable created by changing the origin and scale of x.

Because A does not appear in the scenario, the change of origin has no influence
m

on the sequence of events.

Three types of moments are:


)A

1. Moments about arbitrary point


2. Moments about mean
3. Moments about origin
The rth moment about mean of a distribution, denoted by mr, is given by
(c

Amity Directorate of Distance & Online Education


Computational Statistics 217

The mean of the rth power of deviations of observations from their arithmetic mean
Notes

e
is hence the rth moment around mean. Specifically,

in
nl
O
Significance of Moments
“The n-th raw moment (i.e., moment about zero) of a distribution is defined by”

ty
Where

r si
The n-th moment of a real-valued continuous function f(x) of a real variable about a
value c is the integral
ve
Moments for random variables can be defined in a more general way than
ni

moments for real-valued functions – see moments in metric spaces. Without additional
explanation, the moment of a function usually refers to the above formula with c = 0.
The central moment (moments about the mean, with c being the mean) is frequently
U

used instead of the moments about zero for the second and higher moments since it
provides more information about the distribution's shape.

“Other moments may also be defined. For example, the nth inverse moment about
ity

zero is and the n-th logarithmic moment about zero is .”

“The n-th moment about zero of a probability density function f(x) is the expected
value of X n and is called a raw moment or crude moment. The moments about its mean
μ are called central moments; these describe the shape of the function, independently
m

of translation.”

If f is a probability density function, the n-th moment of the probability distribution


is the value of the integral above. If F is a cumulative probability distribution function
)A

of any probability distribution, which may or may not have a density function, then the
Riemann–Stieltjes integral is the n-th moment of the probability distribution.
(c

“Where X is a random variable that has this cumulative distribution F, and E is the
expectation operator or mean. When

Amity Directorate of Distance & Online Education


218 Computational Statistics

Notes

e
the moment is said not to exist. If the n-th moment about any point exists, so

in
does the (n − 1)-th moment (and thus, all lower-order moments) about every point.
The zeroth moment of any probability density function is 1, since the area under any
probability density function must be equal to one.”

nl
Moment Moment Cumulant
ordinal Raw Central Standardized Raw Normalized

O
1 Mean 0 0 Mean N/A
2 – Variance 1 Variance 1
3 – – Skewness – Skewness
4 – – (Non-excess or historical) – Excess kurtosis

ty
kurtosis
5 – – Hyperskewness – –
6 – – Hypertailedness – –

si
7+ – – – – –
Significance of moments (raw, central, normalised) and cumulants (raw,
normalised), in connection with named properties of distributions
r
Source: https://en.wikipedia.org/wiki/Moment_(mathematics)#:~:text=If%20the%20function%20
ve
is%20a,concept%20of%20moment%20in%20physics.

1. Mean
“The first raw moment is the mean, usually denoted
ni

2. Variance
“The second central moment is the variance. The positive square root of the
variance is the standard deviation
U

3. Standardized Moments
“The normalised n-th central moment or standardised moment is the n-th central
ity

moment divided by σn; the normalised n-th central moment of the random variable X is
.”

These normalised central moments are dimensionless values that represent the
distribution regardless of how the scale is changed.
m

The first moment of an electric signal is its DC level, whereas the second moment
is proportional to its average power.
)A

●● Skewness
The lopsidedness of the distribution is measured by the third central moment;
any symmetric distribution will have a third central moment of zero if defined. The
skewness, or normalised third central moment, is often referred to as. A negative
skewness distribution is one that is skewed to the left (the tail of the distribution is
(c

longer on the left). A positive skewness distribution is one that is skewed to the right
(the tail of the distribution is longer on the right).

Amity Directorate of Distance & Online Education


Computational Statistics 219

“For distributions that are not too different from the normal distribution, the median
Notes

e
will be somewhere near μ − γσ/6; the mode about μ − γσ/2.”

●● Kurtosis

in
When compared to a normal distribution with the same variance, the fourth central
moment is a measure of the heaviness of the tail of the distribution. The fourth central
moment is always nonnegative, and except for a point distribution, it is always strictly

nl
positive, because it is the expectation of a fourth power. A normal distribution's fourth
central moment is 3σ4.

The standardised fourth central instant is known as the kurtosis (Equivalently, as

O
in the next section, excess kurtosis is the fourth cumulant divided by the square of the
second cumulant.) The kurtosis of a distribution with heavy tails is high (also referred
to as leptokurtic); on the other hand, light-tailed distributions (for example, bounded

ty
distributions like the uniform) have low kurtosis (sometimes called platykurtic).

“The kurtosis can be positive without limit, but κ must be greater than or equal to
γ + 1; equality only holds for binary distributions. For unbounded skew distributions not
2

si
too far from normal, κ tends to be somewhere in the area of γ2 and 2γ2.

The inequality can be proven by considering

r
ve
where T = (X − μ)/σ. This is the expectation of a square, so it is non-negative for all
a; however it is also a quadratic polynomial in a. Its discriminant must be non-positive,
which gives the required relationship.”
ni

4. High Moments
Moments that are higher than 4th-order are referred to be high-order moments.

These are higher-order statistics, like variance, skewness, and kurtosis, that
U

involve non-linear combinations of data and can be used to describe or estimate


additional shape factors. The higher the moment, the more difficult it is to estimate,
as bigger samples are necessary to produce estimates of comparable quality. This is
ity

due to the higher orders using too many degrees of freedom. Furthermore, they can be
difficult to interpret, frequently being best understood in terms of lower order moments
— consider the higher-order derivatives of the physics words jerk and jounce. For
example, the 4th-order moment (kurtosis) can be interpreted as measuring "relative
importance of tails as compared to shoulders in contribution to dispersion" (for a given
m

amount of dispersion, higher kurtosis corresponds to thicker tails, while lower kurtosis
corresponds to broader shoulders), whereas the 5th-order moment can be interpreted
as measuring "relative importance of tails as compared to centre (mode and shoulders)
)A

in (for a given amount of skewness, higher 5th moment corresponds to higher


skewness in the tail portions and little skewness of mode, while lower 5th moment
corresponds to more skewness in shoulders).

5. Mixed Moments
(c

Moments with numerous variables are known as mixed moments.

Amity Directorate of Distance & Online Education


220 Computational Statistics

Covariance, coskewness, and cokurtosis are some examples. There is a single


Notes

e
covariance, but there are several co-skewnesses and co-kurtoses.

Example: For the following distribution calculate first four moments about mean and also

in
find β1, β2, γ1 and γ2:

Marks 5 10 15 20 25 30 35
Frequency 4 10 20 36 16 12 2

nl
Solution:
Table of frequency distribution for calculation of moments:

O
Marks (x) f d=(x – 20)/5 fd fd2 fd3 fd4
5 4 -3 -12 36 -108 324
10 10 -2 -20 40 -80 160

ty
15 20 -1 -20 20 -20 20
20 36 0 0 0 0 0
25 16 1 16 16 16 16

si
30 12 2 24 48 96 192
35 2 3 6 18 54 162
∑fd = -6 ∑fd = 178
2
∑fd = -42
3
∑fd4 = 874
r
ve
Now, µ1’ = ∑fd/N * h = –6/100 * 5 = – 0.3

µ2’ = ∑fd2/N * h2 = 178/100 * 25 = 44.5

µ3’ = ∑fd3/N * h3 = –42/100 * 125 = – 52.5


ni

µ4’ = ∑fd4/N * h4 = 874/100 * 625 = 5462.5

Moments about mean


U

µ2 = µ2’ – µ1’2 = 44.5 – 0.09 = 44.41 = σ2

µ3 = µ3’ – 3µ2’µ1’ + µ1’3

= –52.5 – 3 * 44.5 * – 0.3 + 2 (–0.3)3


ity

= –52.5 + 40.05 – 0.054 = – 12.504

µ4 = µ4’ – 4µ1’µ3’ + 6µ2’µ1’2 – 3µ1’4

= 5462.5 – 4 (– 0.3 * – 52.5) + 6 (44.5) (– 0.3)2 – 3 (– 0.3)4


m

= 5462.5 – 63 + 24.03 – 0.0243

= 5423.5057
)A

b1 = µ32 / µ23 = (– 12.504)2 / (44.41)3 = 0.001785

g1 = µ3 / s3 = – 12.504 / (6.6641)3 = – 0.0422

b2 = µ4 / s22 = 5423.5057 / (44.41)2 = 2.7499


(c

g2 = b2 – 3 = 2.7499 – 3 = – 0.2501

Amity Directorate of Distance & Online Education


Computational Statistics 221

Example: The first four moments of a distribution about the value 5 of a variable are 1,
Notes

e
10, 20 and 25. Find the central moments, b1 and b2

Solution:

in
Here,

µ1’ = 1, µ2’ = 10, µ3’ = 20 and µ4’ = 25

nl
Thus, moments about mean can be calculated as follows:

µ2 = µ2’ – µ1’2 = 10 – (1)2 = 9

O
µ3 = µ3’ – 3µ2’µ1’ + 2µ1’3

= 20 – 3 * 10 * 1 + 2 (1)3

= 20 – 30 + 2 = – 8

ty
µ4 = µ4’ – 4µ3’µ1’ + 6µ2’µ1’2 – 3µ1’4

= 25 – 4 * 20 * 1 + 6 * 10 * (1)2 – 3 * (1)4

si
=2

Thus,

b1 = µ32 / = µ23 = (– 8)2 / (9)3 = 64 / 729 = 0.0877 r


ve
b2 = µ4 / = µ22 = 2 / (9)2 = 2 / 81 = 0.0247
ni

Example: For the following distribution, find central moments, b1 and b2:

Class 1.5 – 2.5 2.5 – 3.5 3.5 – 4.5 4.5 – 5.5 5.5 – 6.5
Frequency 1 3 7 3 1
U

Solution:
For calculating moments,
ity

x f d = (x – x̄ ) fd fd2 fd3 fd4


2 1 –2 –2 4 –8 16
3 3 –1 –3 3 –3 3
4 7 0 0 0 0 0
m

5 3 1 3 3 3 3
6 1 2 2 4 8 16
N = 15 ∑ fd = 0 ∑ fd = 14
2
∑ fd ∑ 0
3
∑ fd4 = 38
)A

Thus,
µ1 = ∑ fd/N = 0/15 = 0
µ2 = ∑ fd2/N = 14/150 = 0.933
(c

µ3 = ∑ fd3/N = 0/15 = 0
µ4 = ∑ fd4/N = 38/15 = 2.533

Amity Directorate of Distance & Online Education


222 Computational Statistics

b1 = µ32 / µ23 = (0)2 / (0.933)3 = 0


Notes

e
Since, b1 = 0, thus the distribution is symmetrical
b2 = µ4 / µ22 = (2.53) / (0.933)2 = 2.53 / 0.87 = 2.908 ≈ 2.91

in
Since, b2 < 3, thus, curve is platykurtic

Example: Wages of workers are given in following table:

nl
Weekly wages 10 – 12 12 – 14 14 – 16 16 – 18 18 – 20 20 – 22 22 – 24
Frequency 1 3 7 12 12 4 3

O
Find the first four central moment and b1 and b2

Solution:

ty
For calculating first four moments,

Wages f x d’ = x–17/2 fd’ fd’2 fd’3 fd’4


10 – 12 1 11 –3 –3 9 –27 81

si
12 – 14 3 13 –2 –6 12 –24 48
14 – 16 7 15 –1 –7 7 –7 7
16 – 18 20 17 0 0 0 0 0
18 – 20 12 19 1 r 12 12 12 12
ve
20 – 22 4 21 2 8 16 32 64
22 – 24 3 23 3 9 27 81 243
∑ fd’ = 13 ∑ fd’2 =27 ∑ fd’3 = 67 ∑fd’4 = 455
ni

Thus,

µ1’ = ∑ fd’/N * h = 13/50 * 2 = 0.52

µ2’ = ∑ fd’2/N * h2 = 27/50 * 4 = 2.16


U

µ3’ = ∑ fd’3/N * h3 = 67/50 * 8 = 10.72

µ4’ = ∑ fd’4/N * h4 = 455/50 * 16 = 145.6


ity

Therefore,

µ1 = 0, µ2 = µ2’ – µ1’2 = 2.16 – 0.2704 = 1.8896

µ3 = µ3’ – 3µ2’µ1’ + 2µ1’3


m

= 10.72 – 3 * 2.16 * 0.52 + 2 (0.52)3

= 10.72 – 3.3696 + 0.2812


)A

= 7.6316

µ4 = µ4’ – 4µ1’µ3’ + 6µ2’µ1’2 – 3µ1’4

= 145.6 – 4 * 0.52 * 10.72 + 6 * 2.16 * (0.52)2 – 3 * (0.52)4


(c

= 145.6 – 4 * 0.52 * 10.72 + 6 * 2.16 * (0.52)2 – 3 * (0.52)4

= 145.6 – 22.29 + 3.504 – 0.2193

Amity Directorate of Distance & Online Education


Computational Statistics 223

= 149.104 – 22.509
Notes

e
= 126.59

Hence, the b coefficients

in
b1 = µ32 / µ23 = (7.6316)2 / (1.8896)3 = 58.24131 / 6.7469 = 8.632

b2 = µ4 / µ12 = 126.59 / (1.8896)2 = 126.59 / 3.5706 = 35.4534

nl
3.2.4 Factorial Moments

O
The factorial moment is a mathematical quantity defined in probability theory as
the expectation or average of a random variable's falling factorial. Factorial moments
originate in the application of probability-generating functions to determine the moments
of discrete random variables, and are useful for investigating non-negative integer-

ty
valued random variables.

In the mathematical area of combinatorics, which is the study of discrete


mathematical structures, factorial moments are used as analytic tools.

si
Definition
The r-th factorial moment of a probability distribution on real or complex numbers,
r
or, to put it another way, a random variable X with that probability distribution, is
ve
provided below for a natural number r.

where the E is the expectation (operator) and


ni
U

“It is the falling factorial, which gives rise to the name, although the notation (x)r
varies depending on the mathematical field. Of course, the definition requires that the
expectation is meaningful, which is the case if (X)r ≥ 0 or E[|(X)r|] < ∞.

If X is the number of successes in n trials, and pr is the probability that any r of the
ity

n trials are all successes, then

Factorial moments of order r about the origin of the frequency distribution xi I fi (i =


m

1,2,…n), is defined as
)A

where x(r) = x (x –1 ) (x – 2)….(x – r + 1) and N =

Therefore, the factorial moment of order r about any point x = a is given by


(c

where (x – a)(r) = (x – a) (x – a – 1) … (x – a – r + 1)

Amity Directorate of Distance & Online Education


224 Computational Statistics

Thus, from above equation, we have


Notes

e
µ(1)’ = 1/N ∑i f ix i = µ1’ (about origin) = Mean (x̄ )

µ(2)’ = 1/N ∑i f ix i(2) = 1/N ∑i f ix i (xi – 1)

in
= 1/N ∑i f ix i2 – 1/N ∑i f ix i = (µ2’ – µ1’)

µ(3)’ = 1/N ∑i f ix i(3) = 1/N ∑i f ix i (xi – 1) (xi – 2)

nl
= 1/N ∑i f ix i3 – 3 (1/N) ∑i f ix i2 + 2 (1/N) ∑i f ix i

= (µ3’ – 3µ2’ + 2µ1’)

O
µ(4)’ = 1/N ∑i f ix i(4) = 1/N ∑i f ix i (xi – 1) (xi – 2) (xi – 3)

= 1/N ∑i f ix i (xi3 – 6 xi2 + 11 xi – 6)

ty
= 1/N ∑i f ix i4 – 6 (1/N) ∑i f ix i3 + 11 (1/N) ∑i f ix i2 – 6 (1/N) ∑i f ix i

= (µ4’ – 6µ3’ + 11µ2’ – 6µ1’)

si
Conversely, we can get
µ4’ = µ(1)’

µ2’ = µ(2)’ + µ(1)’


r
ve
µ3’ = µ(3)’ + 3µ(2)’ + µ(1)’

µ4’ = µ(4)’ + 6µ(3)’ + 7µ(2)’ + µ(1)’

Factorial Moments of Binomial Distribution


ni

The rth factorial moment of the Binomial distribution is:


U
ity
m
)A

Factorial Moments of Poisson Distribution


(c

If variable X has Poisson distribution and parameter λ, therefore the factorial


moments of X are

Amity Directorate of Distance & Online Education


Computational Statistics 225

E [(X)r] = λr
Notes

e
The probability mass function of Poisson distribution is given as:

in
where,

x (number of occurrences) = 0,1,2,….

nl
e (Euler’s number) = 2.71828….

! = factorial function

O
Example: In a café, the customer arrives at a mean rate of 2 per min. Find the probability
of arrival of 5 customers in 1 minute using Poisson distribution formula.

ty
Solution:
Here, λ = 2 and x = 5

Thus, using Poisson distribution formula

si
P (X = x) = (e-λ λx)/x!

P(X = 5) = (e-2 λ5)/5! = 0.036


r
Thus, the probability of arrival of 5 customers per minute is 3.6%
ve
Example: Find the mass probability of function at x = 6, if the value of the mean is 3.4.

Solution:
ni

Given: λ = 3.4, and x = 6.

Using the Poisson distribution formula:


U

P(X = x) = (e-λ λx )/x!

P(X = 6) = (e-3.4 3.46 )/6!

P(X = 6) = 0.072
ity

Thus, the probability of function is 7.2%.


Example: If 3% of electronic units manufactured by a company are defective. Find the
probability that in a sample of 200 units, less than 2 bulbs are defective.
m

Solution:
The probability of defective units p = 3/100 = 0.03
)A

Give n = 200.
We observe that p is small and n is large here. Thus, it is a Poisson distribution.
Mean λ= np = 200 × 0.03 = 6
P(X= x) is given by the Poisson Distribution Formula as (e-λ λx )/x!
(c

Amity Directorate of Distance & Online Education


226 Computational Statistics

P(X < 2) = P(X = 0) + P(X= 1)


Notes

e
=(e-6 60 )/0! + (e-661)/1!
= e-6 + e-6 × 6

in
= 0.00247 + 0.0148
P(X < 2) = 0.01727

nl
Hence, the probability that less than 2 bulbs are defective is 0.01727

3.2.5 Shephard's Correction for Moments

O
Sheppard's corrections are approximate corrections to moment estimates
generated from binned data in statistics. William Fleetwood Sheppard is the inspiration
for this notion. In farming class intervals, we assume the basic assumption that

ty
the frequencies are equally distributed around the midpoints of the intervals. This
assumption is used in all moment computations for grouped frequency distributions.
Multiplying the class mid-point or its power by the appropriate class frequency
approximates the aggregate of the observations or their powers in a class.

si
This assumption is appropriate for distributions that are symmetrical or nearly
symmetrical. However, for highly skewed distributions or when the class intervals

r
surpass around 1/20th of the range, it is not acceptable. In such cases, W. F. Sheppard
recommended several modifications to eliminate the so-called "grouping mistakes" that
ve
can occur when calculating moments.

If the frequency tapers off to zero in both directions, Sheppard suggested the
following modifications in the calculation of central moments assuming continuous
ni

frequency distributions.
U
ity

Here h = the width of class interval.

Type of data which can be corrected:


◌◌ The class interval should have the same width.
m

◌◌ This method of moment correction is only applicable to continuous variables,


i.e., continuous data.
)A

◌◌ The frequencies must be symmetrical. In both directions, the frequency should


gradually decrease to zero.

Example:

Correct responses 0 2 4 6 8 10 12 14 16 18 20
(c

No. of students 1 2 5 10 20 51 22 11 5 3 1

The above distribution has following moments around the mean:

Amity Directorate of Distance & Online Education


Computational Statistics 227

µ1 = 0
Notes

e
µ2 = 2.64

µ3 = 0.08

in
µ4 = 28.30

Use Sheppard’s correction for these moments. Note that the class interval for the

nl
distribution is 2.

Solution:

O
µ1 = 0; no correction needed

µ2 = µ2 – (h2/12) = 2.64 – 22/12 = 0.33 = 2.3

ty
µ3 = 0.08; no correction required

µ4 = (µ4) – ½ h2µ2 + (7/240)h4 = 28.30 – 5.28 + 0.12 = 23.14

Example: Consider the following distribution of marks:

si
Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
No. of students 1 6 11 17 21 16 13 7 5 2
r
For the distribution of marks above, the value for moments are given below:
ve
Raw Moments –
ni

is the rth raw moment, where fi is the frequency count and xi is the mid value of
class. So, using the above formula for the Raw Moment we get following values for
moments.
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


228 Computational Statistics

3.2.6 Skewness Using Moments


Notes

e
The term 'skewness' refers to a lack of symmetry or a deviation from symmetry;
for example, a skewed distribution is one that is not symmetrical (or asymmetrical).

in
Skewness gauges the difference between how observations are distributed in a given
distribution and how they are distributed in a symmetrical (or normal) distribution.
Because statistical theory is frequently predicated on the assumption of a normal

nl
distribution, the idea of skewness gains relevance. As a result, a measure of skewness
is required to defend against the consequences of this assumption.

The mean, median, and mode values are all the same in a symmetrical distribution.

O
Skewness is defined to be positive if the mean is bigger than the mode. The mean is
bigger than the mode in a positively skewed distribution, while the median is located
between the mean and the mode. Some values in a positively skewed distribution are
substantially greater than the majority of other observations. When the long tail is on the

ty
positive side of the peak, the distribution is positively skewed.

Skewness is stated to be negative when the value of mode is greater than the
mean. The diagrams below may help you understand what skewness means. The

si
median is between the mean and the mode in a negatively skewed distribution, and the
mode is greater than the mean. The average is drawn to the low-valued item (that is, to
the left). Some values in a negatively skewed distribution are substantially smaller than
r
the majority of observations. When the long tail of a distribution is on the opposite side
ve
of the peak, it is said to be negatively skewed.

Normally,

If Mean > Mode, the skewness is positive.


ni

If Mean < Mode, the skewness is negative.

If Mean = Mode, the skewness is zero.


U
ity
m

Skewness can be measured in a variety of ways:


)A

Find the skewness in the following data using the moments.


(c

Amity Directorate of Distance & Online Education


Computational Statistics 229

Class Frequency (f)


Notes

e
0-10 1
10-20 3

in
20-30 5
30-40 7
40-50 4

nl
Sum N=35

Class Frequency (f) X(m.v) D= (x-μ) =x-25 f(d) f(d)2 f(d)3 f(d)4

O
0-10 1 5 -20 -20 400 -8000 160000
10-20 3 15 -10 -30 300 -3000 30000
20-30 5 25 0 0 0 0 0
30-40 7 35 10 70 700 7000 70000

ty
40-50 4 45 20 80 1600 32000 640000
Sum N=35 100 3000 28000 9,00,000

si
Central Moments

r
ve
ni
U

Conversion of formula:
= 150- 52 =150-25 = 125

= 1400 – (3*5*150) + 2* 53 = - 600


ity

= 45000- (4*1400*5) + ( 6 *150 * 52 ) – (3*54)

= 37652
m

Co-efficient of Skewness:
= 0.184
)A

Since, value is positive, so the data is said to be positively skewed.

Example: Compute the first four moments and measure of Skewness for the following
distribution of wages:

Weekly earnings (₹) 5 6 7 8 9 10 11 12 13 14 15


(c

No. of men 1 2 5 10 20 51 22 11 5 3 1

Amity Directorate of Distance & Online Education


230 Computational Statistics

Solution:
Notes

e
di = (xi
Earnings
Men fi – A) A fidi fidi2 fidi3 fidi4
in ₹ (xi)

in
= 10
5 1 -5 -5 25 -125 625
6 2 -4 -8 32 -128 512

nl
7 5 -3 -15 45 -135 405
8 10 -2 -20 40 -80 160
9 20 -1 -20 20 -20 20

O
10 51 0 0 0 0 0
11 22 1 22 22 22 22
12 11 2 22 44 88 176
13 5 3 15 45 135 405

ty
14 3 4 12 48 192 768
15 1 5 5 25 125 625
∑fi = 131 ∑fidi = 8 ∑fidi = 346
2
∑fidi = 74
3
∑fidi = 3718
4

r si
ve
ni
U
ity

Example: Use Pearson’s Coefficient by both formulas to find the skewness for data with
the following characteristics:
Mean = 70.5.
m

Median = 80.

Mode = 85.
)A

Standard deviation = 19.33.

Solution:
Pearson’s Coefficient of skewness = Mean – Mode / Standard Deviation
(c

or,

Pearson’s Coefficient of skewness = 3(Mean – Median) / Standard Deviation


Amity Directorate of Distance & Online Education
Computational Statistics 231

Therefore, using formula 1, i.e., = Mean – Mode / Standard Deviation


Notes

e
= 70.5 – 85 / 19.33

= -14.5/19.33

in
= -0.75

Therefore, using formula 2, i.e., 3(Mean – Median) / Standard Deviation

nl
= 3(70.5 – 80) / 19.33

= 3(-9.5) / 19.33

O
= -28.5 / 19.33

= -1.47

ty
Thus, the distribution is negatively skewed.

3.2.7 Kurtosis Using Moments

si
The degree of peakedness of a frequency curve is referred to as kurtosis. It
indicates how tall and sharp the central peak is in comparison to a standard bell curve.

The following are some ways to describe kurtosis:


r
ve
◌◌ Platykurtic– When the kurtosis is less than zero, the curve's frequencies are
closer to being equal (i.e., the curve is more flat and wide)
◌◌ Leptokurtic– When the kurtosis is greater than zero, only a small portion of the
curve has high frequencies (i.e, the curve is more peaked)
ni

◌◌ Mesokurtic- When the kurtosis is equal to zero.


U
ity

Example
m

Find the skewness in the following data using the moments.

Class Frequency (f)


0-10 1
)A

10-20 3
20-30 5
30-40 7
40-50 4
(c

Sum N=35

Amity Directorate of Distance & Online Education


232 Computational Statistics

Class Frequency (f) X(m.v) D= (x-μ) =x-25 f(d) f(d)2 f(d)3 f(d)4
Notes

e
0-10 1 5 -20 -20 400 -8000 160000
10-20 3 15 -10 -30 300 -3000 30000

in
20-30 5 25 0 0 0 0 0
30-40 7 35 10 70 700 7000 70000
40-50 4 45 20 80 1600 32000 640000

nl
Sum N=35 100 3000 28000 9,00,000

Central Moments

O
ty
r
Conversion formula for Moments
si
ve
=150- 52 =150-25 = 125

= 1400 – (3*5*150) + 2* 53 = - 600


ni

= 45000- (4*1400*5) + ( 6 *150 * 52 ) – (3*54)

= 37652
U

Co-efficient of Kurtosis:
= 2.408

Since, value is < 3, so the data is said to be platykurtic.


ity

Example: Compute the first four moments and measure of Kurtosis for the following
distribution of wages:

Weekly earnings (₹) 5 6 7 8 9 10 11 12 13 14 15


No. of men 1 2 5 10 20 51 22 11 5 3 1
m

Solution:
)A

Earnings in (xi) Men fi di = (xi – A)

di = (xi
Earnings
Men fi – A) A = fidi fidi2 fidi3 fidi4
in ₹ (xi)
10
5 1 -5 -5 25 -125 625
(c

6 2 -4 -8 32 -128 512
7 5 -3 -15 45 -135 405

Amity Directorate of Distance & Online Education


Computational Statistics 233

8 10 -2 -20 40 -80 160


Notes

e
9 20 -1 -20 20 -20 20
10 51 0 0 0 0 0

in
11 22 1 22 22 22 22
12 11 2 22 44 88 176
13 5 3 15 45 135 405

nl
14 3 4 12 48 192 768
15 1 5 5 25 125 625
∑fi = 131 ∑fidi = 8 ∑fidi2 = 346 ∑fidi3 = 74 ∑fidi4 = 3718

O
ty
r si
ve
ni

Example: If µ4 = 199, µ3 = 50 and µ2 = 199, calculate the value of excess kurtosis.


U

Solution:
Here, µ4 = 199

µ3 = 50
ity

µ2 = 199

Hence, the measure of Kurtosis:

β2 = µ4/µ42
m

Excess Kurtosis = β2 – 3

Therefore, β2 = 199/(8)2
)A

= 199/64 = 3.109

Excess Kurtosis is denoted by g2

Therefore, g2 = β2 – 3
(c

= 3.109 – 3 = 0.109

Example: For a distribution, the mean is 10, variance is 16, γ1 is +1 and β2 is 4. What

Amity Directorate of Distance & Online Education


234 Computational Statistics

will be the distribution?


Notes

e
Solution:

in
For a distribution

µ1 = 10

µ2 = 16

nl
γ1 = -1

β2 = 4

O
If β2 > 3 then the curve is leptokurtic

if β2 = 3 then the curve is mesokurtic

ty
if β2 < 3 then the curve is platykurtic

Here β2 = 4 > 3

si
γ2 = β2 – 3

γ2 = 4 – 3 = 1

Therefore, the curve is leptokurtic


r
ve
Check your Understanding
State true or false

1. When the kurtosis is equal to zero it is Leptokurtic.


ni

2. The degree of peakedness of a frequency curve is referred to as skewness.


3. The sample kurtosis is a useful metric for determining whether a data collection has
any outliers.
U

4. A higher kurtosis value implies a more serious outlier problem, which may prompt the
researcher to use different statistical methodologies.
ity

Fill in the blanks


1. ________ is when the kurtosis is less than zero, the curve's frequencies are closer
to being equal (i.e., the curve is more flat and wide).
2. __________ is when the kurtosis is greater than zero, only a small portion of the
m

curve has high frequencies (i.e, the curve is more peaked).


3. The term '_________' refers to a lack of symmetry or a deviation from symmetry.
4. Sheppard's corrections are approximate corrections to moment estimates generated
)A

from binned data in statistics. ___________ is the inspiration for this notion.

Summary
●● Kurtosis (from Greek: kyrtos or kurtos, meaning "curved, arching") is a measure
(c

of the "tailedness" of a real-valued random variable's probability distribution


in probability theory and statistics. Kurtosis, like skewness, specifies the form

Amity Directorate of Distance & Online Education


Computational Statistics 235

of a probability distribution, and there are various methods for defining it for a
Notes

e
theoretical distribution and estimating it from a sample of a population. Different
kurtosis metrics may be interpreted differently.

in
◌◌ There are three types of kurtosis:
◌◌ Mesokurtic
◌◌ Leptokurtic

nl
◌◌ Platykurtic
◌◌ Mesokurtic or mesokurtotic distributions are those that have no excess
kurtosis.

O
◌◌ Leptokurtic, or leptokurtotic, is a distribution with a positive excess kurtosis.
◌◌ Platypurtic or platykurtotic distributions have a negative excess kurtosis.
●● The sample kurtosis is a useful metric for determining whether a data collection

ty
has any outliers. A higher kurtosis value implies a more serious outlier problem,
which may prompt the researcher to use different statistical methodologies.
●● The moments of a function are numerical metrics connected to the shape of the

si
function's graph in mathematics.
●● The collection of all the moments (of all orders, from 0 to ∞) uniquely determines

r
the distribution of mass or probability on a finite interval (Hausdorff moment
problem). On unbounded intervals, however, this is not the case (Hamburger
ve
moment problem).
●● Pafnuty Chebyshev was the first to think systematically in terms of the moments of
random variables in the mid-nineteenth century.
ni

●● The term "moments" is commonly used to characterize a distribution's feature.


Many of the most regularly used statistical measures, such as measures of
tendency, variation, skewness, and kurtosis, can be summarized using this
U

method.
●● The factorial moment is a mathematical quantity defined in probability theory
as the expectation or average of a random variable's falling factorial. Factorial
moments originate in the application of probability-generating functions to
ity

determine the moments of discrete random variables, and are useful for
investigating non-negative integer-valued random variables.
●● Sheppard's corrections are approximate corrections to moment estimates
generated from binned data in statistics. William Fleetwood Sheppard is the
m

inspiration for this notion.


●● The term 'skewness' refers to a lack of symmetry or a deviation from symmetry; for
example, a skewed distribution is one that is not symmetrical (or asymmetrical).
)A

Skewness gauges the difference between how observations are distributed


in a given distribution and how they are distributed in a symmetrical (or normal)
distribution. Because statistical theory is frequently predicated on the assumption
of a normal distribution, the idea of skewness gains relevance. As a result, a
measure of skewness is required to defend against the consequences of this
(c

assumption.

Amity Directorate of Distance & Online Education


236 Computational Statistics

●● The degree of peakedness of a frequency curve is referred to as kurtosis. It


Notes

e
indicates how tall and sharp the central peak is in comparison to a standard bell
curve.

in
Activity
1. Give examples of factorial moments.

nl
2. What are the different application of kurtosis?

Questions & Exercises

O
1. Define kurtosis.
2. State the application of kurtosis.
3. What is moments?

ty
4. Write a note on factorial moments.
5. What are the different types of moments?

si
6. What is Shephard’s Correction for Moments?
7. What is kurtosis using moments?
8. What is skewness using moments?
r
ve
Glossary
●● Coefficient of Kurtosis: It is a measure of the relative peakedness of the top of a
frequency curve.
●● Moments: Moments are a set of statistical parameters to measure a distribution.
ni

●● Moment of Order r: It is defined as the arithmetic mean of the rth power of


deviations of observations.
U

●● Coefficient of Variation: It is defined as the ratio of SD and mean, multiplied by


100.
●● Variance: It is the average squared deviation of the data from their mean. For
sample data, we take the average by dividing with (n-1) where n is a sample size.
ity

This is to cater for degree of freedom. For population data, we average by dividing
with the population size N.

Further Readings
m

1. M. Ray & Har Swarup Sharma, “Mathematical Statistics”, Ram Prasad


Publication, Agra-3.
2. Spanos, Aris (1999). Probability Theory and Statistical Inference. New York:
)A

Cambridge University Press.


3. Weisstein, Eric W. "Sheppard's Correction". MathWorld—A Wolfram Web
Resource. Retrieved March 2, 2014.
4. Weatherburn, C.E. (1949), A first course in mathematical statistics, Cambridge
(c

University Press

Amity Directorate of Distance & Online Education


Computational Statistics 237

5. Elhance, D. N. and V. lhance, 1988, Fundamentals of Statistics, Kitab Mahal,


Notes

e
Allahabad.
6. Nagar, A. L. and R. K. Dass, 1983, Basic Statistics, Oxford University Press,

in
Delhi.
7. Mansfield, E., 199 1, Statistics for Business and Economics: Methods and
Applications, W.W. Norton and Co.

nl
8. Yule, G U. and M. G Kendall, 1991, an Introduction to the Theory of Statistics,
Universal Books, Delhi.

O
Check your Understanding – Answers
State true or false
1. False

ty
2. False
3. True

si
4. True
Fill in the blanks
1. Platykurtic
2. Leptokurtic
r
ve
3. Skewness
4. William Fleetwood Sheppard
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


238 Computational Statistics

Module - 4: Correlation and Regression Analysis


Notes

e
Structure:

in
4.1 Correlation –Meaning, types, Limitations of Correlation
4.1.1 Introduction to Correlation

nl
4.1.2 Types of Correlation
4.1.3 Limitations of Correlation
4.1.4 Diagramatic Represntation for Types of Correlation

O
4.2 Correlation Coefficient; Meaning, Properties Measurement of Coefficient of
Correlation – Rank Correlation Co-efficient
4.2.1 Introduction to Correlation Coefficient

ty
4.2.2 Meaning of Correlation Coefficient
4.2.3 Properties of Correlation Coefficient
4.2.4 Measurement of Correlation Coefficient

si
4.2.5 Estimation of Rank Correlation Coefficient
4.2.6 Application of Rank Correlation Coefficient
4.2.7 r
Kendall's Measure of Correlation
ve
4..3 Karl Pearson’s correlation coefficient in bivariate distribution; Estimation and
interpretations
4.3.1 Meaning of Karl Pearsons Correlation Coefficient
ni

4.3.2 Karl Pearsons Correlation Coefficient in Bivariate Distribution


4.3.3 Estimation of Karl Pearsons Correlation Coefficient
4.3.4 Interpretation of Karl Pearsons Correlation Coefficient
U

4.3.5 Intra-Class Correlation


4.3.6 Correlation Ratio
ity

4.4 Regression – Meaning, types, properties and assumptions of Regression


4.4.1 Introduction to Regression Analysis
4.4.2 Meaning of Regression
4.4.3 Properties of Regression Analysis
m

4.4.4 Assmptions of Regression


4.5 Two variable linear regression; Regression lines and regression Co-Efficient
)A

4.5.1 Linear Regression


4.5.2 Two Variable Linear Regression
4.5.3 Regression Lines
4.5.4 Regression Coefficients
(c

4.5.5 Multiple Regression for Trivariate data

Amity Directorate of Distance & Online Education


Computational Statistics 239

Unit - 4.1: Correlation –Meaning, types, Limitations of


Notes

e
Correlation

in
Objectives
At the end of this unit, you will be able to:

nl
●● Understand the introduction to correlation.
●● Learn about the types of correlation.

O
●● Comprehend the limitations of correlation.
●● Analyze diagrammatic representation for types of correlation.

Introduction

ty
We have only examined the statistical handling of data relating to one variable in
previous units. Researchers and decision-makers must analyze the link between two
or more variables in a variety of various scenarios. For example, a company's sales

si
manager may notice that sales are not consistent month to month. He or she is also
aware that the company's advertising budget varies year to year. This manager is
curious as to whether or not there is a link between sales and advertising spending. If
r
the manager was successful in defining the link, he or she might use the information to
ve
improve planning and forecasting of annual sales using the regression technique for his
or her organisation. Similarly, a researcher would be interested in looking at the impact
of R&D spending on a company's annual earnings, the relationship between price index
and purchasing power, and so on. If there is a relationship between the variables, they
ni

are considered to be tightly related.

The correlation problem takes into account the combined fluctuation of two
measurements, neither of which is limited by the experimenter. When another variable
U

(independent variable) is held constant at each of numerous intervals, the regression issue
investigates the frequency distribution of one variable (dependent variable). As a result, this
unit introduces the notion of correlation, as well as its various types and constraints.
ity

4.1.1 Introduction to Correlation


In statistics, correlation or dependence refers to any statistical association between
two random variables or bivariate data, whether causal or not. Correlation can refer to
any statistical link, but it most commonly relates to the degree to which two variables
m

are linearly related. The link between the height of parents and their offspring, as well
as the correlation between the price of an item and the quantity that consumers are
willing to purchase, as illustrated in the so-called demand curve, are both instances of
)A

dependent phenomena.

Correlations are helpful because they can reveal a predicted relationship that
can be used in the real world. Based on the relationship between electricity demand
and weather, an electrical company might produce less power on a mild day. Extreme
(c

weather causes individuals to consume more power for heating and cooling, therefore
there is a causal relationship in this case. In general, however, the appearance of a

Amity Directorate of Distance & Online Education


240 Computational Statistics

correlation does not imply the existence of a causal relationship (i.e., correlation does
Notes

e
not imply causation).

If random variables do not satisfy the mathematical property of probabilistic

in
independence, they are said to be dependent. Correlation is often used interchangeably
with the term "dependency." In a technical sense, however, correlation refers to one of
various types of mathematical operations performed between the tested variables and

nl
their expected values. In essence, correlation is a measure of how closely two or more
variables are related. There are several correlation coefficients, often denoted or ,
measuring the degree of correlation. The Pearson correlation coefficient is the most
frequent of these, because it is only sensitive to a linear relationship between two

O
variables (which may be present even when one variable is a nonlinear function of the
other). Other correlation coefficients, such as Spearman's rank correlation, have been
created to be more robust than Pearson's, i.e. more sensitive to nonlinear correlations,

ty
than Pearson's. Mutual information can also be used to determine how dependent two
variables are.

Correlation examines how two variables vary in relation to one another to

si
determine the relationship, or association, between them. Statistical correlation also
refers to changes in two variables that occur at the same time, and it is commonly
depicted by linear correlations. It's important to note that correlation does not always

r
imply causality. This is because a correlation reflects the relationship between two or
more variables, not whether they cause changes in one another.
ve
The strength of the association is measured by the sample correlation coefficient, r.
The statistical significance of correlations is also determined.

What They Say About Correlation —Some Definitions


ni

"When the relationship is of a quantitative nature, the approximate statistical


tool for discovering and measuring the relationship and expressing it in a
brief formula is known as correlation."
U

–Craxton and Cowden


"Correlation is an analysis of the covariation between two or more variables."
–A.M. Tuttle
ity

"Correlation analysis deals with the association between two or more vari-
ables."
–Simpson and Kofka
m

Meaning of Correlation
An index of relationship, known as the co-efficient of correlation, is used to quantify
the degree of association or relationship between two variables.
)A

The coefficient of correlation is a numerical measure that indicates how closely two
variables are related and how the changes in one variable affect the changes in the
other. The correlation coefficient is always represented by the letters r or r. (Rho).

Product moment correlation co-efficient, often known as Karl Pearson's Coefficient


(c

of Correlation, is a concept. The Rank Difference Correlation coefficient, also known as


spearman's Rank Correlation Coefficient, is denoted by the sign ρ (Rho).

Amity Directorate of Distance & Online Education


Computational Statistics 241

The quantity (or degree or extent) of correlation between two variables is indicated
Notes

e
by the size of 'r.' The value of 'r' is +ve if the correlation is positive, while the value of V
is negative if the correlation is negative. As a result, the coefficient's sign indicates the
type of association. V's value ranges from +1 to -1.

in
Perfect positive correlation and perfect negative correlation are two types of
correlation. The top of the scale will represent perfect positive correlation, starting at +1

nl
and going all the way down to zero, signifying complete lack of connection.

The scale's bottom point will be -1, indicating perfect negative correlation. The
scale, which ranges from +1 to -1, thereby provides numerical measurement of the

O
connection.

[The correlation coefficient is a number, not a percentage.] It's usually rounded to


the nearest two decimal places].

ty
Need
A construct's meaning is determined by its correlation. Basic psycho-educational

si
research requires correlational analysis. In fact, the majority of basic and applied
psychology research is correlational.

(i) Finding characteristics of psychological and educational assessments requires


r
correlational analysis (reliability, validity, item analysis, etc.).
ve
(ii) Confirming that certain data supports the hypothesis.
(iii) Predicting one variable based on information about another (s).
(iii) Developing psychological and educational theories and models.
ni

(v) Grouping variables/measures to simplify data interpretation.


(vi) Performing multivariate statistical tests (Hoteling's T2; MANOVA, MANCOVA,
Discriminant analysis, and Factor Analysis).
U

(vii) Separating the effects of variables.

4.1.2 Types of Correlation


ity

The correlation in a bivariate distribution might be:

1. Positive, Negative and Zero Correlation; and


2. Linear or Curvilinear (Non-linear)
m

1. Positive, Negative, or Zero Correlation: Positive correlation occurs when an


increase in one variable (X) is followed by a matching increase in the other variable
(Y). The range of positive correlations is 0 to +1, with +1 being the perfect positive
)A

coefficient of correlation.
The perfect positive correlation states that there is a proportional rise in one
variable for every unit increase in the other. "Heat" and "Temperature," for example,
have a perfect positive association.
(c

Amity Directorate of Distance & Online Education


242 Computational Statistics

Negative correlation occurs when a rise in one variable (X) causes a


Notes

e
commensurate drop in the other variable (Y).

Examples

in
Positive Relationships Negative Relationships
Water consumption and temperature. Alcohol consumption and driving ability.

nl
Study time and grades. Price and quantity demanded.

The negative correlation varies from 0 to − 1, with the perfect negative correlation
occurring at the lower end. The perfect negative correlation states that for every unit

O
increase in one variable, the other decreases proportionally.

A zero correlation indicates that there is no relationship between the two variables
X and Y; that is, changes in one variable (X) are unrelated to changes in the other

ty
variable (Y). Body weight and intelligence, for example, shoe size and monthly wage;
and so on. The midpoint of the range – 1 to + 1 – is the zero correlation.

2.
r si
Linear or Curvilinear Correlation: Linear correlation is defined as the ratio of
change in two variables in the same or opposite directions, with a straight line as the
ve
graphical representation of one variable with respect to the other variable.
Consider a different scenario. First, when one variable increases, the second
variable increases proportionately up to a point; beyond that, as the first variable
ni

increases, the second variable begins to decrease.

A curved line will be used to depict the two variables graphically. The curvilinear
correlation is a word used to describe a relationship between two variables.
U

The next types of Correlation types are given below:


ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 243

◌◌ Simple correlation: Only two variables are investigated in a simple correlation


Notes

e
problem.
◌◌ Multiple Correlation: Three or more variables are investigated in Multiple

in
Correlation. Ex. Qd = f ( P,P C, P S, t, y )
◌◌ Partially correlated data: the analysis detects more than two variables but only
examines two of them while maintaining the other constant.

nl
◌◌ Total correlation: is calculated using all relevant variables, which is generally
impossible.

Methods of Computing Co-Efficient of Correlation

O
The following three approaches are used to compute the value of the co-efficient of
correlation for ungrouped data with bivariate distribution:

1. Scatter diagram method.

ty
2. Pearson’s Product Moment Co-efficient of Correlation.
3. Spearman’s Rank Order Co-efficient of Correlation.

si
1. Scatter Diagram Method
A scatter diagram, often known as a dot diagram, is a graphic tool used to derive
r
inferences about the relationship between two variables.
ve
The observed pairs of observations are plotted by dots on a graph paper in a two-
dimensional space by taking measurements on variable X along the horizontal axis and
variable Y along the vertical axis in order to create a scatter diagram.
ni

The location of these dots on the graph tells whether the variable changes in
the same direction or in the opposite direction. It's a quick, easy, but sloppy way of
calculating correlation.
U

The frequencies or points are plotted on a graph using scales that are convenient
for both series. According to the degree, the plotted points will tend to concentrate in
a band of bigger or smaller width. The direction of 'the line of best fit,' which is drawn
freehand, illustrates the nature of correlation. Figures provided below, for example,
ity

show scatter diagrams depicting varying degrees of association.


m
)A
(c

Scatter Diagrams showing varying degree of relationship between X and Y.

Amity Directorate of Distance & Online Education


244 Computational Statistics

Notes

e
in
nl
Scatter diagram illustrating linear and curvilinear relationships

O
Positive correlation is seen if the line moves upward from left to right. Similarly, if
the lines go downhill in a left-to-right direction, there will be a negative correlation.

The degree of association will be determined by the slope. If the plotted points
are dispersed far, there will be no association. The 'fact' that correlation is positive or

ty
negative is simply described using this manner.

2. Pearson's Product Moment Coefficient of Correlation: The coefficient of


correlation, r, is commonly referred to as the "Pearson r" after Professor Karl Pearson,

si
who devised the product-moment approach after Gallon and Bravais' previous work.
Later units will go over this.

r
Karl Pearson devised the product-moment correlation coefficient in 1986. On
continuous variables, this correlation coefficient is commonly used. The Pearson’s
ve
product moment correlation coefficient (r) can be defined as ∑xy / N. σx x σy. There
is covariance between X and Y in this, as well as a standard deviation for X and a
standard deviation for Y. As a result, it can be demonstrated that this is the highest
value correlation that can exist. As a result, the correlation coefficient's maximum value
ni

must be 1. Pearson's r is determined by the sign of the products of x and y derived


from their deviations. If the product is negative, r will be negative, and if the product is
positive, r will be positive.
U

This formula's denominator is always positive. This is why the correlation


coefficient ranges from –1 to +1.
ity

We may rewrite the equation as Pearson's correlation coefficient by following a


simple rule: a÷ b ÷ c = a ÷ (b × c).
m
)A

Pearson's correlation coefficient is useful for calculating correlation between two


relatively continuous variables.

Pearson's correlation may be calculated using one of two methods: Deviation


score technique or raw scores method. The intensity and direction of the obtained
(c

coefficient might be used to understand it. Range, measurement inaccuracy, outliers,


and curvilinearity are all issues to consider when evaluating the correlation coefficient.

Amity Directorate of Distance & Online Education


Computational Statistics 245

3. Spearman's Rank Correlation Coefficient: In Education and Psychology, there are


Notes

e
some situations where objects or individuals are ranked and arranged in order of
merit or proficiency on two variables, and when these two sets of ranks covary or
agree, we use rank correlation to measure the degrees of relationship.

in
There are other issues where the relationship between the measurements is non-
linear and cannot be captured by the product-moment r.

nl
When data is supplied on two variables for n participants, the Spearman's rank-
order correlation, or Spearman's rho (rs), invented by a well-known psychologist,
Charles Spearman (1904), is computed. It may also be computed using data from

O
n participants reviewed by two judges to see if they agree. It is appropriate for rank-
order data. Spearman's rho can be used if the data on X or Y, or both variables, is
in rank order. When the assumptions of Pearson's assumptions are not met, it can
also be utilised with continuous data. It is used to calculate the length of a monotonic

ty
connection. Spearman's rho (rs) has a range of – 1.00 to + 1.00, similar to Pearson's r.

The interpretation of Spearman's rho is dependent on the sign of the coefficient


and the value of the coefficient, similar to Pearson's correlation. The connection is

si
positive if the sign of rs is positive, and it is negative if the sign of rs is negative. The
association is weak if the value of rs is near to zero, but as the value of rs approaches
+1 or -1, the relationship becomes stronger. There is no relationship between X and
r
Y when rs is zero, and there is a perfect relationship between X and Y when rs has
ve
absolute value 1. It makes no difference what the rs value is; it does not suggest
causation.

To compute Spearman's rho, you'll need data on the X and Y variables. If your data
is on continuous variables, you'll need to transform it to a rank order. The following is
ni

the computing formula:


U
ity

4.1.3 Limitations of Correlation


m

Outside of the two variables being investigated, correlation cannot look at the
presence or effect of other factors. Correlation does not, however, reveal cause and
effect. Curvilinear relationships are similarly difficult to characterize using correlation.
)A

Following are the disadvantages of a Correlational Research Study:

1. Correlational research is limited to identifying relationships.


A correlational research study has the advantage of revealing relationships
(c

that were previously unknown. What it lacks is a conclusive explanation for why the
link exists in the first place. We can only explore the links between phenomena to
understand how each one effects the other with the information we have. When seeking

Amity Directorate of Distance & Online Education


246 Computational Statistics

for unique results, knowing that one modification can cause subsequent changes can
Notes

e
be useful, but it does not address the question of "why," which is sometimes required for
study.

in
2. It will not reveal which variables have the most influence.
A correlational research study can aid in determining the relationships that exist
between factors and a given occurrence. What this research cannot provide is

nl
information on which variable is influencing the other. You might know that wealthier
households have greater education levels, but you can't tell if it's the education that
leads to more money.

O
That means the correlation for a given variable must be assumed or the data must
be collected using a different study strategy.

ty
3. Correlational research can take a long time to complete.
Although there are numerous advantages to doing a correlational research study,
the process can be costly and time-consuming. Direct contact or observation of

si
the variables in question are the only ways to collect data. That means a number of
scenarios must be thoroughly examined before an exact coefficient can be determined.
This disadvantage is most typically seen in the naturalistic observation approach, but it
r
can be applied to any attempt in this category.
ve
4. Extraneous variables may cause the information to be distorted.
Additional influences may or may not be excluded from the correlational research
investigation. It's possible that unusual consequences will arise that will obstruct the
ni

task. In the case of the toddler and the ice cream truck, strong gusts could make it
appear as if the vehicle is closer or farther away than it actually is.

Another issue that falls under this disadvantage is the observer's awareness of the
U

subjects. When people are aware that they are being observed, they behave differently,
which can bias the data in any direction. This problem even affects surveys, as some
people try to provide or refuse data in order to achieve specific results.
ity

5. The quality of the work can have a negative impact on the outcome.
The value of the data acquired will be determined by the quality of the work done
during a correlational research project. The time and money spent on the survey
attempt will be squandered if the survey questions do not give enough of a trigger to
m

create information. Even if the study's framework allows for some flexibility, a lack of
representation in the sample might lead to subpar results and lead researchers down
the wrong path of research.
)A

The field of psychology is home to the majority of correlational research


investigations. It's used as a first step in gathering information on a certain topic or
circumstance when experimentation isn't possible. Although it normally looks at two
variables to see if a coefficient exists, in some cases it can look at more.
(c

The researchers have little control over the factors, which is why this form of
research might be troublesome at times. It's also why it's such a popular approach to
examine certain data points.

Amity Directorate of Distance & Online Education


Computational Statistics 247

4.1.4 Diagrammatic Representation for Types of Correlation


Notes

e
A scatter plot, scatter graph, or correlation chart is another name for a scatter
diagram.

in
With two variables, we create a scatter diagram. The first variable is unrelated to
the second, and the second is dependent on the first.

nl
The scatter diagram is the most straightforward technique to investigate the
relationship between these variables. You can forecast the behaviour of the dependent
variable based on the independent variable after determining how they are related.

O
A. Correlation: Linear And Non-Linear Relationship
●● Linear Relationship
The linear relationship can be expressed in the following equation:

ty
Y=α+βX

Y is a variable on the y-axis (commonly referred to as the dependant), alpha

si
is a constant or the Y intercept of a straight line, beta is the line's slope, and X is a
variable on the x-axis (often called as independent). To further grasp the linearity of the
correlation, we plot scatter with the line that best matches the data in the table below.

Extroversion Scores r
Number of Friends
ve
12 3
10 1
14 5
18 9
ni

16 4
20 7
U

The scatter of the same data is presented in the graph below. It also displays
the line that is the greatest match for the data. It demonstrates that two factors,
extroversion and the number of friends, have a linear correlation. In addition, the graph
depicts a straight line association, showing a linear relationship. Linear correlations
ity

include Pearson's product-moment correlation, Spearman's rho, and others.


m
)A
(c

Amity Directorate of Distance & Online Education


248 Computational Statistics

●● Non-linear Relationship
Notes

e
Other types of relationships exist as well. Curvilinear or non-linear relationships
are the terms used to describe them. One such example is the Yorkes-Dodson Law,

in
which describes the association between stress and performance. When there is too
little or too much stress, it indicates that performance is bad. When stress is moderate,
it improves. Curvilinear relationships come in a variety of forms (cubic, quadratic,

nl
polynomial, exponential, etc.). This connection is seen in the diagram below.

O
ty
●● Positive Correlation
r si
B. Direction Of Correlation: Positive And Negative
ve
The positive correlation suggests that when one variable's value rises, the other
variable's value rises as well. Similarly, as one variable's value falls, the value of the
other variable lowers as well. It indicates that both variables are moving in the same
ni

direction. Cold-drink sales, for example, increase when the temperature rises. As one's
willingness to try new things grows, so does one's creativity. The positive relationship is
seen in the scatterplot below.
U
ity
m
)A

You'll notice that higher X axis scores are often related with higher Y axis scores,
while lower X axis scores are generally associated with lower Y axis scores. Greater
ratings on temperature are linked to higher ratings on cold-drink sales in the 'a' case.
Similarly, when the temperature decreases, so do sales of cold beverages.

●● Negative Correlation
(c

The negative correlation suggests that when one variable's value rises, the value
of the other variable falls. Similarly, as one variable's values decline, the values of the

Amity Directorate of Distance & Online Education


Computational Statistics 249

other variable rise. It indicates that both variables are moving in the opposing direction.
Notes

e
The sale of woollen fabrics, for example, decreases when the temperature rises. As
social anxiety rises, assertiveness falls.

in
The scatterplot illustrating the negative association is shown below. You'll notice
that higher x-axis scores are usually associated with lower y-axis scores, while lower
x-axis scores are usually linked with higher y-axis scores. Higher temperature ratings are

nl
connected with decreased sales of woollen clothing in the first case (example a). Lower
temperatures are also associated with increased sales of woollen fabrics. Negative
association is frequently perceived as 'bad' or 'undesirable' by students. Positive and
negative are only directions, after all. They are neither desirable nor undesirable.

O
ty
r si
ve
●● No Correlation
You've now grasped the concept of positive and negative correlation. However,
there is a third alternative. That is, there is no link between the two variables. If they
ni

don't have any link (i.e., the correlation coefficient is 0), then the correlation's direction
is plainly neither positive nor negative. It's also known as no correlation or zero
correlation. (It's important to understand that 'zero order correlation' is not the same
U

as 'zero correlation.') What is, for example, the link between an individual's salary and
height? It's easy to deduce that they don't have anything in common. The data from
a hundred people is shown in the graph below. The scatterplot for no association is
shown.
ity
m
)A
(c

Amity Directorate of Distance & Online Education


250 Computational Statistics

C. Degree Of Correlation
Notes

e
●● Scatter Diagram with No Correlation
"Scatter Diagram with Zero Degree of Correlation" is another name for this

in
diagram.

nl
O
ty
r si
You can't draw a line across the data points since they're all over the place.
ve
As a result, you can say that these factors are unrelated.

●● Scatter Diagram with Moderate Correlation


"Scatter Diagram with a Low Degree of Correlation" is another name for this
ni

diagram.

The data points are closer together now, and you can see that there is a
relationship between these variables.
U
ity
m
)A

●● Scatter Diagram with Strong Correlation


"Scatter Diagram with a High Degree of Correlation" is another name for this
(c

diagram.

Amity Directorate of Distance & Online Education


Computational Statistics 251

The data points in this picture are close together, and you can create a line by
Notes

e
following their pattern.

in
nl
O
ty
Limitations of a Scatter Diagram

si
◌◌ Scatter graphs can't tell you how strong the association is.
◌◌ The link between the variables is not quantitatively measured in a scatter plot.
It merely depicts the quantitative manifestation of a change in quantity.
◌◌
r
The association between more than two variables is not shown in this graph.
ve
Benefits of a Scatter Diagram
◌◌ It depicts the connection between two variables.
◌◌ It's the most effective way to demonstrate a non-linear pattern.
ni

◌◌ It is possible to determine the data flow range, such as the maximum and
minimum value.
U

◌◌ Patterns are simple to spot.


◌◌ The diagram is easy to plot.
ity
m
)A

Check your Understanding


(c

1. In statistics, correlation or dependence refers to any statistical association between


two random variables or bivariate data, whether causal or not. State true or false.

Amity Directorate of Distance & Online Education


252 Computational Statistics

2. _________ can refer to any statistical link, but it most commonly relates to the degree
Notes

e
to which two variables are linearly related.
3. Correlation examines how two variables vary in relation to one another to determine

in
the relationship, or association, between them. State true or false.
4. Perfect positive correlation and perfect negative correlation are two types of
correlation. State true or false.

nl
5. Negative correlation occurs when a rise in one variable (X) causes a commensurate
drop in the other variable (Y). State true or false.

O
6. A _________ indicates that there is no relationship between the two variables X and
Y; that is, changes in one variable (X) are unrelated to changes in the other variable
(Y).
7. __________ has only two variables are investigated in a correlation problem.

ty
8. ________ has three or more variables are investigated in its Correlation.
9. ________ correlated data is the analysis that detects more than two variables but

si
only examines two of them while maintaining the other constant.
10. Total correlation is calculated using all relevant variables, which is generally
impossible. State true or false.

Summary
r
ve
●● In statistics, correlation or dependence refers to any statistical association
between two random variables or bivariate data, whether causal or not. Correlation
can refer to any statistical link, but it most commonly relates to the degree to which
two variables are linearly related. The link between the height of parents and their
ni

offspring, as well as the correlation between the price of an item and the quantity
that consumers are willing to purchase, as illustrated in the so-called demand
curve, are both instances of dependent phenomena.
U

●● Correlations are helpful because they can reveal a predicted relationship that can
be used in the real world.
●● An index of relationship, known as the co-efficient of correlation, is used to
ity

quantify the degree of association or relationship between two variables.


●● The correlation coefficient is always represented by the letters r or r. (Rho).
●● A scatter plot, scatter graph, or correlation chart is another name for a scatter
m

diagram.
●● Scatter diagrams can help you figure out how two variables are related. This link
might exist between two causes or between a cause and an effect. It might be
)A

either good or negative, or it can have no relationship at all. The first variable is
unrelated to the second, and the second is dependent on the first. You adjust the
independent variable and track the changes in the dependent variable to evaluate
the pattern of the relationship. Two independent variables can be represented in a
scatter diagram.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 253

Activity
Notes

e
1. Prepare a presentation on the topic “Correlation”.
2. Suggest five pairs of variables which you expect to be positively correlated.

in
3. Suggest five pairs of variables which you expect to be negatively correlated.
4. Using data related to India’s population and national income, calculate the correlation

nl
between them using step deviation method.

Questions & Exercises

O
1. What is correlation?
2. What are the different types of correlation?
3. Suggest eight pairs of variables, four in each, which you expect to be positively

ty
correlated and negatively correlated.
4. How does a scatter diagram approach help in studying the correlation between two
variables?

si
5. State the limitations of correlation.
6. What is the diagrammatic representation of types of correlation?
7. State the limitations of scatter diagrams. r
ve
8. What are the benefits of scatter diagrams?

Glossary
●● Correlation Analysis: Refers to a measure of association between two random
ni

variables. If two random variables have been such that when one gets changed
the other will do so in a related manner, they are regarded to be correlated.
Variables which are independent are not correlated. The correlation coefficient
U

is a number between -1 and + 1. It could be calculated hm a number of pairs of


observations which are normally referred to as points (X, Y). A coefficient of
1 implies perfect positive correlation, -1 perfect negative correlation and 0 no
correlation.
ity

●● Scatter Diagram: A diagram showing the joint variation of two variables X and
Y. Each member is represented by a point whose coordinates, on ordinary
rectangular axes, are the values of the 'variables. A set of n observations thus
provides n points on the diagram and the scatter or clustering of the points exhibits
m

the relationship between X and Y.

Further Readings
)A

1. Nagar, A.L. and R.K Das, 1989 : Basic Statistics, Oxford University Press,
Delhi.
2. Goon, A.M., M.K. Gupta and B.'Dasgupta, 19.87 : Basic Statistics, The World
Press Pvt. Ltd., Calcutta.
(c

3. Peters, W.S. and G.W: Summers, 1968. Statistical Analysis for Business
Decisions, Prentice Hall: Englewood-Cliffs.

Amity Directorate of Distance & Online Education


254 Computational Statistics

4. Srivastava, U.K., G.V. Shenoy and S.C. Sharma, 1987. Quantitative


Notes

e
Techniques for Managerial Decision Making,Wiley Eastern: New Delhi.
5. Stevenson, W.J. 1978. Business Statistics-Concepts and Applications, Harper

in
and Row: New York.
6. Aron, A., Aron, E. N., Coups, E.J. (2007). Statistics for Psychology. Delhi:
Pearson Education.

nl
7. Minium, E. W., King, B. M., & Bear, G. (2001). Statistical Reasoning in
Psychology and Education. Singapore: John-Wiley.

O
8. Guilford, J. P., & Fructore, B. (1978). Fundamental Statistics for Psychology
and Education. N.Y.: McGraw-Hill.
9. Wilcox, R. R. (1996). Statistics for Social Sciences. San Diego: Academic
Press.

ty
Check your Understanding – Answers
1. True

si
2. Correlation
3. True
4. True r
ve
5. True
6. zero correlation
7. Simple correlation
ni

8. Multiple Correlation
9. Partially
U

10. True
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 255

Unit - 4.2: Correlation Coefficient; Meaning, Properties


Notes

e
Measurement of Coefficient of Correlation – Rank
Correlation Co-efficient

in
Objectives

nl
At the end of this unit, you will be able to:

●● Understand the introduction to the correlation coefficient.


●● Learn about the meaning of correlation coefficient.

O
●● Comprehend the properties of correlation coefficient.
●● Analyze measurement of correlation coefficient.

ty
●● Evaluate estimation of the rank correlation coefficient.
●● Learn about the application of rank correlation coefficient.
●● Understand Kendall's measure of correlation.

si
Introduction

r
We covered the fundamentals of correlation in the last unit. The term "correlation"
refers to a relationship between two or more variables. This relationship can be viewed
ve
in terms of magnitude and direction. As a result, a link between two variables can be
positive, negative, or non-existent. Furthermore, the correlation coefficient could be
anywhere from +1 to -1.
ni

The correlation coefficient, abbreviated as r, is a summary assessment of the


statistical relationship between two interval or ratio level variables. The correlation
coefficient is scaled to be between -1 and +1 at all times. When r is close to 0, there is
little relationship between the variables, and the farther r is from 0 in either a positive or
U

negative direction, the stronger the association between the two variables.

The symbols X and Y are frequently used to represent the two variables. The
values of X and Y are depicted in a scatter diagram, charting combinations of the two
ity

variables, to show how the two variables are related. The scatter diagram is provided
first, followed by the procedure for calculating Pearson's r. The following examples
show sample sizes that are relatively small. Data from bigger samples will be presented
later.
m

This unit discusses the meaning of the correlation coefficient. It also discusses
properties measurement of coefficient of correlation and then finally throws light upon
the topic rank correlation coefficient.
)A

4.2.1 Introduction to Correlation Coefficient


So far, we've spoken about the relationship's direction. Any reader would
undoubtedly wonder, "How strong is the relationship?" The degree of linearity of
(c

the relationship can be used to assess the relationship's strength. To understand


the strength of the link, we need to learn more about the correlation. A number
is used to indicate the correlation between two variables, which is known as the

Amity Directorate of Distance & Online Education


256 Computational Statistics

correlation coefficient. Depending on the kind of correlation, the correlation coefficient


Notes

e
is represented by a different symbol. The Pearson's product-moment correlation
coefficient is denoted by the letter 'r' (small 'r'). The rxy symbol represents the
correlation between X and Y.

in
The correlation coefficient has a range of –1.00 to + 1.00. It can be any number
between these two, such as – 0.78, – 0.54, – 0.21, + 0.02, + 0.35, + 0.98, and so on.

nl
The association between the two variables is perfect if the correlation coefficient is 1.
If the correlation coefficient is – 1 or + 1, this will occur. The strength of the association
between the two variables grows as the correlation coefficient approaches + 1 or –
1. The strength of the association between two variables weakens as the correlation

O
coefficient goes away from + 1 or – 1. (that is, it becomes weak). As a result, a
correlation coefficient of + 0.84 (or similarly – 0.79, – 0.84, and so on) indicates a
significant relationship between the two variables. A correlation value of + 0.24 or –

ty
0.24, on the other hand, indicates a weak association.

The direction of the association is also shown by the correlation coefficient. If


the correlation coefficient has a positive sign, the correlation is positive. Negative

si
correlation is shown by the correlation coefficient having a negative sign. Some readers
may feel that negative correlation is weaker than positive correlation. The sign (positive
or negative) does, in fact, reflect the relationship's direction. The strength of a link is

r
not determined by the sign of the correlation coefficient. If the absolute value of the
two correlation coefficients with different signs (for example, +0.58 and –0.58) is the
ve
same, the intensity of link is the same. Understanding the shared variance between two
correlated variables is another method to comprehend the degree of relationship. The
correlation coefficient is expressed as a percentage rather than a number. Therefore,
a correlation of 0.30 does not imply that two variables share 30% of their variation. It is
ni

possible to determine the shared variance of two correlated variables.

The correlation coefficient is a metric for determining how strong a relationship


exists between two variables. There are various forms of correlation coefficients, but
U

Pearson's coefficient is the most common. The Pearson coefficient represents the
relationship between two variables (X and Y) measured on the same interval or ratio
scale. It expresses the strength of the relationship between two continuous variables.
ity

Positive correlations show that both variables are moving in the same direction.
Negative correlations, on the other hand, show that as one variable rises, the other
falls; they are inversely related. A value of zero indicates that there is no correlation.

4.2.2 Meaning of Correlation Coefficient


m

In a correlation analysis, the correlation coefficient is a specific statistic that


assesses the strength of the linear link between two variables. In a correlation report,
the coefficient is represented by the letter r.
)A

The formula measures the distance between each data point and the variable
mean for two variables and utilizes this to determine how well the variables'
relationships can be fit to an imaginary line drawn through the data. When we state that
correlations look at linear relationships, we're referring to this.
(c

A correlation coefficient is a numerical measure of a statistical relationship between


two variables. Two columns of a given data set of observations, referred to as a sample,

Amity Directorate of Distance & Online Education


Computational Statistics 257

or two components of a multivariate random variable with a known distribution can be


Notes

e
used as variables.

There are several sorts of correlation coefficients, each with its own definition and

in
set of features. They all use a scale of 1 to +1, with 1 denoting the strongest possible
agreement and 0 denoting the strongest conceivable dissent. Correlation coefficients
have a number of drawbacks as analytical tools, including the potential for outliers to

nl
skew some types and the danger of mistakenly inferring a causal relationship between
variables.

Definition and Interpretation

O
The correlation coefficient indicates how closely two variables X and Y are related.
The correlation coefficient is calculated using Pearson's formula.

ty
…..(1)

The correlation coefficient between X and Y is r, the standard deviations of X and


Y are a percent and ay, respectively, and the number of values of the pair of variables X

si
and Y in the presented data is n.

r
ve
ni
U
ity

Non-linear Association

“The expression is known as the covariance between X and Y.”


m

The Pearson's product-moment correlation coefficient is also known as r. It's worth


noting that r is a dimensionless number with a numerical value ranging from +1 to -1.
Positive r values show a positive (or direct) correlation between the two variables X
and Y, i.e., when X rises, Y rises, and vice versa. Negative r values indicate negative
)A

(or inverse) correlation, which means that increasing one variable causes the value
of the other variable to drop. A 0 correlation indicates that the two variables have no
relationship. A number of scatter plots with associated values for the correlation
coefficient r.
(c

It may be more convenient to use the following form to compute the correlation
coefficient.

Amity Directorate of Distance & Online Education


258 Computational Statistics

Notes

e
in
nl
O
ty
What do the values of the correlation coefficient mean?
●● The correlation coefficient r is a unitless number that ranges from -1 to 1. A p-value
is used to signify statistical significance. As a result, correlations are usually

si
expressed as two numbers: r and ρ (rho).
●● The linear link becomes weaker as r approaches 0.
●● r
Positive r values suggest a positive correlation, in which both variables' values
ve
tend to rise in lockstep.
●● Negative r values imply a negative correlation, in which the values of one variable
tend to rise as the values of the other fall.
●● Positive and negative "perfect" correlations are represented by the numbers 1 and
ni

-1, respectively. At a constant rate, two completely correlated variables change


together. When plotted on a scatterplot, all data points can be connected with a
straight line, which we call a linear connection.
U

●● Based on what we see in the sample, the p-value helps us assess whether we
can reasonably conclude that the population correlation coefficient is different from
zero.
ity

Types
The degree of correlation in data can be measured in a variety of ways, depending
on the type of data: whether it's a measurement, ordinal, or categorical data.
m

●● Pearson
The Pearson product-moment correlation coefficient, often known as r, R, or
Pearson's r, is a measure of the strength and direction of a linear relationship between
)A

two variables that is calculated by dividing their covariance by the product of their
standard deviations.

This is the most well-known and widely utilized correlation coefficient. When the
word "correlation coefficient" is used without further qualification, it usually refers to the
(c

Pearson product-moment correlation coefficient.

Amity Directorate of Distance & Online Education


Computational Statistics 259

●● Intra-class
Notes

e
When quantitative measurements are done on units that are arranged into groups,
the intraclass correlation (ICC) is a descriptive statistic that can be utilised; it describes

in
how closely units in the same group resemble each other.

●● Rank
A measure of the link between the ranks of two variables, or two rankings of the

nl
same variable, is rank correlation:

The Spearman's rank correlation coefficient is a measure of how well a monotonic

O
function can reflect the relationship between two variables.

The Kendall tau rank correlation coefficient is a measure of how closely two data
sets' ranks match.

ty
When both variables are measured at the ordinal level, Goodman and Kruskal's
gamma is a measure of the strength of association of the cross tabulated data.

si
4.2.3 Properties of Correlation Coefficient
The purpose of the correlation coefficient is to establish correlations between two
variables. The following are some of the qualities of the correlation coefficient:

1) The correlation coefficient is a unitless value.


r
ve
2) The sign of correlation coefficient will always be the same as the covariance.
3) The correlation coefficient's numerical value will range from -1 to 1. It's referred to as
a real number value.
ni

4) A negative value for the coefficient indicates a significant and negative association.
And if 'r' continues to approach -1, it indicates that the relationship is heading in the
wrong direction.
U

When 'r' reaches the side of + 1, it indicates a strong and positive relationship. We
can conclude that if the correlation result is +1, the relationship is in a positive state.
5) When the coefficient of correlation approaches 0, it indicates a weak correlation. We
ity

can determine that the association is weak when 'r' is close to zero.
6) The correlation coefficient is risky because we don't know whether the participants
are telling the truth or not.
When the two variables are swapped, the coefficient of correlation remains
m

unchanged.
7) The coefficient of correlation is a pure number that is unaffected by units. When we
)A

add the same number to all the values of one variable, it has no effect. All of the
variables can be multiplied by the same positive value. The correlation coefficient is
unaffected. As previously stated, because 'r' is a scale invariant, it is unaffected by
any unit.
8) We utilize correlation to measure the link, but this does not imply that we are
(c

discussing causality. This basically means that if two variables are correlated, there's
a chance the third variable is impacting both.

Amity Directorate of Distance & Online Education


260 Computational Statistics

4.2.4 Measurement of Correlation Coefficient


Notes

e
Before the correlation can be computed, the covariance of the two variables in
question must be calculated. The standard deviation of each variable is then required.

in
The correlation coefficient is calculated by multiplying the covariance by the product of
the standard deviations of the two variables.

The standard deviation is a measure of how far data deviates from its mean. The

nl
covariance of two variables is a measure of how they change together. Its magnitude,
on the other hand, is limitless, making it difficult to interpret. Covariance is divided by
the product of the two standard deviations to get the normalised form of the statistic.

O
The correlation coefficient is this number.

ty
4.2.5 Estimation of Rank Correlation Coefficient
Rank correlation coefficients, such as Spearman's rank correlation coefficient
and Kendall's rank correlation coefficient (), assess how much one variable tends to

si
rise when the other increases, without requiring that the increase be represented by
a linear relationship. The rank correlation coefficients will be negative if one variable
grows while the other drops. These rank correlation coefficients are commonly thought
r
of as alternatives to Pearson's coefficient, which are used to save calculation time or
make the coefficient less susceptible to non-normality in distributions. Rank correlation
ve
coefficients, on the other hand, represent a different form of link than the Pearson
product-moment correlation coefficient, and are better viewed as measures of a
different type of association than as an alternate measure of the population correlation
coefficient.
ni

To illustrate the nature of rank correlation, and its difference from linear correlation,
consider the following four pairs of numbers (x,y):
U

(0, 1), (10, 100), (101, 500), (102, 2000).

“As we go from each pair to the next pair increases, and so does . This
relationship is perfect, in the sense that an increase in is always accompanied by an
ity

increase in . This means that we have a perfect rank correlation, and both Spearman's
and Kendall's correlation coefficients are 1, whereas in this example Pearson product-
moment correlation coefficient is 0.7544, indicating that the points are far from lying
on a straight line. In the same way if always decreases when increases, the rank
correlation coefficients will be −1, while the Pearson product-moment correlation
m

coefficient may or may not be close to −1, depending on how close the points are
to a straight line. Although in the extreme cases of perfect rank correlation the two
coefficients are both equal (being both +1 or both −1), this is not generally the case,
)A

and so values of the two coefficients cannot meaningfully be compared. For example,
for the three pairs (1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's
coefficient is 1/3.”

Data is frequently available in the form of a rating for various variables. In fields like
(c

as food testing, competitive events (e.g. games, fashion shows, or beauty contests),
and attitudinal surveys, it is usual to use rankings on a preferential basis. In such cases,
the primary goal of determining a correlation coefficient is to assess the degree to

Amity Directorate of Distance & Online Education


Computational Statistics 261

which the two sets of rankings agree. Spearman's rank correlation coefficient, r, is the
Notes

e
coefficient calculated from these ranks.

This is given by the following formula:

in
nl
Here n is the number of pairs of observations and di is the difference in ranks
for the ith observation set. Suppose the ranks obtained by a set of ten students in a

O
Mathematics test (variable X) and a Physics test (variable Y) are as shown below:

ty
To determine the rank correlation, rs we can organize computations as shown in
Table below:

si
Determination of Spearman’s Rank Correlation
Individual Rank in Maths (X) Rank in Physics (Y) D = Y-X D2
1 1 3 +2 4
2 2 1r -1 1
ve
3 3 4 +1 1
4 4 2 -2 4
5 5 6 +1 1
6 6 9 +3 9
ni

7 7 8 +1 1
8 8 10 +2 4
9 9 5 -4 16
U

10 10 7 -3 9
Total 50

Using the formula given above we obtain,


ity

We can thus say that there is a high degree of correlation between the
performance in Mathematics and Physics. We can also test the significance of the value
m

obtained. The null hypothesis is that the two variables are not associated, i.e. r, = O.
That is, we are interested to test the null hypothesis, Ho that the two variables are not
associated in the population and that the observed value of rs differs from zero only by
)A

chance. The t-statistic that is used to test this is


(c

Amity Directorate of Distance & Online Education


262 Computational Statistics

Referring to the table of the t-distribution for n-2 = 8 degrees of freedom, the critical
Notes

e
value for t at a 5% level of significance is 2.306. Since the calculated value of t is higher
than the table value, we reject the null hypothesis concluding that the performances in
Mathematics and Physics are closely associated.

in
When two or more items have the same rank, a correction has to be applied to
. For example, if the ranks of X are 1, 2, 3, 3, 5, ... showing that there are two

nl
items with the same 3rd rank, then instead of writing 3, we write for each so that the
sum of these items is 7 and the mean of the ranks is unaffected. But in such cases the
standard deviation is affected, and therefore, a correction is required. For this , is

O
increased by for each tie, where t is the number of items in each tie.

Merits and Demerits of Rank Correlation Coefficient

ty
Merits

1. Spearman's rank correlation coefficient may be interpreted in the same manner as


Karl Pearson's correlation coefficient.

si
2. It is simple to understand and calculate.
3. It is the sole formula that can be used to see the relationship between qualitative

4.
features.
r
A rank correlation coefficient is a non-parametric form of Karl Pearson's product-
ve
moment correlation coefficient.
5. It does not assume that the population from which the sample data are drawn is
normal.
ni

Demerits
1. For bivariate frequency distributions, the product-moment correlation coefficient
U

may be determined, but the rank correlation coefficient cannot; and 2. If n >30, this
formula is time costly.

4.2.6 Application of Rank Correlation Coefficient


ity

The fundamental goal of correlation is to find a connection between two random


variables. Although the presence of association does not necessarily imply causation,
the presence of causality does. Statistical evidence can only determine whether or not
there is a relationship between variables. Whether causation exists or not is purely
m

a matter of logic. For example, there is evidence that more wealth leads to higher
spending on higher-quality clothing.

However, one must be wary of false or nonsense correlations that may occur by
)A

accident between completely unrelated variables.

When identifying useful independent variables for regression analysis, correlation


analysis is utilised as a starting point. For example, a construction company could
identify characteristics such as population, construction jobs, and building licences
(c

issued the previous year that it believes will effect sales this year.

Amity Directorate of Distance & Online Education


Computational Statistics 263

These and other parameters could be evaluated for mutual association by


Notes

e
obtaining the correlation coefficient of each pair of variables from the historical data
(this type of analysis is simple to accomplish with the help of a computer programme).
Only factors with a strong relationship to annual sales could be chosen for inclusion in a

in
regression model.

Correlation is also employed in component analysis, which aims to resolve a large

nl
number of measurable variables into a small number of new categories called factors.
The findings could be useful in three ways: revealing the underlying or latent factors
that determine the relationship between observed data, - revealing previously obscured
relationships between data, and - providing a classification scheme when data scored

O
on various rating scales must be grouped together.

Forecasting with the use of time series models is another important application of
correlation. Before building an acceptable forecasting model, one must first determine

ty
the trend, seasonality, and random pattern in the data (which is frequently a time
series of the variable of interest accessible at equal time intervals). The concept of
auto-correlation and auto-correlation charts for various time lags aid in determining the

si
nature of the underlying process.

4.2.7 Kendall's Measure of Correlation


r
Kendall's tau is one of the correlation measures. Kendall came up with this
ve
correlation method (1938). Kendall's tau is derived from the comparison of two sets of
ranks, X and Y. It is used as a substitute for Spearman's rho (rs).

Kendall's tau is represented by the lowercase Greek letter tau. The parameter
(population value) is denoted by, while the statistics obtained on the sample are
ni

denoted by rs. Tau has a range of – 1.00 to + 1.00, similar to Spearman's rho. Though
there are some similarities between tau and rs in terms of attributes, tau's rationale
is completely different from rho's. The sign and the value are used to interpret the
U

message. A higher value suggests a more powerful connection. A positive value


represents a positive relationship, whereas a negative value represents a negative
relationship.
ity

Null and Alternative Hypothesis


Kendall's tau is a descriptive statistic that can be calculated. In that instance, no
hypothesis testing is performed. If you want to measure population correlation using r,
you'll need a null and alternative hypothesis.
m

According to the null hypothesis,


)A

It stated that the Kendall's tau between X and Y is zero in the population
represented by sample.
(c

Amity Directorate of Distance & Online Education


264 Computational Statistics

According to the alternate hypothesis,


Notes

e
It states that the Kendall's tau between X and Y is not zero in the population
represented by sample. A two-tailed test is required for this alternate hypothesis.

in
Other possibilities could be written depending on the theory. They are one of two
things:

nl
O
“The first HA denotes that the population value of Kendall’s tau is smaller than
zero. The second HA denotes that the population value of Kendall’s tau is greater than
zero. Remember, only one of them has to be tested and not both. One-tailed test is
required for these hypotheses.”

ty
Logic of Computation of τ
“The tau is based on concordance and discordance among two sets of ranks. For

si
example, table 4.4 shows ranks of four subjects on variables X and Y as RX and RY. In
order to obtain concordant and discordant pairs, we need to order one of the variables
according to the ranks, from lowest to highest (we have ordered X in this fashion).
r
Take a pair of ranks for two subjects A (1,1) and B (2,3) on X and Y. Now, if sign or the
ve
direction of RX – RX for subject A and B is similar to the sign or direction of RY – RY for
subject A and B, then the pair of ranks is said to concordant (i.e., in agreement). In case
of subject A and B, the RX – RX is (1 – 2 = – 1) and RY – RY is also (1 – 3 = – 2). The
sign or direction of A and B pair is in agreement. So pair A and B is called as concordant
ni

pair. Look at second example of B and C pair. The RX – RX is (2 – 3 = – 1) and RY –


RY is also (3 – 2 = +1). The sign or the direction of B and C pair is not in agreement.
This pair is called as discordant pair.”
U

Subjects Rank X (Rx) Rank Y (Ry)


A 1 1
B 4 2
C 3 4
ity

D 2 3

Four subjects and their ranks

How many such pair we need to evaluate? They will be n (n – 1)/2 = (4 × 3)/2 = 6,
m

so six pairs. Here is an illustration: AB, AC, AD, BC, BD, and CD. Once we know the
concordant and discordant pairs, then we can calculate by using following equation.
)A

Where,

τ = value of τ obtained on sample


(c

nC = number of concordant pairs

Amity Directorate of Distance & Online Education


Computational Statistics 265

nD = number of discordant pairs


Notes

e
n = number of subjects

Following is an illustrate of a method to obtain the number of concordant (nC)

in
and discordant (nD) pairs for this small data in table below. We shall also learn a
computationally easy method later.

nl
◌◌ Step 1. Ranks of X are placed in second row in the ascending order.
◌◌ Step 2. Accordingly ranks of Y are arranged in third row.
◌◌ Step 3. Then the ranks of Y are entered in the diagonal (see table below).

O
◌◌ Step 4. Start with first element in the diagonal which is 1 (row 4).
◌◌ Step 5. Now move across the row. Compare it (1) with each column element
of Y. If it is smaller then enter C in the intersection. If it is larger, then enter

ty
D in the intersection. For example, 1 is smaller than 3 (column 3) so C is
entered. In the next row (row 5), 3 is in the diagonal which is greater than 2
(column 4) of Y, so D is entered in the intersection.

si
◌◌ Step 6. Then ∑C and ∑D are computed for each row.
◌◌ Step 7. The nC is obtained from ∑∑C (i.e., 5) and nD is obtained from ∑∑D
(i.e., 1). These values are entered in the equation 4.4 to obtain.
Subjects A B C D r ∑C ∑D
ve
Rank of X 1 2 3 4
Rank of Y 1 3 2 4
1 C C C 3 0
3 D C 1 1
ni

2 C 1 0
4 0 0
∑∑C = 5 ∑∑D = 1
U

Table showing computation of concordant and discordant pairs

Computational Alternative for τ


ity

The previous approach is time-consuming. There is a less difficult option. Let's say
we want to see if there's a link between how long it takes to complete a test and how
well you do on it. Regrettably, we do not have the refined time and performance data.
We know the order in which students have returned their completed response booklets.
m

This gives us ordinal data on the amount of time it took to complete the test. We also
know how well the pupils did on the test.

The statistics for ten subjects are listed below.


)A

Table: Data of 10 subjects on X (rank for time taken to complete test) and Y (ranks
of performance)
(c

Amity Directorate of Distance & Online Education


266 Computational Statistics

Notes

e
in
nl
O
Time taken to complete test and performance on test

First we arrange the ranks of the students in ascending order (in increasing order;
begin from 1 for lowest score) according to one variable, X in this case. Then we

ty
arrange the ranks of Y as per the ranks of X. We have drawn the lines to connect the
comparable ranking of X with Y. Please note that lines are not drawn if the subject gets
the same rank on both the variables. Now we calculate number of inversions. Number

si
of inversions is number of intersection of the lines. We have five intersections of the
lines.

So the following equation can be used to compute:


r
ve
Where τ = sample value of τ
ni

ns = number of inversions.

n = number of subjects
U

“The value of Kendall’s tau for this data is 0.78. The value is positive and near
ity

1.00. So the relationship between X and Y is positive. This means as the rank on time
taken increases, the rank on subject also increases. Interpretation of tau is also clear. If
the tau is 0.78, then it can be interpreted as follows: if the pair of subjects is sampled at
random, then the probability that their order on two variables (X and Y) is similar, that is
m

0.78 higher than the probability that it would be in reverse order. The calculation of tau
need to be modified for tied ranks. Those modifications are not discussed here.”
)A

Significance Testing of τ
Kendall's tau is statistically significant tested by utilising either Appendix E and
referring to the crucial value provided in the Appendix E. The z transformation is
another option. The following equation can be used to compute z.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 267

“You will realize that the denominator is the standard error of tau. Once the Z is
Notes

e
calculated, you can refer to Appendix A in any statistics book for finding out the probability.

Kendall’s tau is said to be a better alternative to Spearman’s rho under the

in
conditions of tied ranks. The tau is also supposed to do better than Pearson’s r under
the conditions of extreme non-normality. This holds true only under the conditions of
very extreme cases. Otherwise, Pearson’s r is still a coefficient of choice.”

nl
Check your Understanding
1. A __________ is a numerical measure of a statistical relationship between two

O
variables.
2. Two columns of a given data set of observations, referred to as a sample, or two
components of a multivariate random variable with a known distribution can be used
as variables. State true or false.

ty
3. The Pearson product-moment correlation coefficient, often known as r, R, or
Pearson's r, is a measure of the __________ and direction of a linear relationship
between two variables that is calculated by dividing their covariance by the product

si
of their standard deviations.
4. When quantitative measurements are done on units that are arranged into groups,

r
the __________ is a descriptive statistic that can be utilized and it describes how
closely units in the same group resemble each other.
ve
5. The ________'s rank correlation coefficient is a measure of how well a monotonic
function can reflect the relationship between two variables.
6. The Kendall tau rank correlation coefficient is a measure of how closely two data
ni

sets' ranks match. State true or false.

Summary
U

●● The term "correlation" refers to a relationship between two or more variables. This
relationship can be viewed in terms of magnitude and direction. As a result, a link
between two variables can be positive, negative, or non-existent. Furthermore, the
correlation could be anywhere from +1 to -1.
ity

●● The correlation coefficient is a metric for determining how strong a relationship


exists between two variables. There are various forms of correlation coefficients,
but Pearson's is the most common. Pearson's correlation (sometimes known as
Pearson's R) is a linear regression correlation coefficient.
m

●● In a correlation analysis, the correlation coefficient is a specific statistic that


assesses the strength of the linear link between two variables. In a correlation
report, the coefficient is represented by the letter r.
)A

●● The Pearson product-moment correlation coefficient, often known as r, R, or


Pearson's r, is a measure of the strength and direction of a linear relationship
between two variables that is calculated by dividing their covariance by the product
of their standard deviations.
(c

●● When quantitative measurements are done on units that are arranged into groups,
the intraclass correlation (ICC) is a descriptive statistic that can be utilized; it
describes how closely units in the same group resemble each other.
Amity Directorate of Distance & Online Education
268 Computational Statistics

●● The Spearman's rank correlation coefficient is a measure of how well a monotonic


Notes

e
function can reflect the relationship between two variables.
●● The Kendall tau rank correlation coefficient is a measure of how closely two data

in
sets' ranks match.
●● The strength of a linear link between two variables is measured using correlation
coefficients.

nl
●● A positive association is shown by a correlation coefficient larger than zero,
whereas a negative relationship is indicated by a number less than zero.

O
●● A value of zero implies that the two variables being compared have no
relationship.
●● Negative correlation, also known as inverse correlation, is an important concept in
building diversified portfolios that can better resist portfolio volatility.

ty
●● Since calculating the correlation coefficient takes time, data is frequently entered
into a calculator, computer, or statistics application.

si
Activity
1. Prepare a presentation based on a case study related to the use of correlation
coefficient.
r
ve
Questions & Exercises
1. In order to find out the correlation coefficient between two variables x and y from 12
pairs of observations the following calculations were made.
ni

2. On subsequent verification it was found that the pair (x, y) = (10, 14) was mistakenly
copied as (x, y) = (11, 4). Find the correct correlation coefficient?
U

3. The regression equations involving variables are Y = 5.6 +1.2x and X = 12.5 + 0.6Y.
Find the arithmetic means of x and y and the correlation coefficient between them?
4. In a music contest two judges ranked eight candidates in order of their performance
as follows:
ity
m
)A

Find the rank correlation coefficient?


5. What is correlation coefficient?
6. What is the range of correlation coefficient?
7. Is correlation coefficient a percentage?
(c

8. State the properties of correlation coefficient.

Amity Directorate of Distance & Online Education


Computational Statistics 269

9. What are the measurement of correlation coefficient?


Notes

e
10. How do we estimate rank of correlation coefficient?
11. What are the application of rank of correlation coefficient?

in
12. Briefly explain the Kendall’s measure of correlation.

Glossary

nl
●● Correlation: Degree of association between two variables.
●● Correlation Coefficient: A number lying between -1 (Perfect negative correlation)

O
and + i (perfect positive correlation) to quantify the association between two
variables.
●● Scatter Diagram: An ungrouped plot of two variables, on the X and Y axes.

ty
●● Time Lag: The length between two time periods, generally used in time series
where one may test, for instance, how values of periods 1, 2; 3, 4 correlate with
values of periods 4, 5, 6, 7 (time lag 3 periods).

si
●● Time-Series: Set of observations at equal time intervals which may form the basis
of future forecasting.

Further Readings
1.
r
Box, G.E.P., and G.M. Jenkins, 1976. Time Series Analysis, Forecasting and
ve
Control, Holden-Day: San Francisco.
2. Draper, N. and H. Smith, 1966. Applied Regression Analysis, John Wiley: New
York.
ni

3. Edwards, B. 1980. The Readable Maths and Statistics Book, George Allen and
Unwin: London.
4. Makridakis, S. and S. Wheelwright, 1978. Interactive Forecasting: Univariate
U

and Multivariate Methods, Holden-Day: San Francisco.


5. Peters, W.S. and G.W: Summers, 1968. Statistical Analysis for Business
Decisions, Prentice Hall: Englewood-Cliffs.
ity

6. Srivastava, U.K., G.V. Shenoy and S.C. Sharma, 1987. Quantitative


Techniques for Managerial Decision Making,Wiley Eastern: New Delhi.
7. Stevenson, W.J. 1978. Business Statistics-Concepts and Applications, Harper
and Row: New York.
m

8. Aron, A., Aron, E. N., Coups, E.J. (2007). Statistics for Psychology. Delhi:
Pearson Education.
)A

9. Minium, E. W., King, B. M., & Bear, G. (2001). Statistical Reasoning in


Psychology and Education. Singapore: John-Wiley.
10. Guilford, J. P., & Fructore, B. (1978). Fundamental Statistics for Psychology
and Education. N.Y.: McGraw-Hill.
(c

11. Wilcox, R. R. (1996). Statistics for Social Sciences. San Diego: Academic
Press.

Amity Directorate of Distance & Online Education


270 Computational Statistics

Check your Understanding – Answers


Notes

e
1. correlation coefficient
2. True

in
3. strength
4. intraclass correlation

nl
5. Spearman
6. True

O
ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 271

Unit - 4.3: Karl Pearson’s Correlation Coefficient in


Notes

e
Bivariate Distribution; Estimation and interpretations

in
Objectives
At the end of this unit, you will be able to:

nl
●● Understand the meaning of Karl Pearson’s correlation coefficient.
●● Learn about Karl Pearson’s correlation coefficient in bivariate distribution.

O
●● Comprehend estimation of Karl Pearson’s correlation coefficient.
●● Analyze interpretation of Karl Pearson’s correlation coefficient.
●● Evaluate intra-class correlation.

ty
●● Learn about correlation ratio.

Introduction

si
Pearson's focus as a statistician was on assessing correlations and fitting curves
to data, for which he created the new chi-square distribution. Pearson's articles
frequently applied statistical tools to scientific problems, rather than only dealing with
mathematical theory. r
ve
This unit discusses the meaning of Pearson’s correlation coefficient. It further
describes Pearson’s correlation coefficient in bivariate distribution. There is also a
brief explanation about the estimation of Karl Pearson’s correlation coefficient. Later
it analyzes the interpretation of Karl Pearson’s correlation coefficient. Furthermore, it
ni

highlights the role of intra-class correlation and correlation ratio in this concept.

4.3.1 Meaning of Karl Pearson’s Correlation Coefficient


U

The most generally used way of evaluating the degree of relationship between
two variables is Karl Pearson's coefficient of correlation (or simple correlation). This
coefficient is based on the following assumptions:
ity

1. That the two variables have a linear connection;


2. That the two variables are casually connected, which indicates that one variable is
independent and the other is dependent;
3. That a large number of independent causes are working in both variables to form a
m

normal distribution.
Karl Pearson's coefficient of correlation is computed as
)A

Alternatively, the following formula can be used only if assumed mean for both the
variables is taken as 0,
(c

Amity Directorate of Distance & Online Education


272 Computational Statistics

Notes

e
in
nl
O
Where

Xi = ith value of X variable

ty
X̅ = mean of X

Yi = ith value of Y variable Y

si
Y̅ = mean of Y

n = number of pairs of observations of X and Y


r
σX = Standard deviation of X
ve
σY = Standard deviation of Y

In case we use assumed means (Ax and Ay for variables X and Y respectively) in
place of true means, then Karl Person’s formula is reduced to:
ni
U
ity
m
)A

4.3.2 Karl Pearson’s Correlation Coefficient in Bivariate


Distribution
(c

In the above section, we learned the computation of correlation coefficient in


case of ungrouped data. For bivariate distribution, the following formula given by Karl
Pearson is used

Amity Directorate of Distance & Online Education


Computational Statistics 273

Notes

e
in
where fij is the frequency of a particular cell in the correlation table and all other
values are defined as

nl
O
ty
Example 1:

r si
ve
ni
U
ity

Example 2:
m
)A
(c

Amity Directorate of Distance & Online Education


274 Computational Statistics

Notes

e
in
nl
4.3.3 Estimation of Karl Pearson’s Correlation Coefficient

O
The correlation coefficient r is known as Pearson’s correlation coefficient as it was
discovered by Karl Pearson.

ty
si
Which can be simplified as

r
ve
Testing the significance of r
ni

The significance of r can be tested by Student’s t test. The test statistics is given by
U

Example:
ity

Compute Pearson’s coefficient of correlation between advertisement cost and


sales as per the data given below:

Advertisement Cost in 1000’s 39 65 62 90 82 75 25 98 36 78


Sales in lakhs 47 53 58 86 62 68 60 91 51 84
m

Solution
Ho: The correlation coefficient r is not significant
)A

H1: The correlation coefficient r is significant.

Level of significance 5%

From the data n = 10


(c

Amity Directorate of Distance & Online Education


Computational Statistics 275

Notes

e
in
nl
O
ty
Correlation coefficient is positively correlated.

Test Statistic

r si
ve
ni
U

∴ The correlation coefficient r is significant. (i.e) There is a relation between


advertisement company and the sales.
ity

4.3.4 Interpretation of Karl Pearson’s Correlation Coefficient


The product moment correlation coefficient is another name for Karl Pearson's
coefficient of correlation. The value of 'r' ranges between 1 and -1.
m

◌◌ Positive r values show positive correlation between the two variables (i.e.,
changes in both variables occur in the same direction)
◌◌ Negative r values indicate negative correlation (i.e., changes in both variables
)A

occur in the opposite directions)


◌◌ A value of r of zero implies that the two variables have no correlation.
◌◌ When r = (+) 1, it implies perfect positive correlation;
◌◌ When it is (−)1, it implies perfect negative correlation,
(c

implying that changes in the independent variable (X) explain 100 percent of
the changes in the dependent variable (Y). We may also observe that if there
is a constant change in the dependent variable in the same direction, for a unit

Amity Directorate of Distance & Online Education


276 Computational Statistics

change, in the independent variable, then the correlation is perfect positive.


Notes

e
However, if such a shift occurs in the other direction, the correlation is said to
be perfect negative.

in
◌◌ A value of 'r' that is close to +1 or –1 suggests a strong degree of correlation
between the two variables.
◌◌ A value of 'r' that is close to 0 suggests a low degree of correlation between

nl
the two variables.

4.3.5 Intra-Class Correlation

O
Both variables in product moment correlation assess different qualities, for
example, one measures price and the other measures demand. Similarly, one variable
could be advertising costs, while another could be sales revenue.

ty
However, in many practical contexts, particularly in the medical, agricultural,
and biological fields, knowing the link among the units of a group or family may be of
importance. For example, in an agricultural experiment, scientists would be curious
about the relationship between the yields of plots in the same block given the same

si
fertilizer.

The association between the heights of brothers in a family may be of relevance

r
in the study of brothers' heights. Both variables in this connection assess the same
attributes, namely yield and height. This correlation refers to how closely members of
ve
the same family resemble one another in terms of the attribute in question. Intraclass
correlation is the term for this type of correlation.
ni
U
ity
m

Table giving the values of N pairs of observations is called intra-class correlation


table and the product moment correlation coefficient calculated from
pairs of observations is called intra-class correlation coefficient. Since the value of each
)A

member of the ith family occurs (ki - 1) times as a value of the X variable as well as a
value of the Y variable. Then mean of variable X and Y are same and
(c

Amity Directorate of Distance & Online Education


Computational Statistics 277

Notes

e
in
nl
O
ty
si
r
ve
Limits of Intra-class Correlation Coefficient
ni

Intra-class correlation coefficient is


U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


278 Computational Statistics

4.3.6 Correlation Ratio


Notes

e
The correlation ratio is a non-linear connection coefficient. The correlation ratio,
denoted by eta, becomes the correlation coefficient in the case of linear correlations.

in
In the case of non-linear relationships, the correlation ratio has a higher value, and
the difference between the correlation ratio and the correlation coefficient denotes the
degree of non-linearity in the relationship. The correlation ratio can be calculated in

nl
SPSS by going to the "analyze" menu and selecting "compare means." This is where
the researcher picks "means" and then selects "ANOVA table" and eta, which is the
correlation ratio, from the "options" menu. In the context of analysis of variance, the
correlation ratio is a helpful measure of the strength of association based on the sum

O
of squares; however, it can also be employed outside of analysis of variance. The
eta square, or square of the correlation ratio, is calculated by dividing the difference
between the group sum of squares and the overall sum of squares. The square root

ty
of the sum of squares for an interval type of variable that has been categorized as
between type variables divided by the entire sum of squares is the correlation ratio. The
numerator and denominator values are crucial in determining the degree of linearity or
non-linearity among the variables. If the numerator is the same size as the denominator,

si
the correlation ratio will be close to one.

Assumptions
●● r
The correlation ratio distinguishes between the curvilinear link, which is perfect in
ve
nature, and the null relationship, which is statistical independence. The researcher
should remember that the perfect association as curvilinear indicates that the
correlation ratio is unaffected by the order of the categorical variable's classes.
●● Asymmetry is assumed by the correlation ratio, or it might be said that it is
ni

asymmetric in nature. In other words, unlike Pearson's correlation, depending on


the type of independent and dependent variables, the researcher will get varying
coefficient values.
U

●● The correlation ratio, like other forms of correlations and associations, cannot
prove causation direction, but it can assess the level of causal direction. As a
result, the correlation ratio has no sign and just fluctuates between zero and one.
ity

If two variables are linearly related, the correlation coefficient measures the
intensity or degree of linear relationship between them, i.e. the extent of linear
relationship may be explained by correlation coefficient. When variables aren't
linearly related and have a curvilinear relationship, the correlation coefficient isn't
m

a good indicator of the relationship's strength. In these circumstances, we look at


the correlation ratio, which is a useful tool for determining the degree of association
between two variables, i.e. the concentration of points around the best-fit curve.
)A

When regression is linear, the correlation coefficient and correlation ratio both
produce the same results i.e.r = η , where η is correlation ratio. So far we were dealing
the situations where there was single value of Y corresponding to any value of X for
example, data in the form

However, in practise, there may be instances where we have many y values for
(c

each x value. The following are the heights of 20 sons based on their dads' heights:

Amity Directorate of Distance & Online Education


Computational Statistics 279

Height of Fathers
Height of Sons (in inches) Notes

e
(in inches)
65 66 66 67 68 65

in
68 68 69 69 72 70
70 70 72 73 74 73
72 74 75 73 74 75

nl
If we consider father’s height by X and son’s height by Y, in the above example
more than one values of Y are available for each value of X. In general X and Y may be
in the following form

O
ty
r si
Let us suppose that for each value of x (i 1, 2,...,m) i = , variable Y has n values y (i
ve
1, 2,...,m; j 1, 2,...,n) ij = = then mean of variable Y for i th array is defined as
ni

Then correlation ratio η is obtained by


U
ity

Now we present the above distribution with frequencies i.e.


m
)A
(c

It means for x (i 1, 2,...,m) i = , Y takes values y (j 1, 2,...,n) ij = and frequency fij is


attached with yij.

Amity Directorate of Distance & Online Education


280 Computational Statistics

Note: You might have studied the frequency distribution earlier. Frequency is the
Notes

e
number of repetitions of a value. If in a series of data, 2 is repeated 5 times then we say
frequency of 2 is 5. And frequency distribution is the arrangement of values of variable
with its frequencies.

in
nl
O
ty
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 281

Notes

e
in
nl
O
ty
Example 1: Compute 2 ηyx for the following table:

r si
ve
Solution: It is known that
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


282 Computational Statistics

Notes

e
in
nl
O
ty
Characteristics of Correlation Ratio

si
1. The correlation ratio's absolute value is always less than or equal to one.
2. The correlation ratio's absolute value cannot be less than the correlation coefficient's
value.
3.
r
YX is unaffected by changes in origin or scale.
ve
4. Correlation coefficient of X on Y i.e. XY r and correlation coefficient of Y on X, YX r
are equal but correlation ratio of Y on X, ηyx and correlation ratio of X on Y, ηxy are
different.
ni

5. Difference (2 2 yx η – r ) measures the extent to which the true regression of Y on X


departs from linearity.
6. It is based solely on the dependent variable's values. The purpose of the auxiliary
U

variable is to categorise the observations of the independent variable.


7. When correlation is linear and forming a straight line then 2 2 yx η = r are same.
8. When scatter diagram does not show any trend then both and r ηyx are zero.
ity

9. If scatter diagram shows straight line and dots lie precisely on a line, then the
correlation coefficient and the correlation ratio, both are 1, i.e. ηyx = r = 1.
10. If r ηyx > then, scatter diagram shows curved trend line.
m

Check your Understanding


1. Pearson's articles frequently applied statistical tools to scientific problems, rather
than only dealing with mathematical theory. State true or false.
)A

2. The most generally used way of evaluating the degree of relationship between two
variables is Karl Pearson's coefficient of correlation (or simple correlation). State true
or false.
3. The correlation ratio's absolute value is always less than or equal to five. State true
(c

or false.

Amity Directorate of Distance & Online Education


Computational Statistics 283

4. If scatter diagram shows ____________ and dots lie precisely on a line, then the
Notes

e
correlation coefficient and the correlation ratio, both are 1, i.e. ηyx = r = 1.
5. The correlation ratio is a __________ connection coefficient.

in
Summary
●● The most generally used way of evaluating the degree of relationship between two

nl
variables is Karl Pearson's coefficient of correlation (or simple correlation). This
coefficient is based on the following assumptions:
●● That the two variables have a linear connection;

O
●● That the two variables are casually connected, which indicates that one variable is
independent and the other is dependent;
●● That a large number of independent causes are working in both variables to form a

ty
normal distribution.
●● Karl Pearson's coefficient of correlation is computed as

si
●● The correlation coefficient r is known as Pearson’s correlation coefficient as it was
discovered by Karl Pearson. r
ve
Activity
1. Prepare a presentation on ‘Karl Pearson’s correlation coefficient’.

Questions & Exercises


ni

1. What is the meaning of Karl Pearson’s correlation coefficient?


2. Briefly talk about Karl Pearson’s correlation coefficient in bivariate distribution.
U

3. How do we estimate Karl Pearson’s correlation coefficient?


4. What is the interpretation of Karl Pearson’s correlation coefficient?
5. What is intra-class correlation?
ity

6. Discuss correlation ratio.

Exercises (Estimation of Karl Pearson’s Correlation Coefficient)


1. Calculate the simple correlation coefficient between wing length & tail length of the
m

following 12 birds of a particular species. Also test its significant.


)A

2. The date refer to the yield of grain in gms|plant(y) and the number of productive
tillers (x) and 15 paddy plants
(c

Amity Directorate of Distance & Online Education


284 Computational Statistics

Find the correlation.


Notes

e
3. The following data relates to the yield in grams(y) and the matured pods (x) of 10
groundnut plants. Work out the correlation coefficient and test its significance.

in
4. Find the persons coefficient of correlation between price and demand from the

nl
following data.

O
Glossary
●● Correlation ratio: a number other than the correlation coefficient that measures the

ty
degree of correlation between two mathematical variables.

Further Readings

si
1. Box, G.E.P., and G.M. Jenkins, 1976. Time Series Analysis, Forecasting and
Control, Holden-Day: San Francisco.
2. Draper, N. and H. Smith, 1966. Applied Regression Analysis, John Wiley: New York.
3.
r
Edwards, B. 1980. The Readable Maths and Statistics Book, George Allen and
ve
Unwin: London.
4. Makridakis, S. and S. Wheelwright, 1978. Interactive Forecasting: Univariate
and Multivariate Methods, Holden-Day: San Francisco.
5. Peters, W.S. and G.W: Summers, 1968. Statistical Analysis for Business
ni

Decisions, Prentice Hall: Englewood-Cliffs.


6. Srivastava, U.K., G.V. Shenoy and S.C. Sharma, 1987. Quantitative
Techniques for Managerial Decision Making,Wiley Eastern: New Delhi.
U

7. Stevenson, W.J. 1978. Business Statistics-Concepts and Applications, Harper


and Row: New York.
8. Aron, A., Aron, E. N., Coups, E.J. (2007). Statistics for Psychology. Delhi:
ity

Pearson Education.
9. Minium, E. W., King, B. M., & Bear, G. (2001). Statistical Reasoning in
Psychology and Education. Singapore: John-Wiley.
10. Guilford, J. P., & Fructore, B. (1978). Fundamental Statistics for Psychology
m

and Education. N.Y.: McGraw-Hill.


11. Wilcox, R. R. (1996). Statistics for Social Sciences. San Diego: Academic Press.
)A

Check your Understanding – Answers


1. True 2. True
3. False 4. straight line
5. non-linear
(c

Amity Directorate of Distance & Online Education


Computational Statistics 285

Unit - 4.4: Regression – Meaning, types, properties


Notes

e
and assumptions of Regression

in
Objectives
At the end of this unit, you will be able to:

nl
●● Understand the introduction to regression analysis.
●● Learn about the meaning of regression.

O
●● Comprehend the properties of regression analysis.
●● Analyze the assumptions of regression.

Introduction

ty
When you wish to predict a continuous dependent variable from a set of
independent factors, you utilize regression analysis. Logistic regression should be
used if the dependent variable is dichotomous. (Both logistic and linear regression

si
will produce similar findings if the split between the two levels of the dependent
variable is close to 50-50.) In regression, the independent variables can be either
continuous or dichotomous. In regression analysis, independent variables with more
r
than two levels can be employed, but they must first be converted into variables
ve
with just two levels. This is known as dummy coding, and it will be described further
down. Regression analysis is typically employed with naturally occurring variables
rather than experimentally modified variables, though regression can be applied with
experimentally manipulated variables as well. It's important to remember that causal
ni

correlations between variables can't be determined via regression analysis. While


we can argue that X "predicts" Y, we can't state that X "causes" Y because of the
terminology.
U

This unit discusses regression in brief covering all aspects related to it. Meaning of
regression, its types and properties along with the assumptions of regression.

4.4.1 Introduction to Regression Analysis


ity

Regression analysis is a set of statistical processes used in statistical modeling


to estimate the relationships between a dependent variable (often referred to as
the 'outcome' or 'response' variable) and one or more independent variables (often
referred to as 'predictors, "covariates, 'explanatory variables, ‘or' features'). Linear
m

regression is the most frequent type of regression analysis, in which one finds the
line (or a more sophisticated linear combination) that best fits the data according to
a set of mathematical criteria. Ordinary least squares, for example, finds the single
)A

line (or hyperplane) that minimizes the sum of squared differences between the
genuine data and that line (or hyperplane). This allows the researcher to estimate the
conditional expectation (or population average value) of the dependent variable when
the independent variables take on a specified set of values for precise mathematical
reasons (see linear regression). Alternative location parameters (e.g., quantile
(c

regression or Necessary Condition Analysis) or the conditional expectation across a

Amity Directorate of Distance & Online Education


286 Computational Statistics

broader array of non-linear models are estimated using somewhat different approaches
Notes

e
in less prevalent kinds of regression (e.g., nonparametric regression).

Regression analysis is primarily employed for two reasons that are conceptually

in
distinct. For starters, regression analysis is extensively used for prediction and
forecasting, and it shares a lot of ground with machine learning. Second, regression
analysis can be used to identify causal links between independent and dependent

nl
variables in particular scenarios. Importantly, regressions reveal correlations between
a dependent variable and a group of independent variables in a given dataset by
themselves. A researcher must carefully demonstrate why existing correlations
have predictive power for a new context or why a relationship between two variables

O
has a causal interpretation before using regressions for prediction or inferring causal
relationships. When using observational data to estimate causal links, the latter is very
crucial.

ty
In the social, behavioral, and physical sciences, regression analysis is one of the
most often used statistical techniques. Its major goal is to investigate the relationship
between one or more independent variables (also known as predictor or explanatory

si
variables) and a dependent variable (also known as criterion variable). Linear
regression looks into relationships that are easily expressed by straight lines or maybe
generalized to several dimensions.

r
Linear regression can solve a wide range of issues, and many more can be
ve
addressed by transforming the original variables into linear relationships among the
converted variables.

Multiple regression predicted values are considered to be linear combinations of


the predictor variables. As a result, the general form of a multiple regression prediction
ni

equation is as follows:

Y = A+ B,X, + B2X2 + ..... + BpXp+ E


U

where

Y = Criterion variable

x = Predictor variable
ity

A = Intercept: the predicted value of Y when all the predictors are zero. The
intercept, A, is so-called, because it intercepts the Y-axis. It estimates the average value
of Y, when XI = O.
m

B = Constant (or regression coefficient: how much of a difference in Y results from


a one unit difference in X). 1\

E = Residual. i.e. the difference between observed (Y) and predicted (Y) values of
)A

Y.

p = Number of predictors.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 287

Notes

e
in
nl
O
ty
si
4.4.2 Meaning of Regression

r
Regression analysis is a statistical technique for analysing and comprehending the
relationship between two or more variables. The method used to perform regression
ve
analysis aids in determining which elements are relevant, which may be ignored, and
how they interact.

We must understand the following terms in order for regression analysis to be an


effective method:
ni

◌◌ The dependent variable is the one that we're trying to figure out or predict.
◌◌ Independent Variables: These are variables that have an impact on the
U

analysis or target variable and offer us with information about the variables'
relationships with the target variable.
We usually have one dependent variable and one or more independent variables in
regression. With the help of the independent variables, we try to "regress" the value of
ity

the dependent variable "Y." In other words, we're trying to figure out how the value of 'Y'
changes when the value of 'X' changes.

4.4.3 Properties of Regression Analysis


m

The following are some of the qualities of the regression coefficient:

●● The letter 'b' is commonly used to represent it.


)A

●● It is expressed in the form of a unique data unit.


●● When two variables, such as x and y, are present, two regression coefficient
values are obtained. When x is independent and y is dependent, we get one
result, and when y is independent and x is dependent, we get the other. Byx
(c

represents the regression coefficient of y on x, while bxy represents the regression


coefficient of x on y.

Amity Directorate of Distance & Online Education


288 Computational Statistics

●● The sign of both regression coefficients must be the same. If byx is positive, then
Notes

e
bxy will be positive as well, and vice versa.
●● Others will be less than unity if one regression coefficient is bigger than unity.

in
●● The correlation coefficient is equal to the geometric mean of the two regression
coefficients.

nl
●● Furthermore, both regression coefficients' arithmetic means (am) are equal to or
greater than the correlation coefficient.

O
●● The regression coefficients are unaffected by the origin shift. However, they are
not unaffected by changes in the scale. It indicates that subtracting any constant
from the value of x and y has no influence on the regression coefficients. The

ty
regression coefficient will change if x and y are multiplied by any constant.

Illustration

si
Subject Age X Glucose Level Y X*Y X^2 Y^2
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 r 79 1975 625 6241
ve
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
∑ 247 486 20485 11409 40022
ni

The values of a and b can be found using the equations below.


U
ity

by putting the values in the formulas you get value

a=65.14
m

b=.385

so in equation
)A

y= a+bx

y= 65.14 + .385225x

4.4.4 Assumptions of Regression


(c

Following are the assumptions of regression:

Amity Directorate of Distance & Online Education


Computational Statistics 289

Amount of cases
Notes

e
The cases-to-Independent Variables (IVs) ratio in regression should ideally be
20:1; that is, 20 examples for each IV in the model. The smallest ratio you should have

in
is 5:1. (i.e., 5 cases for every IV in the model).

Data precision

nl
It's a good idea to double-check the accuracy of the data entry if you entered it
yourself rather than utilising a pre-made dataset. If you don't want to go over each data
point again, you should at least double-check the minimum and maximum values for

O
each variable to confirm that all of the values are "valid." A variable measured on a 1 to
5 scale, for example, should not have a value of 8.

Data that is missing

ty
You should also look for any data that is missing. You may decide not to include
some variables in your analysis if they have a large number of missing values. If there
are only a few cases with missing values, you may want to eliminate them. If numerous

si
cases on distinct variables have missing values, you probably don't want to eliminate
those cases (because a lot of your data will be lost). You don't need to be concerned if
there isn't a lot of missing data and there doesn't appear to be any pattern in terms of

r
what's missing. Simply run your regression, and any cases where the variables used in
the regression do not have values will be excluded. Even if it's tempting, don't assume
ve
there's no pattern; look for it. To do so, divide the dataset into two groups: those with
missing values for a certain variable and those who don't have any missing values
for that variable. You can use t-tests to see if the two groups vary on any of the other
variables in the sample. For example, you may discover that cases with missing values
ni

for the "salary" variable are younger than cases with salary values. With a lot of missing
values, you should run t-tests on each variable. If there is a systematic difference
between the two groups (i.e., missing values vs. non-missing values), you must keep
U

this in mind while interpreting your results and avoid overgeneralizing.

After reviewing your data, you may determine that the missing values should be
replaced with another value. The mean of this variable is the easiest item to use as a
ity

replacement value. Within regression, several statistics tools allow you to replace the
missing value with the mean. Alternatively, instead of using the overall mean, you might
use a group mean (for example, the mean for females).

For any variable that is included in regression, statistics packages default to


m

excluding cases with missing values. (However, that case might be included in another
regression as long as none of the variables in that analysis had missing values.) You
can alter this option so that your regression analysis does not reject cases where any
variable in the regression has missing data, but you may end up with a different number
)A

of cases for each variable.

Outliers
You should also look for outliers (extreme values on a certain item). An outlier is
(c

usually defined as a value that is at least three standard deviations above or below
the mean. If you believe the cases that created the outliers do not belong in the same
"population" as the other cases, you may want to eliminate them. Alternatively, you

Amity Directorate of Distance & Online Education


290 Computational Statistics

might count those extreme values as "missing," while keeping the rest of the variables
Notes

e
the same. Alternatively, you might keep the outlier but make it less dramatic. You could
wish to recode the value to be the biggest (or lowest) non-outlier value possible.

in
Normality
You should also ensure that your data is evenly spread. You can achieve this by
creating histograms and "looking" at the data to see how it is distributed. A line that

nl
illustrates what the shape might look like if the distribution were actually normal is
frequently included in the histogram (and you can "eyeball" how much the actual
distribution deviates from this line). The age distribution is normal in this histogram:

O
ty
r si
ve
ni

A normal probability plot can also be made. The real scores are ranked and
arranged in this plot, and for each example, an expected normal value is generated
U

and compared to an actual normal value. The expected normal value is the place in a
normal distribution that a case with that rank occupies. The normal value is the place in
the actual distribution that it occupies. In general, you want your real numbers to line up
ity

with the diagonal that runs from lower left to higher right. This graph also demonstrates
that age is distributed normally:
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 291

A plot of the "residuals" can also be used to check for normality inside the
Notes

e
regression analysis. The disparity between the obtained and projected DV scores
is known as residuals. (Residuals will be discussed in greater depth in a subsequent
section.) If the data are normally distributed, the residuals around each anticipated

in
DV score should also be normally distributed. If the data (and residuals) are normally
distributed, the residuals scatterplot for each value of the predicted score will show
the majority of residuals in the centre of the plot, with some residuals trailing off

nl
symmetrically from the centre. You should plot the residuals before graphing each
variable independently because if the residuals plot looks okay, you won't need to
conduct the separate plots. A residual plot of a regression is shown below, in which the

O
patient's age and time (in months from diagnosis) are used to forecast the size of a
breast tumour. The residuals around the zero line appear slightly more spread out than
those below the zero line, indicating that the data are not exactly normally distributed.
They do, however, appear to be rather evenly dispersed.

ty
r si
ve
ni

You can look at the data graphically, but you can also look at it statistically to
see if it's typical. Statistical systems like SPSS assess the skewness and kurtosis for
U

each variable; an excessive result for either one indicates that the data is not regularly
distributed. A skewed variable is one whose mean is not in the middle of the distribution;
skewness is a measure of how symmetrical the data are (i.e., the mean and median
ity

are quite different). The term "kurtosis" refers to how skewed the distribution is, either
too skewed or too flat. Values greater than +3 or less than -3 are considered "extreme
values" for skewness and kurtosis. If any of your variables are not normally distributed,
you should transform them (which will be discussed in a later section). The normalcy
problem can also be solved by looking for outliers.
m

Linearity
The assumption of linearity is also included in regression analysis. The IVs and the
)A

DV have a straight line relationship, which is known as linearity. Because regression


analysis only looks for a linear relationship between the IVs and the DV, this assumption
is crucial. Any nonlinear link between the IV and the DV is not taken into account. A
bivariate scatterplot can be used to check for linearity between an IV and the DV (i.e.,
a graph with the IV on one axis and the DV on the other). The scatterplot will be oval if
(c

the two variables are linearly connected.

Amity Directorate of Distance & Online Education


292 Computational Statistics

Notes

e
in
nl
O
ty
Friendship is directly related to happiness, as shown in the bivariate scatterplot

si
above. More specifically, the more friends you have, the happier you are. However,
one may envisage a curvilinear relationship between friends and happiness, in which
happiness rises to a certain degree as the number of friends rises. Happiness, on the
r
other hand, decreases as the number of friends increases. The graph below illustrates
ve
this point:
ni
U
ity
m

You can also use the residual plots discussed before to check for linearity. This
is due to the fact that if the IVs and DV are linearly connected, the residuals and
)A

anticipated DV scores will also be linear. When the majority of the residuals are above
the zero line on the plot at some projected values and below the zero line at other
predicted values, nonlinearity is evident. In other words, the plot's overall shape will be
curved rather than rectangular. When happiness was predicted based on the number of
friends and age, the residuals plot shown below was created. The data are not linear,
(c

as you can see:

Amity Directorate of Distance & Online Education


Computational Statistics 293

Notes

e
in
nl
O
ty
Here's another example of a residuals plot, this time forecasting happiness based
on friends and age. However, the data in this example is linear:

r si
ve
ni
U

If your data isn't linear, you can typically make it so by changing the IVs or the DV
so that they have a linear connection. Transforming one variable doesn't always work;
the IV and DV aren't always linearly connected. If the DV and IV have a curvilinear
ity

relationship, you should dichotomize the IV because a dichotomous variable can only
have a linear relationship with another dichotomous variable (if it has any relationship at
all). Alternatively, if the IV and the DV have a curved connection, you may need to add
the square of the IV in the regression (this is also known as a quadratic regression).
m

The breakdown of linearity in regression will weaken your study rather than
invalidate it since the linear regression coefficient can't fully capture the amount of a
curvilinear relationship. If the IV and DV have both a curved and a linear relationship,
)A

the regression will at the very least capture the linear relationship.

Homoscedasticity
The assumption of homoscedasticity states that the residuals for all anticipated DV
scores are about similar. Another way to think about it is that the variability in your IVs'
(c

scores is the same at all DV values. The same residuals plot that was discussed in the
linearity and normality sections can be used to check homoscedasticity. If the residuals

Amity Directorate of Distance & Online Education


294 Computational Statistics

plot is the same width for all values of the expected DV, the data are homoscedastic.
Notes

e
Heteroscedasticity is typically shown by a cluster of points that grows bigger as the
expected DV values increase. Alternatively, a scatterplot between each IV and the DV
can be used to verify for homoscedasticity. You want the cluster of points to be about

in
the same width all around, just like the residuals plot. The data in the following residuals
plot are quite homoscedastic. In fact, because the residual plot is rectangular, with a
concentration of points along the centre, this residuals plot provides data that meet the

nl
assumptions of homoscedasticity, linearity, and normality:

O
ty
r si
ve
When certain variables are skewed while others are not, heteroscedasticity might
arise. As a result, ensuring that your data is normally distributed should reduce the
risk of heteroscedasticity. Violation of the assumption of homoscedasticity, like the
ni

assumption of linearity, weakens rather than invalidates your regression.

Singularity and Multicollinearity


U

Singularity occurs when the IVs are perfectly correlated and one IV is a
combination of one or more of the other IVs. Multicollinearity occurs when the IVs are
highly correlated (.90 or greater) and singularity occurs when the IVs are perfectly
correlated and one IV is a combination of one or more of the other IVs. Strong bivariate
ity

correlations (typically of.90 or above) or high multivariate correlations can create


multicollinearity and singularity. By just running correlations among your IVs, you can
easily discover high bivariate correlations. If you have strong bivariate correlations,
you can simply remedy your problem by eliminating one of the two variables; however,
you should examine your programming first, as this is commonly the result of a
m

programming error when the variables were formed. High multivariate correlations are
more difficult to detect. To accomplish so, you'll need to figure out the SMC for each
IV. When the IV serves as the DV anticipated by the remainder of the IVs, SMC is the
)A

squared multiple correlation (R2) of the IV. 1-SMC calculates tolerance, which is a
related notion. The proportion of a variable's variation that is not accounted for by the
other IVs in the equation is referred to as tolerance. Most programmes will not allow a
variable to join the regression model if tolerance is too low, so you don't have to worry
about it.
(c

Because the regression coefficients are calculated via matrix inversion, you don't
want singularity or multicollinearity statistically. As a result, inversion is impossible if
Amity Directorate of Distance & Online Education
Computational Statistics 295

singularity exists, and inversion is unstable if multicollinearity exists. Multicollinearity


Notes

e
and singularity are logically undesirable since they make your IVs redundant with one
another. One IV has no predictive value over another in this situation, but you do lose a
degree of freedom. As a result, multicollinearity/ singularity can make your analysis less

in
accurate. In general, you shouldn't mix two IVs that have a correlation of.70 or higher.

Check your Understanding

nl
1. ___________ is typically employed with naturally occurring variables rather
than experimentally modified variables, though regression can be applied with
experimentally manipulated variables as well.

O
2. _______ regression predicted values are considered to be linear combinations of the
predictor variables.
3. Strong bivariate correlations (typically of.90 or above) or high multivariate correlations

ty
can create _________ and_________.
4. The assumption of homoscedasticity states that the residuals for all anticipated DV
scores are about similar. State true or false.

si
5. The IVs and the DV have a straight line relationship, which is known as linearity.
State true or false.

Summary r
ve
●● Logistic regression should be used if the dependent variable is dichotomous. (Both
logistic and linear regression will produce similar findings if the split between the
two levels of the dependent variable is close to 50-50.)
ni

●● In regression, the independent variables can be either continuous or dichotomous.


In regression analysis, independent variables with more than two levels can be
employed, but they must first be converted into variables with just two levels. This
is known as dummy coding.
U

●● Regression analysis is typically employed with naturally occurring variables rather


than experimentally modified variables, though regression can be applied with
experimentally manipulated variables as well.
ity

●● Regression analysis is a set of statistical processes used in statistical modeling


to estimate the relationships between a dependent variable (often referred to as
the 'outcome' or 'response' variable) and one or more independent variables (often
referred to as 'predictors, "covariates, 'explanatory variables, ‘or' features').
m

●● Multiple regression predicted values are considered to be linear combinations of


the predictor variables.
)A

Activity
1. Formulate a study based on regression, its properties and assumptions.

Questions & Exercises


(c

1. What is regression analysis?


2. Define regression.

Amity Directorate of Distance & Online Education


296 Computational Statistics

3. State the properties of regression analysis.


Notes

e
4. What are the assumptions of regression?

in
Glossary
●● Regression analysis: it is a set of statistical methods used for the estimation of
relationships between a dependent variable and one or more independent

nl
variables. It can be utilized to assess the strength of the relationship between
variables and for modeling the future relationship between them.

Further Readings

O
1. David A. Freedman (27 April 2009). Statistical Models: Theory and Practice.
Cambridge University Press. ISBN 978-1-139-47731-4.
2. Fotheringham, A. Stewart; Brunsdon, Chris; Charlton, Martin (2002).

ty
Geographically weighted regression: the analysis of spatially varying
relationships (Reprint ed.). Chichester, England: John Wiley.
3. Steel, R.G.D, and Torrie, J. H., Principles and Procedures of Statistics with

si
Special Reference to the Biological Sciences., McGraw Hill, 1960, page 288.
4. Rouaud, Mathieu (2013). Probability, Statistics and Estimation (PDF).
5. r
Pearson, Karl; Yule, G.U.; Blanchard, Norman; Lee, Alice (1903). "The Law of
ve
Ancestral Heredity". Biometrika. 2 (2): 211–236.
6. Fisher, R.A. (1922). "The goodness of fit of regression formulae, and the
distribution of regression coefficients". Journal of the Royal Statistical Society.
7. Aron, A., Aron, E. N., Coups, E.J. (2007). Statistics for Psychology. Delhi:
ni

Pearson Education.
8. Minium, E. W., King, B. M., & Bear, G. (2001). Statistical Reasoning in
Psychology and Education. Singapore: John-Wiley.
U

9. Guilford, J. P., & Fructore, B. (1978). Fundamental Statistics for Psychology


and Education. N.Y.: McGraw-Hill.
10. Wilcoxon, R. R. (1996). Statistics for Social Sciences. San Diego: Academic
ity

Press.

Check your Understanding – Answers


1. Regression analysis
m

2. Multiple
3. Multicollinearity, singularity
)A

4. True
5. True
(c

Amity Directorate of Distance & Online Education


Computational Statistics 297

Unit - 4.5: Two Variable Linear Regression; Regression


Notes

e
Lines and Regression Co-Efficient

in
Objectives:
At the end of this unit, you will be able to:

nl
●● Understand linear regression.
●● Learn about two-variable linear regression.

O
●● Comprehend regression lines.
●● Analyze regression coefficients.
●● Evaluate multiple regression for trivariate data.

ty
Introduction
Regression analysis is primarily employed for two reasons that are conceptually

si
distinct. For starters, regression analysis is extensively used for prediction and forecasting,
and it shares a lot of ground with machine learning. Second, regression analysis can be
used to identify causal links between independent and dependent variables in particular

r
scenarios. Importantly, regressions reveal correlations between a dependent variable and
a group of independent variables in a given dataset by themselves. A researcher must
ve
carefully demonstrate why existing correlations have predictive power for a new context
or why a relationship between two variables has a causal interpretation before using
regressions for prediction or inferring causal relationships. When using observational data
to estimate causal links, the latter is very crucial.
ni

In the social, behavioral, and physical sciences, regression analysis is one of the
most often used statistical techniques. Its major goal is to investigate the relationship
between one or more independent variables (also known as predictor or explanatory
U

variables) and a dependent variable (also known as criterion variable). Linear


regression looks into relationships that are easily expressed by straight lines or maybe
generalized to several dimensions.
ity

This unit discusses linear regression, two-variable linear regression, regression


lines, regression coefficients and multiple may be a regression for trivariate data.

4.5.1 Linear Regression


m

When you wish to forecast the values of one variable based on the values of
another variable, you use simple linear regression. You might, for example, want to
estimate a person's height (in inches) based on his weight (in pounds). Consider
)A

a group of ten persons, each of whom has a known height and weight. The values
might be plotted on a graph with the weight on the x axis and the height on the y axis.
All ten points on the graph would fit on a straight line if there was a perfect linear
relationship between height and weight. This, however, is never the case (unless your
data are rigged). If there is a (nonperfect) linear relationship between height and weight
(c

(probably a positive one), the graph will show a cluster of upward-sloping points. To put
it another way, persons who weigh a lot should be taller than those who weigh less.
(For further information, see the graph below.)
Amity Directorate of Distance & Online Education
298 Computational Statistics

Notes

e
in
nl
O
ty
si
Simple linear regression involves only two variables, one dependent variable (Y)
and one independent variable (X): Y =A + B X+ E

Examining the graph of the observed data (Y*X) is the first step in establishing
r
whether there is a link between two variables. A scatter plot is the name of the graph.
ve
The scatter plot can be created using statistical software such as SPSS, SYST AT, or
STATISTICA. If the variables X and Y have a relationship, the dots on the scatter plot
will be more or less concentrated around a curve, which is known as the regression
curve. When the curve is a straight line, it is referred to as the line of regression, and
the regression is described as linear. Aside from the linearity property, the scatter plot
ni

can be used to see if the data contains any outliers and whether there are two or more
clusters of points.
U

The bivariate regression model for the population is:

y, =(1 + b .x; + E,

where the subscript i refers to the ph observation, a is the intercept and b is the
ity

regression coefficient. Regression equation is thus a mathematical model describing


the relationship between X and Y. In most cases, the model does not define the exact
relationship between the two variables. Rather, we use it as an approximation to the
exact relationship.
m

Objectives of Regression Analysis


What exactly are the goals of regression analysis? They could be as simple as a
data description to more complex hypothesis testing or prediction.
)A

It's important to note that the regression equation predicts Y from X. Y's value is
determined by X's value.

●● Detailed description
(c

The goal is to find an equation that describes or summarizes the relationship


between two variables in this situation. This goal relies on the fewest assumptions.

Amity Directorate of Distance & Online Education


Computational Statistics 299

●● Coefficient Estimation
Notes

e
The goal is to prove or disprove a theoretical or suspected relationship between
two variables X and Y. The magnitudes and signs of the regression coefficients a and b

in
are most likely of particular importance.

This goal frequently crosses paths with other goals.

nl
●● Foresight
Predicting the dependent variable from the value of an independent variable
is the main goal here. For example, if we know how many publications an institution

O
has produced over time, the goal could be to estimate how many publications will
be produced in a given year in the future. However, the projection is predicated on a
number of key assumptions. As a result, rather than computing a point estimate, which
is a single value, we should compute an interval estimate, which is a range of values

ty
between which the anticipated value would lie with a certain probability. The term
"confidence interval" refers to this range. This is something we'll talk about later in this
module.

si
Assumptions Underlying Regression Analysis
The following assumptions underpin the regression model:

1. r
The straight-line relationship between Y and X is represented by linear regression
ve
models. Any curvilinear relationship is not taken into account.
2. The error term's intended value is zero.
3. For all values of the independent variable, X, the error term's variance is constant.
ni

This is the homoscedasticity assumption. We can assume constant variance if the


residual plots have a rectangular shape. Non-constant variance (heteroscedasticity)
exists and must be addressed if a residual plot shows a rising or decreasing wedge
or bowtie shape.
U

4. There is no autocorrelation between the residuals, therefore they are believed to be


uncorrelated. This means that the Y's are unrelated as well.
Because the observations y I, y2,...,yn constitute a random sample, they are mutually
ity

independent, and the error terms are mutually independent as well.


5. There are two ways in which this assumption can be broken: model misspecification
or time-series data.
m

◌◌ Misspecification of the model. The residuals may not be independent if an


important independent variable is neglected or an inappropriate functional
form is employed. The solution to this conundrum is to discover the right
functional form or to apply multiple regression with the right independent
)A

variables.
◌◌ Data in a time series. When performing regression analysis on data collected
over time, the residuals may be associated. Serial correlation refers to the
relationship between residuals. Positive serial correlation means that the
(c

residual in time period j has a similar sign to the residual in time period G - k),
where k is the time lag.

Amity Directorate of Distance & Online Education


300 Computational Statistics

Negative serial correlation, on the other hand, means that the residual in time period
Notes

e
j has the opposite sign as the residual in time period G - k).
6. The independent variable and the error term are unrelated.

in
7. The residuals are assumed to follow the normal distribution when hypothesis tests
and confidence bounds are utilized.

nl
4.5.2 Two-Variable Linear Regression
Multiple linear regression is a statistical technique for predicting a variable's
outcome based on the values of two or more variables. It is an extension of linear

O
regression and is also referred to simply as multiple regression. The dependent variable
is the one we aim to predict, whereas the independent or explanatory variables are the
ones we use to forecast the value of the dependent variable.

ty
r si
ve
ni
U
ity

◌◌ Multiple linear regression is a statistical technique that predicts the outcome of


a dependent variable using two or more independent variables.
◌◌ Analysts can use this technique to determine the model's variation and the
proportionate contribution of each independent variable to the total variance.
m

◌◌ There are two types of multiple regression: linear and non-linear regression.

Multiple Linear Regression Formula


)A

Where:

yi is the dependent or predicted variable

β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.


(c

β1 and β2 are the regression coefficients representing the change in y relative to a


one-unit change in xi1 and xi2, respectively.
Amity Directorate of Distance & Online Education
Computational Statistics 301

βp is the slope coefficient for each independent variable


Notes

e
ϵ is the model’s random error (residual) term.

Using information about another variable, statisticians can predict the value of one

in
variable using simple linear regression. Linear regression seeks to construct a straight
line relationship between the two variables.

nl
A variant of regression in which the dependent variable has a linear connection
with two or more independent variables is known as multiple regression. Non-linearity
occurs when the dependent and independent variables do not follow a straight line.

O
Both linear and non-linear regression use two or more variables to graphically track
a specific response. Non-linear regression, on the other hand, is notoriously difficult to
implement since it is based on assumptions acquired via trial and error.

ty
Assumptions of Multiple Linear Regression
The following assumptions underpin multiple linear regression:

si
1. The dependent and independent variables have a linear relationship.
The basic assumption of multiple linear regression is that the dependent variable
and each of the independent variables have a linear relationship. Making scatterplots
r
and visually inspecting them for linearity is the best technique to verify for linear
correlations. If the scatterplot shows a non-linear relationship, the analyst will need
ve
to run a non-linear regression or modify the data using statistical software like SPSS.
2. There is no strong correlation between the independent variables.
Multicollinearity, which arises when the independent variables (explanatory variables)
ni

are highly interrelated, should not be present in the data. When independent
variables exhibit multicollinearity, determining the precise variable that contributes
to the variance in the dependent variable becomes difficult. The Variance Inflation
U

Factor approach is the best way to test for the assumption.


3. The residuals' variance is constant.
The level of error in the residuals is assumed to be similar at each point of the linear
ity

model in multiple linear regression. Homoscedasticity is the term for this situation.
The analyst should plot the standardised residuals against the expected values while
assessing the data to see if the points are distributed evenly across all independent
variable values. The data can be plotted on a scatterplot or statistical software can
be used to create a scatterplot that contains the whole model to test the assumption.
m

4. Observational independence
The model implies that the data are unrelated to each other. Simply expressed, the
)A

model posits that the residual values are unrelated. The Durbin Watson statistic is
used to test for this assumption.
The test will display results ranging from 0 to 4, with a value of 0 to 2 indicating
positive autocorrelation and 2 to 4 indicating negative autocorrelation. There is no
autocorrelation at the midpoint, which is a value of 2.
(c

Amity Directorate of Distance & Online Education


302 Computational Statistics

5. Normality in several variables


Notes

e
When residuals are regularly distributed, multivariate normality arises. Examine the
distribution of residual values to see if this assumption is correct. A histogram with a

in
superimposed normal curve or the Normal Probability Plot method can also be used
to test it.

Objectives of Multiple Regression

nl
The general purpose of multiple regression is to investigate the relationship
between several independent or predictor variables and a dependent variable. This

O
relationship can be expressed by the following equation

Where Y is the predicted score, X, is the score on the first predictor variable, X,
is the score on the second, etc. The Y-intercept is Bo. Regression coefficients are
analogous to the slope in simple regression. V, are values of an unobserved error term,

ty
and the unknown parameters .

Estimation of Parameters

si
Equations relating the n observations can be written as:

r
ve
The parameters Bo, B" ... Bp can be estimated using the least squares procedure,
ni

which minimizes the sum of squares of errors.


U

Minimizing the sum of squares leads to a set of equations, called normal equations,
from which the values of unknown parameters can be computed:
ity

Geometrical Representation
“The problem of multiple-regression can be represented geometrically as follows.
We can visualize that n observations (Xiii Xi]' ..... Xip• Yi) i = I, 2, .... n are represented
as points in a (p+I) - dimensional space. The regression problem is to determine from
m

the possible hyper-planes in the p - dimensional space, the best fitting hyper-plane.
We use the least squares criterion and locate the hyper-plane that minimizes the sum
of squares of the errors, i.e., the distances from the points around the hYRer-plane
(observations) and the point on the hyper-plane (i.e. the estimate Y).”
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 303

Where
Notes

e
Y; = the sample value of the dependent variable

y = corresponding value estimated from the regression equation

in
n = number of observations

p = number of predictors or independent variable

nl
The denominator of the equation indicates that the standard error has n-p-l
degrees of freedom in multiple regression with p independent variables. This occurs
because the degrees of freedom are reduced from n to p+1 numerical constants a, bl,

O
b;.... bp, which have been estimated from the sample.

4.5.3 Regression Lines

ty
A regression line is a curve that is used to represent how a set of data behaves. In
other words, it displays the data's best trend.

Forecasting procedures benefit from regression lines. Its goal is to describe how

si
the dependent variable (y variable) interacts with one or more independent variables (x
variable).

r
By inputting different values for the independent variables, the equation derived
from the regression line functions as an analyst who may foresee future behaviours of
ve
the dependent variables.
ni
U
ity
m

Regression Line Formula: y = a + bx + u


)A

Multiple Regression Line Formula: y= a + b1x1 +b2x2 + b3x3 +…+ btxt + u

Let's say you have two variables: x and y. If y is dependent on x, then the result is
a straightforward regression. Furthermore, the variables x and y are given the following
names:
(c

x – Independent Variable or Predictor or Explanator y – Regression or Dependent


Variable or Explained Variable

Amity Directorate of Distance & Online Education


304 Computational Statistics

As a result, if we adopt a simple linear regression model in which y is dependent on


Notes

e
x.

Where does linear regression come into use?

in
Regression lines are employed in both the financial and business worlds. Linear
regressions are used by a variety of financial analysts to forecast stock prices,
commodities prices, and perform valuations on a variety of securities. Several

nl
companies use linear regressions to forecast sales, inventories, and a variety of other
variables.

O
Example 1:
Develop two regression equations for the given data set:

ty
Solution:
r si
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 305

Notes

e
in
Example 2:

nl
Sales data of 10 months for a coffee house situated near a prime location of a
city comprising the number of customers (in hundreds) and monthly sales (in Thousand
Rupees) are given below:

O
ty
r si
ve
Find the simple linear regression equation that fits the given data.
ni
U
ity
m

Solution:
)A

It is given that n = 10. Let us assume an equation of simple linear regression as


follows:

Y = a + bX
(c

where X is the independent variable and Y, the dependent variable.

Amity Directorate of Distance & Online Education


306 Computational Statistics

Notes

e
in
nl
O
ty
si
4.5.4 Regression Coefficients
r
The regression coefficient, often known as the slope coefficient, is the amount "b"
ve
in the regression equation. We have two regression coefficients since there are two
regression equations.

1. X on Y Regression Coefficient, abbreviated as "bxy"


ni

2. Regression Coefficient Y on X (abbreviated as "byx").


Regression coefficients are calculated using a variety of formulas:
U
ity

1. The geometric mean of the two regression coefficients is the coefficient of correlation.
m

Symbolically, r = √(bxy ∗ byx)


2. Because the value of the coefficient of correlation cannot exceed unity, if one of the
regression coefficients is more than unity, the other must be smaller than unity. If bxy
)A

= 1.2 and byx = 1.4, for example, "r" would be 1.29, which is not feasible.
3. The sign of both regression coefficients will be the same. They will either be positive
or negative, in other words. In other words, one of the regression coefficients cannot
have a minus sign while the other has a positive sign.
(c

4. The correlation coefficient will have the same sign as the regression coefficient,
i.e., if the regression coefficient is negative, "r" will be negative as well, and if the

Amity Directorate of Distance & Online Education


Computational Statistics 307

regression coefficient is positive, "r" will be positive as well. If bxy = -0.2 and byx =
Notes

e
-0.8, for example, r = – 0.4.
5. The average of the two regression coefficients would be larger than the correlation

in
coefficient. (bxy + byx) / 2 > r, in symbol. If bxy = 0.8 and byx = 0.4, then the average
of the two values is (0.8 + 0.4) / 2 = 0.6, and the value of r is r = √(0.8 0.4) = 0.566,
which is less than 0.6.

nl
6. Regression coefficients are not affected by changes in origin, but they are affected
by changes in scale.

O
4.5.5 Multiple Regression for Trivariate data
Multivariate Regression is a technique for determining the degree to which many
independent variables (predictors) and dependent variables (responses) are linearly
connected. Once a desirable degree of relationship has been established, the approach

ty
is widely used to anticipate the behaviour of the response variables linked with changes
in the predictor variables.

si
Multivariate Regression (MVR) is a supervised machine learning approach that
analyses numerous data variables. With one dependent variable and numerous
independent variables, a multivariate regression is an extension of multiple regression.

r
We try to anticipate the outcome based on the number of independent variables.
ve
Multivariate regression aims to create a formula that can explain how variables
respond to changes in others at the same time.

Example: What is the relationship between the distance travelled by an OLA driver
and the driver's age and number of years of experience?
ni
U
ity
m
)A

Solution:
Select “Data Analysis Option” in Excel
(c

Amity Directorate of Distance & Online Education


308 Computational Statistics

Notes

e
in
nl
O
ty
The regression equation will be:

1. y = MX + MX + b

si
2. y = 604.17* - 3.18 + 604.17* - 4.06 + 0
3. y = - 4377
r
The distance travelled by the OLA driver is the dependent variable in this
ve
regression equation, and the independent variables are the driver's age and number of
driving experiences.

Example: What is the relationship between a class's GPA, the number of hours of
ni

study and the students' height?


U
ity
m
)A

Solution: Select “Data Analysis Option” in Excel

The regression equation will be:

1. y = MX + MX + b
(c

2. y = 1.08* - 0.03 + 1.08* - 0.002 + 0


3. y = 0.0325

Amity Directorate of Distance & Online Education


Computational Statistics 309

The GPA is the dependent variable in this regression whereas the independent
Notes

e
variables are the students' study hours and height.

Example: What is the relationship between the salary of a group of employees in

in
an organisation, their number of years of experience and their age?

nl
O
ty
r si
ve
Solution: Select “Data Analysis Option” in Excel
ni
U
ity
m

The regression equation will be:

1. y = MX + MX + b
)A

2. y = 41308* - 0.71 + 41308* - 824 + 0


3. y = 37019
The salary is the dependent variable in this regression equation whereas the
(c

independent variables are the employees' experience and age.

Amity Directorate of Distance & Online Education


310 Computational Statistics

Other Examples:
Notes

e
Example 1:

One may utilise multiple regression as a data analyst to forecast crop growth. Crop

in
growth is the dependent variable in this example, and you want to observe how different
factors impact it. Rainfall, temperature, quantity of sunlight, and amount of fertiliser put
to the soil might all be independent factors. A multiple regression model will show you

nl
how much variance each independent variable accounted for in crop development.

O
ty
r si
ve
Example 2:
Assume you're an analyst in the insurance industry, and you're trying to figure
out how probable each potential client is to file a claim. You may include a variety of
independent variables in your model, such as their age, if they have a major health
ni

condition, their employment, and so on. A regression analysis will use these factors to
assess the likelihood of the event (filing a claim) occurring.
U

Example 3:
A team of data scientists anticipated that by the end of July 2020, Delhi would
have more than 5 lakh COVID-19 patients, based on the recent COVID-19 event.
ity

Multiple factors were included in this research, including government decisions, public
behaviour, population, occupation, public transportation, healthcare services, and the
community's general immunity.

Filters used to categorise email as "spam" or "not spam" are another frequently
m

cited example.

Check your Understanding


State true or false
)A

1. ________________ is a supervised machine learning approach that analyses


numerous data variables.
2. When you wish to forecast the values of one variable based on the values of another
(c

variable, you use simple linear regression. State true or false.


3. __________ regression is a statistical technique for predicting a variable's outcome
based on the values of two or more variables.
Amity Directorate of Distance & Online Education
Computational Statistics 311

4. A _________ is a curve that is used to represent how a set of data behaves. In other
Notes

e
words, it displays the data's best trend.
5. Non-linearity occurs when the dependent and independent variables do not follow a

in
straight line. State true or false.

Summary

nl
●● Regression analysis is primarily employed for two reasons that are conceptually
distinct. For starters, regression analysis is extensively used for prediction
and forecasting, and it shares a lot of ground with machine learning. Second,

O
regression analysis can be used to identify causal links between independent and
dependent variables in particular scenarios.
●● When you wish to forecast the values of one variable based on the values of
another variable, you use simple linear regression.

ty
●● Simple linear regression involves only two variables, one dependent variable (Y)
and one independent variable (X): Y =A + B X+ E

si
●● Multiple linear regression is a statistical technique for predicting a variable's
outcome based on the values of two or more variables. It is an extension of linear
regression and is also referred to simply as multiple regression. The dependent
variable is the one we aim to predict, whereas the independent or explanatory
r
variables are the ones we use to forecast the value of the dependent variable.
ve
●● Multiple linear regression is a statistical technique that predicts the outcome of a
dependent variable using two or more independent variables.
●● Analysts can use this technique to determine the model's variation and the
ni

proportionate contribution of each independent variable to the total variance.


●● Multivariate Regression is a technique for determining the degree to which many
independent variables (predictors) and dependent variables (responses) are
U

linearly connected. Once a desirable degree of relationship has been established,


the approach is widely used to anticipate the behavior of the response variables
linked with changes in the predictor variables.
ity

Activity
1. Gather data using physical and/or electronic measurements and then create
appropriate regression curves and lines based on an analysis of data.
m

Questions & Exercises


1. Define regression.
2. What are the different types of regression?
)A

3. What is linear regression?


4. Write a short note on regression lines.
5. What do you understand by the term regression coefficient?
(c

6. Briefly write about two variable linear regression.


7. What is multiple regression?

Amity Directorate of Distance & Online Education


312 Computational Statistics

8. Comment on Multiple Regression for Trivariate data.


Notes

e
Glossary

in
●● Regression line: A regression line is a line that is used to describe the behavior of
a set of data. In other words, it gives the best trend of the given data.
●● Regression Coefficient: The regression coefficients are a static measure which

nl
is used to measure the average functional relationship between variables. In
regression analysis, one variable is dependent and the other is independent. Also,
it measures the degree of dependence of one variable on the other(s).

O
●● Multivariate Regression: it is a technique for determining the degree to which
many independent variables (predictors) and dependent variables (responses) are
linearly connected. Once a desirable degree of relationship has been established,
the approach is widely used to anticipate the behavior of the response variables

ty
linked with changes in the predictor variables.
●● Intercept: It is the estimated Y value when all the independents have a value of O.

si
●● Simple regression: The general purpose of simple regression is to analyze the
relationship between one independent or predictor variable and a dependent or
criterion variable.

Further Readings r
ve
1. Nagpaul, P.S. (2001). Guide to Advanced Data Analysis Using IDAMS Software
(Chapters 4 and 5).
2. Montgomery, D.e. [et al]. Introduction to Linear Regression Analysis. 3rd ed.
New York: John Wiley.
ni

3. Fisher, R.A. (1922). "The goodness of fit of regression formulae, and the
distribution of regression coefficients". Journal of the Royal Statistical Society.
U

4. Aron, A., Aron, E. N., Coups, E.J. (2007). Statistics for Psychology. Delhi:
Pearson Education.
5. Minium, E. W., King, B. M., & Bear, G. (2001). Statistical Reasoning in
ity

Psychology and Education. Singapore: John-Wiley.


6. Guilford, J. P., & Fructore, B. (1978). Fundamental Statistics for Psychology
and Education. N.Y.: McGraw-Hill.
7. Wilcoxon, R. R. (1996). Statistics for Social Sciences. San Diego: Academic Press.
m

Check your Understanding – Answers


1. Multivariate Regression
)A

2. True
3. Multiple linear
4. regression line
(c

5. true

Amity Directorate of Distance & Online Education


Computational Statistics 313

Module - 5: Association of Attributes


Notes

e
Structure:

in
5.1 Association of attributes, Independence
5.1.1 Association of attributes

nl
5.1.2 Independence
5.2 Measure of association for 2x2 table

O
5.2.1 Measure of association for 2x2 table
5.3 Chi-square, Karl Pearson’s and Tschuprow’s coefficient of association
5.3.1 Chi-square coefficient of association

ty
5.3.2 Karl Pearson’s coefficient of association
5.3.3 Tschuprow’s coefficient of association
5.4 Contingency tables with ordered categories

si
5.4.1 Contingency tables with ordered categories

r
ve
ni
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


314 Computational Statistics

Unit - 5.1: Association of Attributes and Independence


Notes

e
Objectives

in
At the end of this unit, you will be able to:

●● Understand the meaning of attributes

nl
●● Discuss the types of associations
●● Understand consistency of data

O
●● Define independence of attributes

Introduction
A qualitative characteristic is one which cannot be measured numerically and

ty
cannot be expressed in units of measurements. Beauty, intelligence, skin colour,
politeness, literacy, etc. are all qualitative characteristics.

si
A qualitative characteristic that varies from unit to unit is called an attribute. It is
a quality that cannot be measured. However, it can be marked as per its presence or
absence.

r
ve
ni

Data may be such that its magnitude cannot be determined. In such cases,
only the presence or absence of a particular quality in a class of individuals can be
U

observed. For example, in case of deafness, the extent of deafness cannot be


determined. However, a population can be grouped on the basis of being deaf or not
being deaf.
ity

It is up to the observer to decide the standard attribute on the basis of which the
data shall be classified. Such data wherein quantitative classification cannot be made
and only the presence or absence of an attribute can be studied is called statistics of
attributes.
m

5.1.1 Association of Attributes


Quantitative and qualitative variables are two types of variables. Quantitative
)A

variables can have their magnitude measured, whereas qualitative variables are non-
numerical in nature and cannot have their magnitude assessed numerically. They are
referred to as attributes. The qualitative data can be quantified by assigning number
1 to a person who possesses a specific attribute and number 0 to someone who does
not. As a result, total number of ones denotes the total number of people who have that
(c

feature. Similarly, the total number of zeros represents the entire number of people who
do not have that attribute. Association of attributes measures the degree of relationship
between attributes. For example, literacy and employment, success and happiness, etc.

Amity Directorate of Distance & Online Education


Computational Statistics 315

The statistical definition of association differs significantly from the conventional


Notes

e
use of the term. In most cases, two traits A and B are considered to be related if
they appear together a number of times. “In Statistics A and B are associated only
if they appear together in a greater number of cases than is expected, if they are

in
independent,” Yule and Kendall remark.

The procedures used to assess the link between two such events whose

nl
magnitude cannot be determined and where we can only detect the existence or
absence of an attribute are referred to as methods used to measure the association of
attributes.

O
In correlation analysis, we look at the relationship between two variables that may
be measured numerically. Similarly, in the case of association, we investigate the link
between two non-quantifiable properties. For instance, consider education levels and
crime rates. There are no variables in association. As previously mentioned, an attribute

ty
divides the universe into two groups: those who have the attribute and those who do
not, but a variable divides the universe into any number of classes. The correlation
coefficient reveals the degree or amount of a linear relationship between two variables,

si
whereas the coefficient of association indicates whether there is a positive or negative
relationship between two attributes.

However, unlike regression coefficient, which is obtained from correlation


r
coefficient, we cannot calculate predicted change in A for a given change in B and vice
ve
versa using coefficient of association.

Types of Associations
ni
U

1. Positive Association:
ity

In positive association, the presence of one attribute is accompanied by the


presence of another attribute. For example, literacy and employment are positively
associated.
m

If

Where, A represents the frequency of attribute A


)A

B represents the frequency of attribute B

N represents the total number of observations

Then attributes A and B are positively associated.


(c

2. Negative Association or Disassociation:


In negative association, the presence of one attribute is accompanied by the

Amity Directorate of Distance & Online Education


316 Computational Statistics

absence of another attribute. For example, hygiene and occurrence of infection are
Notes

e
negatively associated.

If

in
Then attributes A and B are negatively associated.

3. No Association or Independence:

nl
If the presence of one attribute does not affect the presence or absence of
another attribute, they have no association. For example, literacy and health have no
association.

O
If

Then attributes A and B are independent.

ty
It is to be noted that,

●● If A cannot occur without B or all A’s are B’s or vice-versa, the two attributes are
said to be completely associated. Here, (AB) = (A) or (AB) = (B).

si
●● If no A’s are B’s or vice-versa, then A and B are completely disassociated. Here,
(αB) = (α).

5.1.2 Independence r
ve
Consistency of Data
Frequency of any data collected cannot be negative. If the frequency of various
classes is counted and if any class frequency comes out to be negative, the data is
ni

said to be inconsistent. Inconsistencies can arise due to various errors like incorrect
recording, inaccurate addition, printing error, etc. To ensure that data is consistent,
several conditions must be met. To acquire accurate and measurable conclusions from
U

data, these must be used right at the start of the study.

To check the consistency of data, calculate all the class frequencies. If all the class
frequencies are positive then the data is said to be consistent. Data is inconsistent if
ity

frequency of an attribute or combination of attributes is more than the total frequency N.


It is to be noted that consistency of data does not imply that the data is accurate. There
can still be errors of calculation or misprinting. However, if the data is inconsistent, there
has been an error.
m

If a population is categorized into classes on the basis of presence or absence of


an attribute, the class possessing the attribute is called a positive class and denoted by
capital letters whereas the class not possessing the attribute is called the negative class
and is denoted by small Greek letters. Therefore, if ‘A’ represents literacy, ‘α’ represents
)A

illiteracy.

The data can be tested for consistency in the following ways:


(c

Amity Directorate of Distance & Online Education


Computational Statistics 317

For one (A) ≥ 0 But (A) + (α) = N


Notes

e
attribute A (α) ≥ 0
(A) ≤ N

in
(α) ≤ N
For two N = (A) +(α) = (B) +(β)
attributes A (A) = (AB) + (Aβ), and (B) = (AB) + (αB)

nl
and B (α) = (αB) + (αβ), and (β) = (Aβ) + (αβ)
For three (ABC) ≥ 0
attributes A, (AB) ≥ (ABC)

O
B and C, (AC) ≥ (ABC)
(BC) ≥ (ABC)
(ABC) ≥ (AB) +(AC) - (B)

ty
(ABC) ≥ (AB) + +(AC) - (A)
(ABC) ≥ (AC) + (BC) – (C)
(ABC) ≤ (AB) + (BC) + (AC) – (A) – (B) - (C) + N
(AB) + (AC) + (BC) ≥ (A) + (B) + (C) – N

si
(AC) + (BC) - (AB) ≤ (C)
(AB) + (BC) - (AC) ≤ (B)

Illustration 2:
(AB) + (AC) - (BC) ≤ (A)
r
ve
Examine the consistency of the following data

N = 1000, (A) = 600, (B) = 700, (AB) = 150, the symbols having their usual
meaning.
ni

Solution:

We have (αβ) = N – (A) – (B) + (AB) = 1000 – 600 – 700 + 150 = –150
U

Since (αβ) < 0, the data is inconsistent.

Illustration 2: In a locality having a population of 1000 persons, 630 were males


out of whom 450 were married. Among females, the number of married ones was 200.
ity

Check the consistency of the data.

Solution: Let A represent Males, α represent Females, B represent Married and β


represent Unmarried.
m
)A
(c

Amity Directorate of Distance & Online Education


318 Computational Statistics

Given N = 1000, (A) = 630, (AB) = 450 and (αB) = 200


Notes

e
Sex
Male (A) Total

in
Female (α)

Married (B) 450 (AB) 200 (αB) (B)

Marital
Status

nl
Unmarried (β) (Aβ) (αβ) (β)

Total 630 (A) (α) 1000 (N)


(α) = N – (A) = 1000 – 630 = 370

O
(α) = (αB) + (αβ)

(αβ) = (α) - (αB) = 370 – 200 = 170

ty
Since (αβ) < 0, the data is consistent.

Therefore,

si
Sex
Male (A) Total
Female (α)
r
Married (B) 450 200 650
ve
Marital
Status

Unmarried (β) 180 170 350

Total 630 370 1000


ni

Independence
As discussed earlier, if the presence of one attribute does not affect the presence
U

or absence of another attribute, they have no association.

◌◌ …1
Two attributes A and B are said to be independent if there exists no relationship be-
tween the two. Thus, if A and B are independent we may expect
ity

1. the same proportion of A’s in B’s as in β’s and


2. the same proportion of B’s in A’s as in α’s.
For example, literacy and happiness have no association. Thus, the proportion
m

of literates in happy and unhappy groups must be equal. Similarly, the proportion of
happy people will be equal in both literates and illiterates. If the proportion between two
classes is unequal, they must not be independent.
)A

Conditions for independence:


From condition (i) above, we can formulate that
(c

Amity Directorate of Distance & Online Education


Computational Statistics 319

Notes

e
in
nl
O
ty
r si
ve
We know that for independence,
ni

Since,
U

Which can also be written as


ity

As a result, a crucial criterion for determining the independence of two qualities


m

A and B may be expressed in terms of proportion. We may state that the proportion of
ABs in the population should be equal to the product of the proportions of As and Bs
in the population in order to achieve independence. With the aid of the accompanying
table, the requirements for independence between two qualities will be more clear and
)A

easy to understand. The class frequencies are presented in the relevant cells in the
table below.
(c

Amity Directorate of Distance & Online Education


320 Computational Statistics

Notes

e
in
Observing this table, we get the following condition for independence:

nl
O
Let us understand this with an example.

ty
Illustration 1:
Determine whether A and B are independent, positively or negatively associated-

si
1. N = 1000; (A) = 350; (B) = 700; (AB) = 300
2. (A) = 700; (AB) = 300; (α) = 615; (αB) = 380
3.
r
N = 1000; (A) = 450; (B) = 300; (AB) = 135
ve
Solution:
(i) Given,

Attributes
ni

(A) Total
(α)
U

(B) 300 (αB) 700


Attributes

(β) (Aβ) (αβ) (β)


ity

Total 350 (α) 1000

= = 245 = (AB)0
m

Thus, (AB) = 300 >

Since (AB) > (AB)0 , they are positively associated.


)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 321

(ii) Given,
Notes

e
Attributes

in
(A) Total
(α)

nl
(B) 300 380 (B)
Attributes

O
(β) (Aβ) (αβ) (β)

Total 700 615 (N)

ty
(B) = (AB) + (αB) = 300 + 380 = 680

si
N = (A) + (α) = 1315

∴ = = 362 = (AB)0

Thus, (AB) = 300 <


r
ve
Since (AB) < (AB)0 , they are negatively associated.

(iii) Given,

Attributes
ni

(A) Total
(α)
Attributes

(B) 135 (αB) 300


U

(β) (Aβ) (αβ) (β)

Total 450 (α) 1000


ity

= = 135 = (AB)0

Thus, (AB) = 135 =

Since (AB) = (AB)0 , they are independent.


m

Check Your Understanding


Fill in the blanks:
)A

1. A qualitative characteristic that varies from unit to unit is called ___________.


2. If (AB) = (A) then the two attributes are ___________ associated.
3. The data is inconsistent if frequency of an attribute is more than _______________
(c

4. Positive class is represented by ___________ and negative class is represented by


______________.

Amity Directorate of Distance & Online Education


322 Computational Statistics

Summary
Notes

e
●● A qualitative characteristic that varies from unit to unit is called an attribute.
●● Association of attributes measures the degree of relationship between attributes.

in
Attributes can be independent, positively or negatively associated.
●● To ensure that data is consistent, several conditions must be met. To acquire
accurate and measurable conclusions from data, these must be used right at the

nl
start of the study.

Questions & Exercises

O
1. Define attributes. What is association of attributes?
2. What is independence of attributes? What are the conditions for independence?

ty
Activity
1. Prepare a presentation on ‘Qualitative & Quantitative Characteristics’.

si
Further Readings
1. C.R. Kothari, Research Methodology: Methods and Techniques (IInd Revised
Edition)
2. r
Liebetrau, A. (1983). Measures of Association (Quantitative Applications in the
ve
Social Sciences). Sage Publications

Glossary
●● Attribute: a qualitative characteristic that varies from unit to unit is called an
ni

attribute.
●● Association of attributes: measures the degree of relationship between
attributes.
U

●● Independence: no existence of relationship between attributes.

Answers to Check Your Understanding


ity

1. attribute
2. completely
3. total frequency
m

4. capital letters, small Greek letters


)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 323

Unit - 5.2: Measure of Association for 2x2 Table


Notes

e
Objectives

in
At the end of this unit, you will be able to:

●● Understand the meaning of dichotomous classification

nl
●● Define cell frequencies, class frequencies and order of frequencies
●● Understand the 2x2 table for measure of association

O
●● Understand Yule’s Coefficient of Association
●● Understand Coefficient of Colligation

Introduction

ty
A 2 x 2 table is a multicolumn table that shows the number of replies for two
category variables as a percentage. The categories of one of the variables constitute
the table's rows, while the categories of the second variable form the table's columns

si
in a two-way table. The totals are included in a special row and a separate column on
the table's "outside." Cross-tabulation tables are another name for cross-classification
tables.
r
A row variable with two categories and a column variable with two categories make
ve
up the simplest two-way table. In the inner half of the table, this builds a table with two
rows and two columns. Each inner cell reflects the count or percentage of categories
from each variable that have been paired or cross-classified.
ni

5.2.1 Measure of Association for 2x2 Table


Before we understand Measure of Association for 2x2 Table, here are some
U

important terms we must learn.

Dichotomous Classification
An attribute which can be divided into two classes where one class consists of
ity

units which possess the attribute and the other class which consists of units which do
not possess the attribute is called dichotomous classification.

Examples:
m
)A

Cell Frequencies
(A) - Number of units possessing attribute 'A'
(c

(α) - Number of units not possessing attribute 'A'

Amity Directorate of Distance & Online Education


324 Computational Statistics

(B) - Number of units possessing attribute 'B'


Notes

e
(ß) - Number of units not possessing attribute 'B'
(AB) - Number of units possessing attributes 'A' and 'B'

in
(αB) - Number of units not possessing attribute 'A' and possessing attribute 'B'
(Aß) - Number of units possessing attribute 'A' and not possessing attribute 'B'

nl
(αß) - Number of units not possessing attributes 'A' and 'B'

Class frequencies

O
The number of items belonging to a class is called the frequency of that class.
The class frequency is denoted by putting the letter (or letters) denoting the class in
a bracket. Thus, (A) stands for the number of items possessing the attribute A; (αB)
stands for the number of items, not possessing A and possessing B.

ty
The frequency of a positive class is called positive class frequency e.g. (AB) and
frequency of a negative class is called negative class frequency e.g. (αβγ).

si
Ultimate Class Frequencies
The class-frequencies of highest order are called ultimate class-frequencies.

r
Thus, in the case of two attributes class-frequencies of order two are ultimate class-
frequencies. If A and B are attributes then (AB), (Aβ), (αB), (αβ) are ultimate class-
ve
frequencies. If we are considering n attributes, the ultimate class frequencies will have n
symbols.

Thus the total number of ultimate class frequencies in case of two attributes is 22 =
ni

4 and for three attributes is 23 = 8.

◌◌ The total number of ultimate class frequencies in case of n attributes is 2n.


◌◌ The total number of positive class frequencies is 2n.
U

◌◌ The total number of class frequencies of all order is 3n.

Order of frequencies
ity

It depends on the number of attributes specified.

◌◌ (A), (α), (B) and (ß) are called frequencies of first order because only one
attribute is specified.
◌◌ (AB), (αB), (Aß) and (αß) are called frequencies of second order since two
m

attributes are specified.


◌◌ N (TOTAL FREQUENCY) is a frequency of zero order as it does not indicate
any attribute.
)A

Measure of Association for 2x2 Table


A 2 x 2 table (or two-by-two table) is a compact summary of data for two variables
that helps in organizing and interpreting the data easily. In the table given below
excluding the total column and total row, there are two rows and two columns and we
(c

have two attributes each classified into two classes respectively. This table is called the
2 x 2 contingency table. The table also has nine cells and hence, the 2 x 2 contingency
table is also called the nine square table.
Amity Directorate of Distance & Online Education
Computational Statistics 325

Notes

e
in
nl
O
In the table below, A and B are attributes that are divided into two classes (A,α) and

ty
(B,β) respectively.

Attribute Total
A α

si
B AB αB (B)
Attribute
β Aβ αβ (β)
Total (A) (α) N
r
Illustration 1: In a population of 100 individuals, 52 are male out of which 43 are
ve
married. Out of the females, 4 are unmarried.

The two attributes in the given example are (male, female) and (married,
unmarried). Based on these attributes, the following table can be made.
ni

Marital Status Sex Married(A) Unmarried(α) Total


Male(B) 43(AB) 9(αB) 52
Female(β) 44(Aβ) 4(αβ) 48
U

Total 87 13 100

The table so formed is called a 2 by 2 table. It is based on dichotomous


classification of data. It is further used in the calculation of measure of association.
ity

Measure of Correlation
The strategies we've covered thus far can help you determine if two traits are
positively, negatively, or independently associated.
m

This is sometimes sufficient for making practical decisions. However, most of the
time, it is insufficient because we are always interested in the extent of association,
which we can assess quantitatively. The possibility of deriving coefficient of association,
)A

which might give some sense about the extent of association between two attributes,
will be discussed in this section. If the coefficient of association is such that its value
is 0 when two attributes are independent, +1 when they are perfectly associated, and
–1 when they are perfectly dissociated, making decisions would be easy. In case of
dichotomous classification, the following methods for measure of correlation are used:
(c

Amity Directorate of Distance & Online Education


326 Computational Statistics

1. Yule’s Coefficient of Association


Notes

e
2. Coefficient of Colligation

in
Yule’s Coefficient of Association

nl
Yule's coefficient of association is named after G. Udny Yule, its creator. The
coefficient of association for two attributes A and B is

Value of Q lies between –1 and +1.

O
●● If Q = 1, A and B has perfect positive association. It can be verified that under
perfect positive association
(AB) = (A) ⇒ (Ab) = 0

ty
(AB) = (B) ⇒ (aB) = 0

●● If Q = –1, A and B has perfect negative association. This leads to following

si
relationship:
(AB) = 0 or (ab) =0

●●
r
If Q = 0, A and B are independent. Here, we have following relation:
ve
(AB )(ab) = (Ab)( a B)

●● Any value between –1 to +1 tells us the degree of relationship between two


attributes A and B. Conventionally, if Q > 0.5 the association between two
attributes is considered to be of high order and the value of Q less than 0.5 shows
ni

low degree of association between two attributes.


For example, according to illustration 1,
U

Married males = 43
Unmarried males = 9
Married females = 44
ity

Unmarried females = 4
m

According to the formula,


Q=((43)(4)-(44)(9))/((43)(4)+(44)(9))
)A

Q= -0.4

Therefore, there exists negative correlation between the two attributes.

Coefficient of Colligation
(c

Yule provides another crucial coefficient of association. It is described as:

Amity Directorate of Distance & Online Education


Computational Statistics 327

Notes

e
in
The value of Y lies between –1 and +1. A negative value of Y indicates the

nl
presence of negative association. Value of Y near to –1 shows the strong negative
association, which indicates that the proportion of off diagonal frequencies are high
enough than the proportion of diagonal frequencies.

O
◌◌ Y = –1 indicates the perfect negative association.
◌◌ Y = +1 indicates the perfect positive association
◌◌ Y = 0 indicates that the attributes are independent.

ty
Relation between Yule’s Coefficient and Coefficient of colligation:
On further calculation, it can be proved that:

si
Example 1: Calculate Yule’s coefficient of association for the following data:

i. r
(A) = 6000; (B) = 8000 ;(AB) = 4800; N = 10000
ve
ii. (A) = 6000; (B) = 8000; (AB) = 6000; N = 10000
iii. (A) = 6000; (B) = 8000; (AB) = 4000; N = 10000
Solution: We have
ni
U
ity
m

Models of 2x2 Contingency Tables


)A

When observations are nominally scaled, contingency tables are used. s1 s2... sk
classes hold the number of observations from N studied objects on a nominal scale with
level si of the ith factor Ai, resulting in a k-dimensional contingency table with si levels of
the ith of k factors Fi, (i = 1,..., k). There are k + 1 distinct models for such contingency
tables. The models are random since they are dependent on how many elements the
(c

experimenter observes (observation factors). The remaining components are referred to


as fixed factors. A two-dimensional contingency table is used to illustrate this.

Amity Directorate of Distance & Online Education


328 Computational Statistics

Three models exist in 2x2 contingency tables:


Notes

e
Model 1: If we investigate N pupils and investigate whether they have blue eyes
or not and if they are fair-haired or not, then we have k = 2 factors: A eye colour with s1

in
= 2 levels and B hair colour with s2 = 2 levels. The observations can be arranged in a
contingency table like in the table below.

nl
O
ty
Here both factors are observation factors, the entries nij, i = 1, 2, j = 1, 2 and
the marginal sums N1·, N2·, N·1, and N·2 of the contingency table are random
variables. Investigated is a random sample of size N. We call this situation model I of a

si
contingency table.

Model 2: If the marginal number of one of the factors, let's say A, are fixed in
advance we obtain a contingency table like the table below.
r
ve
ni
U

Such a situation occurs if N1 female and N2 male pupils are observed and it is
counted how many have blue and how many do not have blue eyes. We call this model
ity

II of a contingency table.

Model 3: As in Fisher's 'problem of the lady tasting tea' recorded in Fisher,


the condition of model III with all marginal amounts set in advance is of theoretical
relevance. Muriel Bristol, the lady in issue, claimed to be able to determine whether the
m

tea or the milk was added to the cup first. Fisher suggested giving her eight glasses in
random order, four of each type. Then one may wonder what the chances were of her
obtaining the exact amount of cups she properly identified by chance. When the lady
)A

realises that four cups have been prepared for each kind, she will round up all marginal
sums to four.

We may use contingency tables to estimate measurements as well as test


hypotheses. We simply illustrate how to use two-dimensional contingency tables to
construct numerous measures from observed data. A number of coefficients, referred to
(c

as association measures, can be used to determine the degree of relationship between


the two variables (factors).

Amity Directorate of Distance & Online Education


Computational Statistics 329

Let us sum up what we have learned with a case study.


Notes

e
Case Study 1:

in
Cross-Classification of Particle Counts by Wafer Condition

nl
O
This two-way cross-classification table highlights the findings of a manufacturing
facility research that looked at whether particles detected on silicon wafers had an
impact on the wafer's quality. Following that are tables with row percentages, column

ty
percentages, and overall total percentages.

Row Percentages Table

r si
ve
Column Percentages Table
ni
U
ity

Overall Total Percentages Table


m
)A

Analysis:
The inner half of the simplest two-way table comprises two rows and two columns.
Each inner cell reflects the count or percentage of categories from each variable that
(c

have been paired or cross-classified. For each row and column combination, extra rows
and columns may display the percentages of the overall total, the percentages of the
row total, and the percentages of the column total.

Amity Directorate of Distance & Online Education


330 Computational Statistics

Notes

e
in
nl
O
Two-way tables can indicate the most common combinations of values in data. The
tables show that defective wafers are far more likely to have particles than excellent
wafers in this case. Because the quantity of excellent and poor wafers in this case was

ty
uneven, the Row Percentages table is the appropriate place to look for this pattern.
According to the table, roughly three-quarters of wafers with particles were poor,
whereas just 20% of wafers without particles were poor.

si
Let us understand what we have learned so far with the help of some examples.

Example 1: In a population of 1000 individuals, 460 drive two-wheelers out of

r
which 400 are college students. Out of the car owners, 324 have jobs. Calculate
coefficient of association.
ve
Solution:

From the given information, we can prepare the following 2x2 table:
ni

Attributes
Total
Two-wheelers (A) Cars (a)
College (B) 400 (AB) (aB) (B)
Attributes
U

Job (b) (Ab) 324 (ab) (b)


Total 460 (A) (a) 1000 (N)

We know that,
ity

◌◌ AB + Ab = A
◌◌ AB + aB = B
◌◌ aB +ab = a
◌◌ Ab + ab = b
m

◌◌ A+a=N
◌◌ B+b=N
)A

From this, we can fill the missing figures as:

Attributes
Total
Male (A) Female (A)
Brown eyes (B) 200 (AB) 230 (aB) 430 (B)
Attributes
(c

other (b) 50 (Ab) 20 (ab) 70 (b)


Total 250 (A) 250 (a) 500 (N)

Amity Directorate of Distance & Online Education


Computational Statistics 331

Now, according to the formula


Notes

e
in
nl
O
Therefore, there is high degree of positive correlation between the attributes.

Example 2: In a population of 500 individuals, 250 are male out of which 200 have
brown eyes. Out of the females, 20 do not have brown eyes. Calculate coefficient of

ty
association.

Solution:

From the given information, we can prepare the following 2x2 table:

si
Attributes
Total
male (A) female (a)

Attributes
Brown eyes (B) 200 (AB) r (aB) (B)
ve
other (b) (Ab) 20 (ab) (b)
Total 250 (A) (a) 500 (N)

We know that,
ni

◌◌ AB + Ab = A
◌◌ AB + aB = B
◌◌ aB +ab = a
U

◌◌ Ab + ab = b
◌◌ A+a=N
◌◌ B+b=N
ity

From this, we can fill the missing figures as:

Attributes
Total
male (A) female (a)
m

Brown eyes (B) 200 (AB) 230 (aB) 430 (B)


Attributes
other (b) 50 (Ab) 20 (ab) 70 (b)
Total 250 (A) 250 (a) 500 (N)
)A

Now, according to the formula


(c

Amity Directorate of Distance & Online Education


332 Computational Statistics

Notes

e
in
nl
Therefore, there is low degree of negative correlation between the attributes.

Example 3: In a population of 1000 individuals, 500 are young out of which 300
tested positive for Covid-19. The rest are aged over 50, 150 out of them tested positive

O
for Covid-19. Calculate coefficient of association.

Solution:

From the given information, we can prepare the following 2x2 table:

ty
Attributes
Total
Young (A) Aged (a)
Positive (B) 300 (AB) 150 (aB) (B)

si
Attributes
Negative (b) (Ab) (ab) (b)
Total 500 (A) (a) 1000 (N)

We know that, r
ve
◌◌ AB + Ab = A
◌◌ AB + aB = B
◌◌ aB +ab = a
ni

◌◌ Ab + ab = b
◌◌ A+a=N
◌◌ B+b=N
U

From this, we can fill the missing figures as:

Attributes
Total
Young (A) Aged (a)
ity

Positive (B) 300 (AB) 150 (aB) 450 (B)


Attributes
Negative (b) 200 (Ab) 350 (ab) 550 (b)
Total 500 (A) 500 (a) 1000 (N)
m

Now, according to the formula


)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 333

Therefore, there is moderate positive correlation between the attributes.


Notes

e
Example 4: In a society, 250 pet-owners were surveyed, 175 had dogs out of
which 30 were adopted. 75 people owned cats. 45 out of those were purchased.

in
Calculate coefficient of association.

Solution:

nl
From the given information, we can prepare the following 2x2 table:

Attributes
Total
Dog (A) Cat (a)

O
Purchased (B) (AB) 45 (aB) (B)
Attributes
Adopted (b) 30 (Ab) (ab) (b)
Total 175 (A) 75 (a) 250 (N)

ty
We know that,

◌◌ AB + Ab = A

si
◌◌ AB + aB = B
◌◌ aB +ab = a
◌◌ Ab + ab = b
◌◌ A+a=N r
ve
◌◌ B+b=N
From this, we can fill the missing figures as:

Total
Attributes Total
ni

Dog (A) Cat (a)


Purchased (B) 145 (AB) 45 (aB) 190 (B)
Attributes
Adopted (b) 30 (Ab) 30 (ab) 60 (b)
U

Total 175 (A) 75 (a) 250 (N)

Now, according to the formula


ity
m
)A

Therefore, there is moderate positive correlation between the attributes.

Example 5: 1000 students were surveyed, 600 of those were high schoolers. Out
of them, 200 lived on-campus. The remaining students were pursuing higher education.
150 of them lived off-campus. Calculate coefficient of association.
(c

Amity Directorate of Distance & Online Education


334 Computational Statistics

Solution:
Notes

e
From the given information, we can prepare the following 2x2 table:

Attributes

in
Higher Total
High School (A)
Education (a)
On-campus (B) 200 (AB) (aB) (B)

nl
Attributes
Off-campus (b) (Ab) 150 (ab) (b)
Total 600 (A) (a) 1000 (N)

O
We know that,

◌◌ AB + Ab = A
◌◌ AB + aB = B

ty
◌◌ aB +ab = a
◌◌ Ab + ab = b
◌◌ A+a=N

si
◌◌ B+b=N
From this, we can fill the missing figures as:
r Attributes
ve
Higher Total
High School (A)
Education (a)
On-campus (B) 200 (AB) 250 (aB) 450 (B)
Attributes
Off-campus (b) 400 (Ab) 150 (ab) 550 (b)
ni

Total 600 (A) 400 (a) 1000 (N)

Now, according to the formula


U
ity
m

Therefore, there is moderate negative correlation between the attributes.

Check Your Understanding


)A

Fill in the blanks:

1. The total number of ultimate class frequencies in case of n attributes is _______.


2. (αß) is a frequency of _______________.
(c

3. If the coefficient of association of two attributes is 0, they are __________.


4. The value of Yule’s coefficient of association lies from 0 to 1. (True/False)

Amity Directorate of Distance & Online Education


Computational Statistics 335

Summary
Notes

e
●● The number of items belonging to a class is called the frequency of that class.
●● The class-frequencies of highest order are called ultimate class-frequencies.

in
●● The formula for Yule’s coefficient is

nl
●● The formula for coefficient of colligation is

O
ty
Questions & Exercises
1. What is a 2x2 table? Give an example.
2. Write a note on Yule’s coefficient of association.

si
Activity
1. Prepare a 2x2 table on literacy and hygiene and comment on their association.
r
ve
Further Readings
1. Blaikie, N. (2003) Analyzing Quantitative Data. Sage: Thousand Oaks
2. The Measure of Association in a 2x2 Tablet By A. W. F. EDWARDS
ni

Glossary
●● Class Frequency: The number of items belonging to a class is called the frequency
of that class.
U

●● Ultimate Class Frequency: The class-frequencies of highest order are called


ultimate class-frequencies.
ity

Answers to Check Your Understanding


1. 2n
2. Second order
m

3. Independent
4. False
)A
(c

Amity Directorate of Distance & Online Education


336 Computational Statistics

Unit - 5.3: Coefficient of Association


Notes

e
Objectives

in
At the end of this unit, you will be able to:

●● Understand the meaning of coefficient of association and to see how it is different

nl
from coefficient of correlation.
●● Understand Chi-square Coefficient of Association
●● Understand Karl Pearson’s Coefficient of Association

O
●● Understand Tschuprow’s Coefficient of Association

Introduction

ty
You have noticed that the statistical approaches for examining the nature of the
link between two variables are different. For example, on one hand, the relationship
between two variables may be quantified using the correlation coefficient, which not

si
only indicates the size but also the direction of the association, indicating whether the
two variables are positively or negatively associated. In the second situation, however,
coefficient correlation cannot be employed since the data is not quantitatively stated;
r
all we know is whether a certain number has a specific attribute or not. When the data
ve
is qualitative, coefficient of association is used to determine the degree and direction
of relation between the attributes. Correlation coefficient measures the extent of
relationship between two quantitative variables, whereas coefficient of association only
suggests whether the association is positive or negative.
ni

5.3.1 Chi-square Coefficient of Association


Chi-square(ϰ^2) is an essential non-parametric test, and as such, no strict
U

assumptions about the type of population are required. We simply need the degrees
of freedom (and, of course, the sample size) to use this test. Chi-square can be
used as a test of independence, as a non-parametric test. The ϰ^2 test, as a test of
independence, allows us to determine whether or not two attributes are associated. For
ity

example, if we want to know whether a new treatment is successful in reducing fever,


a ϰ^2 test will assist us in determining this. In such a case, we proceed with the null
hypothesis that the two attributes (new medicine and fever control) are independent,
implying that the new medicine is ineffective at managing fever. On this assumption, we
calculate the expected frequencies first, and then the value of ϰ^2 is calculated. If the
m

computed value of ϰ2 is smaller than the table value at a specific level of significance
for specified degrees of freedom, we conclude that the null hypothesis holds, implying
that the two attributes are independent or unrelated (i.e., the new treatment is
)A

ineffective at treating the fever). However, if the calculated value of ϰ2 is greater than
the table value, we can conclude that the null hypothesis does not hold true, implying
that the two attributes are associated and that the association is not due to chance but
exists in reality (i.e., the new medicine is successful in managing the fever and may
be prescribed). However, it should be noted that ϰ2 is not a measure of the degree of
(c

relationship or the type of relationship between two attributes, but rather a technique
for determining the significance of such an association or relationship between two
attributes. To apply the chi-square test as a test of goodness of fit or to judge the

Amity Directorate of Distance & Online Education


Computational Statistics 337

significance of association between attributes, the observed and theoretical or expected


Notes

e
frequencies must be grouped in the same way, and the theoretical distribution must be
adjusted to give the same total frequency as the observed distribution. The value of ϰ2
is then determined as follows:

in
nl
Where

Oij = observed frequency of the cell in ith row and jth column.

O
Eij = expected frequency of the cell in ith row and jth column.

If two distributions (observed and theoretical) are identical, ϰ2 = 0. However, due


to sampling errors, ϰ2 is not always equal to zero, and we must therefore know the

ty
sampling distribution of ϰ2 in order to calculate the probability of an observed ϰ2 being
given by a random sample from the hypothetical universe. Instead of calculating the
probabilities, we may utilise a ready table that provides probabilities for given values
of ϰ2. The tabular values of ϰ2 for various degrees of freedom at a specific level of

si
significance can be used to determine whether or not a computed value of ϰ2 is
significant. If the calculated value of ϰ2 equals or exceeds the table value, the difference
between the observed and expected frequencies is considered significant; however, if
r
the table value is greater than the calculated value of ϰ2, the difference is considered
ve
insignificant, i.e., as a result of chance, and can be ignored.

As previously stated, degrees of freedom play a crucial role in applying the chi-
square distribution and the test based on it; the degrees of freedom must be accurately
determined. There are (10 – 1) = 9 degrees of freedom if there are 10 frequency
ni

classes and one independent constraint. Thus, if 'n' is the number of groups and one
restriction is imposed by keeping the totals of actual and predicted frequencies equal,
the d.f. is (n – 1). In the case of a contingency table (i.e., a table with two columns
U

and two rows or a table with two columns and more than two rows or a table with two
rows but more than two columns or a table with more than two rows and more than two
columns or a table with more than two rows and more than two columns), the d.f. is
calculated as follows:
ity

d.f.=(c-1)(r-1)

Where

c = number of columns and r = number of rows.


m

Conditions for the use of ϰ2 tests


Before using the ϰ2 test, the following requirements must be met:
)A

i. Observations that are recorded and used are obtained at random.


ii. All of the items in the sample must be independent of one another.
iii. No group shall have fewer than ten elements. When the frequencies are less than
10, regrouping is accomplished by merging the frequencies of neighbouring groups
(c

so that the resulting frequencies exceed 10. Some statisticians consider this number
to be 5, while most believe that 10 is ideal.

Amity Directorate of Distance & Online Education


338 Computational Statistics

iv. The total number of items should also be sizable. It should generally be at least 50,
Notes

e
regardless of how few groups there may be.
v. The constraints must be linear. Constraints which involve linear equations in the cell

in
frequencies of a contingency table (i.e., equations containing no squares or higher
powers of the frequencies) are known are know as linear constraints.
Illustration 1: Eight coins were tossed 256 times and the following results were

nl
obtained:

Number of heads F
0 2

O
1 6
2 30
3 52

ty
4 67
5 56
6 32

si
7 10
8 1
Are the coins biased?

Solution: r
ve
ni
U
ity
m
)A
(c

Therefore, the value of chi-square can be worked out as follows:

Amity Directorate of Distance & Online Education


Computational Statistics 339

Notes

e
in
nl
O
Degrees of freedom = (n – 1) = (9 – 1) = 8

ty
The table value of ϰ2 for eight degrees of freedom at 5 per cent level of significance
is 15.507. The calculated value of ϰ2 is much less than this table and hence, it is
insignificant and can be ascribed due to fluctuations of sampling. The result, thus,

si
supports the hypothesis and we may say that the coins are not biased.

Cell Pooling
r
We saw in the last part that the cell size should be at least 5 or more. When one
ve
or more cells in a contingency table have an expected frequency of less than 5, this
criterion can be accomplished by merging two rows or columns before calculating ϰ2.
To attain an expected frequency of 5 or higher in each cell, we must merge these cells.
This is often referred to as grouping the frequencies together. However, by doing so, we
ni

lower the number of data categories and acquire less information from the contingency
table. Furthermore, due to pooling, we lose one or more degrees of freedom.

It should be mentioned that the number of freedom is decided by the number of


U

classes after regrouping. The degree of freedom in a 2x2 contingency table is 1 in a


special case. If the frequency in any cell is less than 5, we may be inclined to use the
pooling approach, which results in 0 degrees of freedom (due to the loss of 1 d.f.),
ity

which is meaningless. When the assumption of a minimum cell frequency of 5 is not


satisfied in the case of a 2x2 contingency table, we use the Yates correction.

Yates Correction
Yates correction is also called Yates correction for continuity. In a 2 x 2 contingency
m

table the degrees of freedom is 1. If any of the expected cell frequencies is less than
5, using the pooling approach may result in 0 degrees of freedom owing to the loss of
1 degree of freedom in pooling, which means less. Furthermore, the chi-square test is
)A

invalid if any one or more of the anticipated frequencies are less than 5. As a result, the
Yates correction is performed whenever any one or more of the predicted frequencies
in a 2x2 contingency table is less than 5. F. Yates, a mathematician from England,
proposed this.
(c

Assume that the four cell values a, b, c, and d for a 2x2 contingency table are
ordered in the following sequence.

Amity Directorate of Distance & Online Education


340 Computational Statistics

a b
Notes

e
c d
The formula is as follows:

in
Limit and Drawback of Chi-square Test
ϰ2 ranges between 0 and ∞.

nl
◌◌ If the value of ϰ2 is high, there is stronger association.
◌◌ The value of 0 indicates that the attributes are independent.

O
This measure has a drawback that it has no upper-bound. Even if the two attributes
are perfectly associated, there is no confirmation of the same according to the chi-
square test.

Let us understand what we have learned with a case study.

ty
The following example demonstrates how to use chi-squared and how to calculate
it. A recent poll of 200 university students questioned about their feelings regarding
distributing copyrighted music. Students made up half of the responders, while staff

si
made up the other half (administrators or faculty). The counts are summarised in the
table below.

r
Attitudes toward copyright items being shared.
ve
ni
U

Overall, 40% of those polled (80 out of 200) said it was acceptable to share
ity

copyrighted music. This is the minuscule proportion. A conditional distribution


is determined for each row, one for staff and one for students. Only 30% of the
professionals agreed it was acceptable to share, compared to 50% of the kids. Because
the percentages in the rows differ, Group and Attitude are linked.
m

We need a standard for comparison, a point of reference, to assess the degree of


association. Consider how the table would seem if there was no association. Copy the
marginal totals from the table, but not the counts inside the table, to figure this out, as
illustrated below.
)A

We need a standard for comparison, a point of reference, to assess the degree of


association. Consider how the table would seem if there was no association. Copy the
marginal totals from the table, but not the counts inside the table, to figure this out, as
illustrated below.
(c

If the variables aren't associated, what happens in these cells?

Amity Directorate of Distance & Online Education


Computational Statistics 341

Notes

e
in
nl
O
Staff make up half of the replies, while students make up the other half. If Group
and Attitude were unrelated, one-half of the examples in each column would be
employees and the other half would be students.

ty
Expected table if Group and Attitude are not associated.

r si
ve
The predicted counts are the artificial counts that we expect to see in the cells if the
ni

variables are unrelated. The difference between the counts in the real table and those in
the table of expected counts is measured using Chi-squared. To compute chi-squared,
start by subtracting the cell counts. The table shows the variances in the counts.
U

Deviations from the original counts.


ity
m

Next, we combine the differences. If we add them, we get zero because the
negative and positive values cancel. We’ll solve the problem as we did then: square the
differences before we add them. Chi-squared also requires another step before we add
)A

them.

Chi-squared assigns some of these differences a larger contribution to the total.


Look at the differences in the first row. Both are 10 in absolute size, but the difference
in the left column is larger relative to the expected count than the difference in the
(c

right column (10 out of 40 compared to 10 out of 60). Rather than treat these the
same, chi-squared assigns more weight to the first. Saying 40 and finding 30 is a
larger proportional error than saying 60 and finding 70. To give more weight to larger

Amity Directorate of Distance & Online Education


342 Computational Statistics

proportional deviations, chi-squared divides the squared deviations by the expected


Notes

e
counts.

The chi-squared statistic is the sum of these weighted, squared differences. In this

in
example, chi-squared, denoted in formulas as χ2 is

nl
O
We usually compute chi-squared using software, but it's helpful to understand what
the programme is doing in order to understand how it assesses association. To recap,
the processes for computing chi-squared are as follows:

ty
◌◌ Create a table with the same margins as the original table, but fill in the
expected counts if no association exists.
◌◌ Subtract the predicted numbers from the actual counts in the original table,
then square the disparities.

si
◌◌ Subtract each squared difference from the predicted count.
◌◌ Add all of the weighted, squared deviations together.
r
If the rows and columns are swapped, Chi-squared has the same value. It makes
ve
no difference which variable is used to define the rows and which is used to define the
columns.

5.3.2 Karl Pearson’s Coefficient of Association


ni

To correct the shortcoming of Chi-square association, Karl Pearson devised a new


measure for the calculation of association, which is:
U

where
ity

C = Coefficient of contingency,

χ2 = Chi-square value which is


m

N = number of items
)A

This is called the coefficient of mean square contingency. It is said to be an


effective way to examine association in contingency tables.

However it has a drawback. This measure has the disadvantage of never reaching
its upper limit, i.e., even if both attributes are perfectly associated, the values will always
(c

be less than one.

This measure's value ranges from 0 to 1.


Amity Directorate of Distance & Online Education
Computational Statistics 343

◌◌ If the value is high, there exists high degree of association between the two
Notes

e
attributes.
◌◌ If the value is low, there exists a low degree of association between the two

in
attributes.
◌◌ If the value is 0, the two attributes are independent.
+.70 or higher Very strong positive relationship

nl
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 weak positive relationship

O
+.01 to +.19 No or negligible relationship
0 No relationship
-.01 to -.19 No or negligible relationship

ty
-.20 to -.29 weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship

si
This can also be elaborated as:

r
Let us understand this with the example we used in unit 5.3.2
ve
Attitudes toward copyright items being shared.
ni
U
ity

In this example, we calculated chi-square to be 8.33. the value of n is 200.

To calculate Pearson’s C,
m
)A
(c

Therefore, there is negligible association between the attributes.

Amity Directorate of Distance & Online Education


344 Computational Statistics

Example 2:
Notes

e
Attribute B1 B2 B3 TOTAL
A1 30 50 20 100

in
A2 20 30 10 60
A3 10 20 10 40
TOTAL 60 100 40 200

nl
Calculate Pearson’s coefficient of contingency.

Solution:

O
To calculate coefficient of contingency, first we calculate value of chi-squared.

Expected Frequency:

ty
r si
ve
ni
U

ϰ2 = 1.388
ity

Now, for Pearson’s C


m
)A

5.3.3 Tschuprow’s Coefficient of Association


In order to avoid the disadvantage of Karl Pearson's measure of association,
Tschuprow offered another measure provided by the formula:
(c

Amity Directorate of Distance & Online Education


Computational Statistics 345

Where, N = number of items


Notes

e
m = number of rows

n = number of columns

in
This measure's value ranges from 0 to 1.

◌◌ If the value is high, there exists high degree of association between the two

nl
attributes.
◌◌ If the value is low, there exists a low degree of association between the two
attributes.

O
◌◌ If the value is 0, the two attributes are independent.
◌◌ If the value is 1, the two attributes are perfectly associated.

ty
Example 1:

Attribute B1 B2 B3 TOTAL
A1 23 9 6 38

si
A2 21 4 3 28
A3 34 24 17 75
TOTAL 78 37 26 141

Calculate Tschuprow’s coefficient of contingency.


r
ve
Solution:
To calculate coefficient of contingency, first we calculate value of chi-squared.
ni

Expected Frequency:
U
ity
m
)A
(c

Amity Directorate of Distance & Online Education


346 Computational Statistics

ϰ2 = 6.72
Notes

e
Now, for Tschuprow’s coefficient of contingency,

in
nl
O
Check Your Understanding
Fill in the blanks:

1. Coefficient of correlation measures the relationship between ____________________

ty
characteristics whereas the coefficient of association measures the relationship
between _____________ characteristics.
2. The value of chi-square lies between __________.

si
3. The value of coefficient of contingency lies between___________.
4. The value of Tschuprow’s coefficient of association lies between ___________.

Summary
r
ve
●● When the data is qualitative, coefficient of association is used to determine the
degree and direction of relation between the attributes.
●● Correlation coefficient measures the extent of relationship between two
ni

quantitative variables, whereas coefficient of association only suggests whether


the association is positive or negative.
●● The formula for chi-square test is
U

●● When utilising contingency tables to run a chi-square test, it is expected that all
ity

cell frequencies are at least 5. If this assumption is not satisfied, we can utilise
the pooling approach, but there will be a loss of information. In a 2x2 contingency
table, if one or more cell frequencies are less than 5, the chi-square value should
be computed using the Yates correction.
m

●● The formula for Karl Pearson’s coefficient of mean square contingency is


)A

●● The formula for Tschuprow’s coefficient of association is


(c

●● If the value is high, there exists high degree of association between the two
attributes. If the value is low, there exists a low degree of association between the

Amity Directorate of Distance & Online Education


Computational Statistics 347

two attributes. If the value is 0, the two attributes are independent. If the value is 1,
Notes

e
the two attributes are perfectly associated.

Questions & Exercises

in
1. What is the chi-square test? What are its limitations?
2. Write a note on Coefficient of contingency.

nl
3. Explain Tschuprow’s coefficient of association

Activity

O
1. Make a table highlighting all the formulas we have learned in this module thus far.
Also state the drawbacks of the formulas wherever taught.

Further Readings

ty
1. Liebetrau, A. (1983). Measures of Association (Quantitative Applications in the
Social Sciences). Sage Publications

si
2. Tschuprow, A. A. (1939) Principles of the Mathematical Theory of Correlation;
translated by M. Kantorowitsch. W. Hodge & Co.

Glossary
●●
r
Chi-Square Distribution: A kind of probability distribution used for test of
ve
independence.
●● Expected frequency: Frequencies expected under certain assumptions.
●● Observed frequency: Actually recorded frequency.
ni

●● Yates Correction: The Yates correction is done whenever any one or more of the
expected frequencies in a 2x2 contingencies table is less than 5.
U

Answers to Check your Understanding


1. quantitative, qualitative
2. 0 to ∞
ity

3. 0 to 1
4. 0 to 1
m
)A
(c

Amity Directorate of Distance & Online Education


348 Computational Statistics

Unit - 5.4: Contingency Tables with Ordered


Notes

e
Categories

in
Objectives
At the end of this unit, you will be able to:

nl
●● Describe the concept of contingency table for manifold classification
●● Understand Chi-square and coefficient of contingency

O
Introduction
Data classification might be dichotomous or manifold, as we've discussed.
An attribute is dichotomous if it has only two classes, and manifold classification if it

ty
contains numerous classes. The criteria 'location,' for example, may be separated into
two categories: large city and small town. The 'nature of occupancy' characteristic is
separated into two categories: 'owner occupied' and 'rented to private parties.' This

si
categorization is dichotomous. Let us say we have a total of N observations that have
been categorised using both criteria. For instance, we might take a random sample of
250 buildings and classify them according to their "location" and "nature of occupancy,"
as shown in the table below:
r
ve
ni

We have two criteria for classification here: one location (two categories) and the
other nature of occupancy (two categories). A contingency table is a two-way table like
U

this. The table above is a two-by-two contingency table, with two categories for each
characteristic. The table has two rows and two columns, with two rows and two columns
equalling four different cells. The goal of creating such a table, as we covered in the
previous segment, is to investigate the relationship between two attributes, i.e. if the
ity

two features or characteristics appear to exist independently of one another or whether


there is any correlation between them. In the example above, we are trying to figure out
if the two characteristics, location and kind of occupancy, are independent.

Instead of two classes, an attribute might be categorised into a number of classes


m

in reality. Manifold classification is a term used to describe this form of classification.

Extremely tall, tall, medium, short, and very short are some examples of height
)A

classifications. In this unit, we will look at manifold classification, a contingency table,


and a mechanism for determining the strength of association between two attributes
that are divided into a number of categories. The computation of chi-square and the
coefficient of contingency, which are used to quantify the degree of correlation between
two qualities, would be the major emphasis of this unit.
(c

Amity Directorate of Distance & Online Education


Computational Statistics 349

5.4.1 Contingency Tables with Ordered Categories


Notes

e
Categorical Data

in
A categorical variable has responses that are made up of a series of categories
rather than numbers that represent the amount or quantity of something on a
continuous scale. For example, a person's gender may be described as "male" or
"female," while a machine part can be categorised as "acceptable" or "defective." More

nl
than two categories are available, for example, a person's relationship status might be
"Single," "In a Relationship," "Married," or "Other."

O
Categorical variables can be formed by classifying a continuous or discrete
variable, or they might be naturally categorical with no numeric scale underlying their
measurement (such as relationship status).

Blood pressure, for example, is a measurement of the force exerted on the walls

ty
of blood arteries, expressed in millimetres of mercury (Hg). Blood pressure is normally
measured in precise units, such as 120/80 Hg, but it's also classified into low, normal,
prehypertensive, and hypertensive categories. The number of children in a home is an

si
example of a discrete variable: while the data may be gathered as the precise number
of children, it can also be examined in categories like "0 children," "1–2 children," and "3
or more children."
r
Although the usefulness of categorising continuous or discrete data into categories
ve
is often challenged (some researchers refer to it as "throwing away information"
because all information concerning variance within the categories is discarded), it is a
standard practise in many professions. Categorizing is done for a variety of reasons,
ranging from personal preference (for example, if certain categorizations have become
ni

widely recognised in your area) to resolving distribution issues with a specific data
collection.

Ordinal variables are those assessed on a scale on which the categories can be
U

ranked in order but without the assumption that the distance between each category is
equal. Categorical data methods can be used to ordinal variables. An ordinal variable
is the well-known Likert scale, in which individuals select replies to questions from a
series of ordered categories (such as Strongly Agree, Agree, Neutral, Disagree, and
ity

Strongly Disagree). For ordinal data, there is a unique set of analytic tools that takes
use of the fact that ordered categories were employed, which is also described in this
chapter. When analysing ordinal data, particular ordinal procedures are favoured over
categorical techniques because they are more powerful.
m

RxC Table
When analysing the relationship between two categorical variables, an RxC table,
)A

also known as a contingency table, is frequently used to depict their distribution in the
data set. The R in RxC stands for row, and the C stands for column: the number of rows
and columns in a table defines it. This is also the order in which rows and columns are
mentioned in describing matrices and in subscript notation. Tables of bigger dimensions
are sometimes distinguished from tables of 2x2 tables, which show the combined
(c

distribution of two binary variables. This is unnecessary since a 2x2 table may be
thought of as an RxC table, with both R and C equaling 2.

Amity Directorate of Distance & Online Education


350 Computational Statistics

The phrase "RxC" is read as "R by C," and the same standard holds true for certain
Notes

e
table sizes, such as "2 by 2."

Let's say we want to examine the association between broad age groups and

in
health, as measured by the well-known five-point general health scale. We decide on
the age categories to utilise and gather data from a sample of people, categorising
them by age (using our predetermined categories) and health condition (using the five-

nl
point scale). The information is then displayed in a contingency table, such as the one
below.

O
ty
si
Because it has four rows and five columns, this table is referred to as a 4x5
r
table. Each cell would include the number of persons who shared the stated pair of
characteristics: the number of people under the age of 18 who were in excellent health,
ve
the number of people aged 18–39 who were in very good health, and so on.

We already know that manifold classification occurs when an attribute is separated


ni

into more than two components or groups. Rather than splitting the population into two
parts, heavy and light, we may subdivide it into a vast number of sections, such as very
heavy, heavy, normal, light, and very light. Both the attributes of the universe may be
U

subdivided in this way.

As a result, attribute A may be separated into several groups: A1, A2,..., Ar.

The attribute B can also be broken into B1, B2,..., Br. The display is termed a
ity

contingency table when the observations are categorised by two qualities and placed in
a table.

This table may be 3x3, 4x4, or any other size. Both the attributes A and B have
three subcategories in the 3x3 table. Similarly, each of the attributes A and B in the 4x4
m

table is broken into four parts: A1, A2, A3, A4 and B1, B2, B3, B4.

It is also possible that the number of classes for both attributes is different. A 3x4
)A

contingency table is created by dividing attribute A into three parts and attribute B into
four parts. Contingency tables of 3x5, 4x3, and so on are also possible. It is worth
noting that if one attribute has two classes and another has more than two classes, the
categorization is still complex.

As a result, we may build contingency tables of 2x3, 2x4, and so on.


(c

We'll focus on two attributes, A and B, where A is separated into r classes, A1, A2,...,
Ar, and B is separated into s classes, B1, B2,..., Bs. The rxs contingency table can be

Amity Directorate of Distance & Online Education


Computational Statistics 351

used to express the various cell frequencies, where (Ai) is the number of people who
Notes

e
have the attribute Ai (i= 1, 2,...., r), (Bj) is the number of people who have the attribute
Bj (j=1, 2,..., s), and (AiBj) is the number of people who have both Ai and Bj (i = 1, 2,....,
r; j =1, 2,..., s). We also have

in
nl
where N is the total frequency.

Following is an example of rxs contingency table:

O
ty
r si
ve
In the above table sum of columns A1, A2, etc. and the sum of rows B1, B2, etc.
would be first order frequencies and the frequencies of various cells would be second
order frequencies.

The total of either A1, A2, etc. or B1, B2, etc. would give grand total N.
ni

In the table

◌◌ (A1) = (A1B1) + (A1B2) + … + (A1Bs)


U

◌◌ (A2) = (A2B1) + (A2B2) + …+ (A2Bs), etc.


Similarly

◌◌ (B1) = (A1B1) + (A2B1) + … + (ArB1)


ity

◌◌ (B2) = (A1B2) + (A2B2) + …+ (ArB2), etc.


And

◌◌ N = (A1) + (A2) + … + (Ar) or


m

◌◌ N = (B1) + (B2) + … + (Bs)


In the following section, you will learn how to find degree of association between
attributes in r × s contingency table.
)A

Chi-square and Coefficient of Contingency


The computation of the coefficient of contingency necessitates the knowledge
of both actual and theoretical or predicted frequencies. As a result, constructing a
(c

theoretical or predicted frequency table is required before determining the coefficient


of contingency. The expected frequencies are computed and inserted into the table of
expected frequencies in the following way:

Amity Directorate of Distance & Online Education


352 Computational Statistics

Notes

e
in
For understanding, a 3×3 contingency table for computing expected frequencies is

nl
displayed below. Similarly, construction can be done for r×s contingency table.

O
ty
si
If A and B are fully independent of one another, then the actual values of (A1B1),
(A2B2), and so on must be identical to their corresponding expected values, i.e.
r
and so on. We may claim A and B are totally independent if the
ve
observed frequency of each cell in a contingency table equals the expected frequency
of the same cell.

If the values for any of the cells are not identical, there is an association between
the two attributes A and B. The difference between the observed and expected
ni

frequencies for specific cells is computed to determine the amount of association. The
value of Chi-square, abbreviated as ϰ2, is computed with the help of such variances.
Therefore,
U
ity
m
)A

where,

O - is the observed frequency of a class, and

E - is the expected frequency of that class.

ϰ2 is also known as “Square contingency”. If the mean of ϰ2 is calculated, it is


(c

called “Mean Square Contingency” which is denoted by f2 (pronounced as phi-square).

Amity Directorate of Distance & Online Education


Computational Statistics 353

Thus, Square contingency = ϰ2


Notes

e
Mean square contingency =

As far as the limit of ϰ2 and f2 are concerned, we see that ϰ2 and f2 are sum of

in
squares and hence they cannot have negative values. The minimum value of ϰ2 and f2
would be 0. This will happen when the numerator in the expression of ϰ2 is 0, i.e. when
the observed and expected frequencies are equal in all the cells of the contingency

nl
table. This is the case when the attributes A and B are completely independent. The
limits of ϰ2 and f2 vary in different cases and we cannot assign upper limits to ϰ2 and
f2 and thus, they are not suitable for studying the association in contingency tables.

O
Karl Pearson has given the following formula for the calculation of “Co-efficient of Mean
Square Contingency.”

ty
Calculating ϰ2 with this formula, we get:

si
We get,

r
ve
The aforementioned coefficient has the disadvantage of never reaching the upper
limit of 1. The value of C lies between 0 and 1 but it never attains the value unity. Only
an endless number of classes may reach the limit of one.
ni

Its maximum value is normally determined by the values of r and s, i.e. the number
of rows and columns.
U

In r×r table (i.e. 2× 2, 3× 3, 4× 4, etc.) the maximum value of

Thus, in 2 × 2 table the maximum value of


ity

Thus 3×3 table it is 0.816 and in 4×4 it is 0.866.

From the explanation, we can see that the maximum value of C depends upon
how the data are classified. Therefore, coefficients calculated from different types of
classification are not comparable.
m

Example 1: From the following table, comment on the association of


characteristics of brothers and sisters.
)A

Brothers Sisters
Introvert Ambivert Extrovert Total
Introvert 850 571 580 2001
Ambivert 618 593 455 1666
Extrovert 540 456 457 1453
(c

Total 2008 1620 1492 5120

Amity Directorate of Distance & Online Education


354 Computational Statistics

Solution:
Notes

e
in
nl
Similarly, all frequencies shall be calculated and the following table will be made,

Observed Expected
Class (O-E)2 (O-E)2)/E
Frequency (O) Frequency (E)

O
(A1)(B1) 850 785 4225 5.38
(A1)(B2) 571 633 3844 6.07
(A1)(B3) 580 583 0009 0.02

ty
(A2)(B1) 618 653 1225 1.88
(A2)(B2) 593 527 4356 8.27
(A2)(B3) 455 486 0961 1.98

si
(A3)(B1) 540 570 0900 1.58
(A3)(B2) 456 460 0016 0.03
(A3)(B3) 457 423 1156 2.73
Total r
5120 5120 27.94
ve
ϰ2 = 27.94

And, for coefficient of contingency


ni
U
ity

The strength of association can be measured by comparing the calculated value


of C with the value calculated theoretically. We have seen maximum value of C in 3×3
table (the one in the question) is where r denotes the columns or rows.
m

Hence,

If we compare C calculated (0.0736) with its maximum value i.e. 0.816, we find that
)A

there is very weak association between the characteristics of brothers and sisters.

Example 2: The following table contains a set of data in which 141 individuals
with cancer have been doubly classified according to the type and stage of cancer.
The three types were as follows: A, lung cancer; B, mouth cancer; C, skin cancer. The
stages were: I, II, and III. Calculate the coefficient of contingency and interpret the
(c

result.

Amity Directorate of Distance & Online Education


Computational Statistics 355

Type Total
Notes

e
A B C
I 46 18 12 76

in
Site II 42 8 6 56
III 68 48 34 150
Total 156 74 52 282

nl
Solution:

O
Similarly, all frequencies shall be calculated and the following table will be made,

Observed Expected (O-E)2 ((O-E)2)/E


Frequency (O) Frequency (E)

ty
46 42 16 0.38
18 20 4 0.20
12 14 4 0.28

si
42 32 100 3.12
8 14 36 2.58
6 10 16 1.60
68 82 r 196 2.40
ve
48 40 64 1.6
34 28 36 1.28
282 282 13.44
ϰ2=13.44
ni

And, for coefficient of contingency


U
ity

The strength of association can be measured by comparing the calculated value of


C with the value calculated theoretically, as we studied earlier. We have seen maximum
m

value of C in 3× 3 table (the one in the question) is where r denotes the columns or
rows.
)A

Hence,

If we compare C calculated (0.21) with its maximum value i.e. 0.816, we find that
there is weak association between the stages and types of cancer.

Let us understand what we have learned with a Case Study:


(c

A random sample of persons may be surveyed about their smoking habits and lung
cancer diagnoses. Each of these variables is dichotomous: a person presently smokes

Amity Directorate of Distance & Online Education


356 Computational Statistics

or does not, and has or does not have a lung cancer diagnosis. The table below will be
Notes

e
our frequency table.

in
nl
O
Looking at the statistics alone, it appears that there is a link between smoking and
lung cancer: 20% of smokers have been diagnosed with lung cancer, compared to only
2.5 percent of non - smokers. Because appearances might be misleading, we'll use the
chi-square test to determine independence. Our hypothesis will be as follows:

ty
H0: The presence of a smoking habit and the presence of lung cancer are
unrelated.

si
HA: The presence of a smoking habit and the presence of lung cancer are not
mutually exclusive.

Although most chi-square tests are done on a computer, especially for bigger
r
tables, it is useful to go through the calculations for a basic example by hand. The chi-
ve
square test is based on the difference between actual and predicted values in each
of the 2x2 table's cells. The observed values are just what you discovered (observed)
in your sample or data set, whereas the predicted values are what you would have
expected to find if the two variables were unrelated. This formula is used to compute
ni

the expected value for a given cell:


U

where E denotes the expected value for cell ij and I and j denote the cell's rows
and columns. It's worth understanding the subscript notation because it's utilised
throughout statistics. The following table is a 2x2 table.
ity
m

To our example, the table below adds row and column totals.
)A
(c

Amity Directorate of Distance & Online Education


Computational Statistics 357

Cell 11 has a frequency of 60, a value of cell12 is 300, a total of 360 for row 1,
Notes

e
and a total of 70 for column 1. Because they are on the table's margin, the values for
column and row totals are termed marginals. The marginal frequency for lung cancer
diagnosis in this table is 70, as it reflects the frequency of one variable in the research

in
without respect to its connection with the other variable. Joint frequencies are the
numbers in the table (60, 300, 10, and 390 in this example) that represent the number
of occurrences with specified values on both variables. In this table, for example, the

nl
joint frequency for smokers with a lung cancer diagnosis is 60.

We would anticipate the frequency of each cell to be the product of its marginals
divided by the sample size if the two variables are unrelated. To put it another way,

O
we would anticipate the marginal distribution to have no effect on the combined
frequencies. If smoking and lung cancer were unrelated, we'd expect the number
of persons who smoke and have lung cancer to be determined only by the number

ty
of smokers and lung cancer patients in the sample. If smoking is not linked to the
formation of lung cancer, the risk of lung cancer should be roughly the same in smokers
and non-smokers, according to this argument.

si
We can compute the expected values for each cell using the formula above:

The tables show the observed and expected values for lung cancer data. We
need a mechanism to identify whether the differences are due to chance or reflect a
r
meaningful outcome. The chi-square test can be used to make this judgement.
ve
ni
U

Check Your Understanding


Fill in the blanks:
ity

1. Contingency tables with manifold classifications are represented by __________


2. Maximum value of C in case of 4x4 table is _________

Summary
m

●● Contingency table is a table of joint frequencies of occurrence of two variables


classified into categories. In a 2×2 contingency table, each attribute is divided into
two categories. Similarly, 3 × 2 table would have 3 categories of one attribute and
)A

two of other attribute. In general, we have seen that an r×s table has r categories
of one attribute and s categories of other attribute. 2×2 table is an example of
dichotomous classification whereas 3×3 or r×s contingency tables are examples of
manifold classification.
(c

●● Association between attributes is found by computing ϰ2.

Amity Directorate of Distance & Online Education


358 Computational Statistics

●● Karl Pearson has given the following formula for the calculation of Co-efficient of
Notes

e
Mean Square Contingency,

in
●● The value of C lies between 0 and 1 but it never attains the upper limit. The
maximum value of C is calculated as √((r-1)/r).

nl
O
●● A value near 1 represents strong association between two attributes and a value
near 0 shows weak association.

Questions & Exercises

ty
1. What are contingency tables? State its types.
2. Write a note on coefficient of contingency.

si
Activity
1. Gather the data available on types of brain tumours and their sites. Prepare a
r
contingency table and calculate the coefficient of contingency.
ve
Further Readings
1. C.R. Kothari, Research Methodology: Methods and Techniques (IInd Revised
Edition)
ni

2. Karl Pearson, F.R.S. (1904). Mathematical contributions to the theory of


evolution. Dulau and Co.

Glossary
U

●● Contingency table: A two-way table, in which columns are classified according to


one criterion or attribute and rows are classified according to the other criterion or
attribute.
ity

●● Expected frequency: Frequencies expected under certain assumptions.


●● Observed frequency: Actually recorded frequency.

Answers to Check Your Understanding


m

1. rxs
2. 0.866
)A
(c

Amity Directorate of Distance & Online Education

You might also like