MCOM Statistical Analysis Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 429

Geektonight Notes

Research Methodology and Statistical Analysis


Unit-1 Introduction to Business Research

Unit-2 Research Plan

Unit-3 Collection of Data

Unit-4 Sample

Unit-5 Measurement and Scaling Techniques

Unit-6 Processing of Data

Unit-7 Diagrammatic and Graphic Presentation

Unit-8 Statistical Derivatives and Measures of Central Tendency

Unit-9 Measures of Variation and Skewness

Unit-10 Correlation and Simple Regression

Unit-11 Time Series Analysis

Unit-12 Index Numbers (Not available inside this PDF) – Click there to get

Unit-13 Probability and Probability Rules

Unit-14 Probability Distributions

Unit-15 Tests of Hypothesis–I

Unit-16 Tests of Hypothesis – II

Unit-17 Chi-Square Test

Unit-18 Interpretation of Statistical Data

Unit-19 Report Writing


Geektonight Notes
Introduction to
UNIT 1 INTRODUCTION TO BUSINESS Business Research

RESEARCH
STRUCTURE

1.0 Objectives
1.1 Introduction
1.2 Meaning of Research
1.3 Meaning of Science
1.4 Knowledge and Science
1.5 Inductive and Deductive Logic
1.6 Significance of Research in Business
1.7 Types of Research
1.8 Methods of Research
1.8.1 Survey Method
1.8.2 Observation Method
1.8.3 Case Method
1.8.4 Experimental Method
1.8.5 Historical Method
1.8.6 Comparative Method
1.9 Difficulties in Business Research
1.10 Business Research Process
1.11 Let Us Sum Up
1.12 Key Words
1.13 Answers to Self Assessment Exercises
1.14 Terminal Questions
1.15 Further Reading

1.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the meaning of research,
l differentiate between Science and Knowledge,
l distinguish between inductive and deductive logic,
l discuss the need for research in business,
l classify research into different types,
l narrate different methods of research,
l list the difficulties in business research, and
l explain the business research process and its role in decision making.

1.1 INTRODUCTION
Research is a part of any systematic knowledge. It has occupied the realm of
human understanding in some form or the other from times immemorial. The
thirst for new areas of knowledge and the human urge for solutions to the
problems, has developed a faculty for search and research and re-research in
him/her. Research has now become an integral part of all the areas of human
activity.

5
Geektonight Notes

Research and Data Research in common parlance refers to a search for knowledge. It is an
Collection endeavour to discover answers to problems (of intellectual and practical nature)
through the application of scientific methods. Research, thus, is essentially a
systematic inquiry seeking facts (truths) through objective, verifiable methods in
order to discover the relationship among them and to deduce from them broad
conclusions. It is thus a method of critical thinking. It is imperative that any
type of organisation in the globalised environment needs systematic supply of
information coupled with tools of analysis for making sound decisions which
involve minimum risk. In this Unit, we will discuss at length the need and
significance of research, types and methods of research, and the research
process.

1.2 MEANING OF RESEARCH


The Random House Dictionary of the English language defines the term
‘Research’ as a diligent and systematic inquiry or investigation into a subject in
order to discover or revise facts, theories, applications, etc. This definiton
explains that research involves acquisition of knowledge. Research means
search for truth. Truth means the quality of being in agreement with reality or
facts. It also means an established or verified fact. To do research is to get
nearer to truth, to understand the reality. Research is the pursuit of truth with
the help of study, observation, comparison and experimentation. In other words,
the search for knowledge through objective and systematic method of finding
solution to a problem/answer to a question is research. There is no guarantee
that the researcher will always come out with a solution or answer. Even then,
to put it in Karl Pearson’s words “there is no short cut to truth… no way to
gain knowledge of the universe except through the gate way of scientific
method”. Let us see some definitions of Research:

L.V. Redman and A.V.H. Mory in their book on “The Romance of Research”
defined research as “a systematized effort to gain new knowledge”

“Research is a scientific and systematic search for pertinent information on a


specific topic” (C.R. Kothari, Research Methodology - Methods and
Techniques)

“A careful investigation or inquiry specially through search for new facts in any
branch of knowledge” (Advanced learners Dictionary of current English)

Research refers to a process of enunciating the problem, formulating a


hypothesis, collecting the facts or data, analyzing the same, and reaching certain
conclusions either in the form of solution to the problem enunciated or in certain
generalizations for some theoretical formulation.

D. Slesinger and M. Stephenson in the Encyclopedia of Social Sciences defined


research as: “Manipulation of things, concepts or symbols for the purpose of
generalizing and to extend, correct or verify knowledge, whether that knowledge
aids in the construction of a theory or in the practice of an art”.

To understand the term ‘research’ clearly and comprehensively let us analyze


the above definition.

i) Research is manipulation of things, concepts or symbols


l manipulation means purposeful handling,
6 l things means objects like balls, rats, vaccine,
Geektonight Notes

l concepts mean the terms designating the things and their perceptions about Introduction to
Business Research
which science tries to make sense. Examples: velocity, acceleration, wealth,
income.
l Symbols may be signs indicating +, –, ÷, ×, x , σ, Σ, etc.
l Manipulation of a ball or vaccine means when the ball is kept on different
degrees of incline how and at what speed does it move? When the vaccine is
used, not used, used with different gaps, used in different quantities (doses)
what are the effects?
ii) Manipulation is for the purpose of generalizing
The purpose of research is to arrive at generalization i.e., to arrive at statements of
generality, so that prediction becomes easy. Generalization or conclusion of an
enquiry tells us to expect some thing in a class of things under a class of conditions.
Examples: Debt repayment capacity of farmers will be decreased during
drought years.
When price increases demand falls.
Advertisement has a favourable impact on sales.
iii) The purpose of research (or generalization) is to extend, correct or
verify knowledge
Generalization has in turn certain effects on the established corpus or body of
knowledge. It may extend or enlarge the boundaries of existing knowledge by
removing inconsistencies if any. It may correct the existing knowledge by
pointing out errors if any. It may invalidate or discard the existing knowledge
which is also no small achievement. It may verify and confirm the existing
knowledge which also gives added strength to the existing knowledge. It may
also point out the gaps in the existing corpus of knowledge requiring attempts to
bridge these gaps.

iv) This knowledge may be used for construction of a theory or practice of


an art
The extended, corrected or verified knowledge has two possible uses to which
persons may put it.

a) may be used for theory building so as to form a more abstract conceptual


system. Eg. Theory of relativity, theory of full employment, theory of wage.
b) may be used for some practical or utilitarian goal.
Eg. ‘Salesmanship and advertisement increase sales’ is the generalization. From
this, if sales have to be increased, use salesmanship and advertisement for
increasing sales.
Theory and practice are not two independent things. They are interdependent.
Theory gives quality and effectiveness to practice. Practice in turn may enlarge
or correct or confirm or even reject theory.

1.3 MEANING OF SCIENCE


The development of Science can be considered as a constant inter play
between theory and facts. The word “Science” comes from the Latin word
“Scientia” which means “knowledge”. As we have seen earlier, research
involves acquisition of knowledge. Thus Science and research are related and
go hand in hand.
7
Geektonight Notes

Research and Data At one time the word science was used to denote all systematic studies or
Collection organized bodies of knowledge. Let us see some definitions.
– “Science means a branch of ( accumulated) knowledge”. In this sense it refers
to a particular field or branch of knowledge such as Physics, Chemistry,
Economics.
– “The systematized knowledge about things or events in nature is called
Science”.
– “Science is popularly defined as an accumulation of systematic knowledge”
(Goode & Hatt).
In these definitions the words ‘systematic’ and ‘knowledge’ are very important.
Knowledge refers to the goal of science, while ‘systematic’ refers to the
‘method’ that is used to reach that goal. Now a days the stress is on the
method rather than the knowledge. See the following definitions:
– Knowledge not of things but of their relations.
– Science is a process which makes knowledge.
– It is the approach rather than the content that is the test of science.
– Science is a way of investigation.
– Science is a way of looking at the World.
– “ The unity of all sciences consists alone in its methods, not in its material” -
(Karl Pearson).
From the above definitions two broad views emerge. They are: (a) Science as
organized or accumulated knowledge. (b) Science as a method / process leading
to knowledge. (a) is a STATIC view where as (b) is a DYNAMIC View. The
view that Science is a method rather than a field of specific subject matter is
more popular.

1.4 KNOWLEDGE AND SCIENCE


Knowledge has some thing to do with knowing. Knowing may be through
acquaintance or through the description of the characteristics of certain things.
The things with which we can be acquainted are the things of which we are
directly aware. Direct awareness may come through perception and sensation.
Most of our knowledge of things is by description.

Knowing has an external reference, which may be called a fact. A fact is any
thing that exists or can be conceived of. A fact is neither true nor false. It is
what it is. What we claim to know is belief or judgement. But every belief
cannot, however, be equated with knowledge, because some of our beliefs, even
the true ones, may turn out to be false on verification. Knowledge, therefore, is
a matter of degree. However, knowledge need not always be private or
individual. Private knowledge may be transformed into public knowledge by the
application of certain scientific and common sense procedures.

Human knowledge takes the form of beliefs or judgement about a particular


phenomenon. Some beliefs may be supported by evidence and some may not.
The evidence may be based on our perceptions and experiences. The beliefs
which are supported by evidence are called justified beliefs. Only justified
beliefs are knowledge. Ordinary belief (not supported by evidence) is not
8 knowledge.
Geektonight Notes

We have shown that knowledge requires explanations and these come in Introduction to
Business Research
Science. Knowledge and Science are not necessarily synonymous. Science
implies knowledge, but the converse is not true. Therefore, we can say that “all
Sciences are knowledge, but all knowledge is not science”. Scientific knowledge
is unified, organized and systematic, while ordinary knowledge is a jumble of
isolated and disconnected facts. Science applies special means and methods to
render knowledge true and exact, but ordinary knowledge rests on observations
which are not methodical. But scientific knowledge and ordinary knowledge are
not different in kind, but only in degree. Scientific knowledge is more
specialized, exact and organized than ordinary knowledge.

Self Assessment Exercise A

1) What do you understand by Research?


..................................................................................................................
..................................................................................................................
..................................................................................................................

2) What is the relation between Science and Research?


..................................................................................................................
..................................................................................................................
..................................................................................................................

3) Distinguish between Knowledge and Science.


..................................................................................................................
..................................................................................................................
..................................................................................................................

4) What is a fact?
..................................................................................................................
..................................................................................................................
..................................................................................................................

1.5 INDUCTIVE AND DEDUCTIVE LOGIC


A rational man does not accept any statement without empirical verification or
logic. After the data / facts have been collected, processed, analyzed, we have
to draw broad conclusions / generalizations. Research provides an analytical
framework for the subject matter of investigation. It establishes the relationship
between the different variables. The cause and effect relationship between the
different variables can also be identified, leading to valuable observations,
generalizations and conclusions. Inductions and deductions are also possible in
systematic research.

Induction is the process of reasoning whereby we arrive at generalizations


from particular facts. It is a movement of knowledge from particular
observations / instances to a general rule or principle. Induction involves a
passage from observed to unobserved. It involves two processes - observation
and generalization. For example, if it is observed in a number of cases that
when price increases less is purchased. Therefore, the generalization is “ when
price increases demand falls”. 9
Geektonight Notes

Research and Data Deduction, on the other hand, is a way of making a particular inference from
Collection
a generalization. Deduction is a movement of knowledge from a general rule to
a particular case. For example, ‘All men are mortal’ is a general rule. Ranjit is
a man. Therefore, from the general rule it can be deduced that Ranjit is also
mortal’. Similarly, All M.Com. degree holders are eligible for Ph.D. in
Commerce is a general statement. Praneeth is a M.Com. degree holder.
Therefore, it can be deduced that Praneeth is eligible for Ph.D. in Commerce.

Empirical studies have a great potential, for they lead to inductions and
deductions. Research enables one to develop theories and principles, on the one
hand, and to arrive at generalizations on the other. Both are aids to acquisition
of knowledge.

1.6 SIGNIFICANCE OF RESEARCH IN BUSINESS


Research is the process of systematic and indepth study or search for a
solution to a problem or an answer to a question backed by collection,
compilation, presentation, analysis and interpretation of relevant details, data and
information. It is also a systematic endeavour to discover valuable facts or
relationships. Research may involve careful enquiry or experimentation and
result in discovery or invention. There cannot be any research which does not
increase knowledge which may be useful to different people in different ways.
Let us see the need for research to business organizations and their managers
and how it is useful to them.

i) Industrial and economic activities have assumed huge dimensions. The size of
modern business organizations indicates that managerial and administrative
decisions can affect vast quantities of capital and a large number of people.
Trial and error methods are not appreciated, as mistakes can be tremendously
costly. Decisions must be quick but accurate and timely and should be
objective i.e. based on facts and realities. In this back drop business decisions
now a days are mostly influenced by research and research findings. Thus,
research helps in quick and objective decisions.
ii) Research, being a fact-finding process, significantly influences business
decisions. The business management is interested in choosing that course of
action which is most effective in attaining the goals of the organization.
Research not only provides facts and figures to support business decisions but
also enables the business to choose one which is best.
iii) A considerable number of business problems are now given quantitative
treatment with some degree of success with the help of operations research.
Research into management problems may result in certain conclusions by
means of logical analysis which the decision maker may use for his action or
solution.
iv) Research plays a significant role in the identification of a new project, project
feasibility and project implementation.
v) Research helps the management to discharge its managerial functions of
planning, forecasting, coordinating, motivating, controlling and evaluation
effectively.
vi) Research facilitates the process of thinking, analysing, evaluating and
interpreting of the business environment and of various business situations and
business alternatives. So as to be helpful in the formulation of business policy
10
and strategy.
Geektonight Notes

vii) Research and Development ( R & D) helps discovery and invention. Introduction to
Business Research
Developing new products or modifying the existing products, discovering new
uses, new markets etc., is a continuous process in business.
viii) The role of research in functional areas like production, finance, human
resource management, marketing need not be over emphasized. Research not
only establishes relationships between different variables in each of these
functional areas, but also between these various functional areas.
ix) Research is a must in the production area. Product development, new and
better ways of producing goods, invention of new technologies, cost reduction,
improving product quality, work simplification, performance improvement,
process improvement etc., are some of the prominent areas of research in the
production area.
x) The purchase/material department uses research to frame alternative suitable
policies regarding where to buy, when to buy, how much to buy, and at what
price to buy.
xi) Closely linked with production function is marketing function. Market research
and marketing research provide a major part of marketing information which
influences the inventory level and production level. Marketing research studies
include problems and opportunities in the market, product preference, sales
forecasting, advertising effectiveness, product distribution, after sales service
etc.,
xii) In the area of financial management, maintaining liquidity, profitability through
proper funds management and assets management is essential. Optimum
capital mix, matching of funds inflows and outflows, cash flow forecasting,
cost control, pricing etc., require some sort of research and analysis. Financial
institutions also (banking and non-banking) have found it essential to set up
research division for the purpose of collecting and analysing data both for their
internal purpose and for making indepth studies on economic conditions of
business and people.
xiii) In the area of human resource management personnel policies have to be
guided by research. An individual’s motivation to work is associated with his
needs and their satisfaction. An effective Human Resource Manager is one
who can identify the needs of his work force and formulate personnel policies
to satisfy the same so that they can be motivated to contribute their best to the
attainment of organizational goals. Job design, job analysis, job assignment,
scheduling work breaks etc., have to be based on investigation and analysis.
xiv) Finally, research in business is a must to continuously update its attitudes,
approaches, products goals, methods, and machinery in accordance with the
changing environment in which it operates.

1.7 TYPES OF RESEARCH


Research may be classified into different types for the sake of better
understanding of the concept. Several bases can be adopted for the
classification such as nature of data, branch of knowledge, extent of coverage,
place of investigation, method employed, time frame and so on. Depending upon
the BASIS adopted for the classification, research may be classified into a
class or type. It is possible that a piece of research work can be classified
under more than one type, hence there will be overlapping. It must be 11
Geektonight Notes

Research and Data remembered that good research uses a number of types, methods, &
Collection
techniques. Hence rigid classification is impossible. The following is only an
attempt to classify research into different types.

i) According to the Branch of Knowledge


Different Branches of knowledge may broadly be divided into two:

a) Life and physical sciences such as Botany, Zoology, Physics and Chemistry.
b) Social Sciences such as Political Science, Public Administration, Economics,
Sociology, Commerce and Management.
Research in these fields is also broadly referred to as life and physical science
research and social science research. Business education covers both
Commerce and Management, which are part of Social sciences. Business
research is a broad term which covers many areas.

Business Research

Management Production Personnel Finance Accounting Marketing Business Business


Policy History

The research carried out, in these areas, is called management research,


production research, personnel research, financial management research,
accounting research, Marketing research etc.

Management research includes various functions of management such as


planning, organizing, staffing, communicating, coordinating, motivating, controlling.
Various motivational theories are the result of research. Production (also
called manufacturing) research focuses more on materials and equipment
rather than on human aspects. It covers various aspects such as new and
better ways of producing goods, inventing new technologies, reducing costs,
improving product quality. Research in personnel management may range
from very simple problems to highly complex problems of all types. It is
primarily concerned with the human aspects of the business such as personnel
policies, job requirements, job evaluation, recruitment, selection, placement,
training and development, promotion and transfer, morale and attitudes, wage
and salary administration, industrial relations. Basic research in this field would
be valuable as human behaviour affects organizational behaviour and
productivity. Research in Financial Management includes financial institutions,
financing instruments (egs. shares, debentures), financial markets (capital
market, money market, primary market, secondary market), financial services
(egs. merchant banking, discounting, factoring), financial analysis (e.g.
investment analysis, ratio analysis, funds flow / cash flow analysis) etc.,

Accounting research though narrow in its scope, but is a highly significant


area of business management. Accounting information is used as a basis for
reports to the management, shareholders, investors, tax authorities, regulatory
bodies and other interested parties. Areas for accounting research include
inventory valuation, depreciation accounting, generally accepted accounting
principles, accounting standards, corporate reporting etc.

Marketing research deals with product development and distribution problems,


marketing institutions, marketing policies and practices, consumer behaviour,
advertising and sales promotion, sales management and after sales service etc.
Marketing research is one of the very popular areas and also a well established
12 one. Marketing research includes market potentials, sales forecasting, product
Geektonight Notes

testing, sales analysis, market surveys, test marketing, consumer behaviour Introduction to
Business Research
studies, marketing information system etc.

Business policy research is basically the research with policy implications.


The results of such studies are used as indices for policy formulation and
implementation. Business history research is concerned with the past. For
example, how was trade and commerce during the Moghul regime.

ii) According to the Nature of Data


A simple dichotomous classification of research is Quantitative research and
Qualitative research / non-quantitative. Quantitative research is variables
based where as qualitative research is attributes based. Quantitative research is
based on measurement / quantification of the phenomenon under study. In other
words, it is data based and hence more objective and more popular.

Qualitative research is based on the subjective assessment of attributes,


motives, opinions, desires, preferences, behaviour etc. Research in such a
situation is a function of researcher’s insights and impressions.

iii) According to the Coverage


According to the number of units covered it can be Macro study or Micro
study. Macro study is a study of the whole where as Micro study is a study of
the part. For example, working capital management in State Road Transport
Corporations in India is a macro study where as Working Capital Management
in Andhra Pradesh State Road Transport Corporation is a micro study.

iv) According to Utility or Application


Depending upon the use of research results i.e., whether it is contributing to the
theory building or problem solving, research can be Basic or Applied. Basic
research is called pure / theoretical / fundamental research. Basic research
includes original investigations for the advancement of knowledge that does not
have specific objectives to answer problems of sponsoring agencies.

Applied research also called Action research, constitutes research activities on


problems posed by sponsoring agencies for the purpose of contributing to the
solution of these problems.

v) According to the place where it is carried out


Depending upon the place where the research is carried out (according to the
data generating source), research can be classified into:

a) Field Studies or field experiments


b) Laboratory studies or Laboratory experiments
c) Library studies or documentary research
vi) According to the Research Methods used
Depending upon the research method used for the investigation, it can be
classified as:

a) Survey research, b) Observation research , c) Case research, d)


Experimental research, e) Historical research, f) Comparative research.
13
Geektonight Notes

Research and Data vii) According to the Time Frame


Collection
Depending upon the time period adopted for the study, it can be

a) One time or single time period research - eg. One year or a point of
time. Most of the sample studies, diagnostic studies are of this type.
b) Longitudinal research - eg. several years or several time periods ( a time
series analysis) eg. industrial development during the five year plans in
India.
viii) According to the purpose of the Study
What is the purpose/aim/objective of the study ? Is it to describe or analyze or
evaluate or explore? Accordingly the studies are known as.

a) Descriptive Study: The major purpose of descriptive research is the


description of a person, situation, institution or an event as it exists. Generally
fact finding studies are of this type.
b) Analytical Study: The researcher uses facts or information already available
and analyses them to make a critical examination of the material. These are
generally Ex-post facto studies or post-mortem studies.
c) Evaluation Study: This type of study is generally conducted to examine /
evaluate the impact of a particular event, eg. Impact of a particular decision
or a project or an investment.

d) Exploratory Study:The information known on a particular subject matter is


little. Hence, a study is conducted to know more about it so as to formulate
the problem and procedures of the study. Such a study is called exploratory/
formulative study.

Self Assessment Exercise B

1) Distinguish between inductive and deductive logic.


..................................................................................................................
..................................................................................................................
..................................................................................................................

2) What is the role of R & D in business?


..................................................................................................................
..................................................................................................................
..................................................................................................................

3) How does research influence business decisions?


..................................................................................................................
..................................................................................................................
..................................................................................................................

4) Distinguish between qualitative and quantitative data.


..................................................................................................................
..................................................................................................................
..................................................................................................................
14
Geektonight Notes

5) List the various types of studies according to the purpose of the study. Introduction to
Business Research
..................................................................................................................
..................................................................................................................
..................................................................................................................

1.8 METHODS OF RESEARCH


The researcher has to provide answers at the end, to the research questions
raised in the beginning of the study. For this purpose he has investigated and
gathered the relevant data and information as a basis or evidence. The
procedures adopted for obtaining the same are described in the literature as
methods of research or approaches to research. In fact, they are the broad
methods used to collect the data. These methods are as follows:

1) Survey Method
2) Observation Method
3) Case Method
4) Experimental Method
5) Historical Method
6) Comparative Method

It is now proposed to explain briefly, each of the above mentioned methods.

1.8.1 Survey Method

The dictionary meaning of ‘Survey’ is to oversee, to look over, to study, to


systematically investigate. Survey research is used to study large and small
populations (or universes). It is a fact finding survey. Mostly empirical problems
are investigated by this approach. It is a critical inspection to gather information,
often a study of an area with respect to a certain condition or its prevalence.
For example: a marketing survey, a household survey, All India Rural Credit
Survey.

Survey is a very popular branch of social science research. Survey research


has developed as a separate research activity alongwith the development and
improvement of sampling procedures. Sample surveys are very popular now a
days. As a matter of fact sample survey has become synonymous with survey.
For example, see the following definitions:

Survey research can be defined as “Specification of procedures for


gathering information about a large number of people by collecting
information from a few of them”. (Black and Champion).

Survey research is “Studying samples chosen from populations to


discover the relative incidence, distribution, and inter relations of
sociological and psychological variables”. (Fred N. Kerlinger)

By surveying data, information may be collected by observation, or personal


interview, or mailed questionnaires, or administering schedules or telephone
enquiries.
15
Geektonight Notes

Research and Data Features of Survey method


Collection
The important features of survey method are as follows:

i) It is a field study, as it is always conducted in a natural setting.


ii) It solicits responses directly from the respondents or people known to have
knowledge about the problem under study.
iii) Generally, it gathers information from a large population.
iv) A survey covers a definite geographical area eg. A village / city or a district.
v) It has a time frame.
vi) It can be an extensive survey involving a wider sample or it can be an intensive
study covering few samples but is an in-depth and detailed study.
vii) Survey research is best adapted for obtaining personal, socio-economic facts,
beliefs, attitudes, opinions.
Survey research is not a clerical routine of gathering facts and figures. It
requires a good deal of research knowledge and sophistication. The competent
survey investigator must know sampling procedures, questionnaire / schedule /
opionionaire construction, techniques of interviewing and other technical aspects
of the survey. Ultimately the quality of the Survey results depends on the
imaginative planning, representative sampling, reliability of data, appropriate
analysis and interpretation of the data.

1.8.2 Observation Method


Observation means seeing or viewing. It is not a casual but systematic viewing.
Observation may therefore be defined as “a systematic viewing of a specific
phenomenon in its proper setting for the purpose of gathering information for
the specific study”.

Observation is a method of scientific enquiry. We observe a person or an event


or a situation or an incident. The body of knowledge of various sciences such
as biology, physiology, astronomy, sociology, psychology, anthropology etc., has
been built upon centuries of systematic observation.

Observation is also useful in social and business sciences for gathering


information and conceptualizing the same. For example, What is the life style of
tribals? How are the marketing activities taking place in Regulated markets?
How will the investment activities be done in Stock Exchange Markets? How
are proceedings taking place in the Indian Parliament or Assemblies? How is a
corporate office maintained in a public sector or a private sector undertaking?
What is the behaviour of political leaders? Traffic jams in Delhi during peak
hours?

Observation as a method of data collection has some features:

i) It is not only seeing & viewing but also hearing and perceiving as well.
ii) It is both a physical and a mental activity. The observing eye catches many
things which are sighted, but attention is also focused on data that are relevant
to the problem under study.
iii) It captures the natural social context in which the person’s behaviour occurs.
iv) Observation is selective: The investigator does not observe every thing but
selects the range of things to be observed depending upon the nature, scope and
16 objectives of the study.
Geektonight Notes

v) Observation is not casual but with a purpose. It is made for the purpose of Introduction to
noting things relevant to the study. Business Research

vi) The investigator first of all observes the phenomenon and then gathers and
accumulates data.

Observation may be classified in different ways. According to the setting it can


be (a) observation in a natural setting, eg. Observing the live telecast of
parliament proceedings or watching from the visitors gallery, Electioneering in
India through election meetings or (b) observation in an artificially stimulated
setting, eg. business games, Tread Mill Test. According to the mode of
observation it may be classified as (a) direct or personal observation, and (b)
indirect or mechanical observation. In case of direct observation, the investigator
personally observes the event when it takes place, where as in case of indirect
observation it is done through mechanical devices such as audio recordings,
audio visual aids, still photography, picturization etc. According to the
participating role of the observer, it can be classified as (a) participant
observation and (b) non-participant observation. In case of participant
observation, the investigator takes part in the activity, i.e. he acts both as an
observer as well as a participant. For example, studying the customs and life
style of tribals by living / staying with them. In case of non-participant
observation, the investigator observes from outside, merely as an on looker.

Observation method is suitable for a variety of research purposes such as a


study of human behaviours, behaviour of social groups, life styles, customs and
traditions, inter personal relations, group dynamics, crowd behaviour, leadership
and management styles, dressing habits of different social groups in different
seasons, behaviour of living creatures like birds, animals, lay out of a
departmental stores, a factory or a residential locality, or conduct of an event
like a meeting or a conference or Afro- Asian Games.

1.8.3 Case Method


Case method of study is borrowed from Medical Science. Just like a patient,
the case is intensively studied so as to diagnose and then prescribe a remedy.
A firm, or a unit is to be studied intensively with a view to finding out
problems, differences, specialties so as to suggest remedial measures. It is an
in-depth/intensive study of a unit or problem under study. It is a comprehensive
study of a firm or an industry, or a social group, or an episode, or an incident,
or a process, or a programme, or an institution or any other social unit.

According to P.V. Young “a comprehensive study of a social unit, be that unit a


person, a group, a social institution, a district, or a community, is called a Case
Study”.

Case Study is one of the popular research methods. A case study aims at
studying every thing about something rather than something about everything. It
examines complex factors involved in a given situation so as to identify causal
factors operating in it. The case study describes a case in terms of its
peculiarities, typical or extreme features. It also helps to secure a fund of
information about the unit under study. It is a most valuable method of study
for diagnostic therapeutic purposes.

1.8.4 Experimental Method

Experimentation is the basic tool of the physical sciences like Physics,


Chemistry for establishing cause and effect relationship and for verifying
inferences. However, it is now also used in social sciences like Psychology, 17
Geektonight Notes

Research and Data Sociology. Experimentation is a research process used to observe cause and
Collection effect relationship under controlled conditions. In other words it aims at studying
the effect of an independent variable on a dependent variable, by keeping the
other interdependent variables constant through some type of control. In
experimentation, the researcher can manipulate the independent variables and
measure its effect on the dependent variable. The main features of the
experimental method are :

i) Isolation of factors or controlled observation.


ii) Replication of the experiment i.e. it can be repeated under similar
conditions.
iii) Quantitative measurement of results.
iv) Determination of cause and effect relationship more precisely.
Three broad types of experiments are:

a) The natural or uncontrolled experiment as in case of astronomy made up


mostly of observations.
b) The field experiment, the best suited one for social sciences. “A field
experiment is a research study in a realistic situation in which one or more
independent variables are manipulated by the experimenter under as
carefully controlled conditions as the situation will permit”. ( Fred N.
Kerlinger)
c) The laboratory experiment is the exclusive domain of the physical scientist.
“A laboratory experiment is a research study in which the variance of all or
nearly all of the possible influential independent variables, not pertinent to the
immediate problem of the investigation, is kept at a minimum. This is done by
isolating the research in a physical situation apart from the routine of ordinary
living and by manipulating one or more independent variables under rigorously
specified, operationalized, and controlled conditions”. (Fred N. Kerlinger).

The contrast between the field experiment and laboratory experiment is not
sharp, the difference is a matter of degree. The laboratory experiment has a
maximum of control, where as the field experiment must operate with less
control.

1.8.5 Historical Method

When research is conducted on the basis of historical data, the researcher is


said to have followed the historical approach. To some extent, all research is
historical in nature, because to a very large extent research depends on the
observations / data recorded in the past. Problems that are based on historical
records, relics, documents, or chronological data can conveniently be
investigated by following this method. Historical research depends on past
observations or data and hence is non-repetitive, therefore it is only a post facto
analysis. However, historians, philosophers, social psychiatrists, literary men, as
well as social scientists use the historical approach.

Historical research is the critical investigation of events, developments,


experiences of the past, the careful weighing of evidence of the validity of the
sources of information of the past, and the interpretation of the weighed
evidence. The historical method, also called historiography, differs from other
methods in its rather elusive subject matter i.e. the past.
18
Geektonight Notes

In historical research primary and also secondary sources of data can be used. Introduction to
Business Research
A primary source is the original repository of a historical datum, like an
original record kept of an important occasion, an eye witness description of an
event, the inscriptions on copper plates or stones, the monuments and relics,
photographs, minutes of organization meetings, documents. A secondary
source is an account or record of a historical event or circumstance, one or
more steps removed from an original repository. Instead of the minutes of the
meeting of an organization, for example, if one uses a newspaper account of
the meeting, it is a secondary source.

The aim of historical research is to draw explanations and generalizations from


the past trends in order to understand the present and to anticipate the future.
It enables us to grasp our relationship with the past and to plan more
intelligently for the future.

For historical data only authentic sources should be depended upon and their
authenticity should be tested by checking and cross checking the data from as
many sources as possible. Many a times it is of considerable interest to use
Time Series Data for assessing the progress or for evaluating the impact of
policies and initiatives. This can be meaningfully done with the help of historical
data.

1.8.6 Comparative Method


The comparative method is also frequently called the evolutionary or Genetic
Method. The term comparative method has come about in this way: Some
sciences have long been known as “Comparative Sciences” - such as
comparative philology, comparative anatomy, comparative physiology,
comparative psychology, comparative religion etc. Now the method of these
sciences came to be described as the “Comparative Method”, an abridged
expression for “the method of the comparative sciences”. When the method of
most comparative sciences came to be directed more and more to the
determination of evolutionary sequences, it came to be described as the
“Evolutionary Method”.

The origin and the development of human beings, their customs, their
institutions, their innovations and the stages of their evolution have to be traced
and established. The scientific method by which such developments are traced
is known as the Genetic method and also as the Evolutionary method. The
science which appears to have been the first to employ the Evolutionary
method is comparative philology. It is employed to “compare” the different
languages in existence, to trace the history of their evolution in the light of such
similarities and differences as the comparisons disclosed. Darwin’s famous work
“Origin of Species” is the classic application of the Evolutionary method in
comparative anatomy.

The whole theory of biological evolution rests on applications of evolutionary


method. This method can be applied not only to plants, to animals, to social
customs and social institutions, to the human mind (comparative psychology), to
human ideas and ideals, but also to the evolution of geological strata, to the
differentiation of the chemical elements and to the history of the solar system.

The term comparative method as a method of research is used here in its


restricted meaning as synonymous with Evolutionary method. To say that the
comparative method is a ‘method of comparison’ is not convincing, for
comparison is not a specific method, but some thing which enters as a factor 19
Geektonight Notes

Research and Data into every scientific method. Classification requires careful comparison and
Collection
every other method of science depends upon a precise comparison of
phenomena and the circumstances of their occurrence. All methods are,
therefore, “comparative” in a wider sense.

1.9 DIFFICULTIES IN BUSINESS RESEARCH


In India, researchers in general, and business researchers in particular are
facing several problems. This is all the more true in case of empirical research.
Some of the important problems are as follows:

i) The lack of scientific training in the business research methodology is a major


problem in our country. Many researchers take a leap in the dark without
having a grip over research methodology. Systematic training in business
research methodology is a necessity.
ii) There is paucity of competent researchers and research supervisors. As a
result the research results many a time do not reflect the reality.
iii) Many of the business organizations are not research conscious and feel that
investment in research is a wastage of resources and does not encourage
research.
iv) The research and Development Department has become a common feature
in many medium and large organizations. But decision makers do not appear
to be very keen on implementing the findings of their R & D departments.
v) At the same time, small organizations which are the majority in our economy,
are not able to afford a R & D department at all. Even engaging a consultant
seems to be costly for them. Consequently, they do not take the help of
research to solve their problems.
vi) Many people largely depend on customs, traditions and routine practices in
their decision making, as they feel that research does not have any useful
purpose to serve in the management of their business.
vii) There are insufficient interactions between the University departments and
business organizations, government departments and research organizations.
There should be some mechanism to develop university and industry
interaction so that both can benefit i.e. the academics can get ideas from the
practitioners on what needs to be researched upon and the practioners can
apply the research results of the academics.
viii) The secrecy of business information is sacrosanct to business organizations.
Most of the business organizations in our country do not part with information
to researchers, except public sector organizations which have the culture of
encouraging research, many of the private sector organizations are not willing
to provide the data.
ix) Even when research studies are undertaken, many a time, they are
overlapping, resulting in duplication because there is no proper coordination
between different departments of a university and between different
universities.
x) Difficulty of funds. Because of the scarcity of resources many university
departments do not come forward to undertake research.
20
Geektonight Notes

xi) Poor library facilities at many places, because of which researchers have to Introduction to
Business Research
spend much of their time and energy in tracing out the relevant material and
information.

xii) Many researchers in our country also face the difficulty of inadequate
computerial and secretarial assistance, because of which the researchers
have to take more time for completing their studies.

xiii) Delayed publication of data: There is difficulty of timely availability of upto


date data from published sources. The data available from published sources
or governmental agencies is old. At least 2 to 3 years time lag exists as a
result the data proves irrelevant.

xiv) Social Research, especially managerial research, relates to human beings and
their behaviour. The observations, the data collection and the conclusions etc
must be valid. There is the problem of conceptualization of these aspects.

xv) Another difficulty in the research arena is that there is no code of conduct for
the researchers. There is need for developing a code of conduct for
researchers to educate them about ethical aspects of research, maintaining
confidentiality of information etc.

In spite of all these difficulties and problems, a business enterprise cannot avoid
research, especially in the fast changing world. To survive in the market an
enterprise has to continuously update itself, it has to change its attitudes,
approaches, products, technology, etc., through continuous research.

Self Assessment Exercise C

1) What is meant by Survey?


..................................................................................................................
..................................................................................................................
..................................................................................................................

2) Distinguish between observation and experiment.


..................................................................................................................
..................................................................................................................
..................................................................................................................

3) What are the comparative sciences?


..................................................................................................................
..................................................................................................................
..................................................................................................................

4) What is a Case Study?


..................................................................................................................
..................................................................................................................
..................................................................................................................

21
Geektonight Notes

Research and Data 5) List out five important difficulties faced by business researchers in India.
Collection
..................................................................................................................
..................................................................................................................
..................................................................................................................

1.10 BUSINESS RESEARCH PROCESS


In abstract terms research is research everywhere and the research process
also is more or less the same, whether it is business research or agricultural
research or educational research. Of course, here and there certain
modifications may be required to suit the specified requirements of the area of
research. The business research process also consists of a number of stages:
Planning the research activity, execution of the plan and finally consolidation of
the results of the research activity or reporting. The important activities involved
in the research process are listed below:

i) Selection of a research problem or researchable area.


ii) Acquaintance with the current theory and knowledge and work done in that
area.
iii) Definition and specification of the research problem more clearly.
iv) Formulation of research hypothesis or at least research objectives.
v) Identification of the sources of data.
vi) Creation and construction of data collection instruments like Questionnaire,
Schedules, Scales etc.
vii) Pre-testing of the instruments and their possible revision.
viii) Formal acquisition of data and information, through survey, observation,
interview etc.
ix) Processing and analysis of the data.
x) Interpretation of the data and formal write up i.e., reporting.
These aspects are dealt with in detail in the units that follow:

Specifically, aspects (i) to (iv) are covered in unit-2, aspects (v) to (viii)
are covered in units 3,4 and 5, processing and presentation aspects of
(ix) are discussed in units 6 & 7, and analytical tools and techniques of
data analysis of (ix) are elaborated in units 8 to 17, interpretation
aspects of (x) are discussed in unit 18 and reporting aspects in unit 19.
Therefore, the above aspects are not elaborated in this unit.

1.11 LET US SUM UP


Research is a part of any systematic knowledge. It is essentially a systematic
investigation to discover answers to problems, seeking facts / truth. The word
Science can be understood in two senses.

– Science as an organized body of knowledge and science as a method leading


to knowledge. All sciences are knowledge, but all knowledge is not science.

Empirical studies have a great potential for they lead to inductions and
22 deductions. Induction is the process of reasoning to arrive at generalizations
Geektonight Notes

from particular facts. Deduction is a way of making a particular inference from Introduction to
Business Research
a generalization.

Research is very useful to business organizations and their managers in a


number of ways. It facilitates timely and objective decisions. It helps in solving
business problems. It helps in providing answers to many business questions. It
is of immense use to business in its functional areas. Marketing research,
personnel research, production management research, financial management
research, accounting research are examples.

Research can be classified into different types for the sake of better
understanding. Several bases can be used for this classification such as branch
of knowledge, nature of data, coverage, application, place of research, research
methods used, time frame etc., and the research may be known as that type.

The research has to provide answers to the research questions raised. For this
the problem has to be investigated and relevant data has to be gathered. The
procedures adopted for obtaining the data and information are described as
methods of research. There are six methods viz., Survey, Observation, Case,
Experimental, Historical and Comparative methods.

Survey is a fact finding enquiry conducted in a natural setting/field, soliciting


responses from people known to have knowledge about the problem under
study. Observation is a systematic viewing of a specific phenomenon in its
proper setting for gathering information. A comprehensive or in-depth study of
an element of research is called a case study. Experimentation is a research
process used to observe cause and effect relationship under controlled
conditions. Historical research depends on past observations or past data and
hence is a post facto analysis. The comparative method is an evolutionary
method employed to trace the evolution, similarities and differences between the
elements under study.

The business researcher in India has to face certain difficulties such as lack of
scientific research training, paucity of competent researchers and research
supervisors, non-encouragement of research by business organizations, small
business organizations are not able to afford R & D departments, lack of
scientific orientation in business management, insufficient interaction between
industry and university, funding problems, poor library facilities, delayed
availability of published data etc.

The business research process involves a number of stages such as selection of


a researchable problem, review of previous work on that problem, specification
of the problem, formulation of hypotheses / objectives, identifying sources of
data, construction of data collection instruments and their pre-testing, collection
of data, processing and analysis of data and finally interpretation and Report
writing.

1.12 KEY WORDS


Deduction : It is a way of making a particular inference from a generalization.
Empirical : Relying/based on experience/observation/experiment
Fact : An event that is true/happened
23
Geektonight Notes

Research and Data Induction : It is a process of reasoning to arrive at generalizations from


Collection particular facts.
Knowledge : Having Information, acquaintance with facts.
Method : A way or mode of doing anything.
Observation : Systematic viewing of things to gather information.
Research : It is a systematic search for pertinent information on a specific
topic.
Science : It may mean accumulated body of knowledge or it may mean a
process leading to knowledge.

1.13 ANSWERS TO SELF ASSESSMENT


EXERCISES
A. 1) Research is a systematic endeavour to discover answers to questions.
2) Science means Knowledge.
3) All Sciences are knowledge but all knowledge is not science.
4) A fact is a verifiable observation.

B. 1) Induction is a reasoning from particular to general, where as deduction


is a reasoning from general to particular.

2) R & D helps the organization in discovery and invention.


3) By providing not only facts and figures to support decisions, but also
enabling to choose one which is best.
4) Quantitative is variables based, where as qualitative is attribute based.
5) Descriptive, analytical, evaluation, exploratory studies.
C. 1) It is a fact finding from the respondents
2) Observation is an uncontrolled experiment, and experiment is a controlled
observation.
3) Comparative philology, comparative anatomy, comparative religion,
comparative psychology etc.
4) An intensive study of a person, a group, an incident or an institution is
a case study.

1.14 TERMINAL QUESTIONS


A. Short answer Questions:
1) What do you mean by research?
2) What do you mean by Science?
3) What is knowledge?
4) What is inductive logic?
5) What is meant by deduction?
6) What are the different areas of business research?
24
Geektonight Notes

7) What are the bases used for classifying research into different types? Introduction to
Business Research
8) List the various methods of research.
9) Distinguish between qualitative and quantitative data.
10) What are the stages in the business research process?
B. Essay Type Questions:
1) Define the concept of research and analyze its characteristics.
2) Define the term Science and distinguish it from knowledge.
3) Explain the significance of business research.
4) Write an essay on various types of research.
5) What do you mean by a method of research? Briefly explain different
methods of research.
6) Explain the significance of research in various functional areas of business.
7) What is Survey Research? How is it different from Observation Research?
8) Write short note on:
a) Case Research
b) Experimental Research
c) Historical Research
d) Comparative Method of research
9) What are the difficulties faced by researchers of business in India?
10) What is meant by business research process? What are the various stages /
aspects involved in the research process.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

1.15 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt with in this unit.
Fred N. Kerlinger. Foundations of Behavioural Research, Surjeet Publications,
Delhi
J.F.Rummel & W.C.Ballaine. Research Methodology in Business, Harper &
Row, Publishers, Newyork
P.V.Young. Scientific Social Surveys and Research, Prentice-Hall of India,
New Delhi
C.R.Kothari, Research Methodology (Methods and Techniques), New Age
International Pvt. Ltd. New Delhi
T.S. Wilkinson & P.L.Bhanarkar. Methodology and Techniques of Social
Research, Himalaya Publishing House, Mumbai

25
Geektonight Notes

Research and Data


Collection UNIT 2 RESEARCH PLAN
STRUCTURE
2.0 Objectives
2.1 Introduction
2.2 Research Problem
2.2.1 Sources of Research Problems
2.2.2 Points to be Considered While Selecting a Problem
2.2.3 Specification of the Problem
2.3 Formulation of Objectives
2.4 Hypothesis
2.4.1 Meaning of Hypothesis
2.4.2 Types of Hypothesis
2.4.3 Criteria for Workable Hypothesis
2.4.4 Stages in Hypothesis
2.4.5 Testing of Hypothesis
2.4.6 Uses of Hypothesis
2.5 Research Design
2.5.1 Functions of Research Design
2.5.2 Components of a Research Design
2.6 Pilot Study and Pre-testing
2.7 Let Us Sum Up
2.8 Key Words
2.9 Answers to Self Assessment Exercises
2.10 Terminal Questions
2.11 Further Reading

2.0 OBJECTIVES
After studying this unit, you should be able to:

l Select a research problem and identify sources of research problem,


l define and specify a research problem,
l explain the need for formulating research objective(s),
l define hypothesis and classify the hypotheses,
l suggest a criteria for a good hypothesis,
l test a hypothesis,
l describe a research design,
l list out the components of a research design, and
l distinguish between a pilot study and pre-test.

2.1 INTRODUCTION
In unit 1, we have discussed the meaning and significance of business research,
types of research, methods of conducting research, and the business research
process. There we have shown that the research process begins with the
raising of a problem, leading to the gathering of data, their analysis and
interpretation and finally ends with the writing of the report. In this unit, we
propose to give a complete coverage on selection and specification of the
research problem, formulation of research objectives / hypotheses and designing
2 6 the action plan of research. Now we will dwell in detail on these aspects along
Geektonight Notes

with the associated features which are interwoven with the research problem Research Plan
and hypothesis formulation and testing.

2.2 RESEARCH PROBLEM


Without a problem, research cannot proceed, because there is nothing to
proceed from and proceed towards. Therefore, the first step in research is to
perceive a problem - either practical or theoretical. The recognition or existence
of a problem motivates research. It may be noted that research is the process
of repeated search for truth/facts. Unless there is a problem to search for,
investigation cannot proceed. Thus, a problem sets the goal or direction of
research.

A problem in simple words is “some difficulty experienced by the researcher in


a theoretical or practical situation. Solving this difficulty is the task of
research”.

A problem exists when we do not have enough information to answer a


question (problem). The answer to the question or problem is what is sought in
the research.

By problem we mean “any condition or circumstance in which one does not


know how to act and what to accept as true”. In our common usage when we
are unable to assess a thing correctly, we often say ‘it is problematic’. Thus
the researcher who selects a problem formulates a hypothesis or postulates a
theoretical assumption that this or that is true, this or that thing to do. He/she
collects proof (facts/data) of his/her hypothesis. Based on the analysis of the
data collected he/she asserts the truth or answers the question/solves the
problem.

The problem for research should ordinarily be expressed in an interrogative


form. For example :

– Why is product X more popular than product Y?


– How to increase labour productivity?
– Does illumination increase productivity?
– Why is factory A earning profits and factory B incurring losses?
– Is the audio-visual system of teaching more effective than the audio system?
These are all searchable problems/questions. Finding answers to the problems is
what is endeavoured in research. One question/problem may give rise to
number of/series of sub-questions too.

Let us, now, discuss some considerations for selection of a research problems.

A topic of study may be selected by some institution or by some researcher or


researchers having intellectual interests. In the former case there could be a
wide variety of problems in which institutions are interested. The institution
could be a local body, or government or corporate enterprises or a political
party. For example, the government may be interested in assessing the probable
consequences of various courses of action for solving a problem say rural
unemployment. A firm may be interested in assessing the demand for something
and predicting the future course of events so as to plan appropriate action
relating to marketing, production, consumer behaviour and so on.
2 7
Geektonight Notes

Research and Data The topic of study may be selected by some individual researcher having
Collection intellectual or scientific interests. The researcher may be interested in exploring
some general subject matter about which relatively little is known. And its
purpose is just for scientific curiosity. Person may also be interested in a
phenomenon which has already been studied in the past, but now it appears
that conditions are different and, therefore, it requires further examination.
Person may also be interested in a field in which there is a highly developed
theoretical system but there is need for retesting the old theory on the basis of
new facts, so as to test its validity in the changed circumstances.

The topic of research may be of a general nature or specifically needed by


some institution, organization or government. It may be of intellectual interest or
of practical concern, “A wide variety of practical concerns may present topics
for research”. For example, one may want to study the impact of television on
children’s education, performance of regulated agricultural markets, profitability
of a firm, impact of imports on Indian economy, a comparative study of
accounting practices in public and private undertakings, etc.

2.2.1 Sources of Research Problems


If the researcher / research organization has a ready problem on hand, he/she
can proceed further in the research process or else you have to search for a
problem. Where can you search for research problems? Your own mind, where
else? You have to feel the problem and think about it. However, the following
sources may help you in identifying the problem / problem areas.

1) Business Problems: A research problem is a felt need, the need may be


an answer, or a solution or an improvement in facilities / technology eg. Cars
Business experiences, various types of problems. They may be business policy
problems, operational problems, general management problems, or functional
area problems. The functional areas are Financial Management, Marketing
Management, Production Management and Human Resources Management.
Every business research problem is expected to solve a management problem
by facilitating rational decision-making.

2) Day to Day Problems: A research problem can be from the day to day
experience of the researcher. Every day problems constantly present some thing
new and worthy of investigation and it depends on the keenness of observation
and sharpness of the intellect of the researcher to knit his daily experience into
a research problem. For example, a person who travels in city buses every day
finds it a problem to get in or get out of the bus. But a Q system (that is the
answer to the problem) facilitates boarding and alighting comfortably.

3) Technological Changes: Technological changes in a fast changing world


are constantly bringing forth new problems and thus new opportunities for
research. For example, what is the impact or implications of a new technique
or new process or new machine?

4) Unexplored Areas: Research problems can be both abstract and of applied


interest. The researcher may identify the areas in which much work has been
done and the areas in which little work has been done or areas in which no
work has been done. He may select those areas which have not been explored
so far/explored very little.

5) Theory of One’s Own Interest: A researcher may also select a problem


for investigation from a given theory in which he has considerable interest. In
such situations the researcher must have a thorough knowledge of that theory
and should be able to explore some unexplained aspects or assumptions of that
2 8 theory. His effort should revalidate, or modify or reject the theory.
Geektonight Notes

6) Books, Theses, Dissertation Abstracts, Articles: Special assignments in Research Plan


textbooks, research theses, investigative reports, research articles in research
journals etc., are rich sources for problem seekers. These sources may suggest
some additional areas of needed research. Many of the research theses and
articles suggest problems for further investigation which may prove fruitful.

7) Policy Problems: Government policy measures give rise to both positive


and negative impact. The researcher may identify these aspects for his
research. For example, what is the impact of the Government’s new industrial
policy on industrial development? What is the impact of Export - Import policy
on balance of payments? What is the impact of Securities Exchange Board of
India Regulations on stock markets?

8) Discussions with Supervisor and Other Knowledgeable Persons: The


researcher may find it fruitful to have discussions with his/her proposed
supervisor or other knowledgeable persons in the area of the topic.

Self Assessment Exercise A

Fill up the blanks with appropriate words

1) A research problem is a ............................... need.


2) The problem sets the ............................... of research.
3) The research problem should preferably be expressed in ..........................
form.
4) A problem exists when we do not have enough ............................... to
answer it.
5) Technological changes are a constant ............................... for research.
6) List five research problems on your own.
..................................................................................................................
..................................................................................................................

2.2.2 Points to be Considered while Selecting a Problem

The topic or problem which the researcher selects among the many possibilities
should meet certain requirements. Every problem selected for research must
satisfy the following criteria.

1) The topic selected should be original or at least less explored. The purpose
of research is to fill the gaps in existing knowledge or to discover new facts
and not to repeat already known facts. Therefore, a preliminary survey of the
existing literature in the proposed area of research should be carried out to find
out the possibility of making an original contribution. Knowledge about previous
research will serve at least three purposes.

a) It will enable the researcher to identify his specific problem for research.
b) It will eliminate the possibility of unnecessary duplication of effort, and
c) It will give him valuable information on the merits and limitations of various
research techniques which have been used in the past.
2) It should be of significance and socially relevant and useful.
3) It should be interesting to the researcher and should fit into his aptitude.
2 9
Geektonight Notes

Research and Data 4) It should be from an area of the researcher’s specialization.


Collection
5) It should correspond to the researcher’s abilities - both acquired and acquirable.
6) It should be big enough to be researchable and small enough to be handled - the
topic should be amenable for research with existing and acquirable skills.
7) It should have a clear focus or objective.
8) The feasibility of carrying out research on the selected problem should be
checked against the following considerations.
a) Whether adequate and suitable data are available?
b) Whether there is access to the organization and respondents?
c) Whether cooperation will be forth coming from the organization and
respondents?
d) What are the resources required and how are they available?
e) Whether the topic is within the resources (money and man power) position
of the researcher?
9) It should be completed with in the time limits permissible.

2.2.3 Specification of the Problem

After going through all the above issues a problem is to be restated in an


analytical jargon keeping in view its solution. The best way of understanding the
problem is to discuss it with those who first raised it in order to find out how it
originally came up and what was in the minds of the people who raised it. The
more general the original statement of the problem, the more the necessity of
preliminary discussions about its nature.

The research problem should define the goal of the researcher in clear terms.
It means that along with the problem, the objective of the proposal should
adequately be spelled out. Without a clear cut idea of the goal to be reached,
research activities would be meaningless.

The first step in the formulation and specification of a research problem is to


make it concrete and explicit. There is no foolproof method by which one can
do it. However, R.L.Ackoff provides considerable guidance in identifying and
specifying a problem of research. He presents five components of a problem.

1) Research Consumer: There must be an individual or a group which has


difficulty. The individual may be the researcher himself and the group / a group
of researchers. For some problems there are also other participants. The
researcher, if he/she is different from the research consumer, is a participant in
the problem.

2) Research-Consumer’s Objective: The research consumer must have


something to know or some ends to achieve. Obviously, a person who wants
nothing cannot have a problem.

3) Alternative Means to Achieve the Objective: The research consumer


must have alternative means to achieve his objectives. Means are courses of
action. A course of action may involve the use of objects. The objects used
thus are instruments. Here an instrument refers to any object, concept or idea
which can be effectively used in the pursuit of an objective.

It should be remembered that there must be at least two means available to the
research consumer. If he/she has no choice of means, he/she cannot have a
3 0 problem.
Geektonight Notes

4) Doubt in Regard to Selection of Alternatives: The existence of Research Plan


alternative courses of action is not enough. To experience a problem, the
research consumer must have some doubt as to which alternative to select.
Without such a doubt there can be no problem. All problems then get reduced
ultimately to the evaluation of efficiency of the alternative means for a given
set of objectives.

5) One or More Environments: There must be one or more environments to


which the difficulty or problem pertains. A problem may exist in one
environment and may not in another. Thus a change in environment may
produce or remove a problem. A research consumer may have doubts as to
which will be the most efficient means. The strategy of marketing a product
may be different in the urban market, the semi-urban market and the rural
market. The instruments of spreading the family planning message may be
different in the case of educated and illiterate people. The range of
environments over which a problem may exist may vary from one to many.
Some problems are specific to only one environment while others are quite
general.

The selection of a topic for research is only half-a-step forward. This general
topic does not help a researcher to see what data are relevant to his/her
purpose. What are the methods would he/she employ in securing them? And
how to organize these? Before he/she can consider all these aspects, he/she
has to formulate a specific problem by making the various components of it (as
explained above) explicit.

A research problem is nothing but a basic question for which an answer or a


solution is sought through research. The basic question may be further broken
down into specifying questions. These “simple, pointed, limited, empirically
verifiable questions are the final result of the phased process, we designate as
the formulation of a research problem”. Specification or definition of the
problem is therefore a process that involves a progressive narrowing of the
scope and sharpening of focus of questions till the specific challenging questions
are finally posed. If you can answer the following questions, you have clearly
specified/defined the problem.

1) What do you want to know? (What is the problem / what are the questions to
be answered).

2) Why do you want to know? (What is the purpose or objective).

3) How do you want to answer or solve it? (What is the methodology we want to
adopt to solve it)

4) When do you want to solve it? (Within what time limits)

5) Where do you want to solve it? (Within what spatial limits)


6) Who is your research-consumer? ( to whom are you answering)

Please remember that a problem well put is half solved.

2.3 FORMULATION OF OBJECTIVES


Having selected and specified the research problem, the next step is to
formulate the objectives of research. Research is not for the sake of research.
It is undertaken to achieve some thing. Thus, research is a goal-oriented 3 1
Geektonight Notes

Research and Data activity. We have to identify the goal / goals to be achieved and they must be
Collection
specified in order to give direction to the research study. Hence, formulation of
research objectives is equally important. Once research objectives are stated,
then the entire research activity will be geared to achieving those objectives.
For example, we intend to examine the working of a Regulated Agricultural
Market in a town to know whether it is fulfilling the objectives for which it has
been set up. For this study, we will gather all the relevant information/data such
as arrivals of different commodities, sources and uses of funds, facilities
provided in the market, users opinions etc. Similarly, if we are clear about what
we want from the research exercise, then the rest of the things will depend
upon the objectives such as identifying sources of data, instruments of collection
of data, tools of analyzing data. However, the objectives of the study must be
clear, specific and definite.

Self Assessment Exercise B

1) List any five points which will weigh in selecting a problem.


................................................................................................................
................................................................................................................
................................................................................................................
2) What do you mean by specification of a problem?
................................................................................................................
................................................................................................................
................................................................................................................
3) What is the need for formulation of objectives of research?
................................................................................................................
................................................................................................................
................................................................................................................
4) How do day-to-day problems give rise to research?
................................................................................................................
................................................................................................................
................................................................................................................
5) What is the need for knowledge about previous research?
................................................................................................................
................................................................................................................
................................................................................................................

2.4 HYPOTHESIS
We know that research begins with a problem or a felt need or difficulty. The
purpose of research is to find a solution to the difficulty. It is desirable that the
researcher should propose a set of suggested solutions or explanations of the
difficulty which the research proposes to solve. Such tentative solutions
formulated as a proposition are called hypotheses. The suggested solutions
formulated as hypotheses may or may not be the real solutions to the problem.
Whether they are or not is the task of research to test and establish.
3 2
Geektonight Notes

2.4.1 Meaning of Hypothesis Research Plan

To understand the meaning of a hypothesis, let us see some definitions:

“A hypothesis is a tentative generalization, the validity of which remains to be


tested. In its most elementary stage the hypothesis may be any guess, hunch,
imaginative idea, which becomes the basis for action or investigation”.
(G.A.Lundberg)

“It is a proposition which can be put to test to determine validity”. (Goode and
Hatt).

“A hypothesis is a question put in such a way that an answer of some kind


can be forth coming” - (Rummel and Ballaine).

These definitions lead us to conclude that a hypothesis is a tentative solution or


explanation or a guess or assumption or a proposition or a statement to the
problem facing the researcher, adopted on a cursory observation of known and
available data, as a basis of investigation, whose validity is to be tested or
verified.

2.4.2 Types of Hypothesis


Hypotheses can be classified in a variety of ways into different types or kinds.
The following are some of the types of hypotheses:

i) Explanatory Hypothesis: The purpose of this hypothesis is to explain a certain


fact. All hypotheses are in a way explanatory for a hypothesis is advanced only
when we try to explain the observed fact. A large number of hypotheses are
advanced to explain the individual facts in life. A theft, a murder, an accident are
examples.
ii) Descriptive Hypothesis: Some times a researcher comes across a complex
phenomenon. He/ she does not understand the relations among the observed
facts. But how to account for these facts? The answer is a descriptive
hypothesis. A hypothesis is descriptive when it is based upon the points of
resemblance of some thing. It describes the cause and effect relationship of a
phenomenon e.g., the current unemployment rate of a state exceeds 25% of the
work force. Similarly, the consumers of local made products constitute a
significant market segment.
iii) Analogical Hypothesis: When we formulate a hypothesis on the basis of
similarities (analogy), it is called an analogical hypothesis e.g., families with
higher earnings invest more surplus income on long term investments.
iv) Working hypothesis: Some times certain facts cannot be explained adequately
by existing hypotheses, and no new hypothesis comes up. Thus, the investigation
is held up. In this situation, a researcher formulates a hypothesis which enables
to continue investigation. Such a hypothesis, though inadequate and formulated
for the purpose of further investigation only, is called a working hypothesis. It is
simply accepted as a starting point in the process of investigation.
v) Null Hypothesis: It is an important concept that is used widely in the sampling
theory. It forms the basis of many tests of significance. Under this type, the
hypothesis is stated negatively. It is null because it may be nullified, if the
evidence of a random sample is unfavourable to the hypothesis. It is a
hypothesis being tested (H0). If the calculated value of the test is less than the
permissible value, Null hypothesis is accepted, otherwise it is rejected. The
rejection of a null hypothesis implies that the difference could not have arisen
due to chance or sampling fluctuations. 3 3
Geektonight Notes

Research and Data vi) Statistical Hypothesis: Statistical hypotheses are the statements derived from
Collection a sample. These are quantitative in nature and are numerically measurable. For
example, the market share of product X is 70%, the average life of a tube light
is 2000 hours etc.
2.4.3 Criteria for Workable Hypothesis

A hypothesis controls and directs the research study. When a problem is felt,
we require the hypothesis to explain it. Generally, there is more than one
hypothesis which aims at explaining the same fact. But all of them cannot be
equally good. Therefore, how can we judge a hypothesis to be true or false,
good or bad? Agreement with facts is the sole and sufficient test of a true
hypothesis. Therefore, certain conditions can be laid down for distinguishing a
good hypothesis from bad ones. The formal conditions laid down by thinkers
provide the criteria for judging a hypothesis as good or valid. These conditions
are as follows:

i) A hypothesis should be empirically verifiable: The most important


condition for a valid hypothesis is that it should be empirically verifiable. A
hypothesis is said to be verifiable, if it can be shown to be either true or false
by comparing with the facts of experience directly or indirectly. A hypothesis is
true if it conforms to facts and it is false if it does not. Empirical verification is
the characteristic of the scientific method.
ii) A hypothesis should be relevant: The purpose of formulating a hypothesis
is always to explain some facts. It must provide an answer to the problem
which initiated the enquiry. A hypothesis is called relevant if it can explain the
facts of enquiry.
iii) A hypothesis must have predictive and explanatory power: Explanatory
power means that a good hypothesis, over and above the facts it proposes to
explain, must also explain some other facts which are beyond its original scope.
We must be able to deduce a wide range of observable facts which can be
deduced from a hypothesis. The wider the range, the greater is its explanatory
power.
iv) A hypothesis must furnish a base for deductive inference on
consequences: In the process of investigation, we always pass from the
known to the unknown. It is impossible to infer any thing from the absolutely
unknown. We can only infer what would happen under supposed conditions by
applying the knowledge of nature we possess. Hence, our hypothesis must be
in accordance with our previous knowledge.
v) A hypothesis does not go against the traditionally established
knowledge: As far as possible, a new hypothesis should not go against any
previously established law or knowledge. The new hypothesis is expected to be
consistent with the established knowledge.
vi) A hypothesis should be simple: A simple hypothesis is preferable to a
complex one. It some times happens that there are two or more hypotheses
which explain a given fact equally well. Both of them are verified by
observable facts. Both of them have a predictive power and both are
consistent with established knowledge. All the important conditions of
hypothesis are thus satisfied by them. In such cases the simpler one is to be
accepted in preference to the complex one.
vii) A hypothesis must be clear, definite and certain: It is desirable that the
hypothesis must be simple and specific to the point. It must be clearly defined
3 4 in a manner commonly accepted. It should not be vague or ambiguous.
Geektonight Notes

(viii) A Hypothesis should be related to available techniques: If tools and Research Plan
techniques are not available we cannot test the hypothesis. Therefore, the
hypothesis should be formulated only after due thought is given to the methods
and techniques that can be used to measure the concepts and variables related
to the hypothesis.
2.4.4 Stages in Hypothesis

There are four stages. The first stage is feeling of a problem. The observation
and analysis of the researcher reveals certain facts. These facts pose a
problem. The second stage is formulation of a hypothesis or hypotheses. A
tentative supposition/ guess is made to explain the facts which call for an
explanation. At this stage some past experience is necessary to pick up the
significant aspects of the observed facts. Without previous knowledge, the
investigation becomes difficult, if not impossible. The third stage is deductive
development of hypothesis using deductive reasoning. The researcher uses the
hypothesis as a premise and draws a conclusion from it. And the last stage is
the verification or testing of hypothesis. This consists in finding whether the
conclusion drawn at the third stage is really true. Verification consists in finding
whether the hypothesis agrees with the facts. If the hypothesis stands the test
of verification, it is accepted as an explanation of the problem. But if the
hypothesis does not stand the test of verification, the researcher has to search
for further solutions.

To explain the above stages let us consider a simple example. Suppose, you
have started from your home for college on your scooter. A little while later
the engine of your scooter suddenly stops. What can be the reason? Why has
it stopped? From your past experience, you start guessing that such problems
generally arise due to either petrol or spark plug. Then start deducing that the
cause could be: (i) that the petrol knob is not on. (ii) that there is no petrol in
the tank. (iii) that the spark plug has to be cleaned. Then start verifying them
one after another to solve the problem. First see whether the petrol knob is on.
If it is not, switch it on and start the scooter. If it is already on, then see
whether there is petrol or not by opening the lid of the petrol tank. If the tank
is empty, go to the near by petrol bunk to fill the tank with petrol. If there is
petrol in the tank, this is not the reason, then you verify the spark plug. You
clean the plug and fit it. The scooter starts. That means the problem is with the
spark plug. You have identified it. So you got the answer. That means your
problem is solved.

2.4.5 Testing of Hypothesis

When the hypothesis has been framed in the research study, it must be verified
as true or false. Verifiability is one of the important conditions of a good
hypothesis. Verification of hypothesis means testing of the truth of the
hypothesis in the light of facts. If the hypothesis agrees with the facts, it is said
to be true and may be accepted as the explanation of the facts. But if it does
not agree it is said to be false. Such a false hypothesis is either totally rejected
or modified. Verification is of two types viz., Direct verification and Indirect
verification.

Direct verification may be either by observation or by experiments. When


direct observation shows that the supposed cause exists where it was thought
to exist, we have a direct verification. When a hypothesis is verified by an
experiment in a laboratory it is called direct verification by experiment. When
the hypothesis is not amenable for direct verification, we have to depend on 3 5
Geektonight Notes

Research and Data indirect verification. Indirect verification is a process in which certain possible
Collection consequences are deduced from the hypothesis and they are then verified
directly. Two steps are involved in indirect verification. (i) Deductive
development of hypothesis: By deductive development certain consequences are
predicted and (ii) finding whether the predicted consequences follow. If the
predicted consequences come true, the hypothesis is said to be indirectly
verified. Verification may be done directly or indirectly or through logical
methods.

Testing of a hypothesis is done by using statistical methods. Testing is used to


accept or reject an assumption or hypothesis about a random variable using a
sample from the distribution. The assumption is the null hypothesis (H0), and it
is tested against some alternative hypothesis (H1). Statistical tests of hypothesis
are applied to sample data. The procedure involved in testing a hypothesis is
A) select a sample and collect the data. B) convert the variables or attributes
into statistical form such as mean, proportion. C) formulate hypotheses.
D) select an appropriate test for the data such as t-test, Z-test. E) perform
computations. F) finally draw the inference of accepting or rejecting the null
hypothesis. You will learn more about it in tests of hypothesis or tests of
significance in later units (Units-15, 16 and 17).

2.4.6 Uses of Hypothesis

If a clear scientific hypothesis has been formulated, half of the research work
is already done. The advantages/utility of having a hypothesis are summarized
here underneath:

i) It is a starting point for many a research work.


ii) It helps in deciding the direction in which to proceed.
iii) It helps in selecting and collecting pertinent facts.
iv) It is an aid to explanation.
v) It helps in drawing specific conclusions.
vi) It helps in testing theories.
vii) It works as a basis for future knowledge.

Self Assessment Exercise C

1) What do you mean by a hypothesis?


..................................................................................................................
..................................................................................................................
..................................................................................................................

2) List out different types of hypothesis.


..................................................................................................................
..................................................................................................................
..................................................................................................................

3) What is meant by null hypothesis?


..................................................................................................................
..................................................................................................................
3 6 ..................................................................................................................
Geektonight Notes

4) What are the characteristics of good hypothesis testing? Research Plan

..................................................................................................................
..................................................................................................................
..................................................................................................................

5) What are the stages in a hypothesis?


..................................................................................................................
..................................................................................................................
..................................................................................................................

6) What are the methods used to prove or reject a hypothesis ?


..................................................................................................................
..................................................................................................................
..................................................................................................................

2.5 RESEARCH DESIGN


Research design is also known by different names such as research outline,
plan, blue print. In the words of Fred N. Kerlinger, it is the plan, structure and
strategy of investigation conceived so as to obtain answers to research
questions and control variance. The plan includes everything the investigator will
do from writing the hypothesis and their operational implications to the final
analysis of data. The structure is the outline, the scheme, the paradigms of the
operation of the variables. The strategy includes the methods to be used to
collect and analyze the data. At the beginning this plan (design) is generally
vague and tentative. It undergoes many modifications and changes as the study
progresses and insights into it deepen. The working out of the plan consists of
making a series of decisions with respect to what, why, where, when, who and
how of the research.

According to Pauline V.Young “a research design is the logical and systematic


planning and directing of a piece of research”. According to Reger E.Kirk
“research designs are plans that specify how data should be collected and
analyzed”.

The research has to be geared to the available time, energy, money and to the
availability of data. There is no such thing as a single or correct design.
Research design represents a compromise dictated by many practical
considerations that go into research.

2.5.1 Functions of Research Design


Regardless of the type of research design selected by the investigator, all plans
perform one or more functions outlined below.

i) It provides the researcher with a blue print for studying research questions.
ii) It dictates boundaries of research activity and enables the investigator to
channel his energies in a specific direction.
iii) It enables the investigator to anticipate potential problems in the implementation
of the study.
iv) The common function of designs is to assist the investigator in providing
answers to various kinds of research questions.
3 7
Geektonight Notes

Research and Data A study design includes a number of component parts which are interdependent
Collection and which demand a series of decisions regarding the definitions, methods,
techniques, procedures, time, cost and administration aspects.

2.5.2 Components of a Research Design


A research design basically is a plan of action. Once the research problem is
selected, then it must be executed to get the results. Then how to go about it?
What is its scope? What are the sources of data? What is the method of
enquiry? What is the time frame? How to record the data? How to analyze
the data? What are the tools and techniques of analysis? What is the
manpower and organization required? What are the resources required? These
and many such are the subject matter of attacking the research problem
demanding decisions in the beginning itself to have greater clarity about the
research study. It is similar to having a building plan before the building is
constructed. Thus, according to P.V. Young the various “considerations which
enter into making decisions regarding what, where, when, how much, by what
means constitute a plan of study or a study design”. Usually the contents or
components of a Research design are as follows:

1) Need for the Study: Explain the need for and importance of this study and its
relevance.
2) Review of Previous Studies: Review the previous works done on this topic,
understand what they did, identify gaps and make a case for this study and justify it.
3) Statement of Problem: State the research problem in clear terms and give
a title to the study.

4) Objectives of Study: What is the purpose of this study? What are the
objectives you want to achieve by this study? The statement of objectives should
not be vague. They must be specific and focussed.
5) Formulation of Hypothesis: Conceive possible outcome or answers to the
research questions and formulate into hypothesis tests so that they can be
tested.
6) Operational Definitions: If the study is using uncommon concepts or
unfamiliar tools or using even the familiar tools and concepts in a specific sense,
they must be specified and defined.
7) Scope of the Study: It is important to define the scope of the study,
because the scope decides what is within its purview and what is outside.

Geographical area to be covered.


Scope includes Subject content to be covered.
Time period to be covered.

Scope includes Geographical scope, content scope, chronological scope of the


study. The territorial area to be covered by the study should be decided. E.g.
only Delhi or northern states or All India. As far as content scope is concerned
according to the problem say for example, industrial relations in so and so
organization, what are aspects to be studied, what are the aspects not coming
under this and hence not studied. Chronological scope i.e., time period selection
and its justification is important. Whether the study is at a point of time or
longitudinal say 1991-2003.
8) Sources of Data: This is an important stage in the research design. At this
stage, keeping in view the nature of research, the researcher has to decide the
sources of data from which the data are to be collected. Basically the sources
3 8
Geektonight Notes

are divided into primary source (field sources) and secondary source Research Plan
(documentary sources). The data from primary source are called as primary
data, and data from secondary source are called secondary data. Hence, the
researcher has to decided whether to collect from primary source or
secondary source or both sources. (This will be discussed in detail in Unit-3).
9) Method of Collection: After deciding the sources for data collection, the
researcher has to determine the methods to be employed for data
collection, primarily, either census method or sampling method. This decision
may depend on the nature, purpose, scope of the research and also time
factor and financial resources.
10) Tools & Techniques: The tools and techniques to be used for collecting
data such as observation, interview, survey, schedule, questionnaire, etc.,
have to be decided and prepared.
11) Sampling Design: If it is a sample study, the sampling techniques, the size
of sample, the way samples are to be drawn etc., are to be decided.
12) Data Analysis: How are you going to process and analyze the data and
information collected? What simple or advanced statistical techniques are
going to be used for analysis and testing of hypothesis, so that necessary
care can be taken at the collection stage.
13) Presentation of the Results of Study: How are you going to present the
results of the study? How many chapters? What is the chapter scheme?
The chapters, their purpose, their titles have to be outlined. It is known as
chapterisation.
14) Time Estimates: What is the time available for this study? Is it limited or
unlimited time? Generally, it is a time bound study. The available or
permitted time must be apportioned between different activities and the
activities to be carried out within the specified time. For example,
preparation of research design one month, preparation of questionnaire one
month, data collection two months, analysis of data two months, drafting of
the report two months etc.,
15) Financial Budget: The design should also take into consideration the
various costs involved and the sources available to meet them. The
expenditures like salaries (if any), printing and stationery, postage and
telephone, computer and secretarial assistance etc.
16) Administration of the Enquiry: How is the whole thing to be executed?
Who does what and when? All these activities have to be organized
systematically, research personnel have to be identified and trained. They
must be entrusted with the tasks, the various activities are to be
coordinated and the whole project must be completed as per schedule.

Research designs provide guidelines for investigative activity and not necessarily
hard and fast rules that must remain unbroken. As the study progresses, new
aspects, new conditions and new connecting links come to light and it is
necessary to change the plan / design as circumstances demand. A universal
characteristic of any research plan is its flexibility.

Depending upon the method of research, the designs are also known as survey
design, case study design, observation design and experimental design.
3 9
Geektonight Notes

Research and Data


Collection 2.6 PILOT STUDY AND PRE-TESTING
A Pilot study is a small scale replica of the main study. When a problem is
selected for research, a plan of action is to be designed to proceed further. But
if we do not have adequate knowledge about the subject matter, the nature of
the population (The word ‘population’ as used in statistics denotes the aggregate
from which the sample is to be taken), the various issues involved, the tools
and techniques to be used for operationalizing the research problem, we have to
familiarize ourselves first with it and acquire a good deal of knowledge about
the subject matter of the study and its dimensions. For this purpose, a small
study is conducted before the main study, which is called a Pilot Study. A pilot
study provides a better knowledge of the problem and its dimensions. It
facilitates us to understand the nature of the population to be surveyed and the
field problems to be encountered. It also helps in developing better approaches
and better instruments. It covers the entire process of research, but on a small
scale. This is also useful for preparing the research design clearly and
specifically.

Pre-Testing is the hallmark of scientific research. Pre-testing means trial


administration of the instrument to sample respondents before finalizing it. It is a
common practice in our day-to-day life that before finally okaying it we try it
on a trial basis. For example, when some recipe is prepared a sample is tasted,
based on that corrections are made. If you give a suit for stitching to the tailor,
you want a trail wear (or pre-test), if you are purchasing a vehicle, you want
to have a trial drive. Similarly, for data collection some instruments such as
interview schedule, or questionnaire or measurement scale are constructed. We
want to administer it on a trial basis to identify its weaknesses, if any. Such a
trial administration of the instrument is called pre-testing.

While designing the instrument or method, we take all precautions keeping in


view the requirements of the study. We will not be able to identify its defects,
limitations and weaknesses easily. But when others use it, they will be able to
identify them objectively. Therefore, it has to be tested empirically, hence pre-
testing of a draft instrument is a must. Based on the opinions, comments,
criticism, suggestions received and difficulties experienced in the pre-testing the
instrument or method is revised or modified and then finalized for using it in the
main study.

The difference between pilot study and pre-test is that, the former is a full
fledged miniature study of a research problem, where as the latter is a trial test
of a specific aspect of the study, such as a questionnaire.

Self Assessment Exercise D

1) What are the different names of research design?


..................................................................................................................
..................................................................................................................
..................................................................................................................

2) What is meant by research plan?


..................................................................................................................
..................................................................................................................
4 0 ..................................................................................................................
Geektonight Notes

3) What are the functions of a research design? Research Plan

..................................................................................................................
..................................................................................................................
..................................................................................................................

4) What do you mean by scope of study?


..................................................................................................................
..................................................................................................................
..................................................................................................................

5) Distinguish between pilot study and pre-test.


..................................................................................................................
..................................................................................................................
..................................................................................................................

2.7 LET US SUM UP


Without a problem, research cannot proceed. A problem is some difficulty
experienced by the researcher in a theoretical or practical situation. Solving this
difficulty is the task of research. The problem for research should ordinarily be
expressed in an interrogative form. If the researcher has a ready problem on
hand he can proceed further. Otherwise, he has to search for a problem. The
problem can be from business in general or functional areas in particular. Other
sources of research problems are: day to day problems, technological changes,
unexplored areas, books, theses, articles, policy problems etc., Having selected
the problem it must be defined and specified.

Having specified the problem, the next step is to formulate the objectives of
research so as to give direction to the study. The researcher should also
propose a set of suggested solutions to the problem under study. Such tentative
solutions formulated are called hypotheses. The hypotheses are of various types
such as explanatory hypothesis, descriptive hypothesis, analogical hypothesis,
working hypothesis, null hypothesis and statistical hypothesis. A good hypothesis
must be empirically verifiable, should be relevant, must have explanatory power,
must be as far as possible within the established knowledge, must be simple,
clear and definite. There are four stages in a hypothesis (a) feeling a problem
(b) formulating hypothesis (c) deductive development of hypothesis and (d)
verification / testing of hypothesis verification can be done either directly or
indirectly or through logical methods. Testing is done by using statistical
methods.

Having selected the problem, formulated the objectives and hypothesis, the
researcher has to prepare a blue print or plan of action, usually called as
research design. The design/study plan includes a number of components which
are interdependent and which demand a series of decisions regarding definitions,
scope, methods, techniques, procedures, instruments, time, place, expenditure and
administration aspects.

If the problem selected for research is not a familiar one, a pilot study may be
conducted to acquire knowledge about the subject matter, and the various issues
involved. Then for collection of data instruments and/or scales have to
constructed, which have to be pre-tested before finally accepting them for use. 4 1
Geektonight Notes

Research and Data


Collection 2.8 KEY WORDS
Hypothesis : A hypothesis is a tentative answer / solution to the research
problem, whose validity remains to be tested.
Pilot Study : A study conducted to familiarize oneself first with the research
problem so that it can be operationalised with a good deal of knowledge about
the problem.
Pre-Test : A trial administration of an instrument such as a questionnaire or
scale to identify its weaknesses is called a pre-test.
Research Design : It is a systematic plan (planning) to direct a piece of
research work.
Research Problem : A research problem is a felt need, which needs an
answer/solution.
Testing of Hypothesis : It means verification of a hypothesis as true or false
in the light of facts.

2.9 ANSWERS TO SELF ASSESSMENT EXERCISES


A. 1) felt 2) goal / direction
3) interrogative 4) information / data
5) source of problems

2.10 TERMINAL QUESTIONS


A) Short answer Questions:
1) What is meant by a research problem?
2) What do you mean by specification of the problem?
3) What is the need for formulating research objectives?
4) What do you mean by a hypothesis?
5) What are the stages in a hypothesis?
6) What do you mean by testing of a hypothesis?
7) What is a research design?
8) What are the functions of a research design?
9) What is a Pilot Study?
10) What do you mean by Pre-testing?

B) Essay Type Questions:


1) What is a research problem? Explain the sources of research problems.
2) What do you mean by a problem? Explain the various points to be
considered while selecting a problem.
3) Explain how you will select and specify a research problem.
4) What do you mean by a hypothesis? What are the different types of
hypotheses?

4 2
Geektonight Notes

5) What is meant by hypothesis? Explain the criteria for a workable Research Plan
hypothesis.
6) What are the different stages in a hypothesis? How do you verify /
test a hypothesis?
7) What is a research design? Explain the functions of a research design.
8) Define a research design and explain its contents.
9) What are the various components of a research design?
10) Distinguish between pilot study and pre-test. Also explain the need for
Pilot study and pre-testing.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

2.11 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt with in this unit.
Fred N. Kerlinger. Foundations of Behavioural Research, Surjeet Publications,
Delhi.
O.R.Krishna Swamy. Methodology of Research in Social Sciences, Himalaya
Publishing House, Mumbai.
T.S.Wilkinson & P.L.Bhandarkar. Methodology and Techniques of Social
Research, Himalaya Publishing House, Mumbai.
C.R.Kothari. Research Methodology, Wiley Eastern, New Delhi.
V.P.Michael. Research Methodology in Management, Himalaya Publishing
House, Mumbai.

4 3
Geektonight Notes

Research and Data


Collection UNIT 3 COLLECTION OF DATA
STRUCTURE
3.0 Objectives
3.1 Introduction
3.2 Meaning and Need for Data
3.3 Primary and Secondary Data
3.4 Sources of Secondary Data
3.4.1 Documentary Sources of Data
3.4.2 Electronic Sources
3.4.3 Precautions in Using Secondary Data
3.4.4 Merits and Limitations of Secondary Data
3.5 Methods of Collecting Primary Data
3.5.1 Observation Method
3.5.2 Interview Method
3.5.3 Through Local Reporters and Correspondents
3.5.4 Questionnaire and Schedule Methods
3.6 Choice of Suitable Method
3.7 Let Us Sum Up
3.8 Key Words
3.9 Answers to Self Assessment Exercises
3.10 Terminal Questions
3.11 Further Reading

3.0 OBJECTIVES
On the completion of this unit, you should be able to:
l discuss the necessity and usefulness of data collection,
l explain and distinguish between primary data and secondary data,
l explain the sources of secondary data and its merits and demerits,
l describe different methods of collecting primary data and their merits and
demerits,
l examine the choice of a suitable method, and
l examine the reliability, suitability and adequacy of secondary data.

3.1 INTRODUCTION
In Unit 2, we have discussed about the selection of a research problem and
formulation of research design. A research design is a blue print which directs
the plan of action to complete the research work. As we have mentioned
earlier, the collection of data is an important part in the process of research
work. The quality and credibility of the results derived from the application of
research methodology depends upon the relevant, accurate and adequate data.
In this unit, we shall study about the various sources of data and methods of
collecting primary and secondary data with their merits and limitations and also
the choice of suitable method for data collection.

3.2 MEANING AND NEED FOR DATA


Data is required to make a decision in any business situation. The researcher is
faced with one of the most difficult problems of obtaining suitable, accurate and
4 4 adequate data. Utmost care must be exercised while collecting data because
Geektonight Notes

the quality of the research results depends upon the reliability of the data. Collection of Data
Suppose, you are the Director of your company. Your Board of Directors has
asked you to find out why the profit of the company has decreased since the
last two years. Your Board wants you to present facts and figures. What are
you going to do?

The first and foremost task is to collect the relevant information to make an
analysis for the above mentioned problem. It is, therefore, the information
collected from various sources, which can be expressed in quantitative form, for
a specific purpose, which is called data. The rational decision maker seeks to
evaluate information in order to select the course of action that maximizes
objectives. For decision making, the input data must be appropriate. This
depends on the appropriateness of the method chosen for data collection. The
application of a statistical technique is possible when the questions are
answerable in quantitative nature, for instance; the cost of production, and profit
of the company measured in rupees, age of the workers in the company
measured in years. Therefore, the first step in statistical activities is to gather
data. The data may be classified as primary and secondary data. Let us now
discuss these two kinds of data in detail.

3.3 PRIMARY AND SECONDARY DATA


The Primary data are original data which are collected for the first time for a
specific purpose. Such data are published by authorities who themselves are
responsible for their collection. The Secondary data on the other hand, are
those which have already been collected by some other agency and which have
already been processed. Secondary data may be available in the form of
published or unpublished sources. For instance, population census data
collected by the Government in a country is primary data for that Government.
But the same data becomes secondary for those researchers who use it later.
In case you have decided to collect primary data for your investigation, you
have to identify the sources from where you can collect that data. For example,
if you wish to study the problems of the workers of X Company Ltd., then the
workers who are working in that company are the source. On the other hand,
if you have decided to use secondary data, you have to identify the secondary
source who have already collected the related data for their study purpose.

With the above discussion, we can understand that the difference between
primary and secondary data is only in terms of degree. That is that the data
which is primary in the hands of one becomes secondary in the hands of
another.

Self Assessment Exercise A


1) What do you mean by data? Why it is needed for research?
.............................................................................................................
.............................................................................................................
.............................................................................................................
2) Distinguish between primary and secondary data. Illustrate your answer with
examples.
.............................................................................................................
.............................................................................................................
.............................................................................................................
4 5
Geektonight Notes

Research and Data


Collection 3.4 SOURCES OF SECONDARY DATA
We have discussed above the meaning of primary and secondary data.
Sometimes, it is not possible to collect primary data due to time, cost and
human resource constraints. Therefore, researchers have to take the help of
secondary data. Now let us discuss, (a) various sources from where, one can
get secondary data, (b) precautions while using secondary data, its merits and
demerits and some documentary and electronic sources of data in India.

3.4.1 Documentary Sources of Data

This category of secondary data source may also be termed as Paper Source.
The main sources of documentary data can be broadly classified into two
categories:

a) Published sources, and


b) Unpublished sources.
Let us discuss these two categories in detail.

a) Published Sources
There are various national and international institutions, semi-official reports of
various committees and commissions and private publications which collect and
publish statistical data relating to industry, trade, commerce, health etc. These
publications of various organisations are useful sources of secondary data.
These are as follows:

1) Government Publications: Central and State Governments publish current


information alongwith statistical data on various subjects, quarterly and annually.
For example, Monthly Statistical Abstract, National Income Statistics, Economic
Survey, Reports of National Council of Applied Economic Research (NCEAR),
Federation of Indian Chambers of Commerce and Industry (FICCI), Indian
Council of Agricultural Research (ICAR), Central Statistical Organisation
(CSO), etc.
2) International Publications: The United Nations Organisation (UNO),
International Labour Organisation (ILO), International Monetary Fund (IMF),
World Bank, Asian Development Bank (ADB) etc., also publish relevant data
and reports.
3) Semi-official Publications: Semi-official organisations like Corporations,
District Boards, Panchayat etc. publish reports.
4) Committees and Commissions: Several committees and commissions
appointed by State and Central Governments provide useful secondary data. For
example, the report of the 10th Financial Commission or Fifth Pay Commissions
etc.
5) Private Publications: Newspapers and journals publish the data on different
fields of Economics, Commerce and Trade. For example, Economic Times,
Financial Express etc. and Journals like Economist, Economic and Political
Weekly, Indian Journal of Commerce, Journal of Industry and Trade, Business
Today etc. Some of the research and financial institutions also publish their
reports annually like Indian Institute of Finance. In addition to this, reports
prepared by research scholars, universities etc. also provide secondary source
4 6 of information.
Geektonight Notes

b) Unpublished Sources Collection of Data

It is not necessary that all the information/data maintained by the institutions or


individuals are available in published form. Certain research institutions, trade
associations, universities, research scholars, private firms, business institutions
etc., do collect data but they normally do not publish it. We can get this
information from their registers, files etc.

3.4.2 Electronic Sources

The secondary data is also available through electronic media (through Internet).
You can download data from such sources by entering web sites like
google.com; yahoo.com; msn.com; etc., and typing your subject for which the
information is needed.

You can also find secondary data on electronic sources like CDs, and the
following online journals:

Electronic Journal http://businessstandard.com


Electronic Journal http://www.businessworldindia.com
Electronic Journal http://www.business-today.com
Electronic Journal http://www.india-invest.com
Census of India http://www.censusindia.net
Union Budget and Economic Survey http://www.indianbudget.nic.in
Directory of Government of India http://goidirectory.nic.in
Institutions
Indian Council of Agricultural Research http://www.icar.org.in
Ministry of Commerce and Industry http://www.commin.nic.in
Indian Institute of Foreign Trade http://www.iift.edu
Department of Industrial Policy and http://www.dipp.nic.in
Promotion, Ministry of Commerce and
Industry
Ministry of Consumer Affairs, Food & http://www.fccimin.in
Public Distribution
Khadi and Village Industries http://www.kvic.org.in
Board for Industrial & Financial http://www.bifr.nic.in
Reconstruction
Building Material & Technology http://www.bmtpc.org
Promotion Council
Central Food Technological Research http://www.cftri.com
Institute
National Council for Traders Information http://www.ncti-india.com
National Handloom Development http://www.nhdcltd.com
Corporation Ltd.
The Associated Chamber of Commerce http://www.assochm.org
and Industry
Federation of Indian Chambers of http://www.ficiofindia.com
Commerce and Industry
4 7
Geektonight Notes

Research and Data Now you have learnt that the secondary data are available in documents, either
Collection published or unpublished, and electronic sources. However, you have to take
precautions while using secondary data in research. Let us discuss them in
detail.

3.4.3 Precaution in Using Secondary Data

With the above discussion, we can understand that there is a lot of published
and unpublished sources where researcher can gets secondary data. However,
the researcher must be cautious in using this type of data. The reason is that
such type of data may be full of errors because of bias, inadequate size of the
sample, errors of definitions etc. Bowley expressed that it is never safe to take
published or unpublished statistics at their face value without knowing their
meaning and limitations. Hence, before using secondary data, you must examine
the following points.

Suitability of Secondary Data


Before using secondary data, you must ensure that the data are suitable for the
purpose of your enquiry. For this, you should compare the objectives, nature and
scope of the given enquiry with the original investigation. For example, if the
objective of our enquiry is to study the salary pattern of a firm including perks
and allowances of employees. But, secondary data is available only on basic
pay. Such type of data is not suitable for the purpose of the study.

Reliability of Secondary Data


For the reliability of secondary data, these can be tested: i) unbiasedness of the
collecting person, ii) proper check on the accuracy of field work, iii) the editing,
tabulating and analysis done carefully, iv) the reliability of the source of
information, v) the methods used for the collection and analysis of the data. If
the data collecting organisations are government, semi-government and
international, the secondary data are more reliable corresponding to data
collected by individual and private organisations.

Adequacy of Secondary Data

Adequacy of secondary data is to be judged in the light of the objectives of the


research. For example, our objective is to study the growth of industrial
production in India. But the published report provide information on only few
states, then the data would not serve the purpose. Adequacy of the data may
also be considered in the light of duration of time for which the data is
available. For example, for studying the trends of per capita income of a
country, we need data for the last 10 years, but the information available for
the last 5 years only, which would not serve our objective.

Hence, we should use secondary data if it is reliable, suitable and adequate.

3.4.4 Merits and Limitations of Secondary Data

Merits

1) Secondary data is much more economical and quicker to collect than primary
data, as we need not spend time and money on designing and printing data
collection forms (questionnaire/schedule), appointing enumerators, editing and
tabulating data etc.
4 8
Geektonight Notes

2) It is impossible to an individual or small institutions to collect primary data with Collection of Data
regard to some subjects such as population census, imports and exports of
different countries, national income data etc. but can obtain from secondary
data.
Limitations
1) Secondary data is very risky because it may not be suitable, reliable, adequate
and also difficult to find which exactly fit the need of the present investigation.
2) It is difficult to judge whether the secondary data is sufficiently accurate or not
for our investigation.
3) Secondary data may not be available for some investigations. For example,
bargaining strategies in live products marketing, impact of T.V. advertisements
on viewers, opinion polls on a specific subject, etc. In such situations we have to
collect primary data.
Self Assessment Exercise B

1) Write names of five web sources of secondary data which have not been
included in the above table.
....................................................................................................................
....................................................................................................................
....................................................................................................................
2) Explain the merits and limitations of using secondary data.
....................................................................................................................
....................................................................................................................
....................................................................................................................
3) What precautions must a researcher take before using the secondary data?
....................................................................................................................
....................................................................................................................
....................................................................................................................

4) In the following situations indicate whether data from a census should be


taken?
i) A TV manufacturer wants to obtain data on customer preferences with
respect to size of TV.
ii) IGNOU wants to determine the acceptability of its employees for
subscribing to a new employee insurance programme.
.............................................................................................................
.............................................................................................................
.............................................................................................................

3.5 METHODS OF COLLECTING PRIMARY DATA


If the available secondary data does not meet the requirements of the present
study, the researcher has to collect primary data. As mentioned earlier, the data
which is collected for the first time by the researcher for his own purpose is 4 9
Geektonight Notes

Research and Data called primary data. There are several methods of collecting primary data, such
Collection
as observation, interview through reporters, questionnaires and schedules. Let us
study about them in detail.

3.5.1 Observation Method

The Concise Oxford Dictionary defines observation as, ‘accurate watching and
noting of phenomena as they occur in nature with regard to cause and effect
or mutual relations’. Thus observation is not only a systematic watching but it
also involves listening and reading, coupled with consideration of the seen
phenomena. It involves three processes. They are: sensation, attention or
concentration and perception.

Under this method, the researcher collects information directly through


observation rather than through the reports of others. It is a process of
recording relevant information without asking anyone specific questions and in
some cases, even without the knowledge of the respondents. This method of
collection is highly effective in behavioural surveys. For instance, a study on
behaviour of visitors in trade fairs, observing the attitude of workers on the job,
bargaining strategies of customers etc. Observation can be participant
observation or non-participant observation. In Participant Observation
Method, the researcher joins in the daily life of informants or organisations,
and observes how they behave. In the Non-participant Observation Method,
the researcher will not join the informants or organisations but will watch from
outside.

Merits
1) This is the most suitable method when the informants are unable or reluctant to
provide information.
2) This method provides deeper insights into the problem and generally the data is
accurate and quicker to process. Therefore, this is useful for intensive study
rather than extensive study.
Limitations
Despite of the above merits, this method suffers from the following limitations:

1) In many situations, the researcher cannot predict when the events will occur. So
when an event occurs there may not be a ready observer to observe the event.
2) Participants may be aware of the observer and as a result may alter their
behaviour.
3) Observer, because of personal biases and lack of training, may not record
specifically what he/she observes.
4) This method cannot be used extensively if the inquiry is large and spread over a
wide area.
3.5.2 Interview Method
Interview is one of the most powerful tools and most widely used method for
primary data collection in business research. In our daily routine we see
interviews on T.V. channels on various topics related to social, business, sports,
budget etc. In the words of C. William Emory, ‘personal interviewing is a two-
way purposeful conversation initiated by an interviewer to obtain information
5 0 that is relevant to some research purpose’. Thus an interview is basically, a
Geektonight Notes

meeting between two persons to obtain the information related to the proposed Collection of Data
study. The person who is interviewing is named as interviewer and the person
who is being interviewed is named as informant. It is to be noted that, the
research data/information collect through this method is not a simple
conversation between the investigator and the informant, but also the glances,
gestures, facial expressions, level of speech etc., are all part of the process.
Through this method, the researcher can collect varied types of data intensively
and extensively.

Interviewes can be classified as direct personal interviews and indirect personal


interviews. Under the techniques of direct personal interview, the investigator
meets the informants (who come under the study) personally, asks them
questions pertaining to enquiry and collects the desired information. Thus if a
researcher intends to collect the data on spending habits of Delhi University
(DU) students, he/ she would go to the DU, contact the students, interview
them and collect the required information.

Indirect personal interview is another technique of interview method where it


is not possible to collect data directly from the informants who come under the
study. Under this method, the investigator contacts third parties or witnesses,
who are closely associated with the persons/situations under study and are
capable of providing necessary information. For example, an investigation
regarding a bribery pattern in an office. In such a case it is inevitable to get
the desired information indirectly from other people who may be knowing them.
Similarly, clues about the crimes are gathered by the CBI. Utmost care must be
exercised that these persons who are being questioned are fully aware of the
facts of the problem under study, and are not motivated to give a twist to the
facts.

Another technique for data collection through this method can be structured and
unstructured interviewing. In the Structured interview set questions are asked
and the responses are recorded in a standardised form. This is useful in large
scale interviews where a number of investigators are assigned the job of
interviewing. The researcher can minimise the bias of the interviewer. This
technique is also named as formal interview. In Un-structured interview, the
investigator may not have a set of questions but have only a number of key
points around which to build the interview. Normally, such type of interviews
are conducted in the case of an explorative survey where the researcher is not
completely sure about the type of data he/ she collects. It is also named as
informal interview. Generally, this method is used as a supplementary method of
data collection in conducting research in business areas.

Now-a-days, telephone or cellphone interviews are widely used to obtain the


desired information for small surveys. For instance, interviewing credit card
holders by banks about the level of services they are receiving. This technique
is used in industrial surveys specially in developed regions.

Merits
The major merits of this method are as follows:

1) People are more willing to supply information if approached directly. Therefore,


personal interviews tend to yield high response rates.
2) This method enables the interviewer to clarify any doubt that the interviewee
might have while asking him/her questions. Therefore, interviews are helpful in
getting reliable and valid responses. 5 1
Geektonight Notes

Research and Data 3) The informant’s reactions to questions can be properly studied.
Collection
4) The researcher can use the language of communication according to the
standard of the information, so as to obtain personal information of informants
which are helpful in interpreting the results.
Limitations
The limitations of this method are as follows:

1) The chance of the subjective factors or the views of the investigator may come
in either consciously or unconsciously.
2) The interviewers must be properly trained, otherwise the entire work may be
spoiled.
3) It is a relatively expensive and time-consuming method of data collection
especially when the number of persons to be interviewed is large and they are
spread over a wide area.
4) It cannot be used when the field of enquiry is large (large sample).
Precautions : While using this method, the following precautions should be
taken:

l obtain thorough details of the theoretical aspects of the research problem.


l Identify who is to be interviewed.
l The questions should be simple, clear and limited in number.
l The investigator should be sincere, efficient and polite while collecting data.
l The investigator should be of the same area (field of study, district, state etc.).

Self Assessment Exercise C

1) How can data be collected through the Observation Method?


....................................................................................................................
....................................................................................................................
....................................................................................................................
2) Distinguish between the observation and the interview method of data
collection.
....................................................................................................................
....................................................................................................................
....................................................................................................................

3.5.3 Through Local Reporters and Correspondents

Under this method, local investigators/agents or correspondents are appointed in


different parts of the area under investigation. This method is generally adopted
by government departments in those cases where regular information is to be
collected. This method is also useful for newspapers, magazines, radio and TV
news channels. This method has been used when regular information is required

5 2
Geektonight Notes

and a high degree of accuracy is not of much importance. Collection of Data

Merits
1) This method is cheap and economical for extensive investigations.
2) It gives results easily and promptly.
3) It can cover a wide area under investigation.
Limitations
1) The data obtained may not be reliable.
2) It gives approximate and rough results.
3) It is unsuited where a high degree of accuracy is desired.
4) As the agent/reporter or correspondent uses his own judgement, his personal
bias may affect the accuracy of the information sent.
3.5.4 Questionnaire and Schedule Methods

Questionnaire and schedule methods are the popular and common methods for
collecting primary data in business research. Both the methods comprise a list
of questions arranged in a sequence pertaining to the investigation. Let us study
these methods in detail one after another.

i) Questionnaire Method

Under this method, questionnaires are sent personally or by post to various


informants with a request to answer the questions and return the questionnaire.
If the questionnaire is posted to informants, it is called a Mail Questionnaire.
Sometimes questionnaires may also sent through E-mail depending upon the
nature of study and availability of time and resources. After receiving the
questionnaires the informants read the questions and record their responses in
the space meant for the purpose on the questionnaire. It is desirable to send
the quetionnaire with self-addressed envelopes for quick and high rate of
response.

Merits
1) You can use this method in cases where informants are spread over a vast
geographical area.
2) Respondents can take their own time to answer the questions. So the researcher
can obtain original data by this method.
3) This is a cheap method because its mailing cost is less than the cost of personal
visits.
4) This method is free from bias of the investigator as the information is given by
the respondents themselves.
5) Large samples can be covered and thus the results can be more reliable and
dependable.
Limitations
1) Respondents may not return filled in questionnaires, or they can delay in replying
to the questionnaires. 5 3
Geektonight Notes

Research and Data 2) This method is useful only when the respondents are educated and co-operative.
Collection
3) Once the questionnaire has been despatched, the investigator cannot modify the
questionnaire.
4) It cannot be ensured whether the respondents are truly representative.
ii) Schedule Method

As discussed above, a Schedule is also a list of questions, which is used to


collect the data from the field. This is generally filled in by the researcher or
the enumerators. If the scope of the study is wide, then the researcher appoints
people who are called enumerators for the purpose of collecting the data. The
enumerators go to the informants, ask them the questions from the schedule in
the order they are listed and record the responses in the space meant for the
answers in the schedule itself. For example, the population census all over the
world is conducted through this method. The difference between questionnaire
and schedule is that the former is filled in by the informants, the latter is filled
in by the researcher or enumerator.

Merits
1) It is a useful method in case the informants are illiterates.
2) The researcher can overcome the problem of non-response as the enumerators
go personally to obtain the information.
3) It is very useful in extensive studies and can obtain more reliable data.
Limitations
1) It is a very expensive and time-consuming method as enumerators are paid
persons and also have to be trained.
2) Since the enumerator is present, the respondents may not respond to some
personal questions.
3) Reliability depends upon the sincerity and commitment in data collection.
The success of data collection through the questionnaire method or schedule
method depends on how the questionnaire has been designed.

Designing the Questionnaire


The success of collecting data either through the questionnaire method or
through the schedule method depends largely on the proper design of the
questionnaire. This is a specialised job and a requires high degree of skill,
experience, thorough knowledge of the research topic, ability to frame questions
and a great deal of patience. There are no hard and fast rules in designing the
questionnaire. However, the following general guidelines may be helpful in this
connection.

l The number of questions should be minimised as far as possible because


informants may not like to spend much time to answer a lengthy questionnaires.
l The questions should be precise, clear and unambiguous. Lengthy questions tend
to confuse the informant.
5 4
Geektonight Notes

l Choose the appropriate type of questions. Generally there are five kinds of Collection of Data

questions used in questionnaires. They are as follows :


i) Simple choice questions which offer the respondents a choice between two
answers, such as, ‘Yes’ or ‘No’, ‘Right’ or ‘Wrong’. ‘Do you own a
computer?’ Can easily be answered with ‘Yes’ or ‘No’.
ii) Multiple choice questions are often used as a follow-up to simple choice
questions. This type of questions provide a choice between a number of
factors that might influence informant preferences. For example, where do
you sell your agricultural products? a) In village market, b) In a regulated
market, c) To commission agent, d) Any other…
iii) Open-ended questions allow the informants to give any related answer in
their own words. For example, what should be done to enhance the
practical utility of commerce programmes?
iv) Specific questions which require specific information. For example, “From
where did you take the loan for your business.”
v) Scaled questions are used to record how strongly the opinions are
expressed. For example, How do you rate the facilities provided by the
market committee?
a) Very good, b) Good, c) Normal, d) Bad, or e) Very bad.
l The questions should be arranged in a logical sequence to avoid embarrassment.
For example, asking a question how many children do you have? Then the next
question : Are you married?
l Questions which require calculations should be avoided. For example, question
regarding yearly income of the respondents who are getting daily wage or piece
wages, should not be asked.
Pilot testing or Pre-testing the Questionnaire
Before finalising the questionnaire, it is desirable to carry out a preliminary
experiment on a sample basis. The investigator should examine each question to
ensure that the question is not confusing, leading to biased responses etc. The
real test of a questionnaire is how it performs under actual conditions of data
collection. This test can be carried out among small groups of subjects in order
to provide an estimate of the time needed for responding to the survey. The
questionnaire pre-test serves the same role in questionnaire design as testing a
new product in the market. As test marketing provides the real test of
customer reactions to the product and the accompanying marketing programmes,
in the same way, the pre-test provides the real test of the questionnaire.
Therefore this work must be done with utmost care and caution to yield good
results.

Specimen Questionnaire
The following specimen questionnaire incorporates most of the qualities which
we have discussed above. It relates to ‘Computer User Survey’.

5 5
Geektonight Notes

Research and Data


Collection Computer User Survey
1. What brand of Computer do you primarily use?
(i) IBM (ii) Compaq
(iii) HCL (iv) Dell
(v) Siemens (vi) Any other
_____________
(please specify)
2. Where was the computer purchased?
(i) Computer store (ii) Mail order
(iii) Manufacturer (iv) Company Dealer
(v) Any other _____________
3. How long have you been using computers? ______years
_____months.
4. In a week about how many hours do you spend on the computer ____
hours?
5. Which database management package do you use most often?
(i) Dbase-II (ii) Dbase-III
(iii) Lotus 1,2,3 (iv) MS-Excel
(v) Oracle (vi) Any other
_____________
(please specify)
6. Does the computer, that you primarily use, have a hard disk
Yes No
7. Where did you obtain the software that you use?
(i) Computer user group (ii) Regular dealer
(iii) Mail order (iv) Directly from Software
(v) Any other _____________ dealer
8. On the following 9-point scale, rate the degree of difficulty that you have
encountered in using the computer.
Extremely difficult 123456789 Not difficult
9. If you have to purchase a personal computer today, which one would
you be most likely to purchase?
(i) IBM (ii) Compaq
(iii) HCL (iv) Dell
(v) Siemens (vi) Any other
_____________
(please specify)
10. What is your sex Male Female
11. Please state your date of birh .............. ............... ..............
Month Day Year
12. Your Qualifications
(i) Secondary (ii) Sr. Secondary
(iii) Graduate (iv) Post-graduate
(v) Doctorate (vi) Any other
_____________
(please specify)
13. Which of the following best describe you primary field of
employment.
(i) Medical (ii) Education
(iii) Business (iv) Government
(v) Technical (vi) Any other ____________
(please specify)
14. What is your current Salary?
5 6
Geektonight Notes
Collection of Data
3.6 CHOICE OF SUITABLE METHOD
You have noticed that there are various methods and techniques for the
collection of primary data. You should be careful while selecting the method
which should be appropriate and effective. The selection of the methods
depends upon various factors like scope and objectives of the inquiry, time,
availability of funds, subject matter of the research, the kind of information
required, degree of accuracy etc. As apprised, every method has its own merits
and demerits. For example, the observation method is suitable for field surveys
when the incident is really happening, the interview method is suitable where
direct observation is not possible. Local reporter/correspondent method is
suitable when information is required at regular intervals. The questionnaire
method is appropriate in extensive enquiries where sample is large and
scattered over large geographical areas and the respondents are able to express
their responses in writing. The Schedule method is suitable in case respondents
are illiterate.

Self Assessment Exercise D


1) List out the methods of collecting primary data.
....................................................................................................................
....................................................................................................................
...................................................................................................................
2) Point out the major problems in constructing questionnaires.
....................................................................................................................
....................................................................................................................
...................................................................................................................
3) Distinguish between direct personal interview and indirect interview. Give
suitable examples.
....................................................................................................................
....................................................................................................................
...................................................................................................................
4) Distinguish between Schedule and Questionnaire?
....................................................................................................................
....................................................................................................................
...................................................................................................................
5) Are the following statement true or false?
a) Interview method introduces more bias than the use of questionnaire.
b) ‘Yes’ or ‘No’ type questions should not be used in questionnaires unless
only one of the two answer is possible.
c) Open questions are more difficult than most other types to tabulate.

3.7 LET US SUM UP


In this unit we elaborated on the meaning of data, methods of data collection,
merits and limitations of data collection, precautions which are needed for the
collection of data. 5 7
Geektonight Notes

Research and Data The information collected from various processes for a specific purpose is
Collection
called data. Statistical data may be either primary data or secondary data. Data
which is collected originally for a specific purpose is called primary data. The
data which is already collected and processed by some one else and is being
used now in the present study, is called secondary data. Secondary data can be
obtained either from published sources or unpublished sources. It should be used
if it is reliable, suitable and adequate, otherwise it may result in misleading
conclusions. It has its own merits and demerits. There are several problems in
the collection of primary data. These are: tools and techniques of data
collection, degree of accuracy, designing the questionnaire, selection and training
of enumerators, problem of tackling non-responses and other administrative
aspects.

Several methods are used for collection of primary data. These are: observation,
interview, questionnaire and schedule methods. Every method has its own merits
and demerits. Hence, no method is suitable in all situations. The suitable method
can be selected as per the needs of the investigator which depends on objective
nature and scope of the enquiry, availability of funds and time.

3.8 KEY WORDS


Data: Quantitative or/ and qualitative information, collected for study and
analysis.
Interview: A method of collecting primary data by meeting the informants and
asking the questions.
Observation: The process of observing individuals in controlled situations.
Questionnaire: is a device for collection of primary data containing a list of
questions pertaining to enquiry, sent to the informants, and the informant himself
writes the answers.
Primary Data: Data that is collected originally for the first time.
Secondary Data: Data which were collected and processed by someone else
but are being used in the present study.
Published Sources: Sources which consist of published statistical information.
Schedule: is a device for collection of primary data containing a list of
questions to be filled in by the enumerators who are specially appointed for that
purpose.

3.9 ANSWERS TO SELF ASSESSMENT EXERCISES


A. 1) In many business companies, some of the data required for statistical
analysis are obtained from internal sources like computer files of accounting
data. Together with internal data, business often uses data from external
sources. For example, aggregate data on national economic activity are
readily available from CSO, annual report of Ministry of Labour,
Government of India.
2) Data which is collected originally is called primary data and the same
collected by others are called secondary data. For example, a researcher
interested in knowing what consumer’s choice about the brand of
5 8
Geektonight Notes

toothpaste, he or she must make a survey and collect data on the opinions of Collection of Data
the consumer. This is called primary data. The data obtained from published
and unpublished sources is called secondary data.
B. 1) l http://www.bis.org.in
l http://www.business-today.com
l http://www.businessonlineindia.com
l http://www.indiacofee.org
l http://www.dgft.nic.in
IV. i) No, ii) No

D. 1) Observation Method, interview method, data collection through reporters/


correspondents, questionnaries and schedule methods.

2) a) What information will be sought?


b) What type of questionnaire will be required?
c) How many questions will be used?
d) What will the content of the individual questions be?
e) How will those questionnaires be administered?

3) i) Personal Interview: Under this method, the investigator collects


information personally from the concerned sources.
ii) Indirect Interview : Under this method, the investigator contacts third
parties or witnesses capable of supplying the necessary information.
5) i) True ii) True iii) True.

3.10 TERMINAL QUESTIONS


1) What precautions would you take while using the data from secondary
sources.
2) Explain what precautions must be taken while designing a questionnaire
in order that it may be really useful. Illustrate your answer giving
suitable examples.
3) Construct a suitable questionnaire containing not more than twenty five
questions pertaining to the sales promotion of your company’s product.
4) Distinguish between the following :
a) Primary and Secondary Data
b) Internal and External Data
c) A Schedule and Questionnaire
5) Explain the various methods of collecting primary data pointing out their
merits and demerits?
6) What is the need for pre-testing the drafted questionnaire.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

5 9
Geektonight Notes

Research and Data


Collection 3.11 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
Kothari, C.R. 2004. Research Methodology Methods and Techniques, New
Age International (P) Limited : New Delhi.
Rao K.V. 1993. Research Methodology in Commerce and Management,
Sterling Publishers Private Limited : New Delhi.
Sadhu, A.N. and A. Singh, 1980. Research Methodology in Social Sciences,
Sterling Publishers Private Limited : New Delhi.

6 0
Geektonight Notes
Sampling
UNIT 4 SAMPLING
STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Census and Sample
4.3 Why Sampling?
4.4 Essentials of a Good Sample
4.5 Methods of Sampling
4.5.1 Random Sampling Methods
4.5.2 Non-Random Sampling Methods
4.6 Sample Size
4.7 Sampling and Non-Sampling Errors
4.7.1 Sampling Errors
4.7.2 Non-Sampling Errors
4.7.3 Control of Errors
4.8 Let Us Sum Up
4.9 Key Words
4.10 Answers to Self Assessment Exercises
4.11 Terminal Questions
4.12 Further Reading

4.0 OBJECTIVES
After studying this Unit, you should be able to:

l distinguish between census and sampling study,


l explain various reasons for opting for the sample method,
l explain the different methods of sampling and their advantages and
disadvantages,
l describe the sampling and non-sampling errors and minimize them, and
l design a representative sample from a population keeping both cost and
precision in mind.

4.1 INTRODUCTION
In the previous Unit 3, we have studied the types of data (primary and
secondary data) and various methods and techniques of collecting the primary
data. The desired data may be collected by selecting either census method or
sampling method.

Researchers usually cannot make direct observations of every unit of the


population they are studying for a variety of reasons. Instead, they collect data
from a subset of population – a sample – and use these observations drawn to
make inferences about the entire population. Ideally, the characteristics of a
sample should correspond to the characteristics of a population from which the
sample was drawn. In that case, the conclusions drawn from a sample are
probably applicable to the entire population.

In this Unit, we shall discuss the basics of sampling, particularly how to get a
sample that is representative of a population. It covers different methods of
drawing samples which can save a lot of time, money and manpower in a 6 1
Geektonight Notes

Research and Data variety of situations. These include random sampling methods, such as, simple
Collection random sampling, stratified sampling, systematic sampling, multistage sampling,
cluster sampling methods (and non-random sampling methods viz., convenience
sampling, judgement sampling and quota sampling. The advantages and
disadvantages of sampling and census are covered. How to determine the
sample size of a given population is also discussed.

4.2 CENSUS AND SAMPLE


Let us try to understand the terms ‘census’ and ‘sample’ with the help of an
illustration. Suppose you wish to study the ‘impact of T.V. advertisements on
children in Delhi, then you have to collect relevant information from the children
residing in Delhi who view T.V. Alternatively, we can say this is the population
(statistical terminology) for your study. If you collect the data from all of them
not leaving a single child, it known as Census method of data collection. This
means studying the whole population. Otherwise, if you select only some
children from among them for gathering the desired information for the study,
because it is not feasible to gather the information from all the children, then it
is known as Sample for data collection. Therefore, a sample is a subset of a
statistical population whose characteristics are studied to know the information
about the whole population. When dealing with people, it can be defined as a
set of respondents (people) selected from a population for the purpose of a
survey. A population is a group of individual persons, objects, items or any other
units from which samples are taken for measurement.

The numerical characteristics of a population are called parameters. They are


fixed and usually of unknown quantity. For example, the average (µ) height of
all Indian male adults is a population parameter. The numerical characteristics
of the sample data such as the mean, variance or proportion are called sample
statistics. It can be used to provide estimates of the corresponding population
parameters. For example, the average (x) height of a sample of 1000 Indian
male adults residing in Delhi is a sample statistic. The process of selecting a
representative sample for the purpose of inferring the characteristics of
population is called sampling.

Webster defines a survey as ‘the action of ascertaining facts regarding


conditions or the condition of something to provide exact information especially
to persons responsible or interested’ and as ‘a systematic collection and analysis
of data on some aspect of an area or group.’ Unless the researcher makes a
systematic collection of data followed by careful analysis and interpretation of
data, the data cannot become exact information. Surveys can be divided into
two categories on the basis of their extensiveness, namely, census and sample
survey. A complete survey of population is called a census. It involves covering
all respondents, items, or units of the population. For example, if we want to
know the wage structure of the textile industry in the country, then one
approach is to collect the data on the wages of each and every worker in the
textile industry. On the other hand, a sample is a representative subset of
population. Thus in a sample survey we cover only a sample of respondents,
items or units of population we are interested in and then draw inferences
about the whole population.

The following are the advantages of census:

1) In a census each and every respondent of the population is considered and


6 2 various population parameters are compiled for information.
Geektonight Notes

2) The information obtained on the basis of census data is more reliable and Sampling
accurate. It is an adopted method of collecting data on exceptional matters like
child labour, distribution by sex, educational level of the people etc.
3) If we are conducting a survey for the first time we can have a census instead of
sample survey. The information based on this census method becomes a base
for future studies. Similarly, some of the studies of special importance like
population data are obtained only through census.

4.3 WHY SAMPLING?


One of the decisions to be made by a researcher in conducting a survey is
whether to go for a census or a sample survey. We obtain a sample rather
than a complete enumeration (a census ) of the population for many reasons.
The most important considerations for this are: cost, size of the population,
accuracy of data, accessibility of population, timeliness, and destructive
observations.

1) Cost: The cost of conducting surveys through census method would be


prohibitive and sampling helps in substantial cost reduction of surveys. Since
most often the financial resources available to conduct a survey are scarce, it is
imperative to go for a sample survey than census.
2) Size of the Population: If the size of the population is very large it is
difficult to conduct a census if not impossible. In such situations sample
survey is the only way to analyse the characteristics of a population.

3) Accuracy of Data: Although reliable information can be obtained through


census, sometime the accuracy of information may be lost because of a
large population. Sampling involves a small part of the population and a few
trained people can be involved to collect accurate data. On the other hand,
a lot of people are required to enumerate all the observations. Often it
becomes difficult to involve trained manpower in large numbers to collect
the data thereby compromising accuracy of data collected. In such a
situation a sample may be more accurate than a census. A sloppily
conducted census can provide less reliable information than a carefully
obtained sample.

4) Accessibility of Population: There are some populations that are so


difficult to get access to that only a sample can be used, e.g., people in
prison, birds migrating from one place to another place etc. The
inaccessibility may be economic or time related. In a particular study,
population may be so costly to reach, like the population of planets, that only
a sample can be used.

5) Timeliness: Since we are covering a small portion of a large population


through sampling, it is possible to collect the data in far less time than
covering the entire population. Not only does it take less time to collect the
data through sampling but the data processing and analysis also takes less
time because fewer observations need to be covered. Suppose a company
wants to get a quick feedback from its consumers on assessing their
perceptions about a new improved detergent in comparison to an existing
version of the detergent. Here the time factor is very significant. In such
situations it is better to go for a sample survey rather than census because
it reduces a lot of time and product launch decision can be taken quickly.

6) Destructive Observations: Sometimes the very act of observing the


desired characteristics of a unit of the population destroys it for the intended 6 3
Geektonight Notes

Research and Data use. Good examples of this occur in quality control. For example, to test the
Collection quality of a bulb, to determine whether it is defective, it must be destroyed.
To obtain a census of the quality of a lorry load of bulbs, you have to
destroy all of them. This is contrary to the purpose served by quality-control
testing. In this case, only a sample should be used to assess the quality of
the bulbs. Another example is blood test of a patient.

The disadvantages of sampling are few but the researcher must be cautious.
These are risk, lack of representativeness and insufficient sample size each of
which can cause errors. If researcher don’t pay attention to these flaws it may
invalidate the results.

1) Risk: Using a sample from a population and drawing inferences about the
entire population involves risk. In other words the risk results from dealing with
a part of a population. If the risk is not acceptable in seeking a solution to a
problem then a census must be conducted.
2) Lack of representativeness: Determining the representativeness of the
sample is the researcher’s greatest problem. By definition, ‘sample’ means a
representative part of an entire population. It is necessary to obtain a sample
that meets the requirement of representativeness otherwise the sample will be
biased. The inferences drawn from nonreprentative samples will be misleading
and potentially dangerous.
3) Insufficient sample size: The other significant problem in sampling is to
determine the size of the sample. The size of the sample for a valid sample
depends on several factors such as extent of risk that the researcher is willing to
accept and the characteristics of the population itself.

4.4 ESSENTIALS OF A GOOD SAMPLE


It is important that the sampling results must reflect the characteristics of the
population. Therefore, while selecting the sample from the population under
investigation it should be ensured that the sample has the following
characteristics:

1) A sample must represent a true picture of the population from which it is drawn.
2) A sample must be unbiased by the sampling procedure.
3) A sample must be taken at random so that every member of the population of
data has an equal chance of selection.
4) A sample must be sufficiently large but as economical as possible.
5) A sample must be accurate and complete. It should not leave any information
incomplete and should include all the respondents, units or items included in the
sample.
6) Adequate sample size must be taken considering the degree of precision
required in the results of inquiry.

Self Assessment Exercise A


1) What do you mean by census and sample methods for data collection?
..................................................................................................................
..................................................................................................................
..................................................................................................................
6 4 ..................................................................................................................
Geektonight Notes
Sampling
2) Explain whether census or sample is more appropriate in the following
situations?

a) To test the quality of a soft drink.


..................................................................................................................
..................................................................................................................
b) To enumerate eligible voters of an assembly constituency.
..................................................................................................................
..................................................................................................................
c) To know the opinion of consumers on launching a new product.
..................................................................................................................
..................................................................................................................
3) Fill in the blanks

a) If the sample does not represent the population characteristics, we call


it a ————— sample.
b) One of the major advantages of sampling is that it helps in ———
reduction.
c) A sample must be ——— large but as ——— as possible.

4.5 METHODS OF SAMPLING


If money, time, trained manpower and other resources were not a concern, the
researcher could get most accurate data from surveying the entire population of
interest. Since most often the resources are scarce, the researcher is forced to
go for sampling. But the real purpose of the survey is to know the
characteristics of the population. Then the question is with what level of
confidence will the researcher be able to say that the characteristics of a
sample represent the entire population. Using a combination of tasks of
hypotheses and unbiased sampling methods, the researcher can collect data that
actually represents the characteristics of the entire population from which the
sample was taken. To ensure a high level of confidence that the sample
represents the population it is necessary that the sample is unbiased and
sufficiently large.

It was scientifically proved that if we increase the sample size we shall be that
much closer to the characteristics of the population. Ultimately, if we cover
each and every unit of the population, the characteristics of the sample will be
equal to the characteristics of the population. That is why in a census there is
no sampling error. Thus, “generally speaking, the larger the sample size, the less
sampling error we have.”

The statistical meaning of bias is error. The sample must be error free to make
it an unbiased sample. In practice, it is impossible to achieve an error free
sample even using unbiased sampling methods. However, we can minimize the
error by employing appropriate sampling methods.

The various sampling methods can be classified into two categories. These are
random sampling methods and non-random sampling methods. Let us discuss
them in detail. 6 5
Geektonight Notes

Research and Data 4.5.1 Random Sampling Methods


Collection

The random sampling method is also often called probability sampling. In


random sampling all units or items in the population have a chance of being
chosen in the sample. In other words a random sample is a sample in which
each element of the population has a known and non-zero chance of being
selected. Random sampling always produces the smallest possible sampling
error. In the real sense, the size of the sampling error in a random sample is
affected only by a random chance. Because a random sample contains the
least amount of sampling error, we may say that it is an unbiased sample.
Remember that we are not saying that a random sample contains no error, but
rather the minimum possible amount of error. The major advantage of random
sampling is that it is possible to quantify the magnitude of the likely error in the
inference made and this will help in building confidence in drawing inferences.

The following are the important methods of random sampling:

1) Simple Random Sampling


2) Systematic Sampling
3) Stratified Random Sampling
4) Cluster Sampling
5) Multistage Sampling
1. Simple Random Sampling: The most commonly used random sampling
method is simple random sampling method. A simple random sample is one in
which each item in the total population has an equal chance of being included
in the sample. In addition, the selection of one item for inclusion in the sample
should in no way influence the selection of another item. Simple random
sampling should be used with a homogeneous population, that is, a population
consisting of items that possess the same attributes that the researcher is
interested in. The characteristics of homogeneity may include such as age, sex,
income, social/religious/political affiliation, geographical region etc.

The best way to choose a simple random sample is to use random number
table. A random sampling method should meet the following criteria.

a) Every member of the population must have an equal chance of inclusion in the
sample.
b) The selection of one member is not affected by the selection of previous
members.
The random numbers are a collection of digits generated through a probabilistic
mechanism. The random numbers have the following properties:

i) The probability that each digit (0,1,2,3,4,5,6,7,8,or 9) will appear at any place
is the same. That is 1/10.
ii) The occurrence of any two digits in any two places is independent of each
other.
Each member of a population is assigned a unique number. The members of
the population chosen for the sample will be those whose numbers are identical
to the ones extracted from the random number table in succession until the
desired sample size is reached. An example of a random number table is given
below.
6 6
Geektonight Notes

Table 1: Table of Random Numbers Sampling

1 2 3 4 5 6 7 8 9 10
1 96268 11860 83699 38631 90045 69696 48572 05917 51905 10052
2 03550 59144 59468 37984 77892 89766 86489 46619 50236 91136
3 22188 81205 99699 84260 19693 36701 43233 62719 53117 71153
4 63759 61429 14043 44095 84746 22018 19014 76781 61086 90216
5 55006 17765 15013 77707 54317 48862 53823 52905 70754 68212
6 81972 45644 12600 01951 72166 52682 37598 11955 73018 23528
7 06344 50136 33122 31794 86723 58037 36065 32190 31367 96007
8 92363 99784 94169 03652 80824 33407 40837 97749 18361 72666
9 96083 16943 89916 55159 62184 86206 09764 20244 88388 98675
10 92993 10747 08985 44999 35785 65036 05933 77378 92339 96151
11 95083 70292 50394 61947 65591 09774 16216 63561 59751 78771
12 77308 60721 96057 86031 83148 34970 30892 53489 44999 18021
13 11913 49624 28519 27311 61586 28576 43092 69971 44220 80410
14 70648 47484 05095 92335 55299 27161 64486 71307 85883 69610
15 92771 99203 37786 81142 44271 36433 31726 74879 89384 76886
16 78816 20975 13043 55921 82774 62745 48338 88348 61211 88074
17 79934 35392 56097 87613 94627 63622 08110 16611 88599 02890
18 64698 83376 87527 36897 17215 74339 69856 43622 22567 11518
19 44212 12995 03581 37618 94851 63020 65348 55857 91742 79508
20 89292 00204 00579 70630 37136 50922 83387 15014 51838 81760
21 08692 87237 87879 01629 72184 33853 95144 67943 19345 03469
22 67927 76855 50702 78555 97442 78809 40575 79714 06201 34576
23 62167 94213 52971 85794 68067 78814 40103 70759 92129 46716
24 45828 45441 74220 84157 23241 49332 23646 09390 13031 51569
25 01164 35307 26526 80335 58090 85871 07205 31749 40571 51755
26 29283 31581 04359 45538 41435 61103 32428 94042 39971 63678
27 19868 49978 81699 84904 50163 22652 07845 71308 00859 87984
28 14292 93587 55960 23159 07370 65065 06580 46285 07884 83928
29 77410 52135 29495 23032 83242 89938 40516 27252 55565 64714
30 36580 06921 35675 81645 60479 71035 99380 59759 42161 93440
31 07780 18093 31258 78156 07871 20369 53977 08534 39433 57216
32 07548 08454 36674 46255 80541 42903 37366 21164 97516 66181
33 22023 60448 69344 44260 90570 01632 21002 24413 04671 05665
34 20827 37210 57797 34660 32510 71558 78228 42304 77197 79168
35 47802 79270 48805 59480 88092 11441 96016 76091 51823 94442
36 76730 86591 18978 25479 77684 88439 34112 26052 57112 91653
37 26439 02903 20935 76297 15290 84688 74002 09467 41111 19194
38 32927 83426 07848 59372 44422 53372 27823 25417 27150 21750
39 51484 05286 77103 47284 00578 88774 15293 50740 07932 87633

40 45142 96804 92834 26886 70002 96643 36008 02239 93563 66429
6 7
Geektonight Notes

Research and Data To select a random sample using simple random sampling method we should
Collection follow the steps given below:

i) Determine the population size (N).


ii) Determine the sample size (n).
iii) Number each member of the population under investigation in serial order.
Suppose there are 100 members number them from 00 to 99.

iv) Determine the starting point of selecting sample by randomly picking up a


page from random number tables and dropping your finger on the page
blindly.

v) Choose the direction in which you want to read the numbers (from left to
right, or right to left, or down or up).

vi) Select the first ‘n’ numbers whose X digits are between 0 and N. If N =
100 then X would be 2, if N is a four digit number then X would be 3 and
so on.

vii) Once a number is chosen, do not use it again.

viii) If you reach the end point of the table before obtaining ‘n’ numbers, pick
another starting point and read in a different direction and then use the
first X digit instead of the last X digits and continue until the desired
sample is selected.

Example: Suppose you have a list of 80 students and want to select a sample
of 20 students using simple random sampling method. First assign each student
a number from 00 to 79. To draw a sample of 20 students using random
number table, you need to find 20 two-digit numbers in the range 00 to 79. You
can begin any where and go in any direction. For example, start from the 6th
row and 1st column of the random number table given in this Unit. Read the
last two digits of the numbers. If the number is within the range (00 to 79)
include the number in the sample. Otherwise skip the number and read the next
number in some identified direction. If a number is already selected omit it. In
the example starting from 6th row and 1st column and moving from left to right
direction the following numbers are considered to selected 20 numbers for
sample.

81972 45644 12600 01951 72166 52682 37598 11955 73018 23528
06344 50136 33122 31794 86723 58037 36065 32190 31367 96007
92363 99784 94169 03652 80824 33407 40837 97749 18361 72666

The bold faced digits in the one’s and ten’s place value indicate the selected
numbers for the sample. Therefore, the following are the 20 numbers chosen as
sample.

72 44 00 51 66 55 18 28
36 22 23 37 65 67 07 63
69 52 24 49

6 8
Geektonight Notes

Advantages Sampling

i) The simple random sample requires less knowledge about the characteristics of
the population.
ii) Since sample is selected at random giving each member of the population equal
chance of being selected the sample can be called as unbiased sample. Bias
due to human preferences and influences is eliminated.
iii) Assessment of the accuracy of the results is possible by sample error
estimation.
iv) It is a simple and practical sampling method provided population size is not large.
Limitations
i) If the population size is large, a great deal of time must be spent listing and
numbering the members of the population.
ii) A simple random sample will not adequately represent many population
characteristics unless the sample is very large. That is, if the researcher is
interested in choosing a sample on the basis of the distribution in the population
of gender, age, social status, a simple random sample needs to be very large to
ensure all these distributions are representative of the population. To obtain a
representative sample across multiple population attributes we should use
stratified random sampling.
2. Systematic Sampling: In systematic sampling the sample units are selected
from the population at equal intervals in terms of time, space or order. The
selection of a sample using systematic sampling method is very simple. From a
population of ‘N’ units, a sample of ‘n’ units may be selected by following the
steps given below:
i) Arrange all the units in the population in an order by giving serial numbers
from 1 to N.
ii) Determine the sampling interval by dividing the population by the sample
size. That is, K=N/n.
iii) Select the first sample unit at random from the first sampling interval (1 to
K).
iv) Select the subsequent sample units at equal regular intervals.
For example, we want to have a sample of 100 units from a population of 1000
units. First arrange the population units in some serial order by giving numbers
from 1 to 1000. The sample interval size is K=1000/100=10. Select the first
sample unit at random from the first 10 units ( i.e. from 1 to 10). Suppose the
first sample unit selected is 5, then the subsequent sample units are 15, 25,
35,.........995. Thus, in the systematic sampling the first sample unit is selected
at random and this sample unit in turn determines the subsequent sample units
that are to be selected.

Advantages
i) The main advantage of using systematic sample is that it is more expeditious to
collect a sample systematically since the time taken and work involved is less
than in simple random sampling. For example, it is frequently used in exit polls
and store consumers.
ii) This method can be used even when no formal list of the population units is
available. For example, suppose if we are interested in knowing the opinion of
consumers on improving the services offered by a store we may simply choose
6 9
Geektonight Notes

Research and Data every kth (say 6th) consumer visiting a store provided that we know how many
Collection consumers are visiting the store daily (say 1000 consumers visit and we want to
have 100 consumers as sample size).
Limitations
i) If there is periodicity in the occurrence of elements of a population, the selection
of sample using systematic sample could give a highly un-representative sample.
For example, suppose the sales of a consumer store are arranged
chronologically and using systematic sampling we select sample for 1st of every
month. The 1st day of a month can not be a representative sample for the whole
month. Thus in systematic sampling there is a danger of order bias.
ii) Every unit of the population does not have an equal chance of being selected
and the selection of units for the sample depends on the initial unit selection.
Regardless how we select the first unit of sample, subsequent units are
automatically determined lacking complete randomness.
3. Stratified Random Sampling: The stratified sampling method is used when
the population is heterogeneous rather than homogeneous. A heterogeneous
population is composed of unlike elements such as male/female, rural/urban,
literate/illiterate, high income/low income groups, etc. In such cases, use of
simple random sampling may not always provide a representative sample of the
population. In stratified sampling, we divide the population into relatively
homogenous groups called strata. Then we select a sample using simple
random sampling from each stratum. There are two approaches to decide the
sample size from each stratum, namely, proportional stratified sample and
disproportional stratified sample. With either approach, the stratified sampling
guarantees that every unit in the population has a chance of being selected. We
will now discuss these two approaches of selecting samples.
i) Proportional Stratified Sample: If the number of sampling units drawn
from each stratum is in proportion to the corresponding stratum population size,
we say the sample is proportional stratified sample. For example, let us say
we want to draw a stratified random sample from a heterogeneous population
(on some characteristics) consisting of rural/urban and male/female respondents.
So we have to create 4 homogeneous sub groups called stratums as follows:

Urban Rural

Male Female Male Female

To ensure each stratum in the sample will represent the corresponding stratum
in the population we must ensure each stratum in the sample is represented in
the same proportion to the stratums as they are in the population. Let us
assume that we know (or can estimate) the population distribution as follows:
65% male, 35% female and 30% urban and 70% rural. Now we can determine
the approximate proportions of our 4 stratums in the population as shown below.

Urban Rural
Male Female Male Female
0.30 × 0.65 = 0.195 0.30 × 0.35 = 0.105 0.70 × 0.65 = 0.455 0.70 × 0.35 = 0.245

Thus a representative sample would be composed of 19.5% urban-males, 10.5%


urban-females, 45.5% rural-males and 24.5% rural females. Each percentage
should be multiplied by the total sample size needed to arrive at the actual
7 0
Geektonight Notes

sample size required from each stratum. Suppose we require 1000 samples then Sampling
the required sample in each stratum is as follows:

Urban-male 0.195 × 1000 = 195

Urban-female 0.105 × 1000 = 105

Rural-male 0.455 × 1000 = 455

Rural-female 0.245 × 1000 = 245

Total: 1,000

ii) Disproportional Stratified Sample: In a disproportional stratified sample,


sample size for each stratum is not allocated on a proportional basis with the
population size, but by analytical considerations of the researcher such as
stratum variance, stratum population, time and financial constraints etc. For
example, if the researcher is interested in finding differences among different
stratums, disproportional sampling should be used. Consider the example of
income distribution of households. There is a small percentage of households
within the high income brackets and a large percentage of households within
the low income brackets. The income among higher income group households
has higher variance than the variance among the lower income group house-
holds. To avoid under-representation of higher income groups in the sample, a
disproportional sample is taken. This indicates that as the variability within the
stratum increases sample size must increase to provide accurate estimates and
vice-versa.

Suppose in our example of urban/rural and male/female stratum populations, the


stratum estimated variances (s2) are as follows. However, the variance is
discussed in Unit 9 of this course.

Urban-male 3.0; Urban-female 5.5; Rural-males 2.5; Rural-females 1.75.


The above figures are, normally, estimated on the basis of previous knowledge
of a researcher.
Then the allocation of sample size of 1000 for each strata using disproportional
stratified sampling method will be as shown in the following table:

Stratum Stratum Stratum Stratum Pi × σi Sample size


population variance standard (Pi×σi×1000)/
proportion (Pi) (σi2) deviation (σi) Σ Piσi)

Urban-male 0.195 3.0 1.73 0.338 207

Urban-female 0.105 5.5 2.35 0.246 151

Rural-male 0.455 2.5 1.58 0.719 442

Rural-female 0.245 1.75 1.32 0.324 199

Total 1.628 1000

Advantages
a) Since the sample are drawn from each of the stratums of the population,
stratified sampling is more representative and thus more accurately reflects
characteristics of the population from which they are chosen. 7 1
Geektonight Notes

Research and Data b) It is more precise and to a great extent avoids bias.
Collection
c) Since sample size can be less in this method, it saves a lot of time, money and
other resources for data collection.
Limitations
a) Stratified sampling requires a detailed knowledge of the distribution of attributes
or characteristics of interest in the population to determine the homogeneous
groups that lie within it. If we cannot accurately identify the homogeneous
groups, it is better to use simple random sample since improper stratification can
lead to serious errors.
b) Preparing a stratified list is a difficult task as the lists may not be readily
available.
4. Cluster Sampling: In cluster sampling we divide the population into groups
having heterogenous characteristics called clusters and then select a sample of
clusters using simple random sampling. We assume that each of the clusters is
representative of the population as a whole. This sampling is widely used for
geographical studies of many issues. For example if we are interested in finding
the consumers’ (residing in Delhi) attitudes towards a new product of a
company, the whole city of Delhi can be divided into 20 blocks. We assume that
each of these blocks will represent the attitudes of consumers of Delhi as a
whole, we might use cluster sampling treating each block as a cluster. We will
then select a sample of 2 or 3 clusters and obtain the information from
consumers covering all of them. The principles that are basic to the cluster
sampling are as follows:
i) The differences or variability within a cluster should be as large as possible.
As far as possible the variability within each cluster should be the same as
that of the population.
ii) The variability between clusters should be as small as possible. Once the
clusters are selected, all the units in the selected clusters are covered for
obtaining data.
Advantages
a) The cluster sampling provides significant gains in data collection costs, since
traveling costs are smaller.
b) Since the researcher need not cover all the clusters and only a sample of
clusters are covered, it becomes a more practical method which facilitates
fieldwork.
Limitations
a) The cluster sampling method is less precise than sampling of units from the
whole population since the latter is expected to provide a better cross-section of
the population than the former, due to the usual tendency of units in a cluster to
be homogeneous.
b) The sampling efficiency of cluster sampling is likely to decrease with the
decrease in cluster size or increase in number of clusters.
The above advantages or limitations of cluster sampling suggest that, in practical
situations where sampling efficiency is less important but the cost is of greater
significance, the cluster sampling method is extensively used. If the division of
clusters is based on the geographic sub-divisions, it is known as area sampling.
In cluster sampling instead of covering all the units in each cluster we can
resort to sub-sampling as two-stage sampling. Here, the clusters are termed as
primary units and the units within the selected clusters are taken as secondary
7 2 units.
Geektonight Notes

5. Multistage Sampling: We have already covered two stage sampling. Multi Sampling
stage sampling is a generalisation of two stage sampling. As the name suggests,
multi stage sampling is carried out in different stages. In each stage
progressively smaller (population) geographic areas will be randomly selected.
A political pollster interested in assembly elections in Uttar Pradesh may first
divide the state into different assembly units and a sample of assembly
constituencies may be selected in the first stage. In the second stage, each of
the sampled assembly constituents are divided into a number of segments and a
second stage sampled assembly segments may be selected. In the third stage
within each sampled assembly segment either all the house-holds or a sample
random of households would be interviewed. In this sampling method, it is
possible to take as many stages as are necessary to achieve a representative
sample. Each stage results in a reduction of sample size.
In a multi stage sampling at each stage of sampling a suitable method of
sampling is used. More number of stages are used to arrive at a sample of
desired sampling units.
Advantages
a) Multistage sampling provides cost gains by reducing the data collection on costs.
b) Multistage sampling is more flexible and allows us to use different sampling
procedures in different stages of sampling.
c) If the population is spread over a very wide geographical area, multistage
sampling is the only sampling method available in a number of practical
situations.
Limitations
a) If the sampling units selected at different stages are not representative
multistage sampling becomes less precise and efficient.
4.5.2 Non-Random Sampling Methods

The non-random sampling methods are also often called non-probability sampling
methods. In a non-random sampling method the probability of any particular unit
of the population being chosen is unknown. Here the method of selection of
sampling units is quite arbitrary as the researchers rely heavily on personal
judgment. Non-random sampling methods usually do not produce samples that
are representative of the general population from which they are drawn. The
greatest error occurs when the researcher attempts to generalise the results on
the basis of a sample to the entire population. Such an error is insidious
because it is not at all obvious from merely looking at the data, or even from
looking at the sample. The easiest way to recognise whether a sample is
representative or not is to determine whether the sample is selected randomly
or not. Nevertheless, there are occasions where non-random samples are best
suited for the researcher’s purpose.The various non-random sampling methods
commonly used are:

1) Convenience Sampling;
2) Judgement Sampling; and
3) Quota Sampling.
Let us discuss these methods in detail.
1) Convenience Sampling: Convenience sampling refers to the method of
obtaining a sample that is most conveniently available to the researcher. For
example, if we are interested in finding the overtime wage paid to employees
working in call centres, it may be convenient and economical to sample 7 3
Geektonight Notes

Research and Data employees of call centres in a nearby area. Also, on various issues of public
Collection interest like budget, election, price rise etc., the television channels often present
on-the-street interviews with people to reflect public opinion. It may be
cautioned that the generalisation of results based on convenience sampling
beyond that particular sample may not be appropriate. Convenience samples are
best used for exploratory research when additional research will be
subsequently conducted with a random sample. Convenience sampling is also
useful in testing the questionnaires designed on a pilot basis. Convenience
sampling is extensively used in marketing studies.
2) Judgement Sampling: Judgement sampling method is also known as purposive
sampling. In this method of sampling the selection of sample is based on the
researcher’s judgment about some appropriate characteristic required of the
sample units. For example, the calculation of consumer price index is based on a
judgment sample of a basket of consumer items, and other related commodities
and services which are expected to reflect a representative sample of items
consumed by the people. The prices of these items are collected from selected
cities which are viewed as typical cities with demographic profiles matching the
national profile. In business judgment sampling is often used to measure the
performance of salesmen/saleswomen. The salesmen/saleswomen are grouped
into high, medium or low performers based on certain specified qualities. Then
the sales manager may actually classify the salesmen/saleswomen working
under him/her who in his/her opinion will fall in which group. Judgment sampling
is also often used in forecasting election results. We may often wonder how a
pollster can predict an election based on only 2% to 3% of votes covered. It is
needless to say the method is biased and does not have any scientific basis.
However, in the absence of any representative data, one may resort to this kind
of non-random sampling.
3) Quota Sampling: The quota sampling method is commonly used in marketing
research studies. The samples are selected on the basis of some parameters
such as age, sex, geographical region, education, income, occupation etc, in
order to make them as representative samples. The investigators, then, are
assigned fixed quotas of the sample meeting these population characteristics.
The purpose of quota sampling is to ensure that various sub-groups of the
population are represented on pertinent sample characteristics to the extent that
the investigator desires. The stratified random sampling also has this objective
but should not be confused with quota sampling. In the stratified sampling
method the researcher selects a random sample from each group of the
population, where as, in quota sampling, the interviewer has a quota fixed for
him/her to achieve. For example, if a city has 10 market centres, a soft drink
company may decide to interview 50 consumers from each of these 10 market
centres to elicit information on their products. It is entirely left to the
investigator whom he/she will interview at each of the market centres and the
time of interview. The interview may take place in the morning, mid day, or
evening or it may be in the winter or summer.
Quota sampling has the advantage that the sample confirms the selected
characteristics of the population that the researcher desires. Also, the cost and
time involved in collecting the data are also greatly reduced. However, quota
sampling has many limitations, as given below:
a) In quota sampling the respondents are selected according to the convenience of
the field investigator rather than on a random basis. This kind of selection of
sample may be biased. Suppose in our example of soft drinks, after the sample
is taken it was found that most of the respondents belong to the lower income
group then the purpose of conducting the survey becomes useless and the
7 4 results may not reflect the actual situation.
Geektonight Notes

b) If the number of parameters, on which basis the quotas are fixed, are larger Sampling
then it becomes difficult for the researcher to fix the quota for each sub-group.
c) The field workers have the tendency to cover the quota by going to those places
where the respondents may be willing to provide information and avoid those
with unwilling respondents. For example, the investigators may avoid places
where high income group respondents stay and cover only low income group
areas.

Self Assessment Exercise B

1) Suppose there are 900 families residing in a colony. You are asked to select a
sample of families using simple random sampling for knowing the average
income. The families are identified with serial numbers 001 to 900.
i) Select a random sample using the following random table.

29283 31581 04359 45538 41435 61103 32428 94042 39971 63678

19868 49978 81699 84904 50163 22652 07845 71308 00859 87984

14292 93587 55960 23159 07370 65065 06580 46285 07884 83928

77410 52135 29495 23032 83242 89938 40516 27252 55565 64714

36580 06921 35675 81645 60479 71035 99380 59759 42161 93440

ii) While selecting the random sample in the above example, what are the
random numbers you have rejected and why?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................

2) There are 4 (A,B,C, and D) sections in class X of a secondary school. You


are asked to find the average income of the parents of the students of
section A and C. Which sampling method will be used from the following?
a) Simple random sampling; b) Systematic sampling; c) Stratified
sampling; d) Cluster sampling.
....................................................................................................................

3) The employees of a company are classified into 4 groups (A,B,C and D) on


the basis of their salary structure. You are asked to find the average salary
income of the employees working in the company. What is the sampling
method to be used?
a) Simple random sampling; b) Systematic sampling; c) Stratified
sampling; d) Quota sampling.
....................................................................................................................

7 5
Geektonight Notes

Research and Data 4) State true or false.


Collection
a) A systematic sampling can be used even if all the units of the population
are not available.
b) A budget has been announced by the government. A TV journalist
recorded the views of the people residing near his house. The sampling
method that the TV journalist used is quota sampling.

....................................................................................................................

4.6 SAMPLE SIZE


The question of how large a sample should be is a difficult one. Sample size
can be determined by various factors (like time, funds, manpower, population
size, purpose of study etc. For example, if the available funds for study are
limited then the researcher may not be able to spend more than a fixed
proportion of the total fund available with him/her. In general, sample size
depends on the nature of the analysis to be performed, the desired precision of
the estimates one wishes to achieve, number of variables that have to be
examined simultaneously and how heterogeneous is the population spread.
Moreover, technical considerations suggest that the required sample size is a
function of the precision of the estimates one wishes to achieve, the variance
of the population and statistical level of confidence one wishes to use. The
higher the precision and confidence level required, the larger the sample size
should be. Typical confidence levels are 95% and 99%, while a typical precision
(significance) value is 1% or 5%. You will learn more about the confidence and
precision levels in Unit 16 and Unit 17 of this course.

Once the researcher determines the desired degree of precision and confidence
level, there are several formulas he/she can use to determine the sample size
and interpretation of results depending on the plan of the study. Here we will
discuss three of them.

1) If the researcher wishes to report the results as proportions of the sample


responses, use the following formula.
P(1 − P)
n= 2
A P(1 − P)
2
+
Z N
Where, n = Sample size.
P = Estimated percentage of the population possessing attribute of
interest.
A = Accuracy desired, usually expressed as a decimal (i.e. 0.01, 0.05,
etc.)
Z = Standardization value indicating a confidence level (Z=1.96 at 95%
confidence level and Z = 2.56 at 99% confidence level. See Unit
16 for more details.
N = Population size (known or estimated)
2) If the researcher wishes to report the results as means of the sample responses,
use the following formula.
σ2
n=
A 2 σ2
+
7 6 Z2 N
Geektonight Notes

Where, n = Sample size. Sampling

P = Estimated percentage of the population possessing attribute of


interest.
A = Accuracy desired, usually expressed as a decimal (i.e. 0.01, 0.05,
etc.)
Z = Standardization value indicating a confidence level (Z=1.96 at 95%
confidence level and Z = 2.56 at 99% confidence level.
See Unit 16 for more details.
N = Population size (known or estimated)

3) If the researcher plans the results in a variety of ways or if he/she has difficulty
in estimating the proportion or standard deviation of the attribute of interest, the
following formula may be more useful.

NZ 2
× .25
n=
[d × (N − 1)] + [Z 2 × .25]
2

Where, n = Sample size required


d = Accuracy precision level (i.e. 0.01, 0.05, 0.10 etc.)
Z = Standardization value indicating a confidence level (Z = 1.96 at
95% confidence level and Z = 2.56 at 99% confidence level.
See Unit 16 for more details.
N = Population size (known or estimated).
For example, if the population size (N) is 1000 and you wish a 95% confidence
level and ±5% precision level (d=0.05 and Z=1.96) then the sample size (n):

1000×1.96 2 × 0.25
n= = 277.7or say 280
(0.05 2 × 999)+ (1.96 2 × 0.25)

4.7 SAMPLING AND NON-SAMPLING ERRORS


The quality of a research project depends on the accuracy of the data collected
and its representation to the population. There are two broad sources of
errors. These are sampling errors and non-sampling errors.

4.7.1 Sampling Errors

The principal sources of sampling errors are the sampling method applied, and
the sample size. This is due to the fact that only a part of the population is
covered in the sample. The magnitude of the sampling error varies from one
sampling method to the other, even for the same sample size. For example, the
sampling error associated with simple random sampling will be greater than
stratified random sampling if the population is heterogeneous in nature.

Intuitively, we know that the larger the sample the more accurate the research.
In fact, the sampling error varies with samples of different sizes. Increasing the
sample size decreases the sampling error.

7 7
Geektonight Notes

Research and Data The following Figure gives an approximate relationship between sample size and
Collection sampling error. Study the following figure carefully.

large
Sampling error
small

small large
Sample size
Fig.: 4.1

4.7.2 Non-Sampling Errors


The non-sampling errors arise from faulty research design and mistakes in
executing research. There are many sources of non-sampling errors which may
be broadly classified as: (a) respondent errors, and (b) administrative errors.

a) Respondent Errors: If the respondents co-operate and give the correct


information the objectives of the researcher can be easily accomplished.
However, in practice, this may not happen. The respondents may either refuse
to provide information or even if he/she provides information it may be biased.

If the respondent fails to provide information, we call it as non-response error.


Although this problem is present in all types of surveys, the problem is more
acute in mailed surveys. Non-response also leads to some extreme situations
like those respondents who are willing to provide information are over
represented while those who are indifferent are under-represented in the
sample. In order to minimise the non-response error the researcher often seeks
to re-contact with the non-respondents if they were not available earlier.
If the researcher finds that the non-response rate is more in a particular group
of respondents (for example, higher income groups) additional efforts should be
made to obtain data from these under-represented groups of the population. For
example, for these people who are not responding to the mailed questionnaires,
personal interviews may be conducted to obtain data. In a mailed
questionnaire the researcher never knows whether the respondent really
refused to provide data or was simply indifferent. There are several
techniques which help to encourage respondents to reply. You must have
already learned these techniques in Unit 3 of this course.
Response bias occurs when the respondent may not give the correct
7 8 information and try to mislead the investigator in a certain direction. The
Geektonight Notes

respondents may consciously or unconsciously misrepresent the truth. For Sampling


example, if the investigator asks a question on the income of the respondent he/
she may not give the correct information for obvious reasons. Or the
investigator may not be able to put a question that is sensitive (thus avoiding
embarrassment). This may arise from the problems in designing the
questionaire and the content of questions. Respondents who must understand
the questions may unconsciously provide biased information.
The response bias may also occur because the interviewer’s presence
influences respondents to give untrue or modified answers. The respondents/
interviewers tendency is to please the other person rather than provide/elicit the
correct information.
b) Administrative Errors: The errors that have arisen due to improper
administration of the research process are called administrative errors. There
are four types of administrative errors. These are as follows:
i) sample selection error,
ii) investigator error,
iii) investigator cheating, and
iv. data processing error.
i) Sample Selection Error: It is difficult to execute a sampling plan. For
example, we may plan to use systematic sampling plan in a market
research study of a new product and decide to interview every 5th
customer coming out of a consumer store. If the day of interview
happened to be a working day then we are excluding all those consumers
who are working. This may lead to an error because of the
unrepresentative sample selection.
ii) Investigator Error: When the investigator interviews the respondent, he/
she may fail to record the information correctly or may fail to cross check
the information provided by the respondent. Therefore, the error may arise
due to the way the investigator records the information.
iii) Investigator Cheating: Some times the investigator may try to fake the
data even without meeting the concerned respondents. There should be
some mechanism to crosscheck this type of faking by the investigator.
iv) Data Processing Error: Once the data is collected the next job the
researcher does is edit, code and enter the data into a computer for further
processing and analysis. The errors can be minimised by careful editing,
coding and entering the data into a computer.
4.7.3 Control of Errors

In the above two sections we have identified the most significant sources of
errors. It is not possible to eliminate completely the sources of errors.
However, the researcher’s objective and effort should be to minimise these
sources of errors as much as possible. There are ways of reducing the errors.
Some of these are:
(a) designing and executing a good questionnaire; (b) selection of appropriate
sampling method; (c) adequate sample size; (d) employing trained investigators
to collect the data; and (e) care in editing, coding and entering the data into the
computer. You have already learned the above ways of controlling the errors
in Unit 3 and in this Unit.
7 9
Geektonight Notes

Research and Data Self Assessment Exercise C


Collection
1) The size of a population is 10000. You wish to have a 99% confidence
level and ±5% precision level. What is the sample size required?
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) As the sample size increases, the sampling error:
a) Increases b) Decreases c) Remains constant
....................................................................................................................

3) The sampling errors arise due to:


a) The investigator’s bias b) The data processing problem
c) The respondent’s bias d) The sampling method applied

....................................................................................................................

4.8 LET US SUM UP


A sample is a subset of population whose characteristics are studied to know
the information about the population. A complete survey of population is called
census. When compared with census, sampling is less expensive, requires less
time and other resources and is more accurate when samples are taken
properly. Also, sampling is the only alternative when the measurement of
population units is destructive in nature.

There are two broad categories of sampling methods. These are: (a) random
sampling methods, and (b) non-random sampling methods. The random sampling
methods are based on the chance of including the units of population in a
sample.

Some of the sampling methods covered in this Unit are: (a) simple random
sampling, (b) systematic random sampling, (c) stratified random sampling,
(d) cluster sampling, and (e) multistage sampling. With an appropriate sampling
plan and selection of random sampling method the sampling error can be
minimised. The non-random sampling methods include: (a) convenience sampling,
(b) judgment sampling, and (c) Quota sampling. These methods may be
convenient to the researcher to apply. These methods may not provide a
representative sample to the population and there are no scientific ways to
check the sampling errors.

There are two major sources of errors in survey research. These are:
(a) sampling errors, and (b) non-sampling errors. The sampling errors arise
because of the fact that the sample may not be a representative sample to the
population. Two major sources of non-sampling errors are due to: (a) non-
response on the part of respondent and/or respondent’s bias in providing correct
information, and (b) administrative errors like design and implementation of
questionnaire, investigators’ bias, and data processing errors.

It may not be possible to completely eliminate the sampling and non-sampling


errors. However, there are some ways to minimise these errors. These are:
8 0
Geektonight Notes

(a) designing a good questionnaire, (b) selection of appropriate sampling method, Sampling
(c) adequate sample size, (d) employing trained investigators and, (e) care in
data processing.

4.9 KEY WORDS


Administrative Errors : The administrative errors arise due to improper
administration of the research.
Census : A complete survey of population is called census.
Convenient Sampling : Here the units of the population are included in the
sample as per the convenience of the researcher.
Cluster Sampling: In cluster sampling method we divide the population into
groups called clusters, selective sample of clusters using simple random
sampling and then cover all the units in each of the clusters included in the
sample.
Judgment Sampling: In this sampling method the selection of sample is based
on the researcher’s judgment about some appropriate characteristics required of
the sample units.
Multi-stage Sampling: Here we select the sample units in a number of stages
using one or more random sampling methods.
Non-sampling Errors : The non-sampling errors arise from faulty research
design and mistakes in executing the research.
Non-random Sampling/Non-Probability Sampling : In this sampling method
the probability of any particular unit of the population being included in the
sample is unknown.
Parameters : The numerical characteristics of a population are called
parameters.
Quota Sampling : In this sampling method the samples are selected on the
basis of some parameters such as age, gender, geographical region, education,
income, occupation etc.
Random Sampling/Probability Sampling : If all the units of the population
have a chance of being chosen in the sample, the sampling method is called
random sampling/probability sampling.
Respondent Errors : The respondent errors arise due to failure of the
respondent to provide correct information.
Sample : A sample is a representative set of population.
Sampling Errors : The sampling errors arise because we cover only a part of
the population.
Simple Random Sampling : This is one of the basic methods of random
sampling where each unit in the population has equal chance of being included
in the sample.
Stratified Sampling : The stratified sampling method is used when the
population is heterogeneous. Here the population is divided into some
homogeneous groups called stratums.
Systematic Sampling : In systematic sampling the sample units are selected
from the population at equal intervals in terms of time, space or order. 8 1
Geektonight Notes

Research and Data


Collection 4.10 ANSWERS TO SELF ASSESSMENT
EXERCISES
A. 2) a) sample survey; b) census; c) sample survey.
3) a) biased; b) cost; c) sufficiently, economical.

B. 1) i) Selected sample using simple random sampling


283, 581, 359, 538, 435, 103, 428, 042, 678, 868,
699, 163, 652, 845, 308, 859, 292, 587, 960, 159,
370, 065, 580, 285, 884, 410, 135, 495, 032, 242
ii) 39971, 49978, 84904, 87984, 55960, 83928
The population size is 900 and these random numbers fall outside the population
range of 000 to 899.

2) Cluster sampling
3) Stratified sampling
4) a) true
b) false, it is convenience sampling
C. 1) The required sample size is 370
2) Decreases
3) Sampling method applied

4.11 TERMINAL QUESTIONS


1) What is the difference between random sampling and non-random sampling?
2) List some of the situations where (a) sampling is more appropriate than census
and (b) census is more appropriate than sampling.
3) What are the advantages and disadvantages of stratified random sampling?
4) What are the ways to control survey errors?
5) What are the advantages of sampling over census?
6) Discuss the method of cluster sampling. What is the difference between cluster
sampling and stratified random sampling?
7) The total population is 5000 and you wish a 99% confidence level and a ±5%
precision level. What is the sample size required?
8) A certain population is divided into 4 stratums so that N1 = 4000, N2 = 6000,
N3 = 7000, N4 = 3000. The respective stratum standard deviations are σ1 = 2.0,
σ2 = 4.0, σ3 = 3.0, σ4 = 6.0. How should a sample size of 300 be allocated to
four stratums using: (a) proportional and (b) disproportional methods.
9) Discuss the sources of sampling and non-sampling errors.
10)What are the essentials of a good sample?

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
8 2 university for assessment. These are for your practice only.
Geektonight Notes
Sampling
4.12 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
Gupta, C.B., & Vijay Gupta, An Introduction to Statistical Methods, Vikas
Publishing House Pvt. Ltd., New Delhi.
Kothari, C.R.(2004) Research Methodology Methods and Techniques, New Age
International (P) Ltd., New Delhi.
Levin, R.I. and D.S. Rubin. (1999) Statistics for Management, Prentice-Hall of
India, New Delhi
Mustafi, C.K.(1981) Statistical Methods in Managerial Decisions, Macmillan,
New Delhi

8 3
Geektonight Notes

Research and Data


Collection UNIT 5 MEASUREMENT AND SCALING
TECHNIQUES
STRUCTURE
5.0 Objectives
5.1 Introduction
5.2 Measurement and Scaling
5.3 Issues in Attitude Measurement
5.4 Levels of Measurement Scales
5.5 Types of Scaling Techniques
5.5.1 Comparative Scales
5.5.2 Non-comparative Scales
5.6 Selection of an Appropriate Scaling Technique
5.7 Let Us Sum Up
5.8 Key Words
5.9 Answers to Self Assessment Exercises
5.10 Terminal Questions
5.11 Further Reading

5.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the concepts of measurement and scaling,
l discuss four levels of measurement scales,
l classify and discuss different scaling techniques, and
l select an appropriate attitude measurement scale for your research problem.

5.1 INTRODUCTION
As we discussed earlier, the data consists of quantitative variables like price,
income, sales etc., and qualitative variables like knowledge, performance,
character etc. The qualitative information must be converted into numerical
form for further analysis. This is possible through measurement and scaling
techniques. A common feature of survey based research is to have
respondent’s feelings, attitudes, opinions, etc. in some measurable form. For
example, a bank manager may be interested in knowing the opinion of the
customers about the services provided by the bank. Similarly, a fast food
company having a network in a city may be interested in assessing the quality
and service provided by them. As a researcher you may be interested in
knowing the attitude of the people towards the government announcement of a
metro rail in Delhi. In this unit we will discuss the issues related to
measurement, different levels of measurement scales, various types of scaling
techniques and also selection of an appropriate scaling technique.

5.2 MEASUREMENT AND SCALING


Before we proceed further it will be worthwhile to understand the following
two terms: (a) Measurement, and (b) Scaling.

a) Measurement: Measurement is the process of observing and recording the


84
observations that are collected as part of research. The recording of the
Geektonight Notes

observations may be in terms of numbers or other symbols to characteristics of Measurement and


Scaling Techniques
objects according to certain prescribed rules. The respondent’s, characteristics
are feelings, attitudes, opinions etc. For example, you may assign ‘1’ for Male
and ‘2’ for Female respondents. In response to a question on whether he/she is
using the ATM provided by a particular bank branch, the respondent may say
‘yes’ or ‘no’. You may wish to assign the number ‘1’ for the response yes and
‘2’ for the response no. We assign numbers to these characteristics for two
reasons. First, the numbers facilitate further statistical analysis of data obtained.
Second, numbers facilitate the communication of measurement rules and results.
The most important aspect of measurement is the specification of rules for
assigning numbers to characteristics. The rules for assigning numbers should be
standardised and applied uniformly. This must not change over time or objects.
b) Scaling: Scaling is the assignment of objects to numbers or semantics according
to a rule. In scaling, the objects are text statements, usually statements of
attitude, opinion, or feeling. For example, consider a scale locating customers of
a bank according to the characteristic “agreement to the satisfactory quality of
service provided by the branch”. Each customer interviewed may respond with
a semantic like ‘strongly agree’, or ‘somewhat agree’, or ‘somewhat disagree’,
or ‘strongly disagree’. We may even assign each of the responses a number.
For example, we may assign strongly agree as ‘1’, agree as ‘2’ disagree as ‘3’,
and strongly disagree as ‘4’. Therefore, each of the respondents may assign 1,
2, 3 or 4.

5.3 ISSUES IN ATTITUDE MEASUREMENT


When a researcher is interested in measuring the attitudes, feelings or opinions
of respondents he/she should be clear about the following:

a) What is to be measured?
b) Who is to be measured?
c) The choices available in data collection techniques
The first issue that the researcher must consider is ‘what is to be measured’?
The definition of the problem, based on our judgments or prior research
indicates the concept to be investigated. For example, we may be interested in
measuring the performance of a fast food company. We may require a precise
definition of the concept on how it will be measured. Also, there may be more
than one way that we can measure a particular concept. For example, in
measuring the performance of a fast food company we may use a number of
measures to indicate the performance of the company. We may use sales
volume in terms of value of sales or number of customers or spread of
network of the company as measures of performance. Further, the
measurement of concepts requires assigning numbers to the attitudes, feelings or
opinions. The key question here is that on what basis do we assign the
numbers to the concept. For example, the task is to measure the agreement of
customers of a fast food company on the opinion of whether the food served
by the company is tasty, we create five categories: (1) strongly agree, (2)
agree, (3) undecided, (4) disagree, (5) strongly disagree. Then we may measure
the response of respondents. Suppose if a respondent states ‘disagree’ with the
statement that ‘the food is tasty’, the measurement is 4.

The second important issue in measurement is that, who is to be measured?


That means who are the people we are interested in. The characteristics of
the people such as age, sex, education, income, location, profession, etc. may 85
Geektonight Notes

Research and Data have a bearing on the choice of measurement. The measurement procedure
Collection
must be designed keeping in mind the characteristics of the respondents under
consideration.

The third issue in measurement is the choice of the data collection techniques.
In Unit 3, you have already learnt various methods of data collection.Normally,
questionnaires are used for measuring attitudes, opinions or feelings.

5.4 LEVELS OF MEASUREMENT SCALES


The level of measurement refers to the relationship among the values that are
assigned to the attributes, feelings or opinions for a variable. For example, the
variable ‘whether the taste of fast food is good’ has a number of attributes,
namely, very good, good, neither good nor bad, bad and very bad. For the
purpose of analysing the results of this variable, we may assign the values 1, 2,
3, 4 and 5 to the five attributes respectively. The level of measurement
describes the relationship among these five values. Here, we are simply using
the numbers as shorter placeholders for the lengthier text terms. We don’t
mean that higher values mean ‘more’ of something or lower values mean ‘less’
of something. We don’t assume that ‘good’ which has a value of 2 is twice of
‘very good’ which has a value of 1. We don’t even assume that ‘very good’
which is assigned the value ‘1’ has more preference than ‘good’ which is
assigned the value ‘2’. We simply use the values as a shorter name for the
attributes, opinions, or feelings. The assigned values
of attributes allow the researcher more scope for further processing of data and
statistical analysis.

Typically, there are four levels of measurement scales or methods of assigning


numbers: (a) Nominal scale, (b) Ordinal scale, (c) Interval scale, and (d) Ratio
scale.

a) Nominal Scale is the crudest among all measurement scales but it is also the
simplest scale. In this scale the different scores on a measurement simply
indicate different categories. The nominal scale does not express any values or
relationships between variables. For example, labelling men as ‘1’ and women
as ‘2’ which is the most common way of labelling gender for data recording
purpose does not mean women are ‘twice something or other’ than men. Nor it
suggests that men are somehow ‘better’ than women. Another example of
nominal scale is to classify the respondent’s income into three groups: the
highest income as group 1. The middle income as group 2, and the low-income
as group 3. The nominal scale is often referred to as a categorical scale. The
assigned numbers have no arithmetic properties and act only as labels. The only
statistical operation that can be performed on nominal scales is a frequency
count. We cannot determine an average except mode.
In designing and developing a questionnaire, it is important that the response
categories must include all possible responses. In order to have an exhaustive
number of responses, you might have to include a category such as ‘others’,
‘uncertain’, ‘don’t know’, or ‘can’t remember’ so that the respondents will not
distort their information by forcing their responses in one of the categories
provided. Also, you should be careful and be sure that the categories provided
are mutually exclusive so that they do not overlap or get duplicated in any way.
b) Ordinal Scale involves the ranking of items along the continuum of the
86 characteristic being scaled. In this scale, the items are classified according to
Geektonight Notes

whether they have more or less of a characteristic. For example, you may wish Measurement and
Scaling Techniques
to ask the TV viewers to rank the TV channels according to their preference
and the responses may look like this as given below:

TV Channel Viewers preferences


Doordarshan-1 1
Star plus 2
NDTV News 3
Aaaj Tak TV 4

The main characteristic of the ordinal scale is that the categories have a logical
or ordered relationship. This type of scale permits the measurement of degrees
of difference, (that is, ‘more’ or ‘less’) but not the specific amount of
differences (that is, how much ‘more’ or ‘less’). This scale is very common
in marketing, satisfaction and attitudinal research.

Another example is that a fast food home delivery shop may wish to ask its
customers:
How would you rate the service of our staff?
(1) Excellent • (2) Very Good • (3) Good • (4) Poor • (5) Worst •

Suppose respondent X gave the response ‘Excellent’ and respondent Y gave


the response ‘Good’, we may say that respondent X thought that the service
provided better than respondent Y to be thought. But we don’t know how
much better and even we can’t say that both respondents have the same
understanding of what constitutes ‘good service’.

In marketing research, ordinal scales are used to measure relative attitudes,


opinions, and preferences. Here we rank the attitudes, opinions and preferences
from best to worst or from worst to best. However, the amount of difference
between the ranks cannot be found out. Using ordinal scale data, we can
perform statistical analysis like Median and Mode, but not the Mean.

c) Interval Scale is a scale in which the numbers are used to rank attributes such
that numerically equal distances on the scale represent equal distance in the
characteristic being measured. An interval scale contains all the information of
an ordinal scale, but it also one allows to compare the difference/distance
between attributes. For example, the difference between ‘1’ and ‘2’ is equal to
the difference between ‘3’ and ‘4’. Further, the difference between ‘2’ and ‘4’
is twice the difference between ‘1’ and ‘2’. However, in an interval scale, the
zero point is arbitrary and is not true zero. This, of course, has implications for
the type of data manipulation and analysis. We can carry out on data collected in
this form. It is possible to add or subtract a constant to all of the scale values
without affecting the form of the scale but one cannot multiply or divide the
values. Measuring temperature is an example of interval scale. We cannot say
400C is twice as hot as 200C. The reason for this is that 00C does not mean that
there is no temperature, but a relative point on the Centigrade Scale. Due to
lack of an absolute zero point, the interval scale does not allow the conclusion
that 400C is twice as hot as 200C.
Interval scales may be either in numeric or semantic formats. The following are
two more examples of interval scales one in numeric format and another in
semantic format. 87
Geektonight Notes

Research and Data i) Example of Interval Scale in Numeric Format


Collection
Food supplied is: Indicate your score on
Fresh 1 2 3 4 5 the concerned blank
Tastes good 1 2 3 4 5 and circle the appro-
Value for money 1 2 3 4 5 priate number on each
Attractive packaging 1 2 3 4 5 line.
Prompt time delivery 1 2 3 4 5

ii) Example of Interval Scale in Semantic Format


Please indicate your views on the food supplied by XXX Fast Food Shop by
scoring them on a five points scale from 1 to 5 (that is, 1=Excellent, 2=Very
Good, 3=Good, 4=Poor, 5=Worst). Indicate your views by ticking the appropriate
responses below:

Food supplied is: Excellent Very Good Good Poor Worst


Fresh
Tastes good
Value for money
Attractive packaging
Prompt time delivery

The interval scales allow the calculation of averages like Mean, Median and
Mode and dispersion like Range and Standard Deviation.

d) Ratio Scale is the highest level of measurement scales. This has the properties
of an interval scale together with a fixed (absolute) zero point. The absolute zero
point allows us to construct a meaningful ratio. Examples of ratio scales include
weights, lengths and times. In the marketing research, most counts are ratio
scales. For example, the number of customers of a bank’s ATM in the last
three months is a ratio scale. This is because you can compare this with
previous three months. Ratio scales permit the researcher to compare both
differences in scores and relative magnitude of scores. For example, the
difference between 10 and 15 minutes is the same as the difference between 25
and 30 minutes and 30 minutes is twice as long as 15 minutes. Most financial
research that deals with rupee values utilizes ratio scales. However, for most
behavioural research, interval scales are typically the highest form of
measurement. Most statistical data analysis procedures do not distinguish
between the interval and ratio properties of the measurement scales and it is
sufficient to say that all the statistical operations that can be performed on
interval scale can also be performed on ratio scales.
Now you must be wondering why you should know the level of measurement.
Knowing the level of measurement helps you to decide on how to interpret the
data. For example, when you know that a measure is nominal then you know
that the numerical values are just short codes for longer textual names. Also,
knowing the level of measurement helps you to decide what statistical analysis is
appropriate on the values that were assigned. For example, if you know that a
measure is nominal, then you would not need to find mean of the data values or
perform a t-test on the data. (t-test will be discussed in Unit-16 in the course).
88
Geektonight Notes

It is important to recognise that there is a hierarchy implied in the levels of Measurement and
Scaling Techniques
measurement. At lower levels of measurement, assumptions tend to be less
restrictive and data analyses tend to be less sensitive. At each level up the
hierarchy, the current level includes all the qualities of the one below it and adds
something new. In general, it is desirable to have a higher level of measurement
(that is, interval or ratio) rather than a lower one (that is, nominal or ordinal).

Self Assessment Exercise-A

1) The main difference between interval scale and the ratio scale in terms of their
properties is:
...................................................................................................................
....................................................................................................................
....................................................................................................................
2) Why should the researcher know the level of measurement?
....................................................................................................................
....................................................................................................................
...................................................................................................................

3) What are the main statistical limitations of nominal scale?


....................................................................................................................
....................................................................................................................
....................................................................................................................
4) Indicate whether the following measures are nominal, ordinal, interval or ratio
scales?
a) Social status of a respondent: ..............................................................
b) Stock market prices: ...........................................................................
c) The ranks obtained by students: ..........................................................
d) The Fahrenheit scale for measuring temperature: .................................

5.5 TYPES OF SCALING TECHNIQUES


The various types of scaling techniques used in research can be classified into
two categories: (a) comparative scales, and (b) Non-comparative scales. In
comparative scaling, the respondent is asked to compare one object with
another. For example, the researcher can ask the respondents whether they
prefer brand A or brand B of a detergent. On the other hand, in non-
comparative scaling respondents need only evaluate a single object. Their
evaluation is independent of the other object which the researcher is studying.
Respondents using a non-comparative scale employ whatever rating standard
seems appropriate to them. Non-comparative techniques consist of continuous
and itemized rating scales. Figure 5.1 shows the classification of these scaling
techniques.

89
Geektonight Notes

Research and Data Figure 5.1: Scaling Techniques


Collection
Scaling Techniques

Comparative Scales Non-Comparative Scales

Paired Rank Constant Q-sort


Comparison Order Sum

Continuous Rating Scales Itemised Rating Scales

Likert Semantic Differential Stapel

5.5.1 Comparative Scales

The comparative scales can further be divided into the following four types of
scaling techniques: (a) Paired Comparison Scale, (b) Rank Order Scale, (c)
Constant Sum Scale, and (d) Q-sort Scale.

a) Paired Comparison Scale: This is a comparative scaling technique in which a


respondent is presented with two objects at a time and asked to select one
object (rate between two objects at a time) according to some criterion. The
data obtained are ordinal in nature. For example, there are four types of cold
drinks - Coke, Pepsi, Sprite, and Limca. The respondents can prefer Pepsi to
Coke or Coke to Sprite, etc. In all we can have the following six comparisons.
Coke–Pepsi
Coke–Sprite
Coke–Limca
Pepsi–Sprite
Pepsi–Limca
Sprite–Limca
n (n − 1)
In general, with n brands we have paired comparisons. The following is
2
the data recording format using the paired comparisons.
Table 5.1

Brand Coke Pepsi Sprite Limca


Coke — √
Pepsi —
Sprite √ √ —
Limca √ √ √ —
No. of times preferred 2 3 1 0
90
Geektonight Notes

A √ in a particular box means that the brand in that column was preferred Measurement and
Scaling Techniques
over the brand in the corresponding row. In the above recording, Coke was
preferred over Sprite, Coke over Limca, in this case the number of times coke
preferred was 2 times. Similarly, Pepsi over Coke, Pepsi over Sprite, Pepsi over
Limca, in this case Pepsi was 3 time preferred. Thus, the number of times a
brand was preferred is obtained by summing the √ s in each column.

The following table gives paired comparison of data (assumed) for four brands
of cold drinks.

Table 5.2

Brand Coke Pepsi Sprite Limca

Coke – 0.90 0.64 0.14

Pepsi 0.10 – 0.32 0.02

Sprite 0.36 0.68 – 0.15

Limca 0.86 0.98 0.85 –

The entries in the boxes represent the proportion of respondents preferring


‘column brand’ and to ‘row’ brand. For example, 90% prefer Pepsi to Coke
and only 10% prefer Coke to Pepsi, etc.

Paired comparison is useful when the number of brands are limited, since it
requires direct comparison and overt choice. One of the disadvantages of paired
comparison scale is violation of the assumption of transitivity may occur. For
example, in our example (Table 5.1) the respondent preferred Coke 2 times,
Pepsi 3 times, Sprite 1 time, and Limca 0 times. That means, preference-wise,
Pepsi >Coke, Coke >Sprite, and Sprite >Limca. However, the number of times
Sprite was preferred should not be that of Coke. In other words, if A>B and
B >C then C >A should not be possible. Also, the order in which the objects
are presented may bias the results. The number of items/brands for comparison
should not be too many. As the number of items increases, the number of
comparisons increases geometrically. If the number of comparisons is too large,
the respondents may become fatigued and no longer be able to carefully
discriminate among them. The other limitation of paired comparison is that this
scale has little resemblance to the market situation, which involves selection
from multiple alternatives. Also, respondents may prefer one item over certain
others, but they may not like it in an absolute sense.

b) Rank Order Scale: This is another type of comparative scaling technique in


which respondents are presented with several items simultaneously and asked to
rank them in the order of priority. This is an ordinal scale that describes the
favoured and unfavoured objects, but does not reveal the distance between the
objects. For example, if you are interested in ranking the preference of some
selected brands of cold drinks, you may use the following format for recording
the responses.

91
Geektonight Notes

Research and Data Table 5.3: Preference of cold drink brands using rank order scaling
Collection
Instructions: Rank the following brands of cold drinks in order of
preference. Begin by picking out the one brand you like most and assign it a
number1. Then find the second most preferred brand and assign it a number
2. Continue this procedure until you have ranked all the brands of cold drinks
in order of preference. The least preferred brand should be assigned a rank
of 4. Also remember no two brands receive the same rank order.
Format:
Brand Rank
(a) Coke 3
(b) Pepsi 1
(c) Limca 2
(d) Sprite 4

Like paired comparison, the rank order scale, is also comparative in nature. The
resultant data in rank order is ordinal data. This method is more realistic in
obtaining the responses and it yields better results when direct comparison are
required between the given objects. The major disadvantage of this technique is
that only ordinal data can be generated.

c) Constant Sum Scale: In this scale, the respondents are asked to allocate a
constant sum of units such as points, rupees, or chips among a set of stimulus
objects with respect to some criterion. For example, you may wish to determine
how important the attributes of price, fragrance, packaging, cleaning power, and
lather of a detergent are to consumers. Respondents might be asked to divide a
constant sum to indicate the relative importance of the attributes using the
following format.
Table 5.4: Importance of detergent attributes using a constant sum scale

Instructions: Between attributes of detergent please allocate 100 points


among the attributes so that your allocation reflects the relative importance
you attach to each attribute. The more points an attribute receives, the more
important the attribute is. If an attribute is not at all important, assign it zero
points. If an attribute is twice as important as some other attribute, it should
receive twice as many points.
Format:
Attribute Number of Points
(a) Price 50
(b) Fragrance 05
(c) Packaging 10
(d) Cleaning Power 30
(e) Lather 05
Total Points 100

“If an attribute is assigned a higher number of points, it would indicate that the
attribute is more important.” From the above Table, the price of the detergent is
92
Geektonight Notes

the most important attribute for the consumers followed by cleaning power, Measurement and
Scaling Techniques
packaging. Fragrance and lather are the two attributes that the consumers
cared about the least but preferred equally.” The advantage of this technique is
saving time. However, there are two main disadvantages. The respondents may
allocate more or fewer points than those specified. The second problem is
rounding off error if too few attributes are used and the use of a large number
of attributes may be too taxing on the respondent and cause confusion and
fatigue.

d) Q-Sort Scale: This is a comparative scale that uses a rank order procedure to
sort objects based on similarity with respect to some criterion. The important
characteristic of this methodology is that it is more important to make
comparisons among different responses of a respondent than the responses
between different respondents. Therefore, it is a comparative method of scaling
rather than an absolute rating scale. In this method the respondent is given
statements in a large number for describing the characteristics of a product or a
large number of brands of a product. For example, you may wish to determine
the preference from among a large number of magazines. The following format
shown in Table 5.5 may be given to a respondent to obtain the preferences.
Table 5.5: Preference of Magazines Using Q-Sort Scale Procedure

Instructions: The bag given to you contain pictures of 90 magazines.


Please choose 10 magazines you ‘prefer most’, 20 magazines you ‘like’,
30 magazines to which you are ‘neutral (neither like nor dislike)’, 20
magazines you ‘dislike’, and 10 magazines you ‘prefer least’. Please list
the sorted magazine names in the respective columns of the form
provided to you.
Format:

Prefer Most Like Neutral Dislike Prefer Least


_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
_________ _________ _________ _________ _________
(10) _________ _________ _________ (10)
_________ _________ _________
_________ _________ _________
_________ _________ _________
_________ _________ _________
_________ _________ _________
_________ _________ _________
_________ _________ _________
_________ _________ _________
_________ _________ _________
(20) _________ (20)
_________
_________
_________
_________
_________
_________
_________
_________
_________
(30)
93
Geektonight Notes

Research and Data Note that the number of responses to be sorted should not be less than 60 or
Collection not more than 140. A reasonable range is 60 to 90 responses that result in a
normal or quasi-normal distribution. This method is faster and less tedious than
paired comparison measures. It also forces the subject to conform to quotas at
each point of scale so as to yield a quasi-normal distribution. The utility of Q-
sort in marketing research is to derive clusters of individuals who display similar
preferences, thus representing unique market segments.

5.5.2 Non-Comparative Scales

The non-comparative scaling techniques can be further divided into: (a)


Continuous Rating Scale, and (b) Itemised Rating Scale.

a) Continuous Rating Scales


It is very simple and highly useful. In continuous rating scale, the respondent’s
rate the objects by placing a mark at the appropriate position on a continuous
line that runs from one extreme of the criterion variable to the other. Examples
of continuous rating scale are given below:

Question: How would you rate the TV advertisement as a guide for buying?

Scale Type A

Strongly Strongly
agree disagree

Scale Type B
Strongly Strongly
disagree agree

Scale Type C
Strongly Strongly
agree 10 9 8 7 6 5 4 3 2 1 0 disagree

Scale Type D
Strongly Strongly
disagree 0 1 2 3 4 5 6 7 8 9 10 agree

When scale type A and B are used, the respondents score is determined either
by dividing the line into as many categories as desired and assigning the
respondent a score based on the category into which his/her mark falls, or by
measuring distance, in millimeters, centimeters, or inches from either end of the
scale. Which ever of the above continuous scale is used, the results are
normally analysed as interval scaled.

b) Itemised Rating Scales


Itemised rating scale is a scale having numbers or brief descriptions associated
with each category. The categories are ordered in terms of scale position and
the respondents are required to select one of the limited number of categories
that best describes the product, brand, company, or product attribute being
rated. Itemised rating scales are widely used in marketing research.

94
Geektonight Notes

The itemised rating scales can be in the form of : (a) graphic, (b) verbal, or (c) Measurement and
Scaling Techniques
numeric as shown below:

Itemised Graphic Itemised Verbal Itemised Numeric


Scale Scale Scale
–5 —
Completely satisfied
–4 —
Favourable
–3 —
Somewhat satisfied
–2 —
–1 —
Neither satisfied nor
Indifferent dissatisifed 0—
+1 —
Somewhat +2 —
dissatisfied
+3 —
Unfavourable Completely +4 —
dissatisfied
+5 —

Some rating scales may have only two response categories such as : agree and
disagree. Inclusion of more response categories provides the respondent more
flexibility in the rating task. Consider the following questions:

1. How often do you visit the supermarket located in your area of residence?

• Never, • Rarely, • Sometimes, • Often, • Very often

2. In your case how important is the price of brand X shoes when you buy them?

• Very important, • Fairly important, • Neutral, • Not so


important

Each of the above category scales is a more sensitive measure than a scale
with only two responses since they provide more information.

Wording is an extremely important factor in the usefulness of itemised


scales. Table 5.6 shows some common wordings for categories used in itemised
scales.

95
Geektonight Notes

Research and Data Table 5.6: Some common words for categories used in Itemised Rating scales
Collection
Quality:
Excellent Good Not decided Poor Worst
Very Good Good Neither good Fair Poor
nor bad
Importance:
Very Important Fairly Neutral Not so Not at all
important important important
Interest:
Very interested Somewhat Neither interested Somewhat Not very
interested nor disinterested uninterested interested
Satisfaction:
Completely Somewhat Neither satisfied Somewhat Completely
satisfied satisfied nor dissatisfied dissatisfied dissatisfied

Frequency:
All of the time Very often Often Sometimes Hardly ever
Very ofen Often Sometimes Rarely Never

Truth:
Very true Somewhat Not very true Not at all true
true
Purchase
Interest:
Definitely will Probably will Probably will Definitely
buy buy not buy will not buy
Level of
Agreement:
Strongly agree Somewhat Neither agree Somewhat Strongly
agree nor disagree disagree disagree
Dependability:
Completely Somewhat Not very Not at all
dependable dependable dependable dependable
Style:
Very stylish Somewhat Not very Completely
stylish stylish unstylish
Cost:
Extremely Expensive Neither Slightly Very
expensive expensive nor inexpensive inexpensive
inexpensive
Ease of use:
Very ease to Somewhat Not very easy Difficult to
use easy to use to use use
Modernity:
Very modern Somehwat Neither modern Somewhat Very old
modern nor old-fashioned old fashioned fashioned
Alert:
Very alert Alert Not alert Not at all alert

96
Geektonight Notes

In this section we will discuss three itemised rating scales, namely (a) Likert Measurement and
scale, (b) Semantic Differential Scale, and (c) Stapel Scale. Scaling Techniques

a) Likert Scale: In business research, the Likert scale, developed by Rensis


Likert, is extremely popular for measuring attitudes, because, the method is
simple to administer. With the Likert scale, the respondents indicate their own
attitudes by checking how strongly they agree or disagree with carefully worded
statements that range from very positive to very negative towards the attitudinal
object. Respondents generally choose from five alternatives (say strongly
agree, agree, neither agree nor disagree, disagree, strongly disagree).
Consider the following example of a study or measuring attitudes towards
cricket.

Strongly Agree Not sure Disagree Strongly


agree disagree
It is more fun to play a
tough, competitive cricket match 5 4 3 2 1
than to play an easy one.
To measure the attitude, the researchers assign weights or scores to the
alternative responses. In the above example the scores 5 to 1 are assigned
to the responses. Strong agreement of the respondent indicates the most
favourable attitudes on the statement, and the score 5 is assigned to it. On
the other hand, strong disagreement of the respondent indicates the most
unfavourable attitude on the statement, and the score 1 is assigned to it. If
a negative statement towards the object is given, the corresponding scores
would be reversed. In this case, the response ‘strongly agree’ will get a
score of 1 and the response ‘strongly disagree’ will get a score of 5.

A Likert scale may include a number of items or statements. Each statement is


assumed to represent an aspect of an attitudinal domain. For example, Table 5.7
shows the items in a Likert Scale to measure opinions on food products.
Table 5.7: A Likert Scale for studying opinions on food products.

Strongly Agree Neither Disagree Strongly


Agree Agree Nor Disagree
Disagree
If the price of raw materials
fall, firms too should reduce 1 2 3 4 5
the price of the food products.
There should be uniform price
through out the country for 1 2 3 4 5
food products
The food companies should
concentrate more on keeping 1 2 3 4 5
hygiene while manufacturing
food products.
The expiry dates should be
printed on the food products 1 2 3 4 5
before they are delivered to
consumers in the market.
There should be government
regulations on the firms in 1 2 3 4 5
keeping acceptable quality and
on the prices
Now-a-days most food
companies are concerned 1 2 3 4 5
only with profit making rather
than taking care of quality. 97
Geektonight Notes

Research and Data Each respondent is asked to circle his opinion on a score against each
Collection
statement. The final score for the respondent on the scale is the sum of their
ratings for all the items. The very purpose of Likert’s Scale is to ensure the final
items evoke a wide response and discriminate among those with positive and
negative attitudes. Items that are poor (because they lack clarity or elicit mixed
response patterns) are detected from the final statement list. This will ensure us
to discriminate between high positive scores and high negative scores. However,
many business researchers do not follow this procedure and you may not be in a
position to distinguish between high positive scores and high negative scores
because all scores look alike. Hence a disadvantage of the Likert Scale is that it
is difficult to know what a single summated score means. Many patterns of
response to the various statements can produce the same total score. The other
disadvantage of Likert Scale is that it takes longer time to complete than other
itemised rating scales because respondents have to read each statement.
Despite the above disadvantages, this scale has several advantages. It is easy to
construct, administer and use.
b) Semantic Differential Scale: This is a seven point rating scale with end points
associated with bipolar labels (such as good and bad, complex and simple) that
have semantic meaning. The Semantic Differential scale is used for a variety of
purposes. It can be used to find whether a respondent has a positive or negative
attitude towards an object. It has been widely used in comparing brands,
products and company images. It has also been used to develop advertising and
promotion strategies and in a new product development study. Look at the
following Table, for examples of Semantic Differential Scale.

Table 5.8: Examples of Semantic Differential Scale

Modern — — — — — — — Old-fashioned
Good — — — — — — — Bad
Clean — — — — — — — Dirty
Important — — — — — — — Unimportant
Expensive — — — — — — — Inexpensive
Useful — — — — — — — Useless
Strong — — — — — — — Weak
Quick — — — — — — — Slow

In the Semantic Differential scale only extremes have names. The extreme
points represent the bipolar adjectives with the central category representing the
neutral position. The in between categories have blank spaces. A weight is
assigned to each position on the scale. The weights can be such as +3, +2, +1, 0,
–1, –2, –3 or 7,6,5,4,3,2,1. The following is an example of Semantic
Differential Scale to study the experience of using a particular brand of body
lotion.

98
Geektonight Notes
Measurement and
In my experience, the use of body lotion of Brand-X was: Scaling Techniques
+3 +2 +1 0 –-1 –-2 –-3
Useful        Useless
Attractive        Unattractive
Passive        Active
Beneficial        Harmful
Interesting        Boring
Dull        Sharp
Pleasant        Unpleasant
Cold        Hot
Good        Bad
Likable        Unlikable

In the semantic Differential scale, the phrases used to describe the object form
a basis for attitude formation in the form of positive and negative phrases. The
negative phrase is sometimes put on the left side of the scale and sometimes on
the right side. This is done to prevent a respondent with a positive attitude from
simply checking the left side and a respondent with a negative attitude checking
on the right side without reading the description of the words.
The respondents are asked to check the individual cells depending on the
attitude. Then one could arrive at the average scores for comparisons of
different objects. The following Figure shows the experiences of 100
consumers on 3 brands of body lotion.
+3 +2 +1 0 –-1 –-2 –-3
Useful        Useless
Attractive        Unattractive
Passive        Active
Beneficial        Harmful
Interesting        Boring
Dull        Sharp
Pleasant        Unpleasant
Cold        Hot
Good        Bad
Likable        Unlikable
Brand-X Brand-Y Brand-Z

In the above example, first the individual respondent scores for each dimension
are obtained and then the average scores of all 100 respondents, for each
dimension and for each brand were plotted graphically. The maximum score
possible for each brand is + 30 and the minimum score possible for each brand
is –30. Brand-X has score +14. Brand-Y has score +7, and Brand-Z has score
–11. From the scale we can identify which phrase needs improvement for each
Brand. For example, Brand-X needs to be improved upon benefits and Brand-Y
on pleasantness, coldness and likeability. Brand Z needs to be improved on all
the attributes.

c) Staple Scale: The Stapel scale was originally developed to measure the
direction and intensity of an attitude simultaneously. Modern versions of the
Stapel scale place a single adjective as a substitute for the Semantic differential
when it is difficult to create pairs of bipolar adjectives. The modified Stapel
scale places a single adjective in the centre of an even number of numerical
values (say, +3, +2, +1, 0, –1, –2, –3). This scale measures how close to or how
distant from the adjective a given stimulus is perceived to be. The following is an
example of a Staple scale.

99
Geektonight Notes

Research and Data


Collection Instructions: Select a plus number for words that you think describe
personnel banking of a bank accurately. The more accurately you think the
word describes the bank, the larger the plus number you should choose.
Select a minus number for words you think do not describe the bank
accurately. The less accurately you think the word describes the bank, the
larger the minus number you should choose.
Format:
+5 +5
+4 +4
+3 +3
+2 +2
+1 +1
Friendly Personnel Competitive Loan Rates
–1 –1
–2 –2
–3 –3
–4 –4
–5 –5

The following format shows an example of Stapel scale that illustrates


respondents description on personnel banking of a bank.

+4 +3 +2 +1 -1 -2 -3 -4
Fast
Fast Service
Services        
Friendly
Friendly        
Honest
Honest        
Convenient
ConvenientLocation
Location        
Convenient
ConvenientHours
Hours        
Dull
Dull        
Good
GoodServices
Services        
High
HighSaving
SavingRates
Rates        
Each respondent is asked to circle his opinion on a score against each phrase
that describes the object. The final score of the respondent on a scale is the
sum of their ratings for all the items. Also, the average score for each phrase
is obtained by totaling the final score of all the respondents for that phrase
divided by the number of respondents of the phrase. The following Figure
shows the opinions of 100 respondents on two banks.

+4 +3 +2 +1 -1 -2 -3 -4
Fast Service        
Friendly        
Honest        
Convenient Location        
Convenient Hours        
Dull        
Good Services        
High Saving Rates        

Bank-X Bank-Y

In the above example first the individual respondent’s scores for each phrase
that describes the selected bank are obtained and then the average scores of all
100
Geektonight Notes

100 respondents for each phrase are plotted graphically. The maximum score Measurement and
Scaling Techniques
possible for each bank is +32 and the minimum possible score for each brand is
–32. In the example, Bank-X has score +24, and Bank-Y has score +3. From
the scale we can identify which phrase needs improvement for each Bank.
The advantages and disadvantages of the Stapel scale are very similar to those
for the Semantic differential scale. However, the Stapel scale tends to be easier
to construct and administer, especially over telephone, since the Stapel scale
does not call for the bipolar adjectives as does the Semantic differential scale.
However, research on comparing the Stapel scale with Semantic differential
scale suggests that the results of both the scales are largely the same.

5.6 SELECTION OF AN APPROPRIATE SCALING


TECHNIQUE
In this unit, so far, you have learnt some of the important scaling techniques
that are frequently used in attitudinal research for the measurement of attitudes.
Each of these techniques has some advantages and disadvantages. Now you
may ask which technique is more appropriate to use to measure attitudes.
Virtually any technique can be used to measure the attitudes. But at the same
time all techniques are not suitable for all purposes. As a general rule, you
should use a scaling technique that will yield the highest level of information
feasible in a given situation. Also, if possible the technique should permit you
the use of a variety of statistical analysis. A number of issues decide the
choice of scaling technique. Some significant issues are:

1) Problem Definition and Statistical Analysis: The Choice between ranking,


sorting, or rating techniques is determined by the problem definition and the type
of statistical analysis likely to be performed. For example, ranking provides only
ordinal data that limits the use of statistical techniques.
2) The Choice between Comparative and Non-comparative Scales: Some
times it is better to use a comparative scale rather than a non-comparative
scale. Consider the following example:
How satisfied you are with the brand- X detergent that you are presently
using?

Completely Somewhat Neither Somewhat Completely


satisfied satisfied satisfied nor dissatisfied dissatisfied
dissatisifed

This is a non-comparative scale since it deals with a single concept (the brand of
a detergent). On the other hand, a comparative scale asks a respondent to rate a
concept. For example, you may ask:
Which one of the following brands of detergent you prefer?
Brand-X Brand-Y
In this example you are comparing one brand of detergent with another brand.
Therefore, in many situations, comparative scaling presents ‘the ideal situation’
as a reference for comparison with actual situation.

3) Type of Category Labels: We have discussed different types of category


labels used in constructing measurement scales such as verbal categories and
numeric categories. Many researchers use verbal categories since they believe
that these categories are understood well by the respondents. The maturity and
the education level of the respondents influences this decision. 101
Geektonight Notes

Research and Data 4) Number of Categories: While there is no single, optimal number of categories,
Collection traditional guidelines suggest that there should be between five and nine
categories. Also, if a neutral or indifferent scale response is possible for at least
some of the respondents, an odd number of categories should be used.
However, the researcher must determine the number of meaningful positions
that are best suited for a specific problem.

5) Balanced versus Unbalanced Scale: In general, the scale should be balanced


to obtain objective data.

6) Forced versus Nonforced Categories: In situations where the respondents


are expected to have no opinion, the accuracy of data may be improved by a
non forced scale that provides a ‘no opinion’ category.

Self Assessment Exercises B

1) In paired comparison, the order in which the objects are presented may
____________ results.
2) A researcher wants to measure consumer preference between 7 brands of
bath soap and has decided to use the Paired comparisons scaling technique.
How many pairs of brands will the researcher present the respondents?:
________________
3) In a semantic differential scale there are 20 scale items. Should all the
positive adjectives be on the left side and all the negative adjectives be on the
right side. Can you explain?
................................................................................................................
................................................................................................................
................................................................................................................

4) Indicate the type of scale used in the following examples.


a) Do you favour or oppose the return of the ruling party in the next elections?

(i) Favour (ii) Neutral (iii) Oppose


..........................................................................................................
b) Which one of the following pairs of companies would you like to invest your
money?
i) MTNL - Reliance
ii) MTNL - BPL
iii) Reliance - BPL

c) Suppose Rs. 1,000 is given to you. How do you spend it?


Items Amount (Rs.)
(a) Books
(b) Clothes
(c) Fast Food
Total 1000

102
Geektonight Notes
Measurement and
5.7 LET US SUM UP Scaling Techniques

There are four levels of measurements: nominal, ordinal, interval, and


ratio.These constitute a hierarchy where the lower scale of measurement,
nominal, has far fewer statistical applications than those further up this
hierarchy of scales. Nominal scales yield data on categories; ordinal scales give
sequences; interval scales begin to reveal the magnitude between points on the
scale and ratio scales explain both order and the absolute distance between any
two points on the scale.

The measurement scales, commonly used in marketing research, can be divided


into two types; comparative and non-comparative scales. Comparative scales
involve the respondent in signaling where there is a difference between two or
more firms, brands, services, or other stimuli. The scales under this type are:
(a) Paired Comparison, (b) Rank Order, (c) Constant Sum, and (d) Q-sort.
Further, The non-comparative scales can be classified into: (a) Continuous
Rating Scales and (b) Itemised Rating Scales. The Itemised Rating scales can
further be classified into: (a) Likert Scale, (b) Semantic Differential Scale, and
(c) Stapel Scale.

A number of scaling techniques are available for measurement of attitudes.


There is no unique way that you can use to select a particular scaling
technique for your research study. A number of issues, such as problem
defintion and statistical analysis, choice between comparative and non-
comparative scales, type of category lables, number of categories etc.,
discussed in this unit should be considered before you arrive at a particular
scaling technique.

5.8 KEY WORDS


Comparative Scales : In comparative scaling, the respondent is asked to
compare one object with another.
Constant Sum Scale : In this scale, the respondents are asked to allocate a
constant sum of units such as points, rupees, or chips among a set of stimulus
objects with respect to some criterion.
Continuous Rating Scales : Here the respondents rate the objects by placing
a mark at the appropriate position on a continuous line that runs from one
extreme of the criterion variable to the other.
Itemised Rating Scales : Itemised rating scale is a scale having numbers or
brief descriptions associated with each category.
Interval Scale : In this scale, the numbers are used to rank attributes such
that numerically equal distances on the scale represent equal distances in the
characteristic being measured.
Likert Scale : With the Likert scale, the respondents indicate their own
attitudes by checking how strongly they agree or disagree with carefully worded
statements that range from very positive to very negative towards the attitudinal
object.
Measurement : Measurement is the process of observing and recording the
observations that are collected as part of research.
Non-comparative Scales : In non-comparative scaling, respondents need only
evaluate a single object.
Nominal Scale : In this scale, the different scores on a measurement simply
indicate different categories.
103
Geektonight Notes

Research and Data Ordinal Scale : In this scale, the items are ranked according to whether they
Collection have more or less of a characteristic.
Paired Comparison Scale : This is a comparative scaling technique in which
a respondent is presented with two objects at a time and asked to select one
object according to some criterion.
Q-Sort Scale : This is a comparative scale that uses a rank order procedure
to sort objects based on similarity with respect to some criterion.
Rank Order Scale : In this scale, the respondents are presented with several
items simultaneously and asked to order or rank them according to some
criterion.
Ratio Scale : Ratio scales permit the researcher to compare both differences
in scores and relative magnitude of scores.
Scaling : Scaling is the assignment of objects to numbers or semantics
according to a rule.
Semantic Differential Scale : This is a seven point rating scale with end
points associated with bipolar labels (such as good and bad, complex and
simple) that have semantic meaning.
Staple Scale : The Staple scale places a single adjective as a substitute for the
Semantic differential when it is difficult to create pairs of bipolar adjectives.

5.9 ANSWERS TO SELF ASSESSMENT EXERCISES


A. 1) Interval scale does not have a fixed (absolute) zero point whereas ratio scale
has a fixed zero point that allows us to construct a meaningful ratio.
2) Knowing the level of measurement helps in interpreting the data and
performing statistical analysis of the data.
3) In nominal scale the assigned numbers have no arithmetic properties and act
only as labels. The only statistical operation that can be performed on
nominal scales is frequency count.
4) a) Nominal Scale, b) Ratio Scale, c) Ordinal Scale, d) Interval Scale.
B. 1) Bias

2) 21

3) No. Some of the positive adjectives may be placed on the left side and
some on the right side. This prevents the respondent with positive
(negative) attitude from simply checking the left (right) side without
reading the description of the words.

4) a) Itemised rating scale, b) Paired comparison scale, c) Constant sum


scale.

5.10 TERMINAL QUESTIONS


1) Discuss briefly different issues you consider for selecting an appropriate scaling
technique for measuring attitudes.
2) What are the different levels of measurement? Explain any two of them.
3) How do you select an appropriate scaling technique for a research study?
Explain the issues involved in it.
104
Geektonight Notes

4) Discuss briefly the issues involved in attitude measurement. Measurement and


Scaling Techniques
5) Differentiate between ranking scales and rating scales. Which one of these
scales is better for measuring attitudes?
6) In what type of situation is the Q-sort technique more appropriate?
7) Name any four situations in commerce where you can use the Likert scale.
8) Construct a Rank Order Scale to measure toothpaste preferences. Discuss its
advantages and disadvantages.
9) Construct a Semantic differential scale to measure the experiences of
respondents in using Brand-X of shaving cream (assume that all the
respondents use that brand).

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

5.11 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt with in this unit.
Aaker, David A. and George S. Day. (1983) Marketing Research, John Wiley,
New York.
Bailey, Kenneth D. (1978) Methods of Social Research, The Free Press, New
York.
Coombs, C.H.(1953) “Theory and Methods of Social Measurement”, in
Research Methods in the Behavioral Sciences, eds. Feslinger, L. and Ratz,
D., Holt, Rinehart and Winston.
Donald S. Tull and Gerald S. Albaum. (1973) Survey Research: A Decisional
Approach, Index Educational Publishers, New York.
Meister, David. (1985) Behavioural Analysis and Measurement Methods,
John Wiley, New York.
Rodger, Lesile W. (1984) Statistics for Marketing, McGraw-Hill (UK),
London.

105
Geektonight Notes
Processing of Data
UNIT 6 PROCESSING OF DATA
STRUCTURE

6.0 Objectives
6.1 Introduction
6.2 Editing of Data
6.3 Coding of Data
6.4 Classification of Data
6.4.1 Types of Classification
6.4.1.1 Classification According to External Characteristics
6.4.1.2 Classification According to Internal Characteristics
6.4.1.3 Preparation of Frequency Distribution
6.5 Tabulation of Data
6.5.1 Types of Tables
6.5.2 Parts of a Statistical Table
6.5.3 Requisites of a Good Statistical Table
6.6 Let Us Sum Up
6.7 Key Words
6.8 Answers to Self Assessment Exercises
6.9 Terminal Questions/Exercises
6.10 Further Reading

6.0 OBJECTIVES
After studying this unit, you should be able to:
l evaluate the steps involved in processing of data,
l check for obvious mistakes in data and improve the quality of data,
l describe various approaches to classify data,
l construct frequency distribution of discrete and continuous data, and
l develope appropriate data tabulation device.

6.1 INTRODUCTION
In Unit 3 we have discussed various methods of collection of data. Once the
collection of data is over, the next step is to organize data so that meaningful
conclusions may be drawn. The information content of the observations has to
be reduced to a relatively few concepts and aggregates. The data collected
from the field has to be processed as laid down in the research plan. This is
possible only through systematic processing of data. Data processing involves
editing, coding, classification and tabulation of the data collected so that they
are amenable to analysis. This is an intermediary stage between the collection
of data and their analysis and interpretation. In this unit, therefore, we will learn
about different stages of processing of data in detail.

6.2 EDITING OF DATA


Editing is the first stage in data processing. Editing may be broadly defined to
be a procedure, which uses available information and assumptions to substitute
inconsistent values in a data set. In other words, editing is the process of
examining the data collected through various methods to detect errors and
omissions and correct them for further analysis. While editing, care has to be
5
Geektonight Notes

Processing and Presentation taken to see that the data are as accurate and complete as possible, units of
of Data
observations and number of decimal places are the same for the same variable.
The following practical guidelines may be handy while editing the data:

1) The editor should have a copy of the instructions given to the interviewers.
2) The editor should not destroy or erase the original entry. Original entry should
be crossed out in such a manner that they are still legible.
3) All answers, which are modified or filled in afresh by the editor, have to be
indicated.
4) All completed schedules should have the signature of the editor and the date.

For checking the quality of data collected, it is advisable to take a small sample
of the questionnaire and examine them thoroughly. This helps in understanding
the following types of problems: (1) whether all the questions are answered, (2)
whether the answers are properly recorded, (3) whether there is any bias, (4)
whether there is any interviewer dishonesty, (5) whether there are
inconsistencies. At times, it may be worthwhile to group the same set of
questionnaires according to the investigators (whether any particular investigator
has specific problems) or according to geographical regions (whether any
particular region has specific problems) or according to the sex or background
of the investigators, and corrective actions may be taken if any problem is
observed.

Before tabulation of data it may be good to prepare an operation manual to


decide the process for identifying inconsistencies and errors and also the
methods to edit and correct them. The following broad rules may be helpful.

Incorrect answers: It is quite common to get incorrect answers to many of


the questions. A person with a thorough knowledge will be able to notice them.
For example, against the question “Which brand of biscuits do you purchase?”
the answer may be “We purchase biscuits from ABC Stores”. Now, this
questionnaire can be corrected if ABC Stores stocks only one type of biscuits,
otherwise not. Answer to the question “How many days did you go for
shopping in the last week?” would be a number between 0 and 7. A number
beyond this range indicates a mistake, and such a mistake cannot be corrected.
The general rule is that changes may be made if one is absolutely sure,
otherwise this question should not be used. Usually a schedule has a number of
questions and although answers to a few questions are incorrect, it is advisable
to use the other correct information from the schedule rather than discarding
the schedule entirely.

Inconsistent answers: When there are inconsistencies in the answers or when


there are incomplete or missing answers, the questionnaire should not be used.
Suppose that in a survey, per capita expenditure on various items are reported
as follows: Food – Rs. 700, Clothing – Rs.300, Fuel and Light – Rs. 200, other
items – Rs. 550 and Total – Rs. 1600. The answers are obviously inconsistent
as the total of individual items of expenditure is exceeding the total expenditure.

Modified answers: Sometimes it may be necessary to modify or qualify the


answers. They have to be indicated for reference and checking.
6
Geektonight Notes

Numerical answers to be converted to same units: Against the question “What Processing of Data
is the plinth area of your house?” answers could be either in square feet or in
square metres. It will be convenient to convert all the answers to these
questions in the same unit, square metre for example.

6.3 CODING OF DATA


Coding refers to the process by which data are categorized into groups and
numerals or other symbols or both are assigned to each item depending on the
class it falls in. Hence, coding involves: (i) deciding the categories to be used,
and (ii) assigning individual codes to them. In general, coding reduces the huge
amount of information collected into a form that is amenable to analysis.

A careful study of the answers is the starting point of coding. Next, a coding
frame is to be developed by listing the answers and by assigning the codes to
them. A coding manual is to be prepared with the details of variable names,
codes and instructions. Normally, the coding manual should be prepared before
collection of data, but for open-ended and partially coded questions. These two
categories are to be taken care of after the data collection. The following are
the broad general rules for coding:

1) Each respondent should be given a code number (an identification number).

2) Each qualitative question should have codes. Quantitative variables may or may
not be coded depending on the purpose. Monthly income should not be coded if
one of the objectives is to compute average monthly income. But if it is used as
a classificatory variable it may be coded to indicate poor, middle or upper
income group.

3) All responses including “don’t know”, “no opinion” “no response” etc., are to
be coded.

Sometimes it is not possible to anticipate all the responses and some questions
are not coded before collection of data. Responses of all the questions are to
be studied carefully and codes are to be decided by examining the essence of
the answers. In partially coded questions, usually there is an option “Any Other
(specify)”. Depending on the purpose, responses to this question may be
examined and additional codes may be assigned.

Self Assessment Exercise A

1) How would you edit the research data?


..................................................................................................................
..................................................................................................................
..................................................................................................................

2) What do you mean by coding?


..................................................................................................................
..................................................................................................................
..................................................................................................................
7
Geektonight Notes

Processing and Presentation


of Data 6.4 CLASSIFICATION OF DATA
Once the data is collected and edited, the next step towards further processing
the data is classification. In most research studies, voluminous data collected
through various methods needs to be reduced into homogeneous groups for
meaningful analysis. This necessitates classification of data, which in simple
terms is the process of dividing data into different groups or classes according
to their similarities and dissimilarities. The groups should be homogeneous within
and heterogeneous between themselves. Classification condenses huge amount
of data and helps in understanding the important underlying features. It enables
us to make comparison, draw inferences, locate facts and also helps in bringing
out relationships, so as to draw meaningful conclusions. In fact classification of
data provides a basis for tabulation and analysis of data.

6.4.1 Types of Classification

Data may be classified according to one or more external characteristics or one


or more internal characteristics or both. Let us study these kinds with the help
of illustrations.

6.4.1.1 Classification According to External Characteristics

In this classification, data may be classified according to area or region


(Geographical) and according to occurrences (Chronological).

Geographical: In this type of classification, data are organized in terms of


geographical area or region. State-wise production of manufactured goods is an
example of this type. Data collected from an all India market survey may be
classified geographically. Usually the regions are arranged alphabetically or
according to the size to indicate the importance.

Chronological: When data is arranged according to time of occurrence, it is


called chronological classification. Profit of engineering industries over the last
few years is an example. We may note that it is possible to have chronological
classification within geographical classification and vice versa. For example, a
large scale all India market survey spread over a number of years.

6.4.1.2 Classification According to Internal Characteristics

Data may be classified according to attributes (Qualitative characteristics which


are not capable of being described numerically) and according to the magnitude
of variables (Quantitative characteristics which are numerically described).

Classification according to attributes: In this classification, data are


classified by descriptive characteristic like sex, caste, occupation, place of
residence etc. This is done in two ways – simple classification and manifold
classification. In simple classification (also called classification according to
dichotomy), data is simply grouped according to presence or absence of a single
characteristics – male or female, employee or unemployee, rural or urban etc.
In manifold classification (also known as multiple classification), data is
classified according to more than one characteristic. First, the data may be
divided into two groups according to one attribute (employee and unemployee,
say). Then using the remaining attributes, data is sub-grouped again (male and
8
Geektonight Notes

female based on sex). This may go on based on other attributes, like married Processing of Data
and unmarried, rural and urban so on… The following table is an example of
manifold classification.

Population

Employee Unemployee

Male Female Male Female

Classification according to magnitude of the variable: This classification


refers to the classification of data according to some characteristics that can be
measured. In this classification, there are two aspects: one is variables (age,
weight, income etc;) another is frequency (number of observations which can
be put into a class). Quantitative variables may be, generally, divided into two
groups - discrete and continuous. A discrete variable is one which can take
only isolated (exact) values, it does not carry any fractional value. The
examples are number of children in a household, number of departments in an
organization, number of workers in a factory etc. The variables that take any
numerical value within a specified range are called continuous variables. The
examples of continuous variables are the height of a person, profit/loss of
campanies etc. One point may be noted. In practice, even the continuous
variables are measured up to some degree of precision and they also essentially
become discrete variables.

The following are two examples of discrete and continuous frequency


distribution placed side by side.

a) Discrete frequency distribution b) Continuous frequency distribution

No. of children No. of families Income (Rs.) No. of families

0 12 1,000-2,000 6

1 25 2,000-3,000 10

2 20 3,000-4,000 15

3 7 4,000-5,000 25

4 3 5,000-6,000 9

5 1 6,000-7,000 4
Total 68 Total 69

6.4.1.3 Preparation of Frequency Distribution

When raw data is arranged in conveniently organized groups, it is called a


frequency distribution. The number of data points in a particular group is called
frequency. When a discrete variable takes a small number of values (not more
than 8 or 10, say), each of the observed value is counted to form the discrete
frequency distribution. In order to facilitate counting, prepare a column of
“tallies”. The following example illustrates it. 9
Geektonight Notes

Processing and Presentation Illustration 1: A survey of 50 college students was conducted to know that
of Data how many times a week they go to the theatre to see movies. The following
data were obtained:

3 2 2 1 4 1 0 1 1 2 4 1 3 3 2 1 3 4 3 2 0 1 3 4 3

1 4 3 2 2 1 3 1 2 3 2 3 4 4 2 4 3 4 2 3 3 2 0 4 3

To have a discrete frequency table, we may take the help of ‘Tally’ marks as
indicated below.

Table 6.1: Frequency Distribution of Number of Movies Seen by 50


College Students in a Week

Number of Days Tally Marks Frequency

0 5

1 8

2 12

3 15

4 10

Total 50

From the above frequency table it is clear that more than half the students (27
out of 50) go to the theatre twice or thrice a week and very few do not go
even once a week. These were not so obvious from the raw data.

It is possible to prepare frequency distribution for qualitative variables also. For


example, one may construct a frequency distribution of brands of 100 cars or
blood groups of 50 patients in a hospital

Construction of a Continuous Frequency Distribution: In continuous


frequency distribution, the data is grouped into a small number of intervals
instead of individual values of the variables. These groups are called classes.
There are two different ways in which limits of classes may be arranged -
exclusive and inclusive method. In the exclusive method, the class intervals
are so arranged that the upper limit of one class is the lower limit of the next
class, whereas in the inclusive method, the upper limit of a class is included
in the class itself. The same frequency distribution is shown below using
inclusive method. As an example, the frequency distribution is shown below
using exclusive method and inclusive method.

10
Geektonight Notes

Illustration 2 Processing of Data

Table 6.2 : Frequency Distribution of Daily Wages of 65 Labourers.


Exclusive method Inclusive method
Daily wages of No. of Labourers Daily wages of No. of Labourers
Labourers (Rs.) Labourers (Rs.)
20-30 2 20-29.99 2
30-40 15 30-39.99 15
40-50 21 40-49.99 21
50-60 29 50-59.99 29
60-70 13 60-69.99 13
Total 80 Total 80
In the exclusive method, the upper class limit of the first class is the same as
the lower limit of the second class. A labourer with a daily wage of exactly
Rs. 30 will be included in the second class. Thus, a class interval 20–30 means
“20 and above but below 30”. This is the exclusive method and the upper limit
is always excluded.
In case of inclusive method, the upper limits of the classes are not the same as
the Lower limits of their next classes. Thus, class interval 20-29.99 means “20
and above, and 29.99 and below”. It is to be noted that both the methods give
the same class frequencies, although the construction of classes look different.
For computation of positional values such as median, mode etc., it is necessary
to convert the inclusive classes into exclusive form. This can be done with the
help of the following formula:

Lower lim it of the succeeding class − upper lim it of the class


Correction Factor =
2
The result so obtained is deducted from all lower limits and added to all upper
limits. For instance, in the above example, table 6.2, the correction factor is
(30-29.99)/2 = 0.005. Deduct this value from the lower limit and add to the
upper limit of each class. You will obtain the exclusive form of classes as
19.995-29.995; 29.995-39.995 and so on.

Steps to construct frequency distribution: The following broad guidelines


may be followed for construction of a frequency distribution.

1) The highest and the lowest values of the observations are to be identified and
the lower limit of the first class and upper limit of the last class may be decided.
2) The number of classes to be decided. There is no hard and fast rule. It should
not be too little (lower than 5, say) to avoid high information loss. It should not
be too many (more than 12, say) so that it does not become unmanageable.
3) The lower and the upper limits should be convenient numerals like 0-5, 0-10,
100-200 etc.
4) The class intervals should also be numerically convenient, like 5, 10, 20 etc., and
values like 3, 9, 17 etc., should be avoided.
5) As far as possible, the class width may be made uniform for ease in subsequent
calculation.
11
Geektonight Notes

Processing and Presentation It is often quite useful to present the frequency distribution in two different
of Data
ways. One way is relative or percentage relative frequency distribution.
Relative frequencies can be computed by dividing the frequency of each class
with sum of frequency. If the relative frequencies are multiplied by 100 we will
get percentages. Another way is cumulative frequency distribution which
are cumulated to give the total of all previous frequencies including the present
class, cumulating may be done either from the lowest class (from below) or
from the highest class (from above). The following table illustrates this concept.
Illustration 3

Table 6.3: Construction of Relative Frequency Distribution for the Data


on Daily Wages of 70 Labourers

Class Frequency Relative Relative Cumulative Cumulative


Interval Frequency Frequency Frequency Frequency
(as Percentage) (less than) (more than)
(1) (2) (3) (4) (5) (6)
15-20 2 0.0286 2.86 2 70

20-25 23 0.3286 32.86 25 68

25-30 19 0.2714 27.14 44 45

30-35 14 0.2000 20.00 58 26

35-40 5 0.0714 7.14 63 12

40-45 4 0.0571 5.71 67 7

45-50 3 0.0429 4.29 70 3

Total 70 1.0000 100.00

One advantage of using relative frequency distribution is that it helps in


comparing two frequency distributions (with same class intervals). It is also the
basis of empirical probability. The topic of Probability is the subject matter of
Unit 13 and Unit 14 of this course.

Column (5) in the above table gives cumulative frequency of a particular class,
which is obtained as discussed earlier. Cumulative frequency of the second
class is obtained by adding of its class frequency (23) and the previous class
frequency (2). Cumulative frequency of the next class is obtained by adding of
its class frequency (19) to the cumulative frequency of the previous class (25).
Cumulative frequencies may be interpreted as the number of observation below
the upper class limit of a class. For example, a cumulative frequency of 44 in
the third class (25-30) indicates that 44 labourers received a daily wage of less
than Rs. 30. Cumulation from the highest class may also be done as shown in
column (6). It has a similar interpretation.

At times relative frequencies are also cumulated to obtain cumulative relative


frequency distribution.These cumulative frequencies are useful for researchers in
two ways.
12
Geektonight Notes

Firstly, Some simple graphs can be drawn to show all the frequency Processing of Data
distributions. This is done in the next unit (Unit 7). Secondly, frequency
distribution methods are also used for discrete data, if the number of
observations is large and spread is more.

Bivariate frequency distribution: When data is collected on two variables,


one may construct two frequency distributions separately. But it is possible to
construct a two-way frequency distribution table. Here class intervals, based on
value of one variable are placed in the row and class intervals based on value
of the other variable is placed in the column. The following example illustrates
this.

Illustration 4

Table 6.4: Bivariate Frequency Distribution of Sales and Profit of 200


Companies

Sl. Sales Profit (Rs. in thousands)


No. (Rs.
lakhs) Upto 10 10-20 20-50 50-100 100-200 200 and Total
more
(1) (2) (3) (4) (5) (6) (7) (8) (9)
1 Upto 1 10 3 13

2 1-2 12 12 19 43

3 2-5 11 15 20 10 8 64
4 5-10 2 8 15 5 10 40
5 10-20 2 12 4 9 6 33
6 20 and 2 1 2 2 7
more
7 Total 35 40 68 20 29 8 200

The above bivariate frequency table is prepared on the basis of sales and profit
data of 200 companies. As discussed earlier, class limits for both Sales and
Profit are decided first. Tally marks are placed in appropriate row and column
(not shown here). Suppose a company’s Sales and Profit figures are Rs. 2.5
lakhs and Rs. 49000 respectively. It is placed in class 3 of Sales (2 to 5 lakhs)
and Column (5) showing class interval of profit 20 to 50 thousands. The last
column (Column (9) gives the total over all class intervals of Profit. Hence it
gives the frequency distribution of Sales. The frequency distribution in this
column is known as Marginal Frequency distribution of Sales. Similarly, the
figures in Serial No 7 (Row 7) are obtained by summing up over all the class
intervals of Sales. This is the frequency distribution of profit or the Marginal
Frequencies of Profit. The entire table is also known as Joint Frequency
Distribution of Sales and Profit.

13
Geektonight Notes

Processing and Presentation Self Assessment Exercise B


of Data
The following table gives values of production and values of raw materials
used in 60 industrial units. Prepare (i) two individual frequency distributions for
the variables. (ii) Prepare bivariate frequency table. For value of production
you may take - 8000-9000, 9000-10000,…….., 130000 - 140000 as classes and
for value of raw material you may take - 2500 - 3000, 3000-3500, ……,5000
- 5500 as class intervals.

Table : Values of Production and Values of Raw Material Used by 60


Industrial Units (Rs. in lakhs)

Sl. Value of Value of Sl. Value of Value of


No. Production Raw Material No. Production Raw Material
(1) (2) (3) (1) (2) (3)
1 8952.69 2915.30 31 10394.05 3392.49
2 10147.82 3497.26 32 10751.31 3983.16
3 9938.00 3652.91 33 8685.23 3513.37
4 10278.61 3851.22 34 9393.46 3448.06
5 10225.37 3624.36 35 9352.66 3495.73
6 10324.95 3702.34 36 9405.84 3503.47
7 10921.96 3794.27 37 9692.46 3286.88
8 10885.23 4296.33 38 8783.27 3188.69
9 13324.16 5446.39 39 8963.49 3153.59
10 12154.16 3939.36 40 8956.34 3229.62
11 10835.82 3697.95 41 10920.31 3958.89
12 10864.50 3885.91 42 10094.23 3604.95
13 10698.37 3943.97 43 12038.21 4387.66
14 11136.22 4017.36 44 11199.94 4143.76
15 10644.51 3669.89 45 11522.36 3823.97
16 10070.44 3586.27 46 10862.59 3888.08
17 10857.33 3772.49 47 11797.08 3852.58
18 11561.56 4292.22 48 11762.56 3758.17
19 10544.07 4128.06 49 9687.11 3309.81
20 10163.53 3570.83 50 10905.26 3612.14
21 9580.77 3615.61 51 9806.59 4354.83
22 10493.76 3730.19 52 11614.45 2675.32
23 10454.36 3953.90 53 8260.93 3722.35
24 11026.60 3893.18 54 8498.49 3682.07
25 12808.22 4660.28 55 8796.21 3599.29
26 11681.71 4100.16 56 10372.99 3967.33
27 10631.05 3543.68 57 10714.01 4108.27
28 9441.43 3568.53 58 10212.08 3945.38
29 10311.69 3640.68 59 12817.07 3443.86
30 9152.95 3400.50 60 11865.77 3605.48

14
Geektonight Notes

.................................................................................................................. Processing of Data

..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

6.5 TABULATION OF DATA


Presentation of collected data in the tabular form is one of the techniques of
data presentation. The two other techniques are diagrammatic and graphic
presentation, which will be discussed in Unit 7 of this course. Arranging the
data in an orderly manner in rows and columns is called tabulation of data.
Sometimes data collected by survey or even from publications of official bodies
are so numerous that it is difficult to understand the important features of the
data. Therefore it becomes necessary to summarize data through tabulation to
an easily intelligible form. It may be noted that there may be loss of some
minor information in certain cases, but the essential underlying features come
out more clearly. Quite frequently, data presented in tabular form is much easier
to read and understand than the data presented in the text.

In classification, as discussed in the previous section, the data is divided on the


basis of similarity and resemblance, whereas tabulation is the process of
recording the classified facts in rows and columns. Therefore, after classifying
the data into various classes, they should be shown in the tabular form.

6.5.1 Types of Tables

Tables may be classified, depending upon the use and objectives of the data to
be presented, into simple tables and complex tables. Let us discuss them along
with illustrations.

Simple Table: In this case data are presented only for one variable or
characteristics. Therefore, this type of table is also known as one way table.
The table showing the data relating to the sales of a company in different years
will be an example of a single table.

15
Geektonight Notes

Processing and Presentation Look at the following tables for an example of this type of table.
of Data
Illustration 5
Table 6.5 : Population of India During 1961–2001 (In thousands)
Census Year Population
1961 439235
1971 548160
1981 683329
1991 846303
2001 1027015
Source: Census of India, various documents.

Any frequency distribution of a single variable is a simple table

Table 6.6 : Frequency Distribution of Daily Wages of 65 Labourers

Daily Wages of No. of


Laboures (Rs.) Labourers

20-30 2
30-40 5
40-50 21
50-60 19
60-70 11
70-80 5
80-90 2
Total 65

A simple table may be prepared for descriptive or qualitative data also. The
following example illustrates it

Table 6.7 : Education Levels of 40 Labourers

Education Level No. of Persons

Illiterate 22
Literate but below
primary 10
Primary 5
High School 2
College and above 1
All 40

16
Geektonight Notes

Complex Table: A complex table may contain data pertaining to more than Processing of Data
one characteristic. The population data given below is an example.
Illustration 6
Table 6.8 : Rural and Urban Population of India During 1961–2001
(In thousands)

Population
Census Year Rural Urban Total
1961 360298 78937 439235
1971 439046 109114 548160
1981 523867 159463 683329
1991 628691 217611 846303
2001 741660 285355 1027015
Note: The total may not add up exactly due to rounding off error.
Source: Census of India, various documents.
In the above example, rural and urban population may be subdivided into males
and females as indicated below.

Table 6.9 : Rural and Urban Population of India During 1961–2001 (sex-wise)
(In thousands)

Population
Census Year Rural Urban Total
Male Female Male Female Male Female
(1) (2) (3) (4) (5) (6) (7)

In each of the above categories, the persons could be grouped into child and
adult, worker and non-worker, or according to different age groups and so on.
A particular type of complex table that is of great use in research is a cross-
table, where the table is prepared based on the values of two or more
variables. The bivariate frequency table used earlier (illustration 4) is reproduced
here for illustration.
Illustration 7
Table 6.10 : Sales and Profit of 200 Companies
Sl. Sales Profit ( Rupees in thousands)
No. (Rupees Upto 10 10-20 20-50 50-100 100-200 200 and Total
in lakhs) more
(1) (2) (3) (4) (5) (6) (7) (8) (9)
1 Up to 1 10 3 13
2 1-2 12 12 19 43
3 2-5 11 15 20 10 8 64
4 5-10 2 8 15 5 10 40
5 10-20 2 12 4 9 6 33
6 20 and 2 1 2 2 7
more
7 Total 35 40 68 20 29 8 200
17
Geektonight Notes

Processing and Presentation From bivariate table, one may get some idea about the interrelationship between
of Data
two variables. Suppose, that all the frequencies are concentrated in the diagonal
cells, then there is likely to be a strong relationship. That is positive relationship
if it starts from top-left corner to bottom-right corner or if it is from bottom-left
corner to top-right corner then, we could say there is negative relationship. If
the frequencies are more or less equally distributed over all the cells, then
probably there is no strong relationship.
Multivariate tables may also be constructed but interpretation becomes difficult
once we go beyond two variables.

So far we have discussed and learnt about the types of tables and their
usefulness in presentation of data. Now, let us proceed to learn about the
different parts of a table, which enable us to have a clear understanding of the
rules and practices followed in the construction of a table.

6.5.2 Parts of A Statistical Table

A table should have the following four essential parts - title, caption or box
head (column), stub (row heading) and main data. At times it may also contain
an end note and source note below the table. The table should have a title,
which is usually placed above the statistical table. The title should be clearly
worded to give some idea of the table’s contents. Usually a report has many
tables. Hence the tables should be numbered to facilitate reference.

Caption refers to the totle of the columns. It is also termed as “box head”.
There may be sub-captions under the main caption. Stub refers to the titles
given to the rows.

Caption and stub should also be unambiguous. To the extent possible


abbreviations should not be used in either caption or stub. But if they are used,
the expansion must be given in the end note below. Notes pertaining to stub
entries or box headings may be numerals. But, to avoid confusion, it is better to
use some symbols (like *, **, @ etc) or alphabets for notes referring to the
entries in the main body. If the table is based on outside information, it should
be mentioned in the source note below. This note should be complete with
author, title, year of publication etc to enable the reader to go to the original
source for crosschecking or for obtaining additional information. Columns and
rows may be numbered for easy reference.

Some of these features are illustrated below with reference to the table on
Rural and Urban Population during 1961-2001, which was presented in earlier
illustration-6, Table 6.8.

1. Title of the Table: Rural and Urban Population of India during 1961–
Table 2001 (in thousands)

2. Caption or Box Population


Head Rural Urban Total

18
Geektonight Notes
Processing of Data
Census Year

3. Stub (Row 1961


Heading) 1971
1981
1991
2001

360298 78937 439235


4. Body (Main 439046 109114 548160
Data) 523867 159463 683329
628691 217611 846303
741660 285355 1027015

5. End Note Note: The total may not add up exactly due to
rounding off of error.

6. Source Note Source: Census of India, various documents.

Column Number (1) (2) (3) (4) (5)

1
2
Row Number 3
4
5

The boxes above are self-explanatory.


Arrangement of items in stub and box-head
There is no hard and fast rule about the arrangement of column and row
headings in a table. It depends on the nature of data and type of analysis. A
number of different methods are used - alphabetical, geographical, chronological/
historical, magnitude-based and customary or conventional.

Alphabetical: This method is suitable for general tables as it is easy to locate


an item if it is arranged alphabetically. For example, population census data of
India may be arranged in the alphabetical order of states/union territories.

Geographical: It can be used when the reader is familiar with the usual
geographical classification.

Chronological: A table containing data over a period of time may be


presented in the chronological order. Population data (1961 to 2001) presented
earlier (Tables 6.5 and 6.8) are in chronological order. One may either start
from the most recent year or the earliest year. However, there is a convention
to start with the month of January whenever year and month data are
presented.
19
Geektonight Notes

Processing and Presentation Based on Magnitude: At times, items in a table are arranged according to
of Data
the value of the characteristic. Usually the largest item is placed first and other
items follow in decreasing order. But this may be reversed also. Suppose that
state-wise population data is arranged in order of decreasing magnitude. This
will highlight the most populous state and the least populous state.

Customary or Conventional: Traditionally some order is followed in certain


cases. While presenting population census data, usually ‘rural’ comes before
‘urban’ and ‘male’ first and ‘female’ next. At times, conventional geographical
order is also followed.

One point may be noted. The above arrangements are not exclusive. In a big
table, it is always possible and sometimes convenient to arrange the items
following two or three methods together. For example, it is possible to construct
a table in chronological order and within it in geographical order. Sometimes
information of the same table may be rearranged to produce another table to
highlight certain aspects. This will be clear from the following specimen tables.

Table A

Sl.No. Census Year Rural Urban


Male Female Male Female
(1) (2) (3) (4) (5) (6)
1
2

Table B

Sl.No. Census Year Male Female


Rural Urban Rural Urban
(1) (2) (3) (4) (5) (6)
1
2

Tables A and B contain the same information. Table A compares male-female


differences for rural and urban areas whereas Table B highlights rural-urban
contrasts for both the sexes.

Tables are prepared for making data easy to understand for the reader. It
should not be very large as the focus may be lost. A large table may be
logically broken into two or more small tables.

6.5.3 Requisites of a Good Statistical Table

After having an understanding of the parts of a statistical table, now let us


discuss the features of an ideal statistical table. Besides the rules relating to
part of the table, certain guidelines are very helpful in its preparation. They are
as follows:

1) A good table must present the data in as clear and simple a manner as possible.
2) The title should be brief and self-explanatory. It should represent the
description of the contents of the table.
20 3) Rows and Columns may be numbered to facilitate easy reference.
Geektonight Notes
4) Table should not be too narrow or too wide. The space of columns and Processing of Data
rows should be carefully planned, so as to avoid unnecessary gaps.
5) Columns and rows which are directly comparable with one another should
be placed side by side.
6) Units of measurement should be clearly shown.
7) All the column figures should be properly aligned. Decimal points and plus
or minus signs also should be in perfect alignment.
8) Abbreviations should be avoided in a table. If it is inevitable to use, their
meanings must be clearly explained in footnote.
9) If necessary, the derived data (percentages, indices, ratios, etc.) may also
be incorporated in the tables.
10) The sources of the data should be clearly stated so that the reliability of
the data could be verified, if needed.

Self Assessment Exercise C


The following report is obtained from 50 unskilled workers in a factory in
Faridabad. Prepare three simple tables based on caste, education and place of
origin and a complex table by considering all the factors.

S. No Caste Education Place of Origin


1 OC LITERATE BUT BELOW PRIMARY URBAN
2 OC LITERATE BUT BELOW PRIMARY URBAN

3 OC PRIMARY RURAL

4 BC ILLITERATE RURAL

5 SC LITERATE BUT BELOW PRIMARY URBAN

6 SC PRIMARY RURAL

7 ST ILLITERATE RURAL

8 OC HIGH SCHOOL RURAL

9 SC HIGH SCHOOL RURAL

10 ST ILLITERATE RURAL

11 OC LITERATE BUT BELOW PRIMARY RURAL

12 OC PRIMARY RURAL

13 OC HIGH SCHOOL RURAL

14 BC ILLITERATE RURAL

15 SC PRIMARY RURAL

16 SC PRIMARY RURAL

17 OC HIGH SCHOOL RURAL

18 OC HIGH SCHOOL RURAL

19 SC LITERATE BUT BELOW PRIMARY RURAL

20 ST PRIMARY URBAN

21 SC PRIMARY RURAL

22 SC HIGH SCHOOL RURAL

23 ST PRIMARY RURAL

24 OC LITERATE BUT BELOW PRIMARY RURAL

25 OC PRIMARY RURAL

26 OC HIGH SCHOOL URBAN

27 OC PRIMARY URBAN
21
Geektonight Notes

Processing and Presentation 28 OC PRIMARY URBAN


of Data
29 BC ILLITERATE RURAL
30 SC ILLITERATE RURAL
31 SC ILLITERATE URBAN
32 ST ILLITERATE URBAN
33 OC PRIMARY URBAN
34 OC PRIMARY RURAL
35 BC ILLITERATE URBAN
36 OC HIGH SCHOOL URBAN
37 OC HIGH SCHOOL URBAN
38 OC ILLITERATE RURAL
39 BC ILLITERATE URBAN
40 OC PRIMARY RURAL
41 BC LITERATE BUT BELOW PRIMARY RURAL
42 BC ILLITERATE URBAN
43 OC ILLITERATE RURAL
44 BC PRIMARY RURAL
45 BC LITERATE BUT BELOW PRIMARY RURAL
46 SC PRIMARY RURAL
47 SC ILLITERATE URBAN
48 ST PRIMARY RURAL
49 OC PRIMARY RURAL
50 OC LITERATE BUT BELOW PRIMARY RURAL

..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

..................................................................................................................
..................................................................................................................
..................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

22 ..................................................................................................................
Geektonight Notes
Processing of Data
6.6 LET US SUM UP
Once data collection is over, the next important steps are editing and coding.
Editing helps in maintaining consistency in quality of data. Editing is the first
stage in data processing. It is the process of examining the data collected to
detect errors and omissions and correct them for further analysis. Coding
makes further computation easier and necessary for efficient analysis of data.
Coding is the process of assigning some symbols to the answers. A coding
frame is developed by listing the answers and by assigning the codes to them.
The next stage is classification. Classification is the process of arranging data in
groups or classes on the basis of some chracteristics. It helps in making
comparisons and drawing meaningful conclusions. The classified data may be
summarized by means of tabulations and frequency distributions. Cross
tabulation is particularly useful as it provides some clue about relationship and
its direction between two variables. Frequency distribution and its extensions
provide simple means to summarize data and for comparison of two sets of data.

6.7 KEY WORDS


Attribute : Characteristics that are not capable of being measured.

Caption or Box-head : Column headings of a table.

Class Interval : The difference between the upper and lower limits of a class.

Class Limits : The lowest and the highest values that can be included in the
class.

Coding : A method to categorize data into groups and assign numerical values
or symbols to represent them.

Continuous Variable : A variable that can take values to any degree of


precision.
Cumulative Frequency Distribution : A distribution which shows cumulative
frequencies instead of actual frequencies.
Discrete Variable : A variable that can take only certain values (but not
fractional values).
Editing : Methods to substitute inconsistent values in a data set.
Exclusive Class : A class at which the upper limit is excluded from that class
and included as lower limit in next class.
Frequency Distribution : Distribution of frequencies (number of observations)
over different classes of a variable.
Joint Frequency Distribution : Distribution of frequencies (number of
observations) over different classes of two or more variables.
Inclusive Class : A class in which both its lower and upper limits are
included in that class.
Stub : Row headings of a table.

23
Geektonight Notes

Processing and Presentation


of Data 6.8 ANSWERS TO SELF ASSESSMENT EXERCISES
B. Frequency Distribution for Value of Production (Rupees in lakh)

Sl. No. Class Interval Frequency


(1) (2) (3)

1 80–90 8
2 90–100 10
3 100–110 27
4 110–120 10
5 120–130 4
6 130–140 1
7 Total 60

Frequency Distribution for Value of Raw Material (Rupees in lakh)

Sl. No. Class Interval Frequency


(1) (2) (3)

1 25–30 2
2 30–35 11
3 35–40 36
4 40–45 9
5 45–50 1
6 50–55 1
7 Total 60

C. Table : Caste-wise distribution of 50 Unskilled Workers

Sl. No. Caste No of Workers


(1) (2) (3)
1 SC 12
2 ST 6
3 BC 9
4 OC 23
5 All Castes 50

24
Geektonight Notes

Table : Distribution of 50 Unskilled Workers According to Processing of Data


Educational Level

Sl. No. Educational Level No. of Workers


(1) (2) (3)
1 Illiterate 14
2 Literate but below Primary 9
3 Primary 18
4 High School 9
5 All 50

Table : Distribution of 50 Unskilled Workers According to Place


of Origin

Sl. No. Place of Origin No. of Workers


(1) (2) (3)

1 Rural 34
2 Urban 16
5 All 50

Complex Table
Table : Distribution of 50 Unskilled Workers
Education Place of Origin
Level Rural Urban Total

SC ST BC OC Total SC ST BC OC Total

Illiterate 1 2 3 2 8 2 1 3 0 6 14

Below 1 0 2 3 6 1 0 0 2 3 9
Primary

Primary 5 2 1 6 14 0 1 0 3 4 18

High 2 0 0 4 6 0 0 0 3 3 9
School

Total 9 4 6 15 34 3 2 3 8 16 50

6.9 TERMINAL QUESTIONS/EXERCISES


1) What do you mean by Editing of data? Explain the guidelines to be kept in mind
while editing the statistical data.
2) Explain the meaning of coding? How would you code your research data?
3) “Classification of data provides a basis for tabulation of data. Comment.
4) Discuss the various methods of classification.
25
Geektonight Notes

Processing and Presentation


of Data
5) Form a frequency distribution for the following data by inclusive method and
exclusive method.
13 18 17 20 22 15 27 14 7 10 10 16
9 6 15 11 19 21 25 23 28 9 25 9
27 11 30 13 14 2 34 18 28 25 28 12
14 15 18 18 16 20 21 24 21 16 22 4

6) Draw a “less than” and “more than” cumulative frequency distribution for the
following data.
Income (Rs.) 500-600600-700700-800800-900900-1000
No. of families 25 40 65 35 15

7) What is tabulation? Draw the format of a statistical table and indicate its
various parts.
8) Describe the requisites of a good statistical table.
9) Prepare a blank table showing the age, sex and literacy of the population in a
city, according to five age groups from 0 to 100 years.
10) The following figures relate to the number of crimes (nearest-hundred) in four
metropolitan cities in India. In 1961, Bombay recorded the highest number of
crimes i.e. 19,400 followed by Calcutta with 14,200, Delhi 10,000 and Madras
5,700. In the year 1971, there was an increase of 5,700 in Bombay over its
1961 figure. The corresponding increase was 6,400 in Delhi and 1,500 in
Madras. However, the number of these crimes fell to 10,900 in the case of
Calcutta for the corresponding period. In 1981, Bombay recorded a total of
36,300 crimes. In that year, the number of crimes was 7,000 less in Delhi as
compared to Bombay. In Calcutta the number of crimes increased by 3,100 in
1981 as compared to 1971. In the case of Madras the increase in crimes was
by 8,500 in 1981 as compared to 1971. Present this data in tabular form.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

6.10 FURTHER READING


A number of good text books are available for the topics dealt with in this unit.
The following books may be used for more in depth study.
1) Croxton, F E, D J Cowden and S Klein, 1979. Applied General Statistics,
Prentice Hall of India, New Delhi.
2) Saravanavel, P, 1987. Research Methodology, Kitab Mahal, Allahabad.
3) Spiegel, M R, 1992. Statistics, Schaum’s Outline Series, Mc Graw Hill,
Singapore.

26
Geektonight Notes
Diagrammatic and
UNIT 7 DIAGRAMMATIC AND GRAPHIC Graphic Presentation

PRESENTATION
STRUCTURE
7.0 Objectives
7.1 Introduction
7.2 Importance of Visual Presentation of Data
7.3 Diagrammatic Presentation
7.3.1 Rules for Preparing Diagrams
7.4 Types of Diagrams
7.5 One Dimensional Bar Diagrams
7.5.1 Simple Bar Diagram
7.5.2 Multiple Bar Diagram
7.5.3 Sub-divided Bar Diagram
7.6 Pie Diagram
7.7 Structure Diagrams
7.7.1 Organisational Charts
7.7.2 Flow Charts
7.8 Graphic Presentation
7.9 Graphs of Time Series
7.9.1 Graphs of One Dependent Variable
7.9.2 Graphs of More Than One Dependent Variable
7.10 Graphs of Frequency Distribution
7.10.1 Histograms and Frequency Polygon
7.10.2 Cumulative Frequency Curves
7.11 Let Us Sum Up
7.12 Key Words
7.13 Answers to Self Assessment Exercises
7.14 Terminal Questions/Exercises
7.15 Further Reading

7.0 OBJECTIVES
After studying this Unit, you should be able to:

l explain the need and significance of visual presentation (diagrams and


graphs) of the data in research work,
l describe various types of diagrams and illustrate how to present the data
through an appropriate diagram,
l describe the principle of preparing a graph,
l present frequency distribution in the form of historigrams, histograms,
frequency polygon and ogives to make decisions, and
l list out and distinguish between the major forms of diagrams and graphs.

7.1 INTRODUCTION
In the previous Unit 6, you have studied the importance and techniques of
editing, coding, classification and tabulation that help to arrange the mass of
data (collected data) in a logical and precise manner. Tabulation is one of the
techniques for presentation of collected data which makes it easier to establish
trend, pattern, comparison etc. However, you might have noticed, it is a difficult 2 7
Geektonight Notes

Processing and Presentation and cumbersome task for a researcher to interpret a table having a large mass
of Data of numerical information. Sometimes it may fail to convey the message
meaningfully to the readers for whom it is meant. To overcome this
inconvenience, diagrammatic and graphic presentation of data has been invented
to supplement and explain the tables. Practically every day we can find the
presentation of cricket score, stock market index, cost of living index etc., in
news papers, television, magazines, reports etc. in the form of diagrams and
graphs. This kind of presentation is also termed as 'visual presentation' or
‘charting’.

In this unit, you will learn about the importance of visual presentation of
research data and some of the reasons why diagrammatic and graphic
presentation of data is so widely used. You will also study the different kinds of
diagrams and graphs, which are more popularly used for presenting the data in
research work. Also its principles on how to present the frequency distribution
in the form of diagrams and graphs. As you are already familiar with graphs
and diagrams, we will proceed with further discussions.

7.2 IMPORTANCE OF VISUAL PRESENTATION OF


DATA
Visual presentation of statistical data has become more popular and is often
used by the researcher and the statistician in analysis. Visual presentation of
data means presentation of Statistical data in the form of diagrams and graphs.
In these days, as we know, every research work is supported with visual
presentation because of the following reasons.

1) They relieve the dullness of the numerical data: Any list of figures
becomes less comprehensible and difficult to draw conclusions from as its
length increases. Scanning of the figures from tables causes undue strain on the
mind. The data when presented in the form of diagrams and graphs, gives a
birds eye-view of the entire data and creates interest and leaves an impression
on the mind of readers for a long period.

2) They make comparison easy: This is one of the prime objectives of visual
presentation of data. Diagrams and graphs make quick comparison between two
or more sets of data simpler, and the direction of curves bring out hidden facts
and associations of the statistical data.

3) They save time and effort: The characteristics of statistical data, through
tables, can be grasped only after a great strain on the mind. Diagrams and
graphs reduce the strain and save a lot of time in understanding the basic
characteristics of the data.

4) They facilitate the location of various statistical measures and


establish trends: Graph makes it possible to locate several measures of
central tendency such as Median, Quartiles, Mode etc. They help in
establishing trends of the past performance and are useful in interpolation or
extrapolation, line of best fit, establishing correlation etc. Thus, it helps in
forecasting.

5) They have universal applicability: It is a universal practice to present the


numerical data in the form of diagrams and graphs. In these days, it is an
extensively used technique in the field of economics, business, education, health,
2 8
agriculture etc.
Geektonight Notes

6) They have become an integral part of research: In fact, now a days it Diagrammatic and
Graphic Presentation
is difficult to find any research work without visual support. The reason is that
this is the most convincing and appealing way of presenting the data. You can
find diagrammatic and graphic presentation of data in journals, magazines,
television, reports, advertisements etc. After having understood about the
importance of visual presentation, we shall move on to discuss about the
Diagrams and graphs which are more frequently used in the area of business
research.

7.3 DIAGRAMMATIC PRESENTATION


As you know, diagrammatic presentation is one of the techniques of visual
presentation of statistical data. It is a fact that diagrams do not add new
meaning to the statistical facts but they reveal the facts of the data more
quickly and clearly. Because, examining the figures from tables becomes
laborious and uninteresting to the eye and also confusing. Here, it is appropriate
to state the words of M. J. Moroney, “cold figures are uninspiring to most
people. Diagrams help us to see the pattern and shape of any complex
situation.” Thus, the data presented through diagrams are the best way of
appealing to the mind visually. Hence, diagrams are widely used in practice to
display the structure of the data in research work.

7.3.1 Rules for Preparing Diagrams

As we have discussed earlier, the prime objective of diagrammatic presentation


of data is to highlight their basic hidden facts and relationships. To ensure that
the presentation of numerical data is more attractive and effective, therefore, it
is essential to keep the following general rules in mind while adapting diagrams
in research work. Now, let us discuss them one by one.

1) You must have noted that the diagrams must be geometrically accurate.
Therefore, they should be drawn on the graphic axis i.e., ‘X’ axis (horizontal
line) and ‘Y’ axis (vertical line). However, the diagrams are generally drawn on
a plain paper after considering the scale.
2) While taking the scale on ‘X’ axis and ‘Y’ axis, you must ensure that the scale
showing the values should be in multiples of 2, 5, 10, 20, 50, etc.
3) The scale should be clearly set up, e.g., millions of tons, persons in Lakhs, value
in thousands etc. On ‘Y’ axis the scale starts from zero, as the vertical scale is
not broken.
4) Every diagram must have a concise and self explanatory title, which may be
written at the top or bottom of the diagram.
5) In order to draw the readers' attention, diagrams must be attractive and well
propotioned.
6) Different colours or shades should be used to exhibit various components of
diagrams and also an index must be provided for identification.
7) It is essential to choose a suitable type of diagram. The selection will depend
upon the number of variables, minimum and maximum values, objects of
presentation.

2 9
Geektonight Notes

Processing and Presentation Self Assessment Exercise A


of Data
List out the importance of visual presentation of statistical data.

........................................................................................................................
........................................................................................................................
........................................................................................................................
........................................................................................................................

7.4 TYPES OF DIAGRAMS


Generally, diagrams are classified on the basis of their length, width and shape.
There are various types of diagrams namely, one dimensional diagrams, two
dimensional diagrams, three dimensional diagrams, charts, pictograms, cartograms
etc. However, in this unit, we will discuss the important types of diagrams,
which are more frequently used in social science research in general,
particularly in business research. Therefore, we have restricted ourselves to
study only one dimensional bar diagrams, pie diagrams, and structure diagrams.

7.5 ONE DIMENSIONAL BAR DIAGRAMS


Bar refers to a thick line. Under this type of construction only one dimension
i.e., length is taken into account for the purpose of comparison and observance
of fluctuations in growth. The length of each bar is proportionate to the
magnitude of the data. The width is not related to the magnitude of the data.
Generally the width is given for the purpose of visual effect and attractiveness.
The width of each bar and the gap between one bar to another bar must be
uniform. Mention the respective figures at the top of every bar, particularly
when the scale is too narrow, so that the reader knows the figures without
consulting the scale of the diagram.

A large number of one dimensional diagrams are available for presenting data.
Such as line diagram, simple bar diagram, multiple bar diagram, sub-divided bar
diagram, percentage bar diagram, deviation bar diagram etc. We shall, however,
study only the simple bar diagram, multiple bar diagram, and sub-divided bar
diagram. Let us study these three kinds of diagrams with the support of
relevant illustrations.

7.5.1 Simple Bar Diagram


In a Simple bar diagram, the data related to one variable is depicted. Such as,
profits, investments, exports, sales, production etc.

This type of diagram may be drawn either vertically or horizontally. Both


positive and negative values can be presented. In such a case, if bars are
constructed vertically, the positive values are taken on the upper side of
horizontal axis while the negative values are taken on its lower side. On the
other hand if the bars are constructed horizontally, the positive values are taken
on the right hand side of the vertical axis and the negative values are
considered on its left side. These type of construction of bars are also called
deviation bar diagram. The simple bar diagram is very easy to prepare and to
understand the level of fluctuations from one situation to another. It should be
kept in mind that, only length is taken into account and not width. Width should
be uniform for all bars and the gap between each bar is normally identical. Let
us consider the following illustrations and learn how to present the given data in
3 0 the form of simple bar diagrams vertically and horizontally.
Geektonight Notes

Illustration-1 Diagrammatic and


Graphic Presentation
Prepare a Simple Bar Diagram from the Following Data Relating to Tea Exports.

Year 1995-96 1996-97 1997-98 1998-99 1999-00 2000-01 2001-02

Exports
(In Million kgs.) 167 209 410 316 192 215 160

Solution: The quantity of tea exported is given in million kgs. for different
years. A simple bar diagram will be constructed with 7 bars corresponding to
the 7 years. Now study the following vertical construction of bar diagram by
referring the guide lines for construction of simple bars, as explained in section
7.5.1.

450 410
400
Tea Export (in million kgs.)

350 316
300
250 215
209
192
200 167 160
150
100
50
0
1995-96 1996-97 1997-98 1998-99 1999-00 2000-01 2001-02
Years
Figure 7.1: Simple Bar Diagram Showing the Tea Exports in Different Years.

Illustration-2

The following data relates to the Profit and Loss of different industries in 1999-
2002. Present the data through simple bar diagram.

Industry : Cement Oil Textile Sugar Garments


Profit/Loss 48 25 –12 14 –24
(Rs. In lakhs)

Solution : The given data represents positive and negative values i.e., profit
and loss. Let us draw the bars horizontally. Observe fig: 7.2 carefully and try
to understand the construction of simple bars horizontally.

3 1
Geektonight Notes

Processing and Presentation


of Data
–24
Garments

Sugar 14

Industries
–12 Textiles

Oil 25

Cement 48

-30 -20 -10 0 10 20 30 40 50


Profit/Loss (Rs. In lakh)
Figure 7.2: Simple Bar Diagram Showing the Profit and Loss of Different Industries
During 1999-02

Self Assessment Exercise B

Represent the following data related to the surplus/deficit of Balance of Trade


over a period, by simple bar diagram.

Years 1997 1998 1999 2000 2001 2002 2003


Surplus (+)/deficit (–)
(In million $) + 34 + 14 – 12 –4 +6 – 12 – 20

..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

7.5.2 Multiple Bar Diagram

In this type of diagram, two or more than two bars are constructed side by
side horizontally for a period or related phenomenon. This type of diagram is
also called Compound Bar or Cluster Bar Diagram. The technique of preparing
such a diagram is the same as that of simple bar diagram. This diagram, on
the one hand, facilitates comparison of the values of different variables in a set
and on the other, it facilitates the comparison of the values of the same variable
over a period of time or phenomenon. To facilitate easy comparison, the
3 2 different bars of a set may be coloured or shaded differently to distinguish
Geektonight Notes

between them. But the Colour or shade for the bars representing the same Diagrammatic and
Graphic Presentation
variable in different sets should be the same.

Let us consider the following illustration and learn the method of presentation
of the data in the form of a multiple bar diagram.

Illustration-3

Depict the following data in a multiple bar diagram.


Foreign Investment – Industry Wise Inflows
(Rs. in crores)
Years
Industry
1997-98 1998-99 1999-2000
Chemical 956 1580 523
Engineering 2155 1800 1423
Services 1194 1550 506
Food 418 78 525

Solution : The data relates to the Foreign Investment inflow of four


industries during 1997-2000 (three years). Therefore, three sets of bars should
be drawn, each set represents one year. In each set there should be four bars
representing four sectors (Chemical, Engineering, Services and Food). Let us
draw the multiple bars with the help of the procedure explained in subsection
7.5.2. Study the diagram carefully and learn how this type of diagram is drawn.

Chemical Engineering Services Food


2500

2155
Foregin Investment- Industrywise Inflow

2000
1800

1580 1550
1500 1423
(Rs. In crores)

1194

1000 956

523 506 525


500 418

78
0
1997-98 1998-99 1999-2000

Years

Figure 7.3: Multiple Bar Diagram Showing the Inflow of Foreign Investment in Selected
Sectors During 1997-2000

3 3
Geektonight Notes

Processing and Presentation Self Assessment Exercise C


of Data

The following table relates the Indian Textile Exports to different countries

Countries Year
1997-98 1998-99 1999-2000
USA 746.13 759.36 882.41
Germany 366.01 300.46 338.88
UK 403.07 337.94 341.42
Italy 241.64 233.14 215.48
Korea (Republic) 127.00 88.30 185.13

i) Represent the data by Multiple bar diagram.


ii) Which aspects of the distribution does this diagram emphasize?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

7.5.3 Sub-divided Bar Diagram

In this diagram one bar is constructed for the total value of the different
components of the same variable. Further it is sub-divided in proportion to the
values of various components of that variable. This diagram shows the total of
the variables as well as the total of its various components in a single bar.
Hence, it is clear that the sub-divided bar serves the same purpose as multiple
bars. The only difference is that, in case of the multiple bar each component of
a variable is shown side by side horizontally, where as in construction of sub-
divided bar diagram each component of a variable is shown one upon the other.
It is also called a component bar diagram. This method is suitable if the total
values of the variables are small, otherwise the scale becomes very narrow to
depict the data. To study the relative changes, all components may be
converted into percentages and drawn as sub-divided bars. Such a bar
construction is called a sub-divided percentage bar. The limitation is that all
the parts do not have a common base to enable us to compare accurately the
various components of a set.

Let us take up an illustration to understand presenting of the data in the form


of sub-divided bar diagram.

Illustration-4
The following data relates to India's exports of electronic goods to different
3 4 countries during 1994-98. Represent the data by sub-divided bar diagram.
Geektonight Notes

(Rs. in Crores) Diagrammatic and


Graphic Presentation
Country Total
Years USA Hong Malaysia Singapore Germany
Kong
1994-95 210 86 56 275 91 718
1995-96 378 105 159 467 118 1227
1996-97 789 189 221 349 93 1641
1997-98 880 248 175 327 90 1720
1998-99 900 220 200 350 130 1800

Solution : For construction of sub-divided bar diagram, first of all, we must


obtain the total export value of the five countries in each year. However, in the
above illustration of different countries, total exports in each year are given.
Construct sub-divided bar diagram. Now study figure 7.4 carefully and
understand the construction.

USA Hongkong Malaysia Singapore Germany

1800
130
90
1600 93
Indian's Export of Electronics Goods (Rs. in crores)

350
327
1400 349

175 200
1200 118
221
1000 248 220

467 189
800
91
600 159
275
105 880 900
400 789
56
86
200 378
210

0
1994-95 1995-96 1996-97 1997-98 1998-99
Years

Figure 7.4: Sub-divided Bar Diagram Showing the India's Exports of Electronic Goods to
Different Countries During 1994-99.

3 5
Geektonight Notes

Processing and Presentation Self Assessment Exercise D


of Data

Draw sub-divided bar diagram for the following table. Do you agree that
this diagram is more effective for comparison of figures rather than the
Multiple bar diagram? Justify your opinion.

Item-wise Exports of Leather Products from India (1997-2000)


$ Million
Year
Items 1997-98 1998-99 1999-2000
Finished leather 296.19 268.38 239.00
Leather footwear 240.77 241.00 299.77
Leather goods 387.79 411.00 385.25
Leather garments 425.72 381.94 318.94
Others 26.00 33.00 36.00
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

7.6 PIE DIAGRAM


Pie diagrams are generally used to show per cent breakdowns. For instance,
we can show how the budget is allocated under different heads. A pie diagram
is a sub-divided circle. The area of different sub-divisions in pie diagrams are in
the proportion of the data to be represented. While making comparision, pie
diagrams should be used on a percentage basis and not on an absolute basis.

In constructing a pie diagram the first step is to convert the various values of
components of the variable into percentages and then the percentages
transposed into corresponding degrees. The total percentage of the various
components i.e., 100 is taken as 360° (degrees around the centre of a circle)
and the degree of various components are calculated in proportion to the
percentage values of different components. It is expressed as:
360 o
× component ' s percentage
100

It should be noted that in case the data comprises of more than one variable, to
show the two dimensional effect for making comparison among the variables,
we have to obtain the square root of the total of each variable. These square
3 6
Geektonight Notes

roots would represent the radius of the circles and then they will be sub- Diagrammatic and
Graphic Presentation
divided. A pie diagram helps us in emphasizing the area and in ascertaining the
relationship between the various components as well as among the variables.
However, compared to a bar diagram, a pie diagram is less effective for
accurate interpretation when the components are in large numbers. Let us draw
the pie diagram with the help of the data contained in the following table.

Illustration 5

A researcher made an enquiry about the sources of price information tapped


from 550 sample farmers in a regulated agricultural market as given below.
Present the data in the form of pie diagram and comment.

Source of Price Information No. of farmers


Radio 50
Daily papers 60
Local traders 100
Co-framers 310
Personal visits 20
Market office 10

Solution : The number of farmers, who have expressed their sources of


collecting information for price of agricultural products have to be converted
into the corresponding percentages and then after that into degrees as shown
below. Draw the circle and then measure points on the circumference
representing the degrees of each source with the help of protractor. Let us first
calculate the corresponding percentages and then convert into degrees in order
to draw an appropriate pie chart.

Source of No. of Percentage of Degree of


Price Information farmers No. of farmers angle
Radio 50 9.1 33°
Daily wages 60 10.9 39°
Local traders 100 18.2 66°
Co-farmers 310 56.4 203°
Personal visits 20 3.6 13°
Market office 10 1.8 6°
Total 550 100.0 360°

After calculating the degrees of various components, depict them in a circle as


shown in Figure 7.5.

3 7
Geektonight Notes

Processing and Presentation 1.8%


of Data
3.6%
9.1%

10.9%

18.2%
56.4%

Radio Daily wages Local traders Co-farmers Personal visits M arket office

Figure 7.5: Sources of Price Information of Regulated Agricultural Market Tapped


by the Farmers.
The pie diagram reveals that the majority of farmers seek price information
from co-farmers, which represents about 56.4% among the sample farmers
(550). Next to the source of co-farmers, they collect the information through
local traders i.e., 18.2%. The study also reveals that the market office plays a
very insignificant role which represents about 1.8% only.

Self Assessment Exercise E

Construct a pie diagram to describe the following data :

Reasons for Buying Face Cream


Reasons No. of Respondents
Seen it advertised 280
Seen it on the counter 160
Reasonably priced 100
Scent appealed 70
Beneficial to skin 180
Recommendation 210
Total Respondents 1000

What features of this distribution does your pie diagram mainly illustrate?

3 8
Geektonight Notes
Diagrammatic and
7.7 STRUCTURE DIAGRAMS Graphic Presentation

There are several important diagram formats that are used to display the
structural information (qualitative) in the form of charts. The format depends
upon the nature of information. Under these type of diagrams we will discuss
two different diagrams, i.e., (1) Organisational Charts and (2) Flow Charts.

7.7.1 Organisational Charts

These types of charts are most commonly used to represent the internal
structure of organisations. There is no standard format for these kind of
diagrams as the design of the diagram depends on the nature of the
organization. A special format is used in the following illustration which relates
to the organisational structure of the IGNOU. Study the Fig. 7.6 and try to
understand the preparation of this kind of diagram relating to other
organisations.

VISITOR

BOARD OF MANAGEMENT

Planning Board Academic Council Distance Education Finance Committee


Council

Vice Chancellor

Pro-Vice Chancellors

Schools Divisions Centres

Continuing Academic Centre for


Education Administration
Education Co-ordination Extension Education

Computer & Engineering & Computer Campus Const.


Information Science Technology & Maintence

Health Sciences Humanities Electronic Media Centre Finance & Accounts

Management Library & Material Production


Social Sciences
Studies Documentation & Distribution

Agriculture Law Planning & Development Regional Services

Sciences Student Registration


STRIDE
& Evaluation

Figure 7.6: Organisational Set Up of the IGNOU

7.7.2 Flow Charts

Flow charts are used most commonly in any situation where we wish to
represent the information which flows through different situations to its ultimate
point. These charts can also be used to indicate the flow of information about
various aspects i.e., material flow, product flow (distribution channels), funds
flow etc.
3 9
Geektonight Notes

Processing and Presentation The following Figure 7.7 relates to the marketing channels for fruits, which will
of Data give you an understanding about flow charts.

Growers

Processors Pre-Harvest
Contractors

Commission Agent in
Wholesaler Exports Wholesale Market

Retail Wholesaler

Retail Shop Howker Exporter


Consumer

Consumer Exports

Figure 7.7 : Marketing channels for fruits

Self Assessment Exercise F

1) Prepare an organizational chart of any organization of your own choice.


..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

2) Represent the different stages in processing of Data (Unit 6 of this course)


through a flow chart.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

4 0
Geektonight Notes
Diagrammatic and
7.8 GRAPHIC PRESENTATION Graphic Presentation

So far we have discussed about one of the techniques of visual presentation of


data i.e., diagrammatic presentation. You will appreciate as to how such
presentation eliminates the dullness of data and makes it more interesting, and
also helps in comparison between two or more frequency distributions. Now, we
will study another important technique of visual presentations of statistical data
i.e., graphic presentation. You might have seen the graphic representation of
stock index, cricket score, production trends etc., in various magazines and on
television. Everybody, irrespective of whether he/she is a layman or an expert,
has a natural fascination for appropriate graphical presentation of data which
remains an essential part of research methodology. The graphic presentation of
data leaves an impact on the mind of readers, as a result of which it is easier
to draw trends from the statistical data.

The shape of a graph offers easy and appropriate answers to several questions,
such as:

l The direction of curves on the graph makes it very easy to draw comparisons.
l The presentation of time series data on a graph makes it possible to interpolate
or extrapotrate the values, which helps in forecasting.
l The graph of frequency distribution helps us to determine the values of Mode,
Median, Quartiles, percentiles, etc.
l The shape of the graph helps in demonstrating the degree of inequality and
direction of correlation
For all such advantages it is necessary for a researcher to have an
understanding of different types of graphic presentation of data. In practice,
there are a variety of graphs which can be used to depict the data. However,
here we will discuss only a few graphs which are more frequently used in
business research.

Broadly, the graphs of statistical data may be classified into two types, one is
graphs of time series, another is graphs of frequency distribution. We will
discuss both these types, after studying the parts of a graph.

Parts of a Graph
The foremost requirement for a researcher is to be aware of the basic
principles for using the graph paper for presentation of statistical data
graphically.

Conventionally, graphs are drawn on a graph paper. Two perpendiculars are


drawn which intersect each other at right angles. This intersecting point is
called the origin point or the ‘zero’ point. The horizontal line is known as ‘X’
axis (ordinate) on which independent variables are shown while the vertical line
is known as ‘Y’ axis (abscissa) on which dependent variables are shown. The
graph paper is thus divided into four parts, termed as “quadrants”. These
quadrants are meant to depict the positive values and negative values of X
variable and Y variable. By observing the following Chart 7.1. You will
understand clearly about the purpose of quadrants of a graph.

4 1
Geektonight Notes

Processing and Presentation


of Data
Y

5
QUADRANT-II 4 QUADRANT-I

3
X–Negative Values X–Positive Values
2
Y–Positive Values Y–Positive Values
1
X X
-5 -4 -3 -2 -1 0 1 2 3 4 5
-1
-2
QUADRANT-III QUADRANT-IV
-3
X–Negative Values -4 X–Positive Values
Y–Negative Values Y–Negative Values
-5

Y
Chart 7.1 : Parts of a Graph

After understanding the above parts of a graph, let us study the different types
of graphs.

7.9 GRAPHS OF TIME SERIES


A time series is a set of values of a variable or variables arranged over a
period of time. For example, the data relating to the production, sales
expenditure, exports etc., during the last ten years. Thus a graph of time series
is prepared to show the values of one or more than one variables over a period
of time. This type of graphs are also termed as time graphs or historigrams,
because history is represented graphically. These graphs are helpful in studying
the changes over a period of time and forecasting.

Historigrams can be constructed in two ways: 1) on a natural scale, (arithmatic


scale), 2) on a ratio scale. The natural scale graph reflects the changes in
absolute values over a period of time, where as the ratio scale graph reflects
the relative changes over a period of time. In this unit, however, we study the
historigarms on a natural scale which is generally used in business research.

Construction of Historigrams on Natural Scale


Natural scale graphs are used to show the absolute values or relative values in
terms of percentages, such as index numbers of a time series.

The following Principles should be kept in mind while constructing historigrams


so that the reader should not have to search through the text in order to
understand a graph.

1) On X-axis we take the time as an independent variable and on Y axis the values
of data as dependent variable. Plot the different points corresponding to given
data; then the points are joined by a straight line in the order of time.
2) Equal magnitude of scale must be maintained on X-axis as well as on Y-axis.
3) The Y-axis normally starts with zero. In case, there is a wide difference
between the lowest value of the data and zero (origin point), the Y-axis can be
broken and a false base line may be drawn. However, it will be explained under
4 2 the related problem in this section.
Geektonight Notes

4) If the variables are in different units, double scales can be taken on the Y Diagrammatic and
Graphic Presentation
axis.
5) The scales adopted should be clearly indicated and the graph must have a
self-explanatory title.
6) Unfortunately, graphs lend themselves to considerable misuse. The same
data can give different graphical shapes depending on the relative size of
two axes. In order to avoid such misrepresentations the convention in
research is to construct graphs, wherever possible, such that the vertical axis
is around 2/3 to 3/4 the length of the horizontal.
After having learnt about the principles for construction of historigrams, we
move on to discuss the types of historigrams. There are various types which
have been developed. Among them the frequently used graphs are one variable
graphs and more than one dependent variable graphs. We will now look at the
construction of these graphs.

7.9.1 Graph of one Dependent Variable

When there is only one dependent variable, the values of the dependent variable
are taken on Y axis, while the time is taken on X-axis. Study the following
illustration carefully and try to understand the method of construction for one
dependent variable historigrams.

Illustration 6

The following data relates to India's exports to USA during the period of 1994-
2000. Represent the data graphically.

Year 1994 1995 1996 1997 1998 1999 2000


Exports 5310 5726 6170 7322 8237 9071 10687
(In million $)

Y
11000 10687

10000
9000 9071
Export to USA (in million $)

8237
8000
7322
7000
6000 6130
5726
5000 5310
4000
3000
2000
1000
0 X
1994 1995 1996 1997 1998 1999 2000
Years
Figure 7.8: Historigram Showing India's Exports to USA During 1994-2000. 4 3
Geektonight Notes

Processing and Presentation False base line


of Data
In the above graph (Figure 7.8), the scale on Y axis has been taken as 1cm =
1,000 million starting from the origin point i.e., zero. Consequently, the portion of
the graph paper, on which lies the scale between zero and the smallest value of
the data (5310) is omitted and only the above half of the graph paper is used
to depict the data because, there is a wide difference between zero and the
lowest value of the given data. Therefore, the curve which is drawn is not
significant to understand the fluctuations. In such a situation, in order to use the
space of the graph paper effectively, it is mandatory to draw a false base line.
By using the false base line, minor fluctuations are amplified and they become
clearly visible on the graph. The false base line breaks the continuity of the
scale of Y axis from the origin point, i.e., zero by drawing a horizontal wave
line in between the zero and the first centimeter on the scale of Y axis.

Let us consider the above illustration-6 to represent the data graphically by


drawing a false base line, so that we can practically understand and appreciate
the importance of the false base line for effective presentation of data on a
graph sheet. Study Figure 7.9 carefully.

Y
11000

10500
10000 10687
9500

9000
8500 9071
8000

7500 8237

7000
7322
6500

6000
6170
5500 5726
5000 5310

0 X
1994 1995 1996 1997 1998 1999 2000

Figure 7.9 : Historigram Showing India's Exports to USA During 1994-2000.

7.9.2 Graph of more than one dependent variable

When the data of time series relate to more than one dependent variable,
curves must be drawn for each variable separately. These graphs are prepared
in the same manner as we prepare one dependent variable historigram. Let us
consider the following data to construct historigrams. Study Figure 7.10 carefully
and understand the procedure for preparation of this type of graph.
4 4
Geektonight Notes

Illustration-7 Diagrammatic and


Graphic Presentation

Years 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000
Sales
(In 31 58 42 65 75 80 72 96 83 98
lakh)
Cost
of Sales 42 50 48 55 82 75 62 80 67 73
(Rs. In
Lakh)
Profit/ –11 +8 –6 +10 –7 +5 +10 +20 +16 +25
Loss
Solution : The given data comprises of three variables, so, we have to draw
a separate curve for each variable. In this graph, it is not necessary to draw
false base line because the minimum value is close to the point of origin (zero).
For easy identification, each curve is marked differently.

Y Sales Cost of Sales Profit & Loss


100
Sales, Cost of Sales and Profit & loss (Rs. in lakhs)

80

60

40

20

0 X

-20 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

Years

Figure 7.10 : Historigram Showing Sales, Cost of Sales and Profit/Loss of a Company
During 1991-2000

The above graph clearly reveals that with passage of time the profits are rising
after 1996, even though the sales are fluctuating slightly.

4 5
Geektonight Notes

Processing and Presentation


of Data Self Assessment Exercise G

Represent the following data graphically by showing Exports, Imports and


Balance of Trades.

Years 1994 1995 1996 1997 1998 1999 2000


Exports (In lakh) 52 46 62 58 72 89 92
Imports (In Rs. 48 52 69 51 60 80 96
Lakhs)

..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

7.10 GRAPHS OF FREQUENCY DISTRIBUTION


We have seen in Unit-6 the presentation of frequency distribution in the form of
tables. Frequency distribution can also be presented in the form of graphs.
Such graphs give a better understanding and provide illustrative information to
readers than the data in tabular form. It is true that effective graphs can
markedly increase a reader’s comprehension of complex data sets. Compared
to tables, graphs of frequency distribution are helpful in identifying the
characteristics and relationships of the data. These graphs are also useful in
locating the positional averages such as mode, median, qualities etc. In a
continuous frequency distribution, class-limits/mid-values are taken on X axis and
the frequency on the Y-axis. The vertical axis (Y-axis) is not broken, thus the
false base line cannot be taken.

A frequency distribution can be portrayed by means of Histogram, frequency


polygon, ogive curves and scatter diagram. However, the scatter diagram will
be discussed in Unit-10 of this course.

Let us study the procedure involved in the preparation of these types of graphs.

7.10.1 Histogram and Frequency Polygon

Histogram: The graph usually drawn to represent a frequency distribution is


called a Histogram. A histogram is a set of rectangles (vertical bars) each
proportionate in width to the magnitude of a class interval and proportionate in
area to the number of frequencies concerning the classes’ intervals. In a
histogram, the variables (class-intervals) are always shown on X-axis and the
4 6 frequencies are taken on the Y-axis. In constructing a histogram there should
Geektonight Notes

not be any gap between two successive rectangles, and the data must be in Diagrammatic and
Graphic Presentation
exclusive form of classes. However, we cannot construct histogram for
distribution with open-end classes and it can be quite misleading if the
distribution has unequal class intervals.

The value of mode can be determined from the histogram. The procedure for
locating the mode is to draw a straight line from the top right corner of the
highest rectangle (Modal Class) to the top right corner of the preceding
rectangle (Pre Modal Class). Similarly, draw a straight line from the top left
corner of the highest rectangle to top left corner of the succeeding rectangle
(Post Modal Class). Draw a perpendicular from the point of intersection of
these two straight lines to X-axis. The point where it meets the X-axis gives
the value of mode. This is shown in Figure 7.11. However, graphic location of
Mode is not possible in a multi-distribution.

Frequency Polygon: Polygon means ‘many-angled’ diagram. This is another


way of depicting a frequency distribution graphically. It facilitates comparison of
two or more frequency distributions. Frequency polygon can be drawn either
from the histogram or from the given data directly.

The procedure for the construction of a frequency polygon by histogram is to


first draw the histogram, as explained earlier, of the given data. Then, put a dot
at the mid-point of the top horizontal line of each rectangle bar and join
these dots by straight lines.

Another way of drawing frequency polygon is to obtain the mid-values of class


intervals and plot them on X-axis. Mark frequency along the Y axis. Then, plot
the frequency values corresponding to each mid point and connect them through
straight lines. The area left outside is just equal to the area included in it.
Hence, the area of a polygon is equal to the area of histogram. The difference
between the histogram and the polygon is that the histogram depicts the
frequency of each class separately where as the polygon does it collectively.
The histogram is usually associated with the data of discrete series, while
frequency polygon is for continuous series data.

Let us, now, take up an illustration to learn how to draw a histogram, and
frequency polygon practically and also determine the mode. The data relates to
the sales of computers by different companies.

Illustration-8

Sales (Rs. 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
In crores)
No. of
Companies 8 20 35 50 90 70 30 15

Solution : For drawing histogram, as explained earlier, we have to show


sales on X - axis and number of companies on Y-axis by selecting a suitable
scale. For drawing frequency polygon, plot dots on the top middle of each
rectangle, and join them by straight lines.

4 7
Geektonight Notes

Processing and Presentation 100


of Data
90

80

70
Histogram

No. of companies
60

50
Frequency Polygan
40

30

20

10

0
10 20 30 40 50 60 70 80
Z = 46.67
Sale of computers (Rs. in crores)
Figure 7.11: Histogram and Frequency Polygon for Computer Sales of Various
Companies

Remark: The calculation of Mode will be discussed in Unit 8 of this block.

Self Assessment Exercise H

The monthly production of units by a sample of 200 workers in a bulbs


manufacturing firm is given in the following table.

Output 200- 225- 250- 275- 300- 325- 350- 375-


(Units) 225 250 275 300 325 350 375 400
No. of 12 21 25 40 49 28 17 8
Workers

i) Draw a historgram and frequency polygon.

ii) Determine the mode graphically.

......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................

4 8
Geektonight Notes

7.10.2 Cumulative Frequency Curves Diagrammatic and


Graphic Presentation

Some times we are interested in knowing how many families are there in a
city, whose earnings are less than Rs. 5,000 p.m. or whose earning are more
than Rs. 20,000 p.m. In order to obtain this information, we have first of all
to convert the ordinary frequency table into cumulative frequency table. When
the frequencies are added they are called cumulative frequencies. The curves
so obtained from the cumulative frequencies are called ‘cumulative frequency
curves’, popularly known as “ogives”. There are two types of ogives namely
less than ogive, and more than ogive. Let us know about the procedure
involved in drawing these two ogives.

In less than ogive, we start with the upper limit of each class and the
cumulative (addition) starts from the top. When these frequencies are plotted
we get less than ogive. In case of more than ogive we start with the lower
limit of each class and the cumulation starts from the bottom. When these
frequencies are plotted we get more than ogive. You should bear in mind that
while drawing ogives the classes must be in exclusive form.

The ogives are useful to determine the number of items above or below a
given value. It is also useful for comparison between two or more frequency
distributions and to determine certain values (positional values) such as mode,
median, quartiles, percentiles etc. Let us take up an illustration to understand
how to draw ogives practically. Observe carefully the procedures involved in it.

Note: Mode and Median are explained in Unit 8. Similarly, quartiles are in
Unit 9. This illustration can be better understood only after studying those units.

Illustration-9

The following data relates to the monthly operating expenses incurred by a


sample of 200 small-scale industrial units in a city. You are required to draw
ogives and locate the Q1, Q3 and Median (Q2).

Operating Expenses No. of Units


(Rs. In thousands)
0-20 7
20-40 18
40-60 22
60-80 34
80-100 53
100-120 26
120-140 18
140-160 10
160-180 7
180-200 5

Solution : To depict “less than” and “more than” cumulative frequency


curves (ogives), first, we have to convert the above distribution into “less than”
and “more than” cumulative frequency distribution. Study carefully the
procedure for conversion of ordinary frequency into cumulative frequencies as
shown below: 4 9
Geektonight Notes

Processing and Presentation


of Data “Less than” Method “More than” Method
Operating Expenses Frequency Operating Expenses Frequency
(Rs. In ’000) (Rs. In ’000)
Less than 20 7 More than 0 200
Less than 40 25 More than 20 193
Less than 60 47 More than 40 175
Less than 80 81 More than 60 153
Less than 100 134 More than 80 119
Less than 120 160 More than 100 66
Less than 140 178 More than 120 40
Less than 160 188 More than 140 22
Less than 180 195 More than 160 12
Less than 200 200 More than 180 5

The cumulative frequencies presented in the above table have the following
interpretation.

The ‘less than’ cumulative frequencies are to be read against upper class limits.
In contrast, the ‘more than’ cumulative frequencies are to be read against
lower class boundaries. For instance, there are 7 units with operating expenses
of less than Rs. 20,000, there are 160 units with operating expenses of less
than Rs. 120,000. On the other hand, there are 153 units with operating
expenses more than Rs. 60,000; no units with operating expenses more than or
equal to Rs. 2,00,000.

“More than” Type


200 “Less than” Type
180
No. of Small scale industrial units

160
140
120
100
80
60
40
20
0
0 20 40 60 80 100 120 140 160 180 200
Q1 = 60.18 Me = 80.77 Q3 = 112.31
Operating Expenses (Rs. in 000'

Fig 7.12: ‘Less than’ and ‘More than’ Cumulative Frequency Curves Showing the Operating Expenses
5 0 (Rs. in’ 000) of Small Scale Industrial Units.
Geektonight Notes

Now, look at Figure 7.12 which shows both the cumulative curves on the same Diagrammatic and
Graphic Presentation
graph. Study carefully and understand the procedures for drawing ogives.

From the above ogives, the median can be located by drawing a perpendicular
from the intersection of the two ogives to X-axis. The point where the
perpendicular touches X-axis would be the Median of the distribution. Similarly,
the perpendicular drawn from the intersection of the two curves to the Y-axis
would divide the sum of frequencies into two equal parts. The values of
positional averages like Q1, D6, P50, etc., can also be located with the help of
an item’s value on the less than ogive. In the above figure determination of Q1
and Q3 are shown as an illustration.

Self Assessment Exercise I

The following data relates to the monthly expenditure on food incurred by a


sample of 150 families in an institution.

Monthly Expenditure No. of families


2,500 – 3,000 18
3,000 – 3,500 30
3,500 – 4,000 42
4,000 – 4,500 36
4,500 – 5,000 12
5,000 – 5,500 8
5,500 – 6,000 4

a) Draw more-than cumulative frequency curve and less than cumulative


frequency curve.

b) Locate the median monthly expenditure using your ogive.

c) How many sample families are approximately spending less than 3,800 on
food.

......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................
......................................................................................................................

5 1
Geektonight Notes

Processing and Presentation


of Data 7.11 LET US SUM UP
Statistical data not only requires a careful analysis but also ensures an attractive
and communicative display. The work of the researcher is to understand the
facts of the data himself/herself and also to present them in such a form that
their significance may be understandable to a common reader. In order to
achieve this objective, we have, in this unit, discussed the techniques of
diagrammatic and graphic presentation of statistical data. Besides, presenting the
data in the form of tables, data can also be presented in the form of diagrams
and graphs. Such visual presentation of data allows relation between numbers to
be exhibited clearly and attractively, makes quick comparison between two or
more data sets easier, brings out hidden facts and the nature of relationship,
saves time and effort, facilitates the determination of various statistical
measures such as Mean, Mode, Median, Quartiles, Standard deviation etc., and
establishes trends of past performance. Hence, with the help of the diagrams
and graphs the researcher can effectively communicate to readers the
information contained in a large mass of numerical data.

We have discussed the method for constructing simple bar diagram, multiple bar
diagram, sub-divided bar diagram, pie diagram and structure diagrams.

In graphs, we discussed graphs of time series (Historigrams), graphs of


frequency distribution (Histograms, frequency polygon, and cumulative frequency
curves). It is essential to keep in mind the basic principles while using the
diagrams and graphs for presenting the data.

7.12 KEY WORDS


Bar : Is a thick line where the length of the bars should be proportional to the
magnitude of the variable they represent.

Continuous Data : Data that may progress from one class to the next without
a break and may be expressed by either fractions or whole number.

Discrete Data : Data that do not progress from one class to the next without
break, i.e., where classes represent distinct categories or counts and may be
represented by whole numbers only.

False Base Line : A line that is drawn between the origin point (zero) and
the first c.m., by breaking Y-axis in case of historigrams. Hence the scale of
Y-axis does not start at zero.

Flow Chart : Presents the information which flows through various situations
to the ultimate point.

Frequency Polygon : A line graph connecting the mid-points of each class in


a data set.

Historigram : A graph of time series.

Histogram : A graph of a frequency distribution, composed of a series of


rectangles, each proportional in width to the range of a class interval and
proportional in height to the number of observations falling in the class.

Organisational Chart : A diagram specially designed to show the structure of


5 2 an organisation.
Geektonight Notes

Ogive : A graph of a cumulative frequency distribution. Diagrammatic and


Graphic Presentation
Pie Diagram : A circle divided into slices showing the relative areas of various
components of the variable.

Structure Diagram : Displays the structural data (qualitative) in the form of


charts.

7.13 ANSWERS TO SELF ASSESSMENT


EXERCISES
A. 1) Relieves the dullness of the numerical data; 2) Facilitates comparison; 3)
Saves time and effort; 4) Facilitates the location of various statistical measures
and establishes trends; 5) Universal applicability; and 6) Is an integral part of
research.
D. No, In sub-divided bar diagram, all items do not have a common base to enable
one to compare acurately the various items of the data. Where as in multiple bar
diagram, various items of a phenomena have a common base and are placed
together side by side. Hence, comparison is very easy rather than in a sub-
divided bar diagram.

E. Steps : 1) Find out the percentages of each reason for buying face cream.
2) Convert the percentages into degree of angle. 3) Then depict the percentages
in a circle with the help of their respective degree of angles.

7.14 TERMINAL QUESTIONS/EXERCISES


1) Explain the significance of visual presentation of statistical data in research
work.
2) Give a brief description of the different kinds of diagrams generally used in
business research to present the data.
3) What are structure diagrams? Explain each with an illustration the method of
representing the information by different structure diagrams.
4) Explain the principles of constructing a graph of time series. Under which
situation the false base line will be used?
5) Survey your own statistics class in terms of the variables age, sex and income.
Use the graphing techniques outlined in the unit to describe your results.
6) Represent the following data relating to exports of agricultural and allied
products to Russia during 1996-2000 by a suitable diagram.

Year 1996-97 1997-98 1998-99 1999-2000 2000-01


Export (In
$ million) 340 448 336 333 490

7) Draw a Multiple bar and sub-divided bar diagrams to represent the following
data relating to the enrollment of various programmes in an open university
over a period of four years and comment on it.

5 3
Geektonight Notes

Processing and Presentation


of Data Programme No. of Candidate enrolled
1998-99 1999-2000 2000-01 20001-02
MBA 1,565 2,356 1,924 3,208
M.Com 872 1,208 1,118 1,097
B.A. 1,600 1,220 1,090 987
B.Com 726 948 1,458 1,220

8) Construct a pie diagram to describe the following data which relates to the
amount spent on various heads under Rural development programme.

Various heads Rupees (in crores)


Agriculture 1,280
Rural Industries 450
Public Health 150
Transport 600
Education 325
Housing 40
Public Utilities 10

What features of this distribution does your pie diagram mainly illustrate?

9) The following table gives the Index numbers of wholesale Prices (Average) of
Cereals, Pulses and oilseeds over a period of 7 yrs. Compare these prices
through a suitable graph.

Years Cereals Pulses Oilseeds


1997 433 398 529
1998 486 420 638
1999 520 415 829
2000 690 524 750
2001 430 415 858
2002 482 358 884
2003 624 494 866

10)Draw histogram and frequency polygon of the following distribution. Locate the
approximate mode with the help of histogram.

Weekly wages 100– 120– 140– 160– 180– 200– 220–


(In Rs.) 120 140 160 180 200 220 240
No. of 26 52 87 93 34 26 12
Workers

5 4
Geektonight Notes

11) The following data relating to sales of 80 companies are given below Diagrammatic and
Graphic Presentation
Sales (Rs.Lakhs) No. of Companies
5-15 8
15-25 13
25-35 19
35-45 14
45-55 10
55-65 7
65-75 6
75-85 3

Draw cumulative frequency curves. Determine the number of companies whose


sales are:
(i) more than 50 lakhs. (ii) Less than Rs. 30 lakhs (iii) Between Rs. 30 lakhs to
Rs. 50 lakhs.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

7.15 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt with in this unit.
Moskowitz; H. and G.P. Wright 1998 Statistics for Management and
Economics, Charles E. Merill Publishing Company: Ohio, U.S.A.
Gupta, S.P. and M.P. Gupta, 2000. Business Statistics, Sultan Chand & Sons:
New Delhi.
Sinha, S.C. and Dhiman, A.K. 2002. Research Methodology, Vol. 1. Ess Ess
Publication, New Delhi.
George Argyrons. 2000. Statistics for Social and Health Research with a
Guide to SPSS. Sate Publications. New Delhi.

5 5
Geektonight Notes

Processing and Presentation


of Data UNIT 8 STATISTICAL DERIVATIVES AND
MEASURES OF CENTRAL
TENDENCY
STRUCTURE
8.0 Objectives
8.1 Introduction
8.2 Statistical Derivatives
8.2.1 Percentage
8.2.2 Ratio
8.2.3 Rate
8.3 Measures of Central Tendency
8.3.1 Properties of an Ideal Measure of Central Tendency
8.3.2 Mean and Weighted Mean
8.3.3 Median
8.3.4 Mode
8.3.5 Choice of a Suitable Average
8.3.6 Some Other Measures of Central Tendency
8.4 Let Us Sum Up
8.5 Key Words
8.6 Answers to Self Assessment Exercises
8.7 Terminal Questions/Exercises
8.8 Further Reading

8.0 OBJECTIVES
After studying this unit, you should be able to:
l explain the meaning and use of percentages, ratios and rates for data
analysis,
l discuss the computational aspects involved in working out the statistical
derivatives,
l describe the concept and significance of various measures of central
tendency, and
l compute various measures of central tendency, such as arithmetic mean,
weighted mean, median, mode, geometric mean, and hormonic mean.

8.1 INTRODUCTION
In Unit 6 we discussed the method of classifying and tabulating of data.
Diagrammatic and graphic presentations are covered in the previous unit
(Unit-7). They give some idea about the existing pattern of data. So far no big
numerical computation was involved. Quantitative data has to be condensed in a
meaningful manner, so that it can be easily understood and interpreted. One of
the common methods for condensing the quantitative data is to compute
statistical derivatives, such as Percentages, Ratios, Rates, etc. These are simple
derivatives. Further, it is necessary to summarise and analyse the data. The first
step in that direction is the computation of Central Tendency or Average, which
gives a bird's-eye view of the entire data. In this Unit, we will discuss computation
of statistical derivatives based on simple calculations. Further, numerical methods
for summarizing and describing data – measures of Central Tendency – are
discussed. The purpose is to identify one value, which can be obtained from the
5 6 data, to represent the entire data set.
Geektonight Notes
Statistical Derivatives and
8.2 STATISTICAL DERIVATIVES Measures of Central Tendency

Statistical derivatives are the quantities obtained by simple computation from the
given data. Though very easy to compute, they often give meaningful insight to
the data. Here we discuss three often-used measures: percentage, ratio and
rate. These measures point out an existing relationship among factors and
thereby help in better interpretation.

8.2.1 Percentage
As we have noted earlier, the frequency distribution may be regarded as simple
counting and checking as to how many cases are in each group or class. The
relative frequency distribution gives the proportion of cases in individual classes.
On multiplication by 100, the percentage frequencies are obtained. Converting to
percentages has some advantages - it is now more easily understood and
comparison becomes simpler because it standardizes data. Percentages are quite
useful in other tables also, and are particularly important in case of bivariate
tables. We show one application of percentages below. Let us try to understand
the following illustration.

Illustration 1
The following table gives the total number of workers and their categories for
all India and major states. Compute meaningful percentages.

Table: Total Workers and Their Categories-India and Major States : 2001
(In thousands)
Sl. State/ Cultivators Agricultural Household Other Total
No. India Labourers Industry Workers Workers
Workers

(1) (2) (3) (4) (5) (6) (7)

1. Jammu & 1600 249 230 1611 3689


Kashmir
2. Himachal 1961 93 50 887 2991
Pradesh
3. Punjab 2099 1499 307 5236 9142
4. Haryana 3046 1276 207 3854 8383
5. Rajasthan 13167 2529 651 7434 23781
6. Uttar 22173 13605 2886 15517 54180
Pradesh
7. Bihar 8192 13528 1087 5273 28080
8. Assam 3742 1290 329 4197 9557
9. West 5613 7351 2153 14386 29503
Bengal
10. Orissa 4238 5001 689 4344 14273
11. Madhya 11059 7381 1010 6307 25756
Pradesh
12. Gujarat 5613 4988 382 9386 20369
13. Maharas- 12010 11291 1046 17706 42053
htra
14. Andhra 7904 13819 1570 11573 34865
Pradesh
15. Karnataka 6936 6209 936 9441 23522
16. Kerala 740 1654 365 7532 10291
17. Tamil Nadu 5114 8665 1459 12574 27812
INDIA 127628 107448 16396 151040 402512
5 7
Geektonight Notes

Processing and Presentation Solution: In the table above, the row total gives the total workers of a state/
of Data
all India and column total gives the aggregate values of different categories of
workers and all workers. Thus, it is possible to compute meaningful percentages
from both rows and columns. The row percentages are computed by dividing
the figures in columns (3), (4), (5) and (6) by the figure in column (7) and
multiplied by 100. The figures are presented in tabular form below. Percentage
of cultivators in Jammu & Kashmir is obtained as (1600 ÷ 3688) × 100 which
equals 43.37. Similarly other figures are obtained.

Table: Percentage of Total Workers and Their Categories-India and


Major States : 2001

Sl. State/ Cultivators Agricultural Household Other Total


No. India Labourers Industry wor- Workers
workers kers
(1) (2) (3) (4) (5) (6) (7)

1. Jammu & 43.37 6.74 6.22 43.67 100.00


Kashmir
2. Himachal 65.55 3.10 1.68 29.67 100.00
Pradesh
3. Punjab 22.96 16.40 3.36 57.28 100.00
4. Haryana 36.34 15.22 2.47 45.97 100.00
5. Rajasthan 55.36 10.64 2.74 31.26 100.00
6. Uttar 40.92 25.11 5.33 28.64 100.00
Pradesh
7. Bihar 29.17 48.18 3.87 18.78 100.00
8. Assam 39.15 13.50 3.44 43.91 100.00
9. West 19.02 24.92 7.30 48.76 100.00
Bengal
10. Orissa 29.70 35.04 4.82 30.44 100.00
11. Madhya 42.93 28.66 3.92 24.49 100.00
Pradesh
12. Gujarat 27.56 24.49 1.87 46.08 100.00
13. Maharas- 28.56 26.85 2.49 42.10 100.00
htra
14. Andhra 22.67 39.63 4.51 33.19 100.00
Pradesh
15. Karnataka 29.48 26.40 3.98 40.14 100.00
16. Kerala 7.19 16.07 3.55 73.19 100.00
17. Tamil Nadu 18.39 31.16 5.24 45.21 100.00
INDIA 31.72 26.69 4.07 37.52 100.00

The figures above help in comparing the proportion of workers in different


categories across the state and all India. One may read from the table that
Kerala has the lowest percentage of cultivators and Bihar the highest
percentage of agricultural labourers.
5 8
Geektonight Notes

Self Assessment Exercise A Statistical Derivatives and


Measures of Central Tendency
1) What is a Percentage?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) From the data given in illustration 1, compute column percentages and
interpret it. Why are the totals of these percentages not adding to 100?
The table below may be used for computation.
Table: State-wise Percentage Share of Total Workers and Categories
of Workers in All India: 2001
Sl. State/ Cultiva- Agricultural Household Other Total
No. India tors Labourers Industry wor- Workers
workers kers
(1) (2) (3) (4) (5) (6) (7)

1. Jammu &
Kashmir
2. Himachal
Pradesh
3. Punjab
4. Haryana
5. Rajasthan
6. Uttar Pradesh
7. Bihar
8. Assam
9. West Bengal
10. Orissa
11. Madhya
Pradesh
12. Gujarat
13. Maha-
rashtra
14. Andhra
Pradesh
15. Karnataka
16. Kerala
17. Tamil Nadu
INDIA 100.00 100.00 100.00 100.00 100.00

8.2.2 Ratio

Another descriptive measure that is commonly used with frequency distribution


(it may be used elsewhere also) is the ratio. It expresses the relative value of
frequencies in the same way as proportion or percentages but it does so by
comparing any one group to either total number of cases or any other group.
For instance, in table 6.3, Unit 6, the ratio of all labourers to their daily wages
5 9
Geektonight Notes

Processing and Presentation between Rs 30–35 is 70:14 or 5:1. Where ever possible, it is convenient to
of Data reduce the ratios in the form of n1: n2, the most preferred value of n2 being 1.
Thus, representation in the form of ratio also reduces the size of the number
which facilitates easy comparison and quick grasp. As the number of categories
increases, the ratio is a better derivative for presentation as it will be easy and
less confusing.

There are several types of ratios used in statistical work. Let us discuss them.

The Distribution Ratio: It is defined as the ratio of a part to a total which


includes that part also. For example, in an University there are 600 girls out of
2,000 students. Than the distribution ratio of girls to the total number of
students is 3:10. We can say 30% of the total students are girls in that
University.

Interpret ratio: It is a ratio of a part in a total to another part in the same


total. For example, sex ratio is usually expressed as number of females per
1,000 males (not against population).

Time ratio: This ratio is a measure which expresses the changes in a series
of values arranged in a time sequence and is typically shown as percentage.
Mainly, there are two types of time ratios :

i) Those employing a fixed base period: Under this method, for instance, if
you are interested in studying the sales of a product in the current year, you
would select a particular past year, say 1990 as the base year and compare
the current year’s production with the production of 1990.

ii) Those employing a moving base: For example, for computation of the
current year's sales, last year's sales would be assumed as the base (for
1991, 1990 is the base. For 1992, 1991 is the base and so on …. .

Ratios are more often used in financial economics to indicate the financial
status of an organization. Look at the following illustration:

Illustration 2
The following table gives the balance sheet of XYZ Company for the year
2002–03. Compute useful financial ratios.

Table: Balance Sheet of XYZ Company as on March 31, 2003


I Sources of Funds Amount (Rs. 000’)
1 Shareholders' funds 520
1(a) Share capital 130
1(b) Reserve and surplus 390
2 Loan funds 280
2 (a) Secured loans 170
2 (a) (i) Due after one year 120
2 (a) (ii) Due within one year 50
2 (b) Unsecured loans 110
2 (b) (i) Due after one year 50
2 (b) (ii) Due within one year 60

6 0 Total (520 + 280) 800


Geektonight Notes
Statistical Derivatives and
II Application of Funds Amount (in Rs. 000’) Measures of Central Tendency

1 Net fixed asset 535


2 Investments 85
2 (a) Long term investments 75
2 (b) Current investments 10
3 Current assets, loans and advances 330
3 (a) Inventories 160
3 (b) Sundry debtors 80
3 (c) Cash and bank balances 40
3 (d) Loans and advances 50
Less: Current liabilities and provisions 150
Net current assets 180
Total (535 + 85 + 180) 800

Solution: Three common ratios may be computed from the above balance
sheet: current ratio, cash ratio, and debt-equity ratio. However, these ratios are
discussed in detail in MCO-05 : Accounting for Managerial Decisions, under
Unit-5 : Techniques of Financial Analysis.
Current assets , loans , advances + current investment s 330 + 10
Current ratio = =
Current liabilitie s and provisions + short term debt 150 + 50 + 60
= 1.31

Cash and bank balances + Current investment s 40 + 10


Cash ratio = = = 0.19
Current liabilitie s and provisions + Short term debt 150 + 50 + 60

Debt Loan fund 280


Debt − equity ratio = = = = 0.54
Equity Shareholde rs’ funds 520

8.2.3 Rate

The concept of ratio may be extended to the rate. The rate is also a
comparison of two figures, but not of the same variable, and it is usually
expressed in percentage. It is a measure of the number of times a value occurs
in relation to the number of times the value could occur, i.e. number of actual
occurrences divided by number of possible occurrences. Unemployment rate in
a country is given by total number of unemployed person divided by total
number of employable persons. It is clear now that a rate is different from a
ratio. For example, we may say that in a town the ratio of the number of
unemployed persons to that of all persons is 0.05: 1. The same message would
be conveyed if we say that unemployment rate in the town is 0.05, or more
commonly, 5 per cent. Sometimes rate is defined as number of units of a
variable corresponding to a single unit of another variable; the two variables

6 1
Geektonight Notes

Processing and Presentation could be in different units. For example, seed rate refers to amount of seed
of Data
required per unit area of land. The following table gives some examples of
rates.

S.No. Description Computation Rate


(1) (2) (3) (4)

1 100 kms with 8 litres of petrol 100/8 12.5 km per litre


2 Rs. 18 for 12 banana 18/12 Rs. 1.50 per banana
3 Rs. 6000 for 5 days of 6000/5 Rs. 1200 per day
consultancy consultancy

Self Assessment Exercise B


1) Name the different types of ratios used in statistical work.
..................................................................................................................
..................................................................................................................
..................................................................................................................

2) What is a rate?
..................................................................................................................
..................................................................................................................
..................................................................................................................

8.3 MEASURES OF CENTRAL TENDENCY


In Unit 6, we have studied in detail how to classify raw data into a small
number of classes or groups and presented them in the form of tables. The
next step would be to identify a single value that may be considered as the
most representative value of the given data. This is the measure of central
tendency, which represents an average character.

A measure of central tendency helps us to represent a set of huge data by a


single value. To understand the economic condition of people of a particular
country, we talk of average or per capita income. It also enables us to compare
the situation in two different places or situations. For example, one may
compare per capita power availability in two states to understand which one is
better in terms of industrial climate.

To start with, we list the properties that could be defined by an ideal measure
of central tendency. Some of the measures are discussed in detail later.

8.3.1 Properties of an Ideal Measure of Central Tendency


An ideal measure of central tendency should have the following properties:

l simple to compute and easy to interpret.


l based on all observations.
l should not be influenced much by a few observations.
l should be capable of further algebraic treatment.
6 2 l should be capable of being defined unambiguously.
Geektonight Notes

Some of the important measures of central tendency which are most commonly Statistical Derivatives and
Measures of Central Tendency
used in business and industry are: Arithmetic Mean, Weighted Arithmetic Mean,
Median, Mode, Geometric mean and Harmonic mean. Among them Median and
Mode are the positional averages and the rest are termed as Methamatical
Averages.

8.3.2 Mean and Weighted Mean

Most of the time, when we refer to the average of something, we are talking
about the arithmetic mean. This is the most important measure of central
tendencies which is commonly called mean.

Mean of ungrouped data: The mean or the arithmetic mean of a set of data
is given by:

X1 + X 2 + … + X n
X=
N
This formula can be simplified as follows:
∑x Sum of values of all observations.
Arithmetic mean ( x ) =
N Number of observations.
The Greek letter sigma, Σ, indicates “the sum of ”

Illustration 3

Suppose that wages (in Rs) earned by a labourer for 7 days are 22, 25, 29, 58,
30, 24 and 23.The mean wage of the labourer is given by:

(22 + 25 + 29 + 58 + 30 + 24 + 23)/7 = Rs. 30.14

Mean of grouped data: We have seen how to obtain the result of mean from
ungrouped data. In Unit-6, we have learnt the preparation of frequency
distribution (grouped data). Let us consider what modifications are required for
grouped data for calculation of mean.

When we have grouped data, either in the form of discrete or continuous, the
expression for the mean would be :
fx
(x ) = ∑ Σ (f × x)
N Sum of the frequency (Σf)
Let us consider an illustration to understand the application of the formula.

Illustration 4

The following discrete frequency distribution of wage data of a labourer for 35


days:

Wage (in Rupees) 23 24 25 27 28 29 30 31 32 33 34


Frequency (No of Days) 1 1 3 3 4 6 4 5 5 2 1

Now, to compute the mean wage, multiply each variable with its corresponding
frequency (f × x) and obtain the total (Σfx).

6 3
Geektonight Notes

Processing and Presentation Divide this total by number of observations (Σf or N). Practically, we compute
of Data the mean as follows:

(23×1 + 24 ×1 + 25 × 3 + 27 × 3 + 28 × 4 + 29 × 6 + 30 × 4 + 31× 5 + 32 × 5 + 33× 2 + 34 ×1)


Mean =
(1 + 1 + 3 + 3 + 4 + 6 + 4 + 5 + 5 + 2 + 1)

102 Σfx
= = 29.26
35 Σf or N

When a frequency distribution consists of data that are grouped by classes, it


is known as continuous frequency distribution. In such a distribution each value
of an observation falls somewhere in one of the classes. Unlike the raw data
(ungrouped) or discrete data we do not know the seperate values of every
observation. It is, therefore, to be noted that we can easily compute an estimate
of the value of mean of continuous distribution but not the actual value of
mean. On the other hand, we can say for ease of calculation, we cannot be
very accurate.

To find the mean of continuous frequency distribution, we first calculate the


midpoints of each class. Then we multiply each mid-point by the frequency of
observations in that class, obtain sum of these products, and divide the sum by
the total number of observations. The formula looks like this:
∑ fx
x=
N
where, Σfx = Sum value, which is obtained by multiplying the mid-points with
its respective frequencies

N = Number of observations (Σf)

Let us consider the frequency distribution obtained in Unit-6 (table 6.3), as an


illustration for study.

Illustration-5

The following table gives the daily wages for 70 labourers on a particular day.

Daily Wages (Rs) : 15-20 20-25 25-30 30-35 35-40 40-45 45-50
No of labourer : 2 23 19 14 5 4 3

Solution: For obtaining the estimated value of mean we have to follow the
procedure as explained above. This is elaborated below.

Daily wages (Rs) Mid-point (x) No. of workers (f) f.x


15–20 17.5 2 35.0
20–25 22.5 23 517.5
25–30 27.5 19 522.5
30–35 32.5 14 455.0
35–40 37.5 5 187.5
40–45 42.5 4 170.0
45–50 47.5 3 142.5
– N or Σf = 70 Σfx = 2030.0

6 4
Geektonight Notes
Statistical Derivatives and
X=
∑ fx = 2030 = Rs. 29 Measures of Central Tendency
N 70
Hence, the mean daily wage is Rs. 29.

To simplify calculations, the following formula for mean may be more


convenient to use. It is to be noted that it can be applied when the width of
the classes are equal.
∑ fd
X=A+ xi
N
x−A
where, ‘A’ is an assumed mean, d = and ‘i’ is the size of the equal class
i
interval.
This formula makes the computations very simple and takes less time. This
method eliminates the problem of large and inconvenient mid-points. To apply
this formula, let us consider the data of the previous illustration-5. Try to
understand the procedure, for obtaining the value of mean, shown below.

Assume A as 32.5

Class Mid-point (X-32.5)/5 Frequency fd


Interval (X) =d (f)
15-20 17.5 –3 2 –6
20-25 22.5 –2 23 – 46
25-30 27.5 –1 19 – 19
30-35 32.5 0 14 0
35-40 37.5 1 5 5
40-45 42.5 2 4 8
45-50 47.5 3 3 9
– – N = 70 Σfd = – 49

X=A+
∑ fd × i
N
− 49
= 32.5 + × 5 = 29
70
Hence mean daily wage is Rs. 29, as obtained earlier.

The important property of arithmetic mean is that the means of several sets
of data may be combined into a single mean for the combined sets of data.
The combined mean may be defined as:
N X + N X ..... + N n X n
X = 1 1 2 2
12...n N + N ..... + N n
1 2
If we have to combine means of four sets of data, then the above formula can
be generalized as:
N 1 X1 + N 2 X 2 + N 3 X 3 + N 4 X 4
X1234 =
N1 + N 2 + N 3 + N 4

6 5
Geektonight Notes

Processing and Presentation Advantages and disadvantages of mean


of Data
The concept of mean is familiar to most people and easily understood. It is due
to the fact that it possesses almost all the properties of a good measure of
central tendency. However, the mean has disadvantages of which we must be
aware. First, the value of mean may be distorted by the presence of extreme
values in a given data set and in case of U-shaped distribution this measure is
not likely to serve a useful purpose. Second problem with the mean is that we
are unable to compute mean for open-ended classes, since it is difficult to
assign a mid-point to the open-ended classes. Third, it cannot be used for
qualitative variables.

Weighted Mean
The arithmetic mean, as discussed above, gives equal importance (weight) to all
the observations. But in some cases, all observations do not have the same
weightage. In such a case, we must compute weighted mean. The term
‘weight’, in statistical sense, stands for the relative importance of the different
variables. It can be defined as:

xW =
∑ Wx
∑W
where, x w is the weighted mean, ‘w’ are the weights assigned to the variables (x).
Weighted mean is extensively used in Index numbers, it will be discussed in
detail in Unit 12 : Index Numbers, of this course. For example, to compute the
cost of living index, we need the price index of different items and their
weightages (percentage of consumption). The important issue that arises is the
selection of weightages. If actual weightages are not available then estimated or
arbitrary weightages may be used. This is better than no weightages at all.
However, keeping the phenomena in mind, the weightages are to be assigned
logically. To understand this concept, let us take an illustration.

Illustration 6
Given below are Price index numbers and weightages for different group of
items of consumption for an average industrial worker. Compute the cost of
living index.

Group Item Group Price Index Weight


Food 150 55
Clothing 186 15
House rent 125 17
Fuel and Light 137 8
Others 184 5

Solution: The cost of living index is obtained by taking the weighted average
as explained in the table below:
Group Item Group Price Weight Wi. Pi
Index( Pi) (Wi )
Food 150 55 8250
Clothing 186 15 2790
House rent 125 17 2125
Fuel and Light 137 8 1096
Others 184 5 920
6 6 – ΣW = 100 ΣWx = 15181
Geektonight Notes
Statistical Derivatives and
Xw =
∑ Wx = 15181 = 151.81 Measures of Central Tendency
Therefore, the cost of living index is
∑ W 100
Self Assessment Exercise C
1) A student's marks in a Computer course are 69, 75 and 80 respectively in
the papers on Theory, Practical and Project Work.
What are the mean marks if the weights are 1, 2 and 4 respectively?
What would be the mean marks if all the papers have equal importance?
Use the following table

Table: Computation of Weighted Mean Marks


Paper Marks Percentage (X) Weight (W) W. X
Theory
Practical
Project Work

So, weighted mean is =

2) The following table gives frequency distribution of monthly sales (in


Rupees thousands) of 125 firms.

Table: Monthly Sales of 125 Firms

Monthly Sales Number of


(in thousands) Firms
0–150 15
150–300 22
300–450 64
450–600 11
600–750 9
750–900 4
All 125
Compute mean monthly sales of the firms and interpret the data.

Since the class width is 150 for all the classes, the method of assumed
mean is useful. The following table may be helpful.

Table: Computation of Average Monthly Sales of 125 Firms


Monthly Sales Midpoint (X–A)/150 Number of f.d
(in thousands) (X) (d) Firms (f)

6 7
Geektonight Notes

Processing and Presentation So, the average monthly sales =


of Data
3) The mean wage of 200 male workers in a factory was Rs 150 per day,
and the mean wage of 100 female and 50 children were Rs. 90 and Rs.
35 respectively, in the same factory. What would be the combined mean
of the workers. Comment on the result.
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

8.3.3 Median

The median is another measure of central tendency. The median represents the
middle value of the data that measures the central item in the data. Half of the
items lie above the median, and the other half lie below it.

Median of Ungrouped Data: To find the median from ungrouped data, first
array the data either in ascending order or in descending order. If the number
of observations (N) contains an odd number then the median is the middle
value. If it is even, then the median is the mean of the middle two values.
th
 N + 1
In formal language, the median is   item in a data array, where N is
 2 
the number of items. Let us consider the earlier illustration 3 to locate the
median value in two different sets of data.
Illustration-7

On arranging the daily wage data of the labourers (as given in illustration 3) in
ascending order, we get

Rs. 22, 23, 24, 25, 29, 30, 58.

Number of observations is an odd number (seven). According to equation


th
 N + 1 Item, the middle (i.e. the fourth) number is the median. Here the
 
 2 
median wage of labourer is Rs. 25.
You may notice that unlike the mean we calculated earlier, the median we
calculated above was not distorted by the presence of the last value i.e.,
Rs. 58. This value could have been even Rs. 99, the median would have been
the same.

Had there been one more observation, say, Rs. 6, the order would have been
as below:
Rs. 6, 22, 23, 24, 25, 29, 30, 58
There are eight observations, and the median is given by the mean of the
8 +1
fourth and the fifth observations (i.e., th item = 4.5th item. So, median
2
wage = (24 + 25)/2 = Rs. 24.5.
Median of Grouped Data: Now, let us calculate the median from grouped
data. When the data is in the form of discrete series, the median can be
computed by examining the cumulative frequency distribution, as is shown
6 8 below.
Geektonight Notes

Illustration-8 Statistical Derivatives and


Measures of Central Tendency

To compute median wage from the data given in Illustration 4, we add one
more row of cumulative frequency (the formation of cumulative frequency, we
have discussed in Unit 6 of this course : Processing of Data.

Wage (In Rupees) 23 24 25 27 28 29 30 31 32 33 34


Frequency (No. of Days) 1 1 3 3 4 6 4 5 5 2 1
Cumulative Frequency 1 2 5 8 12 18 22 27 32 34 35

th
 N + 1
According to the formula,   item, the number of observations is 35.
 2 
35 + 1
Therefore, item th is the 18th item. Hence the 18th observation will be the
2
median. By inspection it is clear that Median wage is Rs. 29.

This procedure is to be slightly modified for computation of median from class


interval data. The median is taken as the value of the variable corresponding
to the (N/ 2)th observation. The class or group containing median should be
identified first and the median is computed under the assumption that all the
observations in that class are equally spaced. Symbolically, the expression for
median is given by:
N / 2 − c.f
Median = L + ×i
f
where, ‘L’ is the lower limit of the median class, ‘N’ is the number of
observations, ‘f’ is frequency of the median class, ‘cf’ is the cumulative
frequency of the class next lower to the median class and ‘i’ is the width of
the median class. Let us consider the data given in earlier illustration 5 to study
the median.
Illustration 9: Compute median wage for the data given in Illustration 5.

Solution: The approach is quite similar to the previous example. As indicated,


it will be implicitly assumed that the wages of the labourers in the group-
containing median are equally spaced.

Class Interval Frequency (f) Cumulative


(wages in Rs.) frequency (cf)
15-20 2 2
20-25 23 25
25-30 19 44
30-35 14 58
35-40 5 63
40-45 4 67
45-50 3 70
N = 70 –
Here, the number of observations is 70. So the median corresponds to the 35th

6 9
Geektonight Notes

Processing and Presentation th


of Data  70 
 
value of the variable   item. This item lies in 44 (35th observation)
 2 
cumulative frequency. Hence, It is clear Column (3) that median is in the third
class interval, i.e. Rs. 25 to Rs. 30. So we have to locate the position of the
35th observation in the class 25-30.
N
Here, = 35 , L = 25, cf = 25, f = 19 and i = 30 – 25 = 5.
2

N/ 2 − cf 35− 25
Thus median is : L + ×i = 25+ ×5 = Rs. 27.63.
f 19
It is to be noted that the median value may also be located with the help of
graph by drawing ogives or a less than cumulative frequency curve. This
method was discussed in detail in Unit 7 : Diagrammatic and Graphic
Presentation, of this block.

Advantages and disadvantages of median


The biggest advantage of median is that extreme observations do not affect it.
For computation of the median, it is not necessary to know all the observations
and this property comes in handy when there are open-ended classifications of
data. This is also suitable for qualitative variables, which can be ordered or
arranged in ascending or descending order (ordinal variables).

However, it requires data to be arranged before computation. It is not amenable


to arithmetic and algebraic manipulations. For example, if M1 and M2 are
medians of two different sets of data, we cannot get the median of the
combined data set from M1 and M2.

Some Additional Points: Median divides the distribution in two equal


parts. When a distribution is divided into four equal parts they are called
quartiles. Similarly, there are deciles (divided into ten equal parts),
percentiles (divided into hundred equal parts) etc. The general term for all of
them is fractile. In Unit 9 of this block, we will learn more about quartiles.

Self Assessment Exercise D

1) Refer to the data in Self Assessment Exercise C, No. 1. Obtain median


monthly sales of the firms.
The following table may be helpful.

Table: Computation of Median Sales of 125 Firms


Monthly Sales Number of Cumulative
(in thousands) Firms (f) frequency

..................................................................................................................
7 0 ..................................................................................................................
Geektonight Notes

8.3.4 Mode Statistical Derivatives and


Measures of Central Tendency

Mode is also a measure of central tendency. This measure is different from the
arithmetic mean, to some extent like the median because it is not really
calculated by the normal process of arithmetic. The mode, of the data, is the
value that appears the maximum number of times. In an ungrouped data, for
example, the foot size (in inches) of ten persons are as follows: 5, 8, 6, 9, 11,
10, 9, 8, 10, 9. Here the number 9 appeares thrice. Therefore, mode size of
foot is 9 inches. In grouped data the method of calculating mode is different
between discrete distribution and continous distribution.

In discrete data, for example consider the earlier illustration 6, the modal wage
is Rs. 29 as is the wage for maximum number of days, i.e. six days. For
continuous data, usually we refer to modal class or group as the class with the
maximum frequency (as per observation approach). Therefore, the mode from
continuous distribution may be computed using the expression:

∆1
Mode = L + ×i
∆1 + ∆2
where, L = lower limit of the modal class, i = width of the modal class, ∆1 =
excess of frequency of the model class (pi) over the frequency of the
preceding class (f0),
∆2 = excess of frequency of the model class (f1) over the frequency of the
succeeding class (f2). The letter ∆ is read as delta.

Noting that, ∆1 = f1 – f0 and ∆2 = f1 – f2

It is to be noted that while using the formula for mode, you must arrange the
class intervals uniformly throughout, otherwise you will get misleading results.
To illustrate the computation of mode, let us consider the grouped data of
earlier illustration 7.

Illustration 10

Compute mode from the following data.

Daily wages No. of workers Daily wages No. of workers


(Rs.) (f) (Rs) (f)
15-20 12 35-40 5

20-25 23 40-45 4

25-30 19 45-50 3

30-35 14

Solution: Since the maximum frequency 23 is in the class 20-25. Therefore,


based on observation method, the class 20-25 is the modal class. Applying the
∆1
Mode = L + ×i
∆1 + ∆2
formula, we get ∆1 = f1 − f 0 ; ∆2 = f1 − f 2

7 1
Geektonight Notes

Processing and Presentation The related values are as follows:


of Data

L = 20; f0 = 12; f1= 23; f2 = 19; and i = 5 ∴ ∆1 = 11 & ∆ 2 = 4


11 11
∴ Mode = 20 + × 5 = 20 + ×5
11 + 4 15
= 20 + 3.67 = Rs. 23.67
Hence the modal daily wage is Rs. 23.67

In a continuous frequency distribution, the value of mode can also be located


graphically. We have already discussed the procedure for locating the mode
graphically in Unit-7 of this block.

Advantages and Disadvantages of Mode


Extreme values of observations do not affect the mode and its value can be
determined in open-ended classes. This measure is also suitable for any
qualitative variables (both nominal and ordinal variables).

It may not be unique all the time. There may be more than one mode or no
mode (no value that occurs more than once) at all. In such a case it is difficult
to interpret and compare the distributions. It is not amenable to arithmetic and
algebraic manipulations. For example, we cannot get the mode of the combined
data set from the modes of the constituent data sets.

Self Assessment Exercise E

Refer to the data in Self Assessment Exercise C No. 1. Obtain mode of


monthly sales of the firms.

L = , f0 = , f1 = , f2 = and i= .

Hence, the mode is given by :


.......................................................................................................................
.......................................................................................................................
.......................................................................................................................

Comparing the Mean, Median, and Mode

For a moderately skewed distribution, it has been empirically observed that the
difference between Mean and Mode is approximately three times the difference
between Mean and Median. This was illustrated in the Fig. 8.1 (b) and (c).
The expression is:

Mean – Mode = 3(Mean – Median)

Alternately, Mode = 3(Median) –2(Mean)

Sometimes this expression is used to calculate value of one measure when the
value of the other two measures are known.

7 2
Geektonight Notes

8.3.5 Choice of a Suitable Average Statistical Derivatives and


Measures of Central Tendency

We have already discussed advantages and disadvantages of three different


types of averages: Mean, Median, and Mode. Here, we discuss their
appropriateness in terms of the following three factors: (1) the level of
measurement of data (2) the shape of the distribution and (3) the stability of
the measure of the average.

Levels of measurement: There are four levels of measurement of data:


nominal, ordinal, ratio and interval. At nominal level, the observations can be
just distinguished or differentiated but cannot be arranged in any order.
Examples may be colour of cars, types of blood groups, brands of a consumer
goods etc. At ordinal level, the observations can be arranged in ascending or
descending order, but no arithmetic operations are possible. While describing the
existing business climate, the respondents may tell - very good, good, medium,
bad and very bad. This could be an example of ordinal data. At interval level,
it is assumed that a given interval on the scale measures the same amount of
difference, irrespective of where the interval appears. There is a zero but it is
arbitrary and is not of much significance. For example, the temperature
difference between 500C and 600C is the same, as the temperature difference
between 100C and 200C but a temperature of 00C does not mean absence of
heat. Variables like height, weight are examples of ratio levels of
measurement. Here, a value which is twice as large as another value
corresponds to twice the value of the variable and it has an absolute zero. We
say that a 10-metre tower is twice as tall as a 5-metre tower, but we never
mean that a temperature of 400C is twice as hot as a temperature of 200C.

From the above discussion, it is clear that for nominal data only mode can be
used, for ordinal data both mode and median can be used whereas for ratio and
interval levels of data all three measures can be calculated.

Shape of the distribution: If the distribution of data is symmetric with only


one peak, mean, median, and mode are the same. Even in case of two modes,
mean and median will be the same. For asymmetric distribution, all these are
different. For positively skewed distribution, mode is the smallest and median
lies between mode and mean whereas for negatively skewed distribution the
pattern is just the opposite. (It is discussed elaborately in Unit 9, Section 9.5.)
Thus, in either of the cases, median appears to be a better measure of central
tendency. Figure 8.1 shows three different shapes of a distribution.

Mo= Me = X
(a)
Mode Mode
Median Median
Mean Mean

(b) (c)
Figure 8.1 7 3
Geektonight Notes

Processing and Presentation Stability: Quite often a researcher studies a sample to infer about the entire
of Data population. Mean is generally more stable than median or mode. If we calculate
means, medians and modes of different samples from a population, the means
will generally be more in agreement then the medians or the modes. Thus,
mean is a more reliable measure of central tendency. Normally, the choice of a
suitable measure of central tendency depends on the common practice in a
particular industry. According to its requirement, each case must be judged
independently. For example, the Mean sales of different products may be useful
for many business decisions. The median price of a product is more useful to
middle class families buying a new product. The mode may be a more useful
measure for the garment industry to know the modal height of the population to
decide the quantity of garments to be produced for different sizes.

Hence, the choice of the measure of central tendency depends on (1) type of
data (2) shape of the distribution (3) purpose of the study. Whenever possible,
all the three measures can be computed. This will indicate the type of
distribution.

8.3.6 Some Other Measures of Central Tendency

Sometimes two other measures of central tendency, geometric mean and


harmonic mean are also used. They are briefly discussed here.

Geometric Mean (GM): Geometric mean is defined as the Nth root of the
product of all the N observations. It may be expressed as:
G.M. = n Pr oduct of all n values . Thus, the geometric mean of four numbers 2, 5,

8 and 10 is given by 4 ( 2 × 5 × 8 × 10 = 4 (800) = 5.3183 . If one observation is


zero, the geometric mean becomes zero and hence inappropriate. If some
values are negative, sometimes the geometric mean may be computed but may
be meaningless. Geometric mean is appropriate for the variables that reproduce
themselves. Suppose, population of a country in years 1990 and 2000 are
respectively 100 and 121 million. The average population in the decade is
= 2 (100×121) million or 110 million. Probably the most frequent use of
geometric mean is to know the average rate of change. These could be
average percent change of population, compound interest, growth rate etc.

Harmonic Mean (HM): Harmonic mean is defined as the reciprocal of the


arithmetic mean of the reciprocal of the observations. In other words, it may be
defined as the ratio of Number of observations and sum of reciprocal of the
 1 1 1 
values. It may be expressed as: HM = N /  x + x + ..... x  , in short
 1 2 n 

1
N / Σ  For example, the harmonic mean of 4 and 6 is 2 / (1/4 + 1/6) = 2 /
x
(5/12) = 2/ (5/12) = 21 (0.4166) = 4.8. Suppose a car moves half the distance
at the speed of 60 km/hr and the other half at the speed of 80 km/hr. Then the
average speed of the car is 68.57 km/hr, which is the harmonic mean of 60
and 80. Harmonic mean is useful in averaging rates.

For any set of data wherever computation is possible, the following inequality
holds
x > GM > HM
7 4
Geektonight Notes

Illustration-11 Statistical Derivatives and


Measures of Central Tendency

To compute arithmetic, geometric and harmonic means of 4,5,10 and 11 and


verify the above relationship.

Arithmetic Mean, A = (4 + 5 +10 +11)/4 = 30 / 4 = 7.50

Geometric Mean, G = 4 ( 4 × 5 × 10 × 11 = 4 ( 2200) = 6.85

Harmonic Mean, H = 4/(1/4 +1/5+1/10+1/11) = 4 /0.64 = 6.25


So, the relationship discussed above is verified.
It is also possible to compute weighted geometric and harmonic means.

8.4 LET US SUM UP


In order to draw meaningful and useful conclusions from the data the collected
data must be analysed with the help of statistical derivatives like percentage,
ratio and rates. They also give meaningful insight with very little computation. A
ratio expresses the relationship between the magnitude of more than one
quantity. It is generally stated as A : B : C. Proportion is the ratio of any one
category to the total of all the categories. It is a better derivative to use when
the number of categories increases. A rate is usually expressed as per 100 per
1,000 etc. A measure of central tendency gives one representative value,
around which the data set is clustered. Three widely used measures are
discussed in detail. Mode is the simplest of all but at times it is not defined.
Median divides the observations into two equal parts and is particularly suitable
in open-ended data. Arithmetic mean is calculated based on all the observations
but gets affected by extreme values. For qualitative data, however mean cannot
be computed. Mean, median and mode show the type of distribution of
data.Measures of central tendency are also called measures of location.

8.5 KEY WORDS


Arithmetic Mean : This equals the sum of all the values divided by the
number of observations.

Bimodal Distribution : In a distribution, when two values occur more


frequently in equal number.

Geometric Mean : It refers to Nth root of the product of all the N


observations.

Harmonic Mean : This is the reciprocal of the arithmetic mean of the


reciprocals of the given values.

Mean : Usually refers to Arithmetic Mean or Average.

Median : The middlemost observation in a data set when arranged in order.

Mode : The most frequent value occurring in a data set. It is represented by


the highest point in the distribution curve of a data set.

Percentage : It gives the magnitude of the numerator when denominator of a


ratio becomes hundred.

Rate : Amount of one variable per unit amount of some other variable. 7 5
Geektonight Notes

Processing and Presentation Ratio : Relative value of one value with respect to another value.
of Data
Weighted Mean : An average in which each observation value is weighted by
some index of its importance.

8.6 ANSWERS TO SELF ASSESSMENT EXERCISES


A: 2. With reference to the original table, the all India figures are the totals of
all the state figures. Thus a column percentage gives the share of a state from
among all the states of India in respect of the category of workers. The
column percentages are given below.

Table: State-wise Percentage Share of Total Workers and Categories of


Workers in All India: 2001
Sl. State/ Cultivators Agricultural Household Other Total
No. India Labourers Industry Wor-
workers kers
1 Jammu & 1.25 0.23 1.40 1.07 0.92
Kashmir
2 Himachal 1.54 0.09 0.31 0.59 0.74
Pradesh
3 Punjab 1.64 1.40 1.87 3.47 2.27
4 Haryana 2.39 1.19 1.26 2.55 2.08
5 Rajasthan 10.32 2.35 3.97 4.92 5.91
6 Uttar Pradesh 17.37 12.66 17.60 10.27 13.46
7 Bihar 6.42 12.59 6.63 3.49 6.98
8 Assam 2.93 1.20 2.00 2.78 2.37
9 West Bengal 4.40 6.84 13.13 9.52 7.33
10 Orissa 3.32 4.65 4.20 2.88 3.55
11 Madhya 8.66 6.87 6.16 4.18 6.40
Pradesh
12 Gujarat 4.40 4.64 2.33 6.21 5.06
13 Maharashtra 9.41 10.51 6.38 11.72 10.45
14 Andhra 6.19 12.86 9.57 7.66 8.66
Pradesh
15 Karnataka 5.43 5.78 5.71 6.25 5.84
16 Kerala 0.58 1.54 2.22 4.99 2.56
17 Tamil Nadu 4.01 8.06 8.90 8.32 6.91
INDIA 100.00 100.00 100.00 100.00 100.00

The interpretation is obvious - Out of all the workers in all India, 13.46 percent
are in Uttar Pradesh and 10.45 percent in Maharashtra. Andhra Pradesh has
the highest number of Agricultural Labourers (12.86%) followed by Uttar
Pradeh (12.66%) and Bihar (12.59%). The lowest number of Household
7 6 Industry workers is in Himachal Pradesh, etc.
Geektonight Notes

C: 1) The weighted mean is = 539/7 = 77 Statistical Derivatives and


Measures of Central Tendency
If all the papers have equal importance, i.e. equal weightage, then the
simple mean = 224/3 =74.67.
2) Since the class width is 150 for all the classes, the method of assumed
mean is useful. On observation, assumed mean is taken as 375.
Σfd
x =A+ xi
N ; Mean sales of 125 firms is Rs. 361.8 thousands.

N1x1 + N 2 x 2 + N 3 x 3
3) x123 = N1 + N 2 + N 3
x 123 = 116 .43
N / 2 − c.f
D: Median = L + ×c
f
Me = 359.76

∆1
E: Mode = L + ∆ + ∆ × i ; Modal sales value is Rs. 363.64 thousands.
1 2

8.7 TERMINAL QUESTIONS/EXERCISES


1) Explain the concept of central tendency with the help of an example. What
purpose does it serve?
2) “A representative value of a data set is a number indicating the central value
of that data”. To what extent is it true for Mean, Median, and Mode? Explain
with illustrations.
3) Discuss the merits and limitations of various measures of central tendency.
4) The following table gives workers of India (in thousands) as per 2001 census.
Compute suitable percentages and interpret them.
Table: Total Workers and Their Categories-India : 2001 (In thousands)
S. Total Persons Cultivators Agricultural Household Other Total
No. Rural/ Males/ Labourers Industry Workers Workers
Urban Females workers
(1) (2) (3) (4) (5) (6) (7) (8)

1 Rural Persons 124682 103122 11710 71142 310655

2 Males 84047 54749 5642 54762 199200

3 Females 40635 48373 6067 16380 111456

4 Urban Persons 2946 4326 4686 79899 91857


5 Males 2282 2605 2670 68707 76264
6 Females 664 1721 2016 11191 15593

7 Total Persons 127628 107448 16396 151040 402512


8 Males 86328 57354 8312 123469 275464
9 Females 41300 50093 8084 27571 127048

Source: Census of India, 2001.

7 7
Geektonight Notes

Processing and Presentation 5) The monthly salaries (in Rupees) of 11 staff members of an office are:
of Data
2000, 2500, 2100, 2400, 10000, 2100, 2300, 2450, 2600, 2550 and 2700.
Find mean, median and mode of the monthly salaries.
Which one among the above do you consider the best measure of central
tendency for the above data set and why?
6) Consider the data set given in problem 2 above.
Find mean deviation of the data set from (i) median (ii) 2400 and (iii) 2500.
Find mean squared deviation of the data set from (i) mean (ii) 3000 and (iii)
3100.
7) Mean examination marks in Mathematics in three sections are 68, 75 and 72, the
number of students being 32, 43 and 45 respectively in these sections. Find the
mean examination marks in Mathematics for all the three sections taken
together.
8) The followings are the volume of sales (in Rupees) achieved in a month by 25
marketing trainees of a firm:

1220 1280 1700 1400 400 350 1200 1550 1300 1400
1450 300 1800 200 1150 1225 1300 1100 450 1200
1800 475 1200 600 1200

The firm has decided to give the trainees some performance bonus as per the
following rule - Rs. 100 if the volume of sales is below Rs. 500; Rs. 250 if the
volume of sales is between Rs. 500 and Rs.1000; Rs.400 if the volume of sales
is between Rs. 1000 and Rs, 1500 and Rs.600 if the volume of sales is above
Rs. 1500.
Find the average value of performance bonus of the trainees.
9) In an urban cooperative bank, the minimum deposit in a savings bank is Rs. 500.
The deposit balance at the end of a working day is given in the table below :
Table: Average Deposit Balance in ABC Urban Cooperative Bank

S No Deposit Balance Number of Deposits


1 Less than Rs. 10000 982
2 Less than Rs. 9000 959
3 Less than Rs. 8000 874
4 Less than Rs. 7000 773
5 Less than Rs. 6000 621
6 Less than Rs. 5000 395
7 Less than Rs. 4000 295
8 Less than Rs. 3000 145
9 Less than Rs. 2000 25
10 Less than Rs. 1000 10

Calculate mean, median and mode from the above data.


7 8
Geektonight Notes

10) Refer to the table given in the previous problem. Compute (a) median and (b) Statistical Derivatives and
Measures of Central Tendency
mode by graphical approach.
11) Refer to the problem 8. Compute the approximate value of mode using the
relationship: Mean - Mode = 3 (Mean - Median), and compare with the
computed value obtained earlier.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

8.8 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt with in this unit.
Gupta, S P and M P Gupta, 1988. Business Statistics, S Chand, New Delhi.
Hooda, R.P., 2001. Statistics for Business and Economics, Macmillan India Limited,
New Delhi.
Levin, R I and D S Rubin, 1998. Statistics for Management, Prentice Hall India,
New Delhi.
Spiegel, M R, 1992. Statistics, Schaum's Outline Series, McGraw Hill, Singapore.

7 9
Geektonight Notes

Processing and Presentation


of Data UNIT 9 MEASURES OF VARIATION AND
SKEWNESS
STRUCTURE

9.0 Objectives
9.1 Introduction
9.2 Variation – Why is it Important?
9.3 Significance of Variation
9.4 Measures of Variation
9.4.1 Range
9.4.2 Quartile Deviation
9.4.3 Mean Deviation
9.4.4 Standard Deviation
9.4.5 Coefficient of Variation
9.5 Skewness
9.6 Relative Skewness
9.7 Let Us Sum Up
9.8 Key Words
9.9 Answers to Self Assessment Exercises
9.10 Terminal Questions/Exercises
9.11 Further Reading

9.0 OBJECTIVES
After studying this Unit, you should be able to:
l describe the concept and significance of measuring variability for data analysis,
l compute various measures of variation and its application for analysing the
data,
l choose an appropriate measure of variation under different situations,
l describe the importance of Skewness in data analysis,
l explain and differentiate the symmetrical, positively skewed and negatively
skewed data, and
l ascertain the value of the coefficient of skewness and comment on the nature
of distribution.

9.1 INTRODUCTION
In Unit 8, we have learnt about the measures of central tendency. They give us
only a single figure that represents the entire data. However, central tendency
alone cannot adequately describe a set of data, unless all the values of the
variables in the collected data are the same. Obviously, no average can
sufficiently analyse the data, if all the values in a distribution are widely spread.
Hence, the measures of central tendency must be supported and supplemented
with other measures for analysing the data more meaningfully. Generally, there
are three other characteristics of data which provide useful information for data
analysis i.e., Variation, Skewness, and Kurtosis. The third characteristic,
Kurtosis, is not with in the scope of this course. In this unit, therefore, we shall
discuss the importance of measuring variation and skewness for describing
distribution of data and their computation. We shall also discuss the role of
normal curves in characterizing the data.

80
Geektonight Notes
Measures of Variation
9.2 VARIATION – WHY IS IT IMPORTANT? and Skewness

Measures of variation are statistics that indicate the degree to which numerical
data tend to spread about an average value. It is also called dispersion, scatter,
spread etc., It is related to the homogeniety of the data. In the simple words of
Simpson of Kafka “the measurement of the scatterness of the mass of figures
(data) in a series about an average is called measure of variation”. Therefore,
we can say, variation measures the extent to which the items scatter from
average. To be more specific, an average is more meaningful when the data
are examined in the light of variation. Infact, in the absence of measure of
dispersion, it will not be possible to say which one of the two or more sets of
data is represented more closely and adequately by its arithmetic mean value.
Here, the following illustration helps you to understand the necessity of
measuring variability of data for effective analysis.

Illustration-1

The data given below relates to the marks secured by three students (A, B and
C) in different subjects

Subjects Marks
A B C
Research methodology 50 50 10
Accounting for Mangers 50 70 100
Financial Management 50 40 80
Marketing Management 50 40 30
Managerial Economics 50 50 30
Total 250 250 250
Mean ( x ) 50 50 50

In the above illustration, you may notice that the marks of the three students
have the same mean i.e. the average marks of A, B and C are the same i.e.,
50 Marks, and we may analyse and conclude that the three distributions are
similar. But, you should note that, by observing distributions (subject-wise) there
is a wide difference in the marks of these three students. In case of 'A' the
marks in each subject are 50, hence we can say each and every item of the
data is perfectly represented by the mean or in other words, there is no
variation. In case of B there is slight variation as compared to 'C', where as in
case of 'C' not a single item is perfectly represented by the mean and the
items vary widely from one another. Thus, different set of data may have the
same value of average, but may differ greatly in terms of spread or scatter of
items. The study of variability, therefore, is necessary to know the average
scatter of the item from the average to gauge the degree of variability in the
collected data.

9.3 SIGNIFICANCE OF VARIATION


The measure of variability is useful in various situations. Let us take an
example to understand the significance of variation.

81
Geektonight Notes

Processing and Presentation A family intends to cross a lake. They come to know that the average depth of
of Data the lake is 4 feet. The average height of the family, is 5.5 feet. Then they
decide that the lake can be crossed safely. While crossing the lake, at a
particular place all the members of the family get drowned where the level of
water is more than 6.5 feet deep. The reason for drowning is that they rely on
the average depth of the lake and their average height but do not rely on the
variability of the Lake's depth and their height. In the light of the above
example, we may understand the reason for measuring variability of a given
data.

To Judge the reliability of an average: Financial analysts examine the


variation of a firm's earnings. If earnings are widely scattered (extremely high
to low or even negative) it indicates a high risk to investors. Where there is a
wide scatterred in the data, the measure of variation gives a description of the
structure of the data. If variation is small, the average closely represents the
individual values and may say it is reliable. On the other hand, if the variation is
greater the average may be unreliable.

To compare series with regard to their variability: Measuring variation


enables us to compare the variability between two or more series. It is useful
to study the degree of uniformity and consistency in different data sets. A
greater degree of variation in a data set means low degree of consistency. On
the other hand, low degree of variation means high degree of consistency in
that distribution.

To provide a basis for the control of variability itself: It facilitates to


determine the nature and cause of variation in order to control the variation
itself. Quality control experts analyse the variation of the quality of a product.
For instance, a drug that may be average in purity but varies from very pure to
highly impure may endanger lives.

To facilitate the use of other statistical measures: Many analytical devices


in statistics such as hypothesis testing, cost control, analysis of fluctuations,
correlation and regression analysis, techniques of production control etc. are
based on the measure of variation.

Keeping in view the above purposes, the variation of data must be taken into
account while taking business decisions.

9.4 MEASURES OF VARIATION


Variation may be measured in absolute or relative terms. Measures of absolute
variation are expressed in terms of the original units of the given data. For
example, the temperature of a city in a day ranges between 150C and 470C,
then absolute variation of the temperature is 320C (470C-150C). Absolute
measure is possible to compare two sets of data expressed in the same unit
i.e. kgs, rupees, etc. In case the two set of data are expressed in different
units or in different sizes, the absolute measures of variation are not
comparable. In such situations, the measures of relative variation should be
used. For example, we would like to compare the variation of temperature
about an average which is measured in degrees celsius and the variation of the
sale of ‘cold drinks’ given in rupees. This type of distribution variation is
obtained as ratio or percentage. We shall now consider in turn each of the four
relative measures of variation, which provide a numerical index of the variability
82 of a given data. They are:
Geektonight Notes

i) Range Measures of Variation


and Skewness
ii) Quartile Deviation
iii) Mean Deviation, and
iv) Standard Deviation

9.4.1 Range
The Range is the simplest measure of variation. It is defined simply as the
difference between the highest value and the lowest value of observation in a
set of data. In equation form for absolute measure of range from ungrouped
and grouped data, we can say

Range = Highest Value – Lowest Value

In a grouped data, the absolute range is the difference between the upper limit
of the highest class and lower limit of the lowest class. The equation form for
relative measure of range from ungrouped and grouped data, called coefficient
of range is as follows:

Highest value − lowest value


Coefficient of Range =
Highest value + lowest value

Let us take an illustration to understand the computation of absolute and relative


range.

Illustration-2

The following data relates to the total fares collected on Monday from three
different transport agencies.

Transport Agency Fares on Monday (Rs.)


A 450 300 300 500 500
B 400 400 400 400 400
C 600 500 200 300 500

Solution:Let us compute the absolute range of the three transport agencies


A, B and C.

Range = H.V. – L.V.


A's Range = 500 – 300 = Rs. 200
B's Range = 400 – 400 = Rs. 0
C's Range = 600 – 200 = Rs. 400

The interpretation for the above result is simple. In the above illustration, the
variation is nil in case of taxi fare of agency ‘B’. While the variation is small in
agency ‘A’ and the variation is high in transport agency C. The coefficient of
Range for transport agency ‘A’ and ‘C’ is as follows:

500 − 300 600 − 200


A’ s coefficient of R = = 0.25 C’ s coefficient of R = = 0.50
500 + 300 600 + 200 83
Geektonight Notes

Processing and Presentation Its usefulness as a measure of variation is limited. Since it considers only the
of Data
highest and lowest values of the data, it is greatly affected by extreme values
of the data. Therefore, the range is likely to change drastically from sample to
sample. Range cannot be computed in case of open-ended distribution.

In spite of the limitations discussed above, the range is extensively used in


specific situations. For instance, it plays an important role in preparing the
quality control charts, studying fluctuations in the prices of commodities, price of
shares etc. For example maximum gold price and minimum gold price during a
specific period. For meteorological department the range is a good indicator for
weather forecast to know within what limits the temperature is likely to vary
from maximum temperature to minimum temperature in different cities.

Self Assessment Exercise A

The following data relates to the record of time (in minutes) of trucks waiting
to unload material.

Company A 0.51 0.68 0.23 0.59 0.93 0.15 0.85

Company B 0.62 0.25 0.36 0.89 1.05 0.20 0.95

Calculate the absolute and relative range and comment on whether you think it
is a useful measure of variation.
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................

..........................................................................................................................
..........................................................................................................................
..........................................................................................................................

9.4.2 Quartile Deviation


Quartile Deviation is another measure of variation, also termed as semi-inter
quartile range. As we know, quartiles are the factors which divide the
distribution into four equal parts i.e., Q1 (first quartile) gives the value of the 1/
4th item and Q3 (third quartile) gives the value of 3/4th item. The difference
between the Q3 and Q1 is termed as inter quartile range, when it is divide by
two is termed as quartile deviation. It includes the middle 50 per cent of the
distribution. As a result, in a given data one quartile at the upper end and
another quartile at the lower end are excluded. It is, therefore, unaffected by
extreme values. In case of symmetrical distribution Q1 and Q3 are equidistant
from the median. Where as in asymmetrical distribution Q1 and Q3 are not
equidistant from median. Symbolically, the absolute measure of quartile deviation
may be presented as:

Q3 − Q1
Q.D. =
2
84
Geektonight Notes

The relative measure of Q.D., called coefficient of quartile deviation, is Measures of Variation
and Skewness
calculated as:
Q3 − Q1
Coefficient of Q.D. =
Q3 + Q1

It is to be noted that the above formulae (absolute and relative) are applicable
to ungrouped data and grouped data as well. Let us take up an illustration to
ascertain the value of quartile deviation and coefficient of Q.D.

Illustration-3
The following data relates to the daily expenditure of the students in Delhi
University. Calculate quartile deviation and its co-efficient.

Daily expenditure 50- 100- 150- 200- 250- 300- 350- 400- 450-
100 150 200 250 300 350 400 450 500

No. of Students 18 14 21 15 12 13 8 5 2

Solution: For computation of quartile deviation we have to construct the


given frequency (No. of students) into cumulative frequency. The procedure is
to add the frequency of each class to previous cumulative frequency. In this
process, the frequency of the first class is to be taken as the cumulative
frequency of that class and the cumulative frequency of the last class is equal
to the total frequency (sum of observations) of the given data.

Daily expenditure (x) No. of Students (f) Cumulative Frequency


50-100 18 18
100-150 14 32
150-200 21 53
200-250 15 68
250-300 12 80
300-350 13 93
350-400 8 101
400-450 5 106
450-500 2 108
Q − Q1
Q.D. = 3
2

Q1 has N/4th observation i.e., 108/4 = 27th observation which lies in 32


cumulative frequency. So, the value of Q1 exists in the class 100-150. Now
with the help of following formula, the value of Q1 has to be ascertained.
N / 4 − c.f
Q1 = L1 + ×i
f

Where, ‘L1’ is the lower limit of the Q1 class, c.f. is the cumulative
frequency of the preceding class of Q1 class ‘f’ is the frequency of
the Q1, class and ‘i’ is the class interval. Now we present these
values to obtain the result of Q1.
27 − 18
Q1 = 100 + × 50 = Rs.132.14 85
14
Geektonight Notes

Processing and Presentation Q3, has 3(n/4)th observation i.e., 3(108/4)th = 81th observation. This
of Data observation lies in 93 cumulative frequency. So Q3 lies in the 300-350 class.

3N / 4 − c.f .
Q3 = L1 + ×i
f
Here, as explained above, L1; c.f; f; and i are related to Q3 class

81 − 80
Therefore, Q 3 = 300 + × 50 = Rs.303.85
13

Q3 − Q1 303.85 − 132.14
Q.D. = = = Rs.85.85
2 2

Q3 − Q1 303.85 − 132.14
Re lative measure of Q.D. i.e., Coefficient of Q.D. = = = 0.39
Q3 + Q1 303.85 + 132.14

From the above data it may be concluded that the variation of daily expenditure
among the sample students of DU is Rs. 85.85. The coefficient of Q.D. is 0.39
this relative value of variation may be compared with the other dependent
variables of the expenditure like family income of the students, pocket money,
habit of spending etc.

Quartile deviation is a useful measure, superior to range, in case of open-ended


distribution. It is useful where the distribution is badly skewed, because it is
not affected by the extreme values.

Self Assessment Exercise B

The following data shows the profit made by 60 companies in a year.

Compute the quartile deviation and its co-efficient. Do you think this is an
appropriate measure for measuring variability? Comment on your opinion.

Profits (in lakhs)


Less 40- 45- 50- 55- 60- 65- 70 &
than 40 45 50 55 60 65 70 above
No. of Companies 3 12 8 15 10 5 5 2

Solution: Calculation of QD and co-efficient of QD.

Profits (x) No. of C.f


(in Rs. lakhs) Companies (f)

..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
86
Geektonight Notes

9.4.3 Mean Deviation Measures of Variation


and Skewness

As we know, the important characteristics of an ideal measure of variation is,


that it should involve all the data values. Considering this fact, range and
quartile deviation are not based on all the observation of the data, as they are
positional measures of variation. Consequently, they do not show the scatter
around an average, but rather a distance on scale. Where as, the mean
deviation overcomes the above weakness by considering all the items of a data
set. The mean deviation is the arithmetic mean of absolute difference between
the items in a distribution and the average of that distribution. Theoretically,
mean deviation can be computed from the mean or the median or the mode.
However, in actual practice the mean is frequently used in computing the
mean deviation. Under this method, algebraic signs (+, –) are ignored while
taking the deviations from average. For un-grouped data, the formula is:
∑ x−x ∑ x − Me
M.D. from mean = or M.D. from Median =
N N
For grouped data the formula is:
∑f x−x ∑f x − Me
M.D. from mean = or M.D. from Median =
N N
Where, the two bars indicated that the sign of the difference within the two
bars is taken as positive, e.g. 2 − 6 = 4 etc. The co-efficient of Mean
deviation for un-grouped and grouped data, the formula is:

M.D.
Co-efficient of M.D. = The average used ( x or Me)

As an illustration, let us consider the following data, which relates to the sales
of Company A and Company B during 1995-2001.

Illustration-4

Compute the mean deviation and its co-efficient of the sales of two companies
A and B and comment on the result.

Years 1995 1996 1997 1998 1999 2000 2001


sales (Rs. in '000)

Company A : 484 572 124 386 920 653 690

Company B : 3554 2645 6524 4255 4940 5450 6890

Solution: For computation of mean deviation, we have to prepare the


following table. In this illustration we consider the mean for computation of
mean deviation.

87
Geektonight Notes

Processing and Presentation


of Data
Company A Company B
Years Sales (Rs. in Mean = 547 Sales (Rs. in Mean = 4894
’000) X |x−x| ’000) X |x−x|

1995 484 63 3554 1340


1996 572 25 2645 2249
1997 124 423 6524 1630
1998 386 161 4255 639
1999 920 373 4940 46
2000 653 106 5450 556
2001 690 143 6890 1996
3829 1294 34258 8456

∑X 3829
Mean Sales of Company ‘A’ = N = 7 = Rs. 547 thousand
A

∑X 34258
Mean Sales of Company ‘B’ = N =
B

B 7 = Rs. 4894 thousand

∑ x−x
Formula for Mean Deviation from Mean =
N

1294
M.D. of Company ‘A’ = = Rs. 184.9 thousand
7

8456
M.D. of Company ‘B’ = = Rs. 1208 thousand
7

Co-efficient of M.D.

M .D .A 184 .9
Company ‘A’ = Mean = = 0 .34
A 547

M.D. 1208
Company ‘B’ = Mean = 4894 = 0.25
B

The coefficient of mean deviation of company ‘A’ sales is more than the
company 'B' sales. Hence we can conclude there is greater variability in the
sales of company ‘A’.

The drawbacks of this method are, it may be observed, the algebraic signs (+
or –) of the deviations are ignored. From the mathematical point of view it is
unjustifiable as it is not useful for further algebraic treatment. That is the
reason mean deviation is not frequently used in business research. The
accuracy of the result of mean deviation depends upon the degree of
representation of average. Despite of few drawbacks of this measure, it is most
useful measure in case of : i) small samples with no-elaborate analysis is
required, ii) the reports presented to the general public not familiar with
statistical methods, and iii) it has some specific utility in the area of inventory
88 control.
Geektonight Notes

Self Assessment Exercise C Measures of Variation


and Skewness
Calculate the mean deviation and its co-efficient from the following data which
relates the weekly earnings of the family in an area. What light does it throw
on the economical condition of that community and do you justify this measure
is a scientific measure of variability? Give your opinion.

Weekly Ear- 0- 1000- 2000- 3000- 4000- 5000- 6000-


nings (Rs.) 1000 2000 3000 4000 5000 6000 7000

No. of Families 532 704 210 110 32 8 4

Solution: We can also measure the deviations from Median. It is preferred


because the average deviation from Median is the least.

Computation of Mean Deviation and its Co-efficient from Median.

Weekly Earnings Mid-points No. of families Less than x − M e f x − Me


(Rs.) (x) (x) (f) C.f

..........................................................................................................................
..........................................................................................................................

9.4.4 Standard Deviation

The Standard deviation is the most familiar, important and widely used measure
of variation. It is a significant measure for making comparison of variability
between two or more sets of data in terms of their distance from the mean.
The mean deviation, in practice, has been replaced by the standard deviation.
As discussed earlier, while calculating mean deviation algebraic signs ( – / +)
are ignored and can be computed from any of the averages. Whereas, in
computation of standard deviation signs are taken into account, and the
deviation of items, always from mean are taken, squared (instead of ignoring
signs) and averaged. So, finally square root of this value is extracted. Thus,
standard deviation may be defined as “the square root of the arithmetic mean
of the squares of deviations from arithmatic mean of given distribution.” This
measure is also known as root mean square deviation. If the values in a given
data are dispersed more widely from the mean, then the standard deviation
becomes greater. It is usually denoted by σ (read as sigma). The square of the
standard deviation (σ2) is called “variance”.

As said earlier, it is a type of average deviation of values from mean that is


calculated by using the following formulae.

For ungrouped data:

σ=
∑ ( x − x ) 2 , In simple σ = x2
89
N N
Geektonight Notes

Processing and Presentation where, x2 = sum of the squares of deviations ( x − x ) and N = No. of
of Data
observations.

For grouped data:


2
∑ f (x − x) ∑f x
σ= , In simple σ =
N or ∑ f N or ∑ f

If the collected data are very large, then considering the assumed mean is more
convenient to compute standard deviation. In such case, the formula is slightly
modified as:
2
∑ f ( x − x A ) 2 −  ∑ f dx 
σ=   ×C
N or Σf  N or Σf 

X − Assumed Mean
Where, x A = Assumed mean, dx = , C = Common factor .
C

The above formula is applicable only when the class intervals are equal.

Let us take up an illustration to understand the computation of standard


deviation from the grouped data, which relate to the profits of 70 companies.

Illustration-5

Profit (Rs. In Lakh) 6-10 10-14 14-18 18-22 22-26 26-30


No. of Company 9 11 20 16 9 5

Solution: In order to ascertain the Standard deviation we prepare the


following table:

x − xA
Profit M.V.
c
(Rs. In lakh) f x dx fdx fdx2

6-10 9 8 –2 –18 36
10-14 11 12 –1 –11 22
14-18 20 16 0 0 0
18-22 16 20 1 16 16
22-26 9 24 2 18 36
26-30 5 28 3 15 45
N=70 Σfdx =20 Σfdx2 = 155
In the above computation, we have taken the mid value "16" as assumed mean
(AM), the common factor is 4.

N = 70’ Σfdx = 20 Σfdx2 = 155

2
σ=
∑ fdx 2  ∑ fdx 
−  ×C
90 N  N 
Geektonight Notes
Measures of Variation
2
∑ fdx 2  ∑ fdx 
and Skewness
σ= −  ×C
N  N 

2
155  20 
= − × 4 = 2.21 − 0.08 × 4 = 2.13 × 4 = 1.46 × 4 = 5.84
70  70 

Among all the measures of variation, standard deviation is the only measure
possessing the necessary mathematical properties which enhance its utility in
advanced statistical work. It is least affected by the fluctuations of sampling. In
a normal distribution, x ± σ covers 68% of the values whereas x ± QD covers
50% values and x ± M.D. covers 57% values. This is the reason that
standard deviation is called a “Standard Measure”.

9.4.5 Coefficient of Variation

The relative measure of standard deviation is the coefficient of variation,


denoted by C.V. The absolute measure of standard deviation, discussed above,
do not facilitate comparison of two or more data sets in terms of their
variability and consistency. However, comparison between such data sets is
possible in terms of standard deviation only when the mean and the units of
measurement of the data sets are the same. Otherwise, when distributions are
measured in the same units but which have different arithematic means and / or
distributions are measured in different units then this measure, therefore, is used
to compare variability, consistency, and uniformity between two or more sets of
data. When C.V. is lesser in the data, it is said to be more consistent or have
less variability. On the other hand, the series having higher C.V. has higher
degree of variability or lesser consistency. The drawback of this measure is that
it is not satisfactorily useful when the mean is close to zero. To provide a
standardized measure of variation, we compute the C.V which expresses the
standard deviation as a percentage of the mean:
σ
C.V. = × 100
X
Consider the following data which relates to the mean production and standard
deviation of Paddy in four states for understanding of the application of C.V.

State Mean Production of paddy Standard Deviation


(In Lakh Tons) (In lakh tons)

I 83 9.93

II 40 5.24

III 70 8.12

IV 59 10.89

You may notice that the mean production of Paddy in four states is not equal.
In such a situation, to determine which state is more consistent in terms of
production, we shall compute the coefficient of variation.
σ
C.V. = × 100
X 91
Geektonight Notes

Processing and Presentation


9.93 5.24
of Data C.V. of State I = × 100 = 11.96%; C.V.of State II = × 100 = 13.10%
83 40

8.12 10.89
C.V. of State III = × 100 = 11.60%; C.V. of State IV = × 100 = 18.46%
70 59

It is seen that the standard deviation is low in State II when we compare with
the other states. However, since the C.V. is less in State III, it is the more
consistent state in production of paddy than the other three states. Among the
4 states, state IV is extremely inconsistent in production of paddy.

Self Assessment Exercise D

A Prospective buyer tested the bursting pressure of a sample of 120 carry bags
received from A and B manufactures. The results are tabulated below:

Bursting Pressure 10-12 12-14 14-16 16-18 18-20 20-22


(Kgs)

No. of bags of A 3 14 30 56 12 5
No. of bags of B 8 16 23 34 24 15

Which manufacturer's bags have the higher average of bursting pressure?


Which manufactures would you like to recommend and why? If the buyer does
not want to buy bags of more than 16 kg bursting pressure then how does it
change your suggestion, if at all?

Solution: Calculation of Standard Deviation and co-efficient of variation.


Busting Mid-points x − AM
Pressure (x) i No. of fdx fdx2 No. of fdx fdx2
(Kgs) (AM) bags (f) bags
(dx) (f)

..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................

92
Geektonight Notes
Measures of Variation
9.5 SKEWNESS and Skewness

The measure of skewness tells us the direction of dispersion about the centre
of the distribution. Measures of central tendency indicate only the single
representative figure of the distribution while measures of variation, indicate only
the spread of the individual values around the means. They do not give any
idea of the direction of spread. Two distributions may have the same mean and
variation but may differ widely in the shape of their distribution. A distribution is
often found skewed on either side of its average, which is termed as
asymmetrical distribution. Thus, skewness refers to the lack of symmetry in
distribution. Symmetry signifies that the value of variables are equidistant from
the average on both sides. In other words, a balanced pattern of a distribution
is called symmetrical distribution, where as unbalanced pattern of distribution is
called asymmetrical distribution.

A simple method of finding the direction of skewness is to consider the tails of


a frequency polygon. The concept of skewness will be clear from the following
three figures showing symmetrical, positively skewed and negatively skewed
distributions.
Symmetrical Positively Skewed Negatively Skewed
Distribution Distribution Distribution

Mode = Median = Mean Mode Mean Mean Mode


Median Median
(a) (b) (c)

Fig.9.1

Carefully observe the figures presented above and try to understand the
following rules governing them.

It is clear from Figure 9.1 (a) that the data are symmetrical when the spread
of the frequencies is the same on both sides of the middle point of the
frequency polygon. In this case the value of mean, median, and mode coincide
i.e., Mean = Median = Mode.

When the distribution is not symmetrical, it is said to be a skewed distribution.


Such a distribution could be either positively skewed or negatively skewed. In
Figure (b), when there is a longer tail towards the right hand side of the centre
of distribution, the skewness is said to be “Positively Skewed”. In such a
situation, Mean > Median > Mode.

In Figure (c), when there is a longer tail towards the left hand side of the
centre, then the skewness is said to be ‘Negatively Skewed’. In such a case,
Mean < Median < Mode.

It is seen that, in positively skewed distribution, dispersal of individual


observations is greater towards the right of the central value. Where as in a
negatively skewed distribution, a greater dispersal of individual observations is
towards the left of the central value. We can say, therefore, the concept of
Skewness not only refers to lack of symmetry in a distribution but also indicates
93
Geektonight Notes

Processing and Presentation the magnitude as well as the direction of skewness in a distribution. The
of Data relationship of mean, median and mode in measuring the degree of skewness is
that, for a moderately symmetrical distribution the interval between the mean
and the median is approximately 1/3rd of the interval between the mean and
mode.

Tests of Skewness
In the light of the above discussion, we can summerise the following facts
regarding presence of skewness in a given distribution.

1) The mean, median, and mode are not identical.


2) The total of deviations are not zero from median or mode i.e. Σ(X – Me)
or Σ(X – Mo) ≠ 0.
3) Frequencies on both sides of the mode are not equal
4) The distance from the Median to the quantities are not equal i.e.,
(Q3–Me) is not equal to (Me–Q1).
5) The curve of distribution is not bell shaped. This means the two halves of
the curve from Median or Mode do not coincide in a perfect manner.

9.6 RELATIVE SKEWNESS


The relative measure of skewness is termed as Coefficient of Skewness, It is
useful in making a comparison between the skewness in two or more sets of
data. There are two important methods for measuring the coefficient of
skewness. They are: 1) Karl Pearson’s coefficient of skewness. 2) Bowley's
coefficient of skewness.

Let us discuss these two methods. Study carefully to understand the


computation of co-efficient of skewness.
i) Karl Pearson’s Coefficient of Skewness: (denoted as SKp.)
This co-efficient of skewness, is obtained by dividing the difference between
the mean and the mode by the standard deviation. Thus the formula of
Pearson’s coefficient of skewness is:

X − MO
SKp =
σ
This method computes the co-efficient skewness by considering all the items
of the data set. The value of variation usually varies in value between the limits
± 3.

If mode is ill-defined and cannot be easily located then using the approximate
empirical relationship between mean, median, and mode as stated in Unit-8
section 8.3.5, (mode = 3 median – 2 mean) the coefficient of skewness can be
determined by the removal of the mode and substituting median in its place.
Thus the changed formula is:

3 (Mean − Median)
Sk p =
σ
Let us consider the following data to understand the application of Karl
Pearson’s formula for measuring the co-efficient of skewness.
94
Geektonight Notes

Illustration-6 Measures of Variation


and Skewness

The following measures are obtained from the profits of 100 shops in two
different regions. Calculate Karl Pearson’s co-efficient of skewness and
comment on the results.

Region I : X = 16.62; Mo = 18.47; and σ = 3.04

Region II : X = 45.56; Mo = 36.94; and σ = 17.71

Note that, we have already learnt the computation of x , M0 in Unit 8 and σ


in this Unit.
X − M0
Solution: Karl Pearson ' s formula : (SK p ) =
σ

16.62 − 18.47
Coefficient of skewness for Re gion I = = − 0.61
3.04
45.36 − 36.94
Coefficient of skewness for Re gion II = = 0.49
17.71

Based on the results we can comment on the distributions of the two regions
as follows: The coefficient of skewness for Region-I is negative, while that of
Region - II is positive. Hence the earnings of profit in Region I is more
skewed. Since the result in Region-I, indicates that the distribution is negatively
skewed, there is a greater concentration towards higher profits. In case of
Region-II the value of coefficient of skewness indicates that the distribution is
positively skewed. Therefore there is a greater concentration towards lower
profits.

Let us consider another illustration to understand the application of Pearson’s


alternative formula for co-efficient of skewness, when it is not possible for the
mode to be located in a distribution.

Illustration-7
The following statistical measures are given from a data of a factory before
and after the settlement of wage dispute. Calculate the Pearson’s co-efficient
of skewness and comment.

Particulars Before After


Settlement Settlement
No. of workers 1200 1175
Standard deviation (Rs.) 5.9 4.95
Mean wage (Rs.) 22.8 24.0
Median wage (Rs.) 24.0 23.0

Solution: It is understood that the mode is ill-defined in the given data.


Hence, to compute the Pearson’s coefficient of skewness, the following
alternative formula is used here.

3 ( Mean − Median )
Kal Pearson’s Co-efficient of Skewness (SKp) =
σ
95
Geektonight Notes

Processing and Presentation


3 (22.8 − 24.0) − 3.6
of Data a) Before settlement of wage dispute: SK p = = = − 0.61
5.9 5.9

3 ( 24 − 23 ) 3
b) After settlement of wage dispute: SK p = = = 0 . 61
4 .9 4 . 95
From the above calculated values of coefficient of skewness, under different
situations, we may comment upon the nature of distribution as follows:

Before the settlement of dispute the distribution was negatively skewed and
hence there is a greater concentration of wages towards the higher wages.
Whereas it was positively skewed after the settlement of dispute, which reveals
that even though the mean wage of workers has increased after the settlement
of disputes (before settlement wages were 1,200 × 22.8 = Rs. 27360. After
settlement total wages were 1175 × 24 = Rs. 28,200). The workers who were
getting low wages are getting considerably increased wages after settlement of
their dispute, while wages of the workers getting high wages before settlement
had fallen.

We can also comment on the level of uniformity in the distribution of wages by


computing co-efficient of variation which we have studied in this unit, Section
9.4.5. Hence, let us compute the variation to study the state of uniformity in
wages in both the circumstances.

σ
C.V. = × 100
X
5.9
a) Before settlement the coefficient of variation = × 100 = 25.88%
22.8

4.95
b) After settlement the coefficient of variation = × 100 = 20.62%
24.0
Based on the computed values of variation, it may be concluded that there is
sufficient evidence that there is lesser inequality in the distribution of wages
after settlement of the dispute. It means that there was a greater scattered in
wage payment before the dispute was settled.

Self Assessment Exercise E

A survey was conducted on random basis by a Television manufacturing


company to enquire the maximum price at which persons would be willing to
purchase the colour T.V. The following table gives the stated price (in thousand
Rs.) by 150 respondents.

Price of T.V. 8-10 10-12 12-14 14-16 16-18 18-20


(Thousand Rs.)
No. of persons 19 23 28 40 26 14

Calculate Karl Pearson’s co-efficient of skewness and interpret the result.


..........................................................................................................................
..........................................................................................................................
..........................................................................................................................

96 ..........................................................................................................................
Geektonight Notes

.......................................................................................................................... Measures of Variation


and Skewness
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................

Bowley's Measure of Co-efficient of Skewness


Bowley's method for coefficient of skewness (SKB) is derived from quartile
values and for this reason it is useful in case of open-ended distribution, where
extreme values are presented and/or class intervals are unequal in the collected
data or the median and quartile values only are available. In such situations,
formula for coefficient of skewness developed by Prof. Bowley is more
appropriate. It is expressed as:

(Q 3 − Q 2 ) − (Q 2 − Q 1 )
SK = ,
( Q 3 − Q 2 ) + ( Q 2 − Q 1 ) Alternatively;
B

Q 3 + Q1 − 2 Median
SK B =
Q 3 − Q1

It is to be noted that, in an asymmetrical distribution the value of this co-


efficient of skewness lies between ± 1. The criticism against this measure is
that it does not take all the items of the data into account. It is based on
control of 50% of the data and it ignores 25% of the data below Q1 and 25%
of the data above Q3. Thus, this method is also termed as Quartile Co-efficient
of skewness. Since, this method is based only on the middle 50% of the
distribution, there is a possibility that SKB may be negative even while SKp is
positive. However this is a useful measure when variability of the distribution is
computed by using the method of quartile deviation.

Let us consider the following illustration to understand the concept of Bowley's


co-efficient of Skewness.

Illustration-8

The following values were computed in an open-ended distribution relating to


sales of a product. Compute the co-efficient of skewness.

Q1 = 62 Q2 = 141 Q3 = 190

Solution: When quartiles are given, as we discussed earlier (Section 9.4.2),


Bowley's conept is appropriate to obtain the value of relative skewness. Here,
the median (Q2 value is exactly equal to the value of Median) can also be
denoted as Q2.

Q3 + Q1 − 2 Median
Bowley’s coefficient of skewness (SKB) = Q3 − Q1

190 + 62 − (2 ×141) 252 − 282


Bowley’s co-efficient of skwness SKB = = = − 0.23
190 − 62 128
97
Geektonight Notes

Processing and Presentation since the distribution is slightly negatively skewed there is greater concentration
of Data of the sales towards higher sales than the lower sales of the distribution.

Self Assessment Exercise F

From the following information regarding the payment of Commission to


salesmen in two companies (X Ltd. And Y Pvt. Ltd)

Calculate Bowley’s co-efficient of skewness and find out which company is


more homogeneous in payment of commission and which is more a skew. How
do you justify that Bowley’s measure is appropriate?

X Ltd Y Pvt. Ltd.


Payment of No. of Payment of No. of
Commission in Rs. Salesmen Commission in Rs. Salesmen
200-250 50 400-450 20
250-300 85 450-500 42
300-350 67 500-550 50
350-400 58 550-600 22
400-450 16 600-650 16
450-500 7 650-700 9

..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................
..........................................................................................................................

9.7 LET US SUM UP


In this Unit, we have studied how the concepts of variation and Skewness are
useful for describing the data more meaningfully. Variation is a measure of
scatter or spread of items around the central values. Variation is calculated to
examine the extent to which the items vary from some central values. Thus the
average is more meaningful, if it is examined in the light of variation. Relative
measures are obtained as ratios and percentages and are used to compare
variability in two or more sets of data. The mean and the standard deviation
may be the same in two different distributions, but it does not imply that the
distributions are the same. Hence, there is another measure called the measure
of variation. They are range, quartile deviation, mean deviation and standard
deviation. We also discussed the concept of coefficient of variation, which is
used to compare relative variation of different sets of data.
98
Geektonight Notes

Through skewness, we study the shape of the distribution, i.e., whether the Measures of Variation
and Skewness
distribution is symmetrical or asymmetrical. Symmetrical distribution means the
frequency distribution that forms a balanced pattern on both sides of the mean,
median, and mode.

In such a distribution mean, median, and mode are equal and they lie at the
centre of the distribution. In contrast, asymmetrical distribution means
unbalanced pattern of frequency distribution, called as 'skewed' distribution.
Skewed distribution may be positively skewed or negatively skewed. In a
positively skewed distribution the mean is greater than mode and median
( x > Me > Mo) and has a long tail on the right hand side of the data curve.
On the other hand, in a negatively skewed distribution the mode is greater than
the mean and median (Mo > Me > x ) and has a long tail on the left hand
side of the data curve. In a skewed distribution, the relationship of Mean and
median is that the interval between both is approximately 1/3rd of the interval
between the mean and mode. Based on this relationship the degree of
skewness is measured. There are two formulae we use for measuring the
coefficient of skewness, which are called relative measures of skewness,
proposed by Karl Pearson and Bowley. Bowley's formula is normally applied
when the data is an open-end type or/and the classes are unequal.

9.8 KEY WORDS


Asymmetry : A characteristic of a distribution in which the values of variables
are not equidistant from the average on both sides.

Co-efficient of Skewness : It makes comparison between the skewness in


two or more data sets.

Co-efficient of Variation : A relative measure of variation, comparable across


distributions, that is, the ratio of the standard deviation to mean expressed as a
percentage.

Mean Deviation : The arithmetic mean of the absolute values of the


deviations from some measure of central tendency (Mean, Median or Mode).

Measure of Variation : A measure describing how the observations in a


distribution are spread or scattered.

Quartile Deviation : It is one half of the difference between the upper


quartile (Q3) and the Lower quartile.(Q1).

Range : It is the difference between the highest value and the lowest value of
observations.

Standard Deviation : The square root of the Arithmetic mean of squares of


deviations from Arithmetic mean of the data set.

Skewness : It refers to the lack of symmetry in distribution.

Symmetry : A characteristic of a distribution in which the values of variables


are equidistant from the average on both sides.

Variation : The degree to which numerical data tend to spread about an


average value.

Variance : The square of standard deviation. 99


Geektonight Notes

Processing and Presentation


of Data 9.9 ANSWERS TO SELF ASSESSMENT EXERCISES
A) Range : Company A = 0.78 Minutes; Company B = 0.85 Minutes,

Co-efficient of Range: Company A = 0.72; Company B = 0.68. It is not


particularly useful, because in case of A Company, all the rest of the items,
except two fall between 0.51 minutes and 0.93 minutes. In case of Company B
also the items except three, fall between 0.62 minutes and 1.05 minutes. The
range greatly overstates the typical variability, because it is determined by two
extreme values in the data set.

B) Q1 = 45, Q3 = 57.92; Q.D = 6.46; co-efficient of Q.D = 0.063.

Yes. Even though, the Quartile deviation is regarded as a measure of partition


and may not satisfy the test of a good measures of variation, it is an
approximate method in specific situations where the data is in the form of open-
ended classes. It is also useful where the distribution is badly skewed, because
it is unaffected by the extreme values.
C) Median = Rs. 1380.68
Mean deviation = Rs. 722.
Co-efficient of Mean Deviation = 0.52.

The Median earnings of the 1600 families is Rs. 1381. It reveals that 50% of
the families are earning between Rs. 1,000 to Rs. 2,000. It is to be noted that
very few (44 families out of 1,600 families) fall in the last three classes of
higher-earning groups.

This is not a scientific measure of variability because while taking the


deviations algebric signs are ignored Therefore, it is not capable of further
algebraic treatment.

Infact, this measure of variability gives us best results when deviations are
taken from Median, but Median is not a satisfactory measure when the
dispersion in a distribution is very high. It is also not appropriate for large
samples.

D) Manufacturer A: x = 16.25 kgs; σ = 2.07 kgs; CV = 12.74%

Manufacturer B: x = 16.58 kgs; σ = 2.81 kgs; CV = 16.95%

Since the mean bursting pressure of manufacturer B's bags is higher, these
bags may be regarded more standard. However, the bags of manufacturer
A may be suggested for purchase as these bags of manufacturer A are
more consistant because CV is significantly lesser than the bags of
manufacturer B.

If the buyer would not like to buy bags having more than 16 kgs. bursting
pressure then:

x A = 14.15; σ A = 0.74; CVA = 5.23%

x B = 13.64; σ B = 1.12; CVB = 8.21%

In case the buyer would not like to buy bags having more than 16 kgs
100 bursting pressure, then the average bursting pressure of manufacturer A's
Geektonight Notes

bags is higher than manufacturer B. The co-efficient of variation is also Measures of Variation
and Skewness
much lesser in case of manufacturer A than manufacturer B. Hence, in this
case, we may suggest to buy from manufacturer A.

E) x = 13.97; M0 = 14.92; σ = 2.9, SKp = – 0.32. Since SKp is – 0.32, the


distribution is asymmetrical and negatively skewed. It is true because the
mode is greater than the mean. The absolute measure of skewness i.e.
x –M0 is – 0.95 (13.97–14.92). Such an asymmetrical distribution
graphically would tend to tail off towards the left side.

F) Company X : Q1 = 262.21; Me = 304.85; Q3 = 358.84; SKB = 0.12.

Company Y : Q1 = 473.51; Me = 517.5; Q3 = 556.48; SKB = – 0.06.

9.10 TERMINAL QUSTIONS AND EXERCISES


1) What do you understand by “Variation”? Discuss the significance of measuring
variability for data analysis.
2) When would you use the range and standard deviation as a measure of
variation? Explain with suitable illustrations.
3) Explain in what ways measures of variation supplement measures of central
tendency.
4) Explain the concept of skewness. How does it help in analyzing the data?
5) Distinguish between variation and skewness. What are the objectives of
measuring them?
6) The following table is related to the daily temperatures recorded in a city in
a year. Calculate Range and Quartile deviation and which measure of
variation do you suggest to take a decision on the type of garments to be
produced by a garment factory. Justify your suggestion.
Temperature °C –30 to –20; –20 to –10; –10 to 0 0 to 10; 10 to 20

No. of days 38 190 65 42 30

7) Calculate mean deviation from the following distribution.


Profits (in lakhs) No. of firms Profits (in lakhs) No. of firms
15-25 14 45-55 14
25-35 28 55-65 32
35-45 56 65-70 26

8 A transport agency had tested the tyres of two brands A and B. The results are
given in the following table below.
Life (thousand units) Brand A Brand B
15-20 6 8
20-25 15 8
25-30 10 22
30-35 16 17
35-40 13 12
40-45 9 6
45-50 11 0
101
Geektonight Notes

Processing and Presentation i) Which brand of tyres do you suggest to the transport agency to use on their
of Data
fleet of trucks?
8) In a manufacturing firm, four employees on the same job show the following
results over a period of time.
A B C D
Mean time of completing the Job 61 70 83 80.5
(minutes)
Variance (σ2) 64 81 121 100

i) Which employee appears to be more consistent in the time he/she requires


to complete the job?
ii) Which employee appears to be faster in completing the job?
iii) Which measure did you select to answer part (i) and why?
9) The following table relates to the marks obtained at Engineering exams and
CAT examination.
i) Find which group is more homogeneous in intelligence?
ii) Which group is more skewed and why?
Engineering examination CAT examination
Marks No. of students Marks No. of students
50-100 15 750-800 45
100-150 40 800-850 80
150-200 45 850-900 78
200-250 20 900-950 55
250-300 14 950-1000 12

10) The following Table gives the No. of defects per product and its frequency.

No. of defects per product Frequency


Under 15 32
15-20 50
20-25 75
25-30 130
30-35 145
35-40 105
40-45 85
45-50 50
50 and above 20
i) What are the problems you may face in computing standard deviation from
the above data?
ii) Compute Bowley’s co-efficient of skewness and comment on its value.
iii) Do you agree that the suggested method for measuring skewness is an
appropriate method? Give reasons of your opinion.

102
Geektonight Notes
Measures of Variation
11) The following information was obtained from records of a factory relating to the and Skewness
wages, before and after settlement of wages.

Particulars Before settlement After settlement


of dispute of dispute

No. of workers 515 507


Mean wage (Rs.) 49.40 51.73
Meadian Wage (Rs.) 52.5 50.00
Standard Deviation of wages 10.00 11.00

i) Give as much information as you can about the distribution of wages.


ii) Comment on the gain and loss from the point of view of workers and that
of the factory's management.
12)Students' ages in the regular (conventional) M.Com. Programme and the part-
time (distance) programme of a University and an Open University are given by
the following two samples:

Regular M.Com 20 24 18 22 26 25 21 28 23 29
Distance M.Com 24 29 40 46 34 27 31 28 38 23

If homogeneity of the class is a positive factor in learning, use a measure of


relative variation to suggest which of these two groups will be easier to teach.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

9.11 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt with in this unit.
Clark, T.C. and E.W. Jordon, 1998, Introduction to Business and Economic
Statistics, South-Western Publishing Co.
Gupta, S.P. and Gupta, M.P. 2000, Business Statistics, Sultan Chand & Sons :
New Delhi.
Hooda, R.P. 2001, Statistics for Business and Economics, Macmillian India
Ltd. New Delhi.
Richard I. Levin and David S. Rubin, 2000, Statistics for Management,
Prenctice-Hall of India Pvt. Ltd. New Delhi.

103
Geektonight Notes
Correlation and Simple
UNIT 10 CORRELATION AND SIMPLE REGRESSION Regression

STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Correlation
10.2.1 Scatter Diagram
10.3 The Correlation Coefficient
10.3.1 Karl Pearson’s Correlation Coefficient
10.3.2 Testing for the Significance of the Correlation Coefficient
10.3.3 Spearman’s Rank Correlation
10.4 Simple Linear Regression
10.5 Estimating the Linear Regression
10.5.1 Standard Error of Estimate
10.5.2 Coefficient of Determination
10.6 Difference Between Correlation and Regression
10.7 Let Us Sum Up
10.8 Key Words
10.9 Answers to Self Assessment Exercises
10.10 Terminal Questions/Exercises
10.11 Further Reading
Appendix Tables

10.0 OBJECTIVES
After studying this unit, you should be able to:

l understand the concept of correlation,


l use scatter diagrams to visualize the relationship between two variables,
l compute the simple and rank correlation coefficients between two variables,
l test for the significance of the correlation coefficient,
l use the regression analysis in estimating the relationship between dependent and
independent variables,
l use the least squares criterion to estimate the equation to forecast future values
of the dependent variable,
l determine the standard errors of estimate of the forecast and estimated
parameters,
l understand the coefficient of determination as a measure of the strength of the
association between two variables, and
l distinguish between correlation and simple regression.

10.1 INTRODUCTION
In previous units, so far, we have discussed the statistical treatment of data
relating to one variable only. In many other situations researchers and decision-
makers need to consider the relationship between two or more variables. For
example, the sales manager of a company may observe that the sales are not
the same for each month. He/she also knows that the company’s advertising
expenditure varies from year to year. This manager would be interested in
knowing whether a relationship exists between sales and advertising
expenditure. If the manager could successfully define the relationship, he/she 5
Geektonight Notes

Relational and might use this result to do a better job of planning and to improve predictions of
Trend Analysis yearly sales with the help of the regression technique for his/her company.
Similarly, a researcher may be interested in studying the effect of research and
development expenditure on annual profits of a firm, the relationship that exists
between price index and purchasing power etc. The variables are said to be
closely related if a relationship exists between them.

The correlation problem considers the joint variation of two measurements


neither of which is restricted by the experimenter. The regression problem
considers the frequency distribution of one variable (dependent variable) when
another variable (independent variable) is held fixed at each of several intervals.

This unit, therefore, introduces the concept of correlation and regression, some
statistical techniques of simple correlation and regression analysis. The methods
used are important to the researcher(s) and the decision-maker(s) who need to
determine the relationship between two variables for drawing conclusions and
decision-making.

10.2 CORRELATION
If two variables, say x and y vary or move together in the same or in the
opposite directions they are said to be correlated or associated. Thus,
correlation refers to the relationship between the variables. Generally, we find
the relationship in certain types of variables. For example, a relationship exists
between income and expenditure, absenteesim and production, advertisement
expenses and sales etc. Existence of the type of relationship may be different
from one set of variables to another set of variables. Let us discuss some of
the relationships with the help of Scatter Diagrams.

10.2.1 Scatter Diagram


When different sets of data are plotted on a graph, we obtain scatter
diagrams. A scatter diagram gives two very useful types of information. Firstly,
we can observe patterns between variables that indicate whether the variables
are related. Secondly, if the variables are related we can get an idea of the
type of relationship that exists. The scatter diagram may exhibit different types
of relationships. Some typical patterns indicating different correlations between
two variables are shown in Figure 10.1.

Y r=1 Y r = –1

(a) X (b) X
Perfect Positive Correlation Perfect Negative Correlation

6
Geektonight Notes
Correlation and Simple
r<0
Y r>0 Y Regression

X X
(c) (d)
Positive Correlation Negative Correlation

Y Y r=0

(e) X (f) X
Non-linear
Non-linearCorrelation
correlation No
NoCorrelation
correlation

Figure 10.1 : Possible Relationships Between Two Variables, X and Y

If X and Y variables move in the same direction (i.e., either both of them
increase or both decrease) the relationship between them is said to be positive
correlation [Fig. 10.1 (a) and (c)]. On the other hand, if X and Y variables
move in the opposite directions (i.e., if variable X increases and variable Y
decreases or vice-versa) the relationship between them is said to be negative
correlation [Fig. 10.1 (b) and (d)]. If Y is unaffected by any change in X
variable, then the relationship between them is said to be un-correlated [Fig.
10.1 (f)]. If the amount of variations in variable X bears a constant ratio to the
corresponding amount of variations in Y, then the relationship between them is
said to be linear-correlation [Fig. 10.1 (a) to (d)], otherwise it is non-linear
or curvilinear correlation [Fig. 10.1 (e)]. Since measuring non-linear
correlation for data analysis is far more complicated, we therefore, generally
make an assumption that the association between two variables is of the linear
type.

If the relationship is confined to two variables only, it is called simple


correlation. The concept of simple correlation can be best understood with the
help of the following illustration which relates advertisment expenditure to sales
of a company.

7
Geektonight Notes

Relational and Illustration 1


Trend Analysis

Table 10.1 : A Company’s Advertising Expenses and Sales Data (Rs. in crore)

Years : 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004

Advertise- 6 5 5 4 3 2 2 1.5 1.0 0.5


ment
expenses
(X)

Sales (Y) 60 55 50 40 35 30 20 15 11 10

The company’s sales manager claims the sales variability occurs because the
marketing department constantly changes its advertisment expenditure. He/she is
quite certain that there is a relationship between sales and advertising, but does
not know what the relationship is.

The different situations shown in Figure 10.1 are all possibilities for describing
the relationships between sales and advertising expenditure for the company. To
determine the appropriate relationship, we have to construct a scatter diagram
shown in Figure 10.2, considering the values shown in Table 10.1.

60
50
Sales (Rs. Crore)

40
30
20
10
01 2 3 4 5 6
Advertising Expenditure (Rs. Crore)

Figure 10.2 : Scatter Diagram of Sales and Advertising Expenditure for a Company.

Figure 10.2 indicates that advertising expenditure and sales seem to be linearly
(positively) related. However, the strength of this relationship is not known, that
is, how close do the points come to fall on a straight line is yet to be
determined. The quantitative measure of strength of the linear relationship
between two variables (here sales and advertising expenditure) is called the
correlation coefficient. In the next section, therefore, we shall study the
methods for determining the coefficient of correlation.

Self Assessment Exercise A


1) Suggest eight pairs of variables, four in each, which you expect to be positively
correlated and negatively correlated

. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

8
Geektonight Notes

2) How does a scatter diagram approach help in studying the correlation between Correlation and Simple
Regression
two variables?
. ..........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

10.3 THE CORRELATION COEFFICIENT


As explained above, the coefficient of correlation helps in measuring the degree
of relationship between two variables, X and Y. The methods which are used
to measure the degree of relationship will be discussed below.

10.3.1 Karl Pearson’s Correlation Coefficient

Karl Pearson’s coefficient of correlation (r) is one of the mathematical methods


of measuring the degree of correlation between any two variables X and Y is
given as:
∑ (X − X ) (Y − Y ) / n
r=
σX σY

The simplified formulae (which are algebraic equivalent to the above formula)
are:

∑ xy
1) r= , where x = X − X, y = Y − Y
2 2
∑x ∑y

Note: This formula is used when X and Y are integers.

∑ X.∑ Y
∑ XY −
2) r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n

Before we proceed to take up an illustration for measuring the degree of


correlation, it is worthwhile to note some of the following important points.

i) ‘r’ is a dimensionless number whose numerical value lies between +1 to –1. The
value +1 represents a perfect positive correlation, while the value –1 represents
a perfect negative correlation. The value 0 (zero) represents lack of correlation.
Figure 10.1 shows a number of scatter plots with corresponding values for
correlation coefficient.
ii) The coefficient of correlation is a pure number and is independent of the units of
measurement of the variables.
iii) The correlation coefficient is independent of any change in the origin and scale
of X and Y values.
Remark: Care should be taken when interpreting the correlation results.
Although a change in advertising may, in fact, cause sales to change, the fact
that the two variables are correlated does not guarantee a cause and effect
relationship. Two seemingly unconnected variables may often be highly
correlated. For example, we may observe a high degree of correlation: (i)
between the height and the income of individuals or (ii) between the size of the
9
Geektonight Notes

Relational and shoes and the marks secured by a group of persons, even though it is not
Trend Analysis possible to conceive them to be casually related. When correlation exists
between such two seemingly unrelated variables, it is called spurious or non-
sense correlation. Therefore we must avoid basing conclusions on spurious
correlation.

Illustration 2

Taking as an illustration, the data of advertisement expenditure (X) and sales


(Y) of a company for 10 years shown in Table 10.1, we proceed to determine
the correlation coefficient between these variables.

Solution: Table 10.2 : Calculation of Correlation Coefficient


(Rs. in crore)
2
Advertisement Sales XY X Y2
expenditure Rs. (X) Rs. (Y)
6 60 360.0 36 3600
5 55 275.0 25 3025
5 50 250.0 25 2500
4 40 160.0 16 1600
3 35 105.0 9 1225
2 30 60.0 4 900
2 20 40.0 4 400
1.5 15 22.5 2.25 225
1.0 11 11.0 1 121
0.5 10 5.0 0.25 100
ΣX = 30 ΣY = 326 ΣXY = 1288.5 ΣX2=122.50 ΣY2=13696

We know that

∑ ( X)∑ (Y)
∑ XY −
r= n
2 (∑ X) 2 2 (∑ Y)
2
∑X − ∑Y −
n n

1288.5 − (30) (326)


10 310.5
= =
(30) 2 (326) 2 315.7
122.5 − 13696 −
10 10
= 0.9835

The calculated coefficient of correlation r = 0.9835 shows that there is a high


degree of association between the sales and advertisement expenditure. For this
particular problem, it indicates that an increase in advertisement expenditure is
likely to yield higher sales. If the results of the calculation show a strong
correlation for the data, either negative or positive, then the line of best fit to
10
Geektonight Notes

that data will be useful for forecasting (it is discussed in Section 10.4 on Correlation and Simple
Regression
‘Simple Linear Regression’).

You may notice that the manual calculations will be cumbersome for real life
research work. Therefore, statistical packages like minitab, SPSS, SAS, etc.,
may be used to calculate ‘r’ and other devices as well.

10.3.2 Testing for the Significance of the Correlation Coefficient

Once the coefficient of correlation has been obtained from sample data one is
normally interested in asking the questions: Is there an association between the
two variables? Or with what confidence can we make a statement about the
association between the two variables? Such questions are best answered
statistically by using the following procedure.

Testing of the null hypothesis (testing hypothesis and t-test are discussed in
detail in Units 15 and 16 of this course) that population correlation coefficient
equals zero (variables in the population are uncorrelated) versus alternative
hypothesis that it does not equal zero, is carried out by using t-statistic
formula.

n−2
t=r , where, r is the correlation coefficient from sample.
1− r2

Referring to the table of t-distribution for (n–2) degree of freedom, we can find
the critical value for t at any desired level of significance (5% level of
significance is commonly used). If the calculated value of t (as obtained by the
above formula) is less than or equal to the table value of t, we accept the null
hypothesis (H0), meaning that the correlation between the two variables is not
significantly different from zero.

The following example will illustrate the use of this test.

Illustration 3

Suppose, a random sample of 12 pairs of observations from a normal population


gives a correlation coefficient of 0.55. Is it likely that the two variables in the
population are uncorrelated?

Solution: Let us take the null hypothesis (H0) that the variables in the
population are uncorrelated.

Applying t-test,

n−2 12 − 2
t=r 2
= 0.55
1− r 1 − 0.552

= 0.55 × 3.786 = 2.082

From the t-distribution (refer the table given at the end of this unit) with 10
degrees of freedom for a 5% level of significance, we see that the table value
of t0.05/2, (10–2) = 2.228. The calculated value of t is less than the table value of
t. Therefore, we can conclude that this r of 0.55 for n = 12 is not significantly
different from zero. Hence our hypothesis (H0) holds true, i.e., the sample
variables in the population are uncorrelated. 11
Geektonight Notes

Relational and Let us take another illustration to test the significance.


Trend Analysis
Illustration 4

A random sample of 100 pairs of observations from a normal population gives a


correlation coefficient of 0.55. Do you accept that the variables in the
population are correlated?

Solution: Let us take the hypothesis that the variables in the population are
uncorrelated. Apply the t-test:

n−2 100 − 2
t=r = 0.55
1− r2 1 − 0.552

= 6.52

Referring to the table of the t-distribution for n–2 = 98 degrees of freedom, the
critical value for t at a 5% level of significance [t0.05/2, (10–2)] = 1.99
(approximately). Since the calculated value of t (6.52) exceeds the table value
of t (1.99), we can conclude that there is statistically significant association
between the variables. Hence, our hypothesis does not hold true.

10.3.3 Spearman’s Rank Correlation


The Karl Pearson’s correlation coefficient, discussed above, is not applicable in
cases where the direct quantitative measurement of a phenomenon under study
is not possible. Sometimes we are required to examine the extent of association
between two ordinally scaled variables such as two rank orderings. For
example, we can study efficiency, performance, competitive events, attitudinal
surveys etc. In such cases, a measure to ascertain the degree of association
between the ranks of two variables, X and Y, is called Rank Correlation. It
was developed by Edward Spearman, its coefficient (R) is expressed by the
following formula:

6∑ d 2
R =1− where, N = Number of pairs of ranks, and Σd2 =
N3 − N
squares of difference between the ranks of two variables.

The following example illustrates the computation of rank correlation coefficient.

Illustration 5

Salesmen employed by a company were given one month training. At the end
of the training, they conducted a test on 10 salesmen on a sample basis who
were ranked on the basis of their performance in the test. They were then
posted to their respective areas. After six months, they were rated in terms of
their sales performance. Find the degree of association between them.

Salesmen: 1 2 3 4 5 6 7 8 9 10
Ranks in
training (X): 7 1 10 5 6 8 9 2 3 4
Ranks on
sales
Peformance
(Y): 6 3 9 4 8 10 7 2 1 5
12
Geektonight Notes

Solution: Table 10.3: Calculation of Coefficient of Rank Correlation. Correlation and Simple
Regression
Salesmen Ranks Secured Ranks Secured Difference
in Training on Sales in Ranks D2
X Y D = (X–Y)

1 7 6 1 1
2 1 3 –2 4
3 10 9 1 1
4 5 4 1 1
5 6 8 –2 4
6 8 10 –2 4
7 9 7 2 4
8 2 2 0 0
9 3 1 2 4
10 4 5 –1 1
ΣD2 = 24

Using the Spearman’s formula, we obtain

6∑ D 2 6 × 24
R =1 − = 1 −
N3 − N 103 − 10

144
=1 − = 0.855
990
we can say that there is a high degree of positive correlation between the
training and sales performance of the salesmen.
Now we proceed to test the significance of the results obtained. We are
interested in testing the null hypothesis (H0) that the two sets of ranks are not
associated in the population and that the observed value of R differs from zero
only by chance. The test which is used is t-statistic.

n−2 10 − 2
t=R = 0.855
1− R2 1 − 0.8552

= 0.855 29.74 = 4.663

Referring to the t-distribution table for 8 d.f (n–2), the critical value for t at a
5% level of significance [t0.05/2, (10–2)] is 2.306. The calculated value of t is
greater than the table value. Hence, we reject the null hypothesis concluding that
the performance in training and on sales are closely associated.

Sometimes the data, relating to qualitative phenomenon, may not be available in


ranks, but values. In such a situation the researcher must assign the ranks to
the values. Ranks may be assigned by taking either the highest value as 1 or
the lowest value as 1. But the same method must be followed in case of both
the variables.

13
Geektonight Notes

Relational and Tied Ranks


Trend Analysis
Sometimes there is a tie between two or more ranks in the first and/or second
series. For example, there are two items with the same 4th rank, then instead
of awarding 4th rank to the respective two observations, we award 4.5 (4+5/2)
for each of the two observations and the mean of the ranks is unaffected. In
such cases, an adjustment in the Spearman’s formula is made. For this, Σd2 is
(t 3 − t)
increased by for each tie, where t stands for the number of observations
12
in each tie. The formula can thus be expressed as:

 t3 − t t3 − t 
6 ∑ d 2 + + + …
12 12
r =1−  
N −N
3

Self Assessment Exercise B

1) Compute the degree of relationship between price of share (X) and price of
debentures over a period of 8 years by using Karl Pearson’s formula and test
the significance (5% level) of the association. Comment on the result.

Years: 1996 1997 1998 1999 2000 2001 2002 2003

Price of 42 43 41 53 54 49 41 55
shares:

Price of 98 99 98 102 97 93 95 94
debentures:

..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
2) Consider the above exercise and assign the ranks to price of shares and price
of debentures. Find the degree of association by applying Spearman’s formula
and test its significance.

..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
14
Geektonight Notes
Correlation and Simple
10.4 SIMPLE LINEAR REGRESSION Regression

When we identify the fact that the correlation exists between two variables, we
shall develop an estimating equation, known as regression equation or estimating
line, i.e., a methodological formula, which helps us to estimate or predict the
unknown value of one variable from known value of another variable. In the
words of Ya-Lun-Chou, “regression analysis attempts to establish the nature of
the relationship between variables, that is, to study the functional relationship
between the variables and thereby provide a mechanism for prediction, or
forecasting.” For example, if we confirmed that advertisment expenditure
(independent variable), and sales (dependent variable) are correlated, we can
predict the required amount of advertising expenses for a given amount of sales
or vice-versa. Thus, the statistical method which is used for prediction is called
regression analysis. And, when the relationship between the variables is linear,
the technique is called simple linear regression.

Hence, the technique of regression goes one step further from correlation and
is about relationships that have been true in the past as a guide to what may
happen in the future. To do this, we need the regression equation and the
correlation coefficient. The latter is used to determine that the variables are
really moving together.

The objective of simple linear regression is to represent the relationship between


two variables with a model of the form shown below:

Yi = β0 + β1 Xi + ei

wherein

Yi = value of the dependent variable,

β0 = Y-intercept,

β1 = slope of the regression line,

Xi = value of the independent variable,

ei = error term (i.e., the difference between the actual Y value and the value
of Y predicted by the model.

10.5 ESTIMATING THE LINEAR REGRESSION


If we consider the two variables (X variable and Y variable), we shall have
two regression lines. They are:

i) Regression of Y on X

ii) Regression of X on Y.

The first regression line (Y on X) estimates value of Y for given value of X.


The second regression line (X on Y) estimates the value of X for given value
of Y. These two regression lines will coincide, if correlation between the
variable is either perfect positive or perfect negative.

15
Geektonight Notes

Relational and When we draw the regression lines with the help of a scatter diagram as
Trend Analysis shown earlier in Fig. 10.1, we may get an infinite number of possible regression
lines for a set of data points. We must, therefore, establish a criterion for
selecting the best line. The criterion used is the Least Squares Method.
According to the least squares criterion, the best regression line is the one that
minimizes the sum of squared vertical distances between the observed (X, Y)
points and the regression line, i.e., ∑ ( Y − Ŷ ) 2 is the least value and the sum of
the positive and negative deviations is zero, i.e., ∑ (Y − Ŷ) = 0 . It is important
to note that the distance between (X, Y) points and the regression line is called
the ‘error’.

Regression Equations
As we discussed above, there are two regression equations, also called
estimating equations, for the two regression lines (Y on X, and X on Y). These
equations are, algebraic expressions of the regression lines, expressed as
follows:

Regression Equation of Y on X

Ŷ = a + bx

where, Ŷ is the computed values of Y (dependent variable) from the


relationship for a given X, ‘a’ and ’b’ are constants (fixed values), ‘a’
determines the level of the fitted line at Y-axis (Y-intercept), ‘b’ determines the
slope of the regression line, X represents a given value of independent variable.

The alternative simplified expression for the above equation is:

Ŷ − Y = byx ( X − X )
(∑ X ) (∑ Y )
σy ( ∑ XY ) −
byx = r = N
σx 2 (∑ X ) 2
∑X −
N
Regression equation of X on Y

X̂ = a + by

Alternative simplified expression is:

X̂ − X = bxy ( Y − Y )

(∑ X ) (∑ Y )
∑ XY −
σx N
bxy = r =
σy (∑ Y )2
∑ Y2 −
N

It is worthwhile to note that the estimated simple regression line always passes
through X and Y (which is shown in Figure 10.3). The following illustration
shows how the estimated regression equations are obtained, and hence how
they are used to estimate the value of Y for given X value.

16
Geektonight Notes

Illustration 6 Correlation and Simple


Regression

From the following 12 months sample data of a company, estimate the


regression lines and also estimate the value of sales when the company decided
to spend Rs. 2,50,000 on advertising during the next quarter.

(Rs. in lakh)
Advertisement
Expenditure: 0.8 1.0 1.6 2.0 2.2 2.6 3.0 3.0 4.0 4.0 4.0 4.6
Sales: 22 28 22 26 34 18 30 38 30 40 50 46

Solution:

Table 10.4: Calculations for Least Square Estimates of a Company.

(Rs. in lakh)
Advertising Sales
(X) (Y) X2 Y2 XY
0.8 22 0.64 484 17.6
1.0 28 1.00 784 28.0
1.6 22 2.56 484 35.2
2.0 26 4.00 676 52.0
2.2 34 4.84 1156 74.8
2.6 18 6.76 324 46.8
3.0 30 9.00 900 90.0
3.0 38 9.00 1,444 114.0
4.0 30 16.00 900 120.0
4.0 40 16.00 1600 160.0
4.0 50 16.00 2,500 200.0
4.6 46 21.16 2,116 211.6

ΣX=32.8 ΣY=384 ΣX2=106.96 ΣY2=13368 ΣXY=1,150.0

Now we establish the best regression line (estimated by the least square
method).

We know the regression equation of Y on X is:

Ŷ − Y = byx (X − X )

384 32.8
Y= = 32 ; X = = 2.733 Ŷ − Y = byx (X − X )
12 12

(∑ X ) (∑ Y )
∑ XY −
byx = N
2 (∑ X )2
∑X −
N

17
Geektonight Notes

Relational and
(32.8) (384)
Trend Analysis
1,150 −
= 12 = 5.801
(32.8) 2
106.96 −
12

Ŷ − 32 = 5.801 (X − 2.733)
Ŷ = 5.801X − 15.854 + 32 = 5.801X + 16.146
or Ŷ = 16.146 + 5.801X

which is shown in Figure 10.3. Note that, as said earlier, this line passes
through X (2.733) and Y (32).
observed points used to
fit the estimating line
points on the estimating
Y
line
50

Estimating
Ŷ = 16.143 + 5.01X line
40
Positive
Error Ŷ Negative
Y = 32
Error
Sales (Rs. Lac)

30 Y
Y

20 Ŷ

10

0 X
0 1 2 3 4 5
x = 2.73
Advertising (Rs. Lac)
Figure 10.3: Least Squares Regression Line of a Company’s Advertising Expenditure
and Sales.

It is worthwhile to note that the relationship displayed by the scatter diagram


may not be the same if the estimating equation is extended beyond the data
points (values) considered in computing the regression equation.

Using Regression for Prediction


Regression, a statistical technique, is used for predictive purposes in applications
ranging from predicting demand sales to predicting production and output levels.
In the above illustration 6, we obtained the regression model of the company
for predicting sales which is:

Ŷ = 16.146 + 5.801X
wherein Ŷ = estimated sales for given value of X, and
X = level of advertising expenditure.

18
To find Ŷ , the estimate of expected sales, we substitute the specified
Geektonight Notes

advertising level into the regression model. For example, if we know that the Correlation and Simple
Regression
company’s marketing department has decided to spend Rs. 2,50,000/- (X = 2.5)
on advertisement during the next quarter, the most likely estimate of sales ( Ŷ )
is :

Ŷ = 16.1436 + 5.801(2.5) = 30.6455

= Rs. 30,64,850

Thus, an advertising expenditure of Rs. 2.5 lakh is estimated to generate sales


for the company to the tune of Rs. 30,64,850.

Similarly, we can also establish the best regression line of X on Y as follows:

Regression Equation of X on Y

X̂ − X = bxy (Y − Y)

(∑ X ) (∑ Y ) (32.8) (384)
∑ XY − 1,150 −
bxy = N = 12 = 0.093
2 (∑ Y )
2 (384) 2
∑Y − 13368 −
N 12

X̂ – 2.733 = 0.093 (Y – 32)

X̂ – 2.733 = 0.093 Y – 2.976

X̂ = 2.733 – 2976 + 0.093Y

X̂ = – 0.243 + 0.093Y
The following points about the regression should be noted:

1) The geometric mean of the two regression coefficients (byx and bxy) gives
coefficient of correlation.

That is, r = ± (bxy ) (byx )

Consider the values of regression coefficients from the previous illustration to


know the degree of correlation between advertising expenditure and sales.

r = ± 0.093 × 5.801 = 0.734

2) Both the regression coefficients will always have the same sign (+ or –).

3) Coefficient of correlation will have the same sign as that of regression


coefficients. If both are positive, then r is positive. In case both are negative, r
is also negative. For example, bxy = –1.3 and byx = –0.65, then r is:

± − 1.3x − 0.65 = − 0.919 but not +0.919.

4) Regression coefficients are independent of change of origin, but not of scale.

19
Geektonight Notes

Relational and 10.5.1 Standard Error of Estimate


Trend Analysis

Once the line of best fit is drawn, the next process in the study of regression
analysis is how to measure the reliability of the estimated regression equation.
Statisticians have developed a technique to measure the reliability of the
estimated regression equation called “Standard Error of Estimate (Se).” This Se
is similar to the standard deviation which we discussed in Unit-9 of this course.
We will recall that the standard deviation is used to measure the variability of a
distribution about its mean. Similarly, the standard error of estimate
measures the variability, or spread, of the observed values around the
regression line. We would say that both are measures of variability. The
larger the value of Se, the greater the spread of data points around the
regression line. If Se is zero, then all data points would lie exactly on the
regression line. In that case the estimated equation is said to be a perfect
estimator. The formula to measure Se is expressed as:

Se =
∑ (Y − Ŷ) 2
n

where, Se is standard error of estimate, Y is values of the dependent variable,


Ŷ is estimated values from the estimating equation that corresponds to each Y
value, and n is the number of observations (sample size).

Let us take up an illustration to calculate Se in a given situation.

Illustration 7

Consider the following data relating to the relationships between expenditure on


research and development, and annual profits of a firm during 1998–2004.

Years: 1998 1999 2000 2001 2002 2003 2004

R&D (Rs. lakh): 2.5 3.0 4.2 3.0 5.0 7.8 6.5

Profit (Rs. lakh): 23 26 32 30 38 46 44

The estimated regression equation in this situation is found to be


Ŷ = 14.44 + 4.31x . Calculate the standard error of estimate.

Note: Before proceeding to compute Se you may calculate the regression


equation of Y on X on your own to ensure whether the given equation for the
above data is correct or not.

Solution: To calculate Se for this problem, we must first obtain the value of
2
∑ ( Y − Ŷ ) . We have done this in Table 10.5.

20
Geektonight Notes

Table 10.5: Calculation of Σ (Y- Ŷ )2 Correlation and Simple


(Rs. in lakh) Regression

Years Expendi- Profit ŷ Estimating values Individual


ture on (14.44 + 4.31X) error
R&D ( y − ŷ) ( y − ŷ ) 2
X Y
1998 2.5 23 14.44 + 4.31(2.5) = 25.21 –2.21 4.88

1999 3.0 26 14.44 + 4.31(3) = 27.37 –1.37 1.88

2000 4.2 32 14.44 + 4.31(4.2) = 32.54 –0.54 0.29

2001 3.0 30 14.44 + 4.31(3) = 27.37 2.63 6.92

2002 5.0 38 14.44 + 4.31(5) = 35.99 2.01 4.04

2003 7.8 46 14.44 + 4.31(7.8) = 48.06 –2.06 4.24

2004 6.5 44 14.44 + 4.31(6.5) = 42.46 1.54 2.37

Σ (Y − Ŷ) 2= 24.62
We can, now, find the standard error of estimate as follows.

∑ (Y − Ŷ)
2
Se =
n

24.62
= 1.875
7

Standard error of estimate of annual profit is Rs. 1.875 lakh.

We also notice, as discussed in Section 10.5, that ∑ (Y −Ŷ) = 0 . This is one


way to verify the accuracy of the regression line fitted by the least square
method.

10.5.2 Coefficient of Determination

Coefficient of determination (R2) measures the percentage of variation in the


dependent variable which is explained by the independent variable. R2 can be
any value between 0 and 1. It is used by many decision-makers to indicate
how well the estimated regression line fits the given (X, Y) data points. If R2 is
closer to 1, the better the fit which in turn implies greater explanatory power of
the estimated regression equation and, therefore, better prediction of the
dependent variable. On the other hand, if R2 is closer to 0 (zero), it indicates a
very weak linear relationship. Thus prediction should not be based on such
weak estimated regression. R2 is given by:

2
Explained var iation ∑ ( Y − Ŷ )
R2 = or , 1 − 2
Total var iation ∑ (Y − Y )

21
Geektonight Notes

Relational and
(∑ Y) 2
∑ (Y − Y ) = ∑ Y2 −
Trend Analysis 2

Note: When we employ simple regression, there is an alternative way of


computing R2, as shown in the equation below:

R2 = r 2

where, R2 is the coefficient of determination and r is the simple coefficient of


correlation.

Refer to Illustration 6, where we have computed ‘r’ with the help of regression
coefficients (bxy and byx), as an example for R2

r = 0.734

R2 = r2 = 0.7342 = 0.5388

This means that 53.88 per cent of the variation in the sales (Y) can be
explained by the level of advertising expenditure (X) for the company.

Self Assessment Exercise C

You are given the following data relating to age of Autos and their maintenance
costs. Obtain the two regression equations by the method of least squares and
estimate the likely maintenance cost when the age of Auto is 5 years and also
compute the standard error of estimate.

Age of Autos (yrs.): 2 4 6 8


Maintenance
costs (Rs.00): 10 20 25 30

..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

10.6 DIFFRENCE BETWEEN CORRELATION


AND REGRESSION
After having an understanding about the concept and application of simple
correlation and simple regression, we can draw the difference between them.
They are:

1) Correlation coefficient ‘r’ between two variables (X and Y) is a measure of the


direction and degree of the linear relationship between them, which is mutual. It
is symmetric (i.e., rxy = ryx) and it is inconsiderable which, of X and Y, is
dependent variable and which is independent variable. Whereas regression
analysis aims at establishing the functional relationship between the two
variables under study, and then using this relationship to predict the value of the
dependent variable for any given value of the independent variable. It also
22
Geektonight Notes

reflects upon the nature of the variables (i.e., which is the dependent variable Correlation and Simple
Regression
and which is independent variable). Regression coefficients, therefore, are not
symmetric in X and Y (i.e., byx ≠ bxy).
2) Correlation need not imply cause and effect relationship between the variables
under study. But regression analysis clearly indicates the cause and effect
relationship between the variables. The variable corresponding to cause is taken
as independent variable and the variable corresponding to effect is taken as
dependent variable.
3) Correlation coefficient ‘r’ is a relative measure of the linear relationship
between X and Y variables and is independent of the units of measurement.
It is a number lying between ±1. Whereas the regression coefficient byx (or
bxy) is an absolute measure representing the change in the value of the
variable Y (or X) for a unit change in the value of the variable X (or Y).
Once the functional form of the regression curve is known, by susbstituting
the value of the dependent variable we can obtain the value of the
independent variable which will be in the unit of measurement of the
variable.
4) There may be spurious (non-sense) correlation between two variables which
is due to pure chance and has no practical relevance. For example, the
correlation between the size of shoe and the income of a group of
individuals. There is no such thing as spurious regression.
5) Correlation analysis is confined only to study of linear relationship between
the variables and, therefore, has limited applications. Whereas regression
analysis has much wider applications as it studies linear as well as non-linear
relationships between the variables.

10.7 LET US SUM UP


In this unit, fundamental concepts and techniques of correlation (or association)
and simple linear regression have been discussed. Scatter diagrams, which
exhibit some typical patterns indicating different kinds of relationships have been
illustrated. A scatter plot of the variables may suggest that the two variables
are related but the value of the Karl Pearson’s correlation coefficient (r)
quantifies the degree of this association. The closer the correlation coefficient is
to ±1.0, the stronger the linear relationship between the two variables. Test for
significance of the correlation coefficient has been described. Spearman’s rank
correlation for data with ranks is outlined.

Once it is identified that correlation exists between the variables, an estimating


equation known as regression equation could be developed by the least squares
method for prediction. It also explained a statistical test called Standard Error of
Estimate, to measure the accuracy of the estimated regression equation. Finally,
the conceptual differences between correlation and regression have been
highlighted. The techniques of correlation and regression analysis are widely
used in business decision making and data analysis.

10.8 KEY WORDS


Coefficient of Determination: The square of the correlation coefficient. A
measure that defines the proportion of variation in the dependent variable
explained by the independent variable in the regression model.
23
Geektonight Notes

Relational and Correlation Coefficient: A quantitative measure of the linear relationship


Trend Analysis between two variables.

Linear Relationship: The relationship between two variables described by a


straight line.

Least Squares Criterion: The criterion for determining a regression line that
minimizes the sum of squared errors.

Simple Regression Analysis: A regression model that uses one independent


variable to explain the variation in the dependent variable.

Spurious Correlation: Correlation between two variables that have no known


cause and effect connection.

Standard Error of Estimate: A measure of the dispersion of the actual Y


values around the estimated regression line.

10.9 ANSWERS TO SELF ASSESSMENT


EXERCISES
B) 1. rK = – 0.071
t = – 0.1743
For 6 degrees of freedom, the critical value for t at a 5% level of
significance is = 2.4469

2. R = – 0.185
t = – 1.149

table value of t for d.f at a 5% level of significance is 2.4469.

C) Y on X : Ŷ = 5 + 3.25x

X on Y : X̂ = – 3 + 0.297y

10.10 TERMINAL QUESTIONS/EXERCISES


1) What do you understand by the term Correlation? Distinguish between different
kinds of correlation with the help of scatter diagrams.

2) Explain the difference between Karl Pearson’s correlation coefficient and


Spearman’s rank correlation coefficient. Under what situations, is the latter
preferred to the former?

3) What do you mean by Spurious Correlation?

4) What do you understand by the term regression? Explain its significance in


decision-making.

5) Distinguish between correlation and regression.

6) A personal manager of a firm is interested in studying as to how the number of


worker absent on a given day is related to the average temperature on that day.
24 A random sample of 12 days was used for the study. The data is given below:
Geektonight Notes
Correlation and Simple
No. of Regression
workers 6 4 8 9 3 8 5 2 4 10 7 6
absent:
Average 12 30 15 18 40 30 45 35 23 15 25 35
temperature (oC):

a) State the independent variable and dependent variable.


b) Draw a scatter diagram.
c) What type of relationship appears between the variables?
d) What is the logical explanation for the observed relationship?
7) The following table gives the demand and price for a commodity for 6 days.
Price (Rs.): 4 3 6 9 12 10
Demand (mds): 46 65 50 30 15 25

a) Obtain the value of correlation coefficient and test its significance at 5%


level.
b) Develop the estimating regression equations.
c) Compute the standard error of estimate.
d) Predict Demand for price (Rs.) = 5, 8, and 11.
f) Compute coefficient of determination and give your comment on the
distribution.
8) Two judges have ranked 10 students in order of their merit in a competition.

Students: A B C D E F G H I J
Rank by
Ist judge: 5 2 4 1 8 9 7 6 3 10
Rank by
IInd judge: 1 9 7 8 10 2 4 5 3 6

Find out whether the judges are in agreement with each other or not and apply
the t-test for significance at 5% level.

9) A sales manager of a soft drink company is studying the effect of its latest
advertising campaign. People chosen at random were called and asked how
many bottles they had bought in the past week and how many advertisements
of this product they had seen in the past week.

No. of ads (X): 4 0 2 7 3 4 2 6


Bottles purcha- 6 5 4 16 10 9 6 14
sed (Y):
a) Develop the estimating equation that best fits the data and test its
accuracy.
b) Calculate correlation coefficient and coefficient of determination.
c) Predict Y value when X = 5.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
25
Geektonight Notes

Relational and
Trend Analysis 10.11 FURTHER READING
A number of good text books are available for the topics dealt with in this unit.
The following books may be used for more indepth study.
Richard I. Levin and David S. Rubin, 1996, Statistics for Management.
Prentice Hall of India Pvt. Ltd., New Delhi.
Peters, W.S. and G.W. Summers, 1968, Statistical Analysis for Business
Decisions, Prentice Hall, Englewood-cliffs.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India
Ltd., New Delhi.
Gupta, S.P. 1989, Elementary Statistical Methods, Sultan Chand & Sons :
New Delhi.
Chandan, J.S. - Statistics for Business and Economics, Vikas Publishing
House Pvt. Ltd., New Delhi.
APPENDIX : TABLE OF t-DISTRIBUTION AREA
The table gives points of t- distribution corresponding to degrees of freedom and the upper tail area
(suitable for use n one tail test).

0 tα

Values of ta, m

m α 0.1 0.05 0.025 0.01 0.005

1 3.078 6.3138 12.706 31.821 63.657


2 1.886 2.9200 4.3027 6.965 9.9248
3 1.638 2.3534 3.1825 4.541 5.8409
4 1.533 2.1318 2.7764 3.747 4.6041
5 1.476 2.0150 2.5706 3.365 4.0321
6 1.440 1.9432 2.4469 3.143 3.7074
7 1.415 1.8946 2.3646 2.998 3.4995
8 1.397 1.8595 2.3060 2.896 3.3554
9 1.383 1.8331 2.2622 2.821 3.2498
10 1.372 1.8125 2.2281 2.764 3.1693
11 1.363 1.7959 2.2010 2.718 3.1058
12 1.356 1.7823 2.1788 2.681 3.0545
13 1.350 1.7709 2.1604 2.650 3.0123
14 1.345 1.7613 2.1448 2.624 2.9768
15 1.341 1.7530 2.1315 2.602 2.9467
16 1.337 1.7459 2.1199 2.583 2.9208
17 1.333 1.7396 2.1098 2.567 2.8982
18 1.330 1.7341 2.1009 2.552 2.8784
19 1.328 1.7291 2.0930 2.539 2.8609
20 1.325 1.7247 2.0860 2.528 2.8453
21 1.323 1.7207 2.0796 2.518 2.8314
22 1.321 1.7171 2.0739 2.508 2.8188
23 1.319 1.7139 2.0687 2.500 2.8073
24 1.318 1.7109 2.0639 2.492 2.7969
25 1.316 1.7081 2.0595 2.485 2.7874
26
Geektonight Notes
(Contd…) Correlation and Simple
Regression
m α 0.10 0.05 0.025 0.01 0.005

26 1.315 1.7056 2.0555 2.479 2.7787


27 1.314 1.7033 2.0518 2.473 2.7707
28 1.313 1.7011 2.0484 2.467 2.7633
29 1.311 1.6991 2.0452 2.462 2.7564
30 1.310 1.6973 2.0423 2.457 2.7500
35 1.3062 1.6896 2.0301 2.438 2.7239
40 1.3031 1.6839 2.0211 2.423 2.7045
45 1.3007 1.6794 2.0141 2.412 2.6896
50 1.2987 1.6759 2.0086 2.403 2.6778
60 1.2959 1.6707 2.0003 2.390 2.6603
70 1.2938 1.6669 1.994 2.381 2.6480
80 1.2922 1.6641 1.9945 2.374 2.6388
90 1.2910 1.6620 1.9901 2.364 2.6316
100 1.2901 1.6602 1.9867 2.364 2.6260
120 1.2887 1.6577 1.9840 2.358 2.6175
140 1.2876 1.6658 1.9799 2.353 2.6114
160 1.2869 1.6545 1.9771 2.350 2.6070
180 1.2863 1.6534 1.9749 2.347 2.6035
200 1.2858 1.6525 1.9733 2.345 2.6006
∞ 1.282 1.645 1.96 2.326 2.576

27
Geektonight Notes

Relational and
Trend Analysis UNIT 11 TIME SERIES ANALYSIS
STRUCTURE

11.0 Objectives
11.1 Introduction
11.2 Definition and Utility of Time Series Analysis
11.3 Components of Time Series
11.4 Decomposition of Time Series
11.5 Preliminary Adjustments
11.6 Methods of Measurement of Trend
11.6.1 Freehand Method
11.6.2 Least Square Method
11.7 Let Us Sum Up
11.8 Key Words
11.9 Answers to Self Assessment Questions
11.10 Terminal Questions/Exercises
11.11 Further Reading

11.0 OBJECTIVES
After studying this unit, you should be able to:

l define the concept of time series,


l appreciate the role of time series in short-term forecasting,
l explain the components of time series, and
l estimate the trend values by different methods.

11.1 INTRODUCTION
In the previous units, you have learnt statistical treatment of data collected for
research work. The nature of data varied from case to case. You have come
across quantitative data for a group of respondents collected with a view to
understanding one or more parameters of that group, such as investment, profit,
consumption, weight etc. But when a nation, state, an institution or a business
unit etc., intend to study the behaviour of some element, such as price of a
product, exports of a product, investment, sales, profit etc., as they have
behaved over a period of time, the information shall have to be collected for a
fairly long period, usually at equal time intervals. Thus, a set of any quantitative
data collected and arranged on the basis of time is called ‘Time Series’.
Depending on the research objective, the unit of time may be a decade, a year,
a month, or a week etc. Typical time series are the sales of a firm in
successive years, monthly production figures of a cement mill, daily closing
price of shares in Bombay stock market, hourly temperature of a patient.

Usually, the quantitative data of the variable under study are denoted by y1, y2,
...yn and the corresponding time units are denotecd by t1, t2......tn. The variable
‘y’ shall have variations, as you will see ups and downs in the values. These
changes account for the behaviour of that variable.

Instantly it comes to our mind that ‘time’ is responsible for these changes, but
this is not true. Because, the time (t) is not the cause and the changes in the
variable (y) are not the effect. The only fact, therefore, which we must
understand is that there are a number of causes which affect the variable and
have operated on it during a given time period. Hence, time becomes only the
2 8 basis for data analysis.
Geektonight Notes

Forecasting any event helps in the process of decision making. Forecasting is Time Series Analysis
possible if we are able to understand the past behaviour of that particular
activity. For understanding the past behaviour, a researcher needs not only the
past data but also a detailed analysis of the same. Thus, in this unit we will
discuss the need for analysis of time series, fluctuations of time series which
account for changes in the series over a period of time, and measurement of
trend for forecasting.

11.2 DEFINITION AND UTILITY OF TIME SERIES


ANALYSIS
Based on the above discussion we can understand the definitions given by a
few statisticians. They are:
“A time series consists of statistical data which are collected, recorded over
successive increments”.
“When quantitative data are arranged in the order of their occurrence, the
resulting statistical series is called a time series”.
The analysis of time series is of great utility not only to research workers but
also to economists, businessmen and scientists etc., for the following reasons:
1) It helps in understanding past behaviour of the variables under study.
2) It facilitates in forcasting the future behaviour with the help of the changes
that have taken place in the past.
3) It helps in planning future course of action.
4) It helps in knowing current accomplishment.
5) It is helpful to make comparisons between different time series and
significant conclusions drawn therefrom.
Thus we can say that the need for time series analysis arises in research
because:
l we want to understand the behaviour of the variables under study,
l we want to know the expected quantitative changes in the variable
under study, and
l we want to estimate the effect of various causes in quantitative terms.
In a nutshell, the time series analysis is not only useful for researchers, business
research institutions, but also for Governments for devising appropriate future
growth strategies.

11.3 COMPONENTS OF TIME SERIES


If you are informed that the price of one kilogram sunflower oil was Rs. 0.50
in the year1940 and in the year 1980 it was Rs. 30 and in the year 2004 it is
reported to be Rs. 70, and if you are asked this question: shall sunflower oil be
sold again in the future for either Rs. 0.50 or Rs. 30 per kg? Surely, your
answer would be ‘No’.

Another question: Shall sunflower oil be sold again in future for Rs. 60 per
kg? No doubt, your answer would be ‘Yes’. Have you ever thought about how
you answered the above two questions? Probably you have not! The analysis of
these answers shall lead us to arrive at the following observations:

– There are several causes which affect the variable gradually and permanently.
Therefore we are prompted to answer ‘No’ for the first question.
2 9
Geektonight Notes

Relational and – There are several causes which affect the variable for the time being only. For
Trend Analysis this reason we are prompted to answer ‘Yes’ for the second question.

The causes which affect the variable gradually and permanently are termed as
“Long-Term Causes”. The examples of such causes are: increase in the rate of
capital formation, technological innovations, the introduction of automation,
changes in productivity, improved marketing etc. The effect of long term causes
is reflected in the tendency of a behaviour, to move in an upward or downward
direction, termed as ‘Trend’ or ‘Secular Trend’. It reveals as to how the time
series has behaved over the period under study.

The causes which affect the variables for the time being only are labelled as
“Short-Term Causes”. The short term causes are further divided into two parts,
they are ‘Regular’ and ‘Irregular’. Regular causes are further divided into two
parts, namely ‘cyclical causes’ and ‘seasonal causes’. The cyclical variations
are also termed as business cycle fluctuations, as they influence the variable. A
business cycle is composed of prosperity, recession, depression and recovery.
The periodic movements from prosperity to recovery and back again to
prosperity vary both in time and intensity. The seasonal causes, like weather
conditions, business climate and even local customs and ceremonies together
play an important role in giving rise to seasonal movements to almost all the
business activities. For instance, the yearly weather conditions directly affect
agricultural production and marketing.

It is worthwhile to say that the seasonal variations analysis will be possible only
if the season-wise data are available. This fact must be checked first. For
analysing the seasonal effects various methods are available. Among them
seasonal index by ‘Ratio to Moving Average Method’ is the most widely used.
However, if collected data provides only yearly values, there is no possibility of
obtaining seasonal variations. Therefore, the residual amount after eliminating
trend will be the effect of irregular or random causes.

Irregular causes are also termed as ‘Erratic’ or ‘Random’ causes. Random


variations are caused by infrequent occurrences such as wars, strikes,
earthquakes, floods etc. These reasons either go very deep downwards or very
high upwards.

The foregoing paragraphs have, in a way, led us to enumerate the components


of the time series. These components form the basis for ‘Time Series
Analysis’.

Long-term causes : Secular Trend or Trend (T)


Short-term causes :
Regular : Cyclical (C)
: Seasonal (S)
Irregular or Random : Erratic (I)

11.4 DECOMPOSITION OF TIME SERIES


Decomposition and analysis of a time series are one and the same thing. The
original data or observed data ‘O’ is the result of the effects generated by the
long-term and short-term causes, namely, (1) Trend = T, (2) cyclical = C, (3)
seasonal = S, and (4) Irregular = I. Finding out the values for each of the
components is called decomposition of a time series. Decomposition is done
either by the Additive model or the Multiplicative model of analysis. Which of
these two models is to be used in analysis of time series depends on the
assumption that we might make about the nature and relationship among the
3 0 four components.
Geektonight Notes

Additive Model: It is based on the assumption that the four components are Time Series Analysis
independent of one another. Under this assumption, the pattern of occurrence
and the magnitude of movements in any particular component are not affected
by the other components. In this model the values of the four components are
expressed in the original units of measurement. Thus, the original data or
observed data, ‘Y’ is the total of the four component values, that is,

Y=T+S+C+I

where, T, S, C and I represent the trend variations, seasonal variations cyclical


variations, and erratic variations, respectively.

Multiplicative Model: It is based on the assumption that the causes giving


rise to the four components are interdependent. Thus, the original data or
observed data ‘Y’ is the product of four component values, that is :

Y=T×S×C×I

In this model the values of all the components, except trend values, are
expressed as percentages.

In business research, normally, the multiplicative model is more suited and used
more frequently for the purpose of analysis of time series. Because, the data
related to business and economic time series is the result of interaction of a
number of factors which individually cannot be held responsible for generating
any specific type of variations.

Let us consider an example for construction of time series according to the


Multiplicative Model. Table 11.1 presents trend, seasonal, and cyclical-erratic
components of a hypothetical series.

Table 11.1: Hypothetical time series and its components (quarterly)

Components
Year Quarter Series Trend Seasonal Cyclical-
(O) (T) (100 S) erratic
(100 C1)

1 1 79 80 120 82
2 58 85 80 85
3 84 90 92 102
4 107 95 108 105

2 1 130 100 120 108


2 93 105 80 132
3 121 110 92 120
4 161 115 108 130

3 1 216 120 120 150


2 132 125 80 132
3 150 130 93 125
4 163 135 108 112

4 1 176 140 120 105


2 112 145 80 97
3 128 150 93 93
4 142 155 108 85

3 1
Geektonight Notes

Relational and According to multiplicative model


Trend Analysis
Y=T×S×C×I
120 82
Thus, 79 (1 year and 1 quarter) = 80 × ×
100 100
120 108
130 (2 year and 1 quarter) = 100 × ×
100 100
Thus each quarterly figure (Y) is the product of the T, S, and CI. Such a
synthetic composition looks like an actual time series and has encouraged use
of the model as the basis for the analysis of time series data.

11.5 PRELIMINARY ADJUSTMENTS


Before we proceed with the task of analysing a time series data, it is
necessary to do relevant adjustments in the raw data. They are:

1) Calender variations: As we are aware, all the calender months do not


have the same number of days. For instance, the production in the month of
February may be less than other months because of fewer days and if we take
the holidays into account the variation is greater. Therefore, adjustments for
calender variations have to be made.

2) Price changes: As price level changes are inevitable, it is necessay to


convert monetary values into real values after taking into consideration the price
indices. In fact this is the process of deflating which will be discussed in Unit-
12 (Index Numbers) of this course.

3) Population changes: Population grows constantly. This also calls for


adjustment in the data for the population changes. In such cases, if necessary,
per capita values may be computed (dividing original figures by the total
population).

Self Assessment Exercise A

1) State whether the following statements are ‘True’ or ‘False’.

a) Time is the cause for the ups and downs in the values of the variable under
study.
b) The variable under study in time series analysis is denoted by ‘y’.
c) ‘Trend’ values are a major component of the time series.
d) Analysis of time series helps in knowing current accomplishment.
e) Weather conditions, customs, habits etc., are causes for cyclical variations.
f) The analysis of time series is done to know the expected quantity
change in the variable under study.

2) Why do we analyse a time series?


..................................................................................................................
..................................................................................................................
..................................................................................................................

3) List out the components of a time series.

..................................................................................................................
..................................................................................................................
3 2
..................................................................................................................
Geektonight Notes
Time Series Analysis
11.6 METHODS OF MEASUREMENT OF TREND
The effect of long-term causes is seen in the trend values we compute. A
trend is also known as ‘secular trend’ or ‘long-term trend’ as well. There are
several methods of isolating the trend of which we shall discuss only two
methods which are most frequently used in the business and economic time
series data analysis. They are: Free Hand Method, and Method of Least
Square.

11.6.1 Free Hand Method


In this method, the first requirement is that a graph is drawn of the original
data. After plotting the data on the graph paper, without the help of any
numerical calculations, a free hand straight line is drawn through the graph
ensuring that it passes (as closely as possible) through the middle of the entire
graph of the time series. This is, thus, the easiest and quickest method of
estimating secular trend. Even though the straight line is drawn on personal
judgments, it requires a careful inspection of the overall behaviour of
movements in that time series graph.

Though this method is very simple, it does not have a common acceptance
because it gives varying trend values for the same data when efforts are made
by different persons or even by the same persons at different times. It is to be
noted that free-hand method is highly subjective and therefore, different
researchers may draw different trend lines from the same data set. Hence, it is
not advisable to use it as a basis for forecasting, particularly, when the time
series is subject to very irregular movements. Let us consider an illustration to
draw a trend line by free-hand method.

Illustration 1

From the following data, find the trend line by using Free hand (graphline)
Method
Years: 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003
Foodgrain production 35, 55, 40, 85, 135, 110, 130, 150, 130, 120
(lakh tonnes)
180

160

140
Production (lakh tones)

120

100

80 Original data

60
Trend line
40

20

0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Years
Fig. 1: Food Grain Production (in lakh tons) 3 3
Geektonight Notes

Relational and 11.6.2 Least Square Method


Trend Analysis

This is also known as straight line method. This method is most commonly used
in research to estimate the trend of time series data, as it is mathematically
designed to satisfy two conditions. They are:

1) Sum of (Y–Yc) = 0, and


2) Sum of (Y–Yc)2 = least

The straight line method gives a line of best fit on the given data. The straight
line which can satisfy the above conditions and make use of the regression
equation, is given by :

Yc = a + bx

where, ‘Yc’ represents the trend value of the time series variable y, ‘a’ and ‘b’
are constant values of which ‘a’ is the trend value at the point of origin
and ‘b’ is the amount by which the trend value changes per unit of
time, and ‘x’ is the unit of time (value of the independent variable).

The values of constants, ‘a’ and ‘b’, are determined by the following two
normal equations.

∑y = na + b ∑x .................(i)

∑xy = a∑x + b ∑x2 .............(ii)

The process of finding values of constants a and b can be made simple by


using a shortcut method, that is, by taking the origin year in such a way that it
gives the total of ‘x’ (∑x) equal to ‘zero’. This becomes possible if we take the
median year as origin period. Thus, the negative values in the first half of the
series balance out the positive values in the second half. Thus, the earlier
normal equation shall be changed as follows, with reference to ∑x = 0.

∑y = a (as ∑bx becomes zero)

∑xy = b∑x2 (as a∑x becomes zero)

Therefore, the values of two constants are obtained by the following formulae:
∑y ∑xy
a= , and b = 2
N ∑x

It is to be noted that when the number of time units involved is even, the point
of origin will have to be chosen between the two middle time units.

Let us consider an illustration to understand the procedure for estimation of the


trend by using the method of least squares.

Illustration 2

The decision making body of a fertilizer firm producing fertilizers wants to


predict future sales trend for the years 2006 and 2008 based on the analysis of
its past sales pattern. The sales of the firm for the last 7 years, for this
purpose, are given below:

3 4
Geektonight Notes

Years Sales (in ’000 tonnes) Time Series Analysis

1998 70
1999 75
2000 90
2001 98
2002 85
2003 91
2004 100

Solution: To find the straight line equation (Yc = a + bx) for the given time
series data, we have to substitute the values of already arrived expression, that
is:

∑y ∑ xy
a= , and
N ∑ x2

In order to make the total of x = ‘zero’, we must take median year (i.e., 2001)
as origin. Study the following table carefully to understand the procedure for
fitting the straight line.

Table 11.1: Computation of Trend

Year Sales x x2 xy Trend (Yc)


(’000 tons) a+bx

1998 70 –3 9 –210 74.5


1999 75 –2 4 –150 78.6
2000 90 –1 1 –90 82.8
2001 98 0 0 0 87.0
2002 85 1 1 85 91.2
2003 91 2 4 182 95.4
2004 100 3 9 300 99.5
N=7 ∑y = 609 ∑x =0 ∑x 2 = 28 ∑xy = 117 609.0

∑y ∑ xy 117
= 87 ; b = = = 4.18
609
a= = 2 28
N 7 ∑x

Thus, the straight line trend equation is

Yc = 87 + 4.18x

From the above equation, we can also find the monthly increase in sales as
follows:

4.180
= 348.33 tons
12

The reason for this is that the trend values increased by a constant amount ‘b’
every year. Hence the annual increase in sales is 4.18 thousand tons.
3 5
Geektonight Notes

Relational and Trend values are to be obtained as follows:


Trend Analysis
Y1998 = 87 + 4.18 (–3) = 74.5
Y1999 = 87 + 4.18 (–2) = 78.6 and so on ........
Predicting with decomposed components of the time series: The
management wants to estimate fertiliser sales for the years 2006 and 2008.

Estimation of sales for 2006, ‘x’ would be 5 (because for 2004 ‘x’ was 3).
Y2006 = 87 + 4.18 (5) = 107.9 thousand tonnes.
Estimation of sales for 2008, ‘x’ would be 7
Y2008 = 87 + 4.18 (7) = 116.3 thousand tonnes.

Self Assessment Exercise B

1) State whether the following statements are ‘True’ or ‘False’.


a) The free hand method gives different straight lines for the same data when
efforts are made by different persons.
b) The multiplicative model is based on the assumption that the causes giving
rise to the four components are dependent.
c) A free hand curve is drawn without any numerical calculations for trend
estimation.
d) The total of the difference between original data and trend values (obtained
by straight line method) will never be zero.
e) In the least square trend equation Yc = a + bx, if b is positive it indicates a
rising trend.
f) the additive model of time series analysis is expressed as: Y = T + S +
C + I.
2) Enumerate the methods of isolating trend.
................................................................................................................
................................................................................................................
................................................................................................................

3) Foodgrain production (in lakh tonnes) is given below (figures are


imaginary). Find the Trend by using a) Graphic method (free hand) b)
Straight Line Method. Tabulate the trend values. c) Predict the
production for the year 2010.

Years Production
1996 40
1997 60
1998 45
1999 83
2000 130
2001 135
2002 150
2003 120
2004 200
3 6
Geektonight Notes

Years y x x2 xy yc Time Series Analysis

11.7 LET US SUM UP


This unit has introduced you to the concept of time series and its analysis with
a view to making more accurate and reliable forcasts for the future.

A set of quantitative data arranged on the basis of TIME are referred to as


‘Time Series’. The analysis of time series is done to understand the dynamic
conditions for achieving the short-term and long-term goals of institution(s). With
the help of the techniques of time series analysis the future pattern can be
predicted on the basis of past trends.

The quantitative values of the variable under study are denoted by y1, y2, y3......
and the corresponding time units are denoted as x1, x2, x3...... . The variable ‘y’
shall have variations, you will see ups and downs in the values. There are a
number of causes during a given time period which affect the variable.
Therefore, time becomes the basis of analysis. Time is not the cause and the
changes in the values of the variable are not the effect.

The causes which affect the variable gradually and permanently are termed as
Long-term causes. The causes which affect the variable only for the time being
are termed as Short-term causes. The time series are usually the result of the
effects of one or more of the four components. These are trend variations (T),
seasonal variations (S), Cyclical variations (C) and Irregular variations (I).

When we try to analyse the time series, we try to isolate and measure the
effects of various kinds of these components on a series.

We have two models for analysing time series:

1) Additive model, which considers the sum of various components resulting in the
given values of the overall time series data and symbolically it would be
expressed as: Y = T + C + S + I.

2) The multiplicative model assumes that the various components interact in a


multiplicative manner to produce the given values of the overall time series
data and symbolically it would be expressed as : y = T × C × S × I.

The trend analysis brings out the effect of long-term causes. There are
different methods of isolating trends, among these we have discussed only two
methods which are usually used in research work, i.e. free hand and least
square methods.
3 7
Geektonight Notes

Relational and Long-term predictions can be made on the basis of trends, and only the least
Trend Analysis square method of trend computation offers this possibility.

11.8 KEY WORDS


Time Series : is the data on any variable accumulated at regular time
intervals.
Secular Trend : A type of variation in a time series, the long-term tendency
of a time series to grow or decline over a period of time.
Seasonal Variation : Patterns of change in a time series within a year and the
same changes tend to be repeated from year to year.
Cyclical Variations : A type of variation in a time series, in which the values
of variables vary up and down around the secular trend line.
Irregular Variations : A type of element of a time series, refers to such
variations in business activity which do not repeat according to a definite
pattern and the values of variables are completely unpredictable.

11.9 ANSWERS TO SELF ASSESSMENT EXERCISES


A) 1) a) False b) True c) True d) True e) False f) True
3) Secular trend, Seasonal variation, Cyclical variation, and Irregular
variation
B) 1) a) True b) False c) True d) True e) False f) True
3) Y1 = 107 + 18.03 x
Estimated production for 2010 is 287.3 lakh tonnes.

11.10 TERMINAL QUESTIONS/EXERCISES


1) What is time series? Why do we analyse a time series?
2) Explain briefly the components of time series.
3) Explain briefly the additive and multiplicative models of time series. Which
of these models is more commonly used and why?
4) From the following data, obtain the trend line by Freehand Method for
further analysis.
Years 1996 1997 1998 1999 2000 2001 2002 2003

‘y’ 24 28 38 33 49 50 66 68

5) The production (in thousand tons) in a sugar factory during 1994 to 2001
has been as follows:
Year 1994 1995 1996 1997 1998 1999 2000 2001

Produ- 35 38 49 41 56 58 76 75
ction
(Hint: The point of origin must be taken between 1997 and 1998).
i) Find the trend values by applying the method of least square.
ii) What is the monthly increase in production?
3 8 iii) Estimate the production of sugar for the year 2008.
Geektonight Notes

6) The following data relates to a survey of used car sales in a city for the Time Series Analysis
period 1993-2001. Predict sales for 2006 by using the linear trend
equation.

Years 1993 1994 1995 1996 1997 1998 1999 2000 2001

Sales 214 320 305 298 360 450 340 500 520

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

11.11 FURTHER READING


A number of good text books are available for the topics dealt with in this unit.
The following books may be used for more indepth study.
Mentgomery, D.C. and L.A. Johnson, 1996, ‘Forecasting and Time Series
Analysis’ McGraw Hill : New York.
Chandan, J.S., 2001, Statistics for Business and Economics, Vikas Publishing
House Pvt. Ltd., New Delhi.
Gupta, S.P. and H.P. Gupta, 2001, Business Statistics, S. Chand, New Delhi.

3 9
Geektonight Notes
Probability and
UNIT 13 PROBABILITY AND PROBABILITY Probability Rules

RULES
STRUCTURE

13.0 Objectives
13.1 Introduction
13.2 Meaning and History of Probability
13.3 Terminology
13.4 Fundamental Concepts and Approaches to Probability
13.5 Probability Rules
13.5.1 Addition Rule for Mutually Exclusive Events
13.5.2 Addition Rule for Non-mutually Exclusive Events
13.6 Probability Under Statistical Independence
13.7 Probability Under Statistical Dependence
13.8 Bayes’ Theorem: Revision of A- Priori Probability
13.9 Let Us Sum Up
13.10 Key Words
13.11 Answers to Self Assessment Exercises
13.12 Terminal Questions/Exercises
13.13 Further Reading

13.0 OBJECTIVES
After studying this unit, you should be able to:
l comprehend the concept of probability,
l acquaint yourself with the terminology related to probability,
l understand the probability rules and their application in determining probability,
l differentiate between determination of probability under the condition of
statistical independence and statistical dependence,
l apply probability concepts and rules to real life problems, and
l appreciate the relevance of the study of probability in decision making.

13.1 INTRODUCTION
In the previous units we have discussed the application of descriptive statistics.
The subject matter of probability and probability rules provide a foundation for
Inferential Statistics. There are various business situations in which the decision
makers are forced to apply the concepts of probability. Decision making in
various situations is facilitated through formal and precise expressions for the
uncertainties involved. For instance formal and precise expression of
stockmarket prices and product quality uncertainties, may go a long way to help
analyse, and facilitate decision on portfolio and sales planning respectively.
Probability theory provides us with the means to arrive at precise expressions
for taking care of uncertainties involved in different situations.

This unit starts with the meaning of probability and its brief historical evolution.
Its meaning has been described. The next section covers fundamental concept
of probability as well as three approaches for determining probability. These
approaches are : i) Classical approach; ii) Relative frequency of occurrence
approach, and iii) Subjective approach. 5
Geektonight Notes
Probability and Hypothesis Thereafter the addition rule for probability has been explained for both mutually
Testing
exclusive events and non-mutually exclusive events. Proceeding further the unit
addresses the important aspects of probability rules, the conditions of statistical
independence and statistical dependence. The concept of marginal, joint, and
conditional probabilities have been explained with suitable examples.

13.2 MEANING AND HISTORY OF PROBABILITY


Generally, in day-to-day conversation the words, probability, possible, chance,
likelihood etc., are commonly used. You may have a rough idea of what is
meant by these words. For example, we may come across the statements like:
the train may come late today, the chance of winning the cricket match etc. It
means there is uncertainity about the happening of the event(s). We live in a
world where we are unable to forecast the future with complete certainty. Our
need to cope with uncertainty leads us to the study and use of probability. In
statistics, the term probability is established by definition and is not related to
beliefs.

The concept of probability is as old as civilisation itself. As you know gambling


is an age-old malaise. Gamblers have used the probability concept to make
bets. The probability theory was first applied to gambling and later to other
socio-economic problems. The probability theory was later on applied to the
insurance industry, which evolved in the 19th century. This concept was used to
determine the premium to be charged on the basis of probabilistic estimates of
the life expectancy of the insurance policy holder. Consequently, the study of
probability was initiated at many learning centers for students to be equipped
with a tool for better understanding of many socio-economic phenomenon.
Lately, the quantitative analysis has become the backbone of statistical
application in business decision making and research.

If the conditions of certainty only were to prevail, life would have been much
more simple. As is obvious there are numerous real life situations in which
conditions of uncertainty and risk prevail. Consequently, we have to rely on the
theory of chance or probability in order to have a better idea about the possible
outcomes. There are social, economic and business sectors in which decision
making becomes a real challenge for the managers. They may be in the dark
about the possible consequences of their decisions and actions. Due to
increasing competitiveness the stakes have become higher and cost of making a
wrong decision has become enormous.

13.3 TERMINOLOGY
Before we proceed to discuss the fundamental concepts and approaches to
determining probability, let us now acquaint ourselves with the terminology
relevant to probability.

i) Random Experiment: A set of activities performed in a homogenous


condition repetitively constitutes a random experiment. It results in various
possible outcomes. An experiment, therefore, may be a single-trial, two-trial, or
n-trial experiment. It may, thus, be noted that an experiment is determined in
terms of the nature of trial and the number of times the trial is repeated.

6
Geektonight Notes

ii) Trial and Events: To conduct an experiment once is termed as trial, while Probability and
Probability Rules
possible outcomes or combination of outcomes is termed as events. For
example, toss of a coin is a trial, and the occurence of either head or a tail is an
event.

iii) Sample Space: The set of all possible outcomes of an experiment is called the
sample space for that experiment. For example, in a single throw of a dice, the
sample space is (1, 2, 3, 4, 5, 6).

iv) Collectively Exhaustive Events: It is the set of all possible events that can
result from an experiment. It is obvious that the sum total of probability value of
each of these events will always be one. For example, in a single toss of a fair
coin, the collectively exhaustive events are either head or tail. Since

P(H) = 0.5 and P(T) = 0.5


∴ P(H) + P(T) = 0.5 + 0.5 = 1.0

v) Mutually Exclusive Events: Two events are said to be mutually exclusive


events if the occurence of one event implies no possibility of occurence of the
other event. For example, in throwing an unbiased dice, the occurence of the
number at the top prevents the occurence of other numbers on it.

vi) Equally Likely Events: When all the possible outcomes of an experiment
have an equal probability of occurance, such events are called equally likely
events. For example, in case of throwing of a fair coin, we have already seen
that
P(Head) = P (Tail) = 0.5

Many common experiments in real life also can have events, which have all of
the above properties. The best example is that of a single toss of a coin, where
both the possible outcomes or events of either head or tail coming on top are
collectively exhaustive, mutually exclusive and equally likely events.

13.4 FUNDAMENTAL CONCEPTS AND


APPROACHES TO PROBABILITY
Let us, now, discuss the concepts and approaches to determine and interpret
probability. There are two fundamental concepts of probability. They are:

(i) The value of probability of any event lies between 0 to 1. This may be
expressed as follows:
0 ≤ P (Event) ≤ 1
If the value of probability of an event is equal to zero, then the event is never
expected to occur and if the probability value is equal to one, the event is
always expected to occur.
(ii) The sum of the simple probabilities for all possible outcomes of an activity must
be equal to one.
Before proceeding further, first of all, let us discuss different approaches to
defining probability concept.

Approaches to Probability
There are three approaches to determine probability. These are :
7
Geektonight Notes
Probability and Hypothesis a) Classical Approach: The classical approach to defining probability is based on
Testing
the premise that all possible outcomes or elementary events of experiment are
mutually exclusive and equally likely. The term equally likely means that each of
all the possible outcomes has an equal chance of occurance. Hence, as per this
approach, the probability of occuring of any event ‘E’ is given as:

No . of outcomes where the event occurs [ n ( E )]


P(E ) =
Total no . of all possible outcomes ( n (S)]

This approach is also known as ‘A Priori’ probability as when we are


performing the event using a fair coin, standard card, unbiased dice, then we
can tell in advance the probability of happening of some event. We need not
perform the actual experiment to find the required probability.

Example: When we toss a fair coin, the probability of getting a head would be:

Total number of favourable outcomes 1 1


P ( Head ) = = =
Total no. of all possible outcomes 1+1 2

3
Similarly, when a dice is thrown, the probability of getting an odd number is
6
1
or .
2

The premise that all outcomes are equally likely assumes that the outcomes are
symmetrical. Symmetrical outcomes are possible when a coin or a die being
tossed are fair. This requirement restricts the application of probability only to
such experiments which give rise to symmetrical outcomes. The classical
approach, therefore, provides no answer to problems involving asymmetrical
outcomes. And we do come across such situations more often in real life.

Thus, the classical approach to probability suffers from the following limitations:
i) The approach cannot be useful in such a case, when it is not possible in the
events to be considered “equally likely”. ii) It fails to deal with questions like,
what is the probability that a bulb will burn out within 2,000 hours? What is the
probability that a female will die before the age of 50 years? etc.

b) Relative Frequency of Occurrence: The classical approach offers no


answer to probability in situations where outcomes of an experiment lack of
syummetry. Experiments of throwing a die which is loaded, or of tossing a coin
which is biased, fall beyond the purview of classial approach since these
experiments do not generate equally likely or symmetrical outcomes. In such
cases the probability remains undefined. Failure of the classical approach on
this count has made way for the relative frequency approach. It was in early
1800s, when several British statisticians started defining probability from
statistical data collected on births and deaths. This concept was then widely
used for calculating risk of loss of life as well as commercial insurance.
According to this approach, the probability is the observed relative frequency of
an event in a very large number of trials, when the conditions are stable.This
method is utilised by taking relative frequencies of past occurrences as their
probabilities.

Example: Supose an insurance company knows from available actuarial data,


that for all males 50 years old, 30 out of 10,000 die within a one-year period.
8
Geektonight Notes

From this available data the company can estimate the probability of death for Probability and
Probability Rules
that age group as ‘P’ where,
30
P= = 0.003
10,000

Similarly, if the availability of fair complexioned girls is 1 in 20, then the


1
probability for that group is : P = = 0.05 .
20

This approach too has limited practical utility because the computation of
probability requires repetition of an experiment a large number of times. This is
practically true where an event occurs only once so that repetitive occurrences
under precisely the same conditions is neither possible nor desirable.

c) Subjective Probability: The subjective approach to define probability was


introduced by Frank Ramehs in 1926 in his book “The Foundation of
Mathematics and Other Logical Essays”. The subjective approach makes up
for determining the probability of events where experiments cannot be
performed repeatedly under the same conditions. According to this approach
probability is based on the experience and the judgement of the person making
this estimate. This may differ from person to person, depending on one’s
perception of the situation and the past experience. Subjective probability can
be defined as based on the available evidence. Sometimes logic and past data
are not so useful in determining the probability value, in those cases the
subjective approach of the assesser is being used to find that probability. This
approach is so flexible that it may be applied in a number of situations where
the earlier two approaches fail to offer a satisfactory answer.

Example: Suppose a commitee of experts constituted by the Government of


India is faced with the decision, whether to allow construction of a nuclear
power plant on a site which is close to a densely populated area. They will have
to assign subjective probability to answer the question. “What is the probability
of a nuclear accident at the site?” Obviously, neither of the previously discussed
approaches can be applied in this case.

Impartantly, these three approaches compliment one another because where one
fails, the other takes over. However, all are identical in as much as probability
is defined as a ratio or a weight assigned to the likelylihood of occurrence of
an event.

Self Assessment Exercise A

1) What does it mean when:

i) the probability of an event is zero.


ii) the probability of an event is one.
iii) all the possible outcomes of an experiment have an equal chance of
occurance.
iv) the two events not occurring simultaneously, have some elements common
to both.
2) You are being asked to draw one card from a deck of 52 cards. In these 52
cards, 4 categories exist namely: spade, club, diamond, and heart (i.e. 13 each).
So you can test your understanding of the concept by going through the
9
Geektonight Notes
Probability and Hypothesis following cases by writing down yes/no in the second and third column given
Testing
below:

Draws Mutually Collectively


Exclusive Exclusive
a) Draw a spade and a club
b) Draw an ace and a three
c) Draw a club and a non-club
d) Draw a 5 and a diamond

13.5 PROBABILITY RULES


The related terms and concepts defining probability, which we have discussed
above, are needed to develop probability rules for different types of events.

The following rules of probability are useful for calculating the probability of an
event/events under different situations.

13.5.1 Addition Rule for Mutually Exclusive Events


If two events, A and B, are mutually exclusive, then the probability of
occurance of either A or B is given by the following formula:

P(A or B) = P(A ∪ B) = P(A) + P(B) – P (A ∩ B)

If A and B are mutually exclusive events, this rule is depicted in Figure 13.1,
below.

P(A) P(B)

Figure: 13.1

The essential requirement for any two events to be mutually exclusive is that
there are no outcomes common to the occurance of both. This condition is
satisfied when sample space does not contain any outcome favourable to the
occurance of both A and B means A ∩ B = φ

Let us consider the following illustration to understand the concept.

Illustration 1: In a game of cards, where a pack contains 52 cards, 4


categories exist namely spade, club, diamond, and heart. If you are asked to
draw a card from this pack, what is the probability that the card drawn belongs
to either spade or club category.

Solution: Here, P (Spade or club) = P (Spades) + P (Club)


13 1 13 1
Where P (Spade) = = and P (Club) = =
52 4 52 4
1 1 1
10 ∴ P (Spade or Club) = + =
4 4 2
Geektonight Notes

There is an important special case for any event E, either E happens or it does Probability and
Probability Rules
not. So, the events E and not E are exhaustive and exclusive.

So, P (E) + P (not E) = 1


or, P (E) = 1–P (not E)

Sometimes P (not E) is also written as either P(E | ) = 1 − P( E )

So, P (E) = 1 − P (E | ) = 1 − P( E ) .

13.5.2 Addition Rule for Non-Mutually Exclusive Events


Non-mutually exclusive (overlapping) events present another significant variant
of the additive rule. Two events (A and B) are not mutually exclusive if they
have some outcomes common to the occurance of both, then the above rule
has to be modified in order to account for the overlapping areas, as it is clear
from Figure 13.2. below.

P(A) P(B)

P(A and B) = (A ∩ B)
Figure: 13.2

In this situation, the probability of occurrence of event A or event B is given


by the formula

P (A or B) = P (A ∪ B) = P (A) + P (B) – P (A and B)

where P (A and B) is the joint probability of events A and B, i.e., both


occuring together and is usually written as P (A ∩ B).

Thus, it is clear that the probability of outcomes that are common to both the
events is to be subtracted from the sum of their simple probability.

Consider the following illustrations to understand the application of this concept.

Illustration 2: The event of drawing either a Jack or a spade from a well-


shuffled deck of playing cards. Find the probability.

Solution: These events are not mutually exclusive, so the required probability
of drawing a Jack or a spade is given by:

P (Jack or Spade) = P (Jack) + P (Spade) – P (Jack and Spade)


4 13 1 16 4
= + − = =
52 52 52 52 13

Illustration 3: The employees of a certain company have to elect one out of


five of their numbers for the employee–management relationship committee.
Details of these five were as follows:
11
Geektonight Notes
Probability and Hypothesis
Testing Sex Age
Male 40
Female 20
Male 32
Female 45
Male 30

Find the probability of selecting a representative that would be either male or


over 35.

Solution: P (Male or over 35) = P (Male) + P (over 35) – P (Male and over
35)

3 2 1 4
= + − =
5 5 5 5

Self Assessment Exercise B


1) An urn contains 75 marbles: 35 are blue, and 25 of these blue marbles are
swirled. The rest of them are red, and 30 of the red ones are swirled. The
marbles that are not swirled are clear. What is the probability of drawing:

a) A blue marble from the urn?


b) A clear marble from the urn?
c) A blue, swirled marble?
d) A red, clear marble?
e) A swirled marble?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................

Proceeding further with the multiplication rule, it is pertinent to discuss the


concept of statistical independency and statistical dependency of events.

13.6 PROBABILITY UNDER STATISTICAL


INDEPENDENCE
The two or more events are termed as statistically independent events, if the
occurrence of any one event does not have any effect on the occurrence of
any other event. For example, if a fair coin is tossed once and suppose head
comes, then this event has no effect in any way on the outcome of second toss
of that same coin. Similarly, the results obtained by drawing hearts from a pack
has no effect in any way on the results obtained by throwing a dice. These
events thus are being termed as statistically independent events. There are
three trypes of probability under statistically independent case.
12
Geektonight Notes

a) Marginal Probability; Probability and


Probability Rules
b) Joint Probability;
c) Conditional Probability.

Let us discuss each one of them, one by one.

a) Marginal Probability Under Statistical Independence


A Marginal/Simple/Unconditional probability is the probability of the occurrence
of an event. For example, in a fair coin toss, probability of having a head is:

P (H) = ½ = 0.5

Therefore, the marginal probability of an event (i.e. having a head) is 0.5.


Since, the subsequent tosses are independent of each other, therefore, it is a
case of statistical independence.

Another example can be given in a throw of a fair die, the marginal probability
of the face bearing number 3, is:
P(3) = 1/6 = 0.166

Since, the tosses of the die are independent of each other, this is a case of
statistical independence.

b) Joint Probability Under Statistical Independence


This is also termed as “Multiplication Rule of Probability”. In many
situations we are interested in finding out the probability of two or more events
either occurring together or in quick succession to each other, for this purpose
the concept of joint probability is used.

This joint probability of two or more statistically independent events occurring


together is determined by the product of their marginal probability. The
corresponding formula may be expressed as:

P (A and B) = P (A) × P (B)

Similarly, it can be extended to more than two events also as:

P (A and B and C) = P (A) × P (B) × P (C) and so on.

i.e. P (A and B and C and …) = P (A) × P (B) × P (C) × …


For instance, when a fair coin is tossed twice in quick succession, the
probability of head occuring in both the tosses is:

P(H1 and H2) = P (H1) × P (H2)


= 0.5 × 0.5 = 0.25

Where, H1 is the occurrence of head in 1st toss, and H2 is the occurrence of


head in 2nd toss.

Take another example: When a fair die is thrown twice in quick succession,
then to find the probability of having 2 in the 1st throw and 4 in second throw
is, given as:
13
Geektonight Notes
Probability and Hypothesis P (2 in 1st throw and 4 in 2nd throw)
Testing
= P (2 in the 1st throw) × P (4 in the 2nd throw)
1 1 1
= × = = 0.028
6 6 36

c) Conditional Probability Under the Condition of Statistical


Independence
The third type of probability under the condition of statistical independence is
the Conditional Probability. It is symbolically written as P (A/B), i.e., the
conditional probability of occurrence of event A, on the condition that event B
has already occurred.

In case of statistical independence, the conditional probability of any event is


akin to its marginal probability, when both the events are independent of each
other.

Therefore, P (A/B) = P (A), and


P (B/A) = P (B).

For example, if we want to find out what is the probaility of heads coming up
in the second toss of a fair coin, given that the first toss has already resulted in
head. Symbolically, we can write it as:
P (H2/H1)
As, the two tosses are statistically independent of each other
so, P (H2/H1) = P (H2)
The following table 13.1 summarizes these three types of probabilities, their
symbols and their mathematical formulae under statistical independence.

Table 13.1

Probability’s Type Symbol Formula


Marginal P (A) P (A)
Joint P (AB) P (A) × P (B)
Conditional P (B/A) P (B)

For example, if the first child of a couple is a girl to find the probability that
the second child will be a boy. In this case:
P (B/G) = P (B)
As both the events are independent of each other, the conditional probability of
having the second child as a boy on condition that their first child was a girl, is
equal to the marginal probability of having the second child as a boy.

For example, take the case of rain in different states of India. Suppose, the
probability of having rain in different states of India is given as:

P (rainfall in Bihar) = 0.8


P (rainfall in Uttar Pradesh) = 0.75
P (rainfall in Rajasthan) = 0.4
14 P (rainfall in Gujarat) = 0.5
Geektonight Notes

Then, to find out probability of having rainfall in Gujarat, on condition that Probability and
Probability Rules
during this period, Bihar receives a heavy rainfall, is a case of statistical
independence, as both the events (rain in Gujarat and rain in Bihar) are quite
independent to each other. So, this conditional probability is equal to marginal
probability of having rainfall in Gujarat (which is equal to 0.5).

For example, an urn contains 3 white balls and 7 black balls. We draw a ball
from the Urn, replace it and then again draw a second ball. Now, we have to
find the probability of drawing a black ball in the second draw on condition that
the ball drawn in the first attempt was a white one.

This is a case of conditional probability, but is equal to the marginal (simple)


probability, as the two drawn are independent events.

∴P (B/W) = P (B) = 7/10 = 0.7

Self Assessment Exercise C

1) Which of the following pair of events are statistically independent ?

a) The time until the failure of a watch and of a second watch marketed by
different companies – yes/no
b) The life span of the current Indian PM and that of current Pakistani
President – Yes/no.
c) The takeover of a company and a rise in the price of its stock – Yes/no.

2) What is the probability that a couple’s second child will be

a) A boy, given that their first child was a girl?


b) A girl, given that their first child was a girl?
..................................................................................................................

3) What is the probability that in selecting two cards one at a time from a deck
with replacement, the second card is

a) A face card, given that the first card was red?


b) An ace, given that the first card was a face card?
c) A black jack, given that the first card was a red ace?
..................................................................................................................
..................................................................................................................
..................................................................................................................

4) A bag contains 32 marbles: 4 are red, 9 are black, 12 are blue, 6 are yellow and
1 is purple. Marbles are drawn one at a time with replacement. What is the
probability that:

a) The second marble is yellow given the first one was yellow?
b) The second marble is yellow given the first one was black?
c) The third marble is purple given both the first and second were purple?

15
Geektonight Notes
Probability and Hypothesis ..................................................................................................................
Testing
..................................................................................................................
..................................................................................................................

13.7 PROBABILITY UNDER STATISTICAL


DEPENDENCE
Two or more events are said to be statistically dependent, if the occurrence of
any one event affects the probability of occurrence of the other event.

There are three types of probability under statistical dependence case. They
are:

a) Conditional Probability;
b) Joint Probability;
c) Marginal Probability
Let us discuss the concept of the three types.

a) Conditional Probability Under Condition of Statistical Dependence


We shall first discuss the concept of conditional probability, as it forms the basis
for joint and marginal probabilities concept under statistical dependence.

The conditional probability of event A, given that the event B has already
occured, can be calculated as follows:

P (AB)
P (A / B) =
P (B)

where, P (AB) is the joint probability of events A and B.

This is also referred to as Bayes’ Theorem.

Let us consider the following illustration to understand this concept.

Illustration 4: (i) A box containing 10 balls which have the following


distribution on the basis of colour and pattern.

l 3 are coloured and dotted.


l 1 is coloured and sptripped.
l 2 are grey and dotted.
l 4 are grey and stripped.
a) Suppose someone draws a coloured ball from the box. Find what is the
probability that it is (i) dotted and (ii) it is stripped ?

Solution: The problem can be expressed as P (D/C) i.e., the conditional


probability that the ball drawn is dotted given that it is coloured.

Now from the information given in the question.

(i) P(CD) = 3/10 = Joint Probability of drawn ball becoming a coloured as well
as a dotted one.
16
Geektonight Notes

Similarly, P (CS) = 1/10, P (GD) = 2/10, and P (GS) = 4/10 Probability and
Probability Rules
P(DC)
So, = P(D / C) =
P(C)

where, P(C) = Probability of drawing a coloured ball from the box = 4/10 (4
coloured balls out of 10 balls).

3 / 10
∴ P (D / C) = = 0.75
4 / 10
ii)Similarly, P(S/C) = Conditional probability of drawing a stripped ball on the
condition of knowing that it is a coloured one.

P(SC) 1 / 10
= = = 0.25
P(C) 4 / 10

Thus, the probability of coloured and dotted ball is 0.75. Similarly, the
probability of coloured and stripped ball is 0.25.
b) Continuing the same illustration, if we wish to find the probability of
(i) P (D/G) and (ii) P (S/G)

P (DG) 2 / 10 1
Solution: i) P (D / G ) = P (G ) = 6 / 10 = 3 = 0.33

where, P(G) = Total probability of grey balls, i.e., 6/10 and

P (SG ) 4 / 10 2
ii) (S / G ) = P ( G ) = 6 / 10 = 3 = 0 .66

c) Similarly, to calculate (i) P (G/D) and (ii) P (C/D)

P (GD) 2 / 10
Solution: (i) P (G / D) = P (D) = 5 / 10 = 0.4

P ( CD ) 3 / 10
and (ii) P ( C / D ) = P ( D ) = 5 / 10 = 0 .6

d) If we wish to find (i) P (C/S) and (ii) P (G/S),

P (CS) 1 / 10
Solution: (i) P (C / S) = P (S) = 5 / 10 = 0.2

P (GS) 4 / 10
(ii) P (G / S) = P (S) = 5 / 10 = 0.8

b) Joint Probability Under the Condition of Statistical Dependence


This is an extension of the multiplication rule of probability involving two or
more events, which have been discussed in the previous section 13.6, for
calculating joint probability of two or more events under the statistical
independence condition.

The formula for calculating joint probability of two events under the condition of
statistical independence is derived from the formula of Bayes’ Theorem. 17
Geektonight Notes
Probability and Hypothesis Therefore, the joint probability of two statistically dependent events A and B is
Testing
given by the following formula:

P (AB) = P (A/B) × P (B)

or P (BA) = P (B/A) × P (A)

depending upon whether order of occurrence of two events is B, A or A, B.

Since, P (A/B) = P (B/A), So the product on the RHS of the formula must
also be equal to each other.

∴ P (A/B) × P (B) = P (B/A) × P (A)

Notice that this formula is not the same under conditions of statistical
independence, i.e., P (BA) = P(B) × P (A). Continuing with our previous
illustration 4, of a box containing 10 balls, the value of different joint
probabilities can be calculated as follows:

Converting the above general formula i.e., P (AB) = P (A/B) × P (B) into our
illustration and to the terms coloured, dotted, stripped, and grey, we would have
calculated the joint probabilities of P (CD), P (GS), P (GD), and P (CS) as
follows:

i) P (CD) = P (C/D) × P (D) = 0.6 × 0.5 = 0.3

ii) P (GS) = P (G/S) × P (S) = 0.8 × 0.5 = 0.4

iii) P (GD) = P (G/D) × P (D) = 0.4 × 0.5 = 0.2

iv) P (CS) = P (C/S) × P (S) = 0.2 × 0.5 = 0.1

Note: The values of P (C/D), P (G/S), P (G/D), and P (C/S) have been already
computed in conditional probability under statistical dependence.
c) Marginal Probability Under the Condition of Statistical Dependence
Finally, we discuss the concept of marginal probability under the condition of
statistical dependence. It can be computed by summing up all the probabilities
of those joint events in which that event occurs whose marginal probability we
want to calculate.

Illustration 5: Consider the previous illustration 4, to compute the marginal


probability under statistical dependence of the event: i) dotted balls occured, ii)
coloured balls occured, iii) grey balls occured, and iv) stripped balls occured.

Solution: i) We can obtain the marginal probability of the event dotted balls by
adding the probabilities of all the joint events in which dotted balls occured.

P (D) = P (CD) + P (GD) = 3/10 + 2/10 = 0.5

In the same manner, we can compute the joint probabilities of the remaining
events as follows:

ii) P (C) = P (CD) + P(CS) = 3/10 + 1/10 = 0.4

iii) P (G) = P (GD) + P (GS) = 2/10 + 4/10 = 0.6

18
iv) P (S) = P (CS) + P (GS) = 1/10 + 4/10 = 0.5
Geektonight Notes

The following table 13.2 summarizes three types of probabilities, their symbols Probability and
Probability Rules
and their mathematical formulae under statistical dependence.

Table 13.2: Probabilities under Statistical Dependence

Probability Type Symbol Formula

Marginal P (A) Sum of the probabilities of


joint events in which ‘A’
occurs
Joint P (AB) P (A/B) × P (B)
or
P (BA) P (B/A) × P (A)
Conditional P (A/B) P (A/B) / P (B)
or
P (B/A) P (B/A) / P (A)

Self Assessment Exercise D

1) According to a survey, the probability that a family owns two cars of annual
income greater than Rs. 35,000 is 0.75. Of the households surveyed, 60 per
cent had incomes over Rs. 35,000 and 52 per cent had two cars. What is the
probability that a family has two cars and an income over Rs. 35,000 a year?
..................................................................................................................
..................................................................................................................
..................................................................................................................

..................................................................................................................
2) Given that P(A) = 3/14, P (B) = 1/6, P(C) = 1/3, P (AC) = 1/7 and P (B/C)
= 5/21, find the following probabilities: P (A/C), P (C/A), P (BC), P (C/B).
..................................................................................................................
..................................................................................................................
..................................................................................................................

..................................................................................................................
3) At a restaurant, a social worker gathers the following data. Of those visiting the
restaurant, 59% are male, 32 per cent are alcoholics and 21 per cent are male
alcoholics. What is the probability that a random male visitor to the restaurant is
an alcoholic?
..................................................................................................................
..................................................................................................................
..................................................................................................................

13.8 BAYES’ THEOREM: REVISION OF A-PRIORI


PROBABILITY
As discussed earlier, the basic objective of calculating probabilities is to facilitate
us in decision-making. For example, assume that you are a seller of winter
19
Geektonight Notes
Probability and Hypothesis garments. Obviously, you are interested in the demand of winter garments. To
Testing
help you in deciding on the amount you should stock for this winter, you have
computed the probability of selling different quantities and have noted that the
chance of selling a certain quantity is very high. Accordingly, you have taken
the decision to stock a large quantity of the product. Suppose, when, finally the
season ends, you find that you are left with a large quantity of stock. Then you
feel that the earlier probability calculation should be revised given the new
experience to help you to decide on the stock for the next winter. Similarly
situations exist where we are interested in an event on an on-going basis.
Every time some new information is available, we revise our probability. This
revision of probability with added information is formalised in the probability
theory in terms of a theorem called Bayes’ theorem. Bayes’ theorem offers a
powerful statistical method of combining our evaluation of new information as
well as our prior estimate of the probability to create Posterior Probability.

Thus, probabilities before revision by Bayes’ concept are termed as Priori


probabilities. Probabilities which have undergone revision in the light of
additional information by Bayes’ rule are termed as Posterior probabilities or
revised probabilities which can be altered after additional information is
gathered.

Bayes’ theorem can be illustrated by the following figure.

Priori
probability

Bayes’ Posterior
Process Probabilities

New
Information

(Source: Quantitative Analysis for Management; Render & Stair)

The origin of the concept of computing posterior probabilities with additional


information is attributable to the Reverend Thomas Bayes (1702-1761). Bayes’
theorem is based on the formula for conditional probability under statistical
dependence, discussed in section 13.7, i.e.,

P (A / B)
P (A / B) =
P (B)

It is worthwhile to note that revised probabilities are, thus, always conditional


probabilities.

Let us consider the following illustration to understand the application of Bayes’


theorem.

Illustration 6: Suppose a box cantains a fair (unbiased) and a loaded (biased)


dice. Naturally, the probability of having 3 on the roll of the fair die is 1/6 =
0.166. However, the same probability on the loaded die is suppose 0.60.
Suppose, we do not have an idea of which one is loaded and which one is fair.
We select one die and roll it. The result comes out as 3. Now, we have this
new information which can be used it to find out the probability that the die
20 rolled was a (i) fair (ii) loaded.
Geektonight Notes

Here, we have Probability and


Probability Rules
P (L) = 0.5; P (F) = 0.5; P (3/F) = 0.166; and P (3/L) = 0.600.

Here, we are going to calculate the joint probabilities of P (3 and F) and P (3


and L), using the formula:
P (AB) = P (A/B) × P (B)
So, P (3 and F) = P (3/F) × P (F) = 0.166 × 0.5 = 0.083 and
P (3 and L) = P (3/L) × P (L) = 0.600 × 0.5 = 0.30
Further, we have to calculate the value of P (3), which is:
P (3) = P (3F) + P (3L)
= 0.083 + 0.300 = 0.383

Now, we can find out the value of P (F/3), as well as P (L/3), by using the
formula

P (F and3) 0.083
P (F / 3) = = = 0.216 and
P (3) 0.383

P (L and 3) 0.300
P (L / 3) = = = 0.784
P (3) 0.383

These two conditional probabilities are called the Revised/Posterior


Probability.

Our original estimates of probability of fair die being rolled was 0.5 and
similarly for loaded die was again 0.5. But, with the single roll of the die, the
probability of the loaded die being rolled is given that 3 has appeared on the
top, increases to 0.78, while for rolled die to be the fair one decreases to 0.22.
This example illustrated the power of Bayes’s theorem.

Self Assessment Exercise E

1. There are two machines, A and B, in a factory. As per the past information,
these two machines produced 30% and 70% of items of output respectively.
Further, 5% of the items produced by machine A and 1% produced by machine
B were defective. If a defective item is drawn at random, what is the
probability that the defective item was produced by machine A or machine B?
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
..................................................................................................................
21
Geektonight Notes
Probability and Hypothesis
Testing 13.9 LET US SUM UP
At the beginning of this unit the historical evolution and the meaning of
probability has been discussed. Contribution of leading mathematicians has been
highlighted. Fundamental concepts and approaches to determining probability
have been explained. The three approaches namely; the classical, the relative
frequency, and the subjective approaches are used to determine the probability
in case of risky and uncertain situation have been discussed.

Probability rules for calculating probabilities of different types of events have


been explained. Further the condition of statistical independence and statistical
dependence have been defined. Three types of probabilities namely: marginal,
joint and conditional under statistical independence and statistical dependence
have been explained. Finally, the Bayesian approach to the revision of a priori
probability in the light of additional information has been undertaken.

13.10 KEY WORDS


Bayes’ Theorem: It gives us a formula that computes the conditional
probabilities when dealing with statistically dependent events.
Classical/Logical Approach: An objective way of assessing probabilistic value
based on logic.
Collectively Exclusive Event: This is the collection of all possible outcomes
of an experiment.
Conditional Probability: The probability of the happening of an event on the
condition that another event has already occurred.
Dependent Event: This is the situation in which the occurrence of one event
affects the happening of another event.
Independent Event: This is the situation in which the occurrence of an event
has no effect on the probability of the occurrence of any other event.
Joint Probability: The probability of occuring of events together or in quick
succession
Marginal/Simple Probability: As the name suggests, it is the simple
probability of occurrence of an event.
Mutually Exclusive Events: A situation in which only one event can occur
on any given trial/experiment. It means events that cannot occur together.
Posterior or Revised Probability: A probability value that results from new
or revised information of priori probabilities.
Priori Probability: A probability value determined before new/additional
information is obtained.
Probability: Any numerical value between 0 and 1, both inclusive, telling about
the likelihood of occurrence of an event.
Relative Frequency Approach: An objective way of determining probability
based on observing frequency over a number of trials.
Subjective Approach: A way of determining probability values based on
22 experience/judgement.
Geektonight Notes
Probability and
13.11 ANSWERS TO SELF ASSESSMENT Probability Rules
EXERCISES
A) 1. i) It is an impossible event

ii) It is an event which must occur

iii) They are equally likely events

iv) They are mutually exclusive events


2. a) No–Yes; b) No–Yes; c) No–No; d) Yes–No
B) 1. a) 7/15; b) 4/15; c) 11/15; d) 2/3; e) 11/15.

C) 1. a) Yes; b) Yes; c) No.

2. a) 1/2; b) 1/2.

3. a) P (Face2/Red1) = 3/13

b) P (Ace2/Face1) = 1/13

c) P (Black Jack2/Red Ace1) = 1/26.

4. a) 6/32; b) 6/32; c) 1/32.

D) 1. Let I = income > Rs. 35,000; C = 2 cars


P (C and I ) = P (C/I) P (I) = 0.75 (0.6) = 0.45.

2. 3/7; 2/3; 5/63; 10/21.

3. 0.356.

E) 1. Machine A = 0.682; Machine B = 0.318.

Supplementary Illustrations

1) If A and B are two non-mutually exclusive events, such that probability of


happening of event A is 0.25, probability of the happening of event B is 0.4 and
probability of happening of event A or B is 0.5. Then we have to find out the
value of probability of happening of both the events together.

Here, we have
P (A) = 0.25
P (B) = 0.40 and
P ( A ∪ B) = 0.5, then P ( A ∩ B) = ?
P (A ∪ B) = P (A) + P (B) – P (A ∩ B)

Replacing given values in this equation, we have

0.5 = 0.25 + 0.40 – P (A ∩ B)


or P (A ∩ B) = 0.15
23
Geektonight Notes
Probability and Hypothesis 2) Find out the probability of at least one tail on three tosses of a fair coin.
Testing
As there is only one case, where there is no tail on three coins. This case is of
having H1, H2, H3.
Now
P (H\1 H2 H3) = P (H1) × P (H2) × P (H3)
= 0.5 × 0.5 × 0.5 = 0.125
∴ To get the answer, we just need to subtract this probability from 1. So, that
our answer is:

= 1–P (H1 H2 H3) = 1–0.125 = 0.875

Here, the probability of at least one tail occurring in three consecutive


tosses is 0.875.

3) What is the chance that a non-leap year contain 53 Mondays?

A non-leap year consists of 365 days, i.e., a total 52 full weeks and one
extra day. So, a non-leap year contain 53 Mondays, only when that extra
day must also be a Monday.

But, as that day can be from any of the following set, viz., either Sunday
or Monday or Wednesday or Thursday or Friday or Saturday.

No. of favourable events 1


So, the required probability =
Total no. of outcomes 7

4) What is the probability of having at least one head on two tosses of a fair coin?

The possible ways in which a head may occur are H1 H2; H1 T2; T1 H2.
Each of these has a probability of 0.25.

As P (H1 H2) = P (H1) × P(H2)

= 0.5 × 0.5 = 0.25.

The results are similar for P (H1 T2) and P (T1 H2) also. Since, the two
tosses are statistically independent events. Therefore, the probability of at
least one head on two tosses is 0.25 × 3 = 0.75.

5) Suppose we are tossing an unfair coin, where the probability of getting head in
a toss is 0.8. If we have to calculate the probability of having three heads in
three consecutive trials

Then, as given

P (H) = 0.8, ∴ P(T) = 0.2

As given if P (H1 H2 H3) represents the probability of having three heads in


three trials.

And P (H1 H2 H3) = P(H1) × P(H2) × P(H3)


= 0.8 × 0.8 × 0.8 = 0.512

24
Geektonight Notes

If we have to calculate the probability of having three consecutive tails in Probability and
Probability Rules
three trials.

In that case P (T1 T2 T3) = P (T1) × P (T2) × P (T3)

= 0.2 × 0.2 × 0.2 = 0.008

6) Let us consider an urn having 10 balls with descriptions given below:

4 are White (W) and lettered (L)


2 are White (W) and numbered (N)
3 are Yellow (Y) and lettered (L)
1 are Yellow (Y) and numbered (N).

Suppose at random one ball in picked out from the urn, then we have to
find out the probability that:

i) The ball is lettered

ii)The ball drawn is lettered, given that it is yellow

For this, first of all, let us tabulate a series of useful probabilities.

P (WL) = 0.4
P (YL) = 0.3
P (WN) = 0.2
P (YN) = 0.1
Also, P (W) = 0.6
P (Y) = 0.4
P (L) = 0.7
P (N) = 0.3
As we also knew that:
P (W) = P (WL) + P (WN) = 0.4 + 0.2 = 0.6
P (Y) = P (YL) + P (YN) = 0.3 + 0.1 = 0.4
P (L) = P (WL) + P (YL) = 0.4 + 0.3 = 0.7
and P (N) = P (WN) + P (YN) = 0.2 + 0.1 = 0.3
So, (i) (L) = 0.7

P (LY) 0.3
(ii) for P (L/Y) = = = 0.75
P (Y) 0.4

P (CG ) 3 / 10
and P (C/D) = = = 0.6
P (D) 5 / 10

7) A manufacturing firm produces units of a product in three machines. From the


past records, the proportions of defectives produced at each machine the
following conditional probabilities are computed.
P (D/M1) = 0.06; P (D/M2) = 0.15; P (D/M3) = 0.10.
25
Geektonight Notes
Probability and Hypothesis Events M1, M2, M3 are unit produced in Machines 1, 2, and 3 respectively.
Testing
Event D is a defective unit.
The first machine produces 30% of the units of the product, the second machine
20% and the third machine 50%. A unit of the product produced at one of these
machines is tested and found to be defective. What are the probabilities that the
defective unit was produced in any of the three machines?
Solution: Here, we have
P (M1) = 0.3, P (H2) = 0.2, P (M3) = 0.5 (Check
P (M1) + P (M2) + P (M3) = 1, follows from definition mutually exclusive and
collectively exchaustive events).
and also, P (D/M1) = 0.06, P (D/H2) = 0.15; P (D/M3) = 0.10.
Now, we have to calculate the joint probabilities of P (D and M1), P (D and M2),
and P (D and M3), by using the formula:
P (AB) = P (A/B) × P (B)
∴ P (DM1) = P (D/M1) × P (M1) = 0.06 × 0.3 = 0.018
P (DM2) = P (D/M2) × P (M2) = 0.15 × 0.2 = 0.03
P (DM3) = P (D/M3) × P (M3) = 0.10 × 0.5 = 0.05
Further we can obtain P (D) by adding the above obtained joint probabilities i.e.,
P (D) = P (DM1) + P (DM2) + P (DM3)
= 0.018 + 0.03 + 0.05 = 0.098
Finally, we can find at the values of P (M1/D), P (M2/D) and P (M3/D)

P (M 1 and D) 0.018
P (M1 / D) = = = 0.1837
P (D) 0.098

P (M 2 and D) 0.03
P (M 2 / D) = = = 0.3061
P (D) 0.098

P (M 3and D) 0.05
P ( M 3 / D) = = = 0.5102
P ( D) 0.098
1.000
These three conditional probabilities are called the posterior probabilities.
It is clear from the revised probability values that the probabilities of defective
units produced in M1 is 0.18, M2 is 0.31 and M3 is 0.51 against the past
probabilities 0.3, 0.2, and 0.5 respectively. And the probability that a defective
unit produced by this firm is 0.098.

26
Geektonight Notes
Probability and
13.12 TERMINAL QUESTIONS/EXERCISES Probability Rules

1. Why do we study probability? Explain its importance and relevance.

2. Define the following, using appropriate examples:

i) Equally likely events


ii)Mutually exclusive events

iii) Trial and event


iv) Sample space

3. What are the different approaches to probability? Explain with suitable


examples.

4. State and prove the addition rule of probability for two mutually exclusive
events.

5. Explain the types of probability under statistical independence.

6. Explain the use of Bayes’ theorem in probability.

7. One ticket is drawn at random from an urn containing tickets numbered from 1
to 50. Find out the probability that:

i) It is a multiple of 5 or 7
ii) It is a multiple of 4 or 3
[Answer: i) 17/50, ii) 12/25]
8. If two dice are being rolled, then find out the probabilities that:

i) The sum of the numbers shown on the dice is 7.


ii) The numbers shown on the dice are equal.
iii) The number shown by second die is greater than the number shown by the
first die.
[Answer: (i) 1/6 (ii) 1/6, (iii) 5/12]
9. (a) Find out the probability of getting head in a throw of a fair coin.

(b) If two coins are tossed once, what is the probability of getting
(i) Both heads
(ii) At least one head ?
[Answer: (a) 1/2 (b) (i) 1/4 (ii) 3/4]

10. Given that P(A) = 3/14P (B) = 1/6, P (C) = 1/3, P (AC) = 1/7, P (B/C) = 5/21.

Find out the values of P (A/C), P (C/A), P (B/C), P (C/B).

[Ans. 3/7, 2/3, 5/63, 10/21]

27
Geektonight Notes
Probability and Hypothesis 11. A T.V. manufacturing firm purchases a certain item from three suppliers X, Y
Testing
and Z. They supply 60%, 30% and 10% respectively. It is known that 2%, 5%
and 8% of the items supplied by the respective suppliers are defective. On a
particular day, the firm received items from three suppliers and the contents get
mixed. An item is chosen at random:

a) What is the probability that it is defective?


b) If the item is found to be defective what is the probability that it was
supplied by X, Y, and Z?

[Ans. P (D) = 0.035 P (X/D) = 0.34, P (Y/D) = 0.43, and P (Z/D) = 0.23].

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

13.13 FURTHER READING


The following text books may be used for more indepth study on the topics
dealt within this unit.

Levin, R.I. and Rubin, D.S., 1991, Statistics for Management, PHI, : New
Delhi.

Feller, W., 1957, An Introduction to Probability Theory and Its Applications,


John Wiley & Sons Inc. : New York.

Hooda, R.P. 2001, Statistics for Business and Economics. MacMillan India
Limited, Delhi.

Gupta S.C., and V.K. Kapoor, 2005, Fundamentals of Mathematical Statistics,


Sultan Chand & Sons, Delhi.

Gupta, S.P. Statistical Methods, 2000, Sultan Chand & Sons, Delhi.

28
Geektonight Notes
Probability
UNIT 14 PROBABILITY DISTRIBUTIONS Distributions

STRUCTURE

14.0 Objectives
14.1 Introduction
14.2 Types of Probability Distribution
14.3 Concept of Random Variables
14.4 Discrete Probability Distribution
14.4.1 Binomial Distribution
14.4.2 Poisson Distribution
14.5 Continuous Probability Distribution
14.5.1 Normal Distribution
14.5.2 Characteristics of Normal Distribution
14.5.3 Importance and Application of Normal Distribution
14.6 Let Us Sum Up
14.7 Key Words
14.8 Answers to Self Assessment Exercises
14.9 Terminal Questions/Exercises
14.10 Further Reading

14.0 OBJECTIVES
After studying this unit, you should be able to:

l differentiate between frequency distribution and probability distribution,

l become aware of the concepts of random variable and probability distribution,

l appreciate the usefulness of probability distributions in decision-making,

l identify situations where discrete probability distributions can be applied,

l fit a binomial distribution and poisson distribution to the given data,

l identify situations where continuous probability distributions can be applied, and

l appreciate the usefulness of continuous probability distributions in decision-


making.

14.1 INTRODUCTION
A probability distribution is essentially an extension of the theory of probability
which we have already discussed in the previous unit. This unit introduces the
concept of a probability distribution, and to show how the various basic
probability distributions (binomial, poisson, and normal) are constructed. All these
probability distributions have immensely useful applications and explain a wide
variety of business situations which call for computation of desired probabilities.

By the theory of probability

P(H1) + P(H2) + ……P(Hn) = 1

This means that the unity probability of a certain event is distributed over a set
of disjointed events making up a complete group. In general, a tabular recording
of the probabilities of all the possible outcomes that could result if random 2 9
Geektonight Notes

Probability and (chance) experiment is done is called “Probability Distribution”. It is also


Hypothesis Testing termed as theoretical frequency distribution.

Frequency Distribution and Probability Distribution


One gets a better idea about a probability distribution by comparing it with a
frequency distribution. It may be recalled that the frequency distributions are
based on observation and experimentation. For instance, we may study the
profits (during a particular period) of the firms in an industry and classify the
data into two columns with class intervals for profits in the first column, and
corresponding classes’ frequencies (No. of firms) in the second column.

The probability distribution is also a two-column presentation with the values of


the random variable in the first column, and the corresponding probabilities in
the second column. These distributions are obtained by expectations on the
basis of theoretical or past experience considerations.Thus, probability
distributions are related to theoretical or expected frequency distributions.

In the frequency distribution, the class frequencies add up to the total number
of observations (N), where as in the case of probability distribution the possible
outcomes (probabilities) add up to ‘one’. Like the former, a probability
distribution is also described by a curve and has its own mean, dispersion, and
skewness.

Let us consider an example of probability distribution. Suppose we toss a fair


coin twice, the possible outcomes are shown in Table 14.1 below.

Table 14.1: Possible Outcomes from Two-toss Experiment of a Fair Coin

No. of Ist 2nd No. of Probability


possible toss toss Heads on two of the possible
outcomes tosses outcomes

1 Head Head 2 0.5 × 0.5 = 0.25


2 Head Tail 1 0.5 × 0.5 = 0.25
3 Tail Head 1 0.5 × 0.5 = 0.25
4 Tail Tail 0 0.5 ×0.5 = 0.25
Total = 1.00

Now we are intestered in framing a probability distribution of the possible


outcomes of the number of Heads from the two-toss experiment of a fair coin.
We would begin by recording any result that did not contain a head, i.e., only
the fourth outcome in Table 14.1. Next, those outcomes containing only one
head, i.e., second and third outcomes (Table 14.1), and finally, we would record
that the first outcome contains two heads (Table 14.1). We recorded the same
in Table 14.2 to highlight the number of heads contained in each outcome.

Table 14.2: Probability Distribution of the Possible No. of Heads from Two-toss
Experiment of a Fair Coin
No. of Tosses Probability of
Heads (H) outcomes P (H)
0 (T, T) 1/4 = 0.25
1 (H, T) + (T, H) 1/2 = 0.50
2 (H, H) 1/4 = 0.25
3 0
Geektonight Notes

We must note that the above tables are not the real outcome of tossing a fair Probability
Distributions
coin twice. But, it is a theoretical outcome, i.e., it represents the way in which
we expect our two-toss experiment of an unbaised coin to behave over time.

14.2 TYPES OF PROBABILITY DISTRIBUTION


Probability distributions are broadly classified under two heads: (i) Discrete
Probability Distribution, and (ii) Continuous Probability Distribution.

i) Discrete Probability Distribution: The discrete probability is allowed to take


on only a limited number of values. Consider for example that the probability of
having your birthday in a given month is a discrete one, as one can have only 12
possible outcomes representing 12 months of a year.

ii) Continuous Probability Distribution: In a continuous probability distribution,


the variable of interest may take on any values within a given range. Suppose
we are planning to release water for hydropower generation. Depending on
how much water we have in the reservoir viz., whether it is above or below the
normal level, we decide on the amount and time of release. The variable
indicating the difference between the actual reservoir level and the normal
level, can take positive or negative values, integer or otherwise. Moreover, this
value is contingent upon the inflow to the reservoir, which in turn is uncertain.
This type of random variable which can take an infinite number of values is
called a continuous random variable, and the probability distribution of such a
variable is called a continuous probability distribution.

Before we attempt discrete and continuous probability distributions, the concept


of random variable which is central to the theme, needs to be elaborated.

14.3 CONCEPT OF RANDOM VARIABLES


A random variable is a variable (numerical quantity) that can take different
values as a result of the outcomes of a random experiment. When a random
experiment is carried out, the totality of outcomes of the experiment forms a
set which is known as sample space of the experiment. Similar to the
probability distribution function, a random variable may be discrete or continuous.

The example given in the Introduction, we have seen that the outcomes of the
experiment of two-toss of a fair coin were expressed in terms of the number
of heads. We found in the example, that H (head) can assume values of 0, 1
and 2 and corresponding to each value, a probability is associated. This
uncertain real variable H, which assumes different numerical values depending
on the outcomes of an experiment, and to each of whose value a possibility
assignment can be made, is known as a random variable. The resulting
representation of all the values with their probabilities is termed as the
probability distribution of H.

It is customary to present the distribution as shown in Table 14.3 below.

Table 14.3: Probability Distribution of No. of Heads

H: 0 1 2

P (H: 0.25 0.50 0.25


3 1
Geektonight Notes

Probability and In this case, as we find that H takes only discrete values, the variable H is
Hypothesis Testing called a discrete random variable, and the resulting distribution is a discrete
probability distribution. The function that specifies the probability distribution
of a discrete random variable is called the probability mass function (p.m.f.).

In the above situations, we have seen that the random variable takes a limited
number of values. There are certain situations where the variable under
consideration may have infinite values. Consider for example, that we are
interested in ascertaining the probability distribution of the weight of one kg.
coffee packs. We have reasons to believe that the packing process is such that
a certain percentage of the packs slightly below one kg., and some packs are
above one kg. It is easy to see that it is essentially by chance that the pack
will weigh exactly 1 kg., and there are an infinite number of values that the
random variable ‘weight’ can take. In such cases, it makes sense to talk of the
probability that the weight will be between two values, rather than the
probability of the weight taking any specific value. These types of random
variables which can take an infinitely large number of values are called
continuous random variables, and the resulting distribution is called a
continuous probability distribution. The function that specifies the probability
distribution of a continuous random variable is called the probability density
function (p.d.f.).

Sometimes, for the sake of convenience, a discrete situation with a large


number of outcomes is approximated by a continuous distribution. For example,
if we find that the demand of a product is a random variable taking values of
1, 2, 3, …to 1,000, it may be worthwhile to treat it as a continuous variable.

In a nutshell, if the random variable is restricted to take only a limited number


of values, it is termed as discrete random variable and if it is allowed to take
any value within a given range it is termed as continuous random variable.

It should be clear, from the above discussion, that a probability distribution is


defined only in the context of a random variable or a function of random
variable. Thus in any situation, it is important to identify the relevant random
variable and to find the probability distribution to facilitate decision making.

Expected Value of a Random Variable


Expected value is the fundamental idea in the study of probability distributions.
For finding the expected value of a discrete random variable, we multiply each
value that the random variable can assume by its corresponding probability of
occurrence and then sum up all the products. For example to find out the
expected value of the discrete random variable (RV) of “Daily Visa Cleared”
given in the following table.

Table 14.4

Possible Nos. of the RV Probability Product

100 0.3 30
110 0.6 66
120 0.1 12

Hence the expected values of RV “Daily Visa Cleared” = 108

Now, we will examine situations involving discrete random variables and discuss
3 2 the methods for assessing them.
Geektonight Notes
Probability
14.4 DISCRETE PROBABILITY DISTRIBUTION Distributions

In the previous sections, we have seen that a representation of all possible


values of a discrete random variable together with their probabilities of
occurrence. It is called a discrete probability distribution. There are two kinds
of distributions in the discrete probability distribution. i) Binomial Distribution,
and (ii) Poisson Distribution. Let us discuss these two distributions in detail.

14.4.1 Binomial Distribution

It is the basic and the most common probability distribution. It has been used to
describe a wide variety of processes in business. For example, a quality control
manager wants to know the probability of obtaining defective products in a
random sample of 10 products. If 10 per cent of the products are defective, he/
she can quickly obtain the answer, from tables of the binomial probability
distributions. It is also known as Bernoulli Distribution, as it was originated
by Swiss Mathematician James Bernoulli (1654-1705).

The binomial distribution describes discrete, not continuous, data resulting from
an experiment known as Bernoulli Process. Binomial distribution is a probability
distribution expressing the probability of one set of dichotomous alternatives, i.e.,
success or failure.

As per this distribution, the probability of getting 0, 1, 2, …n heads (or tails) in


n tosses of an unbiased coin will be given by the successive terms of the
expansion of (q + p)n, where p is the probability of success (heads) and q is
the probability of failure (i.e. = 1– p).

Binomial law of probability distribution is applicable only when:

a) A trial results in either success or failure of an event.

b) The probability of success ‘p’ remains constant in each trial.

c) The trials are mutually independent i.e., the outcome of any trial is neither
affected by others nor affects others.

Assumptions i) Each trial has only two possible outcomes either Yes or No,
success or failure, etc.

ii) Regardless of how many times the experiment is performed, the probability of
the outcome, each time, remains the same.

iii) The trials are statistically independent.

iv) The number of trials is known and is 1, 2, 3, 4, 5, etc.

Binomial Probability Formula:


P(r) = nCr pr qn–r

where, P (r) = Probability of r successes in n trials; p = Probability of success;


q = Probability of failure = 1–p; r = No. of successes desired; and n = No. of
trials undertaken. 3 3
Geektonight Notes

Probability and The determining equation for nCr can easily be written as:
Hypothesis Testing
n!
n
Cr =
r! (n − r )!
n! can be simplified as follows:

n! = n (n–1)! = n (n–1) (n–2) ! = n (n–1) (n–2) (n–3) ! and so on.

Hence the following form of the equations, for carrying out computations of the
binomial probability is perhaps more convenient.

n!
P( r ) = prqn – r
r! (n − r ) !

The symbol ‘!’ means ‘factorial’, which is computed as follows: 5! means 5 ×


4 × 3 × 2 × 1 = 120. Mathematicians define 0! as 1.

If n is large in number, say, 50C3, then we can write (with the help of the
above explanation)

50 ! (50) (49) (48) (47) !


50
C3 = =
3 ! (50 − 3) ! 3 ! (47) !

50 × 49 × 48
=
3× 2 ×1

Similarly,

75 C 75 ! (75) (74) (73) (72) (71) (70) !


5 = =
5 ! (75 − 5) ! 5 ! (70) !

75 × 74 × 73 × 72 × 71
= , and so on.
5× 4 × 3× 2 ×1

Characteristics of a Binomial Distribution


i) The form of the distribution depends upon the parameters ‘p’ and ‘n’.
ii) The probability that there are ‘r’ successes in ‘n’ no. of trials is given by
n!
P (r) = n C r p r q n − r = p rq n−r
r ! (n − n ) !

iii) It is mainly applied when the population being sampled is infinite.


iv) It can also be applied to a finite population, if it is not very small or the units
sampled are replaced before the next trial is attempted. The point worth noting
is ‘p’ should remain unchanged.

Let us consider the following illustration to undertand the application of binomial


distribution.

Illustration 1

A fair coin is tossed six times. What is the probability of obtaining four or more
heads?

Solution: When a fair coin is tossed, the probabilities of head and tail in case
3 4 of an unbiased coin are equal, i.e.,
Geektonight Notes

p = q = ½ or 0.5 Probability
Distributions

∴ The probabilities of obtaining 4 heads is : P (4) = 6C4 (1/2)4 (1/2)6-4

n!
P( r ) = prqn −r
r! (n − r ) !

6!
P (4) = (0.5) 4 (0.5) 2
4 ! (6 − 4) !

6 × 5× 4 × 3× 2 ×1
= (0.625) (0.25)
(4 × 3 × 2 × 1) (2 × 1)

720
= (0.625) (0.25) = 15 × 0.625 × 0.25
(24) (2)

= 0.234
The probability of obtaining 5 heads is :
P(5) = 6C5(1/2)5 (1/2)6-5

6!
P (5) = (0.5) 5 (0.5)1
5 ! (6 − 5) !

6 × 5 × 4 × 3 × 2 × 1
= (0.03125) (0.5)
5 × 4 × 3 × 2 × 1 (1 × 1)

= 6 × (0.03125) (0.5)
= 0.094
The probability of obtaining 6 heads is : P(6) = 6C6 (1/2)6 (1/2)6-6

6!
P (6) = (0.5) 2 (0.5) 0
6 ! (6 − 6) !

6 × 5 × 4 × 3 × 2 × 1
= (0.015625) (1)
6 × 5 × 4 × 3 × 2 × 1 (1)

= 1 × 0.015625 × 1
= 0.016
∴ The probability of obtaining 4 or more heads is :
0.234 + 0.094 + 0.016 = 0.344

Illustration 2

The incidence of a certain disease is such that on an average 20% of workers


suffer from it. If 10 workers are selected at random, find the probability that

i) Exactly 2 workers suffer from the disease


ii) Not more than 2 workers suffer from the disease
iii) At least 2 workers suffer from the disease
20 1
Solution: Probability that a worker suffering from the disease = =
100 5
1
i.e., p = , and 3 5
5
Geektonight Notes

Probability and The probability of a worker not suffering from the disease i.e.,
Hypothesis Testing

 1 4
q = 1 −  =
 5 5

By binomial probability law, the probability that out of 10 workers, ‘x’ workers
suffer from a disease is given by:
P(r) = nCr pr qn–r

10 − r
10 C . 1 r. 4 ; r = 0, 1, 2, …10
r
5 5

i) The required probability that exactly 2 workers will suffer from the disease is
given by :

2 10 − 2
1 4
P(2) = 10C 2    
5 5

10 ! (10) (9) (8) !


= (0.2) 2 (0.8)8 = (0.04) (0.16777)
2 ! (10 − 2) ! (2 × 1) (8) !

= 45 (0.04) (0.16777) = 0.302

ii) The required probability that not more than 2 workers will suffer from the
disease is given by :

P (0) + P(1) + P(2)

0 10 − 0
1 4
P (0) = 10C 0     = 0.107
5 5

1 10 −1
1 4
P (1) = 10C1     = 0.269
5 5

2 10 − 2
1 4
P (2) = 10 C 2     = 0.302
5 5

Probability of not more than 2 workers suffering from the disease


= 0.107 + 0.269 + 0.302 = 0.678
iii) We have to find P (r ≥ 2)
i.e., P (r ≥ 2) = 1–P (0) – P (1)
= 1 – 0.107 – 0.269 = 0.624
Thus, the probability at least two workers suffering from the disease is 0.624.
Measures of Central Tendency and Dispersion for the Binomial
Distribution
As discussed in the Introduction, the binomial distribution has expected values of
mean (µ) and a standard deviation (σ). We now see the computation of both
these statistical measures.

3 6
Geektonight Notes

We can represent the mean of the binomial distribution as : Probability


Distributions
Mean (µ) = np.
where, n = Number of trials; p = probability of success
And, we can calculate the standard deviation by :
σ = npq
where, n = Number of trials; p = probability of success; and q = probability of
failure = 1–p

Illustration 3
If the probability of defective bolts is 0.1, find the mean and standard deviation
for the distribution of defective bolts in a total of 50.

Solution: P = 0.1, n = 500


∴ Hence (µ) = np = 500 × 0.1 = 50
Thus, we can expect 50 bolts to be defective.

Standard Deviation (σ) = npq


n = 500, p = 0.1, q = 1 – p = 1 – 0.1 = 0.9

∴ σ = 500 × .1 × .9 = 6.71

Fitting a Binomial Distribution


When a binomial distribution is to be fitted to observed data, the following
procedure is adopted:

i) Determine the values of ‘p’ and ‘q’. If one of these values is known, the other
can be found out by the simple relationship p = 1–q and q = 1–p. If p and q are
equal, we can say, the distribution is symmetrical. On the other hand if ‘p’ and
‘q’ are not equal, the distribution is skewed. The distribution is positively
skewed, in case ‘p’ is less than 0.5, otherwise it is negatively skewed.
ii) Expand the binomial (p + q)n. The power ‘n’ is equal to one less than the
number of terms in the expanded binomial. For example, if 3 coins are tossed
(n = 3) there will be four terms, when 5 coins are tossed (n = 5) there will be 6
terms, and so on.
iii) Multiply each term of the expanded binomial by N (the total frequency), in
order to obtain the expected frequency in each category.
Let us consider an illustration for fitting a binomial distribution.

Illustration 4
Eight coins are tossed at a time 256 times. Number of heads observed at each
throw is recorded and the results are given below. Find the expected
frequencies. What are the theoretical values of mean and standard deviation?
Also calculate the mean and standard deviation of the observed frequencies.

No. of Heads f No. of heads f


at a throw at a throw
0 2 5 56
1 6 6 32
2 30 7 10
3 52 8 1
4 67 3 7
Geektonight Notes

Probability and Solution: The chance of getting a head is a single throw of one coin is ½.
Hypothesis Testing Hence, p = ½, q = ½, n = 8, N = 256

By expanding 256 (½ + ½)8. We shall get the expected frequencies of 1, 2, …


8 heads (successes).

No. of Head (X) Expected Frequency = N × nCr pr qn–r


(Frequencies approximated)

0 256 × 8C0 (0.5)0 (0.5)8 = 1


1 256 × 8C1 (0.5)1 (0.5)7 = 8
2 256 × 8C2 (0.5)2 (0.5)6 = 28
3 256 × 8C3 (0.5)3 (0.5)5 = 56
4 256 × 8C4 (0.5)4 (0.5)4 = 70
5 256 × 8C5 (0.5)5 (0.5)3 = 56
6 256 × 8C6 (0.5)6 (0.5)2 = 28
7 256 × 8C7 (0.5)7 (0.5)1 = 8
8 256 × 8C8 (0.5)8 (0.5)0 = 1
Total = 256

If we compare the above expected frequencies with the observed frequencies,


given in the illustration, we find that the two frequencies are in close
agreement. This provides the basis to conclude that the observed distribution
will fits the expected distribution.
The mean of the above distribution is:
1
µ = np = 8 × =4
2
The Standard Deviation is (σ) = npq

1 1
= 8× × = 2 = 1.414
2 2

If we compute the mean and standard deviation of the observed frequencies,


we will obtain the following values

X = 4.062; S.D. = 1.462

Note: The procedure for computation of mean and standard deviation of the
observed frequencies has been already discussed in Units 8 and 9 of this
course. Check these values by computing on your own.

Remark: To determine binomial probabilities quickly we can use the Binomial


Tables given at the end of this block (Appendix Table 1).

Self Assessment Exercise A

1) State whether the following statements are true or false:

a) By the theory of probability P (H1) + P (H2) + …P (Hn) = 1


3 8
Geektonight Notes

b) Frequency distribution is obtained by expectations on the basis of Probability


Distributions
theoretical considerations.
c) In a continuous probability distribution the variable under consideration can
take on any value within a given range.
d) Binomial distribution is a probability distribution expressing the probability of
one set of dichotomous alternatives.
e) Binomial distribution may not be applied, when the population being sampled
is infinite.
f) Random variable is a numerical quantity whose value is determined by
the outcome of a random experiment.
2) Determine the following by using binomial probability formula.
a) If n = 4 and P = 0.12, then what is P (0) ?
b) If n = 10 and P = 0.40, then what is p (9) ?
c) If n = 6 and P = 0.83, then what is P (5)?
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

3) The following data shows the result of the experiment of throwing 5 coins at a
time 3,100 times and the number of heads appearing in each throw. Find the
expected frequencies and comment on the results. Also calculate mean and
standard deviation of the theoretical values.

No. of heads: 0 1 2 3 4 5
frequency: 32 225 710 1,085 820 228
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

14.4.2 Poisson Distribution

Poisson distribution, developed by a French mathematician Simeon Poisson, is so


known by his name. It deals with counting the number of occurances of a
particular event in a specific time interval or region of space. It is used in 3 9
Geektonight Notes

Probability and practice where there are infrequently occuring events with respect to time,
Hypothesis Testing volume (similar units), area, etc. For instance, the number of deaths or
accidents occuring in a specific time, the number of defects in production, the
number of workers absent per day etc.

The binomial distribution, as discussed above, is determined by two parameters


‘p’ and ‘n’. In a number of cases ‘p’ (the probability of success) may happen
to be very small (even less than 0.01) and the ‘n’ (the no. of trials) is large
enough (like more than 50) so that their product “np” remains a constant, the
situation is termed as “Poisson Distribution”, and it gives an approximation for
binomial probability distribution formula, i.e., P(r) = nCr pr qn–r
The Poisson distribution process corresponds to a Bernoulli process with a very
large number of trials (n) and a very low probability of success.

This would comparatively be simpler in dealing with and is given by the Poisson
distribution formula as follows:

mre–m
p (r ) = ,
r!
where, p (r) = Probability of successes desired

r = 0, 1, 2, 3, 4, … ∞ (any positive integer)

e = a constant with value: 2.7183 (the base of natural logarithms)

m = The mean of the Poisson Distribution, i.e., np or the average


number of occurrences of an event.

Characteristics of the Poisson Distribution


a) It is also a discrete probability distribution and it is the limiting form of the
binomial distribution.

b) The range of the random variable is 0 ≤ r < ∞

c) It consists of a single parameter “m” only. So, the entire distribution can be
obtained by knowing this value only.

d) It is a positively skewed distribution. The skewness, therefore, decreases when


‘m’ increases.

Measures of Central Tendency and Dispersion for Poisson Distribution

In poisson distribution, the mean (m) and the variance (s2) represent the same
value, i.e.,
Mean = variance = np = m

S.D. (σ) = Variance = np

Let us consider the following illustrations to understand the application of the


poisson distribution.
Illustration 5

2% of the electronic toys produced in a certain manufacturing process turnout


to be defective. What is the probability that a shipment of 200 toys will contain
4 0 exactly 5 defectives ? Also find the mean and standard deviation.
Geektonight Notes

Solution: In the given illustration n = 200; Probability


Distributions
2
Probabiliity of a defective toy (P) = = 0.02
100

Since, n is large and p is small, the poisson distribution is applicable. Apply the
formula:
mre–m
p (r ) =
r!

The probability of 5 defective pieces in 200 toys is given by:

m 5e – m
p (5) = , where m = np = 200 × 0.02 = 4;
5!
e = 2.7183 (constant)
1
5 − 4 (1024)
∴ P (5) =
4 2.7183
= 2.71834
5 × 4 × 3 × 2 ×1 120

(1024) 0.0183
= = 0.156
120

Mean = np = 200 × 0.02 = 4; σ = np = 4 = 2

Illustration 6

Find the probability of exactly 4 defective tools in a sample of 30 tools chosen


at random by a certain tool producing firm by using i) Binomial distribution and
ii) Poisson distribution. The probability of defects in each tool is given to be
0.02.

Solution: i) When binomial distribution is used, the probability of 4 defectives in


30 tools is given by:

P (4) = 30C4 (0.02)4 (0.98)26

= 27405 × 0.00000016 × 0.59 = 0.00259

ii) When poisson distribution is used, the probability of 4 defectives in 30 tools is


given by:

m 4 e−m
P (4) = , where, m = np = 30 (0.02) = 0.6
4!

e = 2.7183 (constant)

0.6 2 2.7183 −0.6


∴ P ( 4) =
4 × 3 × 2 ×1

Re ciprocal [anti log (0.6 × log . 2.7183)] 0.1296


24

Re c. [anti log 0.2606] 0.1296 0.5485 × 0.1296


= =
24 24
= 0.00296
Remark: In general the Poisson distribution can be used as an approximation
to binomial with parameter m = np, is good if n ≥20 and p ≤ 0.05. 4 1
Geektonight Notes

Probability and Fitting of a Poisson Distribution


Hypothesis Testing
To fit a poisson distribution to a given observed data (frequency distribution),
the procedure is as follows:

1) We must obtain the value of its mean i.e., m = np

2) The probabilities of various values of the random variables (r) are to be


r –m
computed by using p.m.f. i.e., p ( r ) = m e
r!
3) Each probability so obtained in step 2 is then multiplied by N (the total
frequency) to get expected frequencies.

Let us consider an illustration to understand for fitting poisson distribution.

Illustration 7

The number of defects per unit in a sample of 330 units of manufactured


product was found as follows:

No. of defects No. of units


0 214
1 92
2 20
3 3
4 1

Fit a poisson distribution to the above given data.

Solution: The mean of the given frequency distribution is:

(0 × 214) + (1× 92) + (2 × 20) + (3 × 3) + (4 ×1) 145


m= = = 0.439
214 + 92 + 20 + 3 + 1 330

0.439 r × e –0.439
We can write P( r ) = . Substituting r = 0, 1, 2, 3, and 4, we get
r!
the probabilities for various values of r, as shown below:

m r e −m 0.439 0 × 2.7183−0.439
(P0) = =
r! 0!

1 (0.6443)
= = 0.6443
1
N(P0) = (P0) × N = 0.6443 × 330 = 212.62

N(P1) = (P0) × m/1 = 212.62 × 0.439/1 = 93.34

N(P2) = (P1) × m/2 = 93.34 × 0.439/2 = 20.49

N(P3) = (P2) × m/3 = 20.49 × 0.439/3 = 3.0

N(P4) = (P3) × m/4 = 3 × 0.439/4 = 0.33

4 2
Geektonight Notes

Thus, the expected frequencies as per poisson distribution are : Probability


Distributions
No. of defects (x) 0 1 2 3 4
Expected
frequencies (No. 212.62 93.34 20.49 3.0 0.33
of units) (f)

Note: We can use Appendix Table-2, given at the end of this block, to
determine poisson probabilities quickly.

Self Assessment Exercise B

1) What are the features of binomial and poisson distributions?


.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

2) Suppose on an average 2% of electric bulbs manufactured by a company


are defective. If they produce 100 bulbs in a day, what is the probability
that 4 bulbs will have defects on that day ?
.............................................................................................................
.............................................................................................................
.............................................................................................................

3) Four hundred car air-conditioners are inspected as they come off the
production line and the number of defects per set is recorded below. Find
the expected frequencies by assuming the poisson model.

No. of defects : 0 1 2 3 4 5

No. of ACs: 142 156 69 27 5 1


.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

4 3
Geektonight Notes

Probability and
Hypothesis Testing 14.5 CONTINUOUS PROBABILITY DISTRIBUTION
In the previous sections, we have examined situations involving discrete random
variables and the resulting probability distributions. Let us now consider a
situation, where the variable of interest may take any value within a given
range. Suppose that we are planning to release water for hydropower
generation and irrigation. Depending on how much water we have in the
reservoir, viz., whether it is above or below the ‘normal’ level, we decide on
the quantity of water and time of its release. The variable indicating the
difference between the actual level and the normal level of water in the
reservoir, can take positive or negative values, integer or otherwise. Moreover,
this value is contingent upon the inflow to the reservoir, which in turn is
uncertain. This type of random variable which can take an infinite number of
values is called a continuous random variable, and the probablity distribution
of such a variable is called a continuous probability distribution.

Now we present one important probability density function (p.d.f), viz., the
normal distribution.

14.5.1 Normal Distribution

The normal distribution is the most versatile of all the continuous probability
distributions. It is useful in statistical inferences, in characterising uncertainities
in many real life situations, and in approximating other probability distributions.

As stated earlier, the normal distribution is suitable for dealing with variables
whose magnitudes are continuous. Many statistical data concerning business
problems are displayed in the form of normal distribution. Height, weight and
dimensions of a product are some of the continuous random variables which are
found to be normally distributed. This knowledge helps us in calculating the
probability of different events in varied situations, which in turn is useful for
decision-making.

To define a particular normal probability distribution, we need only two


parameters i.e., the mean (µ) and standard deviation (σ).

Now we turn to examine the characteristics of normal distribution with the help
of the figure 14.1, and explain the methods of calculating the probability of
different events using the distribution.

Mean
Median
Mode
Normal probability distribution
is symmetrical around a vertical
line erected at the mean
Left hand tail extends
indefinitely but never Right hand tail extends
reaches the horizontal indefinitely but never
axis reaches the horizontal
axis

Figure 14.1: Frequency Curve for the Normal Probability Distribution


4 4
Geektonight Notes

14.5.2 Characteristics of Normal Distribution Probability


Distributions

1) The curve has a single peak, thus it is unimodal i.e., it has only one mode and
has a bellshape.

2) Because of the symmetry of the normal probability distribution (skewness = 0),


the median and the mode of the distribution are also at the centre. Thus, for a
normal curve, the mean, median and mode are the same value.

3) The two tails of the normal probability distribution extend indefinitely but never
touch the horizontal axis.

Areas Under the Normal Curve


The area under the normal curve (Fig. 14.1) gives us the proportion of the
cases falling between two numbers or the probability of getting a value between
two numbers.

Irrespective of the value of mean (µ) and standard deviation (σ), for a normal
distribution, the total area under the curve is 1.00. The area under the normal
curve is approximately distributed by its standard deviation as follows:

µ±1σ covers 68% area, i.e., 34.13% area will lie on either side of µ.

µ ± 2σ covers 95.5% area, i.e., 47.75% will lie on either side of µ.

µσ ± 3σ covers 99.7% area, i.e., 49.85% will lie on either side of µ.

Using the Standard Normal Table


The areas under the normal curve are shown in the Appendix Table-3 at the
end of this block. To use the standard normal table to find normal probability
values, we follow two steps. They are:

Step 1: Convert the normal distribution to a standard normal distribution.


The standard random variable Z, can be computed as follows:

X−µ
Z=
σ
Where,

X = Value of the random variable with which we are concerned.

µ = mean of the distribution of this random variable

σ = standard deviation of this distribution.

Z = Number of standard deviations from X to the mean of this


distribution.

Step 2: Look up the probability of z value from the Appendix Table-3, given at
the end of this block, of normal curve areas. This Table is set up to
provide the area under the curve to any specified value of Z. (The
area under the normal curve is equal to 1. The curve is also called
the standard probability curve).
4 5
Geektonight Notes

Probability and Let us consider the following illustration to understand as to how the table
Hypothesis Testing should be consulted in order to find the area under the normal curve.

Illustration 8
(a) Find the area under the normal curve for Z = 1.54.

Solution: Consulting the Appendix Table-3 given at the end of this block, we
find the entry corresponding to Z = 1.54 the area is 0.4382 and this measures
the Shaded area between Z = 0 and Z = 1.54 as shown in the following figure.

0.4382

µ 1.54

(b) Find the area under normal curve for Z = –1.46

Solution: Since the curve is symmetrical, we can obtain the area between z =
–1.46 and Z = 0 by considering the area corresponding to Z = 1.46. Hence,
when we look at Z of 1.46 in Appendix Table-3 given at the end of this block,
we see the probability value of 0.4279. This value is also the probability value
of Z = –1.46 which must be shaded on the left of the µ as shown in the
following figure.
0.4279

-1.46 µ

(c) Find the area to the right of Z = 0.25

Solution: If we look up Z = 0.25 in Appendix table, we find the probable area


of 0.0987. Subtract 0.0987 (for Z = 0.25) from 0.5 getting 0.4013 (.5–.0987 =
0.4013).

4 6
Geektonight Notes
0.987 Probability
Distributions

0.4013

µ 0.25
d) Find the area to the left of Z = 1.83.
Solution: If we are interested in finding the area to the left of Z (positive
value), we add 0.5000 to the table value given for Z. Here, the table value for
Z (1.83) = 0.4664. Therefore, the total area to the left of Z = 0.9664 (0.5000 +
0.4664) i.e., equal to the shaded area as shown below:

5.000

0.4664

µ 1.83

Now let us take up some illustrations to understand the application of normal


probability distribution.
Illustration 9
Assume the mean height of soliders to be 68.22 inches with a variance of 10.8
inches. How many soldiers in a regiment of 1,000 would you expect to be over
six feet tall ?

X−µ
Solution: Z=
σ
X = 72 inches; µ = 68.22 inches; and σ = 10.8 = 3.286

72 − 68 .22
∴Z = = 1.15
3.286

for Z = 1.15 the area is .3749 (Appendix Table-3).

4 7
Geektonight Notes

Probability and
Hypothesis Testing
0.3749

0.1251

68.22
µ 72

Area to the right of the ordinate at 1.16 from the normal table is (0.5–0.3749)
= 0.1251. Hence, the probability of getting soldiers above six feet is 0.1251 and
out of 1,000 soliders, the expectation is 1,000 × 0.1251 = 125.1 or 125. Thus,
the expected number of soldiers over six feet tall is 125.

Illustration 10
(a) 15,000 students appeared for an examination. The mean marks were 49 and the
standard deviation of marks was 6. Assuming the marks to be normally
distributed, what proportion of students scored more than 55 marks?

X−µ
Solution: Z =
σ
X = 55; µ = 49; σ = 6

55 − 49
∴Z = =1
6
For Z = 1, the area is 0.3413 (as per Appendix Table-3).

∴ The proportion of students scoring more than 55 marks is

0.5–0.3413 = 0.1587 or 15.87%

(b) If in the same examination, Grade ‘A’ is to be given to students scoring more
than 70 marks, what proportion of students will receive grade ‘A’?

X−µ
Solution: Z =
σ
X = 70; µ = 49; σ = 6

70 − 49
∴Z = = 3.5
6

The table gives the area under the standard normal curve corresponding to
Z = 3.5 is 0.4998

Therefore, 0.02% (0.5–0.4998 = 0.0002 × 100) would score more than 70


marks. Since, there are 15,000 candidates, 3 candidates (15,000 × 0.02% = 3)
will receive Grade ‘A’.

4 8
Geektonight Notes

Illustration 11 Probability
Distributions
In a training programme (self-administered) to develop marketing skills of marketing
personnel of a company, the participants indicate that the mean time on the
programme is 500 hours and that this normally distributed random variable has a
standard deviation of 100 hours. Find out the probability that a participant selected
at random will take:
i) fewer than 570 hours to complete the programme, and
ii) between 430 and 580 hours to complete the programme.
Solution: (i) To get the Z value for the probability that a candidate selected at
random will take fewer than 570 hours, we have

x −µ 570 − 500
Z = =
σ 100

70
= = 0.7
100

Consulting the Appendix Table-3 for a Z value of 0.7, we find a probability of


0.2580 (this probability will lie between the mean, 500 hours and 570 hours. As
explained in illustration 8(d), we must add 0.5 to this probability (0.2580) that the
random variable will be between the left-hand tail and the mean.
Therefore, we obtain the probability that the random variable will lie between the
left-hand tail and 570 hours is 0.7580 (0.5 + 0.2580).
This situation is shown below:

0.2580
P( less than 570) Z= 0.7
= 0.7580

(µ) 570

Thus, the probability of a participant taking less than 570 hours to complete the
programme, is marginally higher than 75 per cent.

ii) In order to get the probability, of a participant chosen at random, that he will take
between 430 and 580 hours to complete the programme, we must, first, compute
the Z value for 430 and 580 hours.

x −µ
Z=
σ

430 − 500 −70


Z for 430 = = = – 0. 7
100 100

580 − 500 80
Z for 580 = = = 0 .8
100 100 4 9
Geektonight Notes

Probability and The table shows the probability values of Z values of –0.7 and 0.8 are 0.2580
Hypothesis Testing and 0.2881 respectively. This situation is shown in the following figure.

P(430 to 580) = 0.5461

Z= –0.8
Z= –0.7

430 (µ) 580


500

Thus, the probability that the random variables lie between 430 and 580 hours is
0.5461 (0.2580 + 0.2881).

14.5.3 Importance and Application of Normal Distribution


This distribution was initially discovered for studying the random errors in
measurements, which are encountered during the calculations of orbits of
heavenly bodies. It happens because of the fact that the normal distribution
follows the basic principle of errors. It is mainly for this quality that the
distribution has a wide range of applications in the theory of statistics. To count
a few :

– Industrial quality control


– Testing of significance
– Sampling distribution of various statistics
– Graduation of non-normal curve
– length of the leaves observed at particular times of the year.
The main purpose for using a normal distribution are:

(i) To fill a distribution of measurement for same sample data,

(ii) To approximate the distributions like Binomial, Poisson etc.

(iii) To fit sampling distribution of various statistics like mean or variance etc.

Self Assessment Exercise C

1) Given a standardized normal distribution area between the mean and


positive value of Z as in Appendix Table 2) what is the probability
that:
a) Z is less than +1.08?
b) Z is greater than – 0.21?
c) Z is between the mean and +1.08?
d) Z is less than – 0.27 the mean and greater than +1.06?
e) Z is between – 0.21 and the mean?
5 0
Geektonight Notes

2) Give a normal distribution with µ = 100 and σ = 10, what is the Probability
Distributions
probability that:
a) X > 75?
b) X < 70?
c) X > 112?
d) 75 < X > 85?
e) X < 80 or X > 110?

14.6 LET US SUM UP


In this unit, we have discussed the meaning of frequency distrbution and
probability distribution, and the concepts of random variables and probability
distribution. In any uncertain situation, we are often interested in the behaviour
of certain quantities that take different values in different outcomes of
experiments. These quantities are called random variables and a representation
that specifies the possible values a random variable can take, together with the
associated probabilities, is called a probability distribution. The distribution of a
discrete variable is called a discrete probability distribution and the function that
specifies a discrete distribution is termed as a probability mass function (p.m.f.).
In the discrete distribution we have considered the binomial and poisson
distributions and discussed how these distributions are helpful in decision-making.
We have shown the fitting of such distributions to a given observed data.

In the final section, we have examined situations involving continuous random


variables and the resulting probability distributions. The random variable which
can take an infinite number of values is called a continuous random variable
and the probability distribution of such a variable is called a continuous
probability distribution. The function that specifies such distribution is called the
probability density function (p.d.f.). One such important distribution, viz., the
normal distribution has been presented and we have seen how probability
calculations can be done for this distribution.

14.7 KEY WORDS


Binomial Distribution: It is a type of discrete probability distribution function
that includes an event that has only two outcomes (success or failure) and all
the trials are mutually independent.

Continuous Probability Distribution: In this distribution the variable under


consideration can take any value within a given range.

Continuous Random Variable: If the random variable is allowed to take any


value within a given range, it is termed as continuous random variable.

Discrete Probability Distribution: A probability distribution in which the


variable is allowed to take on only a limited number of values.

Discrete Random Variable: A random variable that is allowed to take only a


limited number of values.

Normal Distribution: It is a type of continuous probability distribution with a


single peaked, bell-shaped curve. The curve is symmetrical around a vertical
line erected at the mean. It is also known as Gaussian distribution. 5 1
Geektonight Notes

Probability and Poisson Distribution: It is the limiting form of the binomial distribution.
Hypothesis Testing Hence, the probability of success is very low and the total number of trials is
very high.

Probability: Any numerical value between 0 and 1 both inclusive, telling about
the likelihood of occurrence of an event.

Probability Distribution: A curve that shows all the values that the random
variable can take and the likelihood that each will occur.

Random Variable: It is a variable, that can take different values as a result of


the outcomed of a random experiment.

14.8 ANSWERS TO SELF ASSESSMENT


EXERCISES
A) 1. a) True; b) False; c) True
d) True; e) False; f) True
2. a) 0.5997; b) 0.0016; c) 0.4018
3. Expected frequencies approximated
97 484 969 969 484 97
B) 2. 0.11
3. Expected frequencies (No. of ACs) approximated
147, 147, 74, 25, 6, 1.
C) 1. a) 0.8599; b) 0.5832; c) 0.3599
d) 0.4618; e) 0.0832
2. a) 0.9938; b) 0.00135; c) 0.1151
d) 0.0606; e) 0.1815.

14.9 TERMINAL QUESTIONS/EXERCISES


1) Distinguish between frequency distribution and probability distribution.

2) Explain the concept of random variable and probability distribution.

3) Define a binomial probability distribution. State the conditions under which the
binomial probability model is appropriate by illustrations.

4) Explain the characteristics of a poisson distribution. Give two examples, the


distribution of which will conform to the poisson form.

5) What do you mean by continuous probability distribution? How does it differ


from binomial distribution?

6) Explain the procedure involved in fitting binomial and poisson distributions.

7) If the average number of defects items in the manufacturing of certain items is


10%, what is the probability of a) 0, b) 2, c) at most 2 items, d) at least two
items are found to be defective in a sample of 12 items taken at random

5 2 Ans: a) 0.2824, b) 0.2301, c) 0.8891, d) 0.3410.


Geektonight Notes

8) If the probability of a defective bolt is 0.1, find (a) the mean, and (b) the Probability
Distributions
standard deviation of defective bolts in a total of 900. (Ans. (a) 90; (b)81)

9) Harry Onobr is in charge of the electronics section of a large department store.


He has noticed that the probability that a customer, who is just browsing will
buy something, is 0.3. Suppose that 15 customers browse in the electronics
section each hour.

a) What is the probability that at least one browsing customer will buy
something during a specified hour?
b) What is the probability that at least 4 browsing customers will buy
something during a specified hour?
c) What is the probability that no browsing customer will buy anything during a
specified hour?
d) What is the probability that not more than 4 browsing customers will
buy something during a specified hour?
[Ans. (a) .9953 (b) .7031 (c) .0047 (d) .5155]

10) Given a binomial distribution with n = 28 trials and p = .025, use the Poisson
approximation to the binomial to find:

(a) P ( r ≥ 3) (b) P (r < 5)


(c) p ( r = q)
[Ans. (a) .03414 (b) .99922 (c) 0.0000]

11) The average number of customer arrivals per minute at a departmental stores is
2. Find the probability that during one particular minute:

a) at least one customer will arrive


b) exactly three customers will arrive
c) at the most two customers will arrive
[Ans. a) 0.8646; b) 0.1805; (c) 0.6767]

12) A set of 5 fair coins was thrown 80 times, and the number of heads in each
throw was recorded and given in the following table. Estimate the probability of
the appearance of head in each throw for each coin and calculate the
theoretical frequency of each number of heads on the assumption that the
binomial law holds:

No. of heads: 0 1 2 3 4 5

Frequency: 6 20 28 12 8 6

Ans: (7, 19, 24, 18, 9, 3)

13) Fit a poisson distribution to the following observed data and calculate the
expected frequencies:

Deaths: 0 1 2 3 4

Frequency: 122 60 15 2 1

5 3
Geektonight Notes

Probability and 14) Given that a random variable X, has a binomial distribution with n = 50 trials
Hypothesis Testing and p = .25, use the normal approximation to the Binomial to find:

(a) p (x > 10) (b) p (x > 21)


(c) p (x < 18) (d) p (q < x < 14)
[Ans. (a) .7422 (b) .0016 (c) .9484 (d) .4658]

15)Glenn Howell VP of personnel for the Standard Insurance Company has


developed a new training programme that is entirely self-paced. New employees
work various stages at their own pace, completion occurs when the material is
learned. Howell’s programme has been especially effective in speeding up the
training process, as an employee’s salary, during training, is only 67% of that
earned upon completion of the programme. In the last several years, average
completion of the programme has been in 44 days. With a standard deviation of
12 days.
a) What is the probability that an employee will finish the programme between
33 and 42 days.
b) What is the probability of finishing the programme in fewer than 30 days?
c) What is the probability of finishing the programme in fewer than 25 or more
than 60 days?
[Ans. (a) .2537 (b) .1210 (c) .1489]
16) A project yields an average cash flow of Rs. 500 lakhs with a standard
deviation of Rs. 60 lakhs. Calculate the following probabilities:

i) Cash-flow will be more than Rs. 560 lakhs.


ii) Cash-flow will be less than Rs. 420 lakhs.
iii) Cash-flow will be between Rs. 460 lakhs and Rs. 540 lakhs.
iv) Cash-flow will be more than Rs. 680 lakhs.
[Ans. (i) .1587 (b) .4972 (iii) .0918 (iv) .0013]
17) If log10x is normally distributed with mean 4 and variance 4. Find the probability
of 1.202 < x < 83180000, given that
log10 1202 = 3.08, log10 8318 = 3.93.
(Ans. 95%)

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

14.10 FURTHER READING


The following books may be used for more indepth study on the topics dealt
within this unit.
Levin, R.I. & Rubin, D.S., 1991, Statistics for Management, PHI, New Delhi.
Gupta, S.P. 1999, Elementary Statistical Methods, Sultan Chand & Sons, New
Delhi.
Bhardwaj, R.S. 2001, Business Statistics, Excel Books, New Delhi.
Chandan, J.S. Statistics for Business and Economics, Vikas Publishing House
5 4 Pvt. Ltd., New Delhi.
Geektonight Notes
Tests of Hypothesis–I
UNIT 15 TESTS OF HYPOTHESIS–I
STRUCTURE

15.0 Objectives
15.1 Introduction
15.2 Point Estimation and Standard Errors
15.3 Interval Estimation
15.4 Confidence Limits, Confidence Interval and Confidence Co-efficient
15.5 Testing Hypothesis – Introduction
15.6 Theory of Testing Hypothesis – Level of Significance, Type-I and
Type-II Errors and Power of a Test
15.7 Two-tailed and One-tailed Tests
15.8 Steps to Follow for Testing Hypothesis
15.9 Tests of Significance for Population Mean–Z-test for variables
15.10 Tests of Significance for Population Proportion – Z-test for Attributes
15.11 Let Us Sum Up
15.12 Key Words and Symbols Used
15.13 Answers to Self Assessment Exercises
15.14 Terminal Questions/Exercises
15.15 Further Reading

15.0 OBJECTIVES
After studying this unit, you should be able to:
l estimate population characteristics (parameters) on the basis of a sample,
l get familiar with the criteria of a good estimator,
l differentiate between a point estimator and an interval estimator,
l comprehend the concept of statistical hypothesis,
l perform tests of significance of population mean and population proportion,
and
l make decisions on the basis of testing hypothesis.

15.1 INTRODUCTION
Let us suppose that we have taken a random sample from a population with a
view to knowing its characteristics, also known as its parameters. We are then
confronted with the problem of drawing inferences about the population on the
basis of the known sample drawn from it. We may look at two different
scenarios. In the first case, the population is completely unknown and we would
like to throw some light on its parameters with the help of a random sample
drawn from the population. Thus, if µ denotes the population mean, then we
intend to make a guess about it on the basis of a random sample. This is
known as estimation. For example, one may be interested to know the average
income of people living in the city of Delhi or the average life in burning hours
of a fluorescent tube light produced by ‘Indian Electrical’ or proportion of
people suffering from T.B. in city ‘B’ or the percentage of smokers in town
‘C’ and so on.

A somewhat different situation may arise when some information about a


parameter is either known or specified and we would like to verify whether
that information holds good for the sample drawn from the population as well.
5 5
Geektonight Notes

Probability and This is known as problem of testing of hypothesis. In the previous examples,
Hypothesis Testing we may be interested of in testing whether the average income in the city of
Delhi is, say, Rs. 2,000 per month. In the second example, we may like to
verify whether the claims made by Indian Electrical, that their fluorescent lamps
would last 5,000 hours, is justified. Some social workers may believe that 20%
of the population in city B suffers from T.B. We would like to make our
comment after a test of hypothesis. In the last example, some human activists,
concerned about the hazards of passive smoking, assert that 30% of the people
staying in town C are smokers. We may share their opinion once we have
satisfied ourselves after peforming a statistical test of hypothesis.

It may be noted that testing of hypothesis plays a vital role in decision-making.


In the first example, the statistician may be concerned about whether to bracket
Delhi with the top metropoliton cities depending on the average income based
on his/her recommendations. If on the basis of a statistical test, it is found that
the claim made by the manufacturer of India Electrical is justified, then the
sales of his lamps would increase. In the third example if there is evidence,
again on the basis of testing hypothesis, that the social worker is right about his
statement, suitable steps may be undertaken to improve the living conditions of
the marginalized section in the city so that the percentage of people suffering
from T.B. is reduced. Some strict legislation banning smoking or reducing
smoking to a desirable level may be enacted on the basis of a hypothesis tested
in the last example.

15.2 POINT ESTIMATION AND STANDARD


ERRORS
Estimation is an integral part of our daily lives. In order to construct a new
house or renovate an old house or flat, we demand an estimate of the cost
involved. A student estimates his/her chance of success before appearing for an
expensive competitive examination.

Now we shall consider estimation from the viewpoint of a statistician. As we


discussed in Unit 4: sampling, is the means to find the true value of the
parameter which can be correctly obtained only through census study. In many
cases it is not practicable due to various constraints. Therefore, the alternative
approach is to select some items as a sample from the population and collect
the data and analyse the data, then estimate the chracteristics of the population.
This is called estimation. Point estimate is one type of estimate. It is a single
number which is used as an estimate of the unknown population parameter. Let
us assume that we have taken a random sample of n observations, x1, x2,
x3…xn, from a population characterized by a parameter θ (read theta). This
symbol θ is used to denote a parameter that could be mean, mode or some
measure of variation etc. Thus θ may be the mean (µ)of a normal distribution
or the probability of success (P) of a binomial distribution with parameters ‘n’
and ‘p’ and so on. In theory of estimation, we try to find a statistic (i.e., a
function of sample observations) T which estimates the unknown parameter θ.
Thus the sample mean ( x ) = ∑ x i / n , x1, x2, x3 …xn being the income per
month of ‘n’ persons selected at random from the city of Delhi, may be
considered to be the estimate of the average income per month (µ) of the
people of Delhi.
This is denoted by µˆ = x i.e., the estimate of µ is x .
5 6
Geektonight Notes

To be more precise, x is known as a point estimator of µ as we try to Tests of Hypothesis–I


estimate the population mean (µ) by a single value, namely, the sample mean.
On the basis of a random sample of incomes from Delhi, if it is found that the
sample mean is Rs. 2,000/-, then one may conclude that the estimate of
average income per month of the people living in that city is Rs. 2,000/-.

As opposed to a point estimate, one may think of an interval estimate that is


supposed to contain the average income of the people of Delhi per month. This
would be discussed in Section 15.3.

At this juncture, we must make a distinction between the two terms Estimator
and estimate. ‘T’ is defined to be an estimator of a parameter θ, if T esimtates
θ. Thus T is a statistic and its value may differ from one sample to another
sample. In other words, T may be considered as a random variable. The
probability distribution of T is known as sampling distribution of T. As already
discussed, the sample mean x is an estimator of population mean µ. The value
of the estimator, as obtained on the basis of a given sample, is known as its
estimate. Thus x is an estimator of µ, the average income of Delhi, and the
value of x i.e., Rs. 2,000/-, as obtained from the sample, is the estimate of µ.

Selection of the best estimator: Our next endeavour would be to discuss


different criteria for selecting the best estimator.

Unbiasedness and Minimum Variance: A statistic T is defined to be


unbiased for a parameter θ if expectation of T is θ, i.e., E(T) = θ. On the
other hand if E(T) = θ + a (θ), then the difference a (θ) = E(T) – θ is known
as bias. The bias is known to be positive if a (θ) > 0 and negative if a (θ) <
0. Our first priority would be to select an unbiased estimator of θ. However,
there may be many unbiased estimators of θ. If x1, x2 …, xn denote n sample
observations from a population with an unknown parameter θ, then any of the n
observations or any linear function of them would be an unbiased estimator of θ.

In order to choose the best estimator among these estimators along with
“unbiasedness”, we introduce a second criterion, known as, minimum variance.
A statistic T is defined to be a minimum variance unbiased estimator (MVUE)
of θ if T is unbiased for θ and T has minimum variance among all the
unbiased estimators of θ. We may note that sample mean ( x ) is an MVUE for
µ.

∑ xi
We know that x = …(15.1)
n

 (∑ x i ) 
∴ E( x ) = E  
 n 

1
= .[∑ E (xi )]
n

1
= [ ∑ µ ] [x1, x2, …xn are taken from population having as
n µ population mean]
1
= . nµ
n

E( x ) = µ

Hence ( x ) is an unbiased estimator of µ.


5 7
Geektonight Notes

Probability and Further, variance of ( x ) is given by :


Hypothesis Testing
v ( x ) = v ( ∑ x i / n ) (where v denotes variance)
1
= .[ ∑ v( x i )] Since xi1s are independent
n2

1
= ∑σ
2
= [where σ2 is population variance]
n2

1 σ2
= 2
.[nσ 2 ] = ……(15.2)
n n
It can be proved that v ( x ) has the minimum variance among all the unbiased
estimators of µ.
Consistency: If T is an estimator of θ, then it is obvious that T should be in
the neighbourhood of θ. T is known to be consistent for θ, if the difference
between T and θ can be made as small as we please by increasing the sample
size n sufficiently.
We can further add that T would be a consistent estimator of θ if
i) E (T) → θ and

ii) V (T) → 0 for a very large n i.e., as n → ∝


For example, sample mean ( x ) is a consistent estimator of µ as E ( x ) = µ
σ2
And V ( x ) = → 0 as n → ∝ .
n

It may be noted that if T is a consistent estimator of θ, then any function of T


is also a consistent estimator of θ.

Efficiency: A statistic T is called as an efficient estimator of θ if it has the


minimum standard error among all the estimators of θ for a fixed sample size
n. Both the sample mean and sample median are consistent estimators for µ.
But standard error (a term, to be defined and explained in this section) of
sample mean is less than that of sample median. Hence sample median is only
a consistent estimator of µ, whereas sample mean is both consistent and
efficient estimator of µ.
Sufficiency: A statistic T is known to be a sufficient estimator of θ if T
contains sufficient information about θ so that we do not have to look for any
other estimator of θ. Sample mean ( x ) is a sufficient estimator of µ.
Now let us consider the following point estimates that are commonly used.
A) Estimating Population Mean: It is obvious that sample mean is the best
estimator of population mean µ. It is an MVUE. It is both consistent and
efficient estimator for µ. Further more, x is a sufficient estimator for µ. Thus
we estimate the average income of the people of Delhi by the sample mean or
the average life of bulbs, manufactured by Indian Electricals, by the
corresponding sample mean.
B) Estimating Population Proportion: If a discrete random variable x
follows binomial distribution with parameters n and P, then we have
µ = E(x) = nP
σ 2 = v(x) = nP (1–p)
[n denoting the number of trials and P denoting the probability of a success].
5 8
Geektonight Notes

Hence, it follows that : Tests of Hypothesis–I

 x  nP
E (p) = E i  = =P …(15.3)
n n

V(x i )
and V (p)= V (xi/n) =
n2

nP (1 − P )
=
n2

P (1 − P )
= …(15.4)
n
Thus if we take a random sample of size ‘n’ from a population where the
proportion of population possessing a certain characteristic is ‘P’ and the
sample contains x units possessing that characteristic, then an estimate of
population proportion (P) is given by:

x
P̂ = …(15.5)
n
In other words, the estimate of the population proportion is given by the
corresponding sample estimate i.e., P̂ = p …(15.6)

From (15.3) E {p} = P


So p is an unbiased estimator of P. It can be shown that p has the minimum
variance among all the unbiased estimators of p. In other words, p is an
MVUE of P.

P(1 − P)
As v (p) = → 0 as n →∝
n
it follows from Eq. (15.4) that p is a consistent estimator of P. We can further
establish that p is an efficient as well as a sufficient estimator of P. Thus we
advocate the use of sample proportion to estimate the population proportion as
p which satisfies all the desirable properties of an estimator.
In order to estimate the proportion of people suffering from T.B. in city B, if
we find the number of people suffering from T.B. is ‘x’ in a random sample of
size ‘n’, taken from city B, then sample estimate p = x/n would provide the
estimate of the proportion of people in that city suffering from T.B. Similarly,
the percentage of smokers as found from a random sample of people of town
C would provide the estimate of the percentage of smokers in town C.
C) Estimation of Population Variance and Standard Error: Standard error
of a statistic T, to be denoted by S.E. (T), may be defined as the standard
deviation of T as obtained from the sampling distribution of T. In order to
compute the standard error of sample mean, it may be noted that from Eq.
σ
(15.2) S.E. ( x ) = for simple random sampling with replacement (SRSWR).
n

σ N−η
S.E. ( x ) = for simple random sampling without replacement
n N −1
(SRSWOR)].
where σ is the population standard deviation (S.D.), n is sample size, N is
N−η
population size and the factor is known as finite population corrector
N −1 5 9
Geektonight Notes

Probability and (f.p.c.) or finite population multiplier (f.p.m.) which may be ignored for a large
Hypothesis Testing population.
In order to find S.E., it is necessary to estimate σ 2 or σ in case it is unknown.
If x1, x2 …, xn denote n sample observations drawn from a population with
mean µ and variance σ 2, then the sample variance:

∑ (x i − x)2
S2 = …(15.7)
n
may be considered to be an estimator of σ 2
Since E(x) = µ and V ( z/ ) = E (x–µ)2 = σ2 …(15.8)
We have

ns2 = ∑ (x i − x)2 …(from 15-8)

= ∑[(x i − µ) − (x − µ)]2

= ∑[(xi − µ)2 − 2(xi − µ) (x − µ) + (x − µ)2 ]

= ∑ (xi − µ)2 − 2(x − µ) ∑ (x i − µ) + n (x − µ)2

= ∑ (x i − µ) 2 − 2(x − µ).n (x − µ) + n ( x − µ) 2
[since Σ (xi–µ) = Σ xi – Σµ

= nx − nµ = n (x − µ)]

= ∑ (x i − µ) 2 − 2n (x − µ)2 + n (x − µ)2

= ∑ (x i − µ)2 + n(x − µ) 2 …(15.9)

As xi is the ith sample observation from a population with µ as mean and σ2


as variance, it follows that :
E (xi–µ)2 = σ2

σ2
And E ( x − µ) 2 = v ( x ) = …(15.10)
n

From (15-9), E (ns2) = ∑ E (x i − µ) 2 − n.E (x − u) 2

2 σ2
= ∑σ − n. = nσ 2 − σ 2 = (n–1) σ
2
n
n −1 2
∴ E(S2 ) = σ ≠ σ2 …(15.11)
n

Hence S2, the sample variance, is a biased estimator of σ2.

2 n −1 2
As E (S ) = σ
n

 n 2
∴E  s  = σ2 …(15.12)
 n −1 

n 2 (∑ x i − x) 2
6 0 thus s = = (s| ) 2 is an unbiased estinator of σ2 …(15.13)
n −1 n −1
Geektonight Notes
Tests of Hypothesis–I
∑ (x i − x) 2
so, we use (s ) =
| 2
as an estimator of σ2 and
n −1
∑ (x i − x) 2
S| = as an estimator of σ
n −1
An estimate of S.E. ( x ) is given by:

S|
>

S.E . ( x ) = for SRSWR


n

S| N−n
= for SRSWOR ……(15.14)
n N −1
P (1 − P)
From (15.4), it follows that v(p) =
n
P(1 − P)
S.E.(p) = for SRSWR
n

P (1 − P) N−n
= . for SRSWOR ……(15.15)
n N −1
An estimate of standard error of sample proportion is given by:
>

p (1 − p )
S.E. ( p ) = for SRSWR
n

p (1 − p) N−n
= . for SRSWOR ……(15.16)
n n
Let us consider the following illustrations to estimate variance from sample and
also estimate the standard error.
Illustration 1
A sample of 32 fluorescent lights taken from Indian Electricals was tested for
the lives of the lights in burning hours. The data are presented below:

Table 15.1: The Lives in Hours of 32 Lights

Sl. Life Sl. Life Sl. Life


No. (Hours) No. (Hours) No. (Hours)

1 4895 12 4992 23 4987


2 4907 13 4997 24 5021
3 5013 14 5003 25 5009
4 4996 15 4985 26 5016
5 5015 16 5015 27 5019
6 4899 17 5317 28 4903
7 4723 18 4990 29 4925
8 4968 19 4989 30 4972
9 5023 20 4923 31 5009
10 5021 21 4946 32 4998
11 5015 22 5024
Solution: We are interested in estimating the average life of fluorescent lights
manufactured by Indian Electricals. As discussed in this section, the estimate of
the population mean (µ) is given by the corresponding sample mean. Then 6 1
Geektonight Notes

Probability and
Hypothesis Testing µˆ = x . If we are further interested in estimating the standard error of x , then
we are to compute

s|

>
S.E .( x ) =
n

2
∑ (xi− x)2 ∑ xi − nx2
where, s = =
|

η −1 n −1

∑ xi
and x = , n = Sample size
n
We ignore f.p.c. as the population of lights is very large.

Table 15.2: Computation of sample mean and sample S.D.

Life in Hours ui = xi–5000 u i2


xi

4895 –105 11025


4907 –93 8649
5013 13 169
4996 –4 16
5015 15 225
4899 –101 10201
4723 –277 76729
4968 –32 1024
5023 23 529
5021 21 441
5015 15 225
4992 –8 64
4997 –3 9
5003 3 9
4985 –15 225
5015 15 225
5317 317 100489
4990 –10 100
4984 –16 256
4923 –77 5929
4946 –54 2916
5024 24 576
4987 –13 169
5021 21 441
5009 9 81
5016 16 256
6 2
Geektonight Notes

5019 19 361 Tests of Hypothesis–I

4903 –97 9409


4925 –75 5625
4972 –28 784
5009 9 81
4998 –2 4

Total –490 237242

From the above Table, ∑ ui = − 490, 2


∑ ui = 237242

∑ ui − 490
∴u = = = − 15.3125
32 32

As u i = x i − 5000

∴ u = x − 5000

Or, X = 5000 + u = 4984.6875 (approximately) 4985

2 2
∑ u i − nu 23742 − 7503 .1248
(s | ) 2 = (s | x ) 2 = (s | ) 2 = = = 7410 .9316
n −1 31

∴ s| = 86.0868

s| 86.0868
hence S.E.(x) = = = 15.2183
n 32
so the estimate of the average life of lights as manufactured by Indian
Electricals is 4985 hours. Estimate of the population variance in 7410.9316
(hours)2 and the standard error is 15.2183 hours.

Illustration 2

A sample of 350 people from city C contained 70 smokers. Find an estimate


of the proportion of smokers in the city. Also find an estimate of the standard
error of the proportion of smokers in the sample.

Solution: In this case x = no. of smokers in the sample = 70, n = 350.

x 70
Thus we have p = = = 0 .2
n 350
Hence the estimate of the proportions of smokers in the city is 0.2 or 20%.

Further
>

p (1 − p ) 0 .2 (1 − 0 .2 )
S.E. ( p ) = = = 0 .0214
n 350

∴ The estimate of the standard error of the proportion of smokers in the


sample is 0.0214.

6 3
Geektonight Notes

Probability and Self Assessment Exercise A


Hypothesis Testing
1) State, with reasons, whether the following statements are true or false.

a) Both the statistic and parameter are functions of sample observations.


b) Any type of sampling would lead to the same inference about the population.
c) Statistical inference is a statistical process to know about a population from
the knowledge of a sample drawn from it.
d) Any type of estimator can be used for estimating a parameter.
e) In most cases, decision-making depends on estimation.
f) There may be more than one estimator for a parameter.
g) Assumption of normality is a must for point estimation.
h) Every consistent estimator is necessarily an efficient estimator.
i) A consistent estimator approaches the parameter with an increase in sample
size.
j) Point estimator is used as an estimate of the unknown population parameter.

2) Differentiate between estimator and estimate.

..............................................................................................................
.............................................................................................................

3) In choosing between sample mean and sample median – which one would you
prefer?

..............................................................................................................
.............................................................................................................
4) The monthly earnings of 20 families, obtained from a random sample from a
village in West Bengal are given below:

Sl. Monthly earnings (Rs.) Sl. Monthly earnings (Rs.)


No. No.
1 1023 11 1012
2 976 12 998
3 898 13 1015
4 1012 14 989
5 980 15 923
6 963 16 767
7 1023 17 897
8 946 18 1013
9 1007 19 947
10 977 20 958

Find an estimate of the average monthly earnings of the village. Also obtain an
estimate of the S.E. of the sample estimate.

..............................................................................................................
6 4
Geektonight Notes

............................................................................................................. Tests of Hypothesis–I

..............................................................................................................
.............................................................................................................
..............................................................................................................

5) In a sample of 900 people, 429 people are found to be consumers of tea. Estimate
the proportion of consumers of tea in the population. Also find the corresponding
standard error.

.............................................................................................................
..............................................................................................................
.............................................................................................................

6) Obtain an unbiased estimate of population mean and population variance on the


basis of the following sample observations:
50, 46, 52, 53, 45, 43, 46, 48, 51
.................................................................................................................
.................................................................................................................
..................................................................................................................

15.3 INTERVAL ESTIMATION


This is another type of estimation. As opposed to estimating a parameter by a
single value i.e., point estimation discussed in the previous section, we may
think of an interval or a range of values that is supposed to contain the
parameter. An interval estimate would always be specified by two values i.e.,
the lower value and the upper value, within which the parameter lies. This is
known as Interval Estimation. Thus interval estimation may be defined as
estimating an interval to which the unknown parameter θ may belong, in all
likelihood.

Regarding the estimation of the average income of the people of Delhi city, one
may argue that it would be better to provide an interval which is likely to
contain the population mean. Thus, instead of saying the estimate of the
average income of Delhi is Rs. 2,000/-, we may suggest that, in all probability,
the estimate of the average income of Delhi would be from Rs. 1,900/- to Rs.
2,100/-. In the second example of estimating the average life of lights produced
by Indian Electricals where the estimate came out to be 4985 hours, the point
estimation may be a bone of contention between the producer and the potential
buyer. The buyer may think that the average life is rather less than 4985 hours.
An interval estimation of the life of lights might satisfy both the parties. Figure
15.1 shows some intervals for θ on the basis of different samples of the same
size from a population characterized by a parameter θ. A few intervals do not
contain θ.

6 5
Geektonight Notes

Probability and
Hypothesis Testing

θ
Fig.15.1: Confidence Intervals to θ

15.4 CONFIDENCE LIMITS, CONFIDENCE


INTERVAL AND CONFIDENCE CO-EFFICIENT
Let us assume that we have taken a random sample of size ‘n’ from a
population characterized by a parameter θ. Let us further suppose that based
on these sample observations, it is possible to find two statistics t1 and t2 such
that:

P (t1 < θ) = α1

And P (t2 > θ) = α2

Where α1 and α2 are two small positive numbers. Combining these two
conditions, we may write:

P (t1 ≤ θ ≤ t2) = 1–α …(15.17)

Where α = α1 + α2

Equation (15.17) could be interpreted as the probability that θ lies between t1


and t2 is (1–α), whatever may be the value of θ, satisfying (15.17). The
interval [t1, t2], t1 being less than t2, that contains the parameter θ is known
Confidence Interval to θ, t1 being known as Lower Confidence limit and t2 as
Upper Confidence Limits. (1–α) is known to be Confidence Co-efficient
corresponding to the confidence interval [t1, t2].

One may like to know why the term ‘confidence’ comes into the picture. If we
choose α1 and α2 such a way that α = 0.01, then the probability that θ would
belong to the random interval [t1, t2] is 0.99. In other words, one feels 99%
confident that [t1, t2] would contain the unknown parameter θ. Similarly if we
select α = 0.05, then P [t1 ≤ θ ≤ t2] = 0.95, thereby implying that we are 95%
confident that θ lies between t1 and t2. (15.17) suggests that as α decreases,
(1–α) increases and the probability that the confidence interval [t1, t2] would
include the parameter θ also increases. Hence our endeavour would be to
reduce ‘α’ and thereby increase the confidence co-efficient (1–α).

6 6
Geektonight Notes

Referring to the estimation of the average life of lights (θ), if we observe that Tests of Hypothesis–I
θ lies between 4935 hours and 5035 hours with probability 0.98, then it would
imply that if repeated samples of a fixed size (say n = 32) are taken from the
population of lights, as manufactured by Indian Electricals, then in 98 per cent
of cases, the interval [4935 hours, 5035 hours] would contain θ, the average life
of lights in the population while in 2 per cent of cases, the interval would not
contain θ. In this case, the confidence interval for θ is [4935 hours, 5035
hours]. Lower Confidence Limit of θ is 4935 hours, Upper Confidence Limit of
θ is 5035 hours, and the Confidence Co-efficient is 98 per cent.

Selection of Confidence Interval

Our next task would be to select the basis for estimating confidence interval.
Let us assume that we have taken a random sample of size ‘n’ from a normal
population characterized by the two parameters µ and σ, the population mean
and standard deviation respectively. Thus, in the case of estimating a
Confidence Interval for average income of people dwelling in Delhi city, we
assume that the distribution of income is normal and we have taken a random
sample from the city. In another example concerning average life of fluorescent
lights as produced by Indian Electricals, we assume that the life of a
fluorescent light is normally distributed and we have taken a random sample
from the population of fluorescent lights manufactured by Indian Electricals.

Figure 15.2 shows percentage of area under Normal Curve. It can be shown
that if a random sample of size ‘n’ is drawn from a normal population with
mean ‘µ’ and variance σ2, then ( x ) , the sample mean also follows normal
distribution with ‘µ’ as mean and σ2/n as variance. Further as we have
observed in Section 15.2.
σ
S.E. ( x ) =
n
From the properties of normal distribution, it follows that the interval :
[µ − S.E. ( x ), µ + S.E. ( x )] covers 68.27% area.

The interval [µ − 2 S.E. ( x ), µ + 2 S.E.( x )] covers 95.45% area and the interval

[µ − 3 S.E. ( x ), µ + 3 S.E. ( x )] covers 99.73% area. Figure 15.2 depicts this


information.

34.135% 34.135%

13.59% 13.59%

2.14%
2.14% µ

–3 –2 –1
68.27% 1 2 3

95.45%

99.73%

Fig. 15.2: Percentages of Area under a Normal Curve 6 7


Geektonight Notes

Probability and Now let us consider a situation where the assumption of normality may not
Hypothesis Testing hold. If the sample size is large enough, then the sample mean x follows
approximately, i.e., asymptotically normal distribution with mean as µ and
standard error as σ/ n , µ and σ being the mean and S.D. of the population
under consideration. In case σ is unknown, we can replace it by the
corresponding sample standard deviation. One may ask the question as to how
large ‘n’ should be. It is rather difficult to specify an exact value of ‘n’ so that
the distribution of x would be asymptotically normal. Larger the value of ‘n’,
the better. However for practical purposes, if ‘n’ exceeds 30, then we may
assume that x is asymptotically normal.
Our next question may be what would be the confidence interval for µ.

Will it be µ ± S.E. ( x ), or µ ± 2 S.E. ( x ), or µ ± 3 S.E. ( x ), or some other interval?

Suppose that the Confidence Interval to µ is given by :


µ ± u S.E. ( x ) and we are to determine u such that :

P [ x − u × S.E. ( x ) ≤ µ ≤ x + u × S.E. ( x )] = 1 − α ……(15.18)

x −µ
Or, P [−u ≤ ≤ u] = 1 − α
S.E. (x)

x −µ
Or, P [−u ≤ Z ≤ u ] = 1 − α [where Z =
S.E. (x) is a standard normal variable]

Or. φ (u) – φ (– u) = 1–α [where φ (K) = P(Z ≤ K), area under


standard normal curve from – ∝ to K].

Or, φ (u) − [1 − φ (u) = 1 − α

Or. 2 φ (u) = 2 − α

Or. φ (u) = 1 − (α / 2 ) …(15.19)

Putting α = 0.10 in (15.19), we get

φ (u) = 1 − 0.05 = 0.95

or, φ (u) = φ (1.645)

or. u = 1.645
Thus 100 (1–α) % or 100 (1–0.1)% or 90% confidence interval to population
mean µ is :

 σ σ 
Given by  X − 1.645 , X + 1.645 
 n n

Putting α =0.05, 0.02 and 0.01 respectively in (15.19) and proceeding in a similar
 σ 1.96σ 
manner, we get 95% Confidence Interval to µ =  x − 1.96 ,x+  …(15.20)
 n n 

 σ σ 
98% Confidence Interval to µ =  x − 2.33 , x + 2.33  …(15.21)
 n n
6 8
Geektonight Notes
Tests of Hypothesis–I
 σ σ 
and 99% Confidence Interval to µ =  x − 2.58 , x + 2.58  …(15.22)
 n n
Theoretically we may take any Confidence interval by choosing ‘u’ accordingly.
However in a majority of cases, we prefer 95% or 99% Confidence Interval.
These are shown in Figure 15.3 and Figure 15.4 below.

95% of area under curve


2.5% of area 2.5% of area
under curve under curve

σ σ
x − 1.96 x x + 1.96
n n
Fig. 15.3: 95% Confidence Interval for Population Mean

99% of area under curve


0.5% of area 0.5% of area
under curve under curve

σ σ
x − 2.58 x x + 2.58
n n
Fig. 15.4: 99% Confidence Interval for Population Mean

Next we consider Interval Estimation in the following cases:


Interval Estimation of Population Mean

As suggested in this section under assumption of normality, 95% confidence


interval to µ, the population mean, is given by

 σ σ 
 x − 1.96 , x + 1.96 
 n n
If the assumption of normality does not hold but ‘n’ is greater than 30, the
above 95% confidence interval still may be used for estimating population mean.
In case σ is unknown, it may be replaced by the corresponding unbiased
estimate of σ, namely S|, so long as ‘n’ exceeds 30. However, we may face a
difficult situation in case σ is unknown and ‘n’ does not exceed 30. This
problem has been discussed in the next unit (Unit-16). Similarly, 99%
confidence interval to µ is given by :

 σ σ 
 x − 2.58 , x + 2.58 
 n n
In case σ is unknown. The 99% confidence interval to µ is :

 S| S| 
 x − 2.58 , x + 2.58  …(15.23)
 n n
in case σ is unknown and n > 30. 6 9
Geektonight Notes

Probability and Interval Estimation of Unknown Population Proportion


Hypothesis Testing
It can be assumed that when n is large and neither ‘p’ nor (1–p) is small (one
may specify np ≥ 5 and n (1–p) ≥ 5), then the sample proportion p is
P (1 − P)
asymptotically normal with mean as P and S.E.(p) = , P being the
n
unknown population proportion in which we are interested. The estimate of S.E.
(p) is given by :

>
p (1 − p )
S.E.( p̂ ) =
n
Hence, 95% confidence interval to p is given by :

 p (1 − p) p (1 − p) 
p − 1.96 , p + 1.96  …(15.24)
 n n 

and 99% confidence interval to P is :

 p(1 − p) p(1 − p ) 
 p − 2.58 , p + 2.58  …(15.25)
 n n 

Let us consider the following illustrations to understand the procedure for


interval estimation.

Illustration 3

In a random sample of 1,000 families from the city of Delhi, it was found that
the average of income as obtained from the sample is Rs. 2,000/-, it is further
known that population S.D. is Rs. 258. Find 95% as well as 99% confidence
interval to population mean.

Solution: Let x denote income of the people of Delhi city. If µ denotes


average income of people dwelling in Delhi, then 95% confidence interval to µ
is:

 σ σ 
 x − 1.96 , x + 1.96 
 n n

and 99% confidence interval to µ is :

 σ σ 
 x − 2.58 , x − 2.58 
 n n

Where x = Sample mean; n = Sample size, and σ = Population standard


deviation.
In our case,

x = Rs. 2000, n = 1000, σ = Rs. 258

σ 258
∴ x − 1.96 = Rs. 2000 − 1.96 × = Rs. 1984.01
n 1000

σ 258
x + 1.96 = Rs. 2000 + 1.96 × = Rs. 2015.99
n 1000
7 0
Geektonight Notes
Tests of Hypothesis–I
σ 258
x − 2.58 = Rs. 2000 − 2.58 × = Rs. 1979
n 1000

σ 258
and x + 2.58 = Rs. 2000 + 2.58 × = Rs. 2021
n 1000
Hence we have

95% confidence interval to average income for the people of Delhi = [Rs.
1984.01 to Rs. 2015.99] and 99% confidence interval to average income for the
people of Delhi = [Rs. 1979 to Rs. 2021].

Illustration 4

Calculate the 95% and 99% confidence limits to the average life of fluorescent
lights produced by Indian Electricals.

Solution: Since σ, the population standard deviation is unknown and n = 32


(> 30), we replace σ by S|, the sample S.D. with divisor as (n–1) in our
previous example and get 95% confidence interval to µ is:

 S| S| 
 x − 1 .96 , x + 1 .96 
 n n 

 S| S| 
Similarly, 99% confidence interval for µ =  x − 2.58 , x + 2.58 
 n n
Where, x = Sample mean = 4985 hours, n = Sample size = 32; and

S| = Sample S.D. with (n–1) division = 86.0868 hours (as computed


earlier).

S| 86.0868
∴ x − 1.96 = 4985 − 1.96 × = 4955.17 hours
n 32

S| 86.0868
x + 1.96 = 4985 + 1.96 × = 5014.83 hours
n 32

S| 86.0868
x − 2.58 = 4985 − 2.58 × = 4945.74 hours
n 32

S| × 86.0868
x + 2.58 = 4985 + 2.58 = 5024.26 hours
n 32

∴ 95% Confidence Interval to the average life of lights = [4945.17


hours, 5014.83 hours].

99% Confidence Interval to the average life of lights = [4945.74


hours, 5024.26 hours].

Illustration 5

While interviewing 350 people in a city, the number of smokers was found to
7 1
Geektonight Notes

Probability and be 70. Obtain 99% lower confidence limit and the corresponding upper
Hypothesis Testing confidence limit to the proportion of smokers in the city.

Solution: As discussed in the previous section, 99% Lower Confidence Limit


to P, the proportion of smokers in the city is given by:

p (1 − p)
p − 2.58
n
and 99% Upper Confidence Limt to P is:

p (1 − p)
p + 2.58
n
provided np ≥ 5 and np (1–p) ≥ 5.

In this case x = no. of smokers = 70

n = no. of people interviewed = 350

x 70
∴ p= = = 0 .2
n 350

As np = 350 × 0.2 = 70 and n (1–p) = 350 × 0.8 = 280 are rather large, we
can apply the formula for 99% Confidence Limit as mentioned already.
∴ 99% Lower Confidence Limit to P is :

0.2 × (1 − 0.2)
0.2 − 1.96 × = 0.2 − 0.0214 = 0.1786
350
99% Upper Confidence Limit to P is :

0.2 × (1 − 0.2)
0.2 + 1.96 × = 0.2 + 0.0214 = 0.2214
350
Hence 99% Lower Confidence Limit and 99% Upper Confidence Limit for the
proportion of smokers in the city are 0.1786 and 0.2214 respectively.

Illustration 6

In a random sample of 19586 people from a town, 2358 people were found to
be suffering from T.B. With 95% Confidence as well as 98% Confidence, find
the limits between which the percentage of the population of the town suffering
from T.B. lies.

Solution: Let x be the number of people suffering from T.B. in the sample and
‘n’ as the number of people who were examined. Then the proportion of
people suffering from T.B. in the sample is given by:

x 2358
p = = = 0.1204
n 19586
As np = x = 2358 and n (1–p) = n–np = n–x
= 19586–2358 = 17228

are both very large numbers, we can apply the formula for finding Confidence
Interval as mentioned in the previous section. Thus 95% Confidence Interval to
7 2
Geektonight Notes

P, the proportion of the population of the town suffering from T.B., is given by : Tests of Hypothesis–I

 p (1 − p) p (1 − p) 
p − 1.96 , p + 1.96 
 n n 

 0.1204 × (1 − 0.1204) 0.1204 × (1 − 0.1204) 


= 0.1204 − 1.96 , 0.1204 + 1.96 
 19586 19586 

= [0.1204 – 0.0045, 0.1204 + 0.0045] = [0.1181, 0.1227]

In a similar way, 98% Confidence Interval to P is given by:

 p (1 − p) p (1 − p) 
p − 2.33 , p + 2.33 
 n n 

 0.1204 × (1 − 0.1204) 0.1204 × (1 − 0.1204) 


= 0.1204 − 2.33 , 0.1204 + 2.33 
 19856 19856 

= [0.1150, 0.1258]

Thereby, we can say with 95% confidence that the percentrage of population in
the town suffering from T.B. lies between 11.81 and 12.27 and with 98%
confidence that the percentage of population suffering from T.B. lies between
11.50 and 12.58.

Illustration 7

A famous shoe company produces 80,000 pairs of shoes daily. From a sample
of 800 pairs, 3% are found to be of poor quality. Find the limits for the number
of substandard pair of shoes that can be expected when the Confidence Level
is 0.99.

Solution: Let p be the sample proportion of defective shoes as produced by


the shoe company. In this case sample size (n) is 800 and population size (N)
is 80,000. Since the population is very large, we do not apply finite population
correction.
p = 3% = 0.03
>

p (1 − p ) 0 .03 (1 − 0 .03 )
∴ S.E. ( p̂ ) = = = 0 .0060
n 800

Thus 99% Lower Confidence Limit to P, the proportion of defective shoes in


the daily production of the shoe company is :
>

p – 2.58 S.E. ( p̂)


= 0.03–2.58 × 0.006 = 0.01452
similarly 99% Upper Confidence Limit to p is :

P + 2.58 Ŝ.E. ( P̂) = 0.03 + 2.58 × 0.006 = 0.04548


Hence, the Lower limit to the number of substandard i.e., defective pairs of
shoes at 99% Level of Confidence = N × 0.01452.

= 80,000 × 0.01452 = 1161.6 = (approximately) 1162 7 3


Geektonight Notes

Probability and The Upper Limit to the number of substandard, pairs of shoes at 99% Level of
Hypothesis Testing Confidence is

80,000 × 0.04548 = 3638.4 = (approximately) 3638

Self Assessment Exercise B

1) State with reasons, whether the following statements are true or false.

a) Confidence Interval provides a range of values that may not contain the
parameter.
b) Confidence Interval is a function of Confidence Co-efficient.
c) 95% Confidence Interval for population mean is x ± 1.96 S.E. ( x ) .
d) While computing Confidence Interval for population mean, if the population
S.D. is unknown, we can always replace it by the corresponding sample S.D.

p (1 − p)
e) 99% Upper Confidence Limit for population proportion is p + 1.96 .
n
f) Confidence co-efficient does not contain Lower Confidence Limit and Upper
Confidence Limit.

p (1 − p )
g) If np ≥ 5 and np (1–p) ≥ 5, one may apply the formula p ± z α for
n
computing Confidence Interval for population proportion.
h) The interval µ ± 3 S.E. ( x ) covers 96% area of the normal curve.
2) Differentiate between Point Estimation and Interval Estimation.
...............................................................................................................
..............................................................................................................

3) Distinguish between Confidence Limit and Confidence Interval.


...............................................................................................................
...............................................................................................................
...............................................................................................................

4. Out of 25,000 customer’s ledger accounts, a sample of 800 accounts was taken
to test the accuracy of posting and balancing and 50 mistakes were found.
Assign limits within which the number of wrong postings can be expected with
99% confidence.
...............................................................................................................
...............................................................................................................
...............................................................................................................
5. A sample of 20 items is drawn at random from a normal population comprising
200 items and having standard deviation as 10. If the sample mean is 40,
obtain 95% Interval Estimate of the population mean.
...............................................................................................................
...............................................................................................................
...............................................................................................................
7 4
Geektonight Notes

6) A new variety of potato grown on 400 plots provided a mean yield of 980 Tests of Hypothesis–I
quintals per acre with a S.D. of 15.34 quintals per acre. Find 99% Confidence
Limits for the mean yield in the population.
................................................................................................................
................................................................................................................
................................................................................................................

15.5 TESTING HYPOTHESIS – INTRODUCTION


Referring to the problem of the status to be given to Delhi City, one of the
criteria for determining the status would be the average income of the people
of Delhi. Let us suppose that if ‘µ’, the average income of the people is Rs.
3,000 per month, then Delhi would belong to the group of top cities. In order to
estimate ‘µ’, we take a random sample of people living in that city and
compute x , the sample mean. If x is in the neighbourhood of Rs. 3,000, then
we have no hesitation in declaring the status of Delhi as one belonging to the
top grade. But the most important question would be as to what difference
between the sample mean and Rs. 3000 (population mean) can be accepted as
the difference due to only sampling fluctuations.

In order to answer this question, let us familiarise ourselves with a few terms
associated with the problem. A statement like ‘The average income of the
people belonging to the city of Delhi is Rs. 3,000 per month’ is known as a
null hypothesis. Thus, a null hypothesis may be described as an assumption or
a statement regarding a parameter (population mean, ‘µ’, in this case) or about
the form of a population. The term ‘null’ is used as we test the hypothesis on
the assumption that there is no difference or, to be more precise, no significant
difference between the value of a parameter and that of an estimator as
obtained from a random sample taken from the population. A hypothesis may
be simple or composite.

A simple hypothesis is one that specifies the population distribution


completely. Thus testing µ = 3,000 is a simple hypothesis if the population
standard deviation (σ) is known.

A composite hypothesis is one that does not specify the population


completely. Testing µ = 3,000 when σ is unknown is a composite hypothesis as
it does not specify the population completely. A null hypothesis is denoted by
H0. Thus we may write :

H0 : µ = 3,000

i.e., the null hypothesis is that the population mean is Rs. 3,000. Generally, we
write

H0 : µ = µ0

i.e., the null hypothesis is that the population mean is µ, whereas µ0 may be
any value as specified in a given situation.

Obviously a null hypothesis (H0) is to be tested against an appropriate


alternative hypothesis (H1). Any hypothesis that contradicts a null
hypothesis is known as an alternative hypothesis. If the null hypothesis is
rejected, the alternative hypothesis is accepted. Procedures enabling us to 7 5
Geektonight Notes

Probability and decide whether to accept or reject a hypothesis is known as test of hypothesis
Hypothesis Testing
or test of significance or decision rule. Thus, the entire process of hypothesis
testing is either to reject or accept H0 only.

In the present problem, one may argue that since many people of Delhi city are
living in the slums and even on the pavements, the average income should be
less than Rs. 3000. So one alternative hypothesis may be :

H1 : µ < 3,000 i.e., the average income is less than Rs. 3,000 or
symbolically:

or, H1 : µ < µ0 i.e., the population mean (µ) is less than µ0.

Again one may feel that since there are many multistoried buildings and many
new models of vehicles run through the streets of the city, the average income
must be more than Rs. 3,000. So another alternative hypothesis may be :

H2 : µ > 3000 i.e., the average income is more than Rs. 3,000.

or, H2 : µ > µ0 i.e., the population mean is more than µo.

Lastly, another group of people may opine that the average income is
significantly different from µ0. So the third alternative could be :

H : µ ≠ 3000 i.e., the average income is anything but Rs. 3,000.

or, H : µ ≠ µ0 i.e., the population mean is not µ0.

15.6 THEORY OF TESTING HYPOTHESIS —


LEVEL OF SIGNIFICANCE, TYPE-I AND
TYPE-II ERRORS AND POWER OF A TEST
In order to take a decision about acceptance or rejection of a null hypothesis,
let us consider the theory involving testing of hypothesis. Suppose that we have
a random sample of size ‘n’ taken from a population characterized by an
unknown parameter ‘θ’. We denote the n sample observations by x = (x1, x2,
x3, …xn) and we would like to test
H0 : θ = θ0 against
H 1 : θ = θ1

If n = 2, then x = (x1, x2) can be represented as a point in the 2-dimensional


plane taking, say x1, on the horizontal axis and x2 on the vertical axis. In a
similar way, it is possible to conceive x = (x1, x2, x3, …xn) as a point in the n-
dimensional plane. Consider all the possible samples of a fixed size ‘n’, i.e., NCn
in case of SRSWOR and Nn in the case of SRSWR, N denoting the population
size. Next we consider the sample space formed by all these points and let it
be denoted by Ω. We divide Ω into two parts ω and A = Ω – ω, the boundary
of ω is taken within A. We frame a simple rule which says that if the sample
point x falls on ω, we reject H0 and if x falls on A, we accept H0. ω is known
as the critical region or rejection region and A, as the acceptance region. At
this juncture, let us make one point clear. Acceptance of H0 does not mean
that H0 is always true. It just reflects the idea that on the basis of the given
data, there is not enough evidence to support the validity of H1. In a similar
manner rejection of H0 indicates the null hypothesis does not hold good in the
light of the given sample observations.
7 6
Geektonight Notes

Type-I and Type-II Errors Tests of Hypothesis–I

Now while testing H0 we are liable to commit two types of errors. In the first
case, it may be that H0 is true but x falls on ω and as such, we reject H0.
This is known as type-I error or error of the first kind. Thus type-I error is
committed in rejecting a null hypothesis which is, in fact, true. Secondly, it may
be that H0 is false but x falls on A and hence we accept H0. This is known as
type-II error or error of the second kind. So type-II error may be described as
the error committed in accepting a null hypothesis which is, in fact, false. The
two kinds of errors are shown in Table 15.3.

Table 15.3: Types of Errors in Testing Hypothesis

Real Situation Statistical decision based on sample


H0 Accepted H0 Rejected

H0 True Right decision Type-I error


H0 False Type-II error Right decision

It is obvious that we should take into account both types of errors and must try
to reduce them.Since committing these two types of errors may be regarded as
random events, we may modify our earlier statement and suggest that an
appropriate test of hypothesis should aim at reducing the probabilities of both
types of errors. Let ‘α’ (read as ‘alpha’) denote the probability of type-I error
and ‘β’ (read as ‘beta’) the probability of type-II error. thus by definition, we
have

α = The probability of the sample point falling on the critical region when H0 is
true i.e., the value of θ is θ0 = P (x ∈ ω | θ0) …(15.26)

and β = The probability of the sample point falling on the critical region when
H1 is true, i.e., the value of θ is θ1

= P (x ∈ A | θ1) … (15.27)

Surely, our objective would be to reduce both type-I and type-II errors. But
since we have taken recourse to sampling, it is not possible to reduce both
types of errors simultaneously for a fixed sample size. As we try to reduce ‘α’,
β increases and a reduction in the value of β results in an increase in the value
of ‘α’. Thus, we fix α, the probability of type-I error to a given level (say, 5
per cent or 1 per cent) and subject to that fixation, we try to reduce β,
probability of type-II error. ‘α’ is also known as size of the critical region. It
is further known as level of significance as ‘α’ constitutes the basis for making
the difference (θ – θ0) as significant. The selection of ‘α’ level of significance,
depends on the experimenter.

Power of a Test: By definition, we have

β = P (x ∈ A | θ = θ1) [from 15.27]

: 1– β = 1 – P (x ∈ A | θ = θ1)

= P (x ∈ω | θ = θ1) [from 15.26]

[Since θ1 may fall either on ω or A,


therefore, P (x ∈ ω | θ = θ1) + P (x ∈ A | θ = θ1) = 1
and we have 1–P (x ∈ A | θ = θ1) = P (x ∈ ω | θ = θ1)] 7 7
Geektonight Notes

Probability and Now P (x ∈ ω | θ = θ1) is the probability of rejecting H0 when H0 is false


Hypothesis Testing and the alternative hypothesis H1 is true which should be the desirable property
of an appropriate test. It is obvious that a low value of β would ensure a high
value of (1–β). Hence we try to minimize β, the probability of type-II error, as
the minimization of β ensures the maximization of (1–β). The expression (1–β)
serves as an indicator of the validity of the test as a very high value of (1–β)
indicates that the test is doing fine in its endeavour to reject a false hypothesis.
Hence (1–β) is known as power of the test as it tells us how well the test
under consideration is performing when the null hypothesis is not true. It is
obvious that we should try to make our test as powerful as possible subject to
a fixed value of α . One may regard power of a test as a function of θ. The
function P (θ) = 1–β (θ) is known as the power function of the test. The
curve obtained by plotting P (θ) against θ is known as power curve. Look at
the following figure 15.5 which exhibits a power curve.

1.0

P(θ)

0
θ
Fig. 15.5: Power Curve of a Test

15.7 TWO-TAILED AND ONE-TAILED TESTS


In order to test the null hypothesis H0 : θ = θ0 against a plausible alternative
hypothesis, let us suppose that we find a statistic T which is a sufficient
estimator of θ. We assume further that, based on a random sample taken from
the population characterized by an unknown parameter θ, it is possible to find a
function of T and θ and let u = u (T, θ) by such a function. T is known as
test statistic for testing H0 : θ = θ0. Lastly let us assume that when θ = θ0,
u0 = u (T, θ0) i.e., u0 is the value of ‘u’ under H0 (i.e., assuming the null
hypothesis to be true). Based on the sampling distribution of the test statistic u
under H0, it may be possible to find 4 values of u, namely, uα/2, u(1-α/2), uα and
u(1-α ) for a fixed level of significance α, such that :
α
P (u0 ≥ uα/2 ) = … (15.28)
2
α
P (u0 ≤ u(1−α/2 ) = … (15.29)
2
P (u0 ≥ u α ) = α … (15.30)
P (u0 ≤ u(1− α)) = α … (15.31)

uα may be described as the upper α-point of the distribution of u and u(1-α) as


the corresponding lower α-point.

Two-tailed test: Adding (15.28) and (15.29), we get :

P (u0 ≥ uα/2) + P (u0 ≤ u(1-α/2)) = α … (15.32)

7 8 i.e., the probability that u0 would exceed uα/2 or u0 is less than u (1-α/2)
is α.
Geektonight Notes

In order to test H0 : θ = θ0 against H1 : θ ≠ θ0, if we select a low value of α, Tests of Hypothesis–I


say α = 0.01, then (15.32) suggests that the probability u0 is greater than ua/2 or
u0 is less than u(1-α/2)is 0.01 which is pretty low. So on the basis of a random
sample drawn from the population, if it is found that u0 is greater than ua/2 or u0
is less than u(1–α/2), then we have rather strong evidence that H0 is not true.
Then we reject H0 : θ = θ0 and accept the alternative hypothesis H1 : θ ≠ θ0.
As shown in the following Figure 15.6, here the critical region lies on both tails
of the probability distribution of u.

100 (1−α) % area


Acceptance Region
Critical Region Critical Region
ω : u0 ≤ u (1-α/2) ω : u0 ≥ u α/2

u(1- α /2) u α /2
50 α % area 50 α % area
Fig. 15.6: Critical region of a two-tailed Test

If the sample point x falls on one of the two tails, we reject H0 and accept H1
: θ ≠ θ0. The statistical test for H0 : θ = θ0 against H1 : θ ≠ θ0 is known as
both-sided test or two-tailed test as the critical region, ‘ω’ lies on both sides of
the probability curve, i.e., on the two tails of the curve. The critical region is
ω : u0 ≥ uα/2 and ω : u0 ≤ u(1-α/2). It is obvious that a two-tailed test is
appropriate when there are reasons to believe that ‘u’ differs from θ0
significantly on both the left side and the right side, i.e., the value of the test
statistic ‘u’ as obtained from the sample is significantly either greater than or
less than the hypothetical value.

For testing the null hypothesis H0 : µ = 3000, i.e., the average income of the
people of Delhi city is Rs. 3000, one may think that the alternative hypothesis
would be H1 : µ ≠ 3000 i.e., the average income is not Rs. 3000 and as such,
we may advocate the application of a two-tailed test. Similarly, for testing the
null hypothesis that the average life of lights produced by Indian Electricals is
5,000 hours against the alternative hypothesis that the average life is not 5,000
hours, i.e., for testing H0 : µ = 5,000 against H1 : µ ≠ 5,000, we may prescribe
a two-tailed test. In the problem concerning the health of city B, we may be
interested in testing whether 20% of the population of city B really suffers from
T.B. i.e., testing H0 : P = 0.2 against H1 : P ≠ 0.2 and again a two-tailed test
is necessary and lastly regarding the harms of smoking, we may like to test H0
: P = 0.3 against H1 : P ≠ 0.3.

Right-tailed Tests
We may think of testing a null hypothesis against another pair of alternatives. If
we wish to test H0 : θ = θ0 against H1 : θ > θ0, then from (15.30) we have
P (u0 ≥ uα) = α. This suggests that a low value of α, say α = 0.01, implies
that the probability that u0 exceeds uα is 0.01. So the probability that u0
exceeds uα is rather small. Thus on the basis of a random sample drawn from
this population if it is found that u0 is greater than uα, then we have enough
evidence to suggest that H0 is not true. Then we reject H0 and accept H1. This
is exhibited in Figure 15.7 as shown below:
7 9
Geektonight Notes

Probability and
Hypothesis Testing

100 (1−α) % area


Acceptance Region
Critical Region
ω : u0 ≥ u α


100 α % area
Fig. 15.7: Critical region of a right-tailed Test

As shown in figure 15.7, the critical region lies on the right tail of the curve.
This is a one-sided test and as the critical region lies on the right tail of the
curve, it is known as right-tailed test or upper-tailed test. We apply a right-
tailed test when there is evidence to suggest that the value of the statistic u is
significantly greater than the hypothetical value θ0. In case of testing about the
average income of the citizens of Delhi, if one has prior information to suggest
that the average income of Delhi is more than Rs. 3,000, then we would like to
test H0 : µ = 3,000 against H1 : µ > 3,000 and we select the right-tailed test.
In a similar manner for testing the hypothesis that the average life of lights by
Indian Electricals is more than 5,000 hours or for testing the hypothesis that
more than 20 per cent suffer from T.B. in city B or for testing the hypothesis
that the per cent of smokers in town C is more than 30, we apply the right-
tailed test.

Left-tailed test
Lastly, we may be interested to test H0 : θ = θ0 against H2 : θ < θ0.From
(15.31), we have P (u0 ≤ u1–α) = α. Choosing α = 0.01, this implies that the
probability that u0 would be less than uα is 0.01, which is surely very low. So, if
on the basis of a random sample taken from the population, it is found that u0
is less than u1-α, then we have very serious doubts about the validity of H0. In
this case, we reject H0 and accept H2 : θ < θ0. This is reflected in Figure 15.8
shown below.

100 (1−α) % area


Acceptance Region
Critical Region
ω : u0 ≤ u (1-α)

100 α % u(1- α )
area
Fig. 15.8: Critical Region of a Left-tailed Test

The test for H0 : θ = θ0 against H2 : θ < θ0 is another one-sided test and as


the critical region lies on the left tail of the curve, this is known as a left-
8 0
Geektonight Notes

tailed test or a lower-tailed test. We apply a left-tailed test when there is Tests of Hypothesis–I
enough indication to suggest that the value of the test statistic ‘u’ is significantly
less than the hypothetical value. Then for determining the status of Delhi city, if
somebody suggests with evidence that the average income is less than Rs.
3,000 and as such Delhi should not be regarded as a top grade city, then we
are to test H0 : µ = 3000 against H1 : µ < 3000, which is a left-tailed test. We
may further note that we apply left-tailed test when we would like to test the
hypothesis that the average life of lights of Indian Electricals is less than 5,000
hours or less than 20 per cent are suffering from T.B. in city B or less than 30
per cent are smokers in town C.

15.8 STEPS TO FOLLOW FOR TESTING


HYPOTHESIS
While testing hypothesis, one must go through the following steps.

1) Set up the null hypothesis H0 : θ = θ0 and one of the alternative hypothesis


H : θ ≠ θ0 or H1 : θ > θ0 or H2 : θ < θ0 depending upon the problem.
Selecting the proper alternative plays a significant role in decision making in
connection with testing of hypothesis.

2) Choose the appropriate test statistic ‘u’ and sampling distribution of ‘u’ under
H0. In most cases ‘u’ follows a standard normal distribution under H0 and
hence Z-test can be recommended in such a case.

3) Select α, the level of significance of the test if it is not provided in the given
problem. In most cases, we choose α = 0.05 and α = 0.01 which are known as
5% level of significance and 1% level of significance.

4) Define critical region ω, based on the alternative hypothesis. For testing


H0 : θ = θ0 against both-sided alternative H1 : θ ≠ θ0, the critical region is given
by ω : u0 ≥ uα/2 and ω : u0 ≤ u(1-α/2). Similarly, the critical region for the right-
sided alternative is given by ω : u0 ≥ uα and the critical region for the left-sided
is given by ω : u0 ≤ u1-α.

5) Obtain the value of u0 on the basis of the given sample observations.

6) Reject H0 if u0 falls on ω. Otherwise accept H0.

7) Draw your own conclusion in very simple language which should be understood
even by a layman.

15.9 TESTS OF SIGNIFICANCE FOR POPULATION


MEAN – Z-TEST FOR VARIABLES
Let us assume that we have taken a random sample of size ‘n’ from a normal
population with mean as µ and standard deviation as σ. Let the sample
observations be denoted by x1, x2, x3, …xn. While testing for the unknown
population mean µ, we are to consider the following cases.

Case 1: When the standard deviation σ is known. We want to test H0 : µ =


µ0 against one of the following alternative hypothesis.

H : µ ≠ µ0 or,
H1 : µ > µ0 or,
H2 : µ < µ0. 8 1
Geektonight Notes

Probability and As we have discussed in Section 15.2, the best statistic for the parameter µ is
Hypothesis Testing
x . It has been proved in that Section, E ( x ) = µ .

σ
S.E. ( x ) =
n
As such the test statistic :

x − E(x) x −µ
z= =
S.E.(x) σ/ n
is a standard normal variable. Under H0, i.e., assuming the null hypothesis to be
true,
n (x − µ0 )
z0 = is a standard normal variable. As such, the test is known as
σ
standard normal variate test or standard normal deviate test or Z-test. In order
to find the critical region for testing H0 against H from (15.28) and (15.29), we
find that :
α
P (u 0 ≥ u ( α / 2 , ) =
2
α
and P ( u 0 ≤ u (1− α / 2 , ) =
2

If we denote the standard normal variate by Z, and the upper α-point of the
standard normal distribution by Zα, and by Z(1–α/2) = –Zα/2, (as the standard
normal distribution is symmetrical about 0), the lower α-point of the standard
normal distribution, then the above two equations are reduced to :

α
P ( Z0 ≥ Zα / 2 ) = ……(15.33)
2

α
and P ( Z0 ≤ − Zα / 2 ) = ……(15.34)
2
From (15.33), we have:
α
1 − P ( Z0 < Z( α / 2 ) =
2

α
or 1 − φ (Zα / 2 ) =
2

α
or φ ( Zα / 2 ) = 1 −
2

choosing α = 0.05, φ ( Z 0.025 ) = 1 − 0.025 = 0.975

or, φ ( Z 0 .0025 ) = φ (1 . 96 ) [from Section 15.4]

Thus, Z0.025 = 1.96


Hence from (15.33) and (15.34), we have
P (Z0 ≥ 1.96) = 0.025
and P (Z0 ≤ –1.96) = 0.025
8 2
Geektonight Notes
Tests of Hypothesis–I
Combining these two equations, we get P ( z 0 ≥ 1.96 ) = 0.05 ……[15.35]

Thus for testing H1: µ ≠ µ0, the critical region is given by :

ω : z o ≥ 1 .96

When the level of significance is 5% and

n (x − µ0 )
Z0 = ……(15.36)
σ

Proceeding in a similar manner, the critical region for the two-tailed test at 1%
level of significance is given by :

ω : z o ≥ 2.58 ……(15.37)

Now if we decide to test H0 against the alternative hypothesis H1 : µ > µ0,


from (15.30), we have
P (Z0 ≥ Zα) = α
or, 1–P (Z0 < Zα) = α

or, 1 − φ (Zα ) = α

α
or, 1 − φ (Zα / 2 ) = …… (15.38)
2
Putting α = 0.05 in (15.38), we get

φ (Z0.05 ) = 0.95 = φ (1.645) …… [from Section 15.4]

Hence the critical region for this right-tailed test at 5% level of significance is :
ω : Z0 ≥ 1.645
Similarly the critical region at 1% level of significance would be :
ω : Z0 ≥ 2.35
Finally if we make up our minds to test H0 against H2 : µ < µ0, then from
(15.35), we get
P (Z0 ≤ –Zα) = α
or, φ (− Z α ) = α

or, 1− φ ( Z α ) = α

or, φ (Z α ) = 1 − α

Thus as before, putting α = 0.05, we get Zα = 1.645


And as such the critical region for this left-tailed test at 5% level of
significance is:
ω : Z0 ≤ –1.645
The critical region, when the level of significance is 1%, is :
ω : Z0 ≤ –2.35

Figures 15.9 and 15.10 describe critical regions at 5% and 1% level of


significance for Two-tailed Tests. Right-tailed tests are shown in Figures 15.11
and 15.12 and left-tailed tests are exhibited in Figures 15.13 and 15.14.
8 3
Geektonight Notes

Probability and
Hypothesis Testing

95 % area
Acceptance Region

Critical Region Critical Region


ω : Z0 ≤ –1.96 ω : Z0 ≥ 1.96

–1.96 µ0 1.96
2.5 % area 2.5 % area
Figure 15.9: Two-tailed Critical Region for Testing Population Mean at 5% Level of Significance

99 % Area
Acceptance Region

Critical Region Critical Region


ω : Z0 ≤ –2.58 ω : Z0 ≥ 2.58

–2.58 µ0 2.58
0.5 % area 0.5 % area
Figure 15.10: Two-tailed Critical Region for Testing Population Mean at 1% Level of Significance.

95 % Area
Acceptance Region

Critical Region
ω : Z0 ≥ 1.645

µ0 1.645
5 % Area
Figure 15.11: Right-tailed Critical Region for Testing Population Mean at 5% Level of Significance

8 4
Geektonight Notes
Tests of Hypothesis–I

99 % Area
Acceptance Region

Critical Region
ω : Z0 ≥ 2.35

µ0 2.35
1 % area
Figure 15.12: Right-tailed Critical Region for Testing Population Mean at 1% Level of
Significance

95 % Area
Acceptance Region

Critical Region
ω : Z0 ≤ –1.645

5 % area –1.645 µ0

Figure 15.13: Left-tailed Critical Region for Testing Population Mean at 5% Level of
Significance

99 % Area
Acceptance Region

Critical Region
ω : Z0 ≤ –2.35

1 % area –2.35 µ0

Figure 15.14: Left-tailed Critical Region for Testing Population Mean at 1% Level of Significance

Case II: When the population standard deviation is unknown.

In order to test for population mean, we replace σ by its unbiased estimator.

2
∑(Xi −x )
S| =
n −1 8 5
Geektonight Notes

Probability and in the test statistic used in Case-I, provided we have a sufficiently large sample
Hypothesis Testing (as discussed earlier n should exceed 30). Thus we consider

n (x − µ0 )
Z0 =
s|

Z0 is a standard normal variable. As before for testing H0 : µ = µ0 against


both-sided alternative H : µ ≠ µ0, the critical region at 5% level of significance
would be given by :

ω | Z 0 | ≥ 1 .96

Also the critical region at 1% level of significance would be

ω : | Z0 | ≥ 2.58

Further the critical region at 5% level of significance for the right-sided


alternative H1: µ > µ0 would be :

ω : | Z0 | ≥ 1.645
and ω : Z0 ≥ 2.33 when the level of significance is 1%.

Lastly the critical region for the left-sided alternative H2 : µ > µ0 would be
provided by :

ω : Z0 ≤ –1.645

and ω : Z0 ≤ –2.33 when α = 0.05 and 0.01 respectively.

15.10 TESTS OF SIGNIFICANCE FOR


POPULATION PROPORTION – Z-TEST FOR
ATTRIBUTES
We consider now the problem of testing H0 : P = P0, i.e., testing whether the
proportion of units in the population possessing a certain characteristic is P0,
i.e., a specified value.

For example, if we want to test whether a fresh coin just coming out from a
mint is unbiased, then we are to test H0 : P = 0.5. Similarly, the problem of
testing whether 20% population of city B is suffering from T.B amounts to
testing Ho : P = 0.2 or testing whether 30% population of a town are smokers
is equivalent to testing H0 : P = 0.3.

As discussed earlier, the number of units in the population having a certain


characteristic follows Binomial Distribution with parameters ‘n’ and P. If ‘n’ is
large such that both nP and n(1–P) are not less than 5, then we can
approximate a Binomial Distribution by a Normal Distribution with mean as µ =
nP and variance as σ2 = nP (1–P).

x
Hence, it follows that the sample proportion (p) = follows normal distribution
n

P0 (1 − P0 )
with mean as P0 and S.D. as under H0.
n
8 6
Geektonight Notes
Tests of Hypothesis–I
p − P0 n (p − P0 )
Thus Z 0 = =
P0 (1 − P0 ) P0 (1 − P0 )
n

is a standard normal variate and as such we can apply Z-test for attributes.

Hence, as discussed earlier, the critical region for testing H0 : P = P0 against


two-sided alternative H : P ≠ P0 would be given by :
ω : |Z0| ≥ 1.96 when the level of significance is 5% and by
ω : |Z0| ≥ 2.58 at 1% level of significance.
The critical regions for the right-sided alternative H1 : P > P0 at 5% level of
significance and 1% level of significance would be:
ω : Z0 ≥ 1.645 and
ω : Z0 ≥ 2.33 respectively.

Lastly when it comes to testing H0 against the left-sided alternative


H 2 : P < P0 ,

We have the critical regions as ω : Z0 ≤ –1.645 when α = 0.05


and ω : Z0 ≤ –2.33 when α = 0.01

Let us consider the following illustrations to understand the application of this


concept.

Illustration 8

The mean breaking strength of the cables supplied by a manufacturer is 1,900


units with a standard deviation of 110 units. By a new technique in the
manufacturing process, the manufacturer claims, the breaking strength of the
cables supplied by him has increased. In order to test his claim, a sample of 50
cables is tested. It is found that the mean breaking strength, as obtained from
the sample, is 1926. Can you support the claim both at 5% and 1% levels of
significance?

Solution: Let the mean breaking strength of the cables be denoted by x .


Since the sample size (n) is 50 which is more than 30, we can apply Z-test.
Then we are to test,
H0 : µ = 1900 i.e., the mean breaking strength of the cables is 1900
units
Against H1 : µ > 1900; i.e., the mean breaking strength has increased

n ( x − 1900)
we use Z 0 =
σ
The critical region for this right-sided alternative is given by :

ω : Z0 ≥ 1.645 at 5% level of significance and

ω : Z0 ≥ 2.33 at 1% level of significance

As per given data, n = 50, x = 1926, σ = 110

50 (1926 − 1900)
Z0 = = 1.671 8 7
110
Geektonight Notes

Probability and Thus, we reject H0 at 5% level of significance but accept the null hypothesis at
Hypothesis Testing 1% level of significance. On the basis of the given data, we thus conclude that
the manufacturer’s claim is justifiable at 5% level of significance but at 1%
level of significance, we infer that the manufacturer has been unable to produce
cables with a higher breaking strength.

Illustration 9

A random sample of 500 flower stems has an average length of 11 cm. Can
this be regarded as a sample from a large population with mean as 10.8 cm
and standard deviation as 2.38 cm?

Solution: Let the length of the stem be denoted by x. Assume that µ denotes
the mean of stems in the population. The sample size 500 being very large, we
apply Z-test for testing H0 : µ = 10.8, i.e., the population mean is 10.8 cm.
against H : µ ≠ 108, i.e., the population mean is not 10.8.

As such we consider, as test statistic :

n ( x − 10.8)
Z0 =
σ
and choosing the level of significance as 5%, we note that the critical region is :
ω : |Z0| ≥ 1.96
as per given data,
m = 500, x = 11 cm, σ = 2.38 cm

500 (11 − 10.8)


∴ Z0 = = 1.879
2.38

Thus we accept H0. We conclude that on the basis of the given data, the
sample can be regarded as taken from a large population with mean as 10.8
cm and standard deviation as 2.38 cm.

Illustration 10

A manufacturer of batteries asserts that the batteries made by him have a


mean life of 650 hours with a standard deviation of 12.83 hours. Ten batteries
were tested and the length of life of the batteries was recorded in hours as
follows:

623, 648, 672, 685, 692, 650, 649, 666, 638, 629

Examine whether the manufacturer was right in his assertion.

Solution: We assume that x, the length of battery-life is normally distributed


with mean as 650 hours and standard deviation as 12.83 hours. We are
interested in testing H0 : µ = 650 i.e., the average life is less than 650 hours
against H1 : µ < 650, i.e., the average life is less than 650 hours.

n ( x − 650)
We consider Z0 =
σ

8 8
Geektonight Notes

and recall that the critical region at 1% level of significance (selecting α = Tests of Hypothesis–I
0.01) for this left-tailed test is given by

ω : Z0 < –2.33
since n = 10, σ = 12.83 hours, and

623 + 648 + 672 + 685 + 692 + 650 + 649 + 666 + 638 + 629
x= = 655 .2 hours
10

10 (655.2 − 650)
∴ Z0 = = 1.282
12.83
As this does not fall in the critical region, H0 is accepted. Thus on the basis of
the given sample, we conclude that the manufactuer’s assertion was right.

Illustration 11

The heights of 12 students taken at random from St. Nicholas College, which
has 1,000 students and a standard deviation of height as 10 inches, are
recorded in inches as 65, 67, 63, 69, 71, 70, 65, 68, 63, 72, 61 and 66.
Do the data support the hypothesis that the mean height of all the students in
that college is 68.2 inches?

Solution: Letting x stand for height of the students of St. Nicholas College, we
would like to test

H0 : µ = 68.2 i.e., the mean height is 68.2 inches against H : µ ≠ 68.2,


i.e. the mean height is not 68.2 inches.

The critical region for this two-tailed test is :

ω : |Z0| ≥ 1.96 when α = 0.05


ω : |Z0| ≥ 2.58 when α = 0.01

x − 68.2
where Z0 =
S.E. (x)

In this case,
65 + 67 + 63 + 69 + 71 + 70 + 65 + 68 + 63 + 72 + 61 + 66
x= = 66 . 67 inches
12

n = sample size = 12; N = population size = 1000; σ = population S.D. = 10


inches
σ N−n
S.E. ( x ) =
n N −1

10 1000 − 12
= = 2.9027 inches
12 1000 − 1

66.67 − 68.2
∴ Z0 = = 0.527
2.9027

Looking at ω, we accept H0 at both 5% and 1% levels of significance. So on


the basis of the given data, we comment that the mean height of the students
of St. Nicholas is 68.2 inches. 8 9
Geektonight Notes

Probability and Illustration 12


Hypothesis Testing
A coin is tossed 950 times and heads appear 500 times. Does the result
support the hypothesis that the coin is unbiased? Select α = 0.01.

Solution: As explained in this section, we would denote by P the probability of


getting a head. So testing the hypothesis that the coin is unbiased amounts to
testing H0 : P = 0.05 against H : P ≠ 0.5, i.e., the coin is biased.

Since n = 950; nP0 = 950 × 0.5 = 475; and nP0 (1–P0) = 237.5, we can apply
Z-test for proportion. Thus we compute :

n (p − 0.5)
Z0 = and note that the critical region at 1% level of
0.5 (1 − 0.5)
significance for this two-tailed test is :
ω : |Z0| ≥ 2.58

x 500
As p= = = 0.5263
n 950

950 × (0.5263 − 0.5)


∴ Z0 = = 1.176
0.5
So we accept H0. On the basis of the given data, we conclude that the coin is
unbiased.

Illustration 13

In a sample of 800 parts manufactured by a company, number of defective


parts was found to be 60. The company, however, claims that only 7% of their
product is defective. Apply an appropriate test to verify whether the
manufacturer’s claim is tenable.

Solution: Let ‘p’ be the sample proportion of defectives and P, the proportion
of defective parts in the whole manufacturing process. Then we are to test

H0 : P = 0.07, i.e., the proportion of defective parts in the process is 7% as


claimed by the manufacturer against H1 : P > 0.07, i.e., the proportion of
defective parts is more than 7%.

We consider Z-test as nP0 = 800 × 0.07 = 56 as well as nP0 (1–p0) = 800 ×


0.07 × 0.93 = 52.09 are quite large.

If we select α = 0.05, then the critical region for this right-tailed test is :
ω : Z0 ≥ 1.645
x 60
We have, as given, p= = = 0.075
n 800

n (p − 0.07) 800 (0.075 − 0.07)


Z0 = =
0.07 × 0.93 0.07 × 0.93

= 0.5543, we ignore f.p.c as the population size is unknown.

Thus, Z0 falls on the acceptance region and we accept the null hypothesis. We
conclude that on the basis of the given information, the manufacturer’s claim is
9 0 valid.
Geektonight Notes

Illustration 14 Tests of Hypothesis–I

A family-planning activist claims that more than 33 per cent of the families in
her town have more than one child. A random sample of 160 families from the
town reveals that 50 families have more than one child. what is your inference
? Select α = 0.01.

Solution: If ‘P’ denotes the proportion of families in the town having more
than one child, then we want to test H0 : P = 0.33 against H1 : P > 0.33.

n (p − 0.33)
We consider Z0 = as test statistic and note that at 1% level
0.33 (1 − 0.33)
of significance the critical region is ω : Z0 ≥ 2.35.

50
Here, p= = 0.3125 , n = 160
160

160 (0.3125 − 0.33)


∴ Z0 = = − 0.4708
0.33 (1 − 0.33)

Thus H0 is accepted and the claim of the activist is justifiable at 1% level of


significance on the basis of the given sample.

Self Assessment Exercise C

1) Examine whether the following statements are true or false:

a) A statistical hypothesis is an assumption about some parameter.


b) A reduction of type-I error results in an increase in type-II error.
c) Power of a test is a function of type-I error.
d) Type-II error is committed when we reject a true null hypothesis.
e) Probability of type-I error is also known as the level of significance of the
test.
f) The critical region for the two-tailed test for population mean at 5% level of
significance is ω : |Z0| ≥ 2.58.
g) Z-test for population proportion is an exact test.
h) When the sample size is very large, any test can be approximated by a Z-
test.

2) Distinguish between TYPE-I and TYPE-II errors.


.......................................................................................................................
.......................................................................................................................
.....................................................................................................................
3) Differentiate between one-tailed tests and two-tailed tests.
.......................................................................................................................
.....................................................................................................................
........................................................................................................................
4) A sample of 5 units is taken from a normal population having variance as 4 squared
units. the sample observations are 23, 32, 35, 28 and 30. Do the data suggest that
the population mean is 30 units? Test at 5% level of significance. 9 1
Geektonight Notes

Probability and .......................................................................................................................


Hypothesis Testing
.......................................................................................................................
.......................................................................................................................
.......................................................................................................................
5) A producer making electronic components claims that not more than 2% of his
components are defective. A sample of 300 components resulted in 16
defectives. Would you support his view ?
.......................................................................................................................
......................................................................................................................
.......................................................................................................................
.......................................................................................................................
6) The numbers of male and female births in a hospital during a month were found
to be 1980 and 1870 respectively. Do the data confirm to the hypothesis that the
sexes are born in the same ratio?
.......................................................................................................................
.......................................................................................................................
.......................................................................................................................
.......................................................................................................................

15.11 LET US SUM UP


Statistical inference is a method to throw some light on the unknown population
with the help of a sample drawn from it. There are two types of estimates
with respect to estimating a parameter. They are : a) point estimates; b)
interval estimates. We estimate a parameter with the help of a single value
known as Point Estimate or a pair of values, known as Interval Estimate.

In a somewhat different situation, some information about some characteristic(s)


of the population may be known and we would like to examine whether that
information holds good for the sample as well. This is known as Test of
hypothesis or test of Significance or Decision rule. While testing a hypothesis,
one is likely to commit two types of Errors. Type-I error is committed in
rejecting a true null hypothesis and Type-II error occurs when a false null
hypothesis is accepted. A good test aims at reducing ‘p’, the probability of
Type-II error, keeping α, the probability of type-I error at a fixed level. α is
also known as size of the test or the level of significance. The test procedure
comprises in finding the value of the test statistic assuming the null hypothesis
to be true and comparing this value to the critical value.

We have concluded our discussion by conducting tests for population mean and
population proportion under different types of alternative hypothesis.

15.12 KEY WORDS AND SYMBOLS USED


Alternative Hypothesis: A hypothesis contradicting a null hypothesis. It is
denoted by H or H1 or H2.

Consistency: T is a consistent Estimator of θ if E(T) → θ and V(T) → o for


large n.
9 2
Geektonight Notes

Critical Region or Rejection Region: The set of values of the test statistic Tests of Hypothesis–I
leading to the rejection of H0. It is a part of the sample space and is denoted
by ω, if the sample point falls on ω, we reject H0.

Efficiency: T is an efficient Estimator of θ if T has the minimum standard


error among all the estimators of θ for a fixed sample size.

Interval Estimation: Estimation of a parameter θ by a pair of values, say, t1


and t2, t1 < t2. t1 is known as Lower confidence Limit and t2 as Upper
Confidence Limit. The probability that [t1, t2] contains θ is known as confidence
co-efficient and denoted by (1–α).
95% confidence Interval to µ

 σ σ 
=  x − 1 .96 , x + 1 .96 
 n n

99% confidence Interval to µ

 σ σ 
=  x − 2 . 58 , x + 2 . 58 
 n n

where x = sample mean; σ = population S.D.; and n = sample size.

when σ is unknown, it can be replaced by

∑ (X i − x ) 2
s =
|
provided ‘n’ exceeds 30.
n −1

Level of Significance: This is the probability of type-I error and is denoted by


α. Usually α is taken as .01 or 0.05 and accordingly we have 1% or 5% level
of significance.

Null Hypothesis: An assumption or statement regarding the parameter or the


form of a population distribution. A null hypothesis is denoted by H0.

Point Estimation: Estimation of an unknown parameter θ by a statistic T with


the help of a single value obtained from a random sample.

Power of a Test: Probability of rejecting a null hypothesis when it is false.


This is given by P(θ) = 1–β (θ) = 1-Probability of Type-II error.

Test statistic: A function of sample observations whose value, as computed


from a random sample, determines the acceptance or rejection of the null
hypothesis.

Type-I Error: Error committed in rejecting a true H0.

Type-II Error: Error committed in accepting a false H0.

Sufficiency: T is a sufficient estimator of θ if it contains all the information


about θ.

Standard Error (S.E.): S.E of a statistic T is the standard deviation of T as


obtained from its sampling distribution.

Unbiasedness and minimum variance: A statistic T is an unbiased estimator


of θ if the expectation of T is θ. T is an MVUE (Minimum variance unbiased 9 3
Geektonight Notes

Probability and estimator) for θ if T has the minimum variance among all the unbiased
Hypothesis Testing estimators of θ.

Z-test for population mean: For testing H0: µ = µ0, test statistic is given by
(x − µ0 )
Z0 = n If σ is unknown and n > 30, we replace σ by s| in the
σ
expression for Z0.

Z-test for population proportion: For testing H0 : P = P0 we consider

n ( p − P0 )
Z0 = provided n is large,
P0 (1 − P0 )

where, p = sample proportion

Under the assumption that the null hypothesis is true, Z0 follows standard
normal distribution. At 5% level of significance, the critical region for the two-
tailed test is given by

ω : |Zo| ≥ 1.96
The critical region for the right-tailed test is
ω : Zo ≥ 1.645
and the critical region for the left-tailed test is
ω : Zo ≤ –1.645
Similarly when the level of significance is 1%, the critical region for the two-
tailed test is
ω : |Zo| ≥ 2.58
For the right-tailed test, the critical region is
ω : Zo ≥ 2.35
and that for the left-tailed test is
ω : Zo ≤ –2.35

15.13 ANSWERS TO SELF ASSESSMENT


EXERCISES
A) 1. a) No, b) No. c) Yes, d) No e) Yes f) Yes, g) Yes,
h) No, i) Yes j) Yes
4. x = Rs. 966.20, S| = Rs. 60.98
>

5. p = 0.4767, S.E. (p) = 0.0166


6. µ̂ = 48.2222; σ̂ = 3.4564
B) 1. a) No, b) Yes, c) Yes, d) Yes, e) No, f) No, g) Yes,
h) No.
4. 1010 to 2115
5. [35.8318, 44.1682]
6. [978.02 quintals, 981.98 quintals]
9 4
Geektonight Notes

C) 1. a) Yes, b) Yes, c) No, (d) No, e) Yes, f) No, g) Yes, Tests of Hypothesis–I

h) Yes.

4. Yes, Z0 = – 0.447

5. No, Z0 = 4.12

6. Yes, Z0 = 1.774

15.14 TERMINAL QUESTIONS/EXERCISES


1) Distinguish between Estimation and testing of hypothesis.

2) Explain the procedure for testing a statistical hypothesis.

3) Discuss the role of normal distribution in interval estimation and also in testing
hypothesis.

4) What is an MVUE ? Examine whether a sample mean is an MVUE.

5) Discuss how far the sample proportion satisfies the desirable properties of a
good estimator.

6) How do you proceed to set confidence limits to population mean ?

7) Describe how you could set confidence limits to population proportion on the
basis of a large sample.

8) Explain how you would test for population mean.

9) Describe the different steps for testing the significance of population proportion.

10) 15 Life Insurance Policies in a sample of 250 taken out of 60,000 were found to
be insured for less than Rs. 7500. How many policies can be reasonably
expected to be insured for less than Rs. 7500 in the whole lot at 99%
confidence level.
(Ans: 1278 to 5922)

11) A sample of 250 measurements of breaking strength of cotton threads provided


a mean of 235 gm and a S.D of 32 gm. Find 95% confidence limits to the mean
breaking strength.
(Ans: 231.033 gms, 238.967 gms)

12) A manufacturer of ball-point pens claims that a certain type of pen produced by
him has a mean writing life of 550 pages with a S.D. of 35 pages. A purchaser
selects 20 such pens and the mean life is found to be 539 pages. At 5% level of
significance should the purchaser reject the manufacturer’s claim ?
(Ans: Yes, Z0 = –2.30)

13) In a sample of 550 guavas from a large consignment, 50 guavas are found to be
rotten. Estimate the percentage of defective guavas and assign limits within
which 95% of the rotten guavas would lie.
[Ans: (i) 9.09%; (ii) 0.0668 to 0.1150]

14) A die is thrown 59215 times out of which six appears 9500 times. Would you
consider the die to be unbiased ?
(Ans: No, Z0 = – 4.113)

15) A sample of 50 items is taken from a normal population with mean as 5 and 9 5
Geektonight Notes

Probability and standard deviation as 3. The sample mean comes out to be 4.38. Can the
Hypothesis Testing sample be regarded as a truly random sample?
(Ans: No, Z = –1.532)

16) A random sample of 600 apples was taken from a large consignment of 10,000
apples and 70 of them were found to be rotten. Show that the number of rotten
apples in the consignment with 95% confidence may be expected to be from
910 to 1,424.

17) The mean life of 500 bulbs, as obtained in a random sample manufactured by a
company, was found to be 900 hours with a standard deviation of 300 hours.
Test the hypothesis that the mean life is less than 900 hours. Select α = 0.05
and 0.01.
(Ans: Yes, Z0 = – 3.7268

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

15.15 FURTHER READING


The following text books may be used for more indepth study on the topics dealt
within this unit.
Levin and Rubin, 1996, Statistics for Managers, Printice Hall of India Pvt. Ltd.,
New Delhi.
Hooda, R.P., 2000, Statistics for Business and Economics, Macmillan India Ltd.,
New Delhi.
Gupta, S.P., Statistical Methods, 1999, Sultan Chand & Sons: New Delhi.

Gupta, C.B., and Vijay Gupta, 1998, An Introduction to Statistical Methods, Vikas
Publishing House Pvt. Ltd., New Delhi.

9 6
Geektonight Notes
Tests of Hypothesis-II
UNIT 16 TESTS OF HYPOTHESIS – II
STRUCTURE

16.0 Objectives
16.1 Introduction
16.2 Small Samples versus Large Samples
16.3 Student’s t-distribution
16.4 Application of t-distribution to determine Confidence Interval for
Population Mean
16.5 Application of t-distribution for Testing Hypothesis Regarding Mean
16.6 t-test for Independent Samples
16.7 t-test for Dependent Samples
16.8 Let Us Sum Up
16.9 Key Words and Symbols
16.10 Answers to Self Assessment/Exercises
16.11 Terminal Questions/Exercises
16.12 Further Reading

16.0 OBJECTIVES
After studying this unit, you should be able to:
l differentiate between exact tests i.e., small sample tests and approximate tests,
i.e., large sample tests,
l be familiar with the properties and applications of t-distribution,
l find the interval estimation for mean using t-distribution,
l have an idea about the theory required for testing hypothesis using
t-distribution,
l apply t-test for independent samples, and
l apply t-test for dependent samples.

16.1 INTRODUCTION
In the previous unit, we considered different aspects of the problems of
inferences. We further noted the limitations of standard normal test or Z-test.
As discussed in Unit 15, we can not apply normal distribution for estimating
confidence intervals for population mean in case the population standard
deviation is unknown and sample size does not exceed 30, i.e., small samples.
We may further recall that as mentioned in Unit 15, we can not test hypothesis
concerning population mean when the sample is small and population standard
deviation is unspecified. In a situation like this, we use t-distribution which is
also known as student’s t-distribution. t-distribution was first applied by W.S.
Gosset who used to work in ‘Guinners Brewery’ in Dublin. The workers of
Guinners Brewery were not allowed to publish their research work. Hence
Gosset was compelled to publish his research work under the penname
‘student’ and hence the distribution is known as student’s t-distribution or simply
student’s distribution. Before we discuss t-distribution, let us differentiate
between exact tests and approximate tests.

9 7
Geektonight Notes

Probability and
Hypothesis Testing 16.2 SMALL SAMPLES VERSUS LARGE SAMPLES
Normally a sample is considered as small if its size is 30 or less whereas, the
sample with size exceeding 30 is considered as a large sample. All the tests
under consideration may be classified into two categories namely exact tests
and approximate tests. Exact tests are those tests that are based on the exact
sampling distribution of the test statistic and no approximation is made about the
form of parent population or the sampling distribution of the test statistic. Since
exact tests are valid for any sample size and usually cost as well as labour
increases with an increase in sample size; we prefer to take small samples for
conducting exact tests. Hence, the exact tests are also known as small sample
tests. It may be noted that while testing for population mean on the basis of a
random sample from a normal distribution, we apply exact tests or small sample
tests provided the population standard deviation is known. This was
demonstrated in Unit 15.

However, there are situations when we have to compromise with an


approximate test. It has been found that if a random sample is taken from a
population characterized by a parameter θ and if T is a sufficient statistic for θ,
then :
T−θ
Z=
S.E (T)
as well as
T− θ
Z= ∧
S. E (T)

is an approximate standard normal variate provided we have taken a sufficiently


large sample. Thus for testing H0 : θ = θ 0, we consider

T − θ0 T − θ0
Z0 = or Z 0 = ∧
S.E (T ) S. E (T )

For a large sample size, Z would be approximately a standard normal deviate


and as such we can prescribe Z-test. This test is known as an approximate test
because it is not based on the exact sampling distribution of the test statistic T.
Since this test is valid only if ‘n’ is sufficiently large, it is also known as a
large sample test. In this connection, it may be pointed out that in the
previous unit, two such large sample tests have already been discussed. In the
first case while testing for population mean with an unknown population
standard deviation, we consider :

n ( x − µ)
Z=
S|
 ∑ (x i − x) 2 
where , S| = 
 (n − 1) 

which is a standard normal variate, approximately, for a large n. Similarly


testing for population proportion, we used:

n (p − p 0 )
Z0 =
P0 (1 − P0 )

P0 being the specified population proportion, which again, for a large sample, is
9 8 an approximate standard normal variable.
Geektonight Notes
Tests of Hypothesis-II
16.3 STUDENT’S t-DISTRIBUTION
Since we cannot use Z-test, for a small sample, for population mean when the
population standard deviation is not known, we are on the look out for a new
test statistic. It is necessary to know a few terms first.

Degree of freedom (df’ or d.o.f’). If we are asked to pick up any five


numbers, then there are no restrictions or constraints on the selection of the
five numbers and as such we are at liberty to choose any five numbers.
Statistically, this is analogous to stating that we have five degrees of freedom (5
d.f). But if we are asked to find any five numbers such that the total is 60,
then basically we are to find four numbers and not five as the last number is
automatically determined since the sum of the five numbers is provided. Hence,
we have now 4 d.f. which is the difference between the number of
observations and the number of constraints (in this case, 5–1 = 4). Similarly, if
we are to pick up five numbers such that their sum is 100 and the sum of the
squares of the numbers is 754, then we have 5–2 = 3 d.f.

Chi-square distribution: Let x1, x2, x3 … xn be ‘n’ independent standard


2
normal variables. Then x 1 follows chi-square distribution with ‘1’ d.f. This is
denoted by
x 12 ~ χ 12

2
Again x12 + x22 = ∑ x i ~ χ2 and in general,
2 2
i =1

n
x 12 + x 22 + … + x 2n = ∑ x i2 ~ χ 2n …… (16.1)
i =1

n 2
If we write u = ∑ xi , then the probability density function of u is given by :
i =1

f(u) = const. e–u/2 . un/2–1

for o < u < α

It can be shown that for u,

Mean (µ) = n; Standard Deviation (σ) = 2n

χ2 distribution has a positive skewness and it is leptokurtic. Leptokurtic means


when a curve is more peaked than the normal curve (mesokurtic). Look at the
following Fig. 16.1 which depicts a χ2 distribution.

0 α
Figure 16.1: χ distribution.
2 9 9
Geektonight Notes

Probability and If x1, x2, x3 …xn are ‘n’ independent variables, each following normal
Hypothesis Testing
xi − µ
distribution with mean (µ) and variance (σ2), then Xi = is a standard
σ
normal variable and as such
n n ( x i − µ) 2
u = ∑ X i2 = ∑ …(16.2)
i =1 i =1 σ2

follows χ2 with m d.f.

The sample variance is given by :

∑ (x i − x) 2
S2 =
n

or nS2 = ∑(xi − x)2

or, nS2 = ∑ (x i − µ) 2 + n (x − µ) 2 [From 15.9]

nS2 ∑ (xi − µ)2 (x − µ)2


= − n
σ2 σ2 σ2

∑ (x i − µ) 2
As ~ χ2n
σ 2

(x − µ)2 (x − µ)2  (x − µ2 )  2
and n = 2 = 2  ~ χ1
σ2 σ /n σ / n 

  σ 
sin ce x ~ N  µ,  ,
  n 

ns2
Hence it follows that 2 ~ χ n −1
2

Student’s t-distribution: Consider two independent variables ‘y’ and ‘u’ such
that ‘y’ follows standard normal distribution and ‘u’ follows χ2-distribution with
md.f.
y
Then the ratio t = , follows t-distribution
u/m
with m d.f. The probability density function of t is given by :

 m +1 
− 
 t2   2 
f ( t ) = const. 1 + 
 m
+ or – ∞ < t < ∞ .........…(16.3)

( x − µ)
where, t = n , const. = a constant required to make the area under the
s
curve equal to unity; m = n–1, the degree of freedom of t.

It can be shown that for a t-distribution with m d.f.


100 Mean (µ) = 0 .........…(16.4)
Geektonight Notes
Tests of Hypothesis-II
m
Standard deviation (σ) = (m − 2) , m > 2 …(16.5)

t-distribution is symmetrical about the point t = 0 and further the distribution is


leptokurtic for m > 4. Thus compared to a standard normal distribution,
t-distribution is more stiff at the peak. Figure 16.2 shows the probability curve
of a t-distribution.

−∞ ∞

Figure 16.2: t-distribution

Since t-distribution is a function of m, we have a different t-distribution for


every distinct value of ‘m’. If ‘m’ is sufficiently large, we can approximate a
t-distribution with md.f by a standard normal distribution.

Since we have :
− m+1
 t2  2
f (t) = const. 1 + 
 m
for − ∞ < t < ∞

m +1  t2 
∴Logf = k − Log1 +  where, ‘k’ is a constant.
2  m

m +1  t 2 t4 
=k−  − 2
+ ……to α
2  m 2m 

x2
[as Log (1+x) = − + ........ to α
2
for –1 < x ≤ 1
t2
and is rather small for a large m].
m

m +1 t 2 m +1 4
Hence Log f = k − . + t …… to α
m 2 4m 2
m +1
since m is very large tends to 1 and other terms containing powers of
m
‘m’ higher than 2 would tend to zero. Thus we have:
t2
Log f = k −
2
101
Geektonight Notes

Probability and
t2 2 2 /2
Hypothesis Testing K−
or, f =e 2 = e k .e − t /2
= const. e − t

for − ∞ < t < ∞


which takes the form of a standard normal variable.

Looking from another angle, as the mean of t-distribution is zero and standard
m
deviation = which tends to unity for a large m,
m−2

we may replace a t-distribution with m d.f. by a standard normal distribution.


Here lies the justification of applying z test for large samples.

If x1, x2, x3 …xn denote the n observations of a random sample drawn from a
normal population with mean as µ and the standard deviation as σ, then x1, x2,
x3 …xn can be described as ‘n’ independent random variables each following
normal distribution with the same mean µ and a common standard deviation σ.
If we consider the statistic:

( x − µ)
n
S|

∑ xi − x) 2 ∑ (x i
where, x = , the sample mean; and S =
|
is the standard
n n −1
deviation with divisor as (n–1) instead of n, then we may write :

n ( x − µ)
( x − µ) σ
n =
S| S| , dividing both numerator and denominator by σ
σ

(x − µ / σ / n y
= =
∑ (x i− x) 2 u /( n − 1)
(n − 1)σ 2

As x follows normal distribution with mean µ and standard deviation σ / n ,


therefore:

x −µ
y= is a standard normal variate
σ/ n

∑ (x i − x)2
Also u = follows χ2-distribution with (n–1) d.f
σ2

(x − µ) y
Hence, by definition n =
s′ u /(n −1)

follows t-distribution with (n–1) d.f.


As such we can write :

(x − µ )
102 t = n ~ t n −1
s|
Geektonight Notes

We apply t-distribution for finding confidence interval for mean as well as Tests of Hypothesis-II
testing hypothesis regarding mean. These are discussed in Sections 16.4 and
16.5 respectively.

16.4 APPLICATION OF t-DISTRIBUTION TO


DETERMINE CONFIDENCE INTERVAL FOR
POPULATION MEAN

Let us assume that we have a random sample of size ‘n’ from a normal
population with mean as µ and standard deviation as σ. We consider the case
when both µ and σ are unknown. We are interested in finding confidence
interval for population mean. In view of our discussion in Section 16.3, we
know that :

(x − µ)
t= n
s|
follows t-distribution with (n–1) d.f. We may recall here that x denotes the
sample mean and s|, the sample standard deviation with divisor as (n–1) and not
‘n’. We denote the upper α-point of t-distribution with (n–1) d.f as tα, (n–1).
Since t-distribution is symmetrical about t = 0, the lower α-point of t-distribution
with (n–1) d.f would be denoted by –tα, (n–1). As per our discussion in Unit
15, in order to get 100 (1–α)% confidence interval for µ, we note that :

 (x − µ) 
P − t α / 2 , (n −1) ≤ n ≤ t α / 2 , (n −1) = 1 − α
 s 

 s| s| 
or P − x − .t α/ 2 , (n −1) ≤ −µ ≤ −x + .t α/ 2 , (n −1) = 1− α
 n n 

 s| . s| .t |α / 2 , 
or P x − t α / 2 , (n − 1) ≤ µ ≤ x + (n − 1) = 1 − α
 n n 
Thus 100 (1-α) % confidence interval to µ is :
 s| . s| . 
 x − t α/2 , ( n − 1), x + t α / 2 , (n − 1) …(16.6)
 n n 

s|
100 (1-α) % Lower Confidence Limit to µ = x − .t α / 2 , (n − 1)
n

s|
and 100 (1-α) % Upper Confidence Limit to µ = x + .t α / 2 , (n − 1)
n
Selecting α = 0.05, we may note that

s|
95% Lower Confidence Limit to µ = x − .t 0.025 , ( n − 1)
n
s|
and 95% Upper Confidence Limit to µ = x + .t 0.025 , ( n − 1) …(16.7)
n 103
Geektonight Notes

Probability and In a similar manner, setting α = 0.01, we get 99% Lower Confidence Limit to µ
Hypothesis Testing
s|
=x− .t 0 .005 , ( n − 1) and
n
|
s
99% Upper Confidence Limit to µ = x + . t 0.005 , ( n − 1) …(16.8)
n
Values of tα, m for m = 1 to 30 and for some selected values of a are provided
in Appendix Table 5. Figures 16.3, 16.4 and 16.5 exhibit confidence intervals to
µ applying t-distribution as follows :

100 (1−α) % area


under the curve
α/ 2% area α/ 2% area under
under the curve the curve

α/2 α/2

s|
s | µ= x+ .t α / 2 , (n − 1)
µ=x− .t α / 2 , (n −1) n
n
α%) Confidence Interval to µ
Fig. 16.3: 100 (1–α

95% area

2.5% area 2.5% area

s| s|
µ=x− .t 0.005 , (n − 1) µ=x+ .t 0.005 , (n − 1)
n n
Fig. 16.4: 95% Confidence Interval to µ
104
Geektonight Notes
Tests of Hypothesis-II

99% area

0.5% area 0.5% area

s| s|
µ=x− .t 0.005 , (n − 1) µ=x+ .t 0.005 , (n − 1)
n n

Fig. 16.5: 99% Confidence Interval to µ

Let us now take up some illustrations to understand this concept.

Illustration 1

Following are the lengths (in ft.) of 7 iron bars as obtained in a sample out of
100 such bars taken from SUR IRON FACTORY.

4.1, 3.98, 4.01, 3.95, 3.93, 4.12, 3.91

we have to find 95% confidence interval for the mean length of iron bars as
produced by SUR IRON FACTORY.

Solution: Let x denote the length of iron bars. We assume that x is normally
distributed with unknown mean µ and unknown standard deviation σ. Then 95%
Lower Confidence Limit to µ

s| N−n
=x− . t 0.005 , 6
n N −1

s| N−n
and 99% Upper Confidence Limit to µ = x + . .t 0.025 ,6
n N −1

∑ (x − x)
2
∑ xi
where, x = ; S =
| i
; n = sample size = 7; N = population size
n n −1
= 100
N−n
and = finite population correction (fpc)
N −1

and t0.025, 6 = Upper 2.5% point of t-distribution with 6 d.f.

= 2.447 [from Appendix Table-5 given at the end of this block,


α = 0.025, m = 6]

105
Geektonight Notes

Probability and Table 16.1: Computation of Sample Mean and S.D.


Hypothesis Testing
xi xi 2
4.10 16.8100
3.98 15.8404
4.01 16.0801
3.95 15.6025
3.93 15.4449
4.12 16.9744
3.91 15.2881
28.00 112.0404

28
Thus, we have : x = =4
7
∑ (x i − x ) 2 = ∑ x i2 − nx 2 = 112.0404 – 7 × 42
= 0.0404

∑(xi − x)2 0 .0404


: s| = = = 0 .082057
n −1 6

100 − 7
f .p.c. = = 0 .969223
100 − 1

0.08 2057
Hence 95% Lower confidence Limit to µ = 4 − × 0.969223 × 2.365
7
= 4 – 0.188091 = 3.811909

Similarly 95% Upper confidence limit to µ = 4 + 0.188091= 4.188091

So 95% Confidence Interval for mean length of iron bars = [3.81 ft, 4.19 ft].

Illustration 2

Find 90% confidence interval to µ given sample mean and sample S.D as 20.24
and 5.23 respectively, as computed on the basis of a sample of 11 observations
from a population containing 1000 units.

Solution: As sample size (n) = 11 is small, we apply t-distribution. Further, we


may ignore f.p.c as the population is rather large. Thus 90% confidence interval
to µ is given by:

 s| s| 
 x − .t 0.05 , 10, x + .t 0.05 , 10
 n n 

∑ (xi − x2 )
As S= , the sample standard deviation (S.D).
n

∴nS2 = ∑ (x i − x) 2
106
Geektonight Notes
∑ (x i − x) 2 Tests of Hypothesis-II
Hence (s ) =
| 2

n −1
nS 2
= [since ∑ (xi − x 2) = nS 2]
n −1

n
or, s| = .S
n −1

11
= × 5.23 = 5.4853
10
Consulting Appendix Table-5, given at the end of this block, we find t0.05,
10 = 1.812
Thus 90% confidence interval to µ is given by:

 5.4853 5.4853 
20.24 − 3.1623 ×1,812, 20.24 + 3.1623 × 1.812 = [17.0969, 23.3831]
 
Illustration 3

The study hours per week of 17 teachers, selected at random from different
parts of West Bengal, were found to be:

6.6, 7.2, 6.8, 9.2, 6.9, 6.2, 6.7, 7.2, 9.7, 10.4, 7.4, 8.3, 7.0, 6.8, 7.6, 8.1, 7.8

Suppose, we are interested in computing 95% and 99% confidence intervals for
the average hours of study per week per teacher in the state of West Bengal.

Solution: If µ denotes the average hours of study per week per teacher in
West Bengal, then as discussed earlier,

 s| s| 
95% confidence interval to µ = x − .t 0.025, (n −1), x + .t 0.025, (n −1)
 n n 

 s| s| 
and 99% confidence interval to µ =  x − .t 0.005 , ( n − 1), x + .t 0.005 , (n − 1)
 n n 

Table 16.2: Computation of Sample Mean and Sample S.D.

Study Hours (xi) x i2


6.6 43.56
7.2 51.84
6.8 46.24
9.2 84.64
6.9 47.61
6.2 38.44
6.7 44.89
7.2 51.84
9.7 94.09
10.4 108.16
7.4 54.76
8.3 68.89
7.0 49.00
6.8 46.34
7.6 57.76
8.1 65.61
7.8 60.84
129.9 1014.41 107
Geektonight Notes

Probability and We have n = 17


Hypothesis Testing
∑ x i 129.9
x= = = 7.64
n 17

∑ (x i− x)2
2
∑xi −n.(x)2
s =
|
=
n −1 n −1

1014.41 − 17 × (7.64) 2
=
17 − 1

1014.41 − 992.28
= = 1.1761
16
From Appendix Table-5, given at the end of this block, t0.025, 16 = 2.120; t0.005,
16 = 2.921

Thus 95% confidence interval to µ

 1.1761 1.1761 
= (7.64 − × 2.12) hrs , (7.64 + × 2.12) hrs 
 4 4 
= [7.0167 hours, 8.2633 hours]
Similarly 99% confidence interval to µ
 1.1761 1.1761 
= (7.64 − × 2.921) hours, (7.64 + × 2.921) hours
 4 4 
= [6.7812 hours, 8.4988 hours]

Illustration 4

In a sample of 26 items, 90% confidence limits to population mean are found to


be 46.584 and 53.416 respectively. Find the sample mean and sample S.D.

Solution: As explained earlier, 90% confidence interval to µ, population mean


is:

 s| s| 
 x − .t 0.05 , ( n − 1), x + .t 0.05 , (n − 1)
 n n 

In this case, n = 26, From Appendix Table-5, given at the end of this block,
t0.05, 25 = 1.708

s|
Hence we have x − ×1.708 = 46.584
26
or x − 0.33497s| = 46.584 …(1)

s|
and x + ×1.708 = 53.416
26
or, x + 0 .33497 s | = 53 .416 …(2)
on adding equation (1) and (2) we get
2 x = 100 or x = 50
108
Geektonight Notes

replacing x by 50 in equation (1), we have Tests of Hypothesis-II

50 − 0.33497 s| = 46.584

3.416
or s| = = 10.19793
0.33497

n −1 |
Hence S = s [from illustration 2]
n
= 0.98058 × 10.19793 = 9.9999 ~− 10
Thus the sample mean is 50 units and sample S.D is approximately 10 units.

Self Assessment Exercise A


1) State, with reasons, whether the following statements are true or false.

a) t-distribution can be used for samples of any size.


b) Exact tests can be applied for samples of any size whereas approximate tests
can be applied for samples of large size only.
c) The tests meant for large samples can not be used for small samples.
d) For applying t-distribution, assumption of normality may not be necessary.
e) When sample size exceeds thirty, we can replace t-test by z-test.
f) For a population with unknown S.D, confidence interval for population mean
would vary in accordance with sample size.
g) For a population with known S.D, confidence interval remains unchanged for
varying sample size.
h) Large sample tests can be performed without the assumption of normality.
i) Both the standard normal distribution and t-distribution have the same mean
and same variance.
j) If x1, x2, x3…xn are n sample observations from N (µ, σ2) then

∑ (x i− x) 2
~ χ 2n
σ 2

k) Z test has the widest range of applicability among all the commonly used
tests.

2) Differentiate between exact tests and approximate tests.


.........................................................................................................................
....................................................................................................................

3) Discuss the points of similarity and dissimilarity between a standard normal


distribution and a t-distribution.
.................................................................................................................
.................................................................................................................
.................................................................................................................

4) A random sample of size 10 drawn from a normal population yields sample mean
as 85 and sample S.D as 8.7. Compute 90% and 95% confidence intervals to
population mean. 109
Geektonight Notes

Probability and ..........................................................................................................


Hypothesis Testing
..........................................................................................................
..........................................................................................................

5) Find 99% confidence limits for ‘µ’ given that a sample of 19 units drawn from a
population of 98 units provides sample mean as 15.627 and sample S.D as 2.348.

..........................................................................................................
..........................................................................................................
..........................................................................................................

6) A sample of size 10 drawn from a normal population produces the following results.
Σxi = 92 and Σxi2 = 889
Obtain 95% confidence limits to µ.
..........................................................................................................
..........................................................................................................
..........................................................................................................

16.5 APPLICATION OF t-DISTRIBUTION FOR


TESTING HYPOTHESIS REGARDING MEAN
Let us consider a situation where a small random sample is taken from a
normal population with mean as µ and standard deviation as σ, both of which
are unknown. Then as discussed in Unit-15, we are interested in testing :

H0 : µ = µ 0
against H : µ ≠ µ0 i.e., the population mean is anything but µ0.
or H1 : µ > µ0 i.e., the population mean is greater than µ0.
or H2 : µ < µ0 i.e., the population mean is less than µ0.

As we have noted in Section 16.1 the proper test to apply in this situation is
undoubtedly t-test. If we denote the upper α-point and lower α-point of t-
distribution with m.d.f. by tα, m and t1–a,m = – tα,m (as t-distribution is
symmetrical about 0) then for testing H0, based on the distribution of t, it may
be possible to find 4 values of t such that :

P (t0 ≥ tα/2, m) = α/2 ……(16.9)


P (t0 ≤ – tα/2, m) = α/2 ……(16.10)
P (t0 ≥ tα, m) = α ……(16.11)
and P (t0 ≤ – tα, m) = α ……(16.12)
where, t0 is the value of t under H0 : µ= µ0
combining (16.9) and (16.10), we have
P (t0 ≥ tα/2, m) + P (t0 ≤ –tα/2, m) = α ……(16.13)

Thus, in order to test H0 : µ= µ0 against both sided or two-sided alternative


H : µ ≠ µ0, selecting a low value of the probability of type-I error i.e., α, say
α = 0.05 or α = 0.01, we find from (16.13) that the probability that t0 is greater
110
than tα/2, m or t0 is less than –tα/2, m is likely to be very low.
Geektonight Notes

Hence, on the basis of a small random sample drawn from the population, if it Tests of Hypothesis-II
is found that t0 is greater than tα/2, m or t0 is less than –tα/2, m i.e.,
|t0| > tα/2, m, then we may suggest that there is enough evidence to suggest that
H0 is untrue and H is true. Then we reject H0 and accept H. The critical
region for this both sided alternative is provided by :

ω : t0 ≥ tα/2, m and ω : t0 ≤ – tα/2, m

⇒ ω : |t0| ≥ tα/2, m …(16.14)

This is shown in the following Figure 16.6. Critical region lies on both the tails.

Acceptance Region

Critical Region Critical Region


ω:t0 ≤ –ta/2, m 100 (1−α) % of area ω : t0 ≥ ta/2, m

α/2 α/2
– ta/2, m 0 ta/2, m
Fig. 16.6: Critical Region for Both-tailed Test.

Secondly, in order to test the null hypothesis against the right-sided alternative
i.e., to test H0 against H1 : µ > µ0, from (16.11) we note that, as before, if we
choose a small value of α, then the probability that the observed value of t,
would exceed the critical value tα, m is very low. Thus one may have serious
questions in this case, about the validity of H0 if the value of t, as obtained on
the basis of a small random sample, really exceeds tα, m. We then reject H0
and accept H1. The critical region

ω : t0 ≥ tα, m ………(16.15)

lies on the right-tail of the curve and the test as such is called right-tailed test.
This is shown in Figure 16.7.

Acceptance Region

100(1−α) % of area Critical Region


ω : t0 ≥ tα, m

α
0 tα, m
Fig. 16.7: Critical Region for Right-tailed Test.
111
Geektonight Notes

Probability and Lastly, when we proceed to test H0 against the left-sided alternative
Hypothesis Testing H2 : µ < µ0, we note that (16.12) suggests that if α is small, then the
probability that t0 would be less than the critical value –tα, m is very small. So
if the value of t0 as computed, on the basis of a small sample, is found to be
less than –tα, m, we would doubt the validity of H0 and accept H2. The critical
region

ω : t0 ≤ – tα, m …(16.16)

would lie on the left-tail and the test would be left-tailed test. This is depicted
in Fig. 16.8.

Acceptance Region

Critical Region 100(1−α) % of area


ω : t0 ≤ −tα, m

α
-tα, m 0
Fig. 16.8: Critical Region for Left-tailed Test.

16.6 t-TEST FOR INDEPENDENT SAMPLES


In order to apply t-test in a given situation, we need to verify the following
points:

1) Whether the sample drawn is a random sample. A positive answer would


confirm that the sample observations are independent.

2) Whether the sample is taken from a normal population. An affirmative answer


is a pre-requisite for applying t-test.

3) Whether the population S.D is unknown. Here a negative answer would


suggest Z-test, provided we get a positive answer to the first two questions. A
‘yes’ may mean t-test.

4) Whether the sample drawn is a small one. Again if the answer is ‘no’ i.e., n >
30, we would be satisfied with Z-test. However, if n ≤ 30 and the first three
conditions are fulfilled, we should recommend t-test.

Putting this in a nutshell, t-test is suggested for population mean if a small


random sample is drawn from a normal population with an unknown standard
deviation. Under the above conditions, in order to test a null hypothesis, we use
the test statistic:

n (x − µ)
t=
s|
where, n = sample size; x = sample mean; and s| = sample S.D with divisor
as (n-1). The test statistic follows t-distribution with (n–1) d.f
112
Geektonight Notes

In order to test H0 : µ = µ0 against the both-sided alternative H : µ ≠ µ0 we Tests of Hypothesis-II


compute :
n ( x − µ0 )
t0 =
s|
if t0 falls on the critical region defined by :
ω : |t0| ≥ tα/2, (n–1)

tα, m being the upper α-point of t-distribution with m d.f, then we reject H0. In
other words, H0 is rejected and H : µ ≠ µ0 is accepted if the observed value
of t, as computed from the sample, exceeds or is equal to the critical value
tα/2, (n-1).

If we select α, the level of significance, as 0.05, then H0 is rejected at 5%


level of significance if :
|t0| ≥ t0.025, (n–1)

on the other hand letting α = 0.01, we reject H0 at 1% level of significance if :


|t0| ≥ t0.005, (n–1)

Figure 16.9 shows critical region at 5% level of significance while Figure 16.10
shows critical region at 1% level of significance.

Acceptance Region

95 % of area
Critical Region
ω : t0 ≤-t0.025, (n-1) Critical Region
ω : t0 ≥ t0.025, (n-1)

0.025 0.025
0 t0.025, (n-1)
- t0.025 , (n-1)
Fig.16.9: Critical Region for Both-tailed Test at 5% Level of Significance

Acceptance Region

Critical Region 99 % of area


ω : t0 ≤-t0.005, (n-1)
Critical Region
ω : t0 ≥ t0.005, (n-1)

0.005 0.025
0 t0.005, (n-1)
- t0.005, (n-1)
Fig. 16.10: Critical Region for Both-tailed Test at 1% Level of Significance 113
Geektonight Notes

Probability and Similarly, for testing H0 against the right-sided alternative H1 : µ > µ0, the
Hypothesis Testing critical region is given by :

ω : t0 ≥ tα , (n–1)

The respective critical regions at 5% and 1% level of significance are given


by :

ω : t0 ≥ t0.05 , (n–1)

and ω : t0 ≥ t0.01 , (n–1)

The following Figures 16.11 and 16.12 show these two critical regions.

Acceptance Region

95 % of area

Critical Region
ω : t0 ≥ t0.05, (n-1)

0.05
0 t0.05, (n-1)
Fig. 16.11: Critical Region for Right-tailed Test at 5% Level of Significance

Acceptance Region

95 % of area

Critical Region
ω : t0 ≥ t0.01, (n-1)

0.01
0 t0.01, (n-1)

Fig. 16.12: Critical Region for Left-tailed Test at 1% Level of Significance

114
Geektonight Notes

Lastly, when we test H0 against the left-tailed test H2 : µ < µ0, the critical Tests of Hypothesis-II
region would be:

ω : t0 ≤ –tα , (n–1)

In particular, the critical region at 5% level of significance would be given by:

ω : t0 ≤ –t0.05 , (n–1)

and the critical region at 1% level of significance would be:

ω : t0 ≤ –t0.01 , (n–1)

These are depicted in the following Figure 16.13 and Figure 16.14 respectively.

Acceptance Region

Critical Region 95 % of area


ω : t0 ≤-t0.05, (n-1)

0.05
- t0.05, (n-1) 0

Fig. 16.13: Critical Region for Left-tailed Test at 5% Level of Significance

Acceptance Region

Critical Region 99 % of area


ω : t0 ≤-t0.01, (n-1)

0.01
- t0.01, (n-1) 0

Fig. 16.14: Critical Region for Left-tailed Test at 1% Level of Significance

Let us take up some illustrations to understand the application of t-test for


independent samples.

Illustration 5

An automatic machine was manufactured to pack 10 kilograms of oil. A


115
Geektonight Notes

Probability and random sample of 13 tins was taken to test the machine. Following were the
Hypothesis Testing weights in kilograms of the 13 tins.

9.7, 9.6, 10.4, 10.3, 9.8, 10.2, 10.4, 9.5, 10.6, 10.8, 9.1, 9.4, 10.7

Assuming normal distribution of the weights of the packed tins, examine


whether the machine worked in accordance with the specifications.

Solution: Let x denote the weight of the packed tins of oil. Since,

1) x is assumed to be normally distributed


2) the population S.D. of the weight of the packed tin is unknown and
3) the sample size n = 13 is small, we apply t-test

Thus, we have to test H0 : µ = 10 against


H1 : µ < 10 (i.e., the machine packed less than 10 kg)
The test statistic is, as discussed in Section 16.5, t-test
Hence we compute :

n (x −10)
t0 = |
, where, x = ∑ x i
s n

2
∑(xi − x) −n.x2
2
∑xi
s| = =
n −1 n −1

The critical region for this left-sided alternative is provided by :

ω : t0 ≤ –tα, (n–1)

choosing α = 0.05, look at t-table (Appendix Table-5) suggests that


t0.05 , 12 = 1.782
Thus the critical region is :
ω : t0 ≤ –1.782
From the given data, we have
2
∑ xi= 130.5 and ∑ x i = 1313.65
Thus, x = 130.5/13 = 10.038

1313.65 − 13(10.038) 2 1313.65 − 1309.8987


S| = =
13 − 1 12
= 0.5591

13 (10.038 − 10)
Hence t 0 = = 0.245
0.5591
which is greater than –1.782

As t0 does not fall on the critical region w, we accept H0. So, on the basis of
the given data as obtained from the sample observations, we conclude that the
machine worked in accordance with the given specifications.

116
Geektonight Notes

Illustration 6 Tests of Hypothesis-II

A company has been producing steel tubes of inner diameter of 4 cms. A


sample of 15 tubes gives the average inner diameter as 3.96 cms with a S.D.
of 0.032 cms. Is the sample mean significantly different from the population
mean?

Solution: Let x denote the inner diameter of steel tubes as produced by the
company. We are interested in testing

H0 : µ = 4 against
H :µ≠ 4

Assuming that x follows normal distribution, we note that the sample size is 15
(<30) and the population S.D. is unknown. All these factors justify the
application of t-distribution. Thus we compute our test statistic as:

n ( x − 4)
t=
s|
As given, x = 3.96; and s = 0.032

n 15
∴s| = s = × 0.032 = 0.033
n −1 14

n ( x − 4)
t0 =
s|
14 (3.96 − 4)
So, t 0 = = − 4.536
0.033

Hence t 0 = 4.536

The critical region for the both-side test is :

ω : t 0 ≥ t α / 2, ( n −1)

Selecting the level of significance as 1%, from the t-table (Appendix Table-5),
we get t0.01/2, (15–1)
= t0.005, 14 = 2.977

Thus, ω : t 0 ≥ 2.977

Since the computed value of t i.e., t 0 = 4.536 , falls on w, we reject H0. Hence
the sample mean is significantly different from the population mean.

Illustration 7

The mean weekly sales of detergent powder in the department stores of the
city of Delhi as produced by a company was 2,025 kg. The company carried
out a big advertising campaign to increase the sales of their detergent powder.
After the advertising campaign, the following figures were obtained from 20
departmental stores selected at random from all over the city (weight in kgs.).

2000 2023 2056 2048 2010 2025 2100


2563 2289 2005 2082 2056 2049 2020
2310 2206 2316 2186 2243 2013
Based on above data, was the advertising successful? 117
Geektonight Notes

Probability and Solution: Let us assume that x represents the weekly sales (in kg) of
Hypothesis Testing detergent powder as produced by the company. If µ denotes the average (i.e.
mean) weekly sales in the city of Delhi, then we would like to test:

H0 : µ = 2025 i.e. there is no change due to the advertisement


H1 : µ > 2025 i.e. there is an increase in sales due to the advertisement.

As explained in the illustrations 5 and 6, we compute

n ( x − 2025)
t0 =
s|
and the critical region for the right-sided alternative is given by :

ω : t 0 ≥ (t 0 , (n −1)
or ω : t0 ≥ 1.729
[By selecting α = 0.05 and consulting Appendix Table-5, given at the end of
this block, we find that for m = 20–1 = 19 and for α = 0.05, value of t is
1.729].

Table 16.3: Computation of mean and S.D.

xi ui = xi – 2000 u i2
2000 0 0
2023 23 529
2056 56 3136
2048 48 2304
2010 10 100
2025 25 625
2100 100 10000
2563 563 316969
2289 289 813521
2005 5 25
2082 82 6724
2056 56 3136
2049 49 2401
2020 20 400
2310 310 96100
2206 206 42436
2316 316 99856
2186 186 34596
2243 243 59049
2013 13 169
Total 2600 762076

 2600 
118 From the above table, we have x =  2000 +  kg = 2130 kg
 20 
Geektonight Notes
Tests of Hypothesis-II
2
∑ ui − nu 2
s =
|

n −1

762076 − 20 × (130 ) 2
= = 149 .3981 kg
19

n ( x − 2025)
As t 0 =
s|

20 ( 2130 − 2025 )
∴ t0 = = 3.143
149 .3981
A glance at the critical region suggests that we reject H0 and accept H1. On
the basis of the given sample we, therefore, conclude that the advertising
campaign was successful in increasing the sales of the detergent powder
produced by the company.

Illustration 8

A random sample of 26 items taken from a normal population has the mean as
145.8 and S.D. as 15.62. At 1% level of significance, test the hypothesis that
the population mean is 150.

Solution: Here we would like to test H0 : µ = 150 i.e., the population mean is
150 against H : µ ≠ 150 i.e., the population mean is not 150. As the necessary
conditions for applying t-test are fulfilled, we compute

n ( x − 150)
t0 =
s|
and the critical region at 1% level of significance is :

ω : t 0 ≥ 2.787

In this case, as given


x = 145.8 ; s = 15.62 ; and n = 26

n 26
∴s | = S = × 15.62 = 15.9293
n −1 25

26 (145.8 − 150)
So, t 0 = = − 1.344
15.9293

thereby, t 0 = 1.344

Looking at the critical region, we find acceptance of H0. So on the basis of the
given data, we infer that the population mean is 150.

16.7 t-TEST FOR DEPENDENT SAMPLE


One of the important applications of t-distribution is t-test for dependent samples
or paired t-test. Let us consider a situation where a giant multinational company
is claiming that their research wing has developed a new type of restorative
that is going to increase the bodyweights of the babies suffering from
malnutrition and this is a revolution in the world of medicine.
119
Geektonight Notes

Probability and We may note that there are other factors such as age, height, food habits, living
Hypothesis Testing conditions etc., which could be attributed to a change in body weight. In case,
we apply the drug to the same group of babies, these factors would be
constant and if there is a significant increase in bodyweights, it would be due to
the treatment i.e. the application of restorative except, may be, the chance
factor. Thus, in order to verify the efficacy of the restorative, the best course
of action would be to take a random sample of babies affected with rickets,
measure their bodyweights before applying the restorative, and take their
bodyweights for a second time, say a couple of months, after applying the
restorative. The appropriate test to apply, in this case, is a paired t-test.

Similarly one may apply paired t-test to verify the necessity of a costly
management training for its sales personnel by recording the sales of the
selected trainees before and after the management training or the validity of
special coaching for a group of educationally backward students by verifying
their progress before and after the coaching programme or the increase in
productivity due to the application of a particular kind of fertiliser by recording
the productivity of a crop before and after applying this particular fertiliser and
so on.

Let us now discuss the theoretical background for the application of paired t-
test. In our earlier discussions, we were emphatic about the observations being
independent of each other. Now we consider a pair of random variables which
are dependent or correlated. Earlier, we considered normal distribution, to be
more precise, univariate normal distribution. Similarly, we may think of bivariate
normal distribution. Let x and y be two random variables following bivariate
normal distribution with mean µ1 and µ2 respectively, standard deviations σ1 and
σ2 respectively and a correlation co-efficient (ρ).

Thus ‘x’ and ‘y’ may be the bodyweight of the babies before and after the
application of the restorative, sales before and after the training programme,
marks of the weak students before and after the coaching, yield of a crop
before and after applying the fertiliser and so on.

Let us consider ‘n’ pairs of observations on ‘x‘ and ‘y’ and denote the ‘n’
pairs by (xi, yi) for i = 1, 2, 3, …, n.

Our null hypothesis is H0 : µ1 = µ2 i.e., the restorative has no impact on the


weight of the babies or the training has no importance or the coaching has not
improved the standard of the students or the fertiliser has not increased the
productivity significantly and so on. If we introduce a new random variable
u = x–y, then we may note that:
E(u) or µu = E (x) – E (y)
= µ1–µ2 = 0, under H0
Further, from the properties of normal distribution, it follows that ‘u’, being a
linear function of two normal variables, also follows normal distribution with
mean
mean (µu) = µ1 – µ2 and variance (σu2) =σ12 + σ22 – 2ρσ1, σ2

Thus testing H0 : µ1–µ2 is analogous to testing for population mean when the
population standard deviation is unknown. In view of our discussion in Section
16.5, if the sample size is small, it is obvious that the appropriate test statistic
would be:

n (u − µ u )
120 t= ……(16.17)
s| u
Geektonight Notes
Tests of Hypothesis-II
∑ ui
where n = sample size; u = ;u=x−y
n
∑ (ui − u)2 ∑ ui − nu 2
su =
|
=
n −1 n −1

nu
As before, under H0, t0 = follows t-distribution with (n–1) d.f.
s| u
Thus for testing H0 : µu = 0 against H : µu ≠ 0,
the critical region is provided by :
ω : t0 ≥ tα/2 (n–1)
For testing H0 against H1 : µ1 > µ2 i.e., H1 : µu > 0
We consider the critical region
ω : t ≥ tα, (n–1)
when the sample size exceeds 30, the assumption of normality for u may be

nu
avoided and the test statistic can be taken as a standard normal variable
s| u
and accordingly we may recommend Z-test.

With the help of the above discussion, we take up some illustrations to


understand the application of t-test for dependent samples.

Illustration 9

A drug is given to 8 patients and the increments in their blood-pressure were


recorded to be :

4, 5, –1, 3, –2, 4, –7 and 0

Is it reasonable to believe that the drug has no effect on the change of blood-
pressure?

Solution: Let x denote blood-pressure before applying the drug and y, the
blood-pressure after applying the drug. Further let µ1 denote the average blood-
pressure in the population before applying the drug and µ2, the average blood-
pressure after applying the drug. Thus the problem is reduced to testing :

H0 : µ1 = µ2 i.e., the drug has no effect on blood-pressure


Against H1 : µ1 > µ2 i.e., the drug has reduced blood-pressure.

Denoting u = x – y = change in blood-pressure, we note that, under H0,

nu
follows t-distribution with (n–1) d.f under H0.
s|u
Thus the critical region would be
ω : t0 ≥ tα , (n–1)
or ω : t0 ≥ 1.895

121
Geektonight Notes

Probability and By taking α = 0.05, tα, (n–1)= t0.05, 7 = 1.895 from Appendix Table-5.
Hypothesis Testing
From the given data, we find that n = 8, Σui = 6, Σui2 = 120

∑ ui 6
Hence u = = = 0.75
n 8

∑ ui2 − nu 2 120 − 8 × (0.75) 2


s| u = = = 4.062
n −1 7

n u
t0 =
s |u

8 × 0.75
∴t0 = = 0.522
4.062

Looking at the critical region, we find that H0 is accepted. Thus on the basis of
the given data we conclude that the drug has been unsuccessful in reducing
blood-pressure.

Illustration 10

A group of students was selected at random from the set of weak students in
statistics. They were given intensive coaching for three months. The marks in
statistics before and after the coaching are shown below.

Student’s serial Marks in statistics


Number Before coaching After coaching

1 19 32
2 38 36
3 28 30
4 32 30
5 35 40
6 10 25
7 15 30
8 29 20
9 16 15

Could the coaching be considered as a success?


Test at 5% level of significance.

Solution: Let x and y denote the marks in statistics before and after the
coaching respectively. If the corresponding mean marks in the population be µ1
and µ2 respectively, then we are to test :

H0 : µ1 = µ2 i.e., the coaching has really improved the standard of the students,
against the alternative hypothesis H : µ1 < µ2.
We compute :

nu
122 t0 = which follows t-distribution with (n–1) d.f under H0.
s| u
Geektonight Notes

where, n = no. of students selected = 9 Tests of Hypothesis-II


u = x – y = difference in statistics marks

∑ ui2 − n(u)2
s| u =
n −1
since α = 0.05, n = 9,
consulting Appendix Table-5, we find that t0.05 , 8 = 1.86.
Thus the left-sided critical region is provided by w : t0 ≤ –1.86.

Table 16.4: Computation of Sample Mean and Sample S.D.

Marks in Statistics
Serial No. (x0) (y0) ui = (xi–yi) u i2
of student Before After
coaching coaching
1 19 32 –13 169
2 38 36 2 4
3 28 30 –2 4
4 32 30 2 4
5 35 40 –5 25
6 10 25 –15 225
7 15 30 –15 225
8 29 20 9 81
9 16 15 1 1
Total – – –36 738

∑ u i − 36
Thus u = = =−4
n 9

738 − 9 × (−4) 2
s| u = = 8.6168
8

8 × −4
∴t0 = = − 1.313
8.6168

A glance at the critical region suggests that we accept H0. On the basis of the
given data, therefore, we infer that the coaching has failed to improve the
standard of the students.

Illustration 11

Wilson company, known for producing fertilizers, recruited 15 candidates. After


recording their sales, they were asked to attend a sales management course.
Their sales, after attending the course, were recorded. The data are presented
below.

123
Geektonight Notes

Probability and
Serial number Sales (’000 Rs.)
Hypothesis Testing
of trainee Before the course After the course

1 15 16
2 16 17
3 13 19
4 20 18
5 18 22.5
6 17 18.3
7 16 19.2
8 19 18
9 20 20
10 15.5 16
11 16.2 17
12 15.8 17
13 18.7 20
14 18.3 18
15 20 22
Was the training programme effective in promoting sales? Select α = 0.05.

Solution: If we consider x and y as sales before and after attending the


course respectively, then we are going to test :

H0 : µ1 = µ2 against
H1 : µ1 < µ2

µ1 and µ2 being the average sales in the population before the training and
after the training. As before the critical region is :
ω : t0 ≤ –1.761
as m = n–1 = 14 and t0.05, 14 = 1.761

Table 16.5: Computation of Sample Mean and S.D.

Serial No. Sales (’000 Rs.) ui u i2


of trainee Before After = xi–yi
course (xi) course (yi)

1 15 16 –1 1
2 16 17 –1 1
3 13 19 –6 36
4 20 18 2 4
5 18 22.5 – 4.5 20.25
6 17 18.3 – 1.3 1.69
7 16 19.2 – 3.2 10.24
8 19 18 1 1
9 20 20 0 0
10 15.5 16 – 0.5 0.25
11 16.2 17 – 0.8 0.64
12 15.8 17 – 1.2 1.44
13 18.7 20 – 1.3 1.69
14 18.3 18 0.3 0.09
15 20 22 –2 4
Total – – –19.l5 83.29
124
Geektonight Notes
Tests of Hypothesis-II
∑ µi − nu
2
su =
|

n −1
From the above table 16.5, we have
− 19.5
u= = − 1.3
15

61 − 6 ( 0 .833 ) 2
su =
|
= 3 .3715
5

nu 15 × − 1.3
Hence t 0 = |
= = − 2.428
su 2.0343
to being less than –1.761, we reject H0. Thus on the basis of the given sample,
we conclude that the training programme was effective in promoting sales.

Illustration 12

Six pairs of husbands and wives were selected at random and their IQs were
recorded as follows:
Pair : 1 2 3 4 5 6
IQ of Husband : 105 112 98 92 116 110
IQ of Wife : 102 108 100 96 112 110
Do the data suggest that there is no significant difference in average IQ
between the husband and wife? Use 1% level of significance.

Solution: Let x denote the IQ of husband and y, that of wife. We would like
to test
H0 : µ1 = µ2 i.e., there is no difference in IQ.

Against H1 : µ1 ≠ µ2, i.e. there is significant difference in IQ.

The critical region for this two-sided test is given by :

ω : t 0 ≥ t 0.01 , (6 − 1)
2

i.e., ω : t 0 ≥ t 0.05,5

i.e., ω : t 0 ≥ 4 . 032

Table 16.6: Computation of Mean and S.D. of IQ.


Pair IQ ui
Husband (xi) Wife (yi) = xi–yi u i2

1 105 102 3 9
2 112 108 4 16
3 98 100 –2 4
4 92 96 –4 16
5 116 112 4 16
6 110 110 0 0
Total – – 5 61
125
Geektonight Notes

Probability and From the above Table, we get,


Hypothesis Testing
5
u= = 0.8333
6

2
∑ ui − n(u ) 2
su =
|

n −1

61 − 6 (0.833) 2
s| u = = 3.3715
5

n u
t0 =
s |u

5 × 0.8333
so, t0 = = 0.553
3.3715
Therefore, we accept H0 and conclude that, on the basis of the given sample,
there is no reason to believe that IQs of husbands and wives are different.

Self Assessment Exercise B


1) State with reasons, whether the following statements are true or false?
a) t-test is an exact test whereas z-test is an approximate test.
b) For small samples, one must always use t-test.
c) In order to apply paired t-test, it is assumed that the data are taken from a
bivariate normal population.
d) t-test for independent sample and t-test for dependent sample are applied
under different conditions.
e) t-test is not applicable if the population S.D. is unknown.

2) Describe the different steps one should undertake in order to apply t-test.
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................

3) Distinguish between large sample and small sample.


...................................................................................................................
....................................................................................................................
4) A manufacturer of ball-point pens claims that a certain kind of pen manufactured
by him has a mean writing life of 500 pages. A purchasing agent takes a sample of
10 such pens and the writing life of the 10 selected pens (in pages) are found to
be :
502, 510, 498, 475, 482, 523, 476, 518, 523, 479
Determine at 1% level of significance whether the purchaser should reject the
claim.
....................................................................................................................
....................................................................................................................
126
Geektonight Notes

.................................................................................................................... Tests of Hypothesis-II

....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) A certain diet was introduced to increase the weight of pigs. A random sample of
12 pigs was taken and weighed before and after applying the new diet. The
differences in weights were :
7, 4, 6, 5, – 6, – 3, 1, 0, –5, –7, 6, 2
can we conclude that the diet was successful in increasing the weight of the pigs?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................

16.8 LET US SUM UP


This unit is an extension of the previous unit where we discussed the problems
of statistical inferences in detail. In this unit we have discussed first the
distinction made between small samples and large samples. If the sample size
does not exceed thirty, then it is known as a small sample and we apply exact
tests also known as Small Sample tests. In particular, if the assumption of
normality holds and the population S.D. is unknown, we apply t-test for testing
mean. A variation of t-test is its applicability to dependent samples where the
effect of a particular treatment can be tested after eliminating other sources of
variation. If the sample size exceeds 30, we consider approximate test also
termed as Large Sample Test. These large sample tests are applicable without
a set of assumptions.

Next, we have considered the problem of finding 100 (1–α) % confidence


interval for population mean applying t-distribution. Coming back to the problem
of testing for population mean for testing H0 : µ = µ0 against both-sided
alternative H1 : µ ≠ µ0, the critical region at α % level of significance is given
by :

ω : t 0 ≥ t α / 2 , ( n −1)

The critical region for the right-sided alternative H1 : µ > µ0 is given by :

ω : t0 ≥ tα, (n–1) and the critical region for the left-sided alternative

H2 : µ < µ0 is given by ω : t0 ≤ –tα, (n–1)

We have concluded our discussion by describing paired t-test and the critical
region for t-tests applied to dependent samples. 127
Geektonight Notes

Probability and
Hypothesis Testing 16.9 KEY WORDS AND SYMBOLS
Chi-square Distribution: If x1, x2 …, xn are ‘m’ independent standard normal
variables, then u = ∑ x i 2 follows χ2-distribution with md.f and this is denoted by
u ~ χ 2m .
Degree of Freedom (d.f.): no. of observations – no. of constraints.
Large Sample: when sample size (n) is more than 30.
Large Sample Tests or Approximate Tests: tests based on large samples.
Paired Samples: Another term used for dependent samples.
Small Sample: when sample size (n) is less than 30.
Small Sample Tests or Exact Tests: tests based on small samples only.
t-distribution: If x is a standard normal variable and u is a chi-square with
m.d.f., and x and u are independent variables, then the ratio.

x
follows t-distribution
u/m
with m.d.f. and is denoted by t ~ tm
100 (1–α) % confidence interval to m

 s| s| 
 x − t 1 / − α / 2 , ( n −1) × , x + t 1− α / 2 , ( n −1) × 
 n n
For testing population mean from independent samples, we use the test statistic

n (x − µ0 )
t0 =
s|
and for testing for a particular effect, we use

nu
t0 =
s| u
where u0 = specific value for mean; s| = simple S.D. with (n–1) divisor;
u = x–y = difference in paired sample; and s|u= sample of S.D. of u with (n–1)
divisor

16.10 ANSWERS TO SELF ASSESSMENT


EXERCISES
A) 1 a) Yes, b) Yes, c)Yes, d) No. e) Yes
f) Yes, g) No, h) Yes, i) Yes, j) No
k) Yes.

4. 90% confidence interval = [69.1163; 100.8837]


95% confidence interval = [65.2952, 104.7048]
5. Lower confidence limit = 9.3613
Upper confidence limit = 21.8927

128 6. 7.5596 and 10.8404


Geektonight Notes

B) 1. a) No. b) Yes, c) Yes d) No e) No Tests of Hypothesis-II

5. No, t0 = – 0.226

6. No, t0 = 0.518

16.11 TERMINAL QUESTIONS/EXERCISES


1) Describe a situation where you can apply t-distribution.

2) How would you distinguish between a t-test for independent sample and a paired
t-test?

3) Describe the role played by t-distribution to set up Confidence Interval to population


mean.

4) Distinguish between large samples and small samples.

5) Describe the steps one should undertake in order to apply t-test.

6) A technician is making engine parts with axle diameter of 0.750 inch. A random
sample of 14 parts shows a mean diameter of 0.763 inch and a S.D. of 0.0528
inch.

i) Set the null hypothesis and the alternative hypothesis.


ii) Choose level of significance
iii) Describe the critical region
iv) Compute test statistic
v) Draw your conclusion
Examine whether the work meets the specification or not at 95% as well as 99%
confidence interval to population mean.
Answer : t0 = 0.888
95% confidence interval = [0.7302”, 0.7958”]
99% confidence interval = [0.7189”, 0.8071”]

7) St. Nicholas college has 500 students. The heights (in cm.) of 11 students chosen
at random provides the following results:
175, 173, 165, 170, 180, 163, 171, 174, 160, 169, 176
Determine the limits of mean height of the students of St. Nicholas college at 1%
level of significance.
(Ans: 164.6038 cm. and 176.4870 cm.)

8) For a sample of 15 units drawn from a normal population of 150 units, the mean
and S.D. are found to be 10.8 and 3.2 respectively. Find the confidence level for
the following confidence intervals.
(i) 9.415, 12.185
(ii) 9.113, 12.487
[Ans: (i) 90% (ii) 95%]

9) A random sample of 15 observations from a normal population yields mean as 52.3


and S.D. as 5.63. Can it be assumed that the population mean is 50 ?
129
Geektonight Notes

Probability and
Hypothesis Testing
10) The following data relates to the sales of a new type of toothpaste in 15 selected
shops before and after a special sales promotion campaign.
Shop 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
No.
Sales
(’000 Rs.)
Before 15 16 13 14 18 19 12 16 20 11 12 9 15 17 21
campaign

Sales
(’000 Rs.)
After 17 17 12 15 20 19 14 15 24 12 10 12 18 17 34
campaign

Would you regard the campaign as a success ?


(Ans: Yes, to = –2.671)

11. A suggestion was made that husbands are more intelligence than wives. A social
worker took a sample of 12 couples and applied I.Q. Tests to both husbands and
wives. The results are shown below:
Sl.No. I.Q. of
Husbands Wives

1. 110 115
2. 115 113
3. 102 104
4. 98 90
5. 90 93
6. 105 103
7. 104 106
8. 116 118
9. 109 110
10. 111 110
11. 87 100
12. 100 98

Do the data support the suggestion ?

Answer: No. t0 = –0.7452

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

130
Geektonight Notes
Tests of Hypothesis-II
16.12 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt within this unit.

Levin and Rubin, 1996, Statistics for Management. Printice-Hall of India Pvt. Ltd.,
New Delhi.
Hooda, R.P., 2000, Statistics for Business and Economics, MacMillan India Ltd.,
Delhi.
Gupta, S.P., 1999, Statistical Methods, Sultan Chand & Sons, New Delhi.

Gupta, S.C., and Kapoor, V.K., Fundamentals of Mathematical Statistics, Sultan


Chand & Sons, New Delhi.

131
Geektonight Notes

Probability and Hypothesis


Testing UNIT 17 CHI-SQUARE TEST
STRUCTURE

17.0 Objectives
17.1 Introduction
17.2 Chi-Square Distribution
17.3 Chi-Square Test for Independence of Attributes
17.4 Chi-Square Test for Goodness of Fit
17.5 Conditions for Applying Chi-Square Test
17.6 Cells Pooling
17.7 Yates Correction
17.8 Limitations of Chi-Square Test
17.9 Let Us Sum Up
17.10 Key Words
17.11 Answers to Self Assessment Exercises
17.12 Terminal Questions/Exercises
17.13 Further Reading
Appendix Tables

17.0 OBJECTIVES
After studying this unit, you should be able to:
l explain and interpret interaction among attributes,
l use the chi-square distribution to see if two classifications of the same data
are independent of each other,
l use the chi-square statistic in developing and conducting tests of goodness-
of-fit, and
l analyse the independence of attributes by using the chi-square test.

17.1 CHI-SQUARE DISTRIBUTION


In the previous two units, you have studied the procedure of testing hypothesis
and using some of the tests like Z-test and t-test. In one sample test you have
learned tests to determine whether a sample mean or proportion was
significantly different from the respective population mean or proportion. But in
practice the requirement in your research may not be confined to only testing
of one mean/proportion of a population. As a researcher you may be interested
in dealing with more than two populations. For example, you may be interested
in knowing the differences in consumer preferences of a new product among
people in the north, the south, and the north-east of India. In such situations the
tests you have learned in the previous units do not apply. Instead you have to
use chi-square test.

Chi-square tests enable us to test whether more than two population proportions
are equal. Also, if we classify a consumer population into several categories
(say high/medium/low income groups and strongly prefer/moderately prefer/
indifferent/do not prefer a product) with respect to two attributes (say consumer
income and consumer product preference), we can then use chi-square test to
test whether two attributes are independent of each other. In this unit you will
learn the chi-square test, its applications and the conditions under which the chi-
square test is applicable.

132
Geektonight Notes
Chi-Square Test
17.2 CHI-SQUARE DISTRIBUTION
The chi-square distribution is a probability distribution. Under some proper
conditions the chi-square distribution can be used as a sampling distribution of
chi-square. You will learn about these conditions in section 17.5 of this unit.
The chi-square distribution is known by its only parameter – number of degrees
of freedom. The meaning of degrees of freedom is the same as the one you
have used in student t-distribution. Figure 17.1 shows the three different chi-
square distributions for three different degrees of freedom.

df = 2

df = 3
Probability

df = 4

0 2 4 6 8 10 12 14 16
χ2
Figure 17.1. Chi-Square Sampling Distributions for df=2, 3 and 4

It is to be noted that as the degrees of freedom are very small, the chi-square
distribution is heavily skewed to the right. As the number of degrees of
freedom increases, the curve rapidly approaches symmetric distribution. You
may be aware that when the distribution is symmetric, it can be approximated
by normal distribution. Therefore, when the degrees of freedom increase
sufficiently, the chi-square distribution approximates the normal distribution. This
is illustrated in Figure 17.2.

df = 2
Probability

df = 4
df = 10

df = 20

0 5 10 15 20 25 30 35 40
χ2
Figure 17.2. Chi-Square Sampling Distributions for df=2, 4, 10, and 20
Like student t-distribution there is a separate chi-square distribution for each
number of degrees of freedom. Appendix Table-1 gives the most commonly
used tail areas that are used in tests of hypothesis using chi-square distribution.
It will explain how to use this table to test the hypothesis when we deal with
examples in the subsequent sections of this unit. 133
Geektonight Notes

Probability and Hypothesis


Testing 17.3 CHI-SQUARE TEST FOR INDEPENDENCE OF
ATTRIBUTES
Many times, the researchers may like to know whether the differences they
observe among several sample proportions are significant or only due to chance.
Suppose a sales manager wants to know consumer preferences of consumers
who are located in different geographic regions of a country, of a particular
brand of a product. In case the manager finds that the difference in product
preference among the people located in different regions is significant, he/she
may like to change the brand name according to the consumer preferences. But
if the difference is not significant then the manager may conclude that the
difference, if any, is only due to chance and may decide to sell the product
with the same name. Therefore, we are trying to determine whether the two
attributes (geographical region and the brand name) are independent or
dependent. It should be noted that the chi-square test only tells us whether two
principles of classification are significantly related or not, but not a measure of
the degree or form of relationship. We will discuss the procedure of testing the
independence of attributes with illustrations. Study them carefully to understand
the concept of χ2 test.

Illustration 1
Suppose in our example of consumer preference explained above, we divide
India into 6 geographical regions (south, north, east, west, central and north
east). We also have two brands of a product brand A and brand B.

The survey results can be classified according to the region and brand
preference as shown in the following table.

Consumer preference
Region Brand A Brand B Total

South 64 16 80
North 24 6 30
East 23 7 30
West 56 44 100
Central 12 18 30
North-east 12 18 30
Total 191 109 300

In the above table the attribute on consumer preference is represented by a


column for each brand of the product. Similarly, the attribute of region is
represented by a row for each region. The value in each cell represents the
responses of the consumers located in a particular region and their preference
for a particular brand. These cell numbers are referred to as observed (actual)
frequencies. The arrangement of data according to the attributes in cells is
called a contingency table. We describe the dimensions of a contingency table
by first stating the number of rows and then the number of columns. The table
stated above showing geographical region in rows (6) and brand preference in
columns (2) is a 6 × 2 contingency table.

In the 6 × 2 contingency table stated above (the example of brand preference)


each cell value represents a frequency of consumers classified as having the
corresponding attributes.We also stated that these cell values are referred to as
134
Geektonight Notes

observed frequencies. Using this data we have to determine whether or not Chi-Square Test
the consumer geographical location (region) matters for brand preference. Here
the null hypothesis (H0) is that the brand preference is not related to the
geographical region. In other words, the null hypothesis is that the two
attributes, namely, brand preference and geographical location of the consumer
are independent. As a basis of comparison, we use the sample results that
would be obtained on the average if the null hypothesis of independence was
true. These hypothetical data are referred to as the expected frequencies.

We use the following formula for calculation of expected frequencies (E).

Row total × Column total


E=
Total

For example, the cell entry in row-1 and column-2 of the brand preference 6x2
contingency table referred to earlier is:

80 × 191 15280
E= = = 50.93
300 300

Accordingly, the following table gives the calculated expected frequencies for
the rest of the cells of the 6x2 contingency table.

Calculation of the Expected Frequencies

Consumer Preference
Region Brand A Brand B Total
South (80×191)/300 = 50.93 (80×109)/300 = 29.07 80
North (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
East (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
West (100×191)300= 63.67 (100×109)/300 =36.33 100
Central (30×191)300 = 19.10 (30×109)/300 = 10.90 30
Northern (30×191)/300 = 19.10 (30×109)/300 = 10.90 30
Total 191 109 300
We use the following formula for calculating the chi-square value.

(O i − E i )
χ2 = ∑
Ei

Where, χ2 = chi-square; Oi = observed frequency; Ei = expected frequency;


and
Σ = sum of.
To ascertain the value of chi-square, the following steps are followed.

1) Subtract Ei from Oi for each of the 12 cells and square each of these differences
(O i–E i) 2.

(O i − E i ) 2
2) Divide each squared difference by Ei and obtain the total, i.e., ∑ .
Ei

This gives the value of chi-squares which may be ranged from zero to infinity.
Thus, value of χ2 is always positive. 135
Geektonight Notes

Probability and Hypothesis


Now we rearrange the data given in the above two tables for comparing the
Testing
observed and expected frequencies. The rearranged observed frequencies,
expected frequencies and the calculated χ2 value are given in the following
Table.

Row/Column Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i


frequencies frequencies
(O i) (Ei)

(1,1) 64 50.93 13.07 170.74 3.35

(2,1) 24 19.10 4.90 24.01 1.26

(3,1) 23 19.10 3.90 15.21 0.80

(4,1) 56 63.67 –7.67 58.78 0.92

(5,1) 12 19.10 –7.10 50.41 2.64

(6,1) 12 19.10 –7.10 50.41 2.64

(1,2) 16 29.07 –13.07 170.74 5.87

(2,2) 6 10.90 –4.90 24.01 2.20

(3,2) 7 10.90 –3.90 15.21 1.40

(4,2) 44 36.33 7.67 58.78 1.62

(5,2) 18 10.90 7.10 50.41 4.62

(6,2) 18 10.90 7.10 50.41 4.62

300 300 χ2 = 31.94

With r × c (i.e. r-rows and c-columns) contingency table, the degrees of


freedom are found by (r–1) x (c–1). In our example, we have 6 × 2
contingency table. Therefore, we have (6–1) × (2–1) = 5 × 1 = 5 degrees of
freedom. Suppose we take 0.05 as the significance level (a). Then at 5 degrees
of freedom and a = 0.05 significance level the table value (from Appendix
Table-4) is 11.071. Since the calculated χ2 value (31.94) is greater than the
table value of (11.071), we reject the null hypothesis and conclude that the
brand preference is not independent of the geographical location of the
customer. Therefore, the sales manager needs to change the brand name across
the regions.

Illustration 2
A TV channel programme manager wants to know whether there are any
significant differences among male and female viewers between the type of the
programmes they watch. A survey conducted for the purpose gives the
following results.

136
Geektonight Notes
Chi-Square Test
Type of TV Viewers Sex
programme Male Female Total
News 30 10 40
Serials 20 40 60
Total 50 50 100

Calculate χ2 statistic and determine whether type of TV programme is


independent of the viewers' sex. Take 0.10 significance level.

Solution: In this example, the null and alternate hypotheses are:


H0: The viewers sex is independent of the type of TV programme (There is no
association among the male and female viewers).

H1: The viewers sex is not independent of the type of TV programme.

We are given the observed frequencies in the problem. The expected


frequencies are calculated in the same way as we have explained in illustration
1. The following table gives the calculated expected frequencies.

Type of TV Viewers Sex


Programme Male Female Total
News (40×50)/100 = 20 (40×50)/100 = 20 40
Serials (60×50)/100 = 30 (60×50)/100 = 30 60
Total 50 50 100

Now we rearrange the data on observed and expected frequencies and


calculate the χ2 value. The following table gives the calculated χ2 value.

(Row, Column) Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i


frequencies frequencies
(Oi) (Ei)
(1,1) 30 20 10 100 5.00
(2,1) 20 30 –10 100 3.33
(1,2) 10 20 –10 100 5.00
(2,2) 40 30 10 100 3.33
χ2 =16.66

Since we have a 2x2 contingency table, the degrees of freedom will be (r–1) ×
(c–1) = (2–1) × (2–1) = 1× 1 = 1. At 1 degree of freedom and 0.10
significance level the table value (from Appendix Table-4) is 2.706. Since the
calculated χ2 value (16.66) is greater than table value of χ2 (2.706) we reject
the null hypothesis and conclude that the type of TV programme is dependent
on viewers' sex. It should, therefore, be noted that the value of χ2 is greater
than the table value of x2 the difference between the theory and observation is
significant.

137
Geektonight Notes

Probability and Hypothesis Self Assessment Exercise A


Testing
1) The following are the independent testing situations, calculated chi-
square values and the significance levels. (i) state the null hypothesis,
(ii) determine the number of degrees of freedom, (iii) calculate the
corresponding table value, and (iv) state whether you accept or reject
the null hypothesis.
a) Type of the car (small, family, luxury) versus attitude by sex
(preferred, not preferred). χ2 = 10.25 and a = 0.05.
b) Income distribution per month (below Rs 10000, Rs 10000-20000,
Rs 20000-30000, Rs 30000 and above) and preference for type of
house with number of bed rooms (1, 2, 3, 4 and above). χ2 = 28.50
and a = 0.01.
c) Attitude towards going to a movie or for shopping versus sex (male,
female). χ2 = 8.50 and a = 0.01.
d) Educational level (illiterate, literate, high school, graduate) versus
political affiliation (CPI, Congress, BJP, BSP). χ2 = 12.65 and α =
0.10.
.........................................................................................................

...............................................................................................................

......................................................................................................

......................................................................................................

2) The following are the number of columns and rows of a contingency


table. Determine the number of degrees of freedom that the chi-square
will have
a) 6 rows, 6 columns c) 3 rows, 5 columns
b) 7 rows, 2 columns d) 4 rows, 8 columns
..............................................................................................................
3) A company has introduced a new brand product. The marketing
manager wants to know whether the preference for the brand is
distributed independent of the consumer’s education level. The survey of
a sample of 400 consumers gave the following results.

Illiterates Literates High School Graduate Total

Bought new brand 50 55 45 60 210

Did not buy new 50 45 55 40 190


brand

Total 100 100 100 100 400

a) Calculate the expected frequencies and the chi-square value.


b) State the null hypothesis.
c) State whether you accept or reject the null hypothesis at a = 0.05.
...............................................................................................................

...............................................................................................................

138 ...............................................................................................................
Geektonight Notes

............................................................................................................... Chi-Square Test

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

...............................................................................................................

17.4 CHI-SQUARE TEST FOR GOODNESS OF FIT


In unit 14, you have studied some probability distributions such as binomial,
Poisson and normal distributions. When we consider a sample data from a
population we try to assume the type of distribution the sample data follows.
The chi-square test is useful in deciding whether a particular probability
distribution such as the binomial, Poisson or normal distribution is the appropriate
probability distribution. This allows us to validate our assumption about the
probability distribution of the sample data. The chi-square test procedure used
for this purpose is called goodness-of-fit test. The test also indicates whether
or not the frequency distribution for the sample population has a particular
shape, such as the normal curve (symmetric distribution). This can be done by
testing whether there is a significant difference between an observed frequency
distribution and an assumed theoretical frequency distribution. Thus by applying
chi-square test for goodness of fit, we can determine whether the observed
data constitutes a sample drawn from the population with assumed theoretical
distribution. In this section we use chi-square test for goodness-of-fit to make
inferences about the type of distribution.

The logic inherent in the chi-square test allows us to compare the observed
frequencies (Oi) with the expected frequencies (Ei). The expected frequencies
are calculated on the basis of our theoretical assumptions about the population
distribution. Let us explain the procedure of testing by going through some
illustrations.

Illustration 3
A sales man has 3 products to sell and there is a 40% chance of selling each
product when he meets a customer. The following is the frequency distribution
of sales.

No. of products sold per sale: 0 1 2 3

Frequency of the number of sales: 10 40 60 20

At the 0.05 level of significance, do these sales of products follow a binomial


distribution?

Solution: In this illustration, the sales process is approximated by a binomial


distribution with P=0.40 (with a 40% chance of selling each product).

Ho: The sales of three products has a binomial distribution with P=0.40. 139
Geektonight Notes

Probability and Hypothesis H1: The sales of three products do not have a binomial distribution with P=0.40.
Testing

Before we proceed further we must calculate the expected frequencies in order


to determine whether the discrepancies between the observed frequencies and
the expected frequencies (based on binomial distribution) should be ascribed to
chance. We began determining the binomial probability in each situation of
sales (0, 1, 2, 3 products sold per sale). For three products, we would find the
probabilities of success by consulting the binomial probabilities Appendix Table-
1. By looking at the column labelled as n = 3 and p = 0.40 we obtained the
following figures of binomial probabilities of the sales.

No. of products Binomial probabilities


sold per sale (r) of the sales

0 0.216

1 0.432

2 0.288

3 0.064

1.000

We now calculate the expected frequency of sales for each situation. There are
130 customers visited by the salesman. We multiply each probability by 130 (no.
of customers visited) to arrive at the respective expected frequency. For
example, 0.216 × 130 = 28.08.

The following table shows the observed frequencies and the expected
frequencies.

No. of products Observed Binomial Number of Expected


sold per sale frequency probability customers frequency
visited
(1) (2) (3) (4) (5) = (3) ×
(4)

0 10 0.216 130 28.08


1 40 0.432 130 56.16
2 60 0.288 130 37.44
3 20 0.064 130 8.32
Total 130

Now we use the chi-square test to examine the significance of differences


between observed frequencies and expected frequencies. The formula for
calculating chi- square is

(O i − E i ) 2
χ2 = ∑
Ei

The following table gives the calculation of chi-square.

140
Geektonight Notes
Chi-Square Test
Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i
frequencies frequencies
(O i) (Ei)

10 28.08 –18.08 326.89 11.64


40 56.16 –16.16 261.15 4.65
60 37.44 22.56 508.95 13.59
20 8.32 11.68 136.42 16.40
130 130 χ2 = 46.28

In order to draw inferences about this calculated value of χ2 we are required


to compare this with table value of χ2. For this we need: (i) degrees of
freedom (n-1), and (ii) level of significance. In the problem we are given that
the level of significance is 0.05. The number of expected situations is 4. That is
(0,1,2,3 products sold per sale) n = 4. Therefore, the degrees of freedom will
be 3 (i.e., n-1 =
4–1 = 3). The table value from Appendix Table-4 is 7.815 at 3 degrees of
freedom and 0.05 level of significance. Since the calculated value (χ2 = 46.28)
is greater than the table value (7.815), we reject the null hypothesis and accept
the alternative hypothesis. We conclude that the observed frequencies do not
follow the binomial distribution.

Let us take another illustration which relates to the normal distribution.

Illustration 4
In order to plan how much cash to keep on hand, a bank manager is interested
in seeing whether the average deposit of a customer is normally distributed with
mean Rs. 15000 and standard deviation Rs. 6000. The following information is
available with the bank.

Deposit (Rs) Less than 10000 10000-20000 More than 20000


Number of depositors 30 80 40

Calculate the χ2 statistic and test whether the data follows a normal distribution
with mean Rs.15000 and standard deviation Rs.6000 (take the level of
significance
(a) as 0.10).

Solution: In this illustration, the assumption made by the bank manager is


that the pattern of deposits follows a normal distribution with mean Rs.15000
and standard deviation Rs.6000. Therefore, in testing the goodness-of-fit you
may like to state the following hypothesis.

H0: The sample data of deposits is from a population having normal distribution
with mean Rs.15000 and standard deviation Rs.6000.

H1: The sample data of deposits is not from a population having normal
distribution with mean Rs.15000 and standard deviation Rs.6000.

In order to calculate the χ2 value we must have expected frequencies. The


expected frequencies are determined by multiplying the proportion of population
values within each class interval by the total sample size of observed
frequencies. Since we have assumed a normal distribution for our population,
141
Geektonight Notes

Probability and Hypothesis the expected frequencies are calculated by multiplying the area under the
Testing
respective normal curve and the total sample size (n=150).

For example, to obtain the area for deposits less than Rs.10000, we calculate
the normal deviate as follows:

x − µ 10000 − 15000 − 5000


z= = = = – 0.83
σ 6000 6000

From Appendix Table-3 (given at the end of this unit), this value (–0.83)
corresponds to a lower tail area of 0.5000–0.2967 = 0.2033. Multiplying 0.2033
by the sample size (150), we obtain the expected frequency 0.2033 × 150 =
30.50 depositors.
The calculations of the remaining expected frequencies are shown in the
following table.

Upper limit Normal deviate Area left Area of Expected


of the deposit x–15000 to x deposit range frequency
range (x) z = 6000 (Depositors)

(1) (2) (3) (4) (5)=(4)×150

10000 –0.83 0.2033 0.2033 30.50

20000 0.83 0.7967 0.5934 89.01

>20000 ∞ 1.0000 0.2033 30.50

1.0000 150

We should note that from Appendix Table-3 for 0.83 the area left to x is
0.5000 + 0.2967 = 0.7967 and for ∞ the area left to x is 0.5000 + 0.5000 =
1.0000. Similarly, the area of deposit range for normal deviate 0.83 = 0.7967–
0.2033 = 0.5934 and for ∞ = 1.0000–0.7967 = 0.2033.

Once the expected frequencies are calculated, the procedure for calculating χ2
statistic will be the same as we have seen in illustration 3.

(O i − E i ) 2
χ2 = ∑
Ei
The following table gives the calculation of chi-square.

Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i


frequencies(Oi) frequencies(Ei)

30 30.50 –0.50 0.2450 0.0080

80 89.01 –9.01 81.1801 0.9120

40 30.50 9.51 90.3450 2.9626

150 150 χ2 = 3.8827

Since n = 3, the number of degrees of freedom will be n–1 = 3–1 = 2 and we


are given 0.10 as the level of significance. From Appendix Table-4 the table
value of χ2 for df = 2 and α = 0.10 is 4.605. Since the calculated value of
142 χ2 (3.8827) is less than the table value we accept the null hypothesis and
Geektonight Notes

conclude that the data are well described by the normal distribution with mean Chi-Square Test
= Rs.15000 and standard deviation = Rs. 6000.

Let us consider an illustration which relates to Poisson Distribution.

Illustration 5
A small car company wishes to determine the frequency distribution of
warranty financed repairs per car for its new model car. On the basis of past
experience the company believes that the pattern of repairs follows a Poisson
distribution with mean number of repairs ( l) as 3. A sample data of 400
observations is provided below:

No. of repairs 0 1 2 3 4 5 or
more per car

No. of cars 20 57 98 85 78 62

i) Construct a table of expected frequencies using Poisson probabilities with l =3.


ii) Calculate the χ2 statistic and give your conclusions about the null hypothesis
(take the level of significance as 0.05).

Solution: For the above problem we formulate the following hypothesis.

H0: The number of repairs per car during warranty period follows a Poisson
probability distribution.
H1: The number of repairs per car during warranty period does not follow a Poisson
probability distribution.
As usual the expected frequencies are determined by multiplying the probability
values (in this case Poisson probability) by the total sample size of observed
frequencies. Appendix Table-2 provides the Poisson probability values. For
λ = 3.0 and for different x values we can directly read the probability values.
For example for λ = 3.0 and x = 0 the Poisson probability value is 0.0498, for
λ = 3.0 and x = 1 the Poisson probability value is 0.1494 and so on … .

The following table gives the calculated expected frequencies.

No. of repairs Poisson probability Expected frequency


per car (x) Ei = (2) × 400
(1) (2) (3)
0 0.0498 19.92
1 0.1494 59.76
2 0.2240 89.60
3 0.2240 89.60

4 0.1680 67.20
5 or more 0.1848 73.92
Total 1.0000 400

It is to be noted that from Appendix Table-2 for λ = 3.0 we have taken the
Poisson probability values directly for x = 0,1,2,3 and 4. For x = 5 or more we
added the rest of the probability values (for x = 5 to x = 12) so that the sum
of all the probability for x = 0 to x = 5 or more will be 1.000. 143
Geektonight Notes

Probability and Hypothesis As usual we use the following formula for calculating the chi-square (χ2) value.
Testing

2 (O i − E i ) 2
χ =∑
Ei
The following table gives the calculated χ2 value

Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i


frequencies(Oi) frequencies(Ei)
20 19.92 0.08 0.0064 0.0003
57 59.76 – 2.76 7.6176 0.1275
98 89.60 8.40 70.5600 0.7875
85 89.60 – 4.60 21.1600 0.2362
78 67.20 10.80 116.6400 1.7357
62 73.92 – 11.92 142.0864 1.9222
400 400 χ2 = 4.8094

Since n = 6, the number of degrees of freedom will be n–1 = 6–1 = 5 and we


are given a = 0.05 as the level of significance. From table 4, the table value of
χ2 for 5 degrees of freedom and a = 0.05 is 11.071. Since the calculated
value of χ2 = 4.8094 which is less than the table value of χ2 =11.071, we
accept the null hypothesis (H0) and conclude that the data follows a Poisson
probability distribution with l = 3.0

Illustration 6
In order to know the brand preference of two washing detergents, a sample of
1000 consumers were surveyed. 56% of the consumers preferred Brand X
and 44% of the consumers preferred Brand Y. Do these data conform to the
idea that consumers have no special preference for either brand? Take
significance level as 0.05.

Solution: In this illustration, we assume that brand preference follows a


uniform distribution. That is, ½ of the consumers prefer Brand A and other ½
of the consumers prefer Brand B.

Therefore, we have the following hypothesis.


H0: Brand name has no special significance for consumer preference.
H1: Brand name has special significance for consumer preference.

Since the consumer preference data is given in proportion we will convert it


into frequencies. The number of consumers who preferred Brand X are 0.56 ×
1000 = 560 and Brand Y are 0.44 × 1000 = 440. The corresponding expected
frequencies are ½ × 1000 = 500 each brand.

The following table gives calculated χ2 value.

144
Geektonight Notes

Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i Chi-Square Test


frequencies(Oi) frequencies(Ei)
20 19.92 0.08 0.0064 0.0003
560 500 60 3600 7.2
440 500 – 60 3600 7.2
1000 1000 χ2 = 14.4

The table value (by consulting the Appendix Table-4) at 5% significance level
and n–1 = 2–1 = 1 degree of freedom is 3.841. Since the value of calculated
χ2 is 14.4 which is greater than table value, we reject the null hypothesis and
conclude that the brand names have special significance for consumer
preference.

17.5 CONDITIONS FOR APPLYING CHI-SQUARE


TEST
To validate the chi-square test, the data set available, needs to fulfill certain
conditions. Sometimes these conditions are also called precautions about using
the chi-square test. Therefore, when ever you use the chi-square test the
following conditions must be satisfied:

a) Random Sample: In chi-square test the data set used is assumed to be a random
sample that represents the population. As with all significance tests, if you have a
random sample data that represents population data, then any differences in the
table values and the calculated values are real and therefore significant. On the
other hand, if you have a non-random sample data, significance cannot be established,
though the tests are nonetheless sometimes utilised as crude “rules of thumb” any
way. For example, we reject the null hypothesis, if the difference between observed
and expected frequencies is too large. But if the chi-square value is zero, we
should be careful in interpreting that absolutely no difference exists between
observed and expected frequencies. Then we should verify the quality of data
collected whether the sample data represents the population or not.
b) Large Sample Size: To use the chi-square test you must have a large
sample size that is enough to guarantee the test, to test the similarity
between the theoretical distribution and the chi-square statistic. Applying chi-
square test to small samples exposes the researcher to an unacceptable rate
of type-II errors. However, there is no accepted cutoff sample size. Many
researchers set the minimum sample size at 50. Remember that chi-square
test statistic must be calculated on actual count data (nominal, ordinal or
interval data) and not substituting percentages which would have the effect
of projecting the sample size as 100.

c) Adequate Cell Sizes: You have seen above that small sample size leads to
type-II error. That is, when the expected cell frequencies are too small, the
value of chi-square will be overestimated. This in turn will result in too
many rejections of the null hypothesis. To avoid making incorrect inferences
from chi- square tests we follow a general rule that the expected frequency
in any cell should be a minimum of 5.

d) Independence: The sample observations must be independent.


e) Final values: Observations must be grouped in categories.

145
Geektonight Notes

Probability and Hypothesis


Testing 17.6 CELLS POOLING
In the previous section we have seen that the cell size should be large enough
of at least 5 or more. When a contingency table contains one or more cells
with expected frequency of less than 5, this requirement may be met by
combining two rows or columns before calculating χ2. We must combine these
cells in order to get an expected frequency of 5 or more in each cell. This
practice is also known as grouping the frequencies together. But in doing this,
we reduce the number of categories of data and will gain less information from
contingency table. In addition, we also lose 1 or more degrees of freedom due
to pooling. With this practice, it should be noted that the number of freedom is
determined with the number of classes after the regrouping. In a special case 2
× 2 contingency table, the degree of freedom is 1. Suppose in any cell the
frequency is less than 5, we may be tempted to apply the pooling method
which results in 0 degrees of freedom (due to loss of 1 df ) which is
meaningless. When the assumption of cell frequency of minimum 5 is not
maintained in case of a 2 × 2 contingency table we apply Yates correction. You
will learn about Yates correction in section 17.7. Let us take an illustration to
understand the cell pooling method.

Illustration 7
A company marketing manager wishes to determine whether there are any
significant differences between regions in terms of a new product acceptance.
The following is the data obtained from interviewing a sample of 190
consumers.

Degree of Region
acceptance South North East West Total
Strong 30 25 20 30 105
Moderate 15 15 20 20 70
Poor 5 10 0 0 15
Total 50 50 40 50 190

Calculate the chi-square statistic. Test the independence of the two attributes at
0.05 degrees of freedom.

Solution: In this illustration, the null and alternate hypotheses are:


H0: The product acceptance is independent of the region of the consumer.
H1: The product acceptance is not independent of the region of the consumer.
We are given the observed frequencies in the problem. The following table
gives the calculated expected frequencies.

Degree of Region
acceptance South North East West Total
Strong 27.63 27.63 22.11 27.63 105

Moderate 18.42 18.42 14.74 18.42 70

Poor 3.95 3.95 3.16 3.95 15

Total 50.00 50.00 40.00 50.00 190


146
Geektonight Notes

Since the expected frequencies (cell values) in the third row are less than 5 we Chi-Square Test
pool the third row with the second row of both observed frequencies and
expected frequencies. The revised observed frequency and expected frequency
tables are given below.

Degree of Region
acceptance South North East West Total
Strong 30 25 20 30 105
Moderate and 20 25 20 20 85
poor

Total 50 50 40 50 190

Degree of Region
acceptance South North East West Total
Strong 27.63 27.63 22.11 27.63 105
Moderate and 22.37 22.37 17.89 22.37 85
poor
Total 50 50 40 50 190

Now we rearrange the data on observed and expected frequencies and


calculate the χ2 value. The following table gives the calculated χ2 value.

(Row, Column) Observed Expected (O i-E i) (O i–E i) 2 (O i–E i) 2/E i


frequencies(Oi) frequencies(Ei)
(1,1) 30 27.63 2.37 5.6169 0.2033
(2,1) 20 22.37 –2.37 5.6169 0.2511
(1,2) 25 27.63 –2.63 6.9169 0.2503
(2,2) 25 22.37 2.63 6.9169 0.3092
(1,3) 20 22.11 –2.11 4.4521 0.2014
(2,3) 20 17.89 2.11 4.4521 0.2489
(1,4) 30 27.63 2.37 5.6169 0.2033
(2,4) 20 22.37 -2.37 5.6169 0.2511
χ2 =1.9185

Since we have a 2 × 4 contingency table, the degrees of freedom will be (r–1)


× (c–1) = (2–1) × (4–1) = 1× 3 = 3. At 3 degree of freedom and 0.05
significance level the table value (from Appendix Table-4) is 7.815. Since the
calculated χ2 value (1.9185) is less than table value of χ2 (7.815) we accept
the null hypothesis and conclude that the product acceptance is independent of
the region of the consumer.

Illustration 8
The following table gives the number of typing errors per page in a 40 page
report. Test whether the typing errors per page have a Poisson distribution with
mean (λ) number of errors is 3.0.

147
Geektonight Notes

Probability and Hypothesis No. of typing 0 1 2 3 4 5 6 7 8 9 10 or


Testing
errors per page more
No. of pages 5 9 6 8 4 3 2 1 1 0 1

i) Construct a table of expected frequencies using Poisson probabilities with λ = 3.


ii) Calculate the χ2 statistic and give your conclusions about the null hypothesis (take
the level of significance as 0.01).
Solution: For the above problem we formulate the following hypothesis.
H0: The number of typing errors per page follows a Poisson probability distribution.
H1: The number of typing errors per page does not follow a Poisson probability
distribution.

As usual the expected frequencies are determined by multiplying the probability


values (in this case Poisson probability) by the total sample size of observed
frequencies. Table 17.3 provides the Poisson probability values. For λ = 3.0 and
for different x values we can directly read the probability values. For example
for λ = 3.0 and x = 0 the Poisson probability value is 0.0498. The following
table gives the calculated expected frequencies.

No. of typing Poisson probability Expected frequency


errors per page(x) Ei = (2) × 40
(1) (2) (3)
0 0.0498 1.99
1 0.1494 5.98 7.97
2 0.2240 8.96
3 0.2240 8.96
4 0.1680 6.72
5 0.1008 4.03
6 0.0504 2.02
7 0.0216 0.86
8 0.0081 0.32 14.11
9 0.0027 0.11
10 or more 0.0012 0.05
Total 1.0000 40

Since the expected frequencies of the first row are less than 5, we pool first
and second rows of observed and expected frequencies. Similarly, the expected
frequencies of the last 6 rows (with 5,6,7,8,9, and 10 or more errors) are less
than 5. Therefore we pool these rows with the row having the expected typing
errors as 4 or more.

As usual we use the following formula for calculating the chi-square (χ2) value.

2 (O i − E i ) 2
χ =∑
Ei

148
Geektonight Notes

The following table gives the calculated χ2 value after pooling cells Chi-Square Test

No. of typing Observed Expected (O i–E i) (O i–E i) 2 (O i–E i) 2/E i


errors per frequen- frequen-
page (x) cies (Oi) cies (Ei)
1 or less 14 7.97 6.032 36.39 4.5664
2 6 8.96 –2.960 8.76 0.9779
3 8 8.96 –0.960 0.92 0.1029
4 or more 12 14.11 –2.112 4.46 0.3161
χ2 = 5.9632

Since n = 4, the number of degrees of freedom will be n–1 = 4–1 = 3 and we


are given a = 0.01 as the level of significance. From Table 4 the table value of
χ2 for 3 degrees of freedom and a = 0.01 is 11.345. Since the calculated
value of χ2 = 5.9632 which is less than the table value of χ2 =11.345, we
accept the null hypothesis (H0) and conclude that the typing errors follow a
Poisson probability distribution with l = 3.0.

17.7 YATES CORRECTION


Yates correction is also called Yates correction for continuity. In a 2 x 2
contingency table the degrees of freedom is 1. If any one of the expected cell
frequency is less than 5, then use of pooling method (explained in section 17.6)
may result in 0 degree of freedom due to loss of 1 degree of freedom in
pooling which is meaning less. More over, it is not valid to perform the chi
square test if any one or more of the expected frequencies is less than 5 (as
explained in section 17.5). Therefore, if any one or more of the expected
frequencies in a 2 × 2 contingency table is less than 5, the Yates correction is
applied. This was proposed by F. Yates, who was an English mathematician.

Suppose for a 2 × 2 contingency table, the four cell values a, b, c and d are
arranged in the following order.

a b
c d

The Yates formula for corrected chi square is given by


2
 n
n  ad − bc − 
 2
χ2 =
(a + b)(c + d )(a + c)(b + d)

Illustration 9
Suppose we have the following data on the consumer preference of a new
product collected from the people living in north and south India.
South India North India Row total
Number of consumers who 4 51 55
prefer present product
Number of consumers who 14 38 52
prefer new product
Column total: 18 89 107
149
Geektonight Notes

Probability and Hypothesis Do the data suggest that the new product is preferred by the people
Testing independent of their region? Use a = 0.05.

Solution: Suppose we symbolise the true proportions of people who prefer


the new product as :

Ps = proportion of south Indians who prefer the new product


PN = Proportion of north Indians who prefer the new product

We state the null hypothesis (H0) and alternative hypothesis (H1)as:

H0: PS = PN (the proportion of people who prefer new product among south and north
India are the same).
H1: PS ≠ PN (the proportion of people who prefer new product among south and north
India are not the same).
In this illustration, (i) the sample size (n) = 107 (ii) the cell values are: a = 4,
b = 51, c = 14, d = 38, (iii) The corresponding row totals are: (a + b) = 55 and
(c + d) = 52, and column totals are (a + c) = 18 and (b + d) = 89.

Since one of the cell frequency is less than 5 (a = 4) we apply Yates


correction to the chi-square test.
2
 n
n  ad − b c − 
 2
χ2 =
(a + b) (c + d) (a + c) (b + d)

2
 107 
107  | 4 × 38 − 51 × 14 | −
2 
107 [ | 152 − 714 | − 53 . 5 ] 2
χ 2
=  =
55 × 52 × 18 × 89 4581720

107[562 − 53.5]2 107 [508 .5] 2


χ2 = =
4581720 4581720
107 × 258572 27667204
χ2 = =
4581720 4581720
∴ χ 2 = 6.0386

The table value for degrees of freedom (2–1) (2–1) = 1 and significance level
∝ = 0.05 is 3.841. Since calculated value of chi-square is 6.0386 which is
greater than the table value we can reject H0 and accept H1 and conclude that
the preference for the new product is not independent of the geographical
region.

It may be observed that when N is large, Yates correction will not make much
difference in the chi square value. However, if N is small, the implication of
Yates correction may overstate the probability.

17.8 LIMITATIONS OF CHI-SQUARE TEST


In order to prevent the misapplication of the χ2 test, one has to keep the
following limitations of the test in mind:

a) As explained in section 17.5 (conditions for applying chi square test), the chi square
test is highly sensitive to the sample size. As sample size increases, absolute
differences become a smaller and smaller proportion of expected value. This means
150
Geektonight Notes

that a reasonably strong association may not come up as significant if the sample Chi-Square Test
size is small. Conversely, in a large sample, we may find statistical significance
when the findings are small and insignificant. That is, the findings are not substantially
significant, although they are statistically significant.
b) Chi-square test is also sensitive to small frequencies in the cells of contingency
table. Generally, when the expected frequency in a cell of a table is less than 5,
chi-square can lead to erroneous conclusions as explained in section 17.5. The
rule of thumb here is that if either (i) an expected value in a cell in a 2 × 2 contingency
table is less than 5 or (ii) the expected values of more than 20% of the cells in a
greater than 2 × 2 contingency table are less than 5, then chi square test should not
be applied. If at all a chi-square test is applied then appropriately either Yates
correction or cell pooling should also be applied.
c) No directional hypothesis is assumed in chi-square test. Chi-square tests the
hypothesis that two attributes/variables are related only by chance. That is if a
significant relationship is found, this is not equivalent to establishing the researchers’
hypothesis that attribute A causes attribute B or attribute B causes attribute A.
Self Assessment Exercise B
1) While calculating the expected frequencies of a chi-square distribution it was found
that some of the cells of expected frequencies have value below 5. Therefore,
some of the cells are pooled. The following statements tell you the size of the
contingency table before pooling and the rows/columns pooled. Determine the
number of degrees of freedom.
a) 5 × 4 contingency table. First two and last two rows are pooled.

b) 4 × 6 contingency table. First two and last two columns are pooled.
c) 6 × 3 contingency table. First two rows are pooled. 4th, 5th, and 6th rows
are pooled.

..................................................................................................................

..................................................................................................................

..................................................................................................................

2) What is the table value of chi-square for goodness-of-fit if there are:


a) 8 degrees of freedom and the significance level is 1%.

b) 13 degrees of freedom and the significance level is 5%.

c) 16 degrees of freedom and the significance level is 0.10%.

d) 6 degrees of freedom and the significance level is 0.20%.


..................................................................................................................

3) a) The following data is an observed frequency distribution. Assuming that


the data follows a Poisson distribution with l=2.5.
i) calculate Poisson probabilities and expected values, ii) calculate
chi square value, and iii) at 0.05 level of significance can we
conclude that the data follow a poisson distribution with l = 2.5.

No. of Telephone 0 1 2 3 4 5 or more


calls per minute
Frequency of occurrences 6 30 41 52 12 9
151
Geektonight Notes

Probability and Hypothesis ..................................................................................................................


Testing
..................................................................................................................

..................................................................................................................

..................................................................................................................

..................................................................................................................

..................................................................................................................

..................................................................................................................

..................................................................................................................

..................................................................................................................

..................................................................................................................

17.9 LET US SUM UP


There are several applications of chi-square distribution, some of which we
have studied in this Unit. These are (i) to test the goodness-of-fit, and (ii) to
test the independence of attributes. The chi-square distribution is known by its
only parameter – number of degrees of freedom. Like student t distribution
there is a separate chi-square distribution for each number of degrees of
freedom.

The chi-square test for testing the goodness-of-fit establishes whether the
sample data supports the assumption that a particular distribution applies to the
parent population. It should be noted that the statistical procedures are based on
some assumptions such as normal distribution of population. A chi-square
procedure allows for testing the null hypothesis that a particular distribution
applies. We also use chi-square test whether to test whether the classification
criteria are independent or not.

When performing chi-square test using contingency tables, it is assumed that all
cell frequencies are a minimum of 5. If this assumption is not met we may use
the pooling method but then there is a loss of information when we use this
method. In a 2 × 2 contingency table if one or more cell frequencies are less
than 5 we should apply Yates correction for computing the chi-square value.

In a chi-square test for goodness of-fit, the degrees of freedom are number of
categories – 1 (n–1). In a chi-square test for independence of attributes, the
degrees of freedom are (number of rows–1) × (number of columns–1). That is,
(r–1) × (c–1).

17.10 KEY WORDS


Adequate Cell Sizes: To avoid making incorrect inferences from chi-square
tests we follow a general rule that the expected frequency in any cell should be
a minimum of 7.

Cells Pooling: When a contingency table contains one or more cells with
expected frequency less than 5, we combine two rows or columns before
calculating χ2. We combine these cells in order to get an expected frequency of
5 or more in each cell.

152 Chi-Square Distribution: A kind of probability distribution, differentiated by


Geektonight Notes

their degree of freedom, used to test a number of different hypotheses about Chi-Square Test
variances, proportions and distributional goodness of fit.

Expected Frequencies: The hypothetical data in the cells are called as


expected frequencies.

Goodness of Fit: The chi-square test procedure used for the validation of our
assumption about the probability distribution is called goodness of fit.

Observed Frequencies: The actual cell frequencies are called observed


frequencies.

Yates Correction: If any one or more of the expected frequencies in a 2 × 2


contingencies table is less than 5, the Yates correction is applied.

17.11 ANSWERS TO SELF ASSESSMENT


EXERCISES
A) 1. a i) H0: The preference for the type of car among people is independent
of their sex.
ii) degrees of freedom: 6
iii) χ2 (table value): 12.592
iv) Conclusion: Accept H0.
1. b i) H0: The income distribution and preference for type of house are
independently distributed.
ii) degrees of freedom: 9
iii) χ2 (table value): 21.666
iv) Conclusion: Reject H0.
1. c i) H0: The attitude towards going to a movie or for shopping is
independent of the sex.
ii) degrees of freedom: 1
iii) χ2 (table value): 6.635
iv) Conclusion: Reject H0.
1. d i) H0: The voters educational level and their political affiliation are
independent of each other.
ii) degrees of freedom: 9
iii) c2 (table value): 14.684
iv) Conclusion: Accept H0.
2. a) 25, b) 6, c) 8, d) 21.

153
Geektonight Notes

Probability and Hypothesis


Testing
3. a. (Row, Observed Expected (Oi - Ei) (Oi - Ei)2 (Oi - Ei)2/Ei
Column) frequency frequency
(O i) (Ei)

(1,1) 50 52.5 –2.5 6.25 0.1190

(1,2) 55 52.5 2.5 6.25 0.1190

(1,3) 45 52.5 –7.5 56.25 1.0714

(1,4) 60 52.5 7.5 56.25 1.0714

(2,1) 50 47.5 2.5 6.25 0.1316

(2,2) 45 47.5 7.5 56.25 1.1842

(2,3) 55 47.5 –2.5 6.25 0.1316

(2,4) 40 47.5 12.5 156.25 3.2895

Total 400 400 χ2 = 7.1178

3. b. H0: The preference for the brand is distributed independent of the consumers’
education level.

3. c. Table value χ2 at 3 d.f and α = 0.05 is 7.815. Since calculated value (7.1178)
is less than the table value of χ2 (7.815), we accept the H0.

B) 1. a) 6, b) 9, c) 4
2. a) 20.090, b)22.362, c) 23.542, d) 8.558
3. i) Poisson probabilities and expected values
No. of repairs Poisson probability Expected frequency
per car (x) Ei =(2)x150
(1) (2) (3)
0 0.0498 7.47
1 0.1494 22.41
2 0.2240 33.60
3 0.2240 33.60
4 0.1680 25.20
5 or more 0.1848 27.72

154
Geektonight Notes

3. ii) chi-square value Chi-Square Test

No. of Observed Expected (Oi-Ei) (Oi-Ei)2 (Oi-Ei)2/Ei


Telephone calls frequency frequency
per minute (O i) (Ei)
0 6 7.47 –1.47 2.16 0.2893
1 30 22.41 7.59 57.61 2.5706
2 41 33.60 7.40 54.76 1.6298
3 52 33.60 18.40 338.56 10.0762
4 12 25.20 –13.20 174.24 6.9143
5 or more 9 27.72 –18.72 350.44 12.6421
Total150 150 χ2=34.1222

3.iii) At 0.05 significance level and 4 degrees of freedom the table value is 9.488.
Since the calculated chi-square value is greater than the table value we reject
the null hypothesis that the frequency of telephone calls follows Poisson
distribution.

17.12 TERMINAL QUESTIONS/EXERCISES


1) Why do we use chi-square test?
2) What do you mean by expected frequencies in (a) chi-square test for testing
independence of attributes, and (b) chi-square test for testing goodness-of-fit?
Briefly explain the procedure you follow in calculating the expected values in
each of the above situations.
3) Explain the conditions for applying chi-square test.
4) What are the limitations for applying chi-square test?
5) When do you use Yates correction?
6) When do you pool rows or columns while applying chi-square test? What are its
limitations?
7) The following data provides information for 30 days on fatal accidents in a metro
city. Do the data suggest that the distribution of fatal accidents follow a Poisson
distribution? Take the level of significance as 0.05.
Fatal accidents per day 0 1 2 3 4 or more
Frequency 4 8 10 6 2

8) Below is an observed frequency distribution.


Marks Under 40 40 and 50 and 60 and 75 and 90 and
range under 50 under 60 under 75 under 90 above
No. of 9 20 65 34 14 8
students
At 0.01 significance level, the null hypothesis is that the data is from normal
distribution with a mean of 10 and a standard deviation of 2. What are your
conclusions?

155
Geektonight Notes

Probability and Hypothesis 9) The following table gives the number of telephone calls attended by a credit card
Testing information attendant.
Day Sunday Monday Tuesday Wednesday Thursday Friday Saturday

No. 45 50 24 36 33 27 42
of calls
attended

Test whether the telephone calls are uniformly distributed? Use 0.10
significance level.
10)The following data gives preference of car makes by type of customer.

Type of Car make


customer
Maruti 800 Maruti Zen Honda Tata Indica Total

Single man 350 200 150 50 750

Single woman 100 150 100 80 430

Married man 300 150 120 120 690

Married woman 150 100 80 50 380

Total 900 600 450 300 2250

(a) Test the independence of the two attributes. Use 0.05 level of significance.
(b) Draw your conclusions.
11) A bath soap manufacturer introduced a new brand of soap in 4 colours. The
following data gives information on the consumer preference of the brand.

Consumer Bath soap colour


rating Red Green Brown Yellow Total
Excellent 30 20 20 30 100

Good 20 10 20 30 80

Fair 20 10 10 30 70

Poor 10 45 35 10 100

Total 80 85 85 100 350

From the above data:

a) Compute the χ2 value,


b) State the null hypothesis, and
c) Draw your inferences.

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

156
Geektonight Notes
Chi-Square Test
17.13 FURTHER READING
A number of good text books are available on the topics dealth within this unit. The
following books may be used for more indepth study.

1) Kothari, C.R.(1985) Research Methodology Methods and Techniques, Wiley


Eastern, New Delhi.
2) Levin, R.I. and D.S. Rubin. (1999) Statistics for Management, Prentice-Hall
of India, New Delhi
3) Mustafi, C.K.(1981) Statistical Methods in Managerial Decisions,
Macmillan, New Delhi.
4) Chandan, J.S., Statistics for Business and Economics, Vikas Publishing
House Pvt Ltd New Delhi.
5) Zikmund, William G. (1988) Business Research Methods, The Dryden
Press, New York.

157
Appendix Table-1 Binomial Probabilities

158
Testing
p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95
2 0 .980 .902 .810 .723 .640 .563 .490 .423 .360 .303 .250 .203 .160 .123 .090 .063 .040 .023 .010 .002
1 .020 .095 .180 .255 .320 .375 .420 .455 .480 .495 .500 .495 .480 .455 .420 .375 .320 .255 .180 .095
2 .000 .002 .010 .023 .040 .063 .090 .123 .160 .203 .250 .303 .360 .423 .490 .563 .640 .723 .810 .902
Geektonight Notes

3 0 .970 .857 .729 .614 .512 .422 .343 .275 .216 .166 .125 .091 .064 .043 .027 .016 .008 .003 .001 .00
1 .029 .135 .243 .325 .384 .422 .441 .444 .432 .408 .375 .334 .288 .239 .189 .141 .096 .057 .027 .007
Probability and Hypothesis

2 .000 .007 .027 .057 .096 .141 .189 .239 .288 .334 .375 .408 .432 .444 .441 .422 .384 .325 .243 .135
3 .000 .000 .001 .003 .008 .016 .027 .043 .064 .091 .125 .166 .216 .275 .343 .422 .512 .614 .729 .857
4 0 .961 .815 .656 .522 .410 .316 .240 .179 .130 .092 .062 .041 .026 .015 .008 .004 .002 .001 .000 .000
1 .039 .171 .292 .368 .410 .422 .412 .384 .346 .300 .250 .200 .154 .112 .076 .047 .026 .011 .004 .000
2 .001 .014 .049 .098 .154 .211 .265 .311 .346 .368 .375 .368 .346 .311 .265 .211 .154 .098 .049 .014
3 .000 .000 .004 .011 .026 .047 .076 .112 .154 .200 .250 .300 .346 .384 .412 .422 .410 .368 .292 .171
4 .000 .000 .000 .001 .002 .004 .008 .015 .026 .041 .062 .092 .130 .179 .240 .316 .410 .522 .656 .815
5 0 .951 .774 .590 .444 .328 .237 .168 .116 .078 .050 .031 .019 .010 .005 .002 .001 .000 .000 .000 .000
1 .048 .204 .328 .392 .410 .396 .360 .312 .259 .206 .156 .113 .077 .049 .028 .015 .006 .002 .000 .000
2 .001 .021 .073 .138 .205 .264 .309 .336 .346 .337 .312 .276 .230 .181 .132 .088 .051 .024 .008 .001
3 .000 .001 .008 .024 .051 .088 .132 .181 .230 .276 .312 .337 .346 .336 .309 .264 .205 .138 .073 .021
4 .000 .000 .000 .002 .006 .015 .028 .049 .077 .113 .156 .206 .259 .312 .360 .396 .410 .392 .328 .204
5 .000 .000 .000 .000 .000 .001 .002 .005 .010 .019 .031 .050 .078 .116 .168 .237 .328 .444 .590 .774
6 0 .941 .735 .531 .377 .262 .178 .118 .075 .047 .028 .016 .008 .004 .002 .001 .000 .000 .000 .000 .000
1 .057 .232 .354 .399 .393 .356 .303 .244 .187 .136 .094 .061 .037 .020 .010 .004 .002 .000 .000 .000
2 .001 .031 .098 .176 .246 .297 .324 .328 .311 .278 .234 .186 .138 .095 .060 .033 .015 .006 .001 .000
3 .000 .002 .015 .042 .082 .132 .185 .236 .276 .303 .312 .303 .276 .236 .185 .132 .082 .042 .015 .002
4 .000 .000 .001 .006 .015 .033 .060 .095 .138 .186 .234 .278 .311 .328 .324 .297 .246 .176 .098 .031
5 .000 .000 .000 .000 .002 .004 .010 .020 .037 .061 .094 .136 .187 .244 .303 .356 .393 .399 .354 .232
6 .000 .000 .000 .000 .000 .000 .001 .002 .004 .008 .016 .028 .047 .075 .118 .178 .262 .377 .531 .735
7 0 .932 .698 .478 .321 .210 .133 .082 .049 .028 .015 .008 .004 .002 .001 .000 .000 .000 .000 .000 .000
1 .066 .257 .372 .396 .367 .311 .247 .185 .131 .087 .055 .032 .017 .008 .004 .001 .000 .000 .000 .000
2 .002 .041 .124 .210 .275 .311 .318 .299 .261 .214 .164 .117 .077 .047 .025 .012 .004 .001 .000 .000
3 .000 .004 .023 .062 .115 .173 .227 .268 .290 .292 .273 .239 .194 .144 .097 .058 .029 .011 .003 .000
4 .000 .000 .003 .011 .029 .058 .097 .144 .194 .239 .273 .292 .290 ;268 .227 .173 .115 .062 .023 .004
5 .000 .000 .000 .001 .004 .012 .025 .047 .077 .117 .164 .214 .261 .299 .318 .311 .275 .210 .124 .041
6 .000 .000 .000 .000 .000 .001 .004 .008 .017 .032 .055 .087 .131 .185 .247 .311 .367 .396 .372 .257
7 .000 .000 .000 .000 .000 .000 .000 .001 .002 .004 .008 .015 .028 .049 .082 .133 .210 .321 .478 .698
8 0 .923 .663 .430 .272 .168 .100 .058 .032 .017 .008 .004 .002 .001 .000 .000 .000 .000 .000 .000 .000
1 .075 .279 .383 .385 .336 .267 .198 .137 .090 .055 .031 .016 .008 .003 .001 .000 .000 .000 .000 .000
2 .003 .051 .149 .238 .294 .311 .296 .259 .209 .157 .109 .070 .041 .022 .010 .004 .001 .000 .000 .000
3 .000 .005 .033 .084 .147 .208 .254 .279 .279 .257 .219 .172 .124 .081 .047 .023 .009 .003 .000 .000
4 .000 .000 .005 :018 .046 .087 .136 .188 .232 .263 .273 .263 .232 .188 .136 .087 .046 .018 .005 .000
5 .000 .000 .000 .003 .009 .023 .047 .081 .124 .172 .219 .257 .279 .279 .254 .208 .147 .084 .033 .005
6 .000 .000 .000 .000 .001 .004 .010 .022 .041 .070 .109 .157 .209 .259 .296 .311 .294 .238 .149 .051
7 .000 .000 .000 .000 .000 .000 .001 .003 .008 .016 .031 .055 .090 .137 .198 .267 .336 .385 .383 .279
8 .000 .000 .000 .000 .000 000 .000 .000 .001 .002 .004 .008 .017 .032 .058 .100 .168 .272 .430 .663
Appendix Table-1 Binomial Probabilities (continued)

p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95

9 0 .914 .630 .387 .232 .134 .075 .040 .021 .010 .005 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000
1 .083 .299 .387 .368 .302 .225 .156 .100 .060 .034 .018 .008 .004 .001 .000 .000 .000 .000 .000 .000
2 .003 .063 .172 .260 .302 .300 .267 .216 .161 .111 .070 .041 .021 .010 .004 .001 .000 .000 .000 .000
Geektonight Notes

3 .000 .008 .045 .107 .176 .234 .267 .272 .251 .212 .164 .116 .074 .042 .021 .009 .003 .001 .000 .000
4 .000 .001 .007 .028 .066 .117 .172 .219 .251 .260 .246 .213 .167 .118 .074 .039 .017 .005 .001 .000
5 .000 .000 .001 .005 .017 .039 .074 .118 .167 .213 .246 .260 .251 .219 .172 .117 .066 .028 .007 .001
6 .000 .000 .000 .001 .003 .009 .021 .042 .074 .116 .164 .212 .251 .272 .267 .234 .176 .107 .045 .008
7 .000 .000 .000 .000 .000 .001 .004 .010 .021 .041 .070 .111 .161 .216 .267 .300 .302 .260 .172 .063
8 .000 .000 .000 .000 .000 .000 .000 .001 .004 .008 .018 .034 .060 .100 .156 .225 .302 .368 .387 .299
9 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .010 .021 .040 .075 .134 .232 .387 .630
10 0 .904 .599 .349 .197 .107 .056 .028 .014 .006 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .091 .315 .387 .347 .268 .188 .121 .072 .040 .021 .010 .004 .002 .000 .000 .000 .000 .000 .000 .000
2 .004 .075 .194 .276 .302 .282 .233 .176 .121 .076 .044 .023 .011 .004 .001 .000 .000 .000 .000 .000
3 .000 .010 .057 .130 .201 .250 .267 .252 .215 .166 .117 .075 .042 .021 .009 .003 .001 .000 .000 .000
4 .000 .001 .011 .040 .088 .146 .200 .238 .251 .238 .205 .160 .111 .069 .037 .016 .006 .001 .000 .000
5 .000 .000 .001 .008 .026 .058 .103 .154 .201 .234 .246 .234 .201 .154 .103 .058 .026 .008 .001 .000
6 .000 .000 .000 .001 .006 .016 .037 .069 .111 .160 .205 .238 .251 .238 .200 .146 .088 .040 .011 .001
7 .000 .000 .000 .000 .001 .003 .009 .021 .042 .075 .117 .166 .215 .252 .267 .250 .201 .130 .057 .010
8 .000 .000 .000 .000 .000 .000 .001 .004 .011 .023 .044 .076 .121 .176 .233 .282 .302 .276 .194 .07.
9 .000 .000 .000 .000 .000 .000 .000 .000 .002 .004 .010 .021 .040 .072 .121 .188 .268 .347 .387 .315
10 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .006 .014 .028 .056 .107 .197 .349 .599
11 0 .895 .569 .314 .167 .086 .042 .020 .009 .004 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .099 .329 .384 .325 .236 .155 .093 .052 .027 .013 .005 .002 .001 .000 .000 .000 .000 .000 .000 .000
2 .005 .087 .213 .287 .295 .258 .200 .140 .089 .051 .027 .013 .005 .002 .001 .000 .000 .000 .000 .000
3 .000 .014 .071 .152 .221 .258 .257 .225 .177 .126 .081 .046 .023 .010 .004 .001 .000 .000 .000 .000
4 .000 .001 .016 .054 .111 .172 .220 .243 .236 .206 .161 .113 .070 .038 .017 .006 .002 .000 .000 .000
5 .000 .000 .002 .013 .039 .080 .132 .183 .221 .236 .226 .193 .147 .099 .057 .027 .010 .002 .000 .000
6 .000 .000 .000 .002 .010 .027 .057 .099 .147 .193 .226 .236 .221 .183 .132 .080 .039 .013 .002 .000
7 .000 .000 .000 .000 .002 .006 .017 .038 .070 .113 .161 .206 .236 .243 .220 .172 .111 .054 .016 .001
8 .000 .000 .000 .000 .000 .001 .004 .010 .023 .046 .081 .126 .177 .225 .257 .258 .221 .152 .071 .014
9 .000 .000 .000 .000 .000 .000 .001 .002 .005 .013 .027 .051 .089 .140 .200 .258 .295 .287 .213 .087
10 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .005 .013 .027 .052 .093 .155 .236 .325 .384 .329
11 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .004 .009 .020 .042 .086 .167 .314 .569
12 0 .886 .540 .282 .142 .069 .032 .014 .006 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .107 .341 .377 .301 .206 .127 .071 .037 .017 .008 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000
2 .006 .099 .230 .292 .283 .232 .168 .109 .064 .034 .016 .007 .002 .001 .000 .000 .000 .000 .000 .000
3 .000 .017 .085 .172 .236 .258 .240 .195 .142 .092 .054 .028 .012 .005 .001 .000 .000 .000 .000 .000
4 .000 .002 .021 .068 .133 .194 .231 .237 .213 .170 .121 .076 .042 .020 .008 .002 .001 .000 .000 .000
5 .000 .000 .004 .019 .053 .103 .158 .204 .227 .223 .193 .149 .101 .059 .029 .011 .003 .001 .000 .000
6 .000 .000 .000 .004 .016 .040 .079 .128 .177 .212 .226 .212 .177 .128 .079 .040 .016 .004 .000 .000

159
Chi-Square Test
160
Testing
Geektonight Notes

Probability and Hypothesis

Appendix Table-1 Binomial Probabilities (continued)

p
n r .01 .05 .10 .15 .20 .25 .30 .35 .40 .45 .50 .55 .60 .65 .70 .75 .80 .85 .90 .95

7 .000 .000 .000 .001 .003 .011 .029 .059 .101 .149 .193 .223 .227 .204 .158 .103 .053 .019 .004 .000
8 .000 .000 .000 .000 .001 .002 .008 .020 .042 .076 .121 .170 .213 .237 .231 .194 .133 .068 .021 .002
9 .000 .000 .000 .000 .000 .000 .001 .005 .012 .028 .054 .092 .142 .195 .240 .258 .236 .172 .085 .017
10 .000 .000 .000 .000 .000 .000 .000 .001 .002 .007 .016 .034 .064 .109 .168 .232 .283 .292 .230 .099
11 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .008 .017 .037 .071 .127 .206 .301 .377 .341
12 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .002 .006 .014 .032 .069 .142 .282 .540
15 0 .860 .463 .206 .087 .035 .013 .005 .002 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
1 .130 .366 .343 .231 .132 .067 .031 .013 .005 .002 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
2 .009 .135 .267 .286 .231 .156 .092 .048 .022 .009 .003 .001 .000 .000 .000 .000 .000 .000 .000 .000
3 .000 .031 .129 .218 .250 .225 .170 .111 .063 .032 .014 .005 .002 .000 .000 .000 .000 .000 .000 .000
4 .000 .005 .043 .116 .188 .225 .219 .179 .127 .078 .042 .019 .007 .002 .001 .000 .000 .000 .000 .000
5 .000 .001 .010 .045 .103 .165 .206 .212 .186 .140 .092 .051 .024 .010 .003 .001 .000 .000 .000 .000
6 .000 .000 .002 .013 .043 .092 .147 .191 .207 .191 .153 .105 .061 .030 .012 .003 .001 .000 .000 .000
7 .000 .000 .000 .003 .014 .039 .081 .132 .177 .201 .196 .165 .118 .071 .035 .013 .003 .001 .000 .000
8 .000 .000 .000 .001 .003 .013 .035 .071 .118 .165 .196 .201 .177 .132 .081 .039 .014 .003 .000 .000
9 .000 .000 .000 .000 .001 .003 .012 .030 .061 .105 .153 .191 .207 .191 .147 .092 .043 .013 .002 .000
10 .000 .000 .000 .000 .000 .001 .003 .010 .024 .051 .092 .140 .186 .212 .206 .165 .103 .045 .010 .001
11 .000 .000 .000 .000 .000 .000 .001 .002 .007 .019 .042 .078 .127 .179 .219 .225 .188 .116 .043 .005
12 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .014 .032 .063 .111 .170 .225 .250 .218 .129 .031
13 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001 .003 .009 .022 .048 .092 .156 .231 .286 .267 .135
14 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .013 .031 .067 .132 .231 .343 .366
15 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .002 .005 .013 .035 .087 .206 .463
Geektonight Notes

Appendix Table-2 Direct Values for Determining Poisson Probabilities Chi-Square Test

For a given value of l, entry indicates the probability of obtaining a specified value of X.

µ
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0 0.9048 0.8187 0.7408 0.6703 0.6065 0.5488 0.4966 0.4493 0.4066 0.3679
1 0.0905 0.1637 0.2222 0.2681 0.3033 0.3293 0.3476 0.3595 0.3659 0.3679
2 0.0045 0.0164 0.0333 0.0536 0.0758 0.0688 0.1217 0.1438 0.1647 0.1839
3 0.0002 0.0011 0.0033 0.0072 0.0126 0.0198 0.0284 0.0383 0.0494 0.0613
4 0.0000 0.0001 0.0003 0.0007 0.0016 0.0030 0.0050 0.0077 0.0111 0.0153
5 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004 0.0007 0.0012 0.0020 0.0031
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0003 0.0005
7 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001

µ
x 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
0 0.3329 0.3012 0.2725 0.2466 0.2231 0.2019 0.1827 0.1653 0.1496 0.1353
1 0.3662 0.3614 0.3543 0.3452 0.3347 0.3230 0.3106 0.2975 0.2842 0.2707
2 0.2014 0.2169 0.2303 0.2417 0.2510 0.2584 0.2640 0.2678 0.2700 0.2707
3 0.0738 0.0867 0.0998 0.1128 0.1255 0.1378 0.1496 0.1607 0.1710 0.1804
4 0.0203 0.0260 0.0324 0.0395 0.0471 0.0551 0.0636 0.0723 0.0812 0.0902
5 0.0045 0.0062 0.0084 0.0111 0.0141 0.0176 0.0216 0.0260 0.0309 0.0361
6 0.0008 0.0012 0.0018 0.0026 0.0035 0.0047 0.0061 0.0078 0.0098 0.0120
7 0.0001 0.0002 0.0003 0.0005 0.0008 0.0011 0.0015 0.0020 0.0027 0.0034
8 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0003 0.0005 0.0006 0.0009
9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002

µ
x 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
0 0.1225 0.1108 0.1003 0.0907 0.0821 0.0743 0.0672 0.0608 0.0550 0.0498
1 0.2572 0.2438 0.2306 0.2177 0.2052 0.1931 0.1815 0.1703 0.1596 0.1494
2 0.2700 0.2681 0.2652 0.5613 0.2565 0.2510 0.2450 0.2384 0.2314 0.2240
3 0.1890 0.1966 0.2033 0.2090 0.2138 0.2176 0.2205 0.2225 0.2237 0.2240
4 0.0992 0.1082 0.1169 0.1254 0.1336 0.1414 0.1488 0.1557 0.1622 0.1680
5 0.0417 0.0476 0.0538 0.0602 0.0668 0.0735 0.0804 0.0872 0.0940 0.1008
6 0.0146 0.0174 0.0206 0.0241 0.0278 0.0319 0.0362 0.0407 0.0455 0.0504
7 0.0044 0.0055 0.0068 0.0083 0.0099 0.0118 0.0139 0.0163 0.0188 0.0216
8 0.0011 0.0015 0.0019 0.0025 0.0031 0.0038 0.0047 0.0057 0.0068 0.0081
9 0.0003 0.0004 0.0005 0.0007 0.0009 0.0011 0.0014 0.0018 0.0022 0.0027
10 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0004 0.0005 0.0006 0.0008
11 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0002 0.0002
12 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001

µ
x 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
0 0.0450 0.0408 0.0369 0.0334 0.0302 0.0273 0.0247 0.0224 0.0202 0.0183
1 0.1397 0.1304 0.1217 0.1135 0.1057 0.0984 0.0915 0.0850 0.0789 0.0733
2 0.2165 0.2087 0.2008 0.1929 0.1850 0.1771 0.1692 0.1615 0.1539 0.1465
3 0.2237 0.2226 0.2209 0.2186 0.2158 0.2125 0.2087 0.2046 0.2001 0.1954
4 0.1734 0.1781 0.1823 0.1858 0.1888 0.1912 0.1931 0.1944 0.1951 0.1954
5 0.1075 0.1140 0.1203 0.1264 0.1322 0.1377 0.1429 0.1477 0.1522 0.1563
6 0.0555 0.0608 0.0662 0.0716 0.0771 0.0826 0.0881 0.0936 0.0989 0.1042
7 0.0246 0.0278 0.0312 0.0348 0.0385 0.0425 0.0466 0.0508 0.0551 0.0595
8 0.0095 0.0111 0.0129 0.0148 0.0169 0.0191 0.0215 0.0241 0.0269 0.0298
9 0.0033 0.0040 0.0047 0.0056 0.0066 0.0076 0.0089 0.0102 0.0116 0.0132
10 0.0010 0.0013 0.0016 0.0019 0.0023 0.0028 0.0033 0.0039 0.0045 0.0053
11 0.0003 0.0004 0.0005 0.0006 0.0007 0.0009 0.0011 0.0013 0.0016 0.0019
12 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006
13 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
14 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 161
Geektonight Notes

Probability and Hypothesis Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….)
Testing
µ
x 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0
0 0.0166 0.0150 0.0136 0.0123 0.0111 0.0101 0.0091 0.0082 0.0074 0.0067
1 0.0679 0.0630 0.0583 0.0540 0.0500 0.0462 0.0427 0.0395 0.0365 0.0337
2 0.1393 0.1323 0.1254 0.1188 0.1125 0.1063 0.1005 0.0948 0.0894 0.0842
3 0.1904 0.1852 0.1798 0.1743 0.1687 0.1631 0.1574 0.1517 0.1460 0.1404
4 0.1951 0.1944 0.1933 0.1917 0.1898 0.1875 0.1849 0.1820 0.1789 0.1755
5 0.1600 0.1633 0.1662 0.1687 0.1708 0.1725 0.1738 0.1747 0.1753 0.1755
6 0.1093 0.1143 0.1191 0.1237 0.1281 0.1323 0.1362 0.1398 0.1432 0.1462
7 0.0640 0.0686 0.0732 0.0778 0.0824 0.0869 0.0914 0.0959 0.1022 0.1044
8 0.0328 0.0360 0.0393 0.0428 0.0463 0.0500 0.0537 0.0575 0.0614 0.0653
9 0.0150 0.0168 0.0188 0.0209 0.0232 0.0255 0.0280 0.0307 0.0334 0.0363
10 0.0061 0.0071 0.0081 0.0092 0.0104 0.0118 0.0132 0.0147 0.0164 0.0181
11 0.0023 0.0027 0.0032 0.0037 0.0043 0.0049 0.0056 0.0064 0.0073 0.0082
12 0.0008 0.0009 0.0011 0.0014 0.0016 0.0019 0.0022 0.0026 0.0030 0.0034
13 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013
14 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005
15 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.000

µ
x 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.0
0 0.0061 0.0055 0.0050 0.0045 0.0041 0.0037 0.0033 0.0030 0.0027 0.0025
1 0.0311 0.0287 0.0265 0.0244 0.0225 0.0207 0.0191 0.0176 0.0162 0.0149
2 0.0793 0.0746 0.0701 0.0659 0.0618 0.0580 0.0544 0.0509 0.0477 0.0446
3 0.1348 0.1293 0.1239 0.1185 0.1133 0.1082 0.1033 0.0985 0.0938 0.0892
4 0.1719 0.1681 0.1641 0.1600 0.1558 0.1515 0.1472 0.1428 0.1383 0.1339
5 0.1753 0.1748 0.1740 0.1728 0.1714 0.1697 0.1678 0.1656 0.1632 0.1606
6 0.1490 0.1515 0.1537 0.1555 0.1571 0.1584 0.1594 0.1601 0.1605 0.1606
7 0.1086 0.1125 0.1163 0.1200 0.1234 0.1267 0.1298 0.1326 0.1353 0.1377
8 0.0692 0.0731 0.0771 0.0810 0.0849 0.0887 0.0925 0.0962 0.0998 0.1033
9 0.0392 0.0423 0.0454 0.0486 0.0519 0.0552 0.0586 0.0620 0.0654 0.0688
10 0.0200 0.0220 0.0241 0.0262 0.0285 0.0309 0.0334 0.0359 0.0386 0.0413
11 0.0093 0.0104 0.0116 0.0129 0.0143 0.0157 0.0173 0.0190 0.0207 0.0225
12 0.0039 0.0045 0.0051 0.0058 0.0065 0.0073 0.0082 0.0092 0.0102 0.0113
13 0.0015 0.0018 0.0021 0.0024 0.0028 0.0032 0.0036 0.0041 0.0046 0.0052
14 0.0006 0.0007 0.0008 0.0009 0.0011 0.0013 0.0015 0.0017 0.0019 0.0022
15 0.0002 0.0002 0.0003 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009
16 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003
17 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001

µ
x 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7.0
0 0.0022 0.0020 0.0018 0.0017 0.0015 0.0014 0.0012 0.0011 0.0010 0.0009
1 0.0137 0.0126 0.0116 0.0106 0.0098 0.0090 0.0082 0.0076 0.0070 0.0064
2 0.0417 0.0390 0.0364 0.0340 0.0318 0.0296 0.0276 0.0258 0.0240 0.0223
3 0.0848 0.0806 0.0765 0.0726 0.0688 0.0652 0.0617 0.0584 0.0552 0.0521
4 0.1294 0.1249 0.1205 0.1162 0.1118 0.1076 0.1034 0.0992 0.0952 0.0912
5 0.1579 0.1549 0.1519 0.1487 0.1454 0.1420 0.1385 0.1349 0.1314 0.1277
6 0.1605 0.1601 0.1595 0.1586 0.1575 0.1562 0.1546 0.1529 0.1511 0.1490
7 0.1399 0.1418 0.1435 0.1450 0.1462 0.1472 0.1480 0.1486 0.1489 0.1490
8 0.1066 0.1099 0.1130 0.1160 0.1188 0.1215 0.1240 0.1263 0.1284 0.1304
9 0.0723 0.0757 0.0791 0.0825 0.0858 0.0891 0.0923 0.0954 0.0985 0.1014
10 0.0441 0.0469 0.0498 0.0528 0.0558 0.0588 0.0618 0.0649 0.0679 0.0710
11 0.0245 0.0265 0.0285 0.0307 0.0330 0.0353 0.0377 0.0401 0.0426 0.0452
12 0.0124 0.0137 0.0150 0.0164 0.0179 0.0194 0.0210 0.0227 0.0245 0.0264
13 0.0058 0.0065 0.0073 0.0081 0.0089 0.0098 0.0108 0.0119 0.0130 0.0142
14 0.0025 0.0029 0.0033 0.0037 0.0041 0.0046 0.0052 0.0058 0.0064 0.0071
15 0.0010 0.0012 0.0014 0.0016 0.0018 0.0020 0.0023 0.0026 0.0029 0.0033
16 0.0004 0.0005 0.0005 0.0006 0.0007 0.0008 0.0010 0.0011 0.0013 0.0014
17 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006
18 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002
162 19 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001
Geektonight Notes
Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….) Chi-Square Test

µ
x 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0

0 0.0008 0.0007 0.0007 0.0006 0.0006 0.0005 0.0005 0.0004 0.0004 0.0003
1 0.0059 0.0054 0.0049 0.0045 0.0041 0.0038 0.0035 0.0032 0.0029 0.0027
2 0.0208 0.0194 0.0180 0.0167 0.0156 0.0145 0.0134 0.0125 0.0116 0.0107
3 0.0492 0.0464 0.0438 0.0413 0.0389 0.0366 0.0345 0.0324 0.0305 0.0286
4 0.0874 0.0836 0.0799 0.0764 0.0729 0.0696 0.0663 0.0632 0.0602 0.0573
5 0.1241 0.1204 0.1167 0.1130 0.1094 0.1057 0.1021 0.0986 0.0951 0.0916
6 0.1468 0.1445 0.1420 0.1394 0.1367 0.1339 0.1311 0.1282 0.1252 0.1221
7 0.1489 0.1486 0.1481 0.1474 0.1465 0.1454 0.1442 0.1428 0.1413 0.1396
8 0.1321 0.1337 0.1351 0.1363 0.1373 0.1382 0.1388 0.1392 0.1395 0.1396
9 0.1042 0.1070 0.1096 0.1121 0.1144 0.1167 0.1187 0.1207 0.1224 0.1241
10 0.0740 0.0770 0.0800 0.0829 0.0858 0.0887 0.0914 0.0941 0.0967 0.0993
11 0.0478 0.0504 0.0531 0.0558 0.0585 0.0613 0.0640 0.0667 0.0695 0.0722
12 0.0283 0.0303 0.0323 0.0344 0.0366 0.0388 0.0411 0.0434 0.0457 0.0481
13 0.0154 0.0168 0.0181 0.0196 0.0211 0.0227 0.0243 0.0260 0.0278 0.0296
14 0.0078 0.0086 0.0095 0.0104 0.0113 0.0123 0.0134 0.0145 0.0157 0.0169
15 0.0037 0.0041 0.0046 0.0051 0.0057 0.0062 0.0069 0.0075 0.0083 0.0090
16 0.0016 0.0019 0.0021 0.0024 0.0026 0.0030 0.0033 0.0037 0.0041 0.0045
17 0.0007 0.0008 0.0009 0.0010 0.0012 0.0013 0.0015 0.0017 0.0019 0.0021
18 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009
19 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0003 0.0003 0.0003 0.0004
20 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002
21 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001

µ
x 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9.0
0 0.0003 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001
1 0.0025 0.0023 0.0021 0.0019 0.0017 0.0016 0.0014 0.0013 0.0012 0.0011
2 0.0100 0.0092 0.0086 0.0079 0.0074 0.0068 0.0063 0.0058 0.0054 0.0050
3 0.0269 0.0252 0.0237 0.0222 0.0208 0.0195 0.0183 0.0171 0.0260 0.0150
4 0.0544 0.0517 0.0491 0.0466 0.0443 0.0420 0.0398 0.0377 0.0357 0.0337
5 0.0882 0.0849 0.0816 0.0784 0.0752 0.0722 0.0692 0.0663 0.0635 0.0607
6 0.1191 0.1160 0.1128 0.1097 0.1066 0.1034 0.1003 0.0972 0.0941 0.0911
7 0.1378 0.1358 0.1338 0.1317 0.1294 0.1271 0.1247 0.1222 0.1197 0.1171
8 0.1395 0.1392 0.1388 0.1382 0.1375 0.1366 0.1356 0.1344 0.1332 0.1318
9 0.1256 0.1269 0.1280 0.1290 0.1299 0.1306 0.1311 0.1315 0.1317 0.1318
10 0.1017 0.1040 0.1063 0.1084 0.1104 0.1123 0.1140 0.1157 0.1172 0.1186
11 0.0749 0.0776 0.0802 0.0828 0.0853 0.0878 0.0902 0.0925 0.0948 0.0970
12 0.0505 0.0530 0.0555 0.0579 0.0604 0.0629 0.0654 0.0679 0.0703 0.0728
13 0.0315 0.0334 0.0354 0.0374 0.0395 0.0416 0.0438 0.0459 0.0481 0.0504
14 0.0182 0.0196 0.0210 0.0225 0.0240 0.0256 0.0272 0.0289 0.0306 0.0324
15 0.0098 0.0107 0.0116 0.0126 0.0136 0.0147 0.0158 0.0169 0.0182 0.0194
16 0.0050 0.0055 0.0060 0.0066 0.0072 0.0079 0.0086 0.0093 0.0101 0.0109
17 0.0024 0.0026 0.0029 0.0033 0.0036 0.0040 0.0044 0.0048 0.0053 0.0058
18 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019 0.0021 0.0024 0.0026 0.0029
19 0.0005 0.0005 0.0006 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014
20 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004 0.0005 0.0005 0.0006
21 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003
22 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001

µ
x 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9 10.0

0 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0000
1 0.0010 0.0009 0.0009 0.0008 0.0007 0.0007 0.0006 0.0005 0.0005 0.0005
2 0.0046 0.0043 0.0040 0.0037 0.0034 0.0031 0.0029 0.0027 0.0025 0.0023
3 0.1040 0.0131 0.0123 0.0115 0.0107 0.0100 0.0093 0.0087 0.0081 0.0076
4 0.0319 0.0302 0.0285 0.0269 0.0254 0.0240 0.0226 0.0213 0.0201 0.0189
163
Geektonight Notes

Probability and Hypothesis Appendix Table-2 Direct Values for Determining Poisson Probabilities (continued….)
Testing
5 0.0581 0.0555 0.0530 0.0506 0.0483 0.0460 0.0439 0.0418 0.0398 0.0378
6 0.0881 0.0851 0.0822 0.0793 0.0764 0.0736 0.0709 0.0682 0.0656 0.0631
7 0.1145 0.1118 0.1091 0.1064 0.1037 0.1010 0.0982 0.0955 0.0928 0.0901
8 0.1302 0.1286 0.1269 0.1251 0.1232 0.1212 0.1191 0.1170 0.1148 0.1126
9 0.1317 0.1315 0.1311 0.1306 0.1300 0.1293 0.1284 0.1274 0.1263 0.1251
10 0.1198 0.1210 0.1219 0.1228 0.1235 0.1241 0.1245 0.1249 0.1250 0.1251
11 0.0991 0.1012 0.1031 0.1049 0.1067 0.1083 0.1098 0.1112 0.1125 0.1137
12 0.0752 0.0776 0.0799 0.0822 0.0844 0.0866 0.0888 0.0908 0.0928 0.0948
13 0.0526 0.0549 0.0572 0.0594 0.0617 0.0640 0.0662 0.0685 0.0707 0.0729
14 0.0342 0.0361 0.0380 0.0399 0.0419 0.0439 0.0459 0.0479 0.0500 0.0521
15 0.0208 0.0221 0.0235 0.0250 0.0265 0.0281 0.0297 0.0313 0.0330 0.0347
16 0.0118 0.0127 0.0137 0.0147 0.0157 0.0168 0.0180 0.0192 0.0204 0.0217
17 0.0063 0.0069 0.0075 0.0081 0.0088 0.0095 0.0103 0.0111 0.0119 0.0128
18 0.0032 0.0035 0.0039 0.0042 0.0046 0.0051 0.0055 0.0060 0.0065 0.0071
19 0.0015 0.0017 0.0019 0.0021 0.0023 0.0026 0.0028 0.0031 0.0034 0.0037
20 0.0007 0.0008 0.0009 0.0010 0.0011 0.0012 0.0014 0.0015 0.0017 0.0019
21 0.0003 0.0003 0.0004 0.0004 0.0005 0.0006 0.0006 0.0007 0.0008 0.0009
22 0.0001 0.0001 0.0002 0.0002 0.0002 0.0002 0.0003 0.0003 0.0004 0.0004
23 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0002 0.0002
24 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0001

µ
x 11 12 13 14 15 16 17 18 19 20
0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
1 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
2 0.0010 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
3 0.0037 0.0018 0.0008 0.0004 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000
4 0.0102 0.0053 0.0027 0.0013 0.0006 0.0003 0.0001 0.0001 0.0000 0.0000
5 0.0224 0.0127 0.0070 0.0037 0.0019 0.0010 0.0005 0.0002 0.0001 0.0001
6 0.0411 0.0255 0.0152 0.0087 0.0048 0.0026 0.0014 0.0007 0.0004 0.0002
7 0.0646 0.0437 0.0281 0.0174 0.0104 0.0060 0.0034 0.0018 0.0010 0.0005
8 0.0888 0.0655 0.0457 0.0304 0.0194 0.0120 0.0072 0.0042 0.0024 0.0013
9 0.1085 0.0874 0.0661 0.0473 0.0324 0.0213 0.0135 0.0083 0.0050 0.0029
10 0.1194 0.1048 0.0859 0.0663 0.0486 0.0341 0.0230 0.0150 0.0095 0.0058
11 0.1194 0.1144 0.1015 0.0844 0.0663 0.0496 0.0355 0.0245 0.0164 0.0106
12 0.1094 0.1144 0.1099 0.0984 0.0829 0.0661 0.0504 0.0368 0.0259 0.0176
13 0.0926 0.1056 0.1099 0.1060 0.0956 0.0814 0.0658 0.0509 0.0378 0.0271
14 0.0728 0.0905 0.1021 0.1060 0.1024 0.0930 0.0800 0.0655 0.0514 0.0387
15 0.0534 0.0724 0.0885 0.0989 0.1024 0.0992 0.0906 0.0786 0.0650 0.0516
16 0.0367 0.0543 0.0719 0.0866 0.0960 0.0992 0.0963 0.0884 0.0772 0.0646
17 0.0237 0.0383 0.0550 0.0713 0.0847 0.0934 0.0963 0.0936 0.0863 0.0760
18 0.0145 0.0256 0.0397 0.0554 0.0706 0.0830 0.0909 0.0936 0.0911 0.0844
19 0.0084 0.0161 0.0272 0.0409 0.0557 0.0699 0.0814 0.0887 0.0911 0.0888
20 0.0046 0.0097 0.0177 0.0286 0.0418 0.0559 0.0692 0.0798 0.0866 0.0888
21 0.0024 0.0055 0.0109 0.0191 0.0299 0.0426 0.0560 0.0684 0.0783 0.0846
22 0.0012 0.0030 0.0065 0.0121 0.0204 0.0310 0.0433 0.0560 0.0676 0.0769
23 0.0006 0.0016 0.0037 0.0074 0.0133 0.0216 0.0320 0.0438 0.0559 0.0669
24 0.0003 0.0008 0.0020 0.0043 0.0083 0.0144 0.0226 0.0328 0.0442 0.0557
25 0.0001 0.0004 0.0010 0.0024 0.0050 0.0092 0.0154 0.0237 0.0336 0.0446
26 0.0000 0.0002 0.0005 0.0013 0.0029 0.0057 0.0101 0.0164 0.0246 0.0343
27 0.0000 0.0001 0.0002 0.0007 0.0016 0.0034 0.0063 0.0109 0.0173 0.0254
28 0.0000 0.0000 0.0001 0.0003 0.0009 0.0019 0.0038 0.0070 0.0117 0.0181
29 0.0000 0.0000 0.0001 0.0002 0.0004 0.0011 0.0023 0.0044 0.0077 0.0125
30 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0013 0.0026 0.0049 0.0083
32 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007 0.0015 0.0030 0.0054
32 0.0000 0.0000 0.0000 0.0000 0.0001 0.0001 0.0004 0.0009 0.0018 0.0034
33 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0005 0.0010 0.0020
34 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0006 0.0012
35 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0007
36 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002 0.0004
37 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0002
38 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
164 39 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
Geektonight Notes
Appendix Table-3 Areas of a Standard Normal Probability Distribution Between the Chi-Square Test
Mean and Positive Values of z.

0.4429 of area

Mean z=1.58

z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09

0.0 .0000 .0040 .0080 .0120 .0160 .0199 .0239 .0279 .0319 .0359
0.1 .0398 .0438 .0478 .0517 .0557 .0596 .0636 .0675 .0714 .0753
0.2 .0793 .0832 .0871 .0910 .0948 .0987 .1026 .1064 .1103 .1141
0.3 .1179 .1217 .1255 .1293 .1331 .1368 .1406 .1443 .1480 .1517
0.4 .1554 .1591 .1628 .1664 .1700 .1736 .1772 .1808 .1844 .1879
0.5 .1915 .1950 .1985 .2019 .2054 .2088 .2123 .2157 .2190 .2224
0.6 .2257 .2291 .2324 .2357 .2389 .2422 .2454 .2486 .2517 .2549
0.7 .2580 .2611 .2642 .2673 .2704 .2734 .2764 .2794 .2823 .2852
0.8 .2881 .2910 .2939 .2967 .2995 .3023 .3051 .3078 .3106 .3133
0.9 .3159 .3186 .3212 .3238 .3264 .3289 .3315 .3340 .3365 .3389
1.0 .3413 .3438 .3461 .3485 .3508 .3531 .3554 .3577 .3599 .3621
1.1 .3643 .3665 .3686 .3708 .3729 .3749 .3770 .3790 .3810 .3830
1.2 .3849 .3869 .3888 .3907 .3925 .3944 .3962 .3980 .3997 .4015
1.3 .4032 .4049 .4066 .4082 .4099 .4115 .4131 .4147 .4162 .4177
1.4 .4192 .4207 .4222 .4236 .4251 .4265 .4279 .4292 .4306 .4319
1.5 .4332 .4345 .4357 .4370 .4382 .4394 .4406 .4418 .4429 .4441
1.6 .4452 .4463 .4474 .4484 .4495 .4505 .4515 .4525 .4535 .4545
1.7 .4554 .4564 .4573 .4582 .4591 .4599 .4608 .4616 .4625 .4633
1.8 .4641 .4649 .4656 .4664 .4671 .4678 .4686 .4693 .4699 .4706
1.9 .4713 .4719 .4726 .4732 .4738 .4744 .4750 .4756 .4761 .4767
2.0 .4772 .4778 .4783 .4788 .4793 .4798 .4803 .4808 .4812 .4817
2.1 :4821 .4826 :4830 .4834 .4838 .4842 .4846 .4850 .4854 .4857
2.2 .4861 .4864 .4868 .4871 .4875 .4878 .4881 .4884 .4887 .4890
2.3 .4893 .4896 .4898 .4901 .4904 .4906 .4909 .4911 .4913 .4916
2.4 .4918 .4920 .4922 .4925 .4927 .4929 .4931 .4932 .4934 .4936
2.5 .4938 .4940 .4941 .4943 .4945 .4946 .4948 .4949 .4951 .4952
2.6 .4953 .4955 .4956 .4957 .4959 .4960 .4961 .4962 .4963 .4964
2.7 .4965 .4966 .4967 .4968 .4969 .4970 .4971 .4972 .4973 .4974
2.8 .4974 .4975 .4976 .4977 .4977 .4978 .4979 .4979 .4980 .4981
2.9 .4981 .4982 .4982 .4983 .4984 .4984 .4985 .4985 .4986 .4986
3.0 .4987 .4987 .4987 .4988 .4988 .4989 .4989 .4989 .4990 .4990
3.1 .4990 .4991 .4991 .4991 .4992 .4992 .4992 .4992 .4993 .4993
3.2 .4993 .4993 .4994 .4994 .4994 .4994 .4994 .4995 .4995 .4995
3.3 .4995 .4995 .4995 .4996 .4996 .4996 .4996 .4996 .4996 .4997
3.4 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4997 .4998
3.5 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998 .4998
3.6 .4998 .4998 .4998 .4999 .4999 .4999 .4999 .4999 .4999 .4999

165
Geektonight Notes

Probability and Hypothesis χ2) Distribution


Appendix Table-4 Area in the Right Tail of a Chi-Square (χ
Testing
Degrees Area in right tail
of
freedom .99 .975 .95 .90 .80 .20 .10 .05 .025 .01
1 0.00016 0.00098 0.00393 0.0158 0.0642 1.642 2.706 3.841 5.024 6.635
2 0.0201 0.0506 0.103 0.211 0.446 3.219 4.605 5.991 7.378 9.210
3 0.115 0.216 0.352 0.584 1.005 4.642 6.251 7.815 9.348 11.345
4 0.297 0.484 0.711 1.064 1.649 5.989 7.779 9.488 11.143 13.277
5 0.554 0.831 1.145 1.610 2.343 7.289 9.236 11.071 12.833 15.086
6 0.872 1.237 1.635 2.204 3.070 8..558 10.645 12.592 14.449 16.812
7 1.239 1.690 2.1675 2.833 3.822 9.803 12.017 14.067 16.013 18.475
8 1.646 2.180 2.733 3.490 4.594 11.030 13.362 15.507 17.535 20.090
9 2.088 2.700 3.325 4.168 5.380 12.242 14.684 16.919 19.023 21.666
10 2.558 3.247 3.940 4.865 6.179 13.442 15.987 18.307 20.483 23.209
11 3.053 3.816 4.575 5.578 6.989 14.631 17.275 19.675 21.920 24.725
12 3.571 4.404 5.226 6.304 7.807 15.812 18.549 21.026 23.337 26.217
13 4.107 5.009 5.892 7.042 8.634 16.985 19.812 22.362 24.736 27.688
14 4.660 5.629 6.571 7.790 9.467 18.151 21.064 23.685 26.119 29.141
15 5.229 6.262 7.261 8.547 10.307 19.311 22.307 24.996 27.488 30.578
16 5.812 6.908 7.962 9.312 11.152 20.465 23.542 26.296 28.845 32.000
17 6.408 7.564 8.672 10.085 12.002 21.615 24.769 27.587 30.191 33.409
18 7.015 8.231 9.390 10.865 12.857 22.760 25.989 28.869 31.526 34.805
19 7.633 8.907 10.117 11.651 13.716 23.900 27.204 30.144 32.852 36.191
20 8.260 9.591 10.851 12.443 14.578 25.038 28.412 31.410 34.170 37.566
21 8.897 10.283 11.591 13.240 15.445 26.171 29.615 32.671 35.479 38.932
22 9.542 10.982 12.338 14.041 16.314 27.301 30.813 33.924 36.781 40.289
23 10.196 11.6889 13.091 14.848 17.187 28.429 32.007 35.172 38.076 41.638
24 10.856 12.4015 13.848 15.658 18.062 29.553 33.196 36.415 39.364 42.980
25 11.524 13.120 14.611 16.473 18.940 30.675 34.382 37.652 40.647 44.314
26 12.198 13.844 15.379 17.292 19.820 31.795 35.563 38.885 41.923 45.642
27 12.879 14.573 16.151 18.114 20.703 32.912 36.741 40.113 43.195 46.963
28 13.565 15.308 16.928 18.939 21.588 34.027 37.916 41.337 44.461 48.278
29 14.256 16.047 17.708 19.768 22.475 35.139 39.087 42.557 45.722 49.588
30 14.953 16.791 18.493 20.599 23.364 36.250 40.256 43.773 46.979 50.892
Source: From Table IV of Fisher and Yates, Statistical Tables for Biological,
Agricultural and Medical Research, Published by Longman Group Ltd
(previously published by Oliver and Boyd, Edinburg, 1963).

166
Geektonight Notes
Chi-Square Test
Appendix : Table-5 Table of t
(One Tail Area)

0 tα
Values of ta, m

Probability (Level of Significance)

d.f (v) 0.1 0.05 0.025 0.01 0.005

1 3.078 6.3138 12.706 31.821 63.657


2 1.886 2.9200 4.3027 6.965 9.9248
3 1.638 2.3534 3.1825 4.541 5.8409
4 1.533 2.1318 2.7764 3.747 4.6041
5 1.476 2.0150 2.5706 3.365 4.0321
6 1.440 1.9432 2.4469 3.143 3.7074
7 1.415 1.8946 2.3646 2.998 3.4995
8 1.397 1.8595 2.3060 2.896 3.3554
9 1.383 1.8331 2.2622 2.821 3.2498
10 1.372 1.8125 2.2281 2.764 3.1693
11 1.363 1.7959 2.2010 2.718 3.1058
12 1.356 1.7823 2.1788 2.681 3.0545
13 1.350 1.7709 2.1604 2.650 3.0123
14 1.345 1.7613 2.1448 2.624 2.9768
15 1.341 1.7530 2.1315 2.602 2.9467
16 1.337 1.7459 2.1199 2.583 2.9208
17 1.333 1.7396 2.1098 2.567 2.8982
18 1.330 1.7341 2.1009 2.552 2.8784
19 1.328 1.7291 2.0930 2.539 2.8609
20 1.325 1.7247 2.0860 2.528 2.8453
21 1.323 1.7207 2.0796 2.518 2.8314
22 1.321 1.7171 2.0739 2.508 2.8188
23 1.319 1.7139 2.0687 2.500 2.8073
24 1.318 1.7109 2.0639 2.492 2.7969
25 1.316 1.7081 2.0595 2.485 2.7874
26 1.315 1.7056 2.0555 2.479 2.7787
27 1.314 1.7033 2.0518 2.473 2.7707
28 1.313 1.7011 2.0484 2.467 2.7633
29 1.311 1.6991 2.0452 2.462 2.7564
30 1.310 1.6973 2.0423 2.457 2.7500
35 1.3062 1.6896 2.0301 2.438 2.7239
40 1.3031 1.6839 2.0211 2.423 2.7045
45 1.3007 1.6794 2.0141 2.412 2.6896
50 1.2987 1.6759 2.0086 2.403 2.6778
60 1.2959 1.6707 2.0003 2.390 2.6603
70 1.2938 1.6669 1.994 2.381 2.6480
80 1.2922 1.6641 1.9945 2.374 2.6388
90 1.2910 1.6620 1.9901 2.364 2.6316
100 1.2901 1.6602 1.9867 2.364 2.6260
120 1.2887 1.6577 1.9840 2.358 2.6175
140 1.2876 1.6658 1.9799 2.353 2.6114
160 1.2869 1.6545 1.9771 2.350 2.6070
180 1.2863 1.6534 1.9749 2.347 2.6035
200 1.2858 1.6525 1.9733 2.345 2.6006
∞ 1.282 1.645 1.96 2.326 2.576

167
Geektonight Notes
Interpretation
UNIT 18 INTERPRETATION OF of Statistical
Data
STATISTICAL DATA
STRUCTURE

18.0 Objectives
18.1 Introduction
18.2 Meaning of Interpretation
18.3 Why Interpretation?
18.4 Essentials for Interpretation
18.5 Precautions in Interpretation
18.6 Concluding Remarks on Interpretation
18.7 Conclusions and Generalizations
18.8 Methods of Generalization
18.8.1 Logical Method
18.8.2 Statistical Method
18.9 Statistical Fallacies
18.10 Conclusions
18.11 Let Us Sum Up
18.12 Key Words
18.13 Answers to Self Assessment Exercises
18.14 Terminal Questions
18.15 Further Reading

18.0 OBJECTIVES
After studying this unit, you should be able to:

l define interpretation,
l explain the need for interpretation,
l state the essentials for interpretation,
l narrate the precautions to be taken before interpretation,
l describe a conclusion and generalization,
l explain the methods of generalization, and
l illustrate statistical fallacies.

18.1 INTRODUCTION
We have studied in the previous units the various methods applied in the
collection and analysis of statistical data. Statistics are not an end in themselves
but they are a means to an end, the end being to draw certain conclusions
from them. This has to be done very carefully, otherwise misleading conclusions
may be drawn and the whole purpose of doing research may get vitiated.

A researcher/statistician besides the collection and analysis of data, has to draw


inferences and explain their significance. Through interpretation the meanings
and implications of the study become clear. Analysis is not complete without
interpretation, and interpretation can not proceed without analysis. Both are,
thus, inter-dependent. In this unit, therefore, we will discuss the interpretation of
analysed data, summarizing the interpretation and statistical fallacies.
5
Geektonight Notes

Interpretation and
Reporting 18.2 MEANING OF INTERPRETATION
The following definitions can explain the meaning of interpretation.

l “The task of drawing conclusions or inferences and of explaining their


significance after a careful analysis of selected data is known as interpretation”.
l “It is an inductive process, in which you make generalizations based on the
connections and common aspects among the categories and patterns”.
l “Scientific interpretation seeks relationship between the data of a study and
between the study findings and other scientific knowledge”.
l “Interpretation in a simple way means the translation of a statistical result into
an intelligible description”.
Thus, analysis and interpretation are central steps in the research process. The
purpose of analysis is to summarize the collected data, where as interpretation
is the search for the broader meaning of research findings. In interpretation, the
researcher goes beyond the descriptive data to extract meaning and insights
from the data.

18.3 WHY INTERPRETATION?


A researcher/ statistician is expected not only to collect and analyse the data
but also to interpret the results of his/ her findings. Interpretation is essential for
the simple reason that the usefulness and utility of research findings lie in
proper interpretation. It is only through interpretation that the researcher can
expose relations and patterns that underlie his findings. In case of hypothesis
testing studies the researcher may arrive at generalizations. In case the
researcher had no hypothesis to start with, he would try to explain his findings
on the basis of some theory. It is only through interpretation that the researcher
can appreciate why his findings are what they are, and can make others
understand the real significance of his research findings.

Interpretation is not a mechanical process. It calls for a critical examination of


the results of one’s analysis in the light of all the limitations of data gathering.
For drawing conclusions you need a basis. Some of the common and important
bases of interpretation are: relationships, ratios, rates and percentages, averages
and other measures of comparison.

18.4 ESSENTIALS FOR INTERPRETATION


Certain points should be kept in mind before proceeding to draw conclusions
from statistics. It is essential that:

a) The data are homogeneous: It is necessary to ascertain that the data are
strictly comparable. We must be careful to compare the like with the like and
not with the unlike.
b) The data are adequate: Sometimes it happens that the data are incomplete
or insufficient and it is neither possible to analyze them scientifically nor is it
possible to draw any inference from them. Such data must be completed
first.

c) The data are suitable: Before considering the data for interpretation, the
researcher must confirm the required degree of suitability of the data.
6
Geektonight Notes

Inappropriate data are like no data. Hence, no conclusion is possible with Interpretation
of Statistical
unsuitable data.
Data

d) The data are properly classified and tabulated: Every care is to be


taken as a pre-requisite, to base all types of interpretations on systematically
classified and properly tabulated data and information.

e) The data are scientifically analyzed: Before drawing conclusions, it is


necessary to analyze the data by applying scientific methods. Wrong analysis
can play havoc with even the most carefully collected data.

If interpretation is based on uniform, accurate, adequate, suitable and


scientifically analyzed data, there is every possibility of attaining a better and
representative result. Thus, from the above considerations we may conclude
that it is essential to have all the pre-requisites/pre-conditions of interpretation
satisfied to arrive at better conclusions.

18.5 PRECAUTIONS IN INTERPRETATION


It is important to recognize that errors can be made in interpretation if proper
precautions are not taken. The interpretation of data is a very difficult task and
requires a high degree of skill, care, judgement and objectivity. In the absence
of these, there is every likelihood of data being misused to prove things that are
not true. The following precautions are required before interpreting the data.

1) The interpreter must be objective.


2) The interpreter must understand the problem in its proper perspective.
3) He / she must appreciate the relevance of various elements of the problem.
4) See that all relevant, adequate and accurate data are collected.
5) See that the data are properly classified and analyzed.
6) Find out whether the data are subject to limitations? If so what are they?
7) Guard against the sources of errors.
8) Do not make interpretations that go beyond the information / data.
9) Factual interpretation and personal interpretation should not be confused. They
should be kept apart.
If these precautions are taken at the time of interpretation, reasonably good
conclusions can be arrived at.

18.6 CONCLUDING REMARKS ON


INTERPRETATION
The task of interpretation is not an easy job. It requires skill and dexterity on
the part of the researcher. Interpretation is an art that one learns through
practice and experience. The researcher may seek the guidance of experts for
accomplishing the task of interpretation.

The element of comparison is fundamental to all research interpretations.


Comparison of one’s findings with a criterion, or with results of other
comparable investigations or with normal (ideal) conditions, or with existing
theories or with the opinions of a panel of judges / experts forms an important
aspect of interpretation. 7
Geektonight Notes

Interpretation and The researcher must accomplish the task of interpretation only after considering
Reporting all relevant factors affecting the problem to avoid false generalizations. He/she
should not conclude without evidence. He/she should not draw hasty
conclusions. He/she should take all possible precautions for proper interpretation
of the data.

Self Assessment Exercise A

1) Interpretation means:
....................................................................................................................
....................................................................................................................
....................................................................................................................

2) Interpretation is essential for the following reasons :


....................................................................................................................
....................................................................................................................
...................................................................................................................

3) What are the preconditions for drawing better conclusions?


....................................................................................................................
....................................................................................................................
....................................................................................................................

4) State any five precautionary steps to be taken before interpretation.


....................................................................................................................
....................................................................................................................
....................................................................................................................
5) State whether the following statements are True (T) or False (F)
i) Heterogeneous data are strictly comparable ( )
ii) Inappropriate data are like absence of data ( )
iii) The interpretation should be subjective ( )
iv) Interpretation is a mechanical process ( )
v) Interpretation can not proceed without analysis ( )

18.7 CONCLUSIONS AND GENERALIZATIONS


Results are direct observations summarized and integrated by the statistical
analysis such as comparison of two groups of workers. Group ‘A’s' average
wage is Rs.5,000 and that of group ‘B’ is Rs.6,000. A conclusion is an
inference based on the data that group B workers are better paid than those of
group ‘A’.

In every day life, we often make generalizations. We believe that what is true
of the observed instances will be true of the unobserved instances. Since, we
have had an uniform experience, we expect that we shall have it even in the
future. We are quite conscious of the fact that the observed instances do not
constitute all the members of a class concerned. But we have a tendency to
8 generalize. A generalization is a statement, the scope of which is wider
Geektonight Notes

than the available evidence. For example, A is a crow, it is black. B is a Interpretation


crow, it is black. C is a crow, it is also black. Therefore, it can be generalized of Statistical
Data
that “all crows are black”. Similarly, all swans are white. All rose plants
possess thorns etc., The process by which such generalizations are made is
known as induction by simple enumeration.

18.8 METHODS OF GENRALIZATION


Normally, two methods are used for generalization : viz. 1) Logical method and
2) Statistical method. There are other methods also for generalization, but
generally only these two methods are widely used. Let us discuss these two
methods in detail.

18.8.1 Logical Method

This method was first introduced by John Stuart Mill, who said that
generalization should be based on logical processes. Mill thought that discovering
causal connections is the fundamental task in generalization. If casual
connections hold good, generalization can be done with confidence. Five
methods of experimental enquiry have been given by Mill. These methods serve
the purpose of discovering causal connections. These methods are as follows.

i) The Method of Agreement: This may be positive or negative. The


method of agreement states that if two or more instances of a phenomenon
under investigation have only one circumstance in common, the circumstance is
the cause or the effect of the given phenomenon. For example, a person gets
pain in his eyes whenever he roams in the sun. Negatively, when he is under
the shade he does not have pain. Therefore, the cause for pain is roaming in
the sun.

ii) The Method of Difference: This method is a combination of both positive


and negative methods of agreement. In this method only two instances are
required. The two instances resemble each other in every other respect, but
differ in the absence or presence of the phenomenon observed. The
circumstance in which alone the two instances differ, is the effect, or the
cause. Let us take the example given by Mill. A man is shot, he is wounded
and dies. Here the wound is the only differentiating circumstance between the
man who is alive and the man who is dead. Hence, death is caused by the
wound.

iii) Joint Method of Agreement and Difference: This is a combination of


the method of agreement and the method of difference. According to this
method, we require two sets of instances. This method can be stated like this:
If two or more instances in which the phenomenon occurs have only one
circumstance in common, while two or more instances in which it does not
occur have nothing in common, save the absence of that circumstance, the
circumstance in which alone the two sets of instances differ, is the effect or
the cause. For example :

A+B+C Produce X
A+P+Q Produce X
M + N + Non-A Produce Non-X
G + H + Non-A Produce Non-X
∴ A and X are causally connected. 9
Geektonight Notes

Interpretation and iv) The Method of Residues: This method is based on the principle of
Reporting elimination. The statement of this method is that, subtract from any phenomenon
such part as is known by previous inductions to be the effect of certain
antecedents, and the residue of the phenomenon is the effect of the remaining
antecedents. For example: A loaded lorry weighs 11 tons. The dead weight of
the lorry is 1 ton. The weight of load = 11 – 1 = 10 tons.

v) The Method of Concomitant Variation: This method can be stated as


“whatever phenomenon varies in any manner, whenever another phenomenon
varies in some particular manner, is either the cause or the effect of that
phenomenon or is connected with it. This method is quantitative in nature and
needs statistical techniques for measurement. That is why it is also known as
the method of quantitative induction, because we base our inference on
quantitative change in the two factors and is applied as some form of
correlation analysis.

18.8.2 Statistical Method

Statistical method may be defined as “ the collection, presentation, analysis and


interpretation of numerical data”. Thus statistical method involves four steps:

i) Collection of Data: The facts pertaining to the problem under study are to
be collected either by survey method or by observation method or by
experiment or from a library. (It was discussed in Unit-3).

ii) Presentation of Data: The data collected as to be processed by


classification, tabulation and then be presented in a clear manner. (This was
discussed in Units 6 and 7 of this course).

iii) Analysis of Data: The processed data then should be properly analyzed
with the help of statistical tools, such as measures of central tendency,
measures of variation, measures of sknewness, correlation, time series, index
numbers etc. (This was discussed in Units 8, 9, 10, 11 and 12 of this course).

iv) Interpretation of Data: The collected and analyzed data has to be


interpreted. It involves explanation of facts and figures and drawing inferences
and conclusions.

Self Assessment Exercise B

1) Differentiate between Conclusion and Generalization.


....................................................................................................................
....................................................................................................................
....................................................................................................................

2) Whose name is associated with the Logical Method?


....................................................................................................................
....................................................................................................................

3) State the method of agreement.


....................................................................................................................
....................................................................................................................
10 ....................................................................................................................
Geektonight Notes

4) What do you mean by concomitant variation ? Interpretation


of Statistical
.................................................................................................................... Data

....................................................................................................................
5) Fill in the blanks with appropriate word (s) :
i) Extending the conclusion from observed instances to unobserved instances
is also called ____________.
ii) Logical method is associated with the name of ______________.

iii) A conclusion is an inference based on ____________.

iv) Logical method proceeds on ____________ connections.

v) Statistical method is ___________ based.

18.9 STATISTICAL FALLACIES


Interpretation of data, as we stated earlier, is a very difficult task and requires
a high degree of care, objectivity, skill and judgement. In the absence of these
things, it is likely that the data may be misused. In fact, experience shows that
the largest number of mistakes are committed knowingly or unknowingly while
interpreting statistical data which may lead to misinterpretation of data by most
of the readers.

Statistical fallacies may arise at any stage – in the collection, presentation,


analysis and interpretation of data. The following are some of the (i) specific
examples illustrating how statistics can be misinterpreted, (ii) Sources of errors
leading to false generalizations, (iii) examples how fallacies arise in using
statistical data and statistical methods.

Bias: Bias, whether it is conscious or unconscious, is very common in statistical


work and it leads to false generalizations. It is found that wrong interpretations
are made want only to prove their point. Some times deliberately statistical
information is twisted as to grind one’s own axe. For example, a business man
may use statistics to prove the superiority of their firm over others by saying
that our firm earned a profit of Rs.1,00,000 where as firm ‘X’ earned only
Rs.80,000 this year. On the face of it, it appears that firm ‘X’ has not
performed well. But a little thinking reveals that many other variables have to
be considered before drawing such a conclusion, such as what is the capital
employed? If the capital employed is same, then the quality of product and so on.

Unconscious bias is even more insidious. Perhaps, all statistical reports contain
some unconscious bias, since the statistical results are interpreted by human
beings after all. Each may look at things in terms of his own experience and
his attitude towards the problem under study. People suffer from several
inhibitions, prejudices, ideologies and hardened attitudes. They can not help
reflecting these in their interpretation of results. For example: A pessimist will
see the future as being dark, where as an optimist may see it as being bright.

Inconsistency in Definitions: Some times false conclusions are drawn


because of failure to define properly the object being studied and hold that
definition in mid for making comparisons. When the working capital of two
firms is compared, net working capital of one must be compared with only net
working capital of the other and not with gross working capital. Even within the
organization, for facilitating comparison over a period of time it is necessary to
keep the definition constant. 11
Geektonight Notes

Interpretation and Inappropriate Comparisons: Comparisons between two things can not be
Reporting made unless they are really alike. Unfortunately, this point is generally forgotten
and comparisons are made between two dissimilar things, thereby, leading to
fallacious conclusions. For example, the cost of living index of Bangalore is 150
(with base year 1999) and that of Hyderabad is 155 (with base 1995).
Therefore, Hyderabad is a costlier city than Bangalore city. This conclusion is
misleading as the base years of the Indices are different.

Faulty Generalizations: Many a time people jump to conclusions or


generalizations on the basis of either too small a sample or a sample that is not
representative of the population. For example, if a foreigner came to Delhi and
his purse was stolen by a pick pocket and he comments that there is no safety
and security for foreigners in India. This is not true as thousands of foreigners
come to India. They are safe and secure. Some times the sample size may be
adequate but not representative.

Drawing Wrong Inferences: Some times wrong inferences may be drawn


from the data. For example, the population of a town has doubled in 10 years.
From this it is interpreted that the birth rate in the town has doubled. Obviously,
this is a wrong inference, as the population of the town can double in many
ways (example: exodus from villages, migration from other places etc.) than
doubling of birth rate only.

Misuse of Statistical Tools: The various tools of analysis such as measures


of central tendency, measures of variation, measures of correlation, ratios,
percentages etc., are very often misused to present information in such a
manner as to convince the public or to camoaflage things. In a company there
are 1,00,000 shares and 1,000 share holders. The company claims that their
shares are well distributed as the average share holding is 100. But a close
scrutiny reveals that 10 persons hold 90,000 shares where as 990 persons hold
10,000 shares, average being about 10. Similarly, range can be misused to
exaggerate disparities. For example, in a factory the wages may range between
Rs.1,000 to Rs.1,500 a month and the Manager gets Rs.20,000 a month. It is
reported that the earnings of their employees range from Rs.1,000 to Rs.20,000.

Failure to Comprehend the Data: Very often figures are interpreted without
comprehending the total background of the data and it may lead to wrong
conclusions. For example, see the following interpretations:

– The death rate in the army is 9 per thousand, where as in the city of Delhi it is
15 per thousand. Therefore, it is safer to be in the army than in the city.
– Most of the patients who were admitted in the intensive care (IC) ward of a
hospital died. Therefore, it is unsafe to be admitted to intensive care ward in that
hospital.

18.10 CONCLUSIONS
Statistical methods and techniques are only tools. As such, they may be very
often misused. Some people believe that “figures can prove anything.” Figures
don’t lie but liers can figure”. Some people regard statistics as the worst type
of lies. That is why it is said that “an ounce of truth can be produced from
tons of statistics”. Mere quantitative results, or huge body of data, without any
definite purpose, can never help to explain anything. The misuse of statistics
12 may arise due to:
Geektonight Notes

i) analysis without any definite purpose. Interpretation


of Statistical
ii) Carelessness or bias in the collection and interpretation of data. Data

iii) Deliberate cooking up of data.


iv) Wrong definitions, inappropriate comparisons, inadequate data etc.,

As a principle, statistics can not prove anything, but they can be made to prove
anything because statistics are like clay with which one can make God or the
Devil. The fault lies not with statistics but with the person who is using
statistics. The interpreter must carefully look into these points before he sets
about the task of interpretation. We may conclude with the words of Marshall
who said “ Statistical arguments are often misleading at first, but free discussion
clears away statistical fallacies”.

Self Assessment Exercise C

1) What is meant by statistical fallacy?


....................................................................................................................
....................................................................................................................
....................................................................................................................
2) What do you mean by bias?
....................................................................................................................
....................................................................................................................
....................................................................................................................
3) Comment on “Statistics are like clay, with which one can make God or the
Devil”.
....................................................................................................................
....................................................................................................................
....................................................................................................................
4) “Figures don’t lie but liers can figure” - explain
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) Point out the Fallacy, if any in the following statements:
i) The per capita income in India increased from Rs.1,000 in 1998 to Rs.2,000
in 2003. So the prosperity of Indians doubled.
ii) This year the rain fall is double that of last year. Therefore, the yield will be
double that of last year.
iii) The average depth of a canal is 5 ft. The height of person is 5.6”.
Therefore, he can safely cross the canal.
iv) The income from excise duties is increasing. Therefore, the production also
is increasing.
v) The import of edible oils in India is increasing year after year. Therefore,
the production in India is decreasing.
13
Geektonight Notes

Interpretation and
Reporting 18.11 LET US SUM UP
A statistician having collected and analyzed data has to draw inferences and
explain their significance. The process of explaining the data, after analysis, is
called interpretation of data. Interpretation is necessary, because, it is only
through interpretation that the researcher can explain relations and patterns that
underlie his findings. Before interpretation it is to be satisfied whether the data
are homogeneous, adequate, suitable and scientifically analyzed.

While interpreting data, certain precautions have to be taken such as


maintenance of objectivity, clear understanding of the problem, use of only
relevant data, understanding limitations of data and gaurding against sources of
errors.

Results / findings are the direct observations and summarized by analysis. A


conclusion is an inference based on the findings. If we believe that what is true
of the observed instances will also be true of the unobserved instances, it can
be extended to the entire class. This process is called generalization. Normally,
there are two methods of generalization. 1. Logical method 2. Statistical
method.

Logical method proceeds on causal connections where as statistical method


proceeds on the basis of data. While concluding / generalizing the interpreter
should guard against statistical fallacies such as bias, inconsistency in definitions,
inappropriate comparisons, faulty generalizations, drawing wrong inferences,
misuse of statistical tools etc.

18.12 KEY WORDS


Analysis: Processing the data into suitable form, checking the quality of data
and calculations and presentation of descriptive statistics, using statistical tools.
Bias: To incline to one side, prejudiced.
Conclusion: Opinion, inference
Fallacy: Misconception, misjudgement, misconclusion
Finding: A decision upon a fact reached as a result of observation or
investigation.
Generalization: Abstraction. It is a statement extended to the entire class of
objects.
Inference: A logical conclusion / deduction arising from certain facts.
Interpretation: It is the task of drawing conclusions or inferences and of
explaining their significance.

18.13 ANSWERS TO SELF ASSESSMENT


EXERCISES
A) i) F ii) T iii) F iv) F v) T
B) i) Generalization ii) John Stuart Mill iii) Findings
iv) Casual v) Data

C) i) It could be due to inflation.


14
Geektonight Notes

ii) There is no proportionate relationship between rainfall and yield. In fact Interpretation
of Statistical
excessive rain spoils crop.
Data
iii) At some places it may be 10 ft. The average of which is 5 ft. Hence, it is
dangerous.
iv) It can also be due to increasing the excise duties.
v) Not necessary. It may be due to increasing population or increase in
consumption.

18.14 TERMINAL QUESTIONS


1) What is meant by interpretation of statistical data? What precautions should be
taken while interpreting the data?
2) What do you understand by interpretation of data? Illustrate the types of
mistakes which frequently occur in interpretation.
3) Explain the need, meaning and essentials of interpretation.
4) Discuss the methods of generalization.
5) Explain with examples logical methods of generalization.
6) What is meant by statistical method? Explain the steps involved in the statistical
method.
7) What is meant by statistical fallacy? What dangers and fallacies are associated
with the use of statistics?
8) Write short notes on :
a) Conclusion
b) Generalization
c) Method of agreement
d) Need for interpretation
9) Point out the ambiguity or mistake in the following statements:
a) The Gross profit to sales ratio of a company was 20% in 2002 and was
15% in 2003. Hence, the stock must have been undervalued.
b) The output in a factory was 3,000 tons in August 2003 and 2,800 tons in
September 2003. So the workers were more efficient in August.
c) The population of a State has doubled during the last 10 years. Hence, The
birth rate has also doubled.
d) The Examination result of School X was 80% in 2003, where as in the
same examination only 350 out of 500 students (70%) passed in school Y.
Hence, the teaching standard of school X is better than school Y.
e) 90% of the people who take whisky die before reaching the age of 80
years. Therefore, whisky is bad for health.

Note: These questions/exercises will help you to understand the unit better. Try
to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.

15
Geektonight Notes

Interpretation and
Reporting 18.15 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
B.N.Gupta. Statistics. Sahitya Bhavan, Agra.
S.P. Gupta. Statistical Methods, Sultan Chand & Sons, New Delhi.
B.N.Agarwal. Basic Statistics, Wiley Eastern Ltd.
P.Saravanavel. Research Methodology, Kitab Mahal, Allahabad.
C.R. Khothari. Research Methodology (Methods and Techniques), New Age
International Pvt. Ltd, New Delhi.

16
Geektonight Notes
Report Writing
UNIT 19 REPORT WRITING
STRUCTURE

19.0 Objectives
19.1 Introduction
19.2 Purpose of a Report
19.3 Meaning
19.4 Types of Reports
19.5 Stages in Preparation of a Report
19.6 Characteristics of a Good Report
19.7 Structure of the Research Report
19.7.1 Prefactory Items
19.7.2 The Text/Body of the Report
19.7.3 Terminal Items
19.8 Check List for the Report
19.9 Let Us Sum Up
19.10 Key Words
19.11 Answers to Self Assessment Exercises
19.12 Terminal Questions
19.13 Further Reading

19.0 OBJECTIVES
After going through this unit, you should be able to :
l define a Report,
l explain the need for reporting,
l discuss the subject matter of various types of reports,
l identify the stages in preparation of a report,
l explain the characteristics of a good report,
l explain different parts of a report, and
l distinguish between a good and bad report.

19.1 INTRODUCTION
The last and final phase of the journey in research is writing of the report.
After the collected data has been analyzed and interpreted and generalizations
have been drawn the report has to be prepared. The task of research is
incomplete till the report is presented.

Writing of a report is the last step in a research study and requires a set of skills
some what different from those called for in respect of the earlier stages of
research. This task should be accomplished by the researcher with utmost care.

19.2 PURPOSE OF A REPORT


The report may be meant for the people in general, when the investigation has
not been carried out at the instance of any third party. Research is essentially a
cooperative venture and it is essential that every investigator should know what
others have found about the phenomena under study. The purpose of a report is
thus the dissipation of knowledge, broadcasting of generalizations so as to
ensure their widest use. 1 7
Geektonight Notes
Interpretation and A report of research has only one function, “it must inform”. It has to
Reporting
propagate knowledge. Thus, the purpose of a report is to convey to the
interested persons the results and findings of the study in sufficient detail, and
so arranged as to enable each reader to comprehend the data, and to determine
for himself the validity of conclusions. Research results must invariably enter
the general store of knowledge. A research report is always an addition to
knowledge. All this explains the significance of writing a report.

In a broader sense, report writing is common to both academics and


organizations. However, the purpose may be different. In academics, reports are
used for comprehensive and application-oriented learning. Whereas in
organizations, reports form the basis for decision making.

19.3 MEANING
Reporting simply means communicating or informing through reports. The
researcher has collected some facts and figures, analyzed the same and arrived
at certain conclusions. He has to inform or report the same to the parties
interested. Therefore “reporting is communicating the facts, data and information
through reports to the persons for whom such facts and data are collected and
compiled”.

A report is not a complete description of what has been done during the period
of survey/research. It is only a statement of the most significant facts that are
necessary for understanding the conclusions drawn by the investigator. Thus, “
a report by definition, is simply an account”. The report thus is an account
describing the procedure adopted, the findings arrived at and the conclusions
drawn by the investigator of a problem.

19.4 TYPES OF REPORTS


Broadly speaking reporting can be done in two ways :

a) Oral or Verbal Report : reporting verbally in person, for example; Presenting


the findings in a conference or seminar or reporting orally to the superiors.

b) Written Report : Written reports are more formal, authentic and popular.

Written reports can be presented in different ways as follows.


i) Sentence form reports : Communicating in sentence form
ii) Tabular reports : Communicating through figures in tables
iii) Graphic reports : Communicating through graphs and diagrams
iv) Combined reports : Communicating using all the three of the
above. Generally, this is the most popular
Research reports vary greatly in length and type. In each individual case, both
the length and the form are largely dictated by the purpose of the study and
problems at hand. For example, business organizations generally prefer reports
in letter form, that too short in length. Banks, insurance and other financial
institutions generally prefer figure form in tables. The reports prepared by
government bureaus, enquiry commissions etc., are generally very
comprehensive on the issues involved. Similarly research theses/dissertations
usually prepared by students for Ph.D. degree are also elaborate and
1 8
methodical.
Geektonight Notes

It is, thus, clear that the results of a research enquiry can be presented in a Report Writing
number of ways. They may be termed as a technical report, a popular report,
an article, or a monograph.

1) Technical Report: A technical report is used whenever a full written report


(ex: Ph.D. thesis) of the study is required either for evaluation or for record
keeping or for public dissemination. The main emphasis in a technical report is
on :
a) the methodology employed.
b) the objectives of the study.
c) the assumptions made / hypotheses formulated in the course of the study.
d) how and from what sources the data are collected and how have the data
been analyzed.
e) the detailed presentation of the findings with evidence, and their limitations.

2) Popular Report: A popular report is one which gives emphasis on simplicity


and attractiveness. Its aim is to make the general public understand the findings
and implications. Generally, it is simple. Simplicity is sought to be achieved
through clear language and minimization of technical details. Attention of the
readers is sought to be achieved through attractive layout, liberal use of graphs,
charts, diagrams and pictures. In a popular report emphasis is given on practical
aspects and policy implications.

3) Research Article: Some times the findings of a research study can be


published in the form of a short paper called an article. This is one form of
dissemination. The research papers are generally prepared either to present in
seminars and conferences or to publish in research journals. Since one of the
objectives of doing research is to make a positive contribution to knowledge, in
the field, publication (publicity) of the work serves the purpose.

4) Monograph: A monograph is a treatise or a long essay on a single subject.


For the sake of convenience, reports may also be classified either on the
basis of approach or on the basis of the nature of presentation such as:

i) Journalistic Report
ii) Business Report
iii) Project Report
iv) Dissertation
v) Enquiry Report (Commission Report), and
vi) Thesis

Reports prepared by journalists for publication in the media may be journalistic


reports. These reports have news and information value. A business report
may be defined as report for business communication from one departmental
head to another, one functional area to another, or even from top to bottom in
the organizational structure on any specific aspect of business activity. These
are observational reports which facilitate business decisions.

A project report is the report on a project undertaken by an individual or a


group of individuals relating to any functional area or any segment of a
functional area or any aspect of business, industry or society. A dissertation,
on the other hand, is a detailed discourse or report on the subject of study.
1 9
Geektonight Notes
Interpretation and Dissertations are generally used as documents to be submitted for the
Reporting
acquisition of higher research degrees from a university or an academic
institution. The thesis is an example in point.

An enquiry report or a commission of enquiry report is a detailed report


prepared by a commission appointed for the specific purpose of conducting a
detailed study of any matter of dispute or of a subject requiring greater insight.
These reports facilitate action, since they contain expert opinions.

Self Assessment Exercise A


1) What do you mean by a report?
....................................................................................................................
....................................................................................................................
....................................................................................................................
2) What is the purpose of a report?
....................................................................................................................
....................................................................................................................
....................................................................................................................
3) What is a popular report?
....................................................................................................................
....................................................................................................................
....................................................................................................................
4) What is meant by an article?
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) What do you mean by verbal reporting?
....................................................................................................................
....................................................................................................................
....................................................................................................................

19.5 STAGES IN PREPARATION OF A REPORT


Research reports are the product of slow and painstaking and accurate work.
Therefore, the preparation of the report may be viewed in the following major
stages.

1) The logical understanding and analysis of the subject matter.


2) Planning/designing the final outline of the report.
3) Write up/preparation of rough draft.
4) Polishing/finalization of the Report.

Logical Understanding of the Subject Matter: It is the first stage which is


primarily concerned with the development of a subject. There are two ways to
develop a subject viz. a. logically and b. chronologically. The logical
2 0 development is done on the basis of mental connections and associations
Geektonight Notes

between one aspect and another by means of logical analysis. Logical treatment Report Writing
often consists of developing material from the simple to the most complex.

Chronological development is based on a connection or sequence in time or


happening of the events. The directions for doing something usually follow the
chronological order.

Designing the Final Outline of the Report: It is the second stage in writing
the report. Having understood the subject matter, the next stage is structuring
the report and ordering the parts and sketching them. This stage can also be
called as planning and organization stage. Ideas may pass through the author’s
mind. Unless he first makes his plan/sketch/design he will be unable to achieve
a harmonious succession and will not even know where to begin and how to
end. Better communication of research results is partly a matter of language
but mostly a matter of planning and organizing the report.

Preparation of the Rough Draft: The third stage is the write up/drafting of
the report. This is the most crucial stage to the researcher, as he/she now sits
to write down what he/she has done in his/her research study and what and
how he/she wants to communicate the same. Here the clarity in
communicating/reporting is influenced by some factors such as who the readers
are, how technical the problem is, the researcher’s hold over the facts and
techniques, the researcher’s command over language (his communication skills),
the data and completeness of his notes and documentation and the availability
of analyzed results. Depending on the above factors some authors may be able
to write the report with one or two drafts. Some people who have less
command over language, no clarity about the problem and subject matter may
take more time for drafting the report and have to prepare more drafts (first
draft, second draft, third draft, fourth draft etc.,)

Finalization of the Report: This is the last stage, perhaps the most difficult
stage of all formal writing. It is easy to build the structure, but it takes more
time for polishing and giving finishing touches. Take for example the
construction of a house. Up to roofing (structure) stage the work is very quick
but by the time the building is ready, it takes up a lot of time.

The rough draft (whether it is second draft or ‘n’ th draft ) has to be


rewritten, polished in terms of requirements. The careful revision of the rough
draft makes the difference between a mediocre and a good piece of writing.
While polishing and finalizing one should check the report for its weaknesses in
logical development of the subject and presentation cohesion. He/she should also
check the mechanics of writing — language, usage, grammar, spelling and
punctuation.

19.6 CHARACTERISTICS OF A GOOD REPORT


Research report is a channel of communicating the research findings to the
readers of the report. A good report is one which does this task efficiently and
effectively. As such it should have the following characteristics/qualities.

i) It must be clear in informing the what, why, who, whom, when, where and how
of the research study.
ii) It should be neither too short nor too long. One should keep in mind the fact
that it should be long enough to cover the subject matter but short enough to
sustain the reader’s interest.
2 1
Geektonight Notes
Interpretation and iii) It should be written in an objective style and simple language, correctness,
Reporting
precision and clarity should be the watchwords of the scholar. Wordiness,
indirection and pompous language are barriers to communication.
iv) A good report must combine clear thinking, logical organization and sound
interpretation.
v) It should not be dull. It should be such as to sustain the reader’s interest.
vi) It must be accurate. Accuracy is one of the requirements of a report. It should
be factual with objective presentation. Exaggerations and superlatives should
be avoided.
vii) Clarity is another requirement of presentation. It is achieved by using familiar
words and unambiguous statements, explicitly defining new concepts and
unusual terms.
viii) Coherence is an essential part of clarity. There should be logical flow of ideas
(i.e. continuity of thought), sequence of sentences. Each sentence must be so
linked with other sentences so as to move the thoughts smoothly.
ix) Readability is an important requirement of good communication. Even a
technical report should be easily understandable. Technicalities should be
translated into language understandable by the readers.
x) A research report should be prepared according to the best composition
practices. Ensure readability through proper paragraphing, short sentences,
illustrations, examples, section headings, use of charts, graphs and diagrams.
xi) Draw sound inferences/conclusions from the statistical tables. But don’t repeat
the tables in text (verbal) form.
xii) Footnote references should be in proper form. The bibliography should be
reasonably complete and in proper form.
xiii) The report must be attractive in appearance, neat and clean whether typed or
printed.
xiv) The report should be free from mistakes of all types viz. language mistakes,
factual mistakes, spelling mistakes, calculation mistakes etc.,

The researcher should try to achieve these qualities in his report as far as possible.

Self Assessment Exercise B


1) List the stages involved in the preparation of a report.
....................................................................................................................
....................................................................................................................
....................................................................................................................
2) What are the ways of developing a subject?
....................................................................................................................
....................................................................................................................
....................................................................................................................
3) What is meant by outlining the report?
....................................................................................................................
....................................................................................................................
....................................................................................................................
2 2
Geektonight Notes

4) Enumerate the characteristics of a good report. Report Writing

....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................
5) What is meant by coherence?
....................................................................................................................
....................................................................................................................
....................................................................................................................
....................................................................................................................

19.7 STRUCTURE OF THE RESEARCH REPORT


Under this head, the format/outline/sketch of a comprehensive technical report
or research report is discussed below. A technical report has a number of
clearly defined sections. The headings of the sections and their order may differ
from one situation to another. The contents of a report can broadly be divided
into three parts as :

1) The front matter or prefactory items.


2) The body or text of the report.
3) The back matter or terminal items.
The following chart summarizes the broad sequence of the contents of a
research report.

STRUCTURE OF A RESEARCH REPORT

PREFACTORY ITEMS TEXT OR BODY TERMINAL ITEMS

1. Blank sheet Chapter 1: Introduction 1. Appendix, if any


2. Title page Chapter 2 to n: Presentation 2. Glossary, if any
& Description of
3. Approval sheet (if any) 3. Bibliography
Evidence
4. Researcher’s declaration 4. Index
5. Dedication (if any) Chapter n+1: Summary 5. Blank sheet
6. Preface and/or Conclusions and
acknowledgements Recommendations

7. Table of contents
8. List of tables
9. List of graphs/charts/
figures
10. List of cases, if any
11. Abstract or high lights (optional)
Let us discuss these items one by one in detail.

2 3
Geektonight Notes
Interpretation and 19.7.1 Prefactory Items
Reporting
The various preliminaries to be included in the front pages of the report are
briefly narrated hereunder:

1) Title Page: The first page of the report is the title page. The title page should
carry a concise and adequately descriptive title of the research study, the name
of the author, the name of the institution to whom it is submitted, the date of
presentation.

2) Approval Sheet: If a certificate of approval is required either from the


research supervisor or from the institution which provided the research facilities,
it must be given.

3) Researcher’s Declaration: Generally the researcher has to declare/certify


that it is his/her bonafide and original work done by him/her.

4) Dedication: If the author wants to dedicate the work to whom soever he/she
likes, he/she may do so.

5) Preface or Acknowledgements: A preface includes the background and


reasons for the study. This is an appropriate place for him/her to make
acknowledgements also. But if the researcher has opted to discuss the
significance, reasons of the study else where in the report he/she may not write
‘preface’. But he/she may use the page for only acknowledgements. In
acknowledgements the researcher acknowledges the assistance and support
received from individuals and organizations in conducting the research. It is
intended to express his/her gratitude to them.

6) Table of Contents: A table of contents gives an outline of the contents of the


report. It contains a list of the chapters and their titles with page numbers. It
facilitates easy location of topics in the report. The chapter headings may be
typed with capital letters.

7) List of Tables: The researcher must have collected lot of data and analyzed
the same and presented in the form of tables. These tables may be listed
chapter wise and the list be presented with page numbers for easy location and
reference.

8) List of Graphs/Charts/Figures: If there are many graphs and charts they


should also be listed with page numbers, after the list of tables separately.

9) List of Cases/Exhibits: If there are many cases/exhibits they should also be


listed.

10) Abstract: An abstract is a synopsis. It should be as brief as possible and run


about one or two pages. It is placed at the prefactory part of the report so that a
reader can get a quick over view of the report. It contains a brief and precise
statement of the purpose and a bare summary of the findings or the results of
the study.

19.7.2 The Text/Body of the Report

After the preliminary items, the body of the report is presented. It is the major
and main part of the report. It consists of the text and context chapters of the
study. Normally the body may be divided into 3 (three) parts.
2 4
Geektonight Notes

i) The introduction Report Writing

ii) The description and discussion of evidence and findings


iii) The summary, conclusions and recommendations

i) Introduction

Generally this is the first chapter in the body of the report. It is devoted
introducing the theoretical background of the problem and the methodology
adopted for attacking the problem.

It may consist of the following aspects:


– Significance and justification of the topic.
– Theoretical background of the topic.
– Statement of the problem.
– Review of literature.
– Objectives of the study.
– Hypotheses to be tested.
– Definition of special terms, concepts and units of study.
– Scope of the study – geographical scope i.e. area/places to be covered, content,
scope i.e., aspects to be included/excluded.
– Period of study i.e., reference period.
– Sources of data i.e., primary or secondary or both.
– Methods of data collection i.e., sample or census.
– Sampling design.
– Data collection instruments .
– Field work.
– Data processing and analysis plan.
– Limitations of the study, if any.
– An over view of the report i.e., chapter plan.

ii) Description and Discussion of Evidence

This is the major and main part of the report. It is divided into several chapters
depending upon the number of objectives of the study, each being devoted to
presenting the results pertaining to some aspect. The chapters should be well
balanced, mutually related and arranged in logical sequence. The results should
be reported as accurately and completely as possible explaining as to their
bearing on the research questions and hypotheses.

Each chapter should be given an appropriate heading. Depending upon the need,
a chapter may also be divided into sections. The entire verbal presentation
should run in an independent stream and must be written according to best
composition rules. Each chapter should end with a summary and lead into the
next chapter with a smooth transition sentence.

While dealing with the subject matter of text the following aspects should be
taken care of. They are :
1) Headings
2) Quotations 2 5
Geektonight Notes
Interpretation and 3) Foot notes
Reporting
4) Exhibits

1) Headings. The following types of headings are commonly used.


– CENTRE HEAD ( All capitals, without underlining)
– Centre Subhead (Capital and lower case, with underlining)
– SIDE HEAD ( All capitals without underlining)
– Side Sub Head (Capital and lower case letters with underlining)
– Paragraph Head followed by a colon (Capital and lower case underline)

Which combinations of headings to use depends on the number of classifications


or divisions that a chapter has. The headings are illustrated below:

Centre Head. A Centre head is typed in all capital letters. If the title is long,
the inverted pyramid style (i.e., the second line shorter than the first, the third
line shorter than the second) is used. All caps headings are not underlined.
Underlining is unnecessary because capital letters are enough to attract the
reader’s attention.

Example

CHALKING OUT A PROGRAMME FOR


IMPORT SUBSTITUTION AND
EXPORT PROMOTION

Centre Subhead. The first letter of the first and the last word and all nouns,
adjectives, verbs and adverbs in the title are capitalized. Articles, prepositions
and conjunctions are not capitalized.

Example

Chalking out a Programme for


Import Substitution and
Export Promotion

Side Heads. Words in the side head are either written in all capitals or
capitalized as in the centre sub head and underlined.
Example: Import Substitution and Export Promotion
Paragraph Head. Words in a paragraph head are capitalized as in the centre
sub head and underlined. At the end, a colon appears, and then the paragraph
starts.
Example: Import Substitution and Export Promotion: The Seventh Five-Year
Plan of India has attempted ……

2) Quotations

Quotation Marks: Double quotation marks (“ ”) are used. A quotation within


a quotation is put in single quotation marks (‘ ’). Example: He said, “To the
selfish, ‘freedom’ is synonymous with license”.

When to Use Quotation Marks: Quotation marks are used for


1) a directly quoted passage or word.
2) a word or phrase to be emphasized, and
2 6 3) Titles of articles, chapters, sections of a book, reports, and unpublished works.
Geektonight Notes

How to Quote: a) All quotations should correspond exactly to the original in Report Writing
wording, spelling, and punctuation.

b) Quotations up to three typewritten lines are run into the text.


c) Direct quotations over three typewritten lines are set in indented paragraphs.
d) Quotation marks are not used for indented paragraphs.

Five ways of introducing a Quotation: These are given below.


a) Introduction: He/she said, “The primary test of success in a negotiation is the
presence of goodwill on both sides”.
b) Interpolation: “The primary test of success in a negotiation”, he/she said, “is
the presence of goodwill on both sides”.
c) End Reference: “ The primary test of success in a negotiation is the presence
of goodwill on both sides”, he/she said.
d) Indented Paragraph: He/she said: For the workers no real advance in their
standard of living is possible without steady increase in productivity because any
increase in wages generally, beyond certain narrow units, would otherwise be
nullified by a rise in prices.
e) Running into a Sentence: He/she recommended that “joint management
councils be set up in all establishments in the public as well as private sector in
which conditions favourable to the success of the scheme exist”.

3) Foot Notes

Types of Footnotes: A foot note either indicates the source of the reference
or provides an explanation which is not important enough to include in the text.

In the traditional system, both kinds of footnotes are treated in the same form
and are included either at the bottom of the page or at the end of the chapter
or book.

In the modern system, explanatory footnotes are put at the bottom of the page
and are linked with the text with a footnote number. But source references are
incorporated within the text and are supplemented by a bibliographical note at
the end of the chapter or book.

Rationale of Footnotes: Footnotes help the readers to check the accuracy of


the interpretation of the source by going to the source if they want to. They
are also an acknowledgement of the author’s indebtedness to the sources. They
lend authority to the work and help the readers to distinguish between the
author’s own contribution and that of others.

Where to put the Footnote: Footnotes appear at the bottom of the page or
at the end of the chapter (before the appendices section).

Numbering of Footnotes: a) For any editorial comment on the chapter or


title, an asterisk is used.

b) In the text Arabic numerals are used for footnoting. Each new chapter begins
with number 1.
c) The number is typed half a space above the line or within parentheses. No
space is given between the number and the word. No punctuation mark is used
after the number.
2 7
Geektonight Notes
Interpretation and d) The number is placed at the end of a sentence or, if necessary to clarify the
Reporting
meaning, at the end of the relevant word or phrase. Commonly, the number
appears after the last quotation mark. In an indented paragraph, the number
appears at the end of the last sentence in the quotation.

4) Exhibits

Tables:

Reference and Interpretation: Before a table is introduced, it is referred to


in the text (e.g., see Table 1.1; refer to Table 1.1; as in Table 1.1; Table 1.1
indicates). A table is meant only to expand, clarify, or give visual explanation
rather than stand by itself. The text should highlight the table’s focus and
conclusions.

Identification: a) Each table is given a number, title, and, if needed, a subtitle.


All identifications are centred.
b) Arabic numerals, instead of Roman numerals or capital letters, are
recommended for numbering the tables. Usually technical monographs and
books contain many tables. As the number increases, Roman numerals become
unfamiliar to the reader. Roman numerals also occupy more space than Arabic
numerals. If there are more than 26 tables, capital letters will not be sufficient to
identify them.

Tables can be numbered consecutively throughout the chapter as 1.1, 1.2, 1.3,…
wherein the first number refers to the chapter and the second number to the
table.

b) For the title and sub title, all capital letter are used.
c) Abbreviations and symbols are not used in the title or sub title.

Checklist: Relevance, accuracy, and clarity are of utmost importance in tables.


When entering the table, check the following:

1) Have the explanation and reference to the table been given in the text?
2) Is it essential to have the table for clarity and extra information?
3) Is the representation of the data comprehensive and understandable?
4) Is the table number correct?
5) Are the title and subtitle clear and concise?
6) Are the column headings clearly classified?
7) Are the row captions clearly classified?
8) Are the data accurately entered and represented?
9) Are the totals and other computations correct?
10) Has the source been given?
11) Have all the uncommon abbreviations been spelt out?
12) Have all footnote entries been made?
13) If column rules are used, have all rules been properly drawn?

Illustrations: Illustrations cover charts, graphs, diagrams, and maps. Most of


the instructions given for tables hold good for illustrations.
2 8
Geektonight Notes

Identification: Illustrations are identified as FIGURE, CHART, MAP or Report Writing


DIAGRAM. The identification marks (i.e. number, title, and, if any, sub title)
are put at the bottom, because an illustration, unlike a table, is studied from
bottom upwards.

19.7.3 Terminal Items


This section follows the text. First comes the appendices section, then the
bibliography and glossary. Each section is separated by a divider page on which
only the words APPENDICES, BIBLIOGRAPHY, or GLOSSARY all in capital
letters appear.

All reference section pages are numbered in Arabic numerals in continuation


with the page numbers of the text.

1) Appendices

What goes into an Appendix: a) Supplementary or secondary references are


put in the appendices section. But all primary reference material of immediate
importance to the reader is incorporated in the text. The appendices help the
author to authenticate the thesis and help the reader to check the data.

b) The material that is usually put in the appendices is indicated below:

1) Original data
2) Long tables
3) Long quotations
4) Supportive legal decisions, laws and documents
5) Illustrative material
6) Extensive computations
7) Questionnaires and letters
8) Schedules or forms used in collecting data
9) Case studies / histories
10) Transcripts of interviews

Numbering of Appendices: The appendices can be serialized with capital


letters (Appendix A, Appendix B) to differentiate from the chapter or table
numbers.

References to Appendices: a) In the text, the reader’s attention is drawn to


the appendices as in the case of tables.
b) All appendices are listed in the table of contents.

2) Bibliographies

Positioning of the Bibliography: The bibliography comes after the appendices


section and is separated from it by a division sheet written BIBLIOGRAPHY.
It is listed as a major section in all capital letters in the table of contents.

A bibliography contains the source of every reference cited in the footnote and
any other relevant works that the author has consulted. It gives the reader an
idea of the literature available on the subject that has influenced or aided the
author. 2 9
Geektonight Notes
Interpretation and Bibliographical Information: The following information must be given for
Reporting
each bibliographical reference.

Books Magazines and Newspapers


1) Author(s) 1) Author(s)
2) Title (underlined) 2) Title of the article (Within quotation marks)
3) Place of publication 3) Title of the magazine (underlined)
4) Publisher 4) Volume number (Roman numerals)
5) Date of publication 5) Serial number (Arabic numerals)
6) Date of issue

3) Glossary

What is a Glossary: A glossary is a short dictionary giving definitions and


examples of terms and phrases which are technical, used in a special
connotation by the author, unfamiliar to the reader, or foreign to the language in
which the book is written. It is listed as a major section in capital letters in the
table of contents.

Positioning of a Glossary: The glossary appears after the bibliography. It may


also appear in the introductory pages of a book after the lists of tables and
illustrations.
Order of Listing: Items are listed in alphabetical order.
Example:
Centre Heading is listed under C and not under H.

4) Index

Index may be either subject index or author index. Author index consists of
important names of persons discussed in the report, arranged in alphabetical
order. Subject index includes a detailed reference to all important matters
discussed in the report such as places, events, definitions, concepts etc., and
presented in alphabetical order. Index is not generally included in graduate /
post graduate students research reports. However, if the report is prepared for
publication or intended as a work of reference, an index is desirable.

Self Assessment Exericse C

Fill in the blanks with appropriate word (s).


1) A report has only one function to perform. It must _____________.
2) Broadly speaking reporting is of two types a) ______________ reporting
b) ____________ reporting.
3) A treatise on a single subject is called a _______________.
4) The contents of a report can broadly be classified into __________ parts.
5) An abstract is a ________________.
6) An Index may be either ___________ index or _________ index.
7) A ________________ contains the sources of references cited and other
relevant works consulted.
3 0 8) The list of special terms and phrases used is given in the form a __________.
Geektonight Notes
Report Writing
19.8 CHECK LIST FOR THE REPORT
When the final drafting of the report is completed and the author is satisfied
about the draft, format and all other relevant aspects and ready for going to
final typing or printing, it is always better to check various things to satisfy
yourself that every thing is satisfactory. Here we are providing a list of
questions for which a positive answer is expected.

Check List of Question

i) Is the ‘title’ of the report accurately describing the content?


ii) Is the scope of the study limited?
iii) Is the research problem properly defined or specified?
iv) Are the objectives of the study conceived well? Have they been achieved?
v) Are hypotheses made explicit?
vi) Has the plan of research been presented in detail?
vii) Were appropriate methods and techniques chosen to test the hypotheses?
viii) Whether all the pertinent data has been collected?
ix) Have the data been classified logically and analyzed intelligently?
x) Is the presentation of arguments clear and logical?
xi) Has an objective and open-minded attitude been maintained through out the
study?
xii) Have the limitations of data, methods, results been spelt out?
xiii) Are the previous works on this problem reviewed in the report?
xiv) Is the chapterization logical? Were the rules of composition properly followed?
xv) Are the forms of presentation — textual, tabular, graphic, properly used?
xvi) Does the summary really summarize?
xvii) Are the quotations and other references relevant?
xviii) Is the bibliography complete and correct?
xix) Are you able to convey what you mean?
xx) Can’t the report be further improved? No. That means, it is the best.
Finally, it should be remembered that report writing is an art which is learnt by
practice and experience, rather than by mere doctrination. The researcher,
therefore, should go through some of the research reports submitted/published in
his field and familiarize himself/herself with the basics of report writing.

Typing Instructions: For typing a report, the following points should be kept in
mind.
Paper: Quarter - size (A4 size) white thick, unruled paper is used.
Typing: Typing is done on only one side of the paper in double space.
Margins: Left side 1.5 inches, right side 0.5 inch, top and bottom 1.0 inch. But
on the first page of every major division, for example, at the beginning of a
chapter give 3 inches space at the top.
3 1
Geektonight Notes
Interpretation and
Reporting 19.9 LET US SUM UP
The final stage of research investigation is reporting. The research results,
findings and conclusions drawn etc., have to be communicated. This can be
done in two ways i.e. orally or in writing. Written reports are more popular and
authentic even though oral reporting also has its place. Based on requirement
reports can be of two types viz., Technical reports and popular reports.

Report writing has to pass through a number of stages such as understanding


the subject matter and its logical analysis, preparation of the final out line /
sketch, preparation of the rough drafts and polishing and finalization. A report
should have certain qualities such as accuracy, coherence, clarity conciseness,
and readability. It must be prepared according to best composition rules.

The total structure of a report can be divided into three main parts.

a) The preliminary part


b) The Text or body part and
c) The Terminal part.

The preliminary part consists of title page, certification, preface,


acknowledgements, table of contents, list of tables, charts, figures etc., The Text
part is the main body of the report which consists of the various chapters of
the subject matter. The last part consists of Appendices, bibliography, glossary
and Index. Having prepared the report it must be thoroughly checked to satisfy
that every thing is satisfactory. Then only it should be given for final typing.

19.10 KEY WORDS


Abstract : An abstract is a short summary of the report.
Article : A short paper prepared for publication in a journal/for presentation in
a seminar/conference.
Bibliography : It is the list of all published and unpublished references used in
the report arranged in alphabetical order.
Dissertation : A formal and lengthy discourse.
Foot note : It is an explanatory note/material source, given at the bottom of
the page.
Glossary : A list of words.
Lay out : Sketch, design, structure.
Monograph : A treatise on a single subject.
Report : A report is an account of the research study.
Reporting : Reporting means communicating through report.
Thesis : A formal and lengthy research paper presented as part of the
requirements for a degree.

3 2
Geektonight Notes
Report Writing
19.11 ANSWERS TO SELF ASSESSMENT
EXERCISES
Self Assessment Exercise C

1) Inform 2) Oral, Written 3) Monograph


4) Three 5) Synopsis 6) Subject, author
7) Bibliography 8) Glossary

19.12 TERMINAL QUESTIONS


A) Short Questions

1) Define research report and explain its purpose.


2) Distinguish between oral reporting and written reporting.
3) Differentiate between a technical report and a popular report.
4) Distinguish between an article and a monograph.
5) What is a bibliography? What is its purpose?
6) Why are quotations used in a research report?
7) Distinguish between bibliography and footnotes.
8) What are the items that can be included in Appendix?
9) What is an abstract?
10) What is meant by glossary?
B) Essay Type Questions
1) What is reporting? What are the different stages in the preparation of a
report?
2) What is a report? What are the characteristics/qualities of a good report?
3) Briefly describe the structure of a report.
4) What are the various aspects that have to be checked before going to final
typing?
5) What are the points to be kept in mind in revising the draft report?
6) Give a brief note on the prefatory items.
7) What are the various items that will find a place in the text / body of the
report?
8) Describe briefly how a research report should be presented.
9) Describe the considerations and steps involved in planning a report writing
work.
10) Write short notes on:
a) Characteristics of a good report.
b) Research article
c) Sources of data
d) Chapter plan

Note: These questions/exercises will help you to understand the unit better.
Try to write answers for them. But do not submit your answers to the
university for assessment. These are for your practice only.
3 3
Geektonight Notes
Interpretation and
Reporting 19.13 FURTHER READING
The following text books may be used for more indepth study on the topics
dealt with in this unit.
1) V.P. Michael, Research Methodology in Management, Himalaya Publishing
House, Bombay.
2) O.R. Krishna Swamy, Methodology of Research in Social Sciences,
Himalaya Publishing House, Mumbai.
3) C.R. Kothari, Research Methodology, Wiley Eastern, New Delhi
4) Berenson, Conrad and Raymond Cotton, Research and Report Writing for
Business and Economics, Random House, New York.

3 4

You might also like