Download as pdf or txt
Download as pdf or txt
You are on page 1of 244

M.B.A.

IV Semester-MPDBA 401

RESEARCH
METHODOLOGY AND
BUSINESS ANALYTICS

CENTRE FOR DISTANCE LEARNING


GITAM UNIVERSITY
(Estd. u/s 3 of the UGC Act, 1956)
Visakhapatnam
Editor
Prof. M. Vivekananda Murthy
Retired Professor, Andhra University, Visakhapatnam

Lesson writers
1. Dr. G. Arti
Asst. Professor, GITAM Institute of Management,
GITAM University, Visakhapatnam

2. Dr. D. Vijaya Geeta


Associate Professor, GITAM Institute of Management,
GITAM University, visakhapatnam

ii
SYLLABUS

OBJECTIVES OF THE COURSE

The objective of this course is to familiarize the student with the concepts and the techniques of Social /
Business Research Methodology
Block-I : Introduction to Research and Research Process
Unit-I : Research, Research Process,Ethics in research
Unit-II : Steps in Research Process
Unit-III : Research design

Block-II : Attitude Measurement and Scales, Data Collection, Sampling, Data Measurement and
Presentation
Unit-IV : Measurement and Scaling.
Unit-V : Methods of Data Collection and Different Types of Sampling Techniques.
Unit-VI : Data Preparation and Representation

Block-III : Data Analysis


Unit-VII : Univariate Analysis-I. Measures of Central Tendency
Unit-VIII : Univariate Analysis-II Measures of Dispersion, Skewness, and Kurtosis.
Unit-IX : Bivariate analysis - Correlation, Regression.
Unit-X : Multivariate analysis: Multiple correlation, Multiple regression

Block-IV : Statistical Inference


Unit-XI : Introduction to Statistical Inference
Unit-XII : Parametric Tests-I (Large Sample Tests)
Unit-XIII : Parametric Tests-II (Small sample tests)
Unit-XIV : Non –Parametric tests

Block-V : Time Series, Report writing and Business Analytics


Unit-XV : Time Series Analysis
Unit-XVI : Use of Statistical Packages: A Brief introduction, usage and explanation about Statistical
packages like EXCEL, SPSS, SAS, R, etc.
Unit-XVII : Report writing and Presentation.
Unit-XVIII : Introduction to Business Analytics. Concept of Business Analytics, Applications.

iii
Book Recommended:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Krishna Swami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. Aditham Bhujanga Rao, Research Methodology for Management and Social Sciences, Excel books,
New Delhi 2008
4. Naresh K Malhotra and Satyabhushan Das, Marketing Research, An applied orientation, Pearson,
New Delhi.
5. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
6. Gupta S C and Kapoor VK, Fundamentals of Applied Statistics, Sultan Chand & Sons, Delhi.
7. G C Beri, Marketing Research, Tata McGraw-Hill Publishing Company Limited, New Delhi.

iv
MODEL PAPER
MPDBA 401-RESEARCH METHODOLOGY & BUSINESS ANALYTICS
SECTION-A
Answer any FIVE of the following: (5x4=20)
Each answer should not exceed one page.
a) Explain the meaning and importance of research
b) Explain the concept of Hypothesis
c) Distinguish between primary and secondary data
d) What do you understand by pre-testing of a questionnaire?
e) Distinguish between Census and Sampling methods of data collection
f) What do you understand by editing of data?
g) Explain the advantages and disadvantages of diagrammatic presentation of data
h) What do you understand by discriminate analysis?

SECTION-B
Answer any FIVE of the following: (5x10=50)
2. a) Briefly describe the various types of research with their merits and limitations.
OR
b) Test the dependence between Type of Hair and eye color of the individuals using the following
information:

Eye color
Type of Hair Blue Green Brown Black Total
Blonde 20 15 18 14 67
Red 11 4 24 2 41
Brown 9 11 36 18 74
Black 8 17 20 4 49
Total 48 47 98 38 231

3. a) Briefly describe the various types of research designs


OR
b) Describe the characteristics of a good questionnaire

v
4. a) Explain the various random sampling techniques
OR
b) Can the following two sample be regarded as coming from the same normal population?

Sample Size Sample mean Sum of squares of deviations from mean


1 10 12 120
2 12 15 314

5. a) Describe the various methods of analysis of data.


OR
b) What is meant by Business Analytics? Define it. Explain it concepts?

6. a) Write the precautions one should take while preparing the research report.
OR
b) The data on prices (Rs in per KG) of a certain commodity during 2011 to 2015 are shown
below:

Quarter Years
2011 2012 2013 2014 2015
I 45 48 49 52 60
II 54 56 63 65 70
III 72 63 70 75 84
IV 60 56 65 72 66

Compute the seasonal Index by the method of simple average, ratio to moving average method
and ration to trend method.

vi
CONTENTS

Unit - 1 Research, Research Process, Ethics in Research .................................................................. 1

Unit - 2 Steps in research process ..................................................................................................... 10

Unit - 3 Research design ................................................................................................................... 17

Unit - 4 Measurement and scaling .................................................................................................... 30

Unit - 5 Methods of data collection and different types sampling techniques ................................ 41

Unit - 6 Data preparation and representation .................................................................................... 54

Unit - 7 Univariate analysis - I


Measures of central tendency .............................................................................................. 72

Unit - 8 Univariate analysis - II


Measures of dispersion, skewness, and kurtosis. ................................................................ 87

Unit - 9 Bivariate analysis -correlation and regression .................................................................. 105

Unit - 10 Multivariate analysis: Multiple correlation, multiple regression ..................................... 120

Unit - 11 Introduction to statistical inference ................................................................................... 131

Unit - 12 Parametric tests-i (large sample tests) ............................................................................... 142

Unit - 13 Parametric tests-ii (small sample tests) ............................................................................. 154

Unit - 14 Non-parametric tests .......................................................................................................... 166

Unit - 15 Time series analysis ........................................................................................................... 191

Unit - 16 Use of statistical packages: a brief introduction,


Usage and explanation about statistical packages like excel, spss, sas, r. ....................... 207

Unit - 17 Report writing and presentation ........................................................................................ 219

Unit - 18 Introduction to business analytics.concept of business analytics, applications. .............. 225

vii
BLOCK -I
INTRODUCTION TO RESEARCH METHODOLOGY

Block I consisting of three units.


Unit-I deals with Definition, Objectives, importance and Significance of research, Types of research,
Research Process, different stages of Research Process and Ethics in research
Unit-II Briefly explains all the seven steps involved in research process.
Unit-III Elaborately explains research design. It also explains different Experimental Designs. Criteria
for good research was also presented.
After studying these three Units readers will have brief idea about research and different steps in
research process. The Reader will also get some basic and general ideas about those steps in the research
process.

viii
UNIT - 1
RESEARCH, RESEARCH PROCESS, ETHICS IN RESEARCH

OBJECTIVES:
After going through this Unit, you will be able to
• Explain the definitions, objectives, significance and types of Research.
• Explain the basic steps that are existing in Research Process.
• Explain the Ethical consideration one must consider while doing research.

STRUCTURE:
1.1 Introduction to research
1.2 Definition
1.3 Objectives of research
1.4 Importance of research
1.5 Significance of research
1.6 Types/methods of research
1.7 Research process
1.8 Ethics in research
1.9 Summary.
1.10 Review Questions
1.11 Further Readings

1.1. INTRODUCTION TO RESEARCH:


Re - means again or over again or new. Search - means examine closely and carefully or to test and try.
Research is the Systematic Approach towards purposeful investigation through formulation of Hypothesis,
Collection of data on relevant variables, Analysis and Interpretation of Results and reaching conclusion
either in the form of a solution or generalization. Research is not a collection of techniques which already
exist but provides a structure for decision making. Whenever questions on ourselves, our institutions, our
environment, our Universe etc arise we seek to answer them. Whenever we encounter problems we try to
find solutions for them. Seeking answers to questions and finding solutions to problems in a systematic
way can be called as research.

1
1.2. DEFINITION:
Different researcher defined the term” Research” in different ways. However, the broad concept and
meaning is same.
D. Slesinger and M. Stephenson in the encyclopedia of social sciences defined research as “the manipulation
of things, concepts or symbols for the purpose of generalizing to extend, correct or verify knowledge,
whether that knowledge aids in construction of theory or in the practice of an art”.
According to Clifford Woody, research comprises defining and redefining problems, formulating hypothesis
or suggested solutions; connecting, organizing and evaluating data; making deductions and reaching
conclusions; and at last carefully testing the conclusions to determine whether they fit the formulating
hypothesis.
Research is defined as search for knowledge. It is having an in depth knowledge or having an insight of
known or unknown fact, theory or problem.
Research is defined as search for knowledge through objective and systematic method of finding solution
to a problem.
Research is defined as a movement, from the known to the unknown. It is an effort to discover something.
It can be an effort to know “more and more about less and less”.
Research is a scientific inquiry aimed at learning new facts, testing ideas, etc. It is the systematic collection,
analysis and interpretation of data to generate new knowledge and answer a certain question or solve a
problem.
Research can be defined as an academic activity with a set of objectives to explain or analyze or understand
a problem or finding solution(s) for the problem(s) by adopting a systematic approach in collecting,
organizing and analyzing the information relating to the problem.
The major aim of research is discovery of new facts, verification and testing of old facts, analysis of
interrelationships among variables and analysis of casual relationships among attributes and development
of new tools, concepts and theories.
Methodology is the method or technique which is used for analyzing the data. Methodology is the analysis
of principles, rules & postulates employed & applied with discipline.
Methodology is a systematic, theoretical analysis of the methods applied to a field of study. It comprises
the theoretical analysis of the body of methods and principles associated with a branch of knowledge.
In research methodology, one talk of research methods as well as the logic behind the methods one will
use in the contest of their research study and explain why they are using a particular method or technique
and why they are not using other methods or techniques so that research results are capable of being
evaluated either by the researcher himself or by others. It is a way to systematically solve the research
problem.
Examples of research: Most of the companies and other organizations want to find out what customers
think and what they want about their products. Using marketing research, they can manage the risks
associated with existing products as well as while offering new products and services.
The survey is a direct way of collecting quantitative or numerical information and qualitative or descriptive
information. When there are errors in the survey design, research problems can surface. For example, a
company might use a method that is designed to collect a random sample from the target consumer
2
population, but the method is not really random. Therefore, the organization cannot generalize its survey
results to represent the target population.

1.3. OBJECTIVES OFRESEARCH:


• To extend knowledge of human beings, social life and environment.
• To bring light to the hidden information that might never be discovered fully during the ordinary
course of life.
• To establish generalizations of existing knowledge and to formulate general laws and theories.
• To describe accurately the characteristics of a particular individual or situation or group.
• To determine the frequency with which something occurs or with which it is associated with
something else.
• To gain familiarity with phenomena or to achieve new insights into it.
• For individual development & that related to particular field only
• To test hypothesis of casual relationship between variables (Attributes)
• To verify and tests the existing facts and theories.

1.4. IMPORTANCE OF RESEARCH:


It is a known fact that Knowledge is a tool to solve the problems of individuals, institutions and society at
large. The main and inherent aim of research is gaining Knowledge. Hence Research is important and
crucial in solving problems mentioned above. It is also important in assessing one’s community and
program needs, preparing the most effective outreach messages etc. Importance of research can be written
in nutshell as 1. It will add addition to the existing knowledge 2. It will be a Scientific invention, 3.It is for
intellectual’s satisfaction, 4. It will act as a tool for social transformation and 5. It is to sharpen one’s
mind.

1.5. SIGNIFICANCE OF RESEARCH:


Generally, the researcher tries to convince the audience that the research is worth doing. It should establish
why the audience should want to read on. It could also persuade someone of why he or she would want to
support, or fund, a research project. One way to do this is by describing how the results may be used.
Why is this work important? What are the implications of doing it? How does it link to another knowledge?
How does it stand to inform policy making? Why is it important to one’s understanding of the world?
What new perspective will bring to the topic? What use might the final research paper have for others in
this field or in the general public? Who might share the findings of the research, once the project is
completed?
Answering all these gives the significance of the research.

3
1.6. TYPES / METHODS OF RESEARCH:
There is no water tight demarcation of classification of research. For the betterment and simplicity for the
reader following brief classification was presented. However, it is not exhaustive.
 dLJƉĞƐŽĨZĞƐĞĂƌĐŚ

EŽŶͲdžƉĞƌŝŵĞŶƚĂůͬ džƉĞƌŝŵĞŶƚĂůͬ
WƵƌĞͬ ƉƉůŝĞĚ
ŽŶĐĞƉƚƵĂů
&ƵŶĚĂŵĞŶƚĂů ŵƉŝƌŝĐĂů

džƉůŽƌĂƚŽƌLJ ĞƐĐƌŝƉƚŝǀĞ ĂƐƵĂů

,ŝƐƚŽƌŝĐĂů ƌŽƐƐƐĞĐƚŝŽŶĂů >ŽŶŐŝƚƵĚŝŶĂů


;^ƚĂƚŝĐͿ KƌKŶĞƚŝŵĞ

1.6.1. Pure/ fundamental research:


Pure research is known as fundamental or basic research. It is done with curiosity or inquisitiveness. It is
primarily intended to find out certain basic principles. It is concerned with generalization or formulation
of theory or discovery of new theory or refinement of existing theory and with some natural phenomenon
relating to pure mathematics.
Eg: Research concerning human behavior carried out with a view to make generalization about human
behavior.
1.6.2. Applied research:
Applied research aims in finding solutions for immediate problems faced by a society or an industry or
business organization. It discovers solutions to some practical problems. It is an application of scientific
method which helps to contradict or modify existing theory or theories. It helps to formulate the policy. It
suggests remedial measures to alleviate social problems. Eg: Applied research might involve a decision
about whether a firm’s new safety-training program should be conducted via online seminars using online
quizzes, or whether participants should be brought to corporate headquarters to be classroom trained.
1.6.3. Experimental/ Empirical research:
Empirical research relies on experience or observation alone, often without due regard for system and
theory. It is a data based research which comes up with conclusions that are verified by observation or
experiment. It assesses the effects of a particular phenomenon by keeping the other variables constant or
controlled. It is an experiment to investigate the relation between two or more variables by making changes
in independent variable and observing the effects of change on the dependent variable. Evidences from
empirical studies are considered to be the most powerful support possible for testing a give hypothesis.
Eg: Researcher selects 50 students from a group of students who require a course in management. 25
students are kept in group A, a student programme is conducted. For the remaining 25 students who are
kept in group B, a special student programme is conducted. At the end a test is conducted to each group to
judge the effectiveness of training programme.
4
1.6.4. Non-Experimental/ Conceptual Research:
Conceptual research is that related to some abstract idea(s) or theory. Research in which independent
variable is not manipulated is called non - experimental research. This can be broadly classified into three
categories namely Exploratory/ formalative, Descriptive and Casual.
Eg: Researcher wants to study whether intelligence affects reading ability for group of students. He
randomly selects the students and tests their intelligence and reading ability by calculating coefficient of
correlation between two sets of scores.
1.6.4.1. Exploratory Research:
Exploratory research is conducted to clarify ambiguous situations or discover potential business
opportunities. It is not intended to provide conclusive evidence from which to determine a particular
course of action. Exploratory research is based on development of hypothesis. It is a study of unfamiliar
problem about which the researcher has little or no knowledge. According to Daniel Katz it attempts to
see what is there rather than to predict the relationships that will be founded. It gives insight for the
problems for the purpose of formulating the Hypothesis for more precise investigation.
Eg: Doctors initial investigation of a patient suffering from an unfamiliar disease to clarify and define the
nature of a problem.
1.6.4.2. Descriptive research:
Descriptive research describes the characteristics of objects, people, groups, organizations or environments
or state of affairs as it exists at present. This research is concerned with conditions, practices, structures,
differences or relationship that exists, opinions held, processes that are going on or trends that are evident.
This research involves with fact finding enquiries of different kinds and has no control over the variables.
They can only report who, what, when, where and how regarding the current economic and employment
situation. Descriptive research can be further classified into two broad categories namely (i) Historical
(Static) and (ii) Dynamic.
Eg: Studies concerned with age, sex, educational level, occupation or income. Studies concerned with
narration of facts and characteristics concerning individual. Problems of individual relations in India with
inter-disciplinary approach.
1.6.4.2.1. Historical/ Static research:
It is a study of past records and other information sources with a view to reconstructing the origin &
development of institution or a system & discovering the trends in the past, including the philosophy of
persons or groups at any remote point of time. Main sources are books, documents, newspapers, magazines,
statistical material etc. It involves a single measurement of the phenomena in question.
Eg: Averages, Dispersion, Skewness etc.
1.6.4.2.2. Dynamic research:
Dynamic research goes beyond the single measurement of variable & examines the relationship among
variables. Dynamic research is broadly classified into two namely (i) Cross Sectional or One-time research
and (ii) Longitudinal.
Eg: Correlation, Regression etc.

5
1.6.4.2.2.1: Cross Sectional research:
The research is confined to single time-period and do not account for changes over a period of time.
Collects data about various variables such as households, dealers, retail stores, others entities etc. of
sample at one point of time in order to examine the relationship between them.
Eg: To examine the relationship between job satisfaction and style of leadership. Relation between similarity
of automobile preference between husband and wife.
1.6.4.2.2.2 Longitudinal research:
The research is carried on over several time periods. A sample or set of samples of population is measured
repeatedly on the same variable over several time periods to find relationships among variables and
examine the changes that take place during the time.
Eg.Consumer behavior over duration of time.
1.6.4.3. Causal Research:
This research is to identify the cause and effect relationship between two or more variables. Causality is
a conditional phenomenon between variables in the form “ if X , then Y”. The following are the conditions
for causality. 1. Covariation, 2. Time order of occurrence of variable, 3. Systematic elimination of other
causal variable and 4. Experimental design.

1.7. RESEARCH PROCESS:


“Research process is the step-by-step procedure which focuses on being objective and gathering a multitude
of information for analysis of one’s research/ research paper. It requires continuous extensive, re-evaluation
and revision of both one’s topic and the way it is presented. One has to revise the plan, add/ delete
material, sometimes one may narrow down the topic or even change the topic completely, depending on
what is discovered during your research or when sufficient information is not available. Research process
consists of series of action or steps necessary to carryout research effectively.
The process mainly involves identifying, locating, assessing, analyzing, and then developing and expressing
one’s ideas. This process is used in all research and evaluation projects, regardless of the research method.
The process focuses on testing hunches or ideas and recreating and documenting in such a way that
another individual can review/conduct the same study again.
The steps are interlinked with one another. If changes are made in one step of the process, the researcher
must review all the other steps to ensure that the changes are reflected throughout the process. Chart 1.1
will illustrate steps that are to be followed in research press generally.

6
Steps in Research Process:
 ĞĨŝŶŝŶŐZĞƐĞĂƌĐŚWƌŽďůĞŵ

ZĞǀŝĞǁŽĨůŝƚĞƌĂƚƵƌĞ

&ŽƌŵƵůĂƚŝŽŶŽĨ,LJƉŽƚŚĞƐŝƐ

ZĞƐĞĂƌĐŚĞƐŝŐŶ

ĂƚĂĐŽůůĞĐƚŝŽŶͬ&ŝĞůĚǁŽƌŬ

WƌŽĐĞƐƐŝŶŐĂŶĚŶĂůLJƐŝƐŽĨĚĂƚĂ

/ŶƚĞƌƉƌĞƚĂƚŝŽŶΘZĞƉŽƌƚ

Chart 1.1
These steps were further explained in Unit-II briefly.

1.8. ETHICS IN RESEARCH:


Ethics came from a Greek word “ethos” which means custom, habit. It may also be defined as a method,
procedure, or perspective for deciding how to act and for analyzing complex problems and issues.
Another way of defining ‘ethics’ focuses on the disciplines that study standards of conduct, such as
Medicine, Philosophy, Theology, Law, Psychology, or Sociology. It is the norms which distinguish between
acceptable and unacceptable behavior. Ethics in research promotes moral values, such as social
responsibility, human rights and compliance with the law. Ethics is required at each and every step of
research starting from problem definition to report writing. Ethical Considerations can be specified as one
of the most important parts of the research. When most people think of ethics (or morals), they think of
rules for distinguishing between right and wrong. For example, a ”medical ethicist” is someone who
studies ethical standards in medicine. For instance, in considering a complex issue like global warming,
one may take an economic, ecological, political, or ethical perspective on the problem. While an economist
might examine the cost and benefits of various policies related to global warming, an environmental
ethicist could examine the ethical values and principles at stake.Many different disciplines, institutions,
and professions have standards for behavior that suit their particular aims and goals. These standards also
help members of the discipline to coordinate their actions or activities and to establish the public’s trust of
the discipline. For instance, ethical standards govern conduct research in medicine, law, engineering, and
business. Ethical norms also serve the aims or goals of research and apply to people who conduct scientific
research or other scholarly or creative activities. There is even a specialized discipline, research ethics,
which studies these norms.
There are several reasons why it is important to adhere to ethical norms in research. First, norms promote
the aims of research, such as knowledge, truth, and avoidance of error. For example, prohibitions against
fabricating, falsifying, or misrepresenting research data promote the truth and minimize error. Second,
since research often involves a great deal of cooperation and coordination among many different people
in different disciplines and institutions, ethical standards promote the values that are essential to
7
collaborative work, such as trust, accountability, mutual respect, and fairness. For example, many ethical
norms in research, such as guidelines for authorship, copyright and patenting policies, data sharing policies,
and confidentiality rules in peer review, are designed to protect intellectual property interests while
encouraging collaboration. Most researchers want to receive credit for their contributions and do not want
to have their ideas stolen or disclosed prematurely. Third, many of the ethical norms help to ensure that
researchers can be held accountable to the public. For instance, federal policies on research misconduct,
conflicts of interest, the human subjects protections, and animal care and use are necessary in order to
make sure that researchers who are funded by public money can be held accountable to the public. Fourth,
ethical norms in research also help to build public support for research. People are more likely to fund a
research project if they can trust the quality and integrity of research. Finally, many of the norms of
research promote a variety of other important moral and social values,such as social responsibility,
human rights, animal welfare, compliance with the law and public health and safety. Ethical lapses in
research can significantly harm human and animal subjects, students, and the public. For example, a
researcher who fabricates data in a clinical trial may harm or even kill patients, and a researcher who fails
to abide by regulations and guidelines relating to radiation or biological safety may jeopardize his health
and safety or the health and safety of staff and students.
According to Bryman and Bell (2007) the following ten principles of ethical considerations have been
compiled as a result of analyzing the ethical guidelines of nine professional social sciences research
associations:
1. Research participants should not be subjected to harm in any ways whatsoever.
2. Respect for the dignity of research participants should be prioritized
3. Full consent should be obtained from the participants prior to the study.
4. The protection of the privacy of research participants has to be ensured.
5. Adequate level of confidentiality of the research data should be ensured.
6. Anonymity of individuals and organizations participating in the research has to be ensured.
7. Any deception or exaggeration about the aims and objectives of the research must be avoided.
8. Affiliations in any forms, sources of funding, as well as any possible conflicts of interests have to
be declared.
9. Any type of communication in relation to the research should be done with honesty and transparency.
10. Any type of misleading information, as well as representation of primary data findings in a biased
way must be avoided.
One should be ethical while defining the problem and as the researcher should not be influenced by his
personal motives and client’s motives in defining the problem. The researcher should be ethical in taking
up the suitable research design for the problem. The researcher should choose an appropriate sample size,
sampling technique while collecting the data. The collection process should be acceptable ethically and it
should be valid before using it in research. Any data which is collected by wrong means without correct
process is considered to be unethical. Many ethical issues may come across during data preparation and
analysis out of which some are editing, coding, transcribing and cleaning of data. Sometimes the researcher
tries to edit the data after data analysis is done in order to get desired results. Researcher faces ethical
issues during disclosure of results honestly and expressing the limitations of the research explicitly. The
researcher should maintain integrity of research work and represent the facts in the report accurately.
8
Plagiarism is also considered to be as unethical. To avoid this, the researcher should mention the references
and give credit to them. Hence one must be truthful, impartial, integrity, prudence and respect for others
work and these are considered to be as conducts for ethical research. The dissemination of the research
results to the client and other stakeholders should be appropriate, should be honest, accurate and complete.
Activity 1:
Research can be classified in to :
Activity 2:
You are Product Manager for a Particular Tea Brand B, a nationally distributed brand. For the last six
consecutive months, Brand B shows a declining trend in sales. You ask the research department to do a
study to determine why the sales are declining.
Is this an exploratory or descriptive or experimental research? Explain your reasons.

SUMMARY:
This Unit was aimed to explain the definitions, objectives, Importance and significance of research. Different
types/ methods of research in the literature was briefly presented. Different stages of the research process
were presented. Ethics in research was also presented.
Reveiw Qeustions:
1. Define Research.
2. Explain different objectives of the research.
3. Briefly discuss the importance and significance of Research.
4. Explain the terms i) Pure Research, ii) Applied Research, iii) Experimental Research.
5. Briefly discuss the Conceptual research
6. What is meant by ethics in research. What are the principles one must keep in mind while analyzing
the ethical guidelines?
7. Why one has to adhere to ethical norms in research? Explain.

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques,WishwaaPrakashan, New Delhi
2. Krishnaswami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
4. H.K Dangi, ShrutiDewen., Business Research Methods, Cengage Learning India Pvt Ltd, New
Delhi 2016.
5. Naval Bajpai , Business Research Methods, Dorling Kindersley (India) Pvt. Ltd. New Delhi.

9
UNIT - 2
STEPS IN RESEARCH PROCESS

OBJECTIVES:
After going through this Unit, you should be able to:
• Explain the different steps that are existing in Research Process.
• Explain the Research Problem and its related issues.
• Explain the importance and availability of Review of literature in research and concepts of hypothesis.
• Explain basic information about remaining steps in research process, namely Research Design,
Data Collection/ Field work, Processing and Analysis of Data, Interpretation and Report Writing.

STRUCTURE:
2.1 Introduction
2.2 Introduction to research problem
2.3 Review of literature
2.4 Formulation of hypothesis
2.5 Preparing research design
2.6 Collection of data
2.7 Processing and analysis of data
2.8 Interpretation and report writing
2.9 Summary
2.10 Review Questions
2.11 Further Readings

2.1 INTRODUCTION:
As discussed in Unit-1 there are Seven broad steps in Research Process and those steps are now discussed
in brief in this Unit-II
The steps that were identified are 1. Defining Research Problem, 2. Review of Literature, 3. Research
design, 4. Formulation of Hypothesis, 5. Data collection and field work, 6. Processing and Analysis of
data and 7. Interpretation and Report writing.

10
2.2. INTRODUCTION TO RESEARCH PROBLEM:
In research process the first and foremost step in Research Process is selecting and defining research
problem. The term problem means a question or issue to be examined. The selection of problem for
research is most important but difficult task. The problem selected for research may be initially vague
and hence it has to be well defined. It must be free from confusion and should be unambiguous or clear.
Hence identify the research problem in such a way that, it can be explained, studied or resolved and
defined precisely. The accuracy and conciseness of problem definition, helps in utilizing the available
resources effectively and efficiently. It involves stating the general problem and identifying the specific
components of the research problem. Before any one start their research he/she need to have some idea
about what to do. This is most difficult but most important too. Without being clear about what the
researcher is going to do research, it is difficult to plan to do it. The problem has to be defined on the key
issues and must be able to provide information while taking the decisions. In most researchers, identifying
exactly why, what and how they are researching is a big process. It is not enough to be interested in a
subject and want to write about it, but, there must be a particular reason why they are writing it, with what
perspective they are taking, what aspects it should cover, and finally what conclusions they we will be
drawing. The problem has to be discussed with persons, who has experience in dealing with several
research problems. Once the problem has been defined precisely and accurately, the research can be
designed and conducted properly.
Formulating and clarifying the research problem is the starting point of once research. Only when research
problem has been defined clearly, research being designed and conducted properly. A research problem
must be susceptible for research. All the effort, time and money spent from this point will be wasted if the
problem is miss understood or ill defined.
Example: The problem that the research agency has identified is childhood obesity, which is a local
problem and concern within the community. This serves as the focus of the study.
2.2.1. Process of defining the research problem:
The identification or development of research problem requires critical study of books and articles relevant
to the subject, academic and daily experience, exposure to feel situations and practical problems, discussions
and interviews with decision makers, experts, researchers, administrators etc, intensified and meaningful
discussions within a group of knowledgeable persons (Delphi Technique) and continuing research.
Sometimes new ideas, approaches may strike towards mind like a flash. This may also be a good source
of identifying research problem. Analysis of available secondary data is an essential step in the problem
definition. Sometimes qualitative research, a structured, explanatory research methodology based on small
samples intended to provide insight and understanding of the problem, must be undertaken to gain
understanding of the research problem and its underlying factors. Other explanatory research techniques
like Pilot surveys, Case studies will help the researcher to get insight of the research problem.
A research problem must have a researcher to find out the best solution for the given problem. That is to
find out by which course of action the objective can be obtained optimally in the contest of a given
environment. While selecting the problem for research, one must be cautious that too narrow or too wage
problems should be avoided controversial subject not become a choice of an average researcher. Research
which is overdone should not be normally chosen. The importance of the subject, availability of required
qualified and trained researchers, the costs involved, the time factor, familiarity and feasibility of the
study are most important points for identifying the research problem. In majority cases a preliminary
study will help in identifying the research problem.

11
2.2.2. Technique involved in defining a Problem:
The general rule to be followed while defining the research problem is that the definition should allow the
researcher to obtain all the information needed to address the necessity of the management and to guide
the researcher in proceeding with the Project. It should not be too broad and too narrow. It should be
unambiguous.
While defining research problem one has to be observed the following points also:
1. Technical terms in words or phrases with special meanings used in the problem should be clearly
defined. This will help in understanding the research problem as well as comparing this with another
similar problem.
2. Basic assumptions relating to research problem should be clearly stated. The criteria for the selection
of the problem should be provided.
3. The time of survey and the sources of data available mentioning its time periods must also be
considered while defining the research problem.
4. The scope and limits of the investigation must also be mentioned in research problem.
In general, research problem must be stated in a general way and free from ambiguities. Thinking and
rethinking process of results help in the formulation of the problem in a more specific way so that it may
be a realistic one in terms of the available data and resources and is also analytically meaningful. It must
be capable of paving the way for the development of working hypothesis and for the means of solving the
problem itself.

2.3. REVIEW OF LITERATURE:


Two major reasons exist for reviewing the literature, the first being the preliminary search that helps one
to generate and refine the research idea or problem and the second is the critical literature review of the
research problem identified.
Once the problem is defined, a brief summary of it has to be written down. Then the researcher has to
examine all available literature relating to the identified problem, to get himself acquainted with the
selected problem. To do this, the researcher must review the relevant literature related to the research
problem. This gives knowledge about the problem area, allows to know about what relevant studies have
been conducted in the past, how these studies were conducted, and the conclusions in the problem area.
The terms, concepts, words, phrases used in the study or the description of the study have to be reviewed.
These items need to be specifically defined as they apply to that particular study. Terms or concepts often
have different definitions depending on who is reading the study. To minimize confusion about what the
terms and phrases mean, the researcher must specifically define them for the study concerning and similar
to the studies made earlier which are similar to the one proposed. Many a times the problem identified
initially may have large or broad scope. Then the researcher should narrow the scope of the study. This
can only be done after the literature has been properly reviewed.
The reading of literature is depending on the approach, one is intending to use in the research under study.
Broadly there are two approaches, one is deductive approach in which one will develop a theatrical or
connectional framework and subsequently test using data and second is inductive approach in which one
will planning to explore the data and to develop theories from them that will subsequently relate to the
literature.
12
The review will help to refine further the research questions and objectives, highlight the research
possibilities that have been overlooked implicitly in research to data, to discover explicit recommendations
for further research, to simply avoid the repeating work, to provide insight in the research approaches,
strategies and techniques that may be appropriate for the research under study. The purpose of the review
is to provide and identify the relevant and significant work that took place in this direction. Always
review must be critical in the sense that the researcher will need to keep it together different authors ideas
and form own opinions and conclusions based on those ideas.
Literature can be had from different sources like Reports, Theses, Conference proceedings, Unpublished
manuscript sources, Government publications, Journals, Books, Newspapers, Indexes, Abstracts,
Catalogues, Bibliographies, Citations, Encyclopedias, Internet etc.

2.4. FORMULATION OF HYPOTHESIS:


Once the concept and interest in the research problem is identified, researcher intends to find relationship
between the concepts. Concepts are the basic units of theory development. Later propositions, which are
the statements concerned with relationships among concepts, were developed. A hypothesis is a statement
explaining some outcome or a proposition or in the form of if-then statement that is empirically testable.
When the data is consistent with a hypothesis then the hypothesis is said to be supported otherwise not
supported.
2.4.1. Definition: Hypothesis is a preposition, condition or principle which is assumed, perhaps without
belief, in order to draw out its logical consequences and by this method to test its accord with facts which
are known or may be determined.
According to G.A. Lund Berg, a hypothesis is a tentative generalization, the validity of which remains to
be tested.
According to Goode & Halt it is a preposition which can be put to test to determine its validity. It may
seem contrary to, or in accord with, commonsense. It may prove to be correct or incorrect. Hypothesis
prevents a blind research and indiscriminate gathering of data which may later prove irrelevant to the
problem under study.
2.4.2. Characteristics of Hypothesis:
1. It should be clear & precise.
2. It should be capable of being tested.
3. It should state relationship between variables.
4. It should be limited in scope & must be specific.
5. It should be stated as far as possible in most simple terms so that the same is easily understandable
by all concerned. But one must remember that simplicity of hypothesis has nothing to do with
significance.
6. It should be consistent with most known facts i.e, it must be consistent with a substantial body of
established facts. In other words, it should be one which judges accept as being the most likely.

13
2.4.3. Types of hypothesis:
There are different ways of classifications available. They are Simple, Complex, Descriptive, Relational,
Directional, Non – Directional, Working, Null, Alternative,Statistical etc.
Hypothesis which states the experience, form, size or distribution of a variable are termed as “Descriptive
hypothesis”. Hypothesis which states the relationship between two variables are known as “Relational
hypothesis”. Hypothesis which indicate the direction of relationship between the variables such as positive,
negative, more than, less than is known as “Directional hypothesis”. Hypothesis which does not indicate
the direction of relationship between the variables such as no relationship between the variables is known
as “Non- Directional hypothesis”. A hypothesis which is tested for its possible rejection is known as “Null
hypothesis”. It is denoted by Ho. This hypothesis does not indicate the direction between the variables.
The opposite of null hypothesis is known as “alternative hypothesis”. It is denoted by H1 or HA. Working
hypothesis is assumption made in order to draw out and test its logical or empirical consequences. Working
hypothesis should be precise and clearly defined because it has to be tested. It gives guidance to the
researcher by limiting the area of research and keeps him on the right track.
Some of the meanings of these types of hypothesis are out of the scope of this book and some are presented
in future Units whenever required.

2.5. PREPARING THE RESEARCH DESIGN:


Research design is a master plan or blue print specifying the methods, procedures for collecting and
analyzing the needed information. In simple words it can be defined as the plan to carry out a research
project. It provides the entire action plan from building of the initial hypothesis to final analysis and
report writing. Since a research design depends on several factors, one cannot classify research designs
into water tight components. These research designs are need based. Broadly there are four types of
research design approaches namely (i) Exploratory studies, (ii) Descriptive studies, (iii) Experimental
studies, and (iv) Modeling.
2.5.1. Exploratory Studies:
Exploratory studies are generally based on secondary data that are readily available. These studies help
for further investigation, developing the concepts, definitions, identification of variables, defining the
hypothesis while improving the research design. An exploratory study is in the nature of preliminary
investigation where in the researcher himself is not having significant knowledge and is therefore unable
to frame detailed research questions. Most of the exploratory studies are qualitative in nature but quantitative
studies can also be done in these research studies. There are three ways of conducting exploratory surveys
namely 1. A search of literature, 2. Interviewing experts in the subject and 3. Conducting Focus group
interviews (FGI), Focused group discussion (FGD).
2.5.2. Descriptive studies: Descriptive studies are formal, pre-planned and structured when the research
problem is clearly defined. It describes the characteristics of relevant groups. They estimate the percentage
of units in population possessing particular criteria. They determine the degree to which the variables are
associated with. These studies answer to the questions such as who, what, when, where, and how. They
attempt to address who should be surveyed, at what time, from where and how this information should be
obtained. There are undertaken in many circumstances. Though many are in the impression that these
studies are simple, in reality these surveys can be complex also. They are divided in to two broad
classification namely Cross sectional and longitudinal.

14
2.5.3. Experimental studies:
Experimental studies are carried out to study the cause and effect relationship between variables. It allows
us to know how the dependent variable is being influenced by the changes in independent variable keeping
the other variables controlled or constant. These are commonly using in natural sciences than social
sciences. Often experiments are conducting in the laboratories relating to the field of organizational
psychology.
2.5.4. Modeling:
In modeling studies the research problem is represented by a mathematical model. While developing the
model one has to have a clear understanding about research problem and about the relationships that were
existed between the variables. These models have to be tested and verified before giving conclusions. In
these models, symbols are assigned to the key variables and represent the relationships among these
variables in a symbolic way.
Research design was further explained in Unit-III

2.6. COLLECTION OF DATA:


The useful information for any study is known as data. Another important step in research process begins
with the collection of data. It requires field staff which requires proper selection, training, supervision and
evaluation. There are two types of data namely (i) Primary data and (ii) Secondary data in research.
2.6.1. Primary data: It is the first hand information which is collected for the study under consideration.
This data is not collected by any one for any kind of study previously. The methods for collecting primary
data are through observation, questionnaire and schedule.
2.6.2. Secondary data:
The data which is collected or gathered from already existing information is known as secondary data.
This data is collected or gathered from existing books, journals, newspapers, magazines, company reports
and internet etc.
Further explanation on this was presented in Unit-V

2.7. PROCESSING AND ANALYSIS OF THE DATA:


Once the data are collected on the variables, the researcher is ready to move to the next step of the
process, which is the data analysis. The researcher now analyzes the data according to the plan. The data
which is collected has to be edited, coded, tabulated and then draw statistical inferences. Data analysis
determine consistent pattern and summarizes the details in the investigation. In the process of data analysis,
proper identification of techniques to be used is a very important item. Statistical analysis may range from
simple univariate analysis to complex multivariate analysis. The results of this analysis are then reviewed
and summarized in a manner directly related to the research questions.
This will be further discussed in detail in different Units from Unit-VI

15
2.8. INTERPRETATION AND REPORT WRITING:
The most important part of the research is communicating the research results. Communicating research
results will be done through making conclusionsand presenting in the report. The conclusions and report
preparation stage consist of interpreting the research results, describing the implications and drawing the
appropriate conclusions about the hypothesis. One must be careful in interpreting the results. The
conclusions should fulfill the answers against the questions raised in the research proposal. A summary of
the findings independent of technical aspects, research methodology, research design and statistical findings
is required to present for the benefit of a common researcher.
Further elaborative discussions were given in Unit-XVII
Activity 1: Need for Review of Literature in the Research Process:
Activity 2: Consider one research Problem of your interest and define the Hypothesis in it.

2.9 SUMMARY:
This Unit is a continuation of the Unit-1,in which different steps was given without any further explanation.
All the seven steps that are in Research Process were briefly explained. Research Problem was clearly
defined and various techniques in defining research Problem was presented. Need for review of literature
in research, definition of Hypothesis, its need in research, its characteristic was presented. A brief mention
about Research Design, Collection of Data, Processing and analysis of the data and Interpretation of Data
and report writing were made. Elaborate explanation for these four steps will be discussed in coming
Units.
2.10 REVIEW QUESTIONS:
1. What are the different steps involved in Research Process and explain them briefly?
2. What is the necessity of defining research problem?
3. Briefly explain the technique involved in defining the research problem.
4. What are the points to be observed while defining the Research Problem?
5. Define Hypothesis. What are the characteristics of the Hypothesis?
2.11 FURTHER READINGS:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Krishnaswami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt .Ltd.
4. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
5. H.K Dangi, ShrutiDewen., Business Research Methods, Cengage Learning India Pvt Ltd, New
Delhi 2016.
6. G C Beri, Marketing Research, Tata McGraw -Hill Publishing Company Limited, New Delhi.
16
UNIT - 3
RESEARCH DESIGN

OBJECTIVES:
After going through this Unit, you should be able to:
• Explain the concept of research design and experimental design.
• Explain different types of classical experimental designs.
• Explain different important statistical designs.
• Explain different features that a good research designs will posses.

STRUCTURE:
3.1 Introduction
3.2 Experimental designs
3.3 Pre- Experimental Designs
3.4 True –Experimental designs
3.5 Quasi - Experimental designs
3.6 Statistical designs
3.7 Features of good research design
3.8 Summary
3.9 Review Questions
3.10 Further Readings

3.1. INTRODUCTION:
A research design is the plan of a research study. As discussed in previous Unit- II, Research design is
the master plan to carry out a Research Project. This includes the research strategies, research choices and
time horizons. In this one can tune the research question in to research project. It will be a general plan of
how one can go about answering their research question(s).
The design of a study defines the study type (descriptive, correlational, semi-experimental, experimental,
review, meta-analytic) and sub-type (e.g., descriptive-longitudinal case study), research question,
hypotheses, independent and dependent variables, experimental design, and, if applicable, data collection
methods and a statistical analysis plan. Research design is the framework that has been created to seek
answers to research questions

17
The research design shows the coherent and explicit way that how the researcher is able to tackle the
research problem.
A good research design is efficient, adequate and effective. It gives minimum errors and maximum benefits.
It takes into account all uncertainties, risks involved and other categories. It minimizes the scope of biases
and errors in research. It depends on the objective and nature of the research. It helps in getting the
required information and solutions in desired time with limited resources.

3.2. EXPERIMENTAL DESIGN:


An experiment is a procedure carried out to verify, refute, or validate a hypothesis. Experiments provide
insight into cause-and-effect by demonstrating what outcome occurs when a particular factor is manipulated.
The term experimental design refers to a plan for assigning experimental units to treatment conditions.
A good experimental design serves three purposes.
• Causation. It allows the experimenter to make causal inferences about the relationship between
independent variables and a dependent variable.
• Control. It allows the experimenter to rule out alternative explanations due to the confounding
effects of extraneous variables (i.e., variables other than the independent variables).
• Variability. It reduces variability within treatment conditions, which makes it easier to detect
differences in treatment outcomes.
In an experimental design, the researcher actively tries to change the situation, circumstances, or experience
of participants (manipulation), which may lead to a change in behavior or outcomes for the participants of
the study. The researcher randomly assigns participants to different conditions, measures the variables of
interest and tries to control for confounding variables. Therefore, experiments are often highly fixed even
before the data collection starts. If it was decided that an experiment is the best approach for testing the
hypothesis, then one need to design the experiment. Experimental design refers to how participants are
allocated to the different conditions in an experiment
Experimental designs are carried out to study the cause and effect relationship between variables. It
allows to know how the dependent variable is being influenced by the changes in independent variable
keeping the other variables controlled or constant. In these studies, a researcher can make systematic
interventions to arrive at the causes of a phenomenon. An experimental design is a sketch to execute an
experiment where a researcher is able to control or manipulate at least one independent variable. The
experimental designs can be broadly classified into two groups namely (i) Classical experimental designs
and (ii) Statistical experimental designs. Classical designs consider the impact of only one treatment level
of independent variable taken for the study at a time, whereas Statistical designs considers the impact of
different treatment levels of one or more independent variables.
The classical experimental designs can be further classified into three categories namely (i) Pre -
experimental design, (ii) True - experimental design and (iii) Quasi - experimental design.

18
3.3. PRE - EXPERIMENTAL DESIGNS:
These designs are of exploratory type and have no control over extraneous factors. Since it doesn’t establish
the cause and effect relationship these designs cannot be put under experimental designs. These are mainly
used to frame the hypothesis about the casual relationship and not for testing hypothesis. There are four
commonly use pre – experimental designs namely (i) One group after only design, (ii) One group before
after design, (iii) Non – matched with control group design and (iv) Matched with control group design.
3.3.1. One-group after only design (One shot case study):
This is the most basic experimental design involves the exposure of single group test unit to a treatment
(X) and then taking a single measurement on the dependent variable (O). In these designs the researcher
has a very little control over the impact of various extraneous variables and this design is used for exploratory
research and not for conclusive research. A pre experimental design in which a single group of test is
exposed to a treatment X, and then a single measurement on the dependent variable is taken is called One-
group after only design.
Symbolically it can be represented as X O
3.3.2. One-group before after design (One-group pretest posttest design):
This design involves in testing the group of test units twice. The first observation O1 is taken then the
group is exposed to the treatment (X) and the second observation O2 is taken after exposed to treatment.
The treatment effect is the difference between the first and second observation ie. O2 - O1, but the validity
of conclusion is questionable as extraneous variables are uncontrollable
Symbolically it can be represented as O1 X O2
3.3.3. Non-matched with control group design (Static group design):
These designs consist of two groups namely experimental group (EG) and control group (CG). The
experimental group is exposed to the treatment (X) and made measurement O1, the measurement on
control group on which no experiment is done is O2. The treatment effect is the difference between the
experimental group receiving the treatment and the control group not receiving the treatment ie. O1 – O2.
Since the two groups are not randomly selected the researcher cannot ascertain that the two groups are
equivalent before treatment.
Symbolically it can be represented as
EG: X O1
CG: O2
3.3.4. Matched with control group design:
These designs match experimental group and control group on the basis of relevant characteristics.
Symbolically it can be represented as
EG M X O1
CG M O2
Where M indicates the experimental group and control group are matched on the basis of some relevant
characteristics.

19
3.4. TRUE EXPERIMENTAL DESIGN:
These designs involve random assignment of both test units and treatments to the experimental groups.
The distinguishing feature of true experimental designs when compared to pre experimental design is
randomization. Random assignment neutralizes the impact of extraneous variable. True experimental
designsare broadly classified into three namely
(i) Two –group, before-after designs, (ii) Two- group, after only designs and (iii)Solomon four group
designs.
3.4.1. Two – group before- after designs (Pretest posttest control group design):
This design involves random assignment to either experimental group or control group and pretest measure
is taken on both the groups. The measurement on EG and OG respectively before exposed to the treatment
(X) are O1 and O3. Then treatment is given to the experimental group only but posttest measures are taken
for both groups and measurements are O2 and O4 respectively for EG and OG.
Symbolically it can be represented as
EG: R O1 X O2
CG: R O3 O4
The treatment effect (TE) is measured as (O2-O1) - (O4-O3)
This design controls for most extraneous variables.
3.4.2. Two -group, after only design (Posttest only control group design):
It does not involve any premeasurement. A true experiment design in which the experimental group is
exposed to the treatment but the control group is not and no pretest measure is taken.
Symbolically it can be represented by
EG: R X O1
CG: R O2
The treatment effect is measured by TE= O1 -O2.
This is simple. As there is no premeasurement, the testing effects are eliminated, but the design is sensitive
to selection bias and mortality (It is difficult to determine whether those in the experiment group who
discontinue the experiment are similar to their counterparts in the control groups). However, this can be
controlled with careful design of experimentation. This is having advantage in respect of time, cost and
sample size requirement.
3.4.3. Solomon Four Group Designs (four-group-six-study design):
If the researcher is concerned with examining the changes in the attitudes of individual respondent, the
Solomon four group design should be considered. This design is used to solve the problems of before-
after design by supplementing it with after-only design. This design is also known as four-group-six-
study design. A true experimental design that explicitly controls for interactive testing effects, in addition
to controlling for all the other extraneous variables is called Solomon Four Group Designs. The researcher
can randomly test units and randomly expose test units to the treatments.

20
3.5. QUASI-EXPERIMENTAL DESIGN:
In these designs the researcher can control when measurements are taken and on whom they are taken.
The researcher lacks control over the scheduling of the treatments and also is unable to expose test units
to the treatments randomly then the Quasi experimental designs are useful. These are useful in the situations
where true experimentation cannot be performed. These experiments are quicker and less expensive.
However full experimental control is lacking in these designs and hence the researcher must take into
account the specific variables that are not controlled. The widely used Quasi experimental designs are
(i) Time series design and (ii) Multiple time series designs.
3.5.1. Time series design:
It is a Quasi experimental design that involves periodic measurements on the dependent variable for a
group of test units. In this, treatments are administered by the researcher or that occurs naturally. After the
treatment, periodic measurements are continued in order to determine the treatment effects.
A time series experiment may be symbolically represented as
EG O1 O2 O3 O4 O5 X O6 O7 O8 O9 O10
Here researcher has no control over the exposure schedule of experimental treatment to the test units but
has a control over the time schedule of measurement for measuring the impact of experimental treatment.
In the time series design a series of measurements are collected before and after treatment and this will
control at least partial control for some extraneous variables.
3.5.2. Multiple time series:
It is similar to time series design except that another group of test units is added to serve as a control
group. This can be represented symbolically as
EG O1 O2 O3 O4 O5 X O6 O7 O8 O9 O10
CG O11 O12 O13 O14 O15X O16 O17 O18 O19 O20
If the control group is selected carefully this will be an improvement over the time series design.

3.6. STATISTICAL DESIGNS:


These designs involve a series of basic experiments that allow for statistical control of external variables.
Using these designs more than one independent variable can be measured. Each test unit can be measured
more than once to make the design economically. These designs offer the following advantages
• The effect of more than one independent variable can be measured.
• Specific extraneous variables can be statistically controlled.
• Economical designs can be formulated when each test unit is measured more than once.
Prof R A Fisher pioneered the study of experimental design and according to him the Basic principles of
the design of experiments are 1. Replication, 2.Randomization and 3.Local control. Replication means
the repetition of the treatments under investigation. This will average out the influence of the chance
factors on different experimental units. Randomization is the process of assigning the treatments to various

21
experimental units in a purely chance basis. This will give the equal chance of allocation of treatments to
experimental units. The process of reducing the experimental error by dividing the relatively heterogeneous
experimental area into homogeneous blocks is known as Local control. This will increase the efficiency
of the design.
The most widely and commonly used statistical designs are (i)Completely randomized designs (CRD),
(ii) Randomized block designs (RBD), (iii) Latin Square designs (LSD) and (iv)Factorial design.
3.6.1. Completely Randomized Block Design(CRD):
This design involves with two principles that is principle of replication and principle of randomization of
experiments. It is used when experimental area happens to be homogeneous. The experimental Treatments
are randomly assigned to experimental/ test units/ participants over the entire experimental material. A
completely randomized design relies on randomization to control the effects of extraneous variables. The
experimenter assumes that, on average, extraneous factors will affect treatment conditions equally; so
any significant differences between conditions can fairly be attributed to the independent variable.
Let us suppose that we have k treatments, the j th treatment being replicated rjtimes, j=1, 2….k. That is
the whole experimental material is divided in to n=”rj experimental units and the treatments are distributed
completely at random over the units’ subject to the condition that the j th treatment occurs rj times. In
particular case if rj=r for all j, ie each treatment is repeated an equal number of times r then n=r.k and
randomization gives every group of r units an equal chance of receiving the treatments. In general, equal
number of replications for each treatment should be made except in particular cases when some treatments
are of greater interest than the others.This design is used when all the variations due to uncontrolled
extraneous factors are included under the heading of chance variation.The statistical technique applied to
analyze the result of this type of experimental design is known as ‘Analysis of Variance’, commonly
known as ANOVA.

The mathematical model is

Where

is the ith observation receiving the jth treatment; i= 1,2,3,….rj and j=1,2,3,…k

is the effect due to jthtreatment

is the general mean effect


is the error of ith observation receiving the jthtreatment
are independently identically distributed Normal variates and

Total sum of squares T.S.S=

Sum of squares due to treatment S.S.T=

Where y.j= ; y.. =

Sum of squares due to error S.S.E = T.S.S – S.S.E

22
The degrees of freedom for Total sum of squares T.S.S is n-1
The degrees of freedom for Sum of squares due to treatment S.S.T is k-1
The degrees of freedom for Sum of squares due to error S.S.E is n-k
Mean sum of squares it is obtained dividing the source of variation by its corresponding degrees of
freedom.

Ho: There is no significance difference between the treatments.

H1: There is significance difference between the treatments.

ANOVA Table

Source of Degrees of Sumof Mean Sum of F- Ratio


variation freedom squares Squares
Treatment k-1 S.S.T

Error n-k E.S.S


Total n-1 T.S.S

If calculate value of F is less than the table value of F we accept otherwise we reject the null hypothesis.
3.6.2. Randomized Block Designs (RBD):
This is an improvement over CRD. In this design principle of local control can be applied in addition to
randomization and replication. With a randomized block design, the researcher divides participants / test
units into subgroups called blocks, such that the variability within blocks is less than the variability
between blocks. Then, participants within each block are randomly assigned to treatment conditions.
Because this design reduces variability and potential confounding, it produces a better estimate of treatment
effects. This is useful if the experimental area is not homogeneous and when there is only one major
external variable, such as sales, income of the respondent. The test units are blocked or grouped on the
basis of the external variable. By blocking, the researcher ensures that the various experimental and
control groups are matched closely on the external variable. Hence RBD is a statistical design in which
the test units are blocked on the basis of an external variable to ensure that the various experimental and
control groups are matched closely on that variable. The main limitation is that the researcher can control
for only one external variable. There are two classifications of effects in an RBD namely block “effects”
and treatment “effects”.
Rules for blocking:
Carefully examine the situation at hand and identify those factors which are known to affect the proposed
response. Choose one or two of these factors as the basis for creating blocks. Blocking factors are sometimes
referred to as disturbing factors.

The mathematical model is

Where

23
is the observation in ith block receiving the jth treatment of the factor; i=1,2,…r and j-1,2,…k

is the effect due to ithblock

is the effect due to jthtreatment

is the general mean effect

is the error of ith observation receiving the jthtreatment

are independently identically distributed Normal variates and

Total sum of squares T.S.S= where N=r.k

Sum of squares due to block S.S.B=

Sum of squares due to treatment S.S.T=

Where y.j = ; yi. = ; y..

Sum of squares due to error S.S.E = T.S.S – S.S.T - S.S.B


The degrees of freedom for Total sum of squares T.S.S is N-1
The degrees of freedom for Sum of squares due to block S.S.B is r-1
The degrees of freedom for Sum of squares due to treatment S.S.T is k-1
The degrees of freedom for Sum of squares due to error S.S.E is N-r-k+1
Mean sum of squares it is obtained dividing the source of variation by its corresponding degrees of
freedom.

Ho: There is no significance difference between the treatments i.e

H1: There is significance difference between the treatments i.e

Ho: There is no significance difference between the blocks i.e

H1: There is significance difference between the blocks i.e

24
ANOVA Table

Source of Degrees of Sum of Mean Sum F- Ratio


variation freedom squares of Squares
Treatment k-1 S.S.T

Block r-1 S.S.B

Error N-k S.S.E


Total N-k-r+1 T.S.S

If calculate value of F is less than the table value of F we accept otherwise we reject the null hypothesis.
3.6.3. Latin Square Design (LSD):
The Latin square design is used to eliminate two nuisance sources, and allows blocking in two directions
(rows and columns). The data are classified according to rows, columns and varieties and are arranged in
a square known as Latin square. Latin Square of order p has p rows, p columns, and p varieties. In this
design there should be as many replicates as there are treatments. The experimental area is divided into
plots arranged in a square in such a manner that there are as many plots in each row as there are in each
column and this number is equal to the number of treatments. The plots are then assigned to various
treatments such that every treatment occurs only once in each row and in each column. The limitation of
this design is that they require equal number of rows, columns, and treatments levels, which is not possible
always. Interactions of the external variables with each other or with in the independent variable is not
possible.

The mathematical model is , (i, j, k =1,2,….t)

Where t= number of treatments, rows and columns.


yijk is the observation on the unit in the ith row, kth column given the jth treatment.

= the general mean common to all experimental units.

= the effect of level ith row factor.

= the effect of level jth treatment.

= the effect of level kth column factor

= component of random variation associated with observation ijk. Usually assumed N (0,1).

Total sum of squares T.S.S= where N= t2

Sum of squares due to row S.S.R=

25
Sum of squares due to column S.S.C=

Sum of squares due to treatment S.S.T=

Where y… = Total of all the t2 observations.


Yi.. = Total of the r observations in ith row.
Y.j.= Total of the r observations in jth column
Y..k. = Total of the r observations from kth treatment
Sum of squares due to error S.S.E = T.S.S – S.S.R - S.S.C – S.S.T
The degrees of freedom for Total sum of squares T.S.S is N-1= t2-1
The degrees of freedom for Sum of squares due to rows S.S.R is t-1
The degrees of freedom for Sum of squares due to columns S.S.C is t-1
The degrees of freedom for Sum of squares due to treatment S.S.T is t-1
The degrees of freedom for Sum of squares due to error S.S.E is (t-1) (t-2)
Mean sum of squares it is obtained dividing the source of variation by its corresponding degrees of
freedom.

Ho: There is no significance difference between the treatments i.e

H1: There is significance difference between the treatments i.e

Ho: There is no significance difference between the rows i.e

H1: There is significance difference between the rows i.e

Ho: There is no significance difference between the columns i.e

H1: There is significance difference between the columns i.e

ANOVA Table]
Source of Degrees of Sum of Mean Sum F- Ratio
variation freedom squares of Squares
Treatment t-1 S.S.T

Row t-1 S.S.R

Column t-1 S.S.C

Error (t-1) (t-2) S.S.E


2
Total t -1 T.S.S

26
If calculate value of F is less than the table value of F we accept otherwise we reject the null hypothesis.
3.6.4. Factorial Design:
Factorial Design involves when there is more than one independent variable, or factor at various levels,
in a study. This allows the interaction between variables. An interaction is said to take place when the
simultaneous effect of two or more variables is different from the sum of their separate effects. For
example, an individual favorite drink might be Tea and favorable temperature might be medium, but this
individual might not prefer medium temperature Tea, leading to interaction. Factorial
designs allow researchers to look at how multiple factors affect a dependent variable, both independently
and together. Factorial design studies are named after the number of levels of the factors. The main
disadvantage of a factorial design is that the number of treatment combinations increases multiplicatively
with an increase in the number of variables or levels.
The characteristic of factorial design is to study the influence of every level of one factor in combination
with variations in every level of another factor. These designs are appropriate in finding out whether
interaction exists between factors. The mathematical models are depending upon the number of factors
and further analysis is beyond the scope of this book.
A factorial design, with two factors namely fertilizer (Three levels) and type of seed (Two type of seeds)
would give the yield of production. It consists of employing all 6 treatments formed by using each type of
seed with each level of fertilizer. Here we consider the case of two factors. Factorial designs can involve
more than two factors also.

3.7. FEATURES OF GOOD RESEARCH DESIGN:


A design is said to be good research design if it is efficient, adequate and effective. It gives minimum
errors and maximum benefits. It takes into account all uncertainties, risks involved and other categories.
It minimizes the scope of biases and errors in research. It depends on the objective and nature of the
research. It helps in getting the required information and solutions in desired time with limited resources.
Activity 1: Illustrate CRD with the help of an example.
Activity 2: Illustrate RBD with the help of an example.
Activity 3: Illustrate LSD with the help of an example.
Activity 4:
Assume we want to use three cooking times for popping the popcorn instead of two. List the possible
treatment combinations that can be assigned. How many are there?
Treatment Combination Popcorn Brand Microwave Location Cooking Time
1 Fastco Lounge 105
2 Pop Secret Lounge 105
3 Fastco Room 105
4 Pop Secret Room 105
5 Fastco Lounge 135

27
6 Pop Secret Lounge 135
7 Fastco Room 135
8 Pop Secret Room 135
Without listing all possibilities, calculate how many treatment combinations would exist for a design that
tested five brands with three microwaves at four cooking times.

3.8 SUMMARY:
This Unit was aimed to explain the meaning of a research design and experimental design. Different
experimental design like Pre- Experimental designs, True- Experimental designs, Quasi- Experimental
designs, Statistical Designs were discussed briefly. Further classified designs in each of the above category
were explained. Features of a good research design was also presented.
Review Questions:
1. Explain the term Research Design. Define experimental design and give different classification in
it.
2. What are the commonly used pre- experimental designs? Explain each of them.
3. What is True experimental designs. Mention different classification of True Experimental designs.
4. Explain what is meant by Completely Randomized Designs.
5. Explain what is meant by Randomized Block Designs.
6. Explain what is meant by Latin Square Designs.
7. Explain what is meant by Factorial designs.

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Krishnaswami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. AdithamBhujangaRao, Research Methodology for Management and Social Sciences, Excel books,
New Delhi 2008
4. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
5. H.K Dangi, ShrutiDewen., Business Research Methods, Cengage Learning India Pvt Ltd, New
Delhi 2016.
6. Naresh K Malhotra and Satyabhushan Das., Marketing Research, An applied orientation, Pearson,
New Delhi.

28
BLOCK-II
MEASUREMENT AND SCALES, DATA COLLECTION, AND PRESENTATION
In this block there are three Units IV, V and VI. In Unit IV, Attitude Measurement and Scales and Scaling
Techniques were discussed. In Unit V: Methods and techniques of Data Collection, Sampling and Sampling
Design, Types of Sampling, Sampling and non-sampling errors were presented. In Unit VI, Data Preparation,
checking, editing, coding, classification, tabulation, diagrammatic and graphical representation were
discussed. A Classification of Statistical Techniques was also presented.

29
UNIT – 4
MEASUREMENT AND SCALING

OBJECTIVES:
After studying this Unit, the reader should be able to
• Explain the concept of Attitude and measurement.
• Explain the different characteristics which will quantify the degree of possession.
• Explain the four types of scales of management and its inter relationships.
• Explain the meaning of scaling, its dimensionality, broad classification of scaling techniques and
important scaling techniques like Thurstone, Likert and Guttman.s scales.

STRUCTURE:
4.1 Introduction
4.2 Scale Characteristics to quantify the degree of possession
4.3 Scale of Measurement
4.4. Meaning of scaling
4.5 Thurstone Scale
4.6 Likert Scale
4.7 Guttman,s Scale
4.8 Summary
4.9 Review Questions
4.10 Further Readings

4.1. INTRODUCTION:
In the field of Research, the researcher will try to gather information through questionnaire or through
other modes. While gathering information one has to be careful about what to be measured and how to be
measured. This unit focuses mainly on how to make objective measurement. It also deals with different
scaling techniques that are common in research. Initially the discussion is about attitude and measurement
were made.
Attitude: The dictionary meaning of Attitude is ‘settled behavior, as indicating opinion’. It is the tendency
of an individual to respond in a consistent manner. It is a subjective and personal affair. It is defined as the
degree of positive or negative affect associated with some psychological object. It is a pre-disposition of
the individual to evaluate some object or symbol or aspect of his word in a favorable or unfavorable

30
manner. The rem ‘opinion’ symbolizes an attitude. Further it is not behavior but a pre-condition for it.
Attitude comprises of 3 components:
1. A cognitive component: Person’s belief or information about the object.
2. An affective component: Person’s feelings about the object such as “like” or “dislike”, “good” or
“bad” etc.
3. Behavioral component: Person’s readiness to respond behaviorally to the object.
The study & measurement of attitude is important as there is relationship between attitude & behavior.
This relation has more aggregate & behavior. This relation has more aggregate level than at individual
level. There may be other factors influencing behavior. Attitude is one measure which influences behavior.
Eg: An individual has a favorable attitude towards buying a product (say a car) but he may not buy it due
to economic considerations.
Measurement: Measurement means assignment of numbers to characteristics/ attributes of objects, persons,
states or events according to rules. Here one will not measure the object, person, state or event itself but
measure its characteristic of being present. That is Measurement is defined as the assignment of number
to an object which reflects the degree of possession of a characteristic by that object.
Eg: One cannot measure people but measure the characteristic value of their age, weight, height etc.

4.2 SCALE CHARACTERISTICS TO QUANTIFY THE DEGREE OF POSSESSION:


There are 4 characteristics to quantify the degree of possession of a characteristic by the object. They are
description, order, distance and origin.
4.2.1 Description:
By description, it means the unique labels or descriptors that are used to designate each value of the scale.
For examples 1 for male and 2 for female: 1 for strongly disagree, 2 for disagree, 3 for neither agree nor
disagree, 4 for agree and 5 for strongly agree are the unique descriptors. All scales possess description.
4.2.2 Order:
These are relative sizes or positions of the descriptors. The numbers are ordered. One number is greater
than, lesser than or equal to other.Eg:

1. .

2.

3.

4.2.3 Distance:
This is the absolute difference between the scale descriptors are known and may be expressed in units.
Difference between numbers is ordered. If the difference between any pair of numbers is greater than or
lesser than or equal to the difference between any other pair of numbers. A five-member house hold is
having two persons more than a three-person household. A scale that has distance also has order.
31
4.2.4 Origin:
The real number series has a unique or a fixed origin, zero. What is the annual income of a house hold
before deducting taxes has a fixed origin or a true zero point? In this if the answer is zero one can say the
income of the house hold is zero. A scale that has origin also has distance (order and description)

4.3 SCALES OF MEASUREMENT:


Different researchers collect data for different purposes and cannot be analyzed by same statistical way
because the entries represented by the numbers are different. Suppose two number 2 and 4 may be weights
of two particular commodities, ranks of two individuals or serial numbers of the items in a shop. There are
conveying different meaning in different situations. Same statistical procedure cannot be applied in all
the situations mentioned. They are case specific. Therefore, there is a need to understand the concept of
scale of measurement to use suitable statistical tool and technique in analyzing the data.
There are four primary scales of measurement namely Nominal, Ordinal, Interval and ratio.
4.3.1 Nominal Scale:
Nominal scale is applied to qualitative data. It is a scale whose numbers serve only as labels or tags for
identifying and classifying objects. The only characteristics possessed by these scales are Description.
When used for identification, there is a strict one-to-one correspondence between the numbers and the
objects. Here the objects or items are classified into various distinctive groups or categories without any
rank or order associated with them.It is classified on the basis of the simple presence or absence, applicable
or inapplicable, possession or non-possession of certain property. It classifies the attributes into a number
of mutually exclusive categories on the basis of simple presence or absence, applicable or non-
applicableetc.This scale has no order, distance or origin.
If a nominal scale is used, analysis of raw data can only be done using mode and Chi- Square distribution
Eg: The people are classified according to the religion into Hindus, Muslims, Sikhs& Christians. Suppose
each category of religion is given certain label either in numbers (1, 2, 3, 4) or in the form of alphabets (A,
B, C, D). These labels are basically used to identify the category to which people belong to. They have no
quantitative significance i.e., they cannot be added, subtracted, multiplied, divided or compared.
Eg.Categorizing people according to marital status, religion, according to political affiliation, habit of
Smoking Vs non-smoking etc.
4.3.2 Ordinal Scale:
Ordinal scale is also known as ranking scale which has the property of order in addition to description. In
this scaling, numbers are assigned to objects to indicate the relative extent to which the objects possess
some characteristic. One can rank objects or items based on the characteristics of objects. Here categories
of items are compared with each other only in the order of rank assigned to these categories. It possesses
the attribute of magnitude only. Here one can determine whether an object has more or less of a characteristic
than the other but not how much more or less. These scales are possessing both description and order
characteristics. That is, it contains all the information of a Nominal Scale and also allows one to order the
objects.
Statistical methods applicable in this method are mode, median, rank correlation, percentiles or quartile
measure and other summery statistics from ordinal data.

32
Eg: 1. Rank the customers of a bank according to their patience while transacting with the bank.
2. Rank the operators in a shop according to their skills.
3. Rank the products of a company according to the satisfaction of customers.
4. Which one statement best describes your opinion of an Intel PC processor?
· Higher than AMD’s PC processor
· About the same as AMD’s PC processor
· Lower than AMD’s PC processor
5. Categorization of students according to grades
6. Categorization of teachers according to designation / to teaching abilities.
7. Ranking of two household according to income
4.3.3Interval Scale:
These scales are possessing description, order and distance characteristics. That is distance characteristic
is additionally possessing by Interval Scale in addition to Ordinal Scale. It has an arbitrary zero point and
a constant scale. The numbers on the scale are placed at equal distance. Using this scale one can judge the
difference between the objects. It is a scale in which the numbers are used to rate objects such that
numerically equal distances on the scale represent equal distances in the characteristic being measured. It
contains all the information of an ordinal scale and also allows one to compare the differences between
objects.
Mathematical form of interval scale is given by y=a +bx.
In interval scales the statistical methods that can be used are sample mean and estimated standard deviation
(plus the above mentioned in case of Ordinal data).
Eg 1: Centigrade thermometer & Fahrenheit thermometer are examples of interval scale. In centigrade
thermometer the minimum is 0oC & maximum is 100oC. In Fahrenheit thermometer the minimum is 32oF
& maximum is 180oF.
C=5(F-32)/9
4.3.4 Ratio Scale:
Ratio Scales possesses all the properties of the Nominal, Ordinal and Interval scales and in addition an
absolute zero point (Origin). Ratio scales is the most ideal level of measurement. Thus in Ratio Scales,
one can identify or classify objects, rank the objects and compare intervals of differences. It is also
meaningful to compare ratios of scale values. It is suitable for measuring properties which have natural
zero point. Eg: Height, weight, distance, area, money value, population counts, rate of return, etc. Since
there is an absolute or natural zero, all arithmetic operations addition, subtraction, multiplication and
division are possible. Numbers on the ratio scale indicate the actual amount of property under measurement.
Eg: A weight of 50kg is twice as large as that of 25kg. It possesses the attribute of absolute zero beside
other attributes of magnitude and equal intervals. Variables that can be measured at ratio level can also be
measured at interval, ordinal and nominal levels. As a rule, properties that can be measured at higher
level can also be measured at lower levels, but not vice versa. Variables that don’t have zero point cannot

33
be measured in ratio scale. Example: leadership quality, happiness, satisfaction, intelligence, leadership
quality etc.
All statistical techniques can be applied to ratio data.
Hence one can write the following about the Primary Scales of Measurement.
There are four levels of measurement: nominal, ordinal, interval and ratio. These constitute a hierarchy
where the lowest scale of measurement, nominal, has far fewer mathematical properties than those further
up this hierarchy of scales. Nominal scales yield data on categories; ordinal scales give sequences; interval
scales begin to reveal the magnitude between points on the scale and ratio scales explain both order and
the absolute distance between any two points on the scale.

4.4. MEANING OF SCALING:


Most people don’t even understand what scaling is. Scaling is the branch of measurement that involves
the construction of an instrument that associates qualitative constructs with quantitative metric units.
Scaling is the assignment of objects to numbers according to a rule.
But what does that mean? In most scaling, the objects are text statements, usually statements of attitude or
belief. There are three statements describing attitudes towards immigration. To scale these statements, we
have to assign numbers to them
Scaling evolved out of efforts in psychology and education to measure “unmeasurable” constructs like
authoritarianism and self-esteem. In many ways, scaling remains one of the most arcane and misunderstood
aspects of social research measurement. And, it attempts to do one of the most difficult of research tasks
— measure abstract concepts. The term Scaling is applied to the procedures for attempting to determine
quantitative measures of subjective abstract concepts.
It describes the procedures of assigning numbers to various degree of opinion, attitude and other concepts.
This can be done in two ways. 1. Making a judgement about some characteristic of an individual and then
placing him directly on a scale that has been defined in terms of that characteristic. 2. Constructing
questionnaire in such a way that the score of the individual’s responses assign him a place on a scale,
which is continuum, consisting of the highest point and the lowest point along with several intermediate
points between these two extreme points.
4.4.1. Purposes of Scaling
Why do one scaling? Why not just create text statements or questions and use response formats to collect
the answers? One purpose of scaling is to test a hypothesis. One can find through Scaling that whether the
construct or concept is a single dimensional or multidimensional one. Sometimes, one will do scaling as
part of exploratory research. It will help us to know what dimensions underlie a set of ratings. The most
common reason for doing scaling is for scoring purposes. When a participant gives their responses to a set
of items, one can assign a single number that represents that’s person’s overall attitude or belief through
scaling.
4.4.2. Classifications of Scaling Techniques:
The scaling techniques commonly used in the research can be classified in to two categories namely 1.
Comparative (also called non-metric scales) scales and 2. Non Comparative (monadic or metric scales)
Scales. Comparative Scales involve the direct comparison of stimulus objects and generally generate

34
some ranking or ordinal data. Non comparative scales generally involve the use of rating, and the resulting
data are interval or ratio in nature. Another way of classification of scaling techniques in the light of items
included in the scales is 1. Single item Scales 2. Multi item Scales and 3. Continues rating Scales. As
these are all beyond the scope of this book it was not analyzed further.
4.4.3. Dimensionality
A scale can have any number of dimensions in it. Most scales that we develop have only a few dimensions.
What’s a dimension? If the construct can be measured well with one number line like height, weight etc,
they are all called one dimensional. What would a two-dimensional concept be? Many models of
intelligence or achievement postulate two major dimensions — mathematical and verbal ability. In this
type of two-dimensional model, a person can be said to possess two types of achievements. Some people
will be high in verbal skills and lower in math. For others, it will be the reverse. In such situations, in order
to describe achievement, one would need to locate a person as a point in two-dimensional (x, y) space.
Similarly, three and more dimensional can be viewed. Unidimensional concepts are generally easier to
understand.
4.4.4. Important Unidimensional Scaling Techniques:
The following three are the important and major unidimensional scaling methods. They are similar in that
they each measure the concept of interest on a number line (Unidimensional).But they differ considerably
in how they arrive at scale values for different items.
• Thurstone or Equal-Appearing Interval Scaling
• Likert or “Summative” Scaling
• Guttman or “Cumulative” Scaling
In the late 1950s and early 1960s, measurement theorists developed more advanced techniques for creating
multidimensional scales. These techniques are beyond the scope of this book.

4.5. THURSTONE SCALE:


This is a method of equal appearing intervals, as originally described by Thurstone and Chave. The
following are the principal steps involved in constructing this scale.
1. Large number of statements subject to enquiry are collected. The data is collected from,
a. Existing literature on the subject.
b. Discussing with knowledge person.
c. Personal experience
d. Interviews
2. Statements range from extreme favorable attitude to extreme unfavorable attitude.
3. Number of statements should be large, though not specified.
4. Each statement is written on a separate word and sort these statements into intervals.

35
5. Consider 11 intervals and name them A to K and can be regarded as having numbers. Based on the
11-point rating scale the statements are judged. The middle one bearing F was described as the
neutral representing neither favorable nor unfavorable attitude about the object.
6. Calculate the average value preferably median value in respect of each statement. This value is
treated as the scale value of a statement.
While selecting the statements (Step 3 above) Thurstone and Chave given Five Criteria.
They are a) The statements should be brief. b) they should be such that can be accepted or rejected in
accordance with the attitude of the respondent. c)The acceptance or rejection of the statement should
indicate something regarding the respondent’s attitude about the issue in question. d) Double barreled
statements should be avoided. e) The researcher must ensure that a large number of statements that were
included in the list belong to the attitude variable that is to be measured.
4.5.1. Limitations of Thurstone Scale:
Thurstone scale uses a two stage procedure and therefore it is both time consuming and expensive to
construct. As there is no expansive response to each item, it does not have much diagnostic value. The
scale has also been criticized on account of a method of scoring respondents are merely asked to select
those statements with which they agree.
Therefore, there is a possibility of two or more respondents having the same attitude score. For example,
if the respondent ‘A’ agrees with statements having a score value of 5,7,10 and another respondent ‘B’
agrees with statements having scale values 8, 8 & 6. Both of them will have the same attitude towards the
object which may not be true.

4.6. LIKERT SCALE:


The LikertScale is developed by RensisLikert and most commonly used technique in the field Research.
This is also known as “Summative” Scale. As mentioned, this is Unidimensional scale. Likert type scales
are developed by utilizing the item is evaluated on the basis of how well it discriminates between those
persons whose total score is high and those whose score is low. Those items or statements that best meet
this sort of discrimination test are included in the final statement.
Linkert scale consists of a number of statements ranging from favorable to unfavorable attitude towards a
given object. Respondent is asked to react regarding each statement. Respondent indicates his agreement
or disagreement for each statement. Each statement usually has 5 degrees of agreement or disagreement
(sometimes 7 degrees of agreement or disagreement will be used.)
That is, in this each item response has five rating categories, strongly disagree to strongly agree as two
extremes with disagree, neither agree nor disagree and agree are in the middle of the scale. A “1 to 5-
point rating scale” is used. Sometimes -2,-1,0,1,2 point rating scale is used.
Eg: Respondent is asked about opinion of job satisfaction. He may respond in any one of the following
ways.
1. Strongly Agree
2. Agree

36
3. Undecided
4. Disagree
5. Strongly Disagree
Each response is given a numerical score indicating its favorableness or un-favorableness. Most favorable
attitude is given highest score and less favorable attitude is given lowest score. These scores are totaled to
measure the respondent attitude. Find the statements which have highest and lowest score total. These
two extreme values are considered as most favorable and less favorable attitude. This way we determine
which statement concisely correlate with the statements that retained in the final instrument & all others
must be discarded from it.
4.6.1. Advantages:
1. It is easy to construct the Linkert-scale when compared to Thurstone-scale as Linkert-type scale
perform without a panel of judges.
2. This scale is more reliable as respondents answer each statement included in the instrument. It
provides more information than Thrustone-type scale.
3. In this scale each statement is given an empirical test for determining ability.
4. Linked-type scale permits the use of statements that are not manifestly related to the attitude being
studied.
5. We can study how responses differ between people (i.e., respondent centered) and how responses
differ between stimuli (i.e., stimulus centered studies)
6. Linkert-type scale can be constructed in less time. It is frequently used by the students of opinion
research.
7. It has been reported in various research studies that there is high degree of correlation between
Linkert-type scale and Thurstone-type scale.
The five positions indicated cannot be said that they are equally spaced. Here one can examine whether
respondents are more or less favorable to a topic but cannot tell how much more or less they are.

4.7. CUMMULATIVE SCALE OR GUTTMAN’S SCALE:


Guttman’s scaling is also sometimes known as cumulative scaling or scalogram analysis. The purpose
of Guttman’s scaling is to establish a one-dimensional continuum for a concept you wish to measure. It
consists of relatively small number of statements that are tested for unidimensionality. Unidimensional
scale measures one and only one variable. The statements in this scale form a cumulative series i.e.,
statements are related to one another in such a way that an individual who agrees with statement 4 also
agrees with less favorable statements 1, 2 & 3. The object is to find a set of items that perfectly matches
this pattern. In practice, one would seldom expect to find this cumulative pattern perfectly. So, one can
use scalogram analysis to examine how closely a set of items corresponds with this idea of cumulativeness.
Eg 1: Employment news it consists of news regarding different companies, different qualifications such
as MBA, B.E, Pharmacy, etc. if we buy employment newspaper to know the jobs for MBA qualified
student along with this news we have regarding different qualifications also.

37
Eg 2: Suppose we want to watch national geographical channel and BBC world news, we have to have
cable connection though we don’t want to watch the remaining channels they can be seen when once you
have cable connection.
Form a respondents score to know exactly which items are endorsed. This quality of responding the
responses to each item, knowing only the total score is called reproducibility.
The following are steps involved in Guttman’s scale
Step 1: Define the universe of content with reference to the problem under study.
Step 2: Develop a number of items or statements relating to the selected topic.
Step 3: Pre-test the statements to determine whether the topic is scalable.
Step 4: Respondents are asked to react on all items or statements based on agree/ disagree.
Step 5: Tabulate the responses giving ‘+’ for agree and ‘-‘ for disagree.
Respondent Scores Item Numbers
4 3 2 1
4 + + + +
3 - + + +
2 - - + +
1 - - - +
0 - - - -
Step 6: Calculate the coefficient of reproducibility in order to test unidimensionality.
Coefficient of reproducibility = 1 – (no. of errors) / (no. of items * no. of respondents)
If reproducibility is below 0.9 then scale is not considered unidimensional.
Step 7: Discard the statements that fail to discriminate between favorable & unfavorable respondents.
Step 8: Discard also, items with less than 20% or more than 80% endorsement in order to guard against
spuriously high estimates of reproducibility.
Step 9: If two or more items have the same popularity, retain one of them and drop others.
4.7.1. Advantages:
1. It assures unidimensionality of items and is measuring attitude.
2. A person’s response pattern can be reproduced, given his total score on the scale.
3. The scale is determined by respondents not by researcher’s subjective judgement.
4. As the final scale contains a small number of items, it is easy to administer it.
5. The scale is highly reliable as a rule.

38
4.7.2. Disadvantages:
1. This scale is both tedious and complex.
2. It is not suitable for measuring attitude towards variables like job satisfaction with several dimensions.
3. The criterion of reproducibility insisted by Guttman is valuable but not an essential property.
4. The standard of reproducibility set by Guttman is somewhat arbitrary.
5. The scale does not have equal or equal-appearing intervals.
Activity 1: Analyze the following with the help of scales of measurement.
——————————————————————————————————————————
Sr.No. Name of the store Preference ranking Preference rating Sales per week
1-7
——————————————————————————————————————————
1 More 2 5 Rs 1200
2. Spencers 1 6 Rs 1000
3. Vijetha 4 3 Rs 450
4. Big Bazar 3 4 Rs 0
——————————————————————————————————————————
Activity2: The following ate he five arithmetic problems that are presented according to increasing level
of difficulty. Draw Scalogram for this.
(1) 6+7=
(2) 70-23
(3) 60-23+57=
(4) 17(37-16) +5=
(5) (27 x 16) + (19+6-18) =
Activity 3: Present an example related to discovering the consumer’s opinion related to A TV produced
by a Company in the light of Likert scale.

4.8. SUMMARY:
This unit is initially focused on the meaning of the attitude and measurement. The four type of measurement
scale were defined and the characteristics that these four scales will possess was also given. The meaning
of scaling and different important unidimensional scaling techniques were presented and discussed.

39
Review Questions :
1. Briefly discuss the terms Attitude and Measurement
2. What are the characteristics to quantify the degree of possession? explain
3. Discuss the four types of measurement scales
4. Explain the Thurstone Scale.
5. Explain the Likert Scale
6. Explain Guttman’s Scale

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Krishnaswami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt .Ltd.
4. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
5. H.K Dangi, ShrutiDewen., Business Research Methods, Cengage Learning India Pvt Ltd, New
Delhi 2016.
6. G C Beri, Marketing Research, Tata McGraw -Hill Publishing Company Limited, New Delhi.

40
UNIT - 5
METHODS OF DATA COLLECTION AND DIFFERENT TYPES SAMPLING
TECHNIQUES

OBJECTIVES:
After studying this unit one will be able to
• Explain the meaning and types of data.
• Explain different methods of collecting Primary Data
• Explain different sources of Secondary data
• Explain different techniques of Sampling

STRUCTURE:
5.1 Introduction
5.2. Types of Data
5.3. Methods of collecting Primary Data
5.4 Different sampling techniques/ Methods
5.5 Probability Sampling
5.6 Non Probability Sampling
5.7 Summary
5.8 Review Questions
5.9 Further Readings

5.1 INTRODUCTION:
Data is a set of values of Qualitative or Quantitative variables. Data may be intangible (as in measured
numerical values / quantitative) or tangible (categorical / qualitative / attribute). Facts, opinions and statistics
collected together for reference or analysis is called Data.Any useful information which is required for
one’s research is known as Research Data. This information is used to understand the natural phenomenon
and in analyzing and predicting future events of the situation.Accurate and reliable data lead to success of
the investigation. Basing on the type of research, the type of data to be collected and its source will
vary.Data can be generated or collected from observations, interviews, surveys, experiments, simulations
or even from previous literature. More precisely “Research data is data that is collected, observed, or
created, for purposes of analysis to produce original research results”.
Data collection is the process of gathering and measuring information on targeted variables / attributes in
an established systematic manner. In any type of research study or process, the researcher collects the
41
data/ information needed to answer his problem and evaluate outcomes. This data is collected based on
the identified hypothesis or research problem, research design. In collecting the data, the researcher must
decide which data to collect? How to collect the data? Who will collect the data? and when to collect the
data?
The procedure for collection of data depends on various considerations such as objective, scope, nature of
investigation etc. and also on the availability of resources like money, time, manpower etc. It varies from
one discipline to other but the main goal is to capture quality information that results in rich data analysis
and allows building credible answer to questions that have been posed. Since inaccurate data can impact
the results of a study and ultimately lead to invalid results, one must choose a procedure that will produce
accurate data.

5.2. TYPES OF DATA:


Data can be classified as Primary or Secondary.
5.2.1. Primary Data:
Primary data are those which are collected afresh and for the first time and for the purpose identified. The
data which is originally collected by the investigator or agency for the first time for investigation and used
by them in the process of analysis is termed as primary data. It is the original and first-hand information
collected through different methods by enquiring directly with one or more people who are the source of
the required information.
5.2.2. Secondary data:
Secondary data is the data which has been collected by individuals or agencies for purposes other than
those of our particular research study. Since the use of secondary data is comparatively cheaper than
collecting primary data, it is advisable to explore the possibility of using secondary data before one is
opting for Primary data. However, there are some disadvantages rather limitations in using Secondary
data. They are 1. Availability of secondary data, 2.Relevancy, 3.Accuracy and 4.Sufficiency.
The first and foremost thing is that secondary data must be available to use. In the case of non-availability
one has to opt for Primary data. Relevancy means it must fit to the requirements of the problem such as
Unit of measurement, similarity in concepts and non-obsolete. Accuracy can be obtained by consulting
original source of data, its methodology, care that was taken while collecting information etc., Sufficiency
means adequacy of data.
Secondary Data Sources can be classified in to two categories namely Internal sources of Secondary Data
(generated within the organization for which the research is being conducted) and External sources of
Secondary Data (generated by sources outside the organization).
Internal sources can be accounting records, sales force records, miscellaneous reports and Internal experts.
External sources of secondary data can be Books, periodicals and other published material, reports and
publications from government ( Census, CSO publications, NSSO reports, State statistical abstracts, District
Statistical Hand Books etc.) ,non-government ( Publications of Indian cotton Mills federation, Federations
of Indian Chamber of Commerce and Industry publications, Stock exchange publications etc.,) and
international organizations ( data relating to other countries other than studied country published by different
international organizations ) , syndicated services ( provided by certain organizations , which collect and
tabulate information on a continuing basis), computerized commercial and other data sources (

42
Indiastat.com, INDEST-AICTE, CMIE Pvt., Ltd.) and media sources ( Leading newspapers like Economic
Times, Hindu, Times if India, financial Express , Business standard etc.)
5.3 METHODS OF COLLECTING PRIMARY DATA:
Following are broad methods of collecting the Primary data.
• Observation
• Questionnaires
• Schedule
5.3.1. Observation:
One of the methods to collect primary data is through observation. This method of observation is the basic
method of data collection and popularly used by different group of researchers such as social scientists,
natural scientists, engineers, computer scientists, educational researchers. Observation means viewing or
seeing. The information is collected through direct observation without interacting with any respondents.
Observations can be conducted on any subject matter, and the type of observations will be based on the
research question. One can observe the purchasing pattern and mode of interest of people while observing
the people who come to purchase in the light of some advertisements and offers given to a particular
product for improving its sales. In visiting any organization and through basic observation one can see
the infrastructure, the work culture etc. The main advantage of this observation is that the respondents
bias is reduced. Clear observation reduces the bias of respondents. It is independent of respondents’
willingness to respond. It is extensive method and information provided by this method is limited. It
provides information relating to the present but not to the past.
It is classified into participant observation and non-participant observation. When the researcher observes
as a representative by attempting to become a member of the group is known as participant observation.
When the researcher observes through a separate representative without attempting to become a member
of the group is known as non-participant observation.
5.3.2. Questionnaire:
A questionnaire is a set of questions to obtain information from the respondents. The questions should be
interlinked and relevant to research. The questions are judged based on relevancy and accuracy. The
questions are asked to obtain the needed and vital information for research. That is this is a formalized set
of questions for obtaining information from respondents. A questionnaire consists of number of questions
which are framed carefully to ensure that they cover all information required from the respondents. The
respondents are supposed to read and understand the questions and give their reply by ticking the option
in multiple choice questions and in the space provided for the open ended question. It is used when the
sample size is large. The questionnaire can be given to a large number of people to gather their information
in person and explain them if the respondents cannot understand any question. This method saves the
researcher time and money. Respondents must be truthful while responding to the questionnaires regarding
controversial issues in particular due to the fact that their responses are anonymous. But this method has
its own drawbacks such as majority of the people who receive questionnaires don’t return them and those
who do might not be representative of the originally selected sample. This method can be adopted for the
entire population or sampled sectors.
Questionnaire method may be conducted through interviews. Interview is a systematic conversation between
the researcher and the respondents. This method involves one to one questioning from the respondent. It

43
not only involves with conversation but also with respondent’s facial expressions and pauses, gestures,
and his environment. The information is collected personally by the investigator from the source by
interacting with the respondents. The investigator being in the spot can handle the delicate situation by
using his skill, intelligence and insight by explaining them to the respondent. The investigator extracts the
information from the respondent by interacting with them at their level of education. Interviews are broadly
classified into structured and un-structured interviews.Structured interviews are carried out in a structured
way. They involve a set of predetermined questions and of highly standard techniques of recording. The
interviewer follows a procedure by posing or asking questions in a form and prescribed order. Un- Structured
interviews do not follow any system or predetermined questions and of highly standard techniques of
recording information. The interviewer is given the freedom to ask supplementary questions or at times
he may omit certain questions too. Different types of interviews can be done practically. Some of them are
presented below.
5.3.2.1Personal interview/ Face -to -face interviews:
The information is collected personally by the investigator from the source by interacting with the
respondents. The investigator being in the spot can handle the delicate situation by using his skill,
intelligence and insight by explaining them to the respondent. The investigator extracts the information
from the respondent by interacting with them at their level of education.These interviews have a distinct
advantage of enabling the researcher to establish rapport with potential participants and therefore gain
their cooperation. These interviews yield highest response rates in survey research. They also allow the
researcher to clarify ambiguous answers and when appropriate, seek follow-up information. Disadvantages
include time consuming and expensive when large samples are involved. Face-to-face contact with
respondents permits a more thorough briefing on how to use the product. Pre-test questions can be asked,
and arrangements can be made for the follow-up.
5.3.2.2. Telephone interviews:
Information is gathered by contacting the respondents on telephone. These interviews are more flexible,
fast, less time consuming and less expensive and the researcher has ready and easy access to anyone on
the planet who has a telephone.No field work is necessary. Disadvantages are that the response rate is not
as high as the face-to- face interview but considerably higher than the mailed questionnaire. The sample
may be biased to the extent that people without phones are part of the population about whom the researcher
wants to draw inferences.
5.3.2.3. Focused interviews:
They are done with some focus on the given experience of the respondent and its effects. The interview is
to confine the respondent to a discussion of issues with which one seeks conversance.
5.3.2.4 Clinical Interview:
A clinical interview is a type of psychological assessment. It is a way for a mental health professional to
ask a client questions, engage in dialogue to learn more about the client and form initial opinions about a
client’s psychological state. A clinical interview is otherwise known as an intake interview, an admission
interview, a mental status exam or a diagnostic interview. It is concerned with broad underlying feelings
or emotions or with the course of individual’s life experience.
5.3.2.5. Group Interview:
The information is collected personally by the investigator from the source by interacting with group of
respondents.
44
5.3.2.6. Computer Assisted Personal Interviewing (CAPI):
It is a form of personal interviewing, but instead of completing a questionnaire, the interviewer brings
along a laptop or hand-held computer to enter the information directly into the database. This method
saves time involved in processing the data, as well as saving the interviewer from carrying around hundreds
of questionnaires. However, this type of data collection method can be expensive to set up and requires
that interviewers have computer and typing skills.
5.3.2.7. Mailed questionnaire/ Web based questionnaires:
In this method researcher and the respondents come in contact with each other. The respondent receives
an e-mail having a link or address that would take the respondent to a secure web-site to fill in a
questionnaire. This is known as mailing of a questionnaire. These questionnaires are mailed to the
respondents with a request to return after the completing the same. In this method personal questions can
also be asked as the privacy of the responder is maintained and hence they can properly answer those
questions. This type of research is often saves time, money, no chance of interviewer’s bias and less
detailed. Some of the disadvantages in this method are exclusion of people who do not have a computer or
are unable to access a computer. People might be in a hurry to complete it or partially fill and so might not
give accurate responses.
5.3.2.8. Steps involved in Construction of aQuestionnaire:
1. Decide the type of information is required.
2. Decide the type of questionnaire to use such as personal interview/ mail questionnaire.
3. Decide on the content of individual questions.
4. Decide on type of questions to use such as open, multiple, dichotomous e.t.c
5. Decide on the working of questions.
6. Decide on the question sequence.
7. Decide on the layout and method of reproduction of questionnaire.
8. Make a preliminary draft and pretest it.
9. Revise and prepare the final questionnaire.
5.3.2.9. Characteristics of good Questionnaires
1. It questions should be short, simple, easy to understand and as clear as possible, with targeted
sections and questions that are to filled by the respondent.
2. The questions should have some direct relation to the problem.
3. The questionnaire should be structured in such a manner so that the required information are accurate
and secured.
4. The questions must be inter-related such that cross checking is feasible.
5. The information collected through schedule should be amenable to statistical manipulation and
tabulation.
6. Questions should be free from personal bias and should not hurt the feelings of the respondents.
45
7. It should be prepared keeping the major languages and level of literacy of the target group in mind.
8. Questions should be standardized and precise terms should be used in framing the questions.
9. It is used to collect regular or infrequent routine data, and data for specialized studies.
10. It is adopted for the entire population or sampled sectors.
11. If the questionnaire is being given to a sample population, then it may be preferable to prepare
several smaller, more targeted questionnaires, each provided to a sub-sample.
12. If the questionnaire is used for a complete enumeration, then special care needs to be taken to avoid
overburdening the respondent.
13. It may include demographic characteristics, fishing practices, opinions of stakeholders on fisheries
issues or management, general information on fishers and household food budgets.
5.3.3. Schedule:
In case the informants are largely uneducated, mailed questionnaire method provide non-responsive data.
In such cases, schedule method is used to collect data. Here the questionnaires are sent through the
enumerators to collect information. Enumerators are persons appointed by the researcher for the purpose.
They are trained. They directly meet the informants/ respondents with the questionnaire. They explain the
scope and objective of the enquiry to the informants and solicit their cooperation. The enumerators ask
the questions to the informants and record their answers in the questionnaire and compile them. The
success of this method depends on the sincerity and efficiency of the enumerators. Schedule method is
widely used in extensive studies. It gives fairly correct result as the enumerators directly collect the
information. The accuracy of the information depends upon the honesty of the enumerators. They should
be unbiased. So the enumerator should be sweet-tempered, good-natured, trained and well-behaved. This
method is relatively costlier and time-consuming than the mailed questionnaire method.
5.3.3.1. Characteristics of Good Schedule:
1. Accurate communication-The questions given in the schedule should enable the respondent to
understand the context in which they are asked
2. Accurate response-The schedule should structure in such a manner so that the required information
is accurate and secured. Following steps should be taken for accurate response.
a. The size of the schedule should be precise and attractive.
b. The questions should be clearly worded and should be unambiguous.
c. The questions should be free from any subjective evaluation.
d. Questions should be interlinked.
e. Information should be capable of tabulation and subsequent statistical analysis.
3. Suitability of scheduled method- This method is applied in the following situations.
a. The field of investigation is wide and dispersed.
b. Where the researcher requires quick result at low cost.
c. Where the respondents are educated.
46
d. Where trained and educated investigators are available.
4. Questions to be included in Schedule-It should follow the following steps.
a. The questions should be short, easy to understand by the respondent.
b. The questions should have some direct relation to the problem.
c. The information collected through schedule should be amenable to statistical manipulation
and tabulation.
d. The questions must be inter-related such that cross checking is feasible.
e. Questions should be free from personal bias and should not hurt the feelings of the respondents.
f. Questions should be standardized.
g. Precise terms should be used in framing the questions.

5.4. DIFFERENT SAMPLING TECHNIQUES/ METHODS:


In theprocess of obtaining primary data, Sampling Techniques are the important methods. To explain the
different important techniques/methods one has to know some terminologies like Population, Sample,
Sampling Frame etc.,
In research survey the interest usually lies in the assessment of the general magnitude and the study of
variation with respect to one or more characteristics relating to individuals belonging to a group. This
group of individuals under study is called Population.
Population: A Population can be defined as the aggregate of all the elements that share some common set
of characteristics. That is the total collection of units, elements or individuals that one wants to analyse.
The elements may be animate or inanimate. This may be finite or infinite. These can be countries, light
bulbs produced by a company, university students, banks, residents of a particular area, regional health
authorities, list of households in a village etc.
A sampling unit is one of the units into which an aggregate is divided for the purpose of sampling, each
unit being regarded as individual and indivisible when the selection is made.
Sample: A sample is a group of units selected from a larger group (the population). By studying the
sample, it is hoped to draw valid conclusions about the larger group.
Sampling is the process of selecting a small number of elements (sample) from a larger defined target
group of elements (Population) such that the information gathered from the small group will allow
judgments to be made about the larger group (Population).
Sample frame: It is the actual list of sampling units or elements from which the sample is selected. When
a list is referred to as sample frame it is possible that the list does not have all the elements of the target
population. It is also possible that the list has elements that do not belong to target population. Such
incongruity can lead to an error called sampling frame error.
Sample frame error: It occurs when certain elements of the population are accidentally omitted or not
included on the list.

47
Complete enumeration / Census and sampling: The census consists in collecting data from each and
every unit from the universe /Population. Where as in sampling only a part of the population will be
chosen for the study. Sampling has the advantage of less expensive, less time consuming, Greater accuracy
and applicability in destructive enumeration when comparative to Census.
Sampling Technique: It is a procedure of selecting sample from the population. The process of sampling
involves selecting the sample, collecting the information, making an inference about the population. All
the three are interwoven and one has an impact on the other. It is used for learning about the population on
the basis of a sample drawn from it. Basing on the sample data and after analyzing the sample, conclusions
are drawn on that basis for the entire universe. Sampling technique is broadly classified into two. They
are Random/Probability sampling and Non-Random/Non-probability sampling. A brief description of the
methods of selection of each of the above samples is given below.
Sampling Techniques

Probability Sampling Non-Probability Sampling


Simple Random Sampling Convenience Sampling
Stratified Random Sampling Judgment Sampling
Systematic Sampling Quota Sampling
Cluster Sampling
Multi stage Sampling

5.5.PROBABILITY SAMPLING:
A Sampling method that uses some form of random selection is called a probability sampling method. In
order to have a random selection method, one must set up some process or procedure that assures that the
different units in the population have equal probabilities of being chosen.
5.5.1. SIMPLE RANDOM SAMPLING:
Simple random sampling is a method of probability sampling in which every unit has an equal nonzero
chance of being selected. We select a group of subjects (a sample) for study from a larger group (a
population). Each individual is chosen randomly and each member of the population has an equal chance
of being included in the sample. Every possible sample of a given size has the same chance of selection;
that is, each member of the population is equally likely to be chosen at any stage in the sampling process.
In order to draw a random sample, a sampling frame should be prepared. Each item in the list is allotted
a serial number. Let us suppose that there are N items in a population each is allotted a number between
and including 1 to N. A random sample of desired size, say, n is then selected from these N units by any
one of the following methods.

48
5.5.1.1. Lottery method:
A lottery draw is a good example of simple random sampling. A sample of 6 numbers is randomly generated
from a population of 45, with each number having an equal chance of being selected. In this method N
identical slips, one for each item of the population are prepared by writing numbers from 1 to N on them.
These slips are placed in a bowl and mixed thoroughly. Out of these N slips, n slips are drawn one by one.
The items corresponding to the numbers written on the selected slips are included in the sample.
5.5.1.2. Roulette wheel method:
Roulette wheel is a numbered wheel consisting of digits from 0 to 9. Let there be 300 items in a population,
numbered as 001,002,….100,101...300. A number is obtained by spinning the wheel three times and
recording the digit which appears at the pointer each time the wheel stops. The item of the population
bearing this number in sampling frame is included in the sample. This process is repeated unless a sample
of desired size is obtained. An improved form of roulette wheel consists of a number of wheels each
consisting of digits from 0 to 9. These wheels are rotated by electricity or by hand. After they stop the
digits appearing at the pointers of each wheel are recorded and thus a number is formed. The item bearing
this number in sampling frame is included in the sample.
5.5.1.3. Use of Random Number Table:
When the population size is very large, the use of the above methods become very inconvenient and
collection of data at a reasonable cost might become a night mare. Then one will select the units in the
population by using the random number tables, already developed, to include in the sample. Random
number tables are constructed in such a way that the digits 0,1,2,3,…..9 appear with approximately same
frequency and independently of each other. There are different types of random number tables like Tippet’s,
Fisher and Yates, Kendall and Babington Smit’s, Random Corporation etc., In this method first identify
the N units in the Population with the numbers from 1 to N. Select at random any page of the random
number tables available with the researcher and pick up the numbers (of the same number of digits N is
having) in any row, column at random. The population units corresponding to the selected random number
will be included in the random sample. Further selection of number may be moving towards column wise,
row wise or diagonal wise. Continue this movement in any one of the selected direction till the sample
size is attended.
While selecting the succeeding unit, the unit chosen in the previous instance may or may not be returned
back to the population. If is returns back, the sampling is called Simple Random Sampling Without
Replacement (SRSWR), otherwise it is called Simple Random Sampling With Replacement (SRSWTR).
The results of analysis of a population on the Simple Random sample are far from satisfactory if various
items of the population are not homogenous.
5.5.2. STRATIFIED RANDOM SAMPLING:
This Stratified Random Sampling is useful if the population is non-homogenous. That is heterogeneous.
It is a method of probability sampling in which the population is divided into different subgroups such
that these are homogenous within and heterogeneous between themselves, also called Strata’s and samples
are selected from each strata. Often factors which divide up the population into sub-populations (groups
/ strata), measurement of interest may vary among the different sub-populations. When we select a sample
from the population, our sample should represent the population. This is achieved by stratified sampling.
A stratified sample is obtained by taking samples from each stratum or sub-group of a population. Generally,
(need not be always) the number of items selected from each stratum is proportional to its size.

49
Ex: If there are 300 items in a population out of which a particular stratum consists of 60 items then the
size of this stratum as compared to the size of population is 60/300=1/5, then 1/5th of 60 i.e.; 12 items will
be selected at random from this stratum.
A stratified sample, thus obtained is also known as proportional stratified sample. A very familiar use of
stratified random sampling is in the survey of market of a commodity. In such a survey, the market of the
commodity is to be divided into various strata, say, according to the income (or) occupation or social
status.
5.5.2.1. Merits:
1. It removes the drawbacks of obtaining a random sample from a heterogeneous population and
makes the sample more representative.
2. Since each strata is homogenous within itself, the variability within each stratum is less than the
variability in a random sample. This ensures greater accuracy in the results of an investigation.
3. The items to be selected from different strata may be located in the same geographical area which
may reduce the cost of investigation to a large extent.
5.5.2.2. Demerits:
1. Stratification of the population is to be done very carefully. In the absence of proper stratification,
the precision of the results is likely to be low.
2. If different strata of a population overlap, the selection of representative sample becomes difficult.
3. Stratification requires some advance information about the nature of the population, which may not
always be available.
5.5.3. SYSTEMATIC SAMPLING:
In this method, the sampling frame is prepared on the basis of certain criteria such as alphabetical, age,
occupation, geographical location etc. Systematic random sampling is a method of probability sampling
in which the defined target population is ordered and the sample is selected according to position using a
skip interval. In order to draw a systematic sample of ‘n’ items from a population of ‘N’ items, the sampling
interval, given by N/n is determined. For example, if there are 1000 items in a population and a systematic
sample of 100 items is to be drawn from this, then the sampling interval is given by 1000/100=10. Then
from the first 10 items of the sampling frame an item is selected at random. If the selected item is 6th
items, the other items to be included in the sample are automatically determined (by adding sampling
interval, successively, to selected items) as 16th, 26th,…etc.
Systematic sampling, sometimes called interval sampling, means that there is a gap, or interval, between
each selection. It is often used in industry, where an item is selected for testing from a production line
(say, every fifteen minutes) to ensure that machines and equipment are working to
specification.Alternatively, the manufacturer might decide to select every 20th item on a production line
to test for defects and quality. This technique requires the first item to be selected at random as a starting
point for testing and, thereafter, every 20th item is chosen. This is also used when questioning people in
surveys.
Ex. Market researcher selecting every 10th person who enters a particular store, after selecting a person at
random as a starting point will be the sample

50
5.5.3.1. Merits:
1. It is operationally more convenient than a simple random sample even in situations where the size
of the population is very large.
2. It is very economical method.
3. If sampling frame is prepared independently of the characteristics under study, the results are likely
to be equivalent to that of simple random sample.
5.5.3.2. Demerits:
1. The results of a systematic sample can be very misleading if there is periodicity in the data. Ex. If
the 6th, 16th, 26th, etc. persons belong to the same income group, the data is said to have periodicity
and the average income thus obtained will be unrealistic.
2. If the population is very large (or) infinite, the preparation of the sampling frame may be rather
impossible.
5.5.4. CLUSTER SAMPLING:
Preparation of sampling frame may be difficult or if not impossible in some circumstances. In such
cases an alternate method of obtaining sample is cluster sampling. Here the population is divided into
some recognizable sub divisions called Clusters depending on the problem understudy. Clusters should
be as small as possible and the number of sampling units in each cluster should be approximately same.
These clusters are heterogeneous within and homogenous between themselves. To obtain a cluster sample,
select a sample of a prifixed number of clusters chosen at random from all the clusters and all the items
covered by the selected clusters are included in the sample.
Ex: Suppose a sample of households is to be selected from rural villages of Delhi state. There are about
300 villages and obtaining a list of all the households may be a difficult task. It is possible to divide these
300 villages into different clusters such that they are homogenous between themselves say “by including
some villages which are economically weaker and some with a sound economic background”. Let us
suppose that 10 clusters are formed in this manner. Then out of these clusters say, two clusters are selected
at random. All the households of villages covered by these two clusters will form a cluster sample. This is
also known as area sample since the division of clusters is based on geographical sub-divisors.
5.5.5. MULTI STAGE SAMPLING:
Under multi stage sampling the first stage may be to select large primary sampling units such as states,
then districts, then towns and finally certain families within towns. If the technique of random sampling
is applied at all stages, the sampling procedure is described as multi-stage random sampling.
Ex: Reconsider the example in the above case. The 10 clusters from the first stage sampling units. Out of
these, let a random sample of two clusters be chosen at random. The villages falling under these clusters
may form the second stage sampling units and from these, let a random sample of 10 villages be selected.
The households of these 10 villages may form the third stage sampling units and so on.

51
5.6. NON RANDOM SAMPLING METHODS:
The selection of the sample, under this method depends upon the judgment of the investigator rather than
by chance. This method is not at all scientific and therefore various laws of statistics cannot be used to
analyze it. However, some time these are also inevitable in Social Science research.
Some of the popular sampling methods obtained by non-random sampling methods are
1. Judgment sampling (or) Purposive sampling
2. Quota sampling
3. Convenience sampling
5.6.1. JUDGEMENT SAMPLING:
In this sampling, the sample is selected with definite purpose in view and this choice of selecting the
sampling units is completely based on the judgment of the investigation.
This judgment sampling suffers from the drawback of particularity, biasedness and favoritism of the
investigator. And therefore this may not give a good representative sample of the population and this
method is very rarely used and cannot be recommended for general use.
5.6.2. QUOTA SAMPLING:
To obtain a Quota sample, the items of the population are subdivided into various subgroups as done in
stratification. Then a quota i.e.; the number of items to be selected from each sub group is fixed. The
selection of units depends upon the judgment of the investigator and not on chance.Quota samples are
very popular in the market surveys and obtain samples on the basis of some parameters like age, sex e.t.c
Demerits:
1. The estimates based on this sample are affected by various types of errors in addition to the bias of
the investigator.
2. It is very difficult to check the accuracy of the estimates based on Quota sample.
5.6.3. CONVENIENCE SAMPLING:
In obtaining a convenience sample, the investigator gives special attention to his convenience.
Ex: To estimate the average height of an Indian, the investigator need not take a sample from each state.
He can take a convenience sample from Andhra Pradesh state only (assuming that the investigator is
Andhra Pradesh based) and estimate the Average height of an Indian.
Ex. In crop cutting experiment selection of a sample adjustment from the road, instead of random is an
example of convenience sample.
Activity 1: Explain one survey that took place in your knowledge with the help of a Questionnaire.
Activity 2: Suggest a suitable data collection method in collecting data on existing products when a new
product is introduced in the market.
Activity 3: List out the advantages of Sampling over complete enumeration method
Activity 4: Plan a survey toAsses liking of children while watching Children Programme in Telugu T V
Channels at your locality.
52
Activities 5: Define Population and sampling Units in each of the following
i) Study of Maternal Mortality rare in a District.
ii) Annual yield of Paddy in East Godavari District.
iii) Forecasting of Election results of a State.
iv) Popularity of A particular brand of Tea in an area.
Activity 6: Prepare a survey with the help of a questionnaire to assess the difficulties caused by auto
drivers to bike and car drivers in Visakhapatnam city.
Activity 6: Give one situation in which Multi stage sampling method is applicable.

5.7. SUMMARY:
This unit is initially focused on types of data and methods of colleting primary data and secondary data.
Concepts of Population, sample, sampling frame etc. were given. Different types of random and non-
random sampling techniques were explained.
Review Questions :
1. Explain different types of Data in research.
2. What are the sources of getting Secondary data?
3. Explain the Observation Method in collecting Primary Data.
4. Explain the Questionnaire Method in collecting Primary Data.
5. Explain the Schedule Method in collecting Primary Data.
6. What is the role of Interview in getting Primary Data?
7. What are the important steps one should follow in constructing Questionnaire?
8. What are the characteristics of a good Questionnaire?
9. What are the characteristics of a good Schedule?
10. Explain the following random sampling techniques:
i) Simple Random Sampling ii) Stratified random Sampling iii) Systematic Random sampling.

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Krishnaswami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt .Ltd.
4. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
5. Gupta S.C and Kapoor V.K, Fundamentals of Applied Statistics, Sultan Chand & Sons, New Delhi

53
UNIT – 6
DATA PREPARATION AND REPRESENTATION

OBJECTIVES:
After studying this Unit, the reader should be able to
1. Explain different steps in data preparation process.
2. Explain the concept of editing, coding, Transcribing and data cleaning, classification. Tabulation,
Diagrammatic and graphical representation, statistically adjusting data.
3. Explain the construction of different bar charts and pie chart
4. Explain the construction of Histogram, frequency polygon, frequency curve and ogives.

STRUCTURE:
6.1 Introduction
6.2. Data preparation Process
6.2.1 Questionnaire checking
6.2.2 Editing
6.2.3 Coding
6.2.4 Transcribing and data cleaning
6.2.5 Classification
6.2.6 Tabulation
6.2.7 Diagrammatic representation
6.2.8 Graphical representation of data and statistically adjusting data.
6.3 A classification of Statistical techniques
6.4 Summary
6.5 Review Questions
6.6 Further Readings

6.1. INTRODUCTION:
Once the data have been collected, the researcher has to process/ prepare, analyze and interpret the same.
All the efforts will be in vain if the collected data are not properly prepared /processed and analyzed.
There are few important steps in data preparation process. Those steps are discussed elaborately. In that

54
process classification and tabulation of data are discussed. Few important diagrammatical and graphical
representation are also presented.

6.2. DATA PREPARATION PROCESS:


The data preparation is guided by the preliminary plan of data analysis that was formulated in research
design stage. Important steps in data preparation process are 1. Questionnaire checking 2.Editing 3.Coding
4.Transcribing and data cleaning 5.Classification 6.Tabulation 7.Diagrammatic representation 8.Graphical
representation and 9.Statistically adjusting Data. The elaborate discussions are presented below.
6.2.1. Questionnaire checking:
This is the first step in the data preparation process. The first step in questionnaire checking is checking of
completeness of all questions in the questionnaire. Usually it will take place during the field work. If it
was not done during field work, one has to do after the survey is over. A questionnaire may not be acceptable
if i) Parts of questionnaire may be incomplete. ii) pattern of responses may indicate that the respondent
did not understand the questions and instructions to be filled. (Skip pattern may not have been followed).
iii) The responses show little variation (ticking more a single number more times). iv) received after the
due date. v) filled by someone who does not qualify for participation. If sufficient number of questionnaires
are not accepted the researcher may like to collect more data.
6.2.2. Editing:
This is nothing but reviewing of questionnaire with the objective of increasing accuracy and precision. As
a matter of fact, editing involves careful scrutiny of the completed questionnaire/ schedule. It is done to
assure that the data are accurate, consistent with the other facts gathered. Editing is done in such a way
that illegible, incomplete, inconsistent or ambiguous responses must be identified. Editing can be done at
two places namely at field and at central (after returning to office). The person who is doing the editing
work is known as Editor. Editor must be well versed with the editing of questionnaire. They should be
familiar with instructions given to the interviewers and codes as well as editing instructions supplied to
them. All entries made by the editors must be single lined, using a distinctive color, signed with date of
editing.
Unsatisfactory respondents may be returned to the field where the re-contact of interviewers and the
respondents is possible. This is possible when the sample size is small. If returning is not feasible, the
editor may assign missing values to unsatisfactory responses if the variables with unsatisfactory responses
are not the key variables and the number of such respondents are small. Sometimes discarding unsatisfactory
respondents is possible when the proportion of unsatisfactory respondents is small, the sample size is
large and responses on key variables are missing.
6.2.3. Coding:
Coding is the procedure of classifying the answers to the questions in to the meaningful categories. It
refers to the process of assigning numerals or other symbols to answers so that responses can be put in to
a limited number of categories or classes. It will reduce the large number of heterogeneous responses in to
meaningful categories. If the questionnaire contains only structured questions or very few unstructured
questions it is called pre coded. That means the codes are assigned before the field work is done. In the
questionnaire. Coding structured questions are relatively simple to open end questions. The response to
such open end questions are generally lengthy and descriptive in nature. In such cases coding needs extra
care in framing the possible categories in which various responses can be classified. Sometimes interviewer
55
himself will decide after taking down the entire response. Category codes must be mutually exclusive and
collectively exhaustive. Data should be coded in such a way that it must retain as much detail as possible.
A practical way of doing is edit and code simultaneously. These two operations are regarded as one and
looked after by one person.
6.2.4. Transcribing and data cleaning:
Transcribing data involves transferring the coded data from the questionnaires or coded sheets on to
disks/ pen drives or directly on the computers using key punching. This step in unnecessary if the data
directly collected by the computers itself. Besides key punching, the data can be transferred by using
optical recognition, digital technologies, bar codes etc. Data cleaning includes consistency checks and
treatment of missing responses. Consistency checks includes identifying out of range data, logically
inconsistent, or have extreme values. Out of range data can be corrected by going back to the edited and
coded questionnaire. Printing the relevant information to identify the responses and take corrective action
is the procedure for correction in logically inconsistent data. Extreme values should be closely examined.
This need not be wrong always.
6.2.5. Classification:
Most research studies yield a large volume of data which needs to be reduced in to homogeneous groups
for meaningful analysis. This necessitates the classification of data in which the process of arranging data
in groups or classes on the basis of common characteristics will be done. Classification can be done
according to area or region (Geographical), according to occurrence of an event in time (Chronological)
according to attributes (A characteristic which cannot be measured quantitatively, that is qualitative in
nature) and according to magnitudes/variables also called according to class intervals.
Classification of data basing on the area or region like population of different states of India according to
2011 census, GDP of different countries is an example of Geographical classification. Time series data is
the example of chronological classification.
The classification according to attributes is called Qualitative classification and classification of data
according to characteristics like religion, social status, occupation, sex, caste, place of residence etc are
the examples. This can be simple classification and manifold classification. In simple classification only
one attribute will be considered and data will be classified according to presence or absence of that
attribute. Out of total number of persons attended how many male, how many female; how many educated,
how many uneducated; how many from urban area, how many from rural area etc. In manifold (multiple)
classification data is classified according to more than one attributes. Here data is classified according to
first attribute (male and female) and then using the second attribute, data is sub classified again (male is
sub classified again educated and uneducated, female is sub classified in to educated and uneducated).
This may be continued for other attributes if needed. Attributes some time can be classified in to more
than two categories also (Religion can be Hindus, Muslims, Christians and others).
The classification according to variables (Characteristics that can be measured quantitatively) is called
quantitative classification and classification of data according to height, weight, age, income, number of
children etc are the examples. Variables may be discrete (it does not take fractional values) (Families with
different number of children, age etc.) or continuous (fractional values can also take) (weight, height
etc.). But in practice, even the continuous variables can measure up to some degree of precision and they
essentially become discrete variables.

56
Example 1. Discrete:

No of children 0 1 2 3 4 5 More than 5 Total


No. of families 22 38 39 24 12 3 1 139

Example 2. Continuous:

Daily wage of workers 0- 1-2 2-3 3-4 4-5 5-6 6 and above Total
(100) 1
No. of laborers 12 23 45 42 31 16 8 167

6.2.5.1. Discrete Frequency Distributions:


In case of discreet variable taking the small number of values say maximum of 10, each of the observation
value counted in the data and written against the concerned discrete value in a tabular form. That is place
all possible values of the variable in ascending order in one column, and then prepare another column of
tally marks (one tally mark for one value). The number of tally marks is counted and presented(frequency)
in the third column. This will be discrete frequency distribution.
Example 3:In a village 20 families were surveyed and number of children they had were recorded as
follows.
0,2,3,1,1,3,4,2,0,3,4,2,2,1,0,4,1,2,2,3. Construct discrete frequency distribution (Ungrouped Data).

Number of children Tally marks Number of families (frequency)


0 /// 3
1 //// 4
2 ///// 6
3 //// 4
4 /// 3
Total 20

6.2.5.2. Continuous Frequency distribution:


In case of discrete variable taking more number of values and also in case of continuous variable, the
variables are grouped in small number of intervals. These intervals are called classes. The lowest and
highest value that can be included in the class are called as Class limits and those values are called lower
limit and upper limit of the class respectively. The width of the class is known as class interval and can be
obtained as difference between two consecutive lower (Upper) limits. Midpoint of the class interval can
be obtained as average of the lower limits of that class and the lower limit of the next class. There are two
different ways in which limits of classes may be arranged in two different and ways namely exclusive
method and inclusive method. In the exclusive method the class intervals are so arranged that the upper
limit of a class interval is same as the lower limit of the next class interval. In inclusive method the upper
limit of a class interval is included in the class itself.
However, to ensure continuity and to get correct class intervals, one should adopt exclusive method. If
data was presented in inclusive method, it is necessary to make adjustment to determine the class interval.

57
A correction factor has to be deducted from all lower limits and the same correction factor has to be added
to all upper limits for adjustment. And the correction factor can be obtained as follows.
Correction factor= ½ (Lower limit of second class interval - Upper limit of the first class interval). Once
the data is presented in exclusive method, the lower and upper values of the class intervals are called class
boundaries.
Points to be observed while construction of Continuous frequency distribution:
1. Though there is no hard and fast rule, the number of classes should not be too small and too big.
Generally, it would better be in between 5 and 15.
2. Length of the class intervals should be determined based on maximum and minimum values and
number of classes.
3. It is desirable to have equal length though it is not mandatory.
4. If possible the length of the class intervals should be multiples of 5 i.e. 5,10,15, 20…. Etc. However,
it is not compulsory.
5. The starting point of the class intervals also must be multiples of 5 or 10. That is if the least value is
4, instead of taking 4-9, better take 0-5 etc.
However, one has to keep in mind that these are all not hard and fast rules to follow.
Example 4:
A sample of 30 persons showed their ages as follows. Construct a frequency distribution (Grouped Data).
20,18,25,68,32,25,16,22,29,27,37,35,49,42,65,37,42,32,49,42,53,48,65,71,64,52,40,35,55,67.
Answer: Lowest is 16 and highest is 71. Then the Range is 71-16= 55
Assume the number of classes is 6. Then 55/6=9.16. So the sake of convenience length of the class
interval is taken as 10.
Then, the class intervals are 15-25,25-35…65-75

Class Intervals Tally marks Frequency


15-25 //// 4
25-35 //// / 6
35-45 //// /// 8
45-55 //// 5
55-65 // 2
65-77 //// 5
Total 30

6.2.5.3. Open end class distributions:


In case where the spread of the minimum values and the spread of the maximum values of the data is very

58
big the distribution with open end can be used. In open end distributions lower limit of the first class
interval and/or upper limit of the last class interval is not given.
Example 5:

Class Less than 10 10-20 20-30 30-40 40-50 50-60 60 or Total


intervals above
Frequency 6 12 18 23 21 17 3 100

6.2.6. Tabulation of data:


Collected data may be presented by any of the following namely 1. Tabulation 2.Diagrammatic presentation
and 3.Graphical presentation.
When huge data is on hand one has to arrange the same in a logical and compact form for understanding
and further analysis. Such arrangement is called Tabulation. In simple word tabulation is nothing but
arrangement of data in rows and columns. This can be done manually, mechanically or electronically
depending on the size of the data, cost consideration and time availability.
Tables are of two types namely Simple tables and Complex tables. Simple tabulation concerns with only
one characteristic (Variable/ attribute). Population figures of a country over consecutive census years,
marital status of the employees of a company, Distribution of workers basing on the weekly wages are the
examples of simple tabulation.
Complex tabulation contained data pertaining to more than one characteristic (variable/ attribute). Urban
and rural population of a country/ State over different consecutive census years, Marital status and
educational levels of employees working a company are the examples of complex tabulation. However,
considering more than two characteristics becomes difficult for interpretation.
6.2.7. Diagrammatic representation:
Diagram will give the idea of the data, its pattern, its shape in an easier way than tabulated data. The data
presented through Diagram is the best way of appealing to the mind visually. There are various types of
diagram in literature. One common type of classification of diagram is based on dimension namely one
dimensional, two dimensional, three dimensional. Keeping the limitations of this book and frequent usage
in social research this is restricted to only one-dimensional diagram only.
Important diagram discussed are 1. Simple bar diagram 2. Multiple bar diagram 3. Sub divided bar diagram
4. Percentage bar diagram 5. Pie diagram 6.Structure diagram.
6.2.7.1. Simple Bar Diagram:
This is related to one characteristic only. It is commonly used and better for representation of qualitative
data. The characteristic will be presented on X- Axis and the corresponding values the Y-axis. Bars are
simply vertical and length of the bars are proportional to their corresponding values. In this, only length
will be considered and width is uniform for all bars and gap between each bar is identical.The diagram
usually shows a comparison between different categories. Although the diagram can technically be plotted
vertically or horizontally, the most usual presentation for a bar graph is vertical. The bars may be on
positive side as well as negative sides depending on the values of the variables.

59
Example 6:Following are the gross revenue of a company XYZ in the year 2013,201,2015.

Year 2013 2014 2015


Revenue in Rs. Crores 120 145 200

Draw a simple bar diagram.


Answer: Identify years on X-axis with same width and separated by equal gaps.Consider Revenue on Y-
axis with suitable scale. Erect bar against each year whose height is proportional to the corresponding
revenue.

6.2.7.1. Multiple Bar Diagram:


Multiple bar diagram facilities comparison between more than one phenomena. By multiple bars diagram
two or more sets of inter-related data are represented by side by sidebars for each phenomena. The technique
of simple bar chart is used to draw this diagram for each sets but the difference is that one will use
different shades, colors, or dots to distinguish between different phenomena. This is also called compound
bar or cluster bar diagram.
Example 7: Following are the information on marital status of employees in three companies A, B, C in
the year 2015. Draw suitable diagram.

Marital status A B C
Married 120 110 90
Unmarried 45 50 30
Divorced and widowed 15 20 10

Answer: Consider Companies on X-axis and employees on Y-axis. Plot the Companies A, B, C with some
width on X-axis separated by some equal gaps. Width of each company must be suitable for erecting three
bars one for married, one for unmarried and one for divorced and widowed. Draw three bars three bars
one for married, one for unmarried and one for divorced and widowed against each company whose
height is proportional to the corresponding number.

60
6.2.7.1. Sub Divided Bar Diagram:
Sub-divided or component bar chart is used to represent data in which the total magnitude is divided
into different components. In this diagram, first we make simple bars for each class taking total magnitude
in that class and then divide these simple bars into parts in the ratio of various components.
Example 8:Construct a sub divided bar diagram to the following data on family expenditure.

Year Monthly Expenditure (Rs. 000’s)


Food Housing Others Total
2011 5 3 10 15
2012 6 5 11 27
2013 6.5 5 15.5 27
2014 8 6 16 30
Answer:
Consider Years with some width on X- axis and separate years by equal space. Consider expenditure on
Y-axis with suitable scale. Plot bars whose height is proportional to the total of all items together against
each year. Now sub-divide the bar according to sun expenditure items.

61
6.7.2.4. Percentage Bar Diagram:
Percentage Bar diagram is nothing but a Sub-divided bar chart drawn on percentage basis. To draw this
bar diagram, express each component as the percentage of its respective total. In drawing percentage bar
diagram, bars of length equal to 100 for each class are drawn at first step and sub-divided in the proportion
of the percentage of their component in the second step. The diagram so obtained is called percentage
bardiagram or percentage staked bar chart. This type of diagram is useful to make comparison.
Example 9:Construct a Percentage bar diagram to the following data given in example 8.
Answer:Consider Years with some width on X- axis and separate years by equal space. Consider
expenditure on Y-axis. Convert all the values in each year in to percentages. Then total for each year will
be 100. Plot bars whose height is proportional to 100 each year. Now sub-divide the bar according to
percentages of the sun expenditure items.

62
6.7.2.4. Pie Diagram:
A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to illustrate
numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and
area), is proportional to the quantity it represents. Pie charts are generally used to show percentage or
proportional data. The size of the slice in the circle represents the proportion of the components out of the
total.Pie charts are good for displaying data for around 6 categories or fewer.
Example 10:The following table shows the areas in millions of square miles of oceans of the world.
Draw Pie chart for the same.

Ocean Pacific Atlantic Indian Antarctic Arctic


Area 70.8 41.2 28.5 7.6 4.8

Answer: Total area is 152.9. Now convert the values in to degrees by equating 152.9 as 360 degrees. Then
the corresponding values in degrees are 70.8 x360/152.9 =167; 41.2 x360/152.9 =97; 28.5 x360/152.9=67;
7.6 x360/152.9=18; 4.8 x360/152.9=11. These values are plotted on a circle and separated with different
colours/ shades.

6.2.7.6. Structure Diagram:


The display ofqualitative information in diagram form is called Structure Diagram. Under this category
of diagrams, the two charts Organizational chart and flow chart only will be discussed in this book.
Organizational Charts are most commonly useful in representing the internal structure of an organization.
There is no specific procedure for drawing this type of chart. Flow charts are useful in representing the
information which flows through different situations to its ultimate point.
6.2.8. Graphical representation:
This is another techniques of visual presentation like diagrammatic representation. However, Diagrams
are more suitable to illustrate the data which is discreet, while continuous data is better represented by
Graphs. This is two dimensional. The values on both the X-axis and on Y-axis are proportional to the
63
Example 12:For the data given in example 11, draw frequency polygon and frequency curve.
Answer: Consider Monthly earnings on X- axis and No. of families on Y-axis with suitable scale. Plot
frequency of each class interval against the midpoint of that class interval. Join those points by straight
lines in order gives the frequency polygon. Join those point by a smoothed curve gives the frequency
curve.

6.2.8.3. Ogives (Cumulative frequency curves):


There are two types of Ogives. One is less than ogives (less than cumulative frequency curve) and other
is more than ogives (more than cumulative frequency curve). These are graphical representation of
cumulative frequency distributions. To construct such graphs one has to defineless than cumulative
frequency and more than cumulative frequency from the frequency distribution. Less than frequency of a
class interval is the frequency which is less than the upper limit of that class interval. Similarly, more than
cumulative frequency of a class interval is frequency which is more than the lower limit of that class
interval. In this process the less than cumulative frequency of the first class interval is its frequency only
where as for the last class interval the total frequency. The more than cumulative frequency of the first
class interval is the total frequency where as for the last class interval the frequency of that class interval
only. In constructing these graphs, the class intervals are taken on X-axis and the cumulative frequencies
on Y-axis with suitable scale. The graph between less than cumulative frequencies against its upper limits
is called less than cumulative frequency curve or less than ogive. The graph between more than cumulative
frequencies against its lower limits is called more than cumulative frequency curve or more than ogive
Example 13:For the data given in example 11, draw cumulative frequency curves.
Answer:

Monthly earnings (Rs. 000) 6-7 7-8 8-9 9-10 10-11 11-12 12-13
No. of families 10 15 20 25 15 10 5
Less than cumulative frequency 10 25 45 70 85 95 100
More than cumulative 100 90 75 55 30 15 5
frequency

65
Consider Monthly earnings on X- axis and cumulative frequencies of the No. of families on Y-axis with
suitable scale. Plot the less than cumulative frequency of the class interval against its upper limit and join
those points by a smoothed curve. This gives the less than cumulative frequency curve. Plot the more than
cumulative frequency of the class interval against its lower limit and join those points by a smoothed
curve. This gives the more than cumulative frequency curve.

6.2.9. Statistically adjusting Data:


Data adjusting is not always necessary. But in some situations with the help the following methods adjusting
of data can be done.
6.2.9.1. Weight assigning:
Sometimes some respondents are comparatively important than the others. In such cases those important
respondents will be assigned a weight. The weight 1.0 means no weightage. In a market survey about the
study of behavior of employees in a Super Bazaar, the most frequently visiting respondents will be given
more weightage than less frequently vising respondents.
6.2.9.2. Variable Re-specification:
Re-specification involves in creating new variable or modifying existing variable in the process of getting
more consistent with the objectives of the study. Assigning Dummy variable like 1 for educated and 0 for
uneducated, reducing 10-point scale to 5-point scale, developing new index using the existing variables,
ration between two variables etc. are the examples of Variable re-specification.
6.2.9.3. Scale Transformation:
This transformation is to ensure the comparability with other scales and to make the data suitable for data
analysis. Different types of scales are employed for measuring different variables. Even in case of employing
the same scale for all the variables the scale transformation is helpful in correcting the differences.A
common transformation is subtracting all the values of a characteristic by corresponding mean and dividing
by corresponding Standard deviation. This is known as standardization.

66
6.3. A CLASSIFICATION OF STATISTICAL TECHNIQUES:
After obtaining the data one can find different statistical measures like mean, mode, standard deviation,
correlation coefficients etc. These were further explained in Units VII to X. However, for proper analysis
and interpretation one has to apply some statistical techniques on them.Those Statistical techniques can
be classified in to Univariate techniques and Multivariate techniques. Univariate statistical techniques are
use only when one measurement on each element in the sample is taken or multiple measurement of each
element are taken but each variable is analyzed independently. Multivariate statistical techniques, on the
other hand, are suitable for analyzing data when there are two or more measurements of each element in
the sample and the variables are analyzed simultaneously. Univariate Statistics techniques are further
classified in to metric Interval and Ratio Scale) and non-metric (Nominal and Ordinal Scale) statistical
techniques basing on the scale of measurement of the data. Further Metric techniques are also called
parametric tests and non-metric techniques are also called Non-parametric tests. These are further classified
basing on the number of samples. Figure 6.1 give a broad classification of common univariate statistical
techniques. This classification was made based on the type of data (metric or non-metric), number of
samples (one or more) number of variable (one or more) type of variable (independent or dependent)
Some of these techniques are discussed in the future units. Figure 6.2 give a broad classification of common
multivariate statistical techniques. This classification was made based on type of variable (independent or
dependent), number of dependent variables etc. However further analysis on these techniques were not
given as those are beyond the scope of this book.
Activity 1.The number of spoiled fruits in 20 fruit boxes are as follows. Construct a frequency distribution
for the following.
4,2,5,1,0,2,5,2,3,1,4,4,5,1,4,1,2,3,1,3.
Activity 2.Marks of 50 students in Statistics (out of 100) are as follows.Construct a frequency distribution
for the following.
70,45,33,64,50 ,25,65,75,30,20,55,60,65,58,52,36,45,42,35,40,31,37,39,61,53
59,49,41,15,55,42,63,82,65,45,63,54,52,48,45,57,53,55,42,46,39,64,35,26,18.
Activity 3.From the cumulative frequency distribution given below, construct a frequency distribution.
Also prepare more than cumulative frequency table.

Marks Below Below Below Below Below Below Below Below Below
10 20 30 40 50 60 70 80 90

No. of 12 30 60 100 150 10 220 240 250


students
Activity 4.For the following information on Profit and loss (in lakhs of Rs.) of different industries, draw
simple bar diagram.

Industries A B C D E
Profit/Loss 52 -12 23 19 -20

67
Classification of Univariate statistical techniques:
 hŶŝǀĂƌŝĂƚĞdĞĐŚŶŝƋƵĞƐ

DĞƚƌŝĐĂƚĂ EŽŶͲDĞƚƌŝĐĂƚĂ

KŶĞƐĂŵƉůĞ dǁŽŽƌŵŽƌĞ
KŶĞƐĂŵƉůĞ dǁŽƐĂŵƉůĞ
ƐĂŵƉůĞ

ƚͲƚĞƐƚ &ƌĞƋƵĞŶĐLJ
njͲƚĞƐƚ ŚŝͲ^ƋƵĂƌĞ
/ŶĚĞƉĞŶĚĞŶƚ ĞƉĞŶĚĞŶƚ
<Ͳ^

ZƵŶ
ŝŶŽŵŝĂů

WĂŝƌĞĚƚͲƚĞƐƚ
dǁŽŐƌŽƵƉƚͲƚĞƐƚ
ͲƚĞƐƚ
/ŶĚĞƉĞŶĚĞŶƚ ĞƉĞŶĚĞŶƚ
KŶĞͲǁĂLJEKs

ŚŝͲ^ƋƵĂƌĞ ^ŝŐŶ
DĂŶŶͲtŚŝƚŶĞLJ tŝůĐŽdžŽŶ
<Ͳ^ DĐEĞŵĂƌ
<ͲtEKs ŚŝͲƐƋƵĂƌĞ
&ƌŝĞĚŵĂŶ

68
Figure 6.1
Classification of Multivariate statistical techniques:
 DƵůƚŝǀĂƌŝĂƚĞdĞĐŚŶŝƋƵĞƐ

ĞƉĞŶĚĞŶĐĞdĞĐŚŶŝƋƵĞƐ /ŶĚĞƉĞŶĚĞŶĐĞdĞĐŚŶŝƋƵĞƐ

sĂƌŝĂďůĞ /ŶƚĞƌŽďũĞĐƚ
KŶĞĚĞƉĞŶĚĞŶƚ DŽƌĞƚŚĂŶŽŶĞ ŝŶƚĞƌĚĞƉĞŶĚĞŶĐĞ ^ŝŵŝůĂƌŝƚLJ
ǀĂƌŝĂďůĞ ĚĞƉĞŶĚĞŶƚ
ǀĂƌŝĂďůĞ

&ĂĐƚŽƌĂŶĂůLJƐŝƐ ůƵƐƚĞƌĂŶĂůLJƐŝƐ
DĞƚƌŝĐ
DƵůƚŝĚŝŵĞŶƐŝŽŶĂů
ƌŽƐƐƚĂďƵůĂƚŝŽŶ ƐĐĂůŝŶŐ
EKs͕EKs
DEKs
DƵůƚŝƉůĞƌĞŐƌĞƐƐŝŽŶ
DEKs
ŽŶũŽŝŶƚĂŶĂůLJƐŝƐ
^ƚƌƵĐƚƵƌĂůĞƋƵĂƚŝŽŶ
dǁŽŐƌŽƵƉ
ŵŽĚĞůŝŶŐĂŶĚƉĂƚŚ
ĚŝƐĐƌŝŵŝŶĂŶƚ
ĂŶĂůLJƐŝƐ
ĂŶĂůLJƐŝƐ
ĂŶŽŶŝĐĂůĐŽƌƌĞůĂƚŝŽŶ
>ŽŐŝƚĂŶĂůLJƐŝƐ
DƵůƚŝƉůĞ

ĚŝƐĐƌŝŵŝŶĂŶƚĂŶĂůLJƐŝƐ



Figure 6.2

Source: Marketing Research, Naresh K Malhotra and S Dash & Business research methods, N Bajpai.
Activity 5.Draw Multiple bar diagram for the data on enrolment of Students in Institution given below:

Programme No. of Students enrolled


2012-13 2013-14 2014-15 2015-16
B. Tech 1500 1600 1525 1650
M.B. A 500 450 400 450
Others 1050 726 1458 666

Activity 6.For the data given in activity 5, draw Sub-divided and percentage bar diagram.
Activity 7.Draw Pie diagram for the following data on sales of a soft drink at five different place in a state
in the year 2014-15.

Sales A B C D E
Rs. 000’s 123 122 86 156 72
69
Activity 8.For the data given inactivity 3, construct Histogram.
Activity 9.Choose some data of your organization in such a way that unequal class intervals are thereand
for that construct Histogram.
Activity10. For the data given inactivity 3, construct Frequency polygon and frequency curve.
Activity 11.For the data given inactivity 3, construct Ogives.
Activity12: Draw a structure diagram of employees for your organization.

6.2. SUMMARY:
In this unit data preparation process was discussed.In the process of data preparation questionnaire checking,
editing of data, coding of data, Transcribing and data cleaning, classification of data, Tabulation, different
diagrammatic representation of data, graphical way of representing data and adjusting data are discussed.
A classification of statistical techniques is also given.
Review Questions:
1. Mention different steps involved in data processing process.
2. Explain the concepts of i) Questionnaire checking ii) Editing iii) coding iv) Transcribing and data
cleaning.
3. Explain the procedure for constructing i) Simple bar diagram ii) Multiple bar diagram iii) Sub-
divided bar diagram iv) Percentage bar diagram.
4. Explain the procedure for drawing a PIE diagram.
5. Explain the procedure for drawing i) Histogram ii) Frequency curve iii) Ogives.
6. Explain different methods of adjusting data.

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Krishnaswami, O.R., Methodology of Research in Social Sciences, Himalaya Publishing House,
Mumbai
3. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt .Ltd.
4. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
5. H.K Dangi, ShrutiDewen., Business Research Methods, Cengage Learning India Pvt Ltd, New
Delhi 2016.

70
BLOCK-III DATA ANALYSIS

In this Block-III there are four Units namely, Unit-VII, Unit-VIII, Unit-IX and Unit-X. All these are on
data Analysis. Unit VII and VIII are completely on Univariate analysis, in which analysis of data having
only one variable was discussed. In Unit VII, different Measures of Central Tendency namely, Arithmetic
mean, median, mode, geometric mean and harmonic mean were discussed. In Unit –VIII different Measures
of Dispersion namely, Range, Quartile Deviation, Mean Deviation and Standard deviation were discussed.
In addition to these measures, measures of Skewness, and Kurtosis were also presented. The Unit – IX is
on the Bivariate analysis in which relation between two variables are discussed. In the process of that
bivariate Correlation and Regression analysis were discussed in detail. The Unit –X is on
Multivariateanalysis. In this Multiple correlation and Multiple regression was discussed in detail. In
addition, a brief mention about other Multivariate techniques were also made in this Unit-X.

71
UNIT - 7
UNIVARIATE ANALYSIS-I
MEASURES OF CENTRAL TENDENCY

OBJECTIVES:
After studying this Unit, the reader should be able to
1. Explain the necessity of different statistical constants to be calculated
2. Explain the different types of statistical measurements.
3. Explain meaning and different measures of central tendency
4. Explain in depth analysis of Mean, Median and Mode.

STRUCTURE:
7.1 Introduction
7.2. Measures of Central Tendency or Averages
7.3. Arithmetic Mean
7.4. Geometric Mean
7.5. Harmonic Mean
7.6. Median
7.7. Mode
7.8. Summary

7.1 INTRODUCTION:
In Unit-6, it was explained how a raw data on a single variable can be organized and summarized in a
manner that their main high-lights and other important characteristics become explicit and easy to grasp.
Although frequency distributions and corresponding graphical distributions make raw data more
meaningful, yet they fail to identify some major properties that describe a set of quantitative data. These
can be classified in to four categories based on the nature of properties they identify about the data. The
properties are
1. The numerical value of one observation around which most numerical values of the other observations
of the given data set show a tendency to cluster or group is called the central tendency. The value
around which individual observations come to cluster is called Central value.
2. The extent to which numerical values are dispersed around the central value, called variation.

72
3. The extent of departure of numerical values from symmetrical (Normal) distribution around the
central value, called Skewness.
4. The degree of concentration of frequencies (observations) in each distribution that is the degree of
peakedness of hump of the distribution, called Kurtosis.
Basing on those properties the descriptive measures of observations corresponding to a variable can be
classified in to four type of measures namely
1. Measures of Central Tendency or Averages
2. Measures of Variation
3. Measures of Skewness and
4. Measures of Kurtosis.
Before going in to the actual calculation of those measurements, different types of data presentations, few
definitions required for tabulation and notations that are useful in tabulation are presented for conveniences
of the reader.

Ungrouped Data: Let be a set of ‘ n’ observations. If n is small further calculations will be


done on those values without making any grouping. This type of data presentation is called ungrouped.
Example 7.1: Ina survey, the profits of 5companies during a year are 15, 20, 10, 35, and 32 crores of
Rupees.

Answer: Here n=5 and =32

Frequency of an observation: The number of times an observation or a particular number occurs/repeated


in the given set of data is known as frequency of that observation. Let the observation Xj repeated fj times
in the given set of data, then fj is called frequency of the observation Xj.
If n is large the ungrouped data that was presented above can also be presented in discrete form using the
observations and corresponding frequencies.
Frequency distribution: It is a tabular representation of observations with their corresponding frequencies
is known as Frequency distribution. Frequency distributions are of two types namely discrete frequency
distribution and continuous frequency distribution.
Discrete frequency distribution: If the observations are denoted by Xj and corresponding frequencies
are denoted by fj , j = 1, 2, 3, …k (say), the distribution can be as follows:

Xj X1 X2 - - - Xk
fj f1 f2 - - - fk

The representation of the data in the form of table showing each observation with its corresponding
frequency is known as Discrete frequency distribution and the table is known as discrete frequency
distribution table.
Example 7.2: The marks obtained by 40 students are as follows. Show it in a discrete distribution.
40,50,50,60,70,40,80,90,90,80,80,70,70,70,40,50,60,90,80,70,70,60,50,50,40,50,50,60,60,40,40,40,60,60,50,50,60,40,50,60.
73
Answer: This can be tabulated as a discrete distribution using tally marks as explained in Unit-VI as
follows:

Marks 40 50 60 70 80 90
Number of Students 8 10 9 6 4 3

Continuous frequency distribution: If the sample size is large the ungrouped data can also be presented
in to grouped form in which class intervals and class frequencies will appear.
Class Interval: Class interval consists of a lower limit and an upper limit. This is dependent on smaller
and larger value of the observations in the data and number of class intervals and the length of the class
intervals.
Number of classes: An ideal number of classes for any frequency distribution gives the maximum
information. It can be obtained from Strurges formula and is given by

Length of the Class interval: It is the difference between any two successive lower limits or any two
successive upper limits.

Length of the class interval= =

Frequency of the class interval / class frequency: The number of observations belonging to a class is
known as class frequency or frequency of that class. The representation of the mid values of a set of
classes in the form of table showing each mid value with its corresponding class frequency is called as
Continuous frequency distribution and the table is known as continuous frequency distribution.
If the mid values of the j th class interval is denoted by xj and corresponding frequencies are denoted by
fj , j = 1, 2, 3, …k (say), the distribution can be as follows

Class Interval 1st 2nd 3rd kth


xj x1 x2 - - - xk
:
fj f1 f2 - - - fk

Example 7.3: A hospital administration requested a Management consultant to study the waiting time
taken by a patient and following are the data on the waiting time of 22 patients. Construct a frequency
distribution.
10,8,5,25,30,15,5,20,12,15,10,15,13,11,14,15,35,8,10,11,13,12
Answer: The number of observations=22. And the maximum value=35 and smaller value=5
Range=35-5=30
If length of the class interval was fixed as 5 , the class intervals will be 35/5=7
Then the class intervals and frequencies are as follows:

74
Class Interval 5-10 10-15 15-20 20-25 25-30 30-35 35-40
Frequency 4 10 4 1 1 1 1

7.2. MEASURES OF CENTRAL TENDENCY OR AVERAGES:


Averages are statistical constants. They give us an idea about the concentration of the values in the central
part of the distribution. Generally, these values tend to concentrate centrally. So, they are also known as
measures of central tendency. They are different types of averages each having its own applications and
specialty.
Sometimes it will be classified as three namely Mean, Median and Mode and Mean will be further classified
as Arithmetic Mean, Geometric Mean and harmonic Mean. Some classifies them as five measures of
central tendency namely Arithmetic mean, Median, Mode, Geometric mean and Harmonic mean.

7.3. ARITHMETIC MEAN:


This is the most popular measure of central tendency and popularly known as mean. This is a mathematical
average. This can be calculated using the following formulae.

7.3.1. Ungrouped data: Let be a set of ‘n’ observations then their arithmetic mean is given

by

7.3.2. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies then their arithmetic mean is given by

where

7.3.3.Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in a
continuous distribution with corresponding frequencies then their arithmetic mean is given by

where

Example 7.4: For the data given in example 7.1 find the value of the Arithmetic Mean (AM)

Answer:Sample size n=5, Sum of the observations =112

AM= =112/5=22.4.

Example 7.5: For the data given in example 7.2, calculate AM.

Answer: AM= where

75
Marks (Xi) 40 50 60 70 80 90 Total
Number of Students 8 10 9 6 4 3 40=N
(fi)
fiXi 320 500 540 420 320 270 2370

Arithmetic Mean= 2370/40=59.25


Example 7.6: Calculate the mean overtime hours worked by the employees in an organization for the
following distribution

Overtime in hours 10-15 15-20 20-25 25-30 30-35 35-40 Total


No Of employees 11 20 35 20 8 6 100

Answer:

Overtime in hours 10-15 15-20 20-25 25-30 30-35 35-40 Total


No Of employees (fi) 11 20 35 20 8 6 100=N
Mid Values (xi) 12.5 17.5 22.5 27.5 32.5 37.5
fi xi 137.5 350 787.5 550 260 225 2310

Mean is given by

7.3.4. Mean deviation method:

7.3.4.1.Ungrouped data: Let be a set of ‘n’ observations and let the deviation
A is the assumed mean / any arbitrary value then their arithmetic mean is given

by = , where

7.3.4.2. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies and let di=Xi- A, A is the assumed mean / any arbitrary value then their arithmetic
mean is given by

= , where and .

7.3.4.3. Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in
a continuous distribution with corresponding frequencies and let di= xi- A, A is the assumed mean / any
arbitrary value, then their arithmetic meanis given by

=A , where and

If di= ( xi- A)/h , A is the assumed mean / any arbitrary value and h is the width of the class interval then
their arithmetic mean is given by
76
= , where and

Example7.7:Calculate the mean daily wage from the following data

Daily earnings
in (Rs) 100 120 140 160 180 200 220
Number of
employees 3 6 10 15 24 42 75

Answer:

Daily earnings(RS)
Number of employees
100 3 -60 -180
120 6 -40 -240
140 10 -20 -200
160(A) 15 0 0
180 24 20 480
200 42 40 1680
220 75 60 4500

The arithmetic mean or average is

7.3.5. Properties of Arithmetic mean:


1. Algebraic sum of the deviations of a set of values from their arithmetic mean is zero.
2. Sum of the squares of the deviations of a set of values is minimum when taken about mean.

3. If are the means of series of sizes (i=1, 2, 3, 4, …,l) respectively then the mean
of the composite series is given by

7.3.6. Merits of Arithmetic mean:


1. It is rigidly defined.
2. It is easy to calculate and easy to understand.
3. It is based on all observations
4. It is applicable for mathematical treatment.
5. It is less affected by fluctuations of sampling.
77
7.3.7. Demerits of Arithmetic Mean:
1. It cannot be determined by inspection nor it can be located graphically
2. It cannot be obtained when a single observation is missing
3. It is effected much by extreme values
4. It leads to wrong conclusions if the details of the data from which it is computed are not given
5. It cannot be used if we are dealing with quantitative characteristics which cannot be measured
quantitatively
6. It cannot be calculated if the extreme class is open
7. In extremely asymmetrical (skewed)distribution arithmetic mean is not a suitable measure of location.

7.3.8. Weighted mean: Let be a random sample drawn from a population and be the
weight or importance assigned to them then their weighted arithmetic mean or weighted mean is given by

7.4. GEOMETRIC MEAN:


Geometric mean of n observations is the nth root of the product of these n observations. This is useful in
finding average growth rates especially in finding population growth rates. They are useful in finding
averaging ratios, percentages, and rates of change. Also, useful in construction of Index numbers.

7.4.1. Ungrouped data: Let be a set of ‘n’ observations then their geometric mean G is nth
root of the product of those n observations and given by

That is anti-logarithm (anti log) of arithmetic mean of logarithm (log) of the observations is called geometric
mean.

7.4.2. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies then their Geometric Mean G is given by

where

7.4.3. Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in a
continuous distribution with corresponding frequencies then their Geometric Mean G is given by

where

78
Example 7.8: Find the average rate of increase in prices which in the first year increased by 20 percent,
in the next year by 25 percent and in the third year by 44 percent.
Answer:The average rate of increase is GM of the increase in prices.

Average Rate= = 28.02%

Example 7.9: Calculate the G.M for the following data

Class interval 0-10 10-20 20-30 30-40


frequency (fi) 5 8 3 4

Answer:

Mid values (xi) 5 15 25 35 Total


frequency (fi) 5 8 3 4 20
filog xi 3.4945 9.4088 4.1937 6.1764 23.2734

G.M

7.4.4. Merits of Geometric Mean:


1. It is rigidly defined
2. It is based on all observations
3. It is suited for further mathematical treatment
4. It is not affected much by fluctuations of sampling

5. If and are the series, and are geometric means of two series respectively then geometric
mean of the combined series is given by

6. It gives comparatively more weight to small items.


7.4.5. Demerits of Geometric Mean:
1. It is not easy to understand to calculate for non-mathematics person.
2. If any one of the observations is zero, then geometric mean becomes zero.
3. If any one of the observation is negative geometric mean becomes imaginary.

7.5 HARMONIC MEAN: Harmonic mean of n observations can be defined as inverse of the arithmetic
mean of inverse of observations. This is particularly useful in averaging rates and ratios. This is a most

79
appropriate average where time factor is variable and the act being performed, such as covering distance
, is constant. That is useful in finding the average speed.

7.5.1 Ungrouped data: Let be a set of ‘n’ observations then their Harmonic mean HM is

given by

7.5.2 Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies then their Harmonic Mean is given by

where

7.5.3 Continuous frequency distribution: 1.5.1 Let be the mid-value of the ‘k’ class
intervals in a continuous distribution with corresponding frequencies then their Harmonic Mean

is given by where

Example 7.10: A cyclist traveled from A to B at a speed of 15kmph and from B to A at a speed of 20kmph.
hat is the average speed of the cyclist for the entire round trip?
Answer: The average speed id harmonic speed of those two speeds x=15kmphand y=20kmph

Average speed = = = = 17.14kmph.

Example7.11: An investor buys Rs. 20,000/-worth of shares of accompany each month and during the
first three months he brought the share at a price of Rs. 120, Rs. 160 and Rs. 210. After the three months
what is the average price paid by him for the shares?
Answer: Since the value of shares is changing after everyone month, the required average will be Harmonic
mean of the prices paid in the three months.

Average Price= = Rs. 166.66

Example 7.12: Calculate the H.M for 8,11,16,25.

Answer: The Harmonic Mean

Example7.13: Calculate the H.M of the following

Class interval 2-4 4-6 6-8 8-10


frequency (fi) 13 25 37 25

80
Answer:

Class Interval 2-4 4-6 6-8 8-10 Total


Mid value (xi) 3 5 7 9
frequency (fi) 13 25 37 25 100= N
fi(1/xi) 4.333 5 5.286 2.778 17.397

and =5.748

7.5.4 Merits of Harmonic Mean:


1. It is rigidly defined
2. It is based on all observations
3. It is applicable for mathematical treatment
4. It is not affected much by of sampling
7.5.5 Demerits of Harmonic Mean:
1. It is not easy to understand
2. It is difficult to compute.
3. If anyone of the observation is zero, then harmonic mean becomes infinity

7.6 MEDIAN:
Median is the value of the variable which divides the data into two equal parts. It is that value of an
observation which divides the observations into two equal parts when arranged in ascending or descending
order. This is useful especially in fixing the height of the chair, in fixing the length of the berths in railway
compartment etc., in which the discomfort will be minimum if the median value of lengths of legs below
knee and heights of the individuals etc.,
In the process of finding Median one must see that the data must be arranged either in ascending order or
in descending order. That is the data must be arranged as per magnitude.
7.6.1.Ungrouped data: In case of Ungrouped data,
if the number of observations is odd, then median is the value of middle observation after arranging them
either in ascending or descending order.
If the number of observations is even, then median is the average of the two middle observations after
arranging them either in ascending or descending order.

That is Median =

81
Example 7.14: Find the median of the following two sets:
a) 30.28.16,14,5,13,12
b) 42,43,38,36,37,35
Answer:
a) After arranging them according to magnitude he data will be 5,12,13,14,16,28,30
Since the number of observations n=7, odd number, the median will be (n+1)/2 th item=3rd item=14.
b) After arranging them according to magnitude he data will be 35,36,37,38,42.43
Since the number of observations n=6, even number, the median will be average of n/2th term and
{(n/2) +1} th item
= (3rd item+ 4th term)/2= (37+38)/2=37.5
7.6.2.Discrete distribution:In the case of discrete distribution median is the value of x corresponding to
less than cumulative frequency just greater than .

Example 7.15: Calculate the median for the following distribution

X 1 2 3 4 5 6 7 8 9
f 8 10 11 16 20 25 15 9 6

Answer:
Calculate the cumulative frequency as follows

x f cumulative frequency
1 8 8
2 10 18
3 11 29
4 16 45
5 20 65
6 25 90
7 15 105
8 9 114
9 6 N=120
Median is the value of x which corresponds to the cumulative frequency which is just greater than or
equal to the N/2= 60 value. Hence median is 5.

7.6.3. Continuous frequency distribution:In the case of continuous frequency distribution, let
are frequencies corresponding to 1st, 2nd, 3rd, ..kth class intervals respectively then median of the distribution
of the data is given by,

Median =

82
7.6.5. Demerits of Median:
1. In the case of even number of observations median cannot be determined exactly.
2. It is not based on all the observations.
3. It is not affected much by fluctuations of sampling when compared to mean.
4. It is not applicable to algebraic treatment.

7.7 MODE:
It is the value which occurs most frequently. It is a set of observations and around which the other items
of the set, cluster densely. This is useful in readymade business.
7.7.1. Ungrouped and discrete frequency distribution: In the case of Ungrouped and discrete frequency
distribution mode is that value of corresponding to maximum frequency.

Example 7.17: Find the model value of the following data:


7,8,10,12,12,10,10,10,15,11,11.
Answer: Here 10 is repeating more number of times (4). Hence Mode=10
Example 7.18: for the data given in example 7.2 find its model data.
Answer: Here 50 value is having more frequency (10) than others. Hence Mode=50

7.7.2. Continuous frequency distribution:In the case of continuous frequency distribution, let
are frequencies corresponding to 1st, 2nd, 3rd, ..kth class intervals respectively then mode of the distribution
of the data is given by,

Mode =

Where,
= Lower limit of the modal class interval, the class corresponding to the highest frequency,
f1 = Frequency of the modal class interval.
f0 = Frequency of the class interval just above the modal class interval
f2 = Frequency of the class interval just below the modal class interval
h = width of the model class interval.
Example7.19: Find the mode for the following frequency distribution

Profits (000’s Rs) 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
No. of Shops (fi) 5 8 7 12 28 20 10 10

84
Answer:
The model class is 40-50, as it has maximum frequency.

Mode =

Where,

= Lower limit of modal class = 40

f1 = Frequency of modal class =28


f0 = Frequency just above the modal class =12
f2 = Frequency just below the modal class = 20
h = width of the class interval =10

Mode =

7.7.3. Merits of Mode:


1. It is easy to calculate and easy to understand.
2. It is not at all affected by extreme values.
3. It can be located even if the frequency distribution has class intervals of unequal magnitude provided
the modal class and the class preceding and succeeding it are of the same magnitude.
7.7.4. Demerits of Mode:
1. It is of well defined
2. It is not based on all observations
3. It is not capable of mathematical treatment
4. It is affected by fluctuations of sampling heavily.
7.7.5. Relation between mean, median, mode: For an asymmetric distribution, the relation is given by
Mode = 3Median – 2Mean
Activity 1: The following distribution gives the overtime hours done in a week by 100 employees in a
factory. Find Arithmetic Mean, median, mode.

Over Time Over 10-15 15-20 20-25 25-30 30-35 35-40


No. of employees 11 20 35 20 8 6

85
Activity 2:The data on recently admitted batch of 150 students is given below. Find the median of the age
of students admitted in to the MBA course.

Age 19 20 21 22 23 24 25 >25
Number 20 30 35 25 20 10 5 5

Activity 3: Calculate AM, GM, HM for the following data and obtain the relation between AM,GM and
HM.
32, 35, 36, 37, 39, 41 and 43.

7.7. SUMMARY:
In this Unit, different measure of central tendency of the Univariate Analysis of the data are presented.
The formulas for finding those values and its necessity in the interpretation of the data was also discussed.
Reveiw Questions:
1. What are the different measures in studying Univariate analysis of data?
2. Explain different methods of studying Arithmetic Mean?
3. Explain the method of calculating Median in Continuous distribution of Data?
4. Briefly explain the specific used of different measures of Central tendency other than Arithmetic
Mean?

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt .Ltd.
3. Richard I Lev in, David S Rubin, Statistics for management, Prentice-Hall of India Private Limited,
New Delhi
4. R P Hooda, Introduction to Statistics, Macmillan India Ltd, New Delhi.
5. J K Sharma, Business Statistics, Dorling Kinderley (India) Pvt.Ltd.NewDeldi.
6. S C Gupta ans V K Kapoor, Fundamentals of Mathematical Statistics, Sultan Chand & Sons, Delhi.

86
UNIT - 8
UNIVARIATE ANALYSIS-II
MEASURES OF DISPERSION, SKEWNESS, AND KURTOSIS.

OBJECTIVES:
After studying this Unit, the reader should be able to
1. Explain the necessity of different measures of dispersion.
2. Explain the calculation of different measures of dispersion
4. Explain the concept of Skewness and different measures of skewness.
5. Explain the concept of Kurtosis and its measurement.

STRUCTURE:
8.1 Introduction
8.2. Range
8.3. Quartile deviation
8.4. Mean deviation
8.5. Standard deviation
8.6. Moments
8.7. Measures of Skewness
8.8. Measures of Kurtosis
8.9. Summary
8.10 Review Questions
8.11 Further Readings

8.1. INTRODUCTION:
The measures of central tendency serve to locate the center of the distribution, but it will not reveal how
the actual observations are spread around the central tendency values. This type of spread is explained by
dispersion. The dispersion of values is indicted by the extent to which these values tend to spread over an
interval rather than cluster closely around an average. Dispersion explains about scattered ness in the
data. That is it gives an idea about homogeneity or heterogeneity of the distribution. Homogeneous means
the data is less dispersed or less deviated. Heterogeneous means that the data is more dispersed or more
deviated. There are four measures of dispersion. They are Range, Quartile Deviation, Mean Deviation

87
and Standard Deviation / Variance. These are absolute measures of dispersion.
The corresponding relative measure of dispersion are coefficient of Range, Coefficient of Quartile
Deviation, Coefficient of Mean deviation and coefficient of variation. These relative measures are useful
for comparing two or more than two series of data.

8.2. RANGE:
It gives the difference between two extreme observations of the distribution. If A and B are the largest/
highest observation and smallest observations respectively in the data, then
Range = Highest Observation (A) – Smallest observation (B)

and Coefficient of range is given by .

This measure cannot give the deviations between individual observations.


Example 8.1: Find the range and coefficient of range for the following:
a) Marks obtained by 10 students in Research methodology internal examination are as follows:
15,12,18,09,23,22,17,19,12,15
b) Following are the quantity demanded and its frequency:

Quantity demanded 6 12 18 24 30 36 42
Frequency 4 7 9 18 15 10 5

c) Dividend yield by 45 companies are as follows:

Dividend yield in 000’s 0-6 6-12 12-18 18-24 24-30 30-34 34-42
No. of companies 6 8 10 9 5 4 3

Answers:
a) Maximum value A = 19 and minimum value B =09.
Range= Maximum value – minimum value =19-09=10

Coefficient of Range = =10/28=0.357

b) Maximum value A=42, Minimum value B=6


Range=42-6=36 and coefficient of Range = (42-6)/(42+6)=36/48=0.75
c) Maximum Value A=42, Minimum Value B=0

88
Range = 42-0=42 and coefficient of range = (42-0)/(42+0)=1

8.3. QUARTILE DEVIATION (OR) SEMI-INTER QUARTILE RANGE:

Here the entire data is divided into four equal parts. Each part is known as one quarter. Let and
are three values which divide the entire data (after arranging them either increasing or decreasing order)
in to four equal pars then and represent the first, second and third quartile values of the data
given. Here Q2 is same as median.
Then inter quartile range is the difference between third and first quartiles i.e. Q3 - Q1
8.3.1. Semi-inter quartile range or Quartile deviation:
Semi-inter quartile range or Quartile deviation is the ratio of interquartile range and 2.

That is Quartile Deviation =

8.3.2. Coefficient of Quartile deviation:

The coefficient of Quartile deviation is / =

8.3.3. Ungrouped data: Let be a set of ‘n’ observations then their Quartile deviation can be
calculated as follows:

i. If the number of observations is odd, then is the value of the observation corresponding to

value and is the value of the observation corresponding to value.

ii. i. If the number of observations is even, then is the value of the observation corresponding to

value and is the value of the observation corresponding to value.

8.3.4. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies and =N then first and third quartiles are, after the data is arranged according
to magnitude, basing on the cumulative frequency, the X value corresponding to N/4th value and 3N/4 th
value respectively gives Q1 and Q3 respectively.

8.3.5. Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in a
continuous distribution with corresponding frequencies then their Quartile deviation is given

by where and

= Lower limit of first quartile class interval

89
= Frequency of first quartile class interval

= Lower limit of third quartile class interval

= Frequency of third quartile class interval

= Cumulative Frequency up to the first quartile class interval

= Cumulative Frequency up to the third quartile class interval

= length of the first quartile class interval

= length of the third quartile class interval

First quartile class interval: It is the class corresponding to the cumulative frequency, which is just greater
than or equal to value.

Third quartile class interval: It is the class corresponding to the cumulative frequency, which is just
greater than or equal to value.

Example 8.2: Find the Quartile deviation (QD) and Coefficient of Quartile Deviation for the following:

Class Intervals 161-163 163-165 165-167 167-169 169-171 171-173 Total


frequency 3 7 14 12 10 4 50

Answer:

Class Intervals frequency Cumulative frequency


161-163 3 3
163-165 7 10=m1
165-167 14=f1 24 1st Quartile C.I, l1=165, C3=2
167-169 12 36=m3
169-171 10=f3 46 3rd Quartile C.I, l3=169,C2=2
171-173 4 50
Total 50=N

N=50, N/2=25 and N/4=12.5, 3N/4=37.5


First Quartile class interval is 165-167 and 3rd Quartile Class Interval=169-171
Q1= 165+2{12.5-10}/14=165.357 and Q3=169 +2{37.5-36}/10=169.3
QD= (169.3-165.357)/2=1.9715
and Coefficient of QD= (169.3-165.357)/(169.3+165.357)=.00059.
90
8.4. MEAN DEVIATION:
This is also called Mean absolute deviation. The two measures of dispersion discussed so far namely
Range and Quartile deviation does not show how observed values of the given data set are scattered
around any of the measures of central tendency. So, it is reasonable to measure the variation as a degree
(amount) to which values within a data set deviate from either mean, median or mode. More so as the sum
of the deviations of observations from Arithmetic Mean is always zero, it will not add anything to the
information. To make it more meaningful it is advisable to consider the absolute values of the deviations
rather than the actual deviations. In general, considering the absolute values of the deviations of observations
from any of the measures of central tendency gives a meaningful meaning when there is no importance
for the sign of the deviations. The average of such absolute deviations is called Mean deviation. This
mean deviation can be based on AM, Median or mode. However, about median is practically more
important than others. This is having very good practical applications.

8.4.1. Ungrouped data: Let be a set of ‘n’ observations and A, an arbitrary value and represents
any of the measures of central tendency then Mean deviation is given by

Since A is arbitrary value, it can be mean, median or mode then

Mean deviation about mean i.e.

Mean deviation about median i.e.

Mean deviation about mode

8.4.2. Coefficient of mean deviation: The coefficient of mean deviation is given as follows

i. Coefficient of mean deviation about mean is

ii. Coefficient of mean deviation about median is

iii. Coefficient of mean deviation about mode is

Example 8.3: Calculate the mean deviation about median and its coefficient of MD for the following
data on marks obtained by 5 students.
23,41,29,53,38
Answer: After arranging as per magnitude the given data can be written as 23,29,38,41,53.
Median is the middle value, hence Median = 38. n=5
|Xi-Median|: 15,9,0,3,15

91
Mean deviation about median i.e. =42/5=8.4.

Coefficient of MD about Median=8.4/38=0.22.

8.4.3. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies and A, an arbitrary value and represents any of the measures of central tendency

then Mean deviation given by , where .

8.4.4. Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in a
continuous distribution with corresponding frequencies and A ,an arbitrary value and represents
any of the measures of central tendency then Mean deviation given by

, where .

Example 8.4: For the following data of sales (in units) of a brand of TV over different days are as follows.
Find the mean deviation about mean. Also, calculate its coefficient.

Sales 50-100 100-150 150-200 200-250 250-300 300-350


Days 11 23 44 19 8 7

Answer: Following table givens the information for calculation of AM and for further calculations of
Mean Deviation.

Sales Mid Value (xi) Frequency (fi) fixi |xi- | fi|xi- |


50-100 75 11 825 104.91 1154.01
100-150 125 23 2875 54.91 1262.93
150-200 175 44 7700 4.91 216.04
200-250 225 19 4275 45.09 856.71
250-300 275 8 2200 95.09 760.72
300-350 325 7 2275 145.09 1015.63
Total 112 20150 5266.04

=179.91

Mean deviation , where

Therefore MD=5266.04/112=47.01
Coefficient of Mean Deviation =47.01/179.91=0.2613.

92
8.5. STANDARD DEVIATION:
In Mean deviation, the signs of the deviations are just ignored without assigning any reason. The
mathematically valid way of ignoring the sign is considering the squared values. Instead of computing the
average of absolute deviations from arithmetic mean, the average of squared deviations from mean can be
computed. That is to do so the sum of all squared deviations of observations from the arithmetic mean and
divide it by the number of observations. This value is called Variance. It is also called mean square
average as it is the average of squared deviations from arithmetic mean.
The positive square root of the Variance is called Standard Deviation. That is the root mean square
deviation of observations from arithmetic mean is called Standard Deviation. This is more important in
different applications and amicable for mathematical computations. This is denoted by and Variance is
denoted by .

8.5.1. Ungrouped data:Let be a set of ‘n’ observations with their arithmetic mean

then the standard deviation is given by

Since Variance is the square of the Standard deviation,

Variance =

Example 8.5: Find the standard deviation of the following data on marks in an internal examination:
10,14,26,25,25

Answer: Arithmetic Mean= 100/5=20

Variance = = =444.4-400=44.4

Standard deviation=”44.4=6.66

8.5.2. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding

frequencies with their arithmetic mean where then the

standard deviation is given by

and the Variance is

93
EXAMPLE 8.6:
In a survey of 50 industries in an industrial area the following data are collected. Using this calculate the
Variance and Standard deviation.

Profit (in lakhs Rs.) 10 15 20 25 30


Number of industries 15 10 15 6 4

Answer:

Profit (Xi) No. of industries fi fiXi fiXi2


10 15 150 1500
15 10 150 2250
20 15 300 6000
25 6 150 3750
30 4 120 3600
Total 50 870 17100

Mean =870/50=17.4
Variance= (17100/50)-17.42= 342-302.76 =39.24
Standard Deviation =6.2642.

8.5.3. Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in a
continuous distribution with corresponding frequencies with their arithmetic mean

where then the standard deviation is given by

and the variance is

Example 8.7: Calculate the standard deviation and variance of the marks from the following table.
Marks: 0 -10 10 – 20 20 – 30 30 – 40 40 – 50 50 - 60
No of students: 12 18 27 20 17 06

94
Answer:

xi 5 15 25 35 45 55
fi 12 18 27 20 17 6 100
xifi 60 270 675 700 765 330 2800
xi2fi 300 4050 16875 24500 34425 18150 98300

Standard deviation , since

8.5.4. Mean deviation method:


8.5.4.1. Ungrouped data: Let be a set of ‘n’ observations and let the deviation
A is the assumed mean / any arbitrary value and their arithmetic
mean = , where then the Standard deviation is given

by

8.5.4.2. Discrete frequency distribution: Let be a set of ‘k’ observations with corresponding
frequencies and let di =Xi- A, A is the assumed mean / any arbitrary value and their arithmetic

mean is = , where and then the Standard


deviation is given by

8.5.4.3. Continuous frequency distribution: Let be the mid-value of the ‘k’ class intervals in
a continuous distribution with corresponding frequencies and let di= xi- A, A is the assumed

mean / any arbitrary value and their arithmetic mean is = ,

where and then the Standard Deviation is given by

95
If di= ( xi- A)/h , A is the assumed mean / any arbitrary value and h is the width of the class interval then
their arithmetic mean is given by

=  , where and 

and Standard deviation is given by 

8.5.5. Coefficient of variance:


It is the ratio of standard deviation and mean multiplied by 100. It is used to compare two or more data.
The data having less coefficient of variation is said to be more efficient data, more consistent, more
homogeneous, less heterogeneous. It is denoted by and is given by .

Coefficient of standard deviation is

Example 8.8: Calculate standard deviation and coefficient of variation for the following data

Xi 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5


fi 2 3 5 6 6 4 6 4 14
Answer:
One must calculate mean and standard deviation to calculate coefficient of variation.

Xi 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5


fi 2 3 5 6 6 4 6 4 14 50
Xifi 5 10.5 22.5 33 39 30 51 38 147 376
Xi2fi 12.5 36.75 101.25 181.5 253.5 225 433.5 361 1543.5 3148.5

Mean=

Standard deviation

Coefficient of variation

96
8.5.6. Combined mean and Combined standard deviation: Let be the means of set of
observations with sizes then their combined mean is given by

Further let be the variances of the above set of observations then combined

standard deviation is given by σ = where

Example 8.9: Calculate combined mean and combined standard deviation of the two firms A & B
No of workers Mean wage Variance
Firm A 500 185 81
Firm B 600 175 100

Answer: Given that Size of the first firm =500, Size of the second firm =500, Mean wage of the
first firm 185, Mean wage of the second firm 175

then combined mean wage is

The deviations in the wages are ,

Combined standard deviation in the wages of the two firms’ is

Example 8.10: Find the missing information in the following table:

Groups A B C Combined
Number 50 - 90 200
Mean 113 - 115 116
Standard deviation 6 7 - 7.746

97
Answer:

Given . There fore

Combined mean , that is

Deviations of combined mean about their mean

Combined standard deviation is

By squaring and simplification one can get that =8.

Example 8.11: The means and S.D of runs made by three teams are given below. Find the missing

Group
Group I II Group III Combined
Number 175 ? 225 500
Mean Performance 220 240 ? 235
S.D ? 6.3 5.9 15.4

Answer:

Group I Group II Group III Combined


Number 175 225 500
Mean Performance 220 240 235
S.D 6.3 5.9 15.4

Since 500, 175 and 225

Given that Combined mean

That is

Deviations about their mean

98
Combined standard deviation is

Coefficient of variation for the first Team

Coefficient of variation for the second Team

Coefficient of variation for the third Team

Team III is having minimum coefficient of variation among the three Teams. Hence the performance of
Team III is better when compared to the remaining.

8.6. MOMENTS:
8.6.1. Moments about any arbitrary value: Moments are one, out of the few quantities which helps to
understand and represent the entire data. Two of the moments are mean and variance (square of the
standard deviation). In general, the moments are depending on deviations of observations in the given set
of data. The deviations can be from any of the arbitrary value A.

The rth moment of a random variable X about any point is given by,

,r=1,2,3…
For ungrouped random variable

For discrete random variable: , where

For continuous random variable , where

8.6.2. Moments about mean:

The rth moment of a random variable X about any point is given by, . By substituting
in place of A, one can get the moments about mean and are denoted by

Then

For ungrouped random variable

For discrete random variable: , where

99
For continuous random variable , where

For r=1, the moments about mean will be zero.


For r=2, the moments about mean is same as variance and its square root will be Standard Deviation.
8.6.3. Moments about origin:
If A =0, the moments about A will become moments about origin. Then

For ungrouped random variable

For discrete random variable: , where

For continuous random variable , where

8.6.4. Relation between moments about mean and moments about any point and vice versa:

Second moment about mean


Third moment about mean
Fourth moment about mean
Second moment about origin
Third moment about origin
Fourth moment about origin

8.7. MEASURES OF SKEWNESS:


So far discussed measures of dispersion describe the dispersion of individual values in a set around a
central value. However, those measures are not sufficient unless measuring the degree to which individual
values in the data set deviate from symmetry on both sides of the central value and the direction in which
these are distributed. This is important because sometimes mean and standard deviation of two or more
data sets may be equal but the frequency curves differ in their shape. So the measurement of such deviation
from the symmetrical (Normal) form (also called asymmetrical or skewedness) is a must to study. Skewness
gives us an idea about lack of symmetry. A distribution is said to be symmetric if mean=median=mode.
A distribution is said to be skewed if

i. Mean≠ median ≠ mode.


ii. Quartiles Q1 and Q3 are not at equidistant from median Q2.
iii. Deciles D1 and D9 are not equidistant from median, also called D-5. And Percentiles P10 and P90 are
not equidistant from median, also called P-50.

iv. The third moment about the arithmetic mean will not be zero ie. ‘“0.

100
v. The curve drawn with the help of given data is not symmetrical but stretched more to one side than
other. A distribution is said to be positively skewed if the curve is stretched more to the right or
mean>median>mode or a distribution is said to be negatively skewed if the curve is stretched more
to the left or mean< median < mode.
Hence the amount of skewness can be 1. Mean-Mode, 2.Q3+Q1-2 Median, 3.D9+D1-2 Median or P90+P10-
2 Median and 4. The amount of third moment about mean i.e.
Basing on this different measures of coefficients of skewness were defined as the above measure are not
independent of units and cannot be used for comparison. They are
1. Karl Pearson’s Coefficient of Skewness
2. Bowley’s Coefficient of Skewness
3. Kelly’s Coefficient of Skewness
4. Coefficient of Skewness based on Moments

8.7.1. Karl Pearson’s coefficient of Skewness:Skp=

8.7.2. Bowley’s Coefficient of Skewness: =

8.7.3. Kelly’s Coefficient of Skewness: =

= =

8.7.1. Coefficient of Skewness based on Moments: The coefficient skewness based on

moments is in terms of the notations used by Karl Pearson is

As suggested by R A Fisher the coefficient of skewness is =

The direction (sign)of the skewness whether positive or negative is depending on the sign of the coefficient
of skewness in the first three cases where as in case of the coefficient of skewness based on moments, the
direction (sign)of the skewness is depending on the sign of the third moment

It can also be written that the distribution is

Negatively skewed if
Positively skewed if
and Symmetric if

Example 8.12: Calculate the Karl Pearson’s coefficient of skewness of wages given below:

Wages 21-26 26-31 31-36 36-41 41-46 46-51 51-56


Laborer’s 5 15 28 42 15 12 3

101
Answer:
Mean, mode and standard deviation are to be calculated for finding the Karl Pearson’s coefficient of
skewness.

Wages Laborers (fi) Mid Value fixi fixi2


(xi)
20-26 5 23 115 2645
26-32 15 28 420 11760
32-38 28 33 924 30492
38-44 42 Modal class 38 1596 60648
44-50 15 43 645 27735
50-56 12 48 576 27648
56-62 3 53 159 8427
Total 120 4435 169355

Using the above values in the table it can be find that the value of the Mean=36.96, Mode=40.05 and
Standard Deviation=6.727 (Reader can work out for these values on his own)

Skp= = = - 0.459.

Example 8.13: For the problem given in Example 8.2, calculate Bowley’s coefficient of Skewness.
Answer: For that question it was already calculated Q1=165.36 and Q3=169.3
In the similar way reader can calculate Median as Median Q2=167.17.

Bowley’s Coefficient of Skewness: = = =0.0812.

8.8. MEASURES OFKURTOSIS:


In Greek language Kurtosis mean “bulginess”. Kurtosis gives us an idea about the flatness or peakness of
the curve. This measure describes the degree of concentration of observations in each distribution. That
is, it measures whether the observed values are concentrated more around the mode (a peacked curve) or
away from the mode to words both tails of the frequency curve. In real life situations two or more
distributions may have same averages, standard deviations and skewness but hey may have different
degrees of concentration of values of observations around a mode, and hence may have different degree
of peakedness. It is measure by the coefficients (due to Karl Pearson) or (due to R
A Fisher).
Kurtosis can be classified in to three categories. They are as follows.
A curve is said to be leptokurtic if it is highly peaked and in this case b2>3 or ¡2>o.
102
A curve is said to be platykurtic (broad) if it is flat than normal and in this case b2<3 or ¡2<0.
A curve is said to be mesokurtic (normal) if it is neither flat nor peaked and in this case b2=3 and ¡2=0
Activity 1. Calculate quartile deviation for the following data
Class interval: 2 – 6 6 – 10 10 – 14 14 – 18 18 – 22 22 – 26 26 – 30 30 - 34
Frequency: 11 19 31 57 62 46 29 13
Activity 2: Compute mean deviation about mean for the following data
Class interval: 50 – 100 100 – 150 150 – 200 200 – 250 250 – 300 300 – 350
Frequency: 11 23 44 19 8 7
Activity 3: An analysis of the weekly wages in two firms in an industry gives the following results:

Firm A Firm B
Number of wage workers 550 650
Average daily wages 50 45
Variance of the distribution of wages 90 120
a) Which firm A or B , pays out a larger amount as daily wages?
b) In which firm A or B, is there greater variability in individual wages?
c) What are the measures of i) average daily wages and ii) standard deviation of individual wages of
all workers in the two firms together?
Activity 4: The following data relate to the profits (in lakhs of Rupees) of 1000 companies.

Profits 100-120 120-140 140-160 160-180 180-200 200-220 220-240


Companies No. 17 53 199 194 327 208 2
Find Standard deviation, coefficient of variation, Karl Pearson’s coefficient of skewness, Bowley’s
coefficient of skewness.
Activity 5: The following distribution gives the overtime hours done in a week by 100 employees in a
factory. Find first quartile, third quartile, Quartile deviation, coefficient of Quartile deviation, Karl Pearson’s,
Bowley’s, Kelly’s and moments based coefficient of skewness. Also find the amount of kutosis of the
data.

Over Time Over 10-15 15-20 20-25 25-30 30-35 35-40


No. of employees 11 20 35 20 8 6

8.9. SUMMARY:
In this Unit-8, the concept of measures of dispersion, measures of skewness and measures of kurtosis
were discussed. Different measures of dispersion namely range, quartile deviation, mean deviation and
standard deviation are discussed in depth. Different measures of skewness were also presented and discussed
in detail. The concept of moments are also explained and with the help of those moments coefficient of
Skewness and measurement of Kurtosis were explained.
103
Review Questions:
1. Critically examine different measures of variation.
2. What is the purpose of measures of variation serve? What is the most common measure of variation?
3. What do you understand by Coefficient of Variation? Discuss its importance.
4. What is skewness and explain in detail. Also, discuss about different measures of dispersion.
5. What is meant by Kurtosis? Mention how to calculate its measure.

Further readings:
1. Kothari, C.R., Research Methodology – Methods and Techniques, WishwaaPrakashan, New Delhi
2. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt .Ltd.
3. Richard I Lev in, David S Rubin, Statistics for management, Prentice-Hall of India Private Limited,
New Delhi
4. R P Hooda, Introduction to Statistics, Macmillan India Ltd, New Delhi.
5. J K Sharma, Business Statistics, Dorling Kinderley (India) Pvt.Ltd.NewDeldi.
6. S C Gupta ans V K Kapoor, Fundamentals of Mathematical Statistics, Sultan Chand & Sons, Delhi
7. U K Srivastava, G V Shenoy and S C Sharma, Quantitative techniques for managerial decisions,
New Age International Publishers, New Delhi.

104
UNIT - 9
BIVARIATE ANALYSIS -CORRELATION AND REGRESSION

OBJECTIVES:
After studying this Unit, the reader should be able to
• Explain the need for finding the relation between more than one variables especially two variable.
• Explain the concept of correlation and regression.
• Explain the different methods of studying correlation like Scatter diagram, Karl Pearson’s Correlation
coefficient, Spearman’ Rank Correlation Coefficient.
• Explain the construction of the Regression Lines and Regression coefficients.
• Explain the procedure for estimating the value of one variable if the corresponding value of the
other variable is given.

STRUCTURE:
9.1 Introduction
9.2 Types of correlations
9.3. Methods of studying correlation
9.4. Scatter diagram
9.5. Karl Pearson’s Correlation coefficient
9.6. Spearman’s Rank Correlation Coefficient
9.7 Regression
9.8. Lines of Regression
9.9 Regression Coefficients
9.10 Summary
9.11 Review Questions
9.12 Further Readings

9.1 INTRODUCTION:
In Units VII and VIII different univariate statistical constants were discussed and provided the algorithms
for calculating them. That is statistical Methods relating to only one variable are discussed. Often an
analysis of data concerning two or more variables is needed to look for any statistical relationship or

105
association between them. The knowledge of such a relationship is important to make inferences from the
covariation between variables in a given situation. It is necessary to use a statistical technique called
correlation analysis to make decisions on the strength of the relationship between the variables like height
and weight of individuals, income and expenditure, rainfall and yield etc., in which the movement in one
variable is accompanied by the movement of the other variable.
As per Croxton and Cowden, when the relationship is of a quantitative nature, the appropriate statistical
tool for discovering and measuring the relationship and expressing it in a brief formula is called Correlation.
When two or more variables vary such that the moment in one variable causes a corresponding moment in
another then the variables are said to be correlated. Correlation analysis is an analysis of the covariation
between two or more variable. The first variable is known as independent variable and the second variable
is known as dependent variable. The selection of first and second is arbitrary in correlation analysis.
Correlation describes the degree to which an independent and dependent variable is related to one another.
Correlation coefficient or coefficient of correlation is used to measure mutual relation between two or
more variables. It is a descriptive statistic, expressing the magnitude and direction of statistical relationship
of variables.
The statistical relationship between two or more variables can be examined through the following sub
problems: 1. To identify any association or relation if any and if so its form and degree of the relationship.
2. Whether the relationship is strong or significant enough to arrive at a desirable conclusion. 3.Can the
relationship be used for further predictive purpose. The first one of these will be dealt under the correlation
analysis and the third will be in the under-Regression Analysis. The second one was discussed in Unit
XIII and XIV
In bivariate analysis, it is interested to find out if there is any correlation or covariation between the two
variables under study. The variables are said to be correlated if the change in one variable affects a change
in the other variable.

9.2. TYPES OF CORRELATION:


There are three broad types of correlations. They are 1. Positive, Negative and Zero correlation 2.Linear
and non-linear correlation and 3. Simple, Partial and Multiple correlation.
9.2.1.Positive, negative and zero Correlation: If two variables deviate in the same direction then the
variables are said to be positively correlated. That is an increase (decrease) in one variable causes an
increase(decrease) in another variable then the variables are said to be positively correlated. Example as
price for a commodity increases then supply for that commodity also increases. Price and Supply are
directly proportional to each other’s. So, the variables which are directly proportional to each other are
said to be positively correlated.
If two variables deviate in the different direction, then the variables are said to be negatively correlated.
That is an increase (decrease) in one variable causes a decrease (increase) in another variable then the
variables are said to be negatively correlated. Example as price of a commodity increases then demand
for that commodity decreases. Price and Demand are inversely proportional to each other’s. So, the variables
which are inversely proportional to each other are said to negatively correlated.
If there exists no association between the two variables, then the variables are said to be not correlated
and also known as Zero correlation existed.

106
9.2.2.Linear and non-linear correlation: The concept of linear relationship suggests that two quantities
are proportional to each other: doubling one causes the other to double as well. That is the correlation
between two variables is said to be linear if the values of the two variables have a constant ratio.Nonlinear
relationships, in general, are any relationship which is not linear. A correlation is said to be non-linear
when the amount of change in the values of one variable does not bear a constant ratio to the amount of
change in the corresponding value of another variable. What is important in considering nonlinear
relationships is that a wider range of possible dependencies is allowed. When there is very little information
to determine, what the relationship is, assuming a linear relationship is simplest and thus, is a reasonable
starting point. However, additional information generally reveals the need to use a nonlinear relationship.
9.2.3 Simple, Partial and Multiple correlation: The distinction between the three are depending on the
number of the variables involved in the correlation analysis. If only two variables are chosen to study the
correlation between them, then such correlation is known as simple. If the number of variables are more
than two (say three or more),and the correlation between two variables only considered to study ignoring
the effect of the other influencing variable(s) is known as partial correlation. Ignoring means the effect is
assuming as constant. The relation between more than two variables considered simultaneously is known
as multiple correlation. In multiple correlation one can find relation between one dependent variable and
many independent variables.
In this Unit IX, the linear and simple correlation only discussed. The Unit X, is on the analysis relating
to more than two variables .

9.3. METHODS OF STUDYING CORRELATION:


There are three broad methods of studying correlation. 1. Scatter diagram, 2. Karl Pearson’s Coefficient
of correlation, 3.Spearman’s Rank correlation coefficient.

9.4. SCATTER DIAGRAM:It is a visual representation of degree of association between two variables
in which one being independent and another being dependent. Let the independent variable be represented
on the X-axis and dependent variable on Y-axis. Plot the points of bivariate data on the graph paper, this
diagrammatic representation is known as scatter diagram (bivariate scatter-plot). One can represent different
types of correlation on the graph paper.
The degree of positive relationship between the variables can be represented by the scatter diagram as
follows. If more number of points are scattered towards the right one can say, the variables are positively
correlated (fig-a and fig-b). In this case the value of correlation lies between 0 and 1. In a scattered
diagram if the points when joined represent a straight line which is increasing (having positive slope) then
one can say, the variables are perfectly correlated and the value of correlation is exactly equal to 1(fig-c).
The degree of negative relationship between the variables can be represented by the scatter diagram as
follows. If more number of points are scattered towards the left one can say, the variables are negatively
correlated (fig-d and fig-e). In this case the value of correlation lies between -1 and 0. In a scattered
diagram, if the points when joined represent a straight line which is decreasing (having negative slope)
then one can say, the variables are perfectly negatively correlated and the value of correlation is exactly
equal to -1 (fig-f).

107
The degree of zero association between the variables can be represented by the scatter diagram as
follows(fig-g). The non-linear correlation is presented in fig-h.
The relationships expressed in scatter diagrams give only qualitative, but to have a quantitative relationship,
coefficients are the better measures. In this unit two such coefficients will be discussed namely Karl
Pearson’s Coefficient of correlation and Spearman’s Rank Correlation coefficient.

9.5. KARL PEARSON COEFFICIENT OF CORRELATION:


Karl Pearson’s Correlation coefficient measures the degree of quantitative relationship between two
variables. This measures quantitatively the extent to which two variables X and Y are correlated. It is the
ratio of covariance between the variables X and Y to the product of standard deviation of X and standard
deviation of Y. It is denoted by or simply ‘r’ and is given by

, where
108
S.D of X = σx= = =

S.D of Y = σy= = =

Cov (X, Y) = = =

Then

r =

Where
The correlation coefficient lies between -1 and 1.

9.5.1 Properties of correlation coefficient:


1. Correlation coefficient lies between -1 and 1.
2. Correlation coefficient is independent of change of origin and scale.

3. If indicates that there is a perfect linear relationship between two variables.

4. Correlation coefficient is a pure number independent of the unit of measurement.


5. The square of r is known as coefficient of determination.
6. If X and Y are independent then they are uncorrelated but the converse is not true.
7. The value of r depends on the slope of the line passing through the data points in the scatter diagram.
8. The sign of the correlation coefficient indicates the direction of the relationship.
9. The magnitude of the correlation indicates the strength of the relationship between variables.
9.5.2. Standard error of correlation coefficient: If r is the correlation coefficient of a sample of n
observations then Standard error of correlation coefficient is given by

9.5.3.Probable error of correlation coefficient: Probable error of correlation coefficient is given

by .

109
Example 9.1:
Calculate the correlation coefficient between expenditure on advertising and sales for the data given
below.

Advertising Expenditure in '000 of


Rs. 39 65 62 90 82 75 25 98 36 78
Sales in lakhs 47 53 58 86 62 68 60 91 51 84

Answer:Here Sales (Y) depends on advertising expenditure (X).

the correlation coefficient between X and Y is

Then Compute the following table.

X Y XY
39 47 1833 1521 2209
65 53 3445 4225 2809
62 58 3596 3844 3364
90 86 7740 8100 7396
82 62 5084 6724 3844
75 68 5100 5625 4624
25 60 1500 625 3600
98 91 8918 9604 8281
36 51 1836 1296 2601
78 84 6552 6084 7056
650 660 45604 47648 45784

Here . The correlation coefficient between the variables X and Y is given by

Example9.2:
The correlation coefficient between the two variables is 0.48. The covariance between the variables X and
Y is 36. The variance of variable X is 16 calculate the standard deviation of variable Y.
Answer:

It is given that the correlation coefficient between the variables X and Y is

110
We know that the correlation coefficient between the variables X and Y is

i.e.

then standard deviation of Y=

9.6. SPEARMAN’S RANK CORRELATION COEFFICIENT:


This method is applied to measure the association between two variables when only ordinal or rank data
are available. It was named after Charles Edward Spearman, who developed this technique. It is a technique
of measuring association or strength of relationship between the observations by ranking them. Suppose
that a group of n individuals is arranged in order of merit or proficiency in possession of two characteristics
X and Y. That is the highest (lowest) observation is given first rank followed by second rank to the second
highest (lowest) and so on. Assignment of ranks either from highest to lowest or lowest to highest must be
same in both the variables. The main objective in computing correlation coefficient in such situations is to
determine the extent to which two sets of ranking agree or disagreement. The spearman’s rank correlation
between X and Y is nothing but Karl Pearson’s coefficient of correlation between ranks of X and ranks of
Y.
9.6.1.Spearman’s Rank correlation coefficient without repeated ranks:

Consider pair of observations . Let the first variable be denoted by and second
variable be denoted by . Rank the observations of X and denote it by . Rank the observations of Y and
denote it by . Compute the difference in ranks and denote it by . Compute the sum of
the squares of these deviations .

Then Spearman’s Rank Correlation Coefficient is given by . ——— (9.6.1)

Spearman’s Rank Correlation lies between -1 and 1. It holds all the properties of Karl Pearson’s Correlation.
Example 9.3: The ranks of 15 students in two subjects X and Y are given below. The two numbers in the
bracket denote the ranks of the students in the subjects X and Y respectively.
(1,10), (2,7), (3,2), (4,6), (5,4), (6,8), (7,3),(8,1), (9,11), (10,15), (11,9), (12,5), (13,14), (14,12), (15,13).
FindSpearman’s rank correlation coefficient.
Answer:Let us fine the difference of ranks given for finding di.

Rank of X 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Rank of Y 10 7 2 6 4 8 3 1 11 15 9 5 14 12 13
Difference di -9 -5 1 -2 1 -2 4 7 -2 -5 2 7 -1 -2 -2
di2 81 25 1 4 1 4 16 49 4 25 4 49 1 4 4

111
= 272

= = 0.0489

Example 9.4: From the following data fins Spearman’s rank correlation coefficient.

Sl No 1 2 3 4 5 6 7 8 9 10
Marks in English X 60 56 25 90 35 14 52 27 54 72
Marks in Statistics Y 42 34 56 35 40 50 45 60 58 36

Solution:

Sl No 1 2 3 4 5 6 7 8 9 10
Marks in English X 60 56 25 90 35 14 52 27 54 72
Ranks of X 8 7 2 10 4 1 5 3 6 9
Marks in Statistics Y 42 34 56 35 40 50 45 60 58 36
Ranks of Y 5 1 8 2 4 7 6 10 9 3
di 3 6 -6 8 0 -6 -1 -7 -3 6
di2 9 36 36 64 0 36 1 49 9 36

= 276, there is no tie.

= = -0.672.

Comment: There is a moderate negative association between marks in English and marks in mathematics.
That is in the marks in English is increasing the marks in mathematics are decreasing.

9.6.2.Spearman’s Rank correlation coefficient with repeated ranks:

Consider pair of observations . Let the first variable be denoted by and second
variable be denoted by . If any two or more individuals are bracketed equal in any classification with
respect to characteristics X and Y, or if there is more than one item with the same value in the series (tie),
common ranks are given to the repeated items. This common rank is the average of the ranks which these
items would have assumed if they were slightly different from each other and the next item will get the
rank next to the ranks already assumed. Thus, some adjustment or correction factor must be made in the
previous formula given in equation 9.6.1. Let be the Rank of the observations of X and be the Rank
of the observations of Y. Compute the difference in ranks and denote it by . Compute the

sum of the squares of these deviations Add correction factor for each tie, where
represents the number of items involved in the tie, to .

112
Then modified Spearman’s Rank Correlation Coefficient is given by

R . ------ (9.6.2)

Example 9.5: From the following data on cost and profit in lakhs of rupees, of a n industry over years,
find out the spearman’s rank correlation coefficient and comment on it.

Year 2004 2005 2006 2007 2008 2009 2010 2011 2012
Cost 52 60 65 52 55 60 60 30 40
Profit 10 20 25 15 20 30 35 5 1

solution:

Year 2004 2005 2006 2007 2008 2009 2010 2011 2012
Cost 52 60 65 52 55 60 60 30 40
Ranks 3.5 7 9 3.5 5 7 7 1 2
Profit 10 20 25 15 20 30 35 5 1
Ranks 3 5.5 7 4 5.5 8 9 1 2
di .5 1.5 2 -.5 -.5 -1 -2 0 0
di2 .25 2.25 4 .25 .25 1 4 0 0

= 12, there is three ties, first one in cost between two terms, second one in again costs between here
terms and third one is in profit between two terms. That is three correction factors are to be added to .

R = = 0.875.

Interpretation: There is a high degree of positive correlation between cost and profit variables of the
industry.

9.7. REGRESSION:
The term regression means stepping back to words the average. Regression analysis is a mathematical
measure of the average relationship between two or more variables in terms of the original units of the
data. In this there are two types of variables namely dependent variable (also called regressed or
explained variable), the variable whose value is influenced/ predicted and the independent variable
(also called regressor or predictor or explanatory variable), the variable which influences the values
or is used for prediction. After establishing the relation between the dependent and independent variables
is significant, the regression will help in estimating/ predicting the average value of the dependent variable
for a given value of the independent variable(s).Basically, there are two types of regression namely Simple
regression and Multiple regression. Regression between one independent variable and one dependent
variable is known as simple regression. It is also defined as the average association between one dependent

113
and one independent variable. These variables may have linear or non-linear relationship. One can plot
these variables on the scatter diagram and join them to represent the regression curves. If these curves
pass through maximum number of points, then one can get regression curves of best fit. If these curves
represent straight lines, then the regression is known as linear regression. Otherwise it is non-linear
regression.Regression between more than one independent variable and one dependent variable is known
as Multiple regression. Line of regression allows one to predict the values of dependent variable.

9.8. LINES OF REGRESSION:


When there are two variable X and Y either X may depend on Y or Y may depend on X. If X depends on
Y one can get line of regression of X on Y. If Y depends on X one can get line of regression of Y on X. The
two regression equations are not reversible as the assumptions for deriving these equations are quite
different so there will always be two lines of regression. The lines of regression are obtained from the
principle of least squares. Consider n pair of observation .

9.8.1.Regression line of Y on X:

Let the variable depends on the variable then the line of regression is said to be line of regression of
Y on X. Let the line of regression of Y on X be where ‘a ‘represent the y-intercept and ‘b’
represents the slope of the line. The two normal equations obtained from the principle of Least Squares
are

Solving the equations (9.7.1) and (9.7.22) simultaneously one can get the estimated values of
a and b.
Slope of this line is also known as regression coefficient of Y on X it is also denoted as

and is given by . Substituting these values of a and

b one can get line of regression of best fit.

9.6.1.Regression line of X on Y:

Let the variable depends on the variable then the line of regression is said to be line of regression of
X on Y. Let the line of regression of X on Y be X where ‘a’ represent the y-intercept and ‘b’
represents the slope of the line. The two normal equations obtained from the principle of Least Squares
are

114
Solving the equations (9.7.3) and (9.7.4) simultaneously one can get the estimated values of a and b.

Slope of this line is also known as regression coefficient of X on Y it is also denoted as

and is given by .
Substituting these values of a and b one can

get line of regression of best fit.


Example 9.6: From the following bivariate data find the regression lines and hence estimate the value of
Y when X=4 and the value of X when Y=24.

X 1 3 5 7 9
Y 15 18 21 23 22

Answer: Let the regression equation of Y on X be Y=a+ b X and the regression equation of
X on Y be X=a+ b Y.

The normal equations for Y=a+bX are =na+ b and =a +b

The normal equations for X=a+bY are =na+ b and =a +b

Here n=5 and other values in the normal equations can be had from the following table.

Sr. No. X Y X2 Y2 XY
1 1 15 1 225 15
2 3 18 9 324 54
3 5 21 25 441 105
4 7 23 49 529 161
5 9 22 81 484 198
Total 25= 99 165 200 533=

Regression equation of Y on X: Solving the normal equations 5a+25b=99 and 25a+165b=533 one can get
a= 15.05 and b=0.95.
Therefore, the regression equation on Y on X is Y=15.05 + 0.95X
The estimated value of Y for X=4 is Y=15.05+0.95 x 4=18.85
Regression equation of X on Y: Solving the normal equations 5a+99b=25 and 99a+2003b=533 one can
get a= -12.5824 and b=0.888.

115
Therefore, the regression equation on X on Y is X=-12.5824 + 0.888X
The estimated value of X for Y=24 is Y=-12.5824+0.888 x24=8.7296.
9.9. REGRESSION COEFFICIENTS:

These two lines of regression always intersect at the point and hence this point satisfies the two
lines of regression. That is the intersect point gives the means of X and Y say respectively.

In general, the line of regression of Y on X is given by and the line of regression


of X on Y is given by

Properties of regression coefficients:

1. Correlation coefficient is the geometric mean of regression coefficients that is .

2. If one of the regression coefficients is greater than one, then the other regression coefficient is less
than one. If then .

3. The arithmetic mean of the regression coefficients is greater than the correlation coefficients.

4. Regression coefficients are independent of change of origin but not on scale.

5. The angle between the lines of regression

i. If then the two lines are perpendicular to each other and

ii. If then the two lines coincide and

6. The signs of r, bxy and byx are always same, either all positive or all negative.
Example 9.7: Using the regression coefficients for the following data find the regression lines between
advertising expenditure (X) (000’s of Rs.) and volume of sales (Y) (lakhs of Rs.), correlation coefficients
and find the most probable value of Y when X=30.

X 25 28 35 32 31 36 29 38 34 32
Y 43 46 49 41 36 32 31 30 33 39
Answer:

The line of regression of Y on X is given by and the line of


regression of X on Y is given by

where and

and

116
X Y X2 Y2 XY
25 43 625 1849 1075
28 46 784 2116 1288
35 49 1225 2401 1715
32 41 1024 1681 1312
31 36 961 1296 1116
36 32 1296 1024 1152
29 31 841 961 899
38 30 1444 900 1140
34 33 1156 1089 1122
32 39 1024 1521 1248
320= 380= 10380= 14838= 12067=

Here n=10 ; = 320/10=32 and =380/10=38

= = - 0.6643

= = - 0.2337

The value of the correlation coefficient = - 0.394 (Since the regression coefficients are
negative correlation coefficient is also negative.)
The line of regression of Yon X is Y-38 =-0.6643(X-32) and by simplification Y=59.2576-.6843X
The line of regression of X on Y is X-32=-0.2337(Y-38) and by simplification X= 23.1194-.2337Y
If X=30, the value of Y is (using Y on X)Y=59.2576-.6843x30= 39.3286.
Example 9.8.: In a partially destroyed laboratory record, record of analysis of correlation data, the following
data are only legible: Variance of X=9, Regression equations are 8X-10Y+66=0, 40X-18Y=214.
What are i) the means of the X and Y, ii) the coefficient of correlation between X and Y and iii) The
standard deviation of Y.
Answer: i) Solving the two regression equations one can get the means of X and Y as 13 and 17 respectively
i.e and

ii) Let us assume that 8X-10Y+66=0 and 40X-18Y=214 be the regression equations of X on Y and Y on
X respectively.
Then equations can be put in the form of 8X=10Y-66 and 18Y=40X-214 which can be written as X= (10/
8)Y-66/8 and Y= (40/18)X-214/18 and the regression coefficientsbxy=10/8 and byx=40/18
then r2 =bxy. byx =(10/8) x (40/18)=400/144=2.78 >1, which is wrong. Hence our assumption is wrong.
Now let us assume that 8X-10Y+66=0 and 40X-18Y=214 be the regression equations of Y on X and X on
Y respectively.

117
Then equations can be put in the form of8X=10Y-66 and 18Y=40X-214 which can be written asY= (8/
10)X+66/10 and X=(18/40)Y+214/40 that is bxy=8/10 and byx=18/40
then r2 =bxy. byx =(8.10) x (18/40) = 0.36 therefore r-± 0.6.Since both he regression coefficients are positive
r=+0.6.
iii) Given that variance of X =9.

Since , 8/10= 0.6 ( ).

This implies standard deviation of Y= 4.


Activity 9.1: Following are the marks obtained by 10 students selected randomly from a class, in three
different subjects English (X), Economic (Y)s and Statistics(Z). Find the three possible Karl Pearson’s
correlation coefficient and three rank correlation coefficients and interpret the results.

Subjects A B C D E F G H I J
X 50 40 50 35 37 18 30 22 15 5
Y 58 60 48 50 30 31 44 36 40 52
Z 72 66 75 40 80 50 0 88 25 90

Activity 9.2: Obtain the two lines of regression by the method and leas squares between X and Y and find
the correlation coefficient. Also, predict Y if X=10 and X, when Y=2.5

X 1 5 3 2 1 1 7 3
Y 6 1 0 0 1 2 1 5

Activity 9.3: Can Y=5+2.8X and X=3-0.5Y be the estimated regressions of Y on X and X on Y respectively.
Explain your answer with suitable arguments.
Activity 9.4: A survey was conducted to study the relationship between expenditure on accommodation
(X)and expenditure on food and entertainment (Y) and the following results were obtained:

Variable Mean (00’s) Standard Deviation(00’s)


X Rs. 173 Rs 63.15
Y Rs. 47.8 Rs. 22.98
Correlation coefficient r= 0.57

Find the two regression equations between X and Y and estimate the expenditure on food and entertainment
if the expenditure on accommodation is Rs 2000.

9.6. SUMMARY:
This Unit is on bivariate analysis, especially on simple correlation and simple regression. The concept of
correlation and regression and different methods of studying correlation and regression were discussed
with the help of examples.

118
Review Questions:
1. Explain the concept of correlation and regression?
2. What is correlation? Distinguish between positive, negative and zero correlations.
3. Explain different methods of studying Correlation?
4. Explain the procedure for estimating parameters in the simple regression lines.
5. How one can use the regression coefficients in estimating the simple linear regression equations.
What else is required in addition to regression coefficients in the process of estimating the regression
equations.

Further readings:
1. Gupta S C and Kapoor VK, Fundamentals of Applied Statistics, Sultan Chand & Sons, Delhi.
2. Richard I Lev in, David S Rubin, Statistics for management, Prentice-Hall of India Private Limited,
New Delhi
3. Chandan J C, Jagjit Singh, Khanna KK, Business statistics, Vikas Publishing Hose Pvt. Ltd., New
Delhi.
4. J K Sharma, Business Statistics, Dorling Kinderley (India) Pvt.Ltd.NewDeldi.

119
UNIT - 10
MULTIVARIATE ANALYSIS:
MULTIPLE CORRELATION, MULTIPLE REGRESSION

OBJECTIVES:
After studying this unit one will be able to
• Explain the concept of multivariate analysis
• Explain different multivariate techniques that are available in analyzing the Multivariate data.
• Explain the concept of multiple correlation
• Explain the concept of Multiple Regression and calculation of its parameters
• Explain the meaning of R2 and adjusted R2
• Explain the concepts and usages of other important Multivariate techniques

STRUCTURE:
10.1 Introduction
10.2. Multiple Correlation and Regression
10.3. Other Multivariate techniques
10.4. Summary

10.1 INTRODUCTION:
In the Unit 9, relation between two variables were considered and concepts of simple linear correlation
and regression were discussed. Different algorithms relating to them are also presented. In that situation,
the values of one variable are associated or influenced by other variable, and as a as a measure of linear
relationship between them, Karl Pearson’s coefficient of correlation was calculated. After establishing
the significant relation between the variables, the concept of regression and estimating value of one
variable for a given value of the second variable. The first variable, whose values are estimating is called
dependent variable and the second variable whose values are given is called independent variable. All
these were come under Bivariate analysis and explained in Unit-IX. However, in many practical situations,
there exit an interrelation between many variables and the value of one variable may be influenced by
many others. As an example, the yield of crop per acre, say (X1) depends upon quality of seeds (X2),
fertility of soil (X-3), fertilizer used (X4) etc. Studying the joint effect of a group of variables upon a
variable not included in that group is termed as multiple correlation and multiple regression. But depending
on the nature of the variables, there are different multivariate techniques in the literature applicable in
different situations other than Multiple Correlation and Multiple Regression. Those are presented in fig
6.2 in Unit-VI

120
Though majority of techniques are having very good practical utility, in this Unit -X, only Multiple
regression is studies in detail. For other important techniques are discussed but not in detailed.

10.2. MULTIPLE Correlation AndREGRESSION:


In the analysis of Multiple regression, the concept of multiple correlation is also important to study.
10.2.1. Multiple Correlation:
The relation between more than two variables is known as multiple correlation. It is denoted by R and
provides the combined effect of two or more independent variables on dependent variable. Here one can
find relation between one dependent variable and many independent variables. Example: Consumption of
a product is influence by price, advertisement, quality, quantity available, availability of substitute good
etc. Here consumption is dependent variable and price, advertisement, quality, quantity available etc. are
independent variable. Multiple correlation lies between 0 and 1. As value of R is close to zero then the
relation is being more and more negligible. R is close to one then the relation is being more and more
strong. If value of R is equal to one then one can say the variables are perfectly correlated. If value of R is
equal to zero, then one can say there may be a non-linear relationship between the variables. In the
multiple Correlation analysis one can attempt to measure the degree of association/ relation between a
dependent variable X1 and two or more independent variable X2, X3, …, taken together as a group. This
gives the multiple correlation coefficient. In this is also possible to measure the degree of association/
relation between a dependent variable and any one of the independent variables included in the analysis,
while effect of the other independent variables included in the analysis is held constant. This measure of
association/ relation is called partial correlation coefficient. This is different from simple correlation
coefficient.

If there are three variables then multiple correlation coefficient between independent variables

and dependent variable is .

Multiple correlation coefficient between independent variables and dependent variable is

Multiple correlation coefficient between independent variables and dependent variable is

Advantages of multiple correlation:


1. It is used to determine the degree of association/ relation between variables by taking one variable
as dependent and the remaining variables taken as independent.
2. It measures the goodness of fit for a given series of data. It is regarded as a measure of the accuracy
of estimates made by reference to the estimating equation.

121
Limitations:
1. It assumes that the relationship amongst the variables is linear. But in practice large number of
relationships is not linear and follows some pattern. In such cases the linear regression coefficients
are unable to describe curvilinear data.
2. It assumes that the effects of independent variables on the dependent variable are quite separate
from each other and hence additive. Accordingly, a given chance in one independent variable has
the same effect on the dependent variable regardless of the size of the other independent variable or
variables.
3. The amount of work involved in the calculation of multiple linear correlation is enormous. It is not
so easy to interpret the results accurately as very few persons are well versed with the basic concept.
Overall the method is complex.
10.2.2. Multiple Regression:

Let us consider a Tri-variate distribution involved with three variables . Let the variable
depends on the other two variables .

The line of regression of is

From the Principle of Least Squares the normal equations are

Solving the equations (2),(3) and (4) simultaneously one can get the estimated values

of , where and and these are known

as regression coefficients. Substituting these values in equation (1) one can get the line of
regression of best fit. 

Let the line of regression of is

From the Principle of Least Squares the normal equations are

122
Solving the equations (6), (7) and (8) simultaneously one can get the estimated values as

.Substituting these values in equation (5) one can get the line of regression of best fit.

Let the line of regression of is

From the Principle of Least Squares the normal equations are

Solving the equations (10),(11)and (12) one can get the estimated values as .Substituting
these values in equation (9) one can get the line of regression of best fit.

Hence the general multiple regression of on is

The general multiple regression of on is

The general multiple regression of on is

123
Example 10.1:
A sample survey of 5 families was taken and figures were obtained with respect to their annual savings
, annual income and family size . From the data given below calculate the line of regression of on
and estimate the annual savings of a family whose size is 4 and annual income 16,000. Also, calculate the
multiple regression line of on . The multiple regression line of on

Annual Annual Family


Family savings Income Size
1 10 16 3
2 5 13 6
3 10 21 4
4 4 10 5
5 8 13 3

Answer:

Annual Annual Family


savings Income Size
(X1) (X2) (X3)
10 16 3 100 256 9 160 30 48
5 13 6 25 169 36 65 30 78
10 21 4 100 441 16 210 40 84
4 10 5 16 100 25 40 20 50
8 13 3 64 169 9 104 24 39
37 73 21 305 1135 95 579 144 299

The line of regression of is as given in equation (1)

Substituting these values in the normal equations given in equation (2), (3) and (4), gives

——————————(16)

——————(17)

————————(18)

Solving the equations (16),(17) and (18) simultaneously, one can get the estimated values as

.
Hence the line of regression of is .
The estimated value of annual savings when annual income is 16,000 and family size is
4 is ( ) is
.
Let the line of regression of is as given in equation (5).

124
Substituting these values in the normal equations given in equation (6), (7) and (8), gives

-------- -------------------(20)

------------------------------(21)

Solving the equations (19), (20) and (21) simultaneously one can get the estimated values as

The line of regression of is as given in equation (9).

Substituting these values in the normal equations given in equation (10), (11) and (12),
gives

23)

---------------------------(24)

Solving the equations (22), (23) and (24) simultaneously one can get the estimated values as

10.2.3. Determination of coefficient of Multiple determination (R2), adjusted R2, and standard error
of the estimate: In case of the multiple regression, the coefficient of multiple determination (R2) is the
proportion of variation in the dependent variable Y that is explained by the combination of independent
variables X1,X2, ….

That is R2 = Regression sum of squares/Total sum of squares=SSR/SST=1-(SSE/SST), where


SSE=Error sum of squares.

Then the adjusted R2 = 1-

where n= Number of observations, k= number of independent variables.


Adjusted R2 is commonly used when a researcher wants to compare two or more regression
models having the same dependent variable but different number of independent variables.
In case of simple regression model, in which only one dependent and only one independent
variable exists, the coefficient of determination is simply r2. And value of its range is between
0 and 1.

The standard error of the estimate =

F test for testing the statistical significance of the overall regression model is F= =

and F statistic follows F distribution with degrees of freedom k and n-k-1.


125
The test statistic t for multiple regression is t= , where bj is the slope of the variable j

with dependent variable Y holding all other independent variables constant, is the
hypothesized population slope for variable j holding all other independent variables constant
and is the standard error of the regression coefficient bj. and the t statistic follows t

distribution with n-k-1 degrees of freedom. This is to test whether the hypothesized parameter
=0 against ≠ 0 for all j.
Most statistical software provides all these values and the researcher only needs to interpret
properly.

10.3 OTHER MULTIVARIATE TECHNIQUES:


Following are the different important multivariate techniques that are available in the literature. Those
techniques were just mentioned and a brief introduction was given below. However, the procedure and
applications can be seen from any of the multivariate techniques text Book.
10.3.1. Discriminant analysis: In case of multiple regression analysis both independent and dependent
variables are of interval scale. But in real life situations there may be cases where the independent variables
are interval scaled but the dependent variable is categorical scaled. Discriminant analysis is a technique
of analysis of this type of data. The objectives of the discernment analysis are to Develop discriminant
function or to determine a linear combination of independent variable to separate groups of dependent
variable by maximizing “between-group variance” relative to “within-group variance”. This is to examine
whether there is a significance difference among the groups in the light of the independent variables or
not. That is to determine the independent variablewhich contribute to the most of the intergroup differences,
to classify the cases to one of the groups based on the values of the independent variable and to evaluate
the accuracy of classification. These are two group discriminant analysis or multiple discriminant analysis
depending on the number of categories possessed by the criterion variable namely two or more.
10.3.2. Logit Analysis: When the dependent variable is binary and non-metric and there are several
independent variables that are metric, in addition to two- group discriminant analysis one can also use
ordinary least square (OLS), logit and probit models for estimation.To run these models the dependent
variable is coded as 0 and 1. This model deals with issue of how likely an observation is to belong to each
group. It estimates the probability of an observation belonging to a group.
10.3.3. Factor Analysis: It is a general name denoting class of procedures primarily used for data reduction
and summarization. In this there is no concept of independent and dependent variables. Factor analysis is
a useful tool for investigating variable relationships for complex concepts such as socioeconomic status,
dietary patterns, or psychological scales. In this, relationships among sets of interrelated variables are
examined and those relationships will be represented in terms of a few underlying factors. That is, the
focus of this is to summarize the information contained in a large number of variables into few number of
factors. This is based on the degree of correlation among the variables. The key concept of factor analysis
is that multiple observed variables have similar patterns of responses because they are all associated with
a latent (i.e. not directly measured) variable.their association with an underlying latent variable, the factor,
which cannot easily be measured.

126
For example, people may respond similarly to questions about income, education, and occupation, which
are all associated with the latent variable socioeconomic status. In factor analysis, a factor is a latent
(unmeasured) variable that expresses itself through its relationship with other measured variables.
The relationship of each variable to the underlying factor is expressed by the so-called factor loading.Factor
analysis also allows you to use the weighted item responses to create what are called factor scores. These
represent a single score for each person on the factor. Factor scores are nice because they allow you to use
a single variable as a measure of the factor in the other analyses, rather than a set of items. This is used to
identify the underlying dimensions or factors that explain the correlations among a set of variables, to
identify a new, smaller set of uncorrelated variables to replace the original set of correlated variables in
subsequent multivariate analysis, to identify a smaller set of salient set of a salient variable from a large
set for use in subsequent multivariate analysis.
10.3.4. Cluster Analysis: The primary objective of cluster analysis is to classify objects in to relatively
homogeneous groups based on the set of variables considered. That is to divide large group of objects or
observations, like products, customers, states etc., in to smaller groups such that the observations with in
each group are similar or close and observations in different groups are dissimilar. These groups are
called clusters. Like factor analysis here also there is no distention between dependent and independent
variables, rather interdependent relationship between the whole set of variables are considered. Reduction
of number of variables will be done by grouping them in factor analysis whereas reduction of number of
observations by grouping them will be done in cluster analysis. In cluster analysis, the grouping of
observations will be done based on a set of variables. As there is no rule in selecting the variables one
must be careful in selecting the variables. After selecting the variables sample observations are to be
obtained for further analysis.
The cluster analysis is based on the distance/ proximity/ similarity measure, criterion to combine the
clusters and number of clusters. There is different measure like Euclidean distance, Manhattan Distance,
Chebyshev Distance, Mahalanobis or correlation distance etc., out of which one distance method must
select depending upon units of measurement. The criterion for combining clusters may be nearest neighbor
(single linkage), Furthest neighbor (Complete linkage), Centroid method etc., out of which one method
must consider.
10.3.5. Multidimensional Scaling (MDS): It is a class of procedures for representing perceptions and
preferences of respondents specially by means of a visual display. This has a variety of usages especially
in marketing like the number and nature of dimensions’ consumers use to perceive different brands,
positioning of current brands and customer’s ideal brand on these dimensions. The information provides
by MDS will be useful in identifying the image of the firm, market segmentation, pricing analysis, new
product development, assessing the advertising effectiveness etc., In the process of conducting MDS, the
researcher must formulate the MDS problem carefully as variety of data may be used as input of the
MDS. The researcher must also determine an appropriate form in which data should be obtained and
select an MDS procedure for analyzing data. The data may be related to perceptions or preferences of the
respondents. Perceptions may be direct or derived. Another important aspect of the solution is determining
the number of dimensions for the special map. Also, the axis of the map should be labelled and the
derived configuration interpreted. Lastly, the researcher must access the quality of the results obtained.
10.3.6. Conjoint Analysis: The main objective of conjoint analysis is to find out the attributes of the
product that a respondent prefers most. This attempt to determine the relative importance, consumers
attach to salient attributes and the utilities they attach to the levels of attributes. Conjoint procedures
attempt to assign values to the levels of each attribute, so that resulting values or utilities attached to the

127
stimuli match, as closely as possible, the input evaluations provided by the respondents. Like MDS,
conjoint analysis relies on respondent’s subjective evaluations. In Conjoint analysis, the stimuli are
combination of attribute levels determined by the researcher, whereas in MDS the stimuli are products or
brands. This seeks to develop utility functions describing the utility that consumers attach to the levels of
each attribute. In conducting the conjoint analysis one must formulate the problem in which the researcher
must identify the attributes and attribute levels to be used in constructing the stimuli. In constructing the
stimuli there are two methods namely the pairwise approach also called two-factor evaluations and full
file approach also called multiple -factor evaluations. The input data may be metric or non-metric as in
the case of MDS. For non-metric data, the respondents are typically required to provide rank order
evaluations. In metric form the respondents provide ratings, rather than rankings. A conjoint analysis
procedure may be selected for analysis and results may be interpreted.
Activity 10.1:
A distributor of a television wants to evaluate factors which influence the demand. He considers the sales
per week as a response variable, Price in1000 Rs as one explanatory variable and advertising cost as
another explanatory variable In 10000 Rs and presented as follows. Fit a regression model to predict the
sales using price and advertising Expenditure per week.

Week 1 2 3 4 5 6 7 8 9
No. sold 350 460 350 430 350 390 380 470 450
Cost 5.5 7.5 8 8 6.8 7.5 4.5 6.4 7
Adv. cost 3.3 3.3 3 4.5 3 4 3 3.7 3.5

SUMMARY:
In this unit both dependence and independence multivariate techniques that are useful in both metric
and non-metric variables are presented. Out of those the concept of Multiple correlation and Multiple
Regression were discussed in detail. The coefficient of multiple determination (R2), proportion of variation
in the dependent variable that is explained by the combination of independent variables and adjusted R2
are discussed. Other multivariate techniques like Discriminant analysis, Logit analysis, Factor analysis,
Cluster analysis, Multidimensional analysis and Conjoint analysis were also discussed in brief.

Review Questions:
1. Explain the meaning and need of Multivariate techniques? Give the list of different Multivariate
techniques mentioning it limitations.
2. Explain the concept of Multiple Correlation.
3. Explain Multiple Regression analysis assuming three metric variables out of which one is dependent
variable.
4. Explain different multivariate techniques mentioning the situation in which it isuseful.

128
Further readings:
1. Gupta S C and Kapoor VK, Fundamentals of Applied Statistics, Sultan Chand & Sons, Delhi.
2. Richard I Lev in, David S Rubin, Statistics for management, Prentice-Hall of India Private Limited,
New Delhi
3. Chandan J C, Jagjit Singh, Khanna KK, Business statistics, Vikas Publishing Hose Pvt. Ltd., New
Delhi.
4. J K Sharma, Business Statistics, Dorling Kinderley (India) Pvt.Ltd.NewDeldi.
5. Naval Bajpai , Business Research Methods, Dorling Kindersley (India) Pvt. Ltd. New Delhi.

129
BLOCK-4

Four units XI, XII, XIII, XIV are there in Block 4. Unit XI is on the introduction to statistical inference in
which all relevant material to understand the concepts of estimation of a parameter, testing aspects, different
terms that are required to understand the testing of hypothesis were discussed. Unit XII is on the large
sampling tests in which tests based on the Normal distribution were discussed and was termed as Parametric
test-I. Unit XIII is on the small sample tests, which are depending on different exact distribution like t and
F were discussed. Test based on Z-transformations were also presented and termed as Parametric tests-II.
Unit XIV is on nonparametric tests. These are parametric free tests and important non-parametric tests
were presented.

130
UNIT - 11
INTRODUCTION TO STATISTICAL INFERENCE

OBJECTIVES:
After studying this unit one will be able to
• Explain the meaning of statistical inference
• Explain the concepts of statistical inference
• Explain the concepts of estimation, point and interval estimation
• Basic concepts and tests of Testing of hypothesis like Hypothesis, null and alternative hypothesis,
simple and complex hypothesis, types of errors, level of significance, Critical region, one tail and
two tail tests.
• Major steps in hypothesis testing.

STRUCTURE:
11.1 Introduction
11.2 Important concepts in statistical inference
11.3 Theory of estimation
11.4 Point & Interval estimation
11.5 Testing of Hypothesis
11.6 Major steps in Hypothesis testing
11.7 Summary
11.8 Review Questions
11.9 Further Readings

11.1 INTRODUCTION:
In day to day life everyone makes estimates. A business manager has to make estimate of necessary
stock available in shop. A manufacturer wants to know the life of the unit produced by them (Population
of units). One way is to test all the units produced by them (Complete enumeration or census), which may
not be possible/ feasible practically. The second way is to test a part of the produced units (sample) and
based on the information obtained from the sample, one can estimate the life of the units produced by the
company (Sampling). A statistical population is a collection of items or individuals about which one wish
to draw some conclusion. It may be finite or infinite: it may be hypothetical or existent. Whenever it is not
possible to study the population fully to make conclusions about some characteristics of that population,
one has to study the sample and using the information obtained from sample, conclusions about population
131
are made. That is by studying the sample, meaningful inferences for the population are made. The logic of
drawing statistically valid conclusions about the population characteristics on the basis of sample drawn
from it in a scientific manner is called inductive inference. Unit V discussed about different sampling
methods for obtaining representative samples from the Population. Next problem is that to develop the
techniques which enables to generalize the results of the sample to the population; to find how far these
generalizations are valid, and also to estimate the population parameters along with the degree of confidence.
All these will come under the broad branch of statistics known as Statistical Inference. This branch is
further broadly classified into TWO heads namely i) Theory of Estimation and ii) Testing of Hypothesis.

11.2. IMPORTANT CONCEPTS IN STATISTICAL INFERENCE:


Following are some of the important concepts required to understand the statistical inference.
11.2.1. Parameter & Statistic: The statistical constants of the population like population meanμ
population variance 2, population standard deviation , population skewness 1 etc. are known as parameters.
Similar statistical constants for the sample drawn from the population like sample mean , sample variance
s2, sample standard deviation (s), sample skewness (b1) etc., are called statistics (t). Obviously, parameters
are functions of the population values while statistics are functions of the sample observations. Generally,
the population parameters are unknown and they have to be estimated by the corresponding statistics.
11.2.2. Sampling Distribution: Let N be the population size and ‘n’ be the sample size. Then one can
draw N C n = k,the number of random samples of size ‘n’. For each of the samples drawn, statistic ‘t’ can
be computed. That is one will have ‘k’ different values for the statistics ‘t’. (Table 11.1) Then these k-
values of ‘t’ may be grouped in to a frequency distribution. The frequency distribution of a statistic so
observed is known as the sampling distribution of the statistic.

Sample Statistic
Number t Mean  Variance
1
2
3

k
Table-11.1
11.2.3. Standard error (S.E): The standard deviation of the sampling distribution of a statistic is called
the standard error of the statistic. The reciprocal of the standard error is taken as the measure of precision
for the statistic. The standard error is used to determine the limits within which the population parameter
may be expected to lie. The standard error forms the basis of the testing of hypothesis.
Standard errors of some of the important statistics for larger values of the sample size are
given Table 11.2. Here n, n1, n2 are called sample sizes;p, p1, p2 are called sample
proportions; , 1, 2 are called sample means; s, s1,s2 are called sample standard deviations
P, P1, P2 are called population proportions; , 1, 2 are called population standard deviations;
r is the sample correlation coefficient; is the population correlation coefficient; Q=1-P; Q1
=1-P1; Q2 =1-P2.

132
Sr. Statistic (t) Standard error {S.E(t)}
No.
1 Sample mean
2 Observed sample proportion “p”

3 Sample standard deviation “s”

4. Sample variance “s2”

5. Sample correlation coefficient “r”


6. Difference between two means 2

7 Difference between two standard deviations s1 -


s2

8 Difference between two proportions p1-p2

Table 11.2
Note: whenever population values are not available they are estimated by the corresponding sample values.
That means substitute sample values in place of population parameter values.
Example 1: A population consists of five number (2,3,6,8,11). Consider all possible samples of size two,
which can be drawn with replacement from population calculate the standard error of sample mean.
Solution: In a random sampling with replacement, any one of the five numbers 2,3,6,8,11drawn in the
first drawn can be associated with any one of these five numbers drawn in the second draw and hence the
total number of possible samples of size 2 is 5X5 = 25 and is given by the cross product (2,3,6,8,11)X
(2,3,6,8,11) as shown below:
(2,2,(2,3),(2,6),(2,8),(2,11),(3,2),(3,3),(3,6),(3,8),(3,11),(6,2),(6,3),(6,6),(6,8),(6,11),(8,2), (8,3),
(8,6),(8,8),(8,11), (11,2),(11,3),(11,6),(11,8),(11,11)
The means of the sample mean (statistic) are as follows:

The means of the samples


2 2.5 4 5 6.5 2.5 3 4.5 5.5 7 4 4.5 6
7 8.5 5 5.5 7 8 9.5 6.5 7 8.5 9.5 11 -

Table 11.3

133
Then the Sampling distribution is

Class interval 2 to 4 4 to 6 6 to 8 8 to 10 10 to12 Total


Frequency 4 8 7 5 1 25

Table 11.4

The mean of the sampling distribution of sample statistic is

=6

The variance of the sampling distribution of the sample statistic is given by

Therefore, the

11.3. THEORY OF ESTIMATION:


Estimation of population parameters like mean, variance, proportion, correlation coefficient, regression
coefficient etc., from the corresponding sample statistics is one of the very important problems of statistical
inference. The theory relating to these aspects is termed as Estimation theory.
The theory of estimation was founded by Prof. R.A. Fisher and over time it was divided in to two groups
namely i) point estimation and ii) Interval Estimation.
In point estimation a sample statistic (numeric value) is used to provide as an estimate of the population
parameter whereas in Interval estimation, probability range is specified within which the true value of the
parameter might be expected to lie (with some confidence).
In the Statistical Inference the sampling distribution of a statistic and its standard error is more important.

11.4. POINT & INTERVAL ESTIMATION:

The statistic‘t’ intended for estimating a parameter is called an estimator of .


For Example, the sample mean is an estimator of population mean . Similarly, the
sample standard deviation is the estimator of the population standard deviation .
The specified numerical value of an estimator calculated from the sample is called Estimate. The estimate
may be a single statistic or a range with attached probability called confidence interval. Under point
estimation, a single value is given as the best estimate of the population value. The interval estimation
involves the determination of an interval within which the population value must lie with a specific
degree of confidence. Suppose we draw a random sample of size ‘n’ from a population whose parameter
Ө is be estimated. From the sample, statistic ’t’ and its standard error SE(t) can be computed. The sampling
distribution of ‘t’ can also be known.With this information one can compute the values of t-SE(t) and t+
SE(t), where is the critical value, depending on , level of significance also called probability of type I
error (will be discussed). These values are termed as lower limit L and upper limit U of the interval. This
134
interval covers the parameter with a specified probability . The interval (L, U) is called confidence
interval for the parameter . The limits L and U are called confidence limits. The probability, ) is called
confidence coefficient. The level of confidence is usually taken as 0.95 or 0.99. That is is .05 or .01.

In case of larger samples, the sampling distribution ofstatistic‘t’ follows standard


normal distribution. In such cases in confidence interval for θ,the critical
value of is 1.96 for 95% confidence level, which can be approximated to 2. The critical
value of is 2.58for 99% confidence level, which can be approximated to 3. Knowing the
standard error formulae (Table 11.2) for statistics like and ,
substituting those values of the statistics, one can find out the confidence interval for the
specified parameter. For small samples, a confidence interval for θ is given
by where denotes the critical value of t for α percent level of
significance and specified degrees of freedom.
Example 2: From a random sample of 49 teachers at GITAM University, the mean age and the standard
deviation of the ages were found to be 48 years and 36 years respectively. Construct a 95% confidence
interval for the mean age of Teachers in GITAM University (Population).

Solution: Given that Sample size = n=49; Mean =48 years; standard deviation S=36.
The value of Z for 95% confidence is 1.96 as sample size is large.

Өpopulation is of t± Z SE(t) =
The 95% confidence intervals for the mean age of the

± 1.96 =48± 1.96 X 36/9=48±7.84 years
The 95% confidence limits for the population mean is (40.16,55.84)) ie (40,56)
(Since SE ( )= n and Z -1.96 for 95% confidence)

11.5 TESTING OF HYPOTHESIS:


Since the inductive inference about population is based on sample study, such decision involves an element
of risk, the risk of taking wrong decisions. This is one of the important aspect. This is also to test the
significance between the observed sample value and hypothetical parameter value on the basis of studying
the sample and to test two independent sample statistics is significantly different or the difference is only
due to the fluctuations in the samples. To test the significance one has to find the value of the test statistic
based on the values of the Statistic, Parameter / Expectation of the Statistic and Standard Error (S.E) of
the Statistic defined as follows:

Test Statistic = =

Suppose one is interested in the following:A pharmaceutical concern is interested to know whether a new
drug is really effective for the particular ailment.A given foodstuff is really effective in increasing
weight.Which of the two brands of a particular product is more effective etc.,

135
One cannot judge by intuition or by simple observation. The modern theory of probability plays a vital
role in decision making. The branch of Statistics which helps us in arriving at the criterion for such
decisions is known as Testing of Hypothesis. Before knowing how the hypothesis is tested it is appropriate
to define and explain Hypothesis and its related terms.
11.5.1. Hypothesis:A hypothesis is a proposition – a tentative assumption which a researcher wants to
test for its logical and empirical consequences. The hypotheses are necessary in problem – oriented research
to understand the cause or relationship of a certain phenomenon under investigation. It is an unproved
theory. It is tentative in nature.
A statistical hypothesis is a hypothesis concerning the parameters or of the probability distribution for a
designated population or populations, or, more generally, of a probabilistic mechanism which is supposed
to generate the observations. A statistical hypothesis is a hypothesis that is testable on the basis of observing
a process that is modeled via a set of random variables. In other words, it is an attempt to say something
about the population /population parameter/ population distribution on the basis of what is true for the
sample/ statistic.
Hypothesis is broadly classified into two different ways. They are i) Null or Alternative and ii) Simple or
Compound. There are other ways of classification but not of interest now.
11.5.2. Null hypothesis:It is a hypothesis tested for possible rejection under the assumption that it is true.
A hypothesis which makes no significant difference is known as null hypothesis. It is denoted by H0
Eg: The average height obtained from a sample is not significantly different from the height of the
population.
Eg: The two training programs are equally efficient.
Eg: The two drugs are equally effective in controlling a particular disease.
11.5.3. Alternative hypothesis: Any hypothesis which is complementary to the null hypothesis is called
alternative Hypothesis. This is denoted by H1 or HA .
Eg:The average height obtained from a sample is significantly different from the height of the population.
Eg: The second training programmeis better than the first one.
Eg: The second drug is more effective than the first in controlling the disease.
11.5.4. Simple hypothesis: A hypothesis which completely specifies the parameters of the distribution is
known as simple hypothesis.
Eg. In a Binomial distribution the parametric values n is 5 and p=0.7. This statement is
simple because both the parameters in Binomial distribution B(n,p) are specified.
Eg. H0: X~B(100,1/2) , that is n, p is specified.This statement is simple because both the
parameters in Binomial distribution are specified.
2
Eg. H0: X ~ N(5,20), that is μ and are specified.This statement is simple because both
the parameters of the normal distribution are specified.
Eg. Ho: X ~ Poisson with mean 3. This statement is simple because the parameter of the
Poisson distribution is specified. (Since Poisson distribution id specified by only one
parameter)
136
11.5.5. Composite/ Complex hypothesis: A hypothesis which is not simple is known as composite
hypothesis. A hypothesis which does not specify all the parameters of the distribution is known as composite/
complex hypothesis.

In a Binomial distribution the parametric values of n is and p=

To test the whether it is equal to the population parameter then the hypothesis which
is used to test is composite hypothesis and it is written as H0 : versus H1 :
Eg. X ~ B(100, p) and H1: p > 0.5 , in this one parameter n is specified and another parameter
p is not specified. This is complex hypothesis.
2 2
Eg. X ~ N(0, ) and H1: unspecified, in this one parameter μ is specified and another
2
parameter is not specified. This is complex hypothesis,
Eg. H0: The mean daily wages of employees are 130.H1: The mean daily wages of employees
is not 130. Which means it may be either ≥ 130 or it may be either ≤ 130. This is also
complex hypothesis.
11.5.6. Types of Errors:The Null hypothesis H0need not be correct always. If it is correct one will say
accept H0, otherwise reject H0, also called accept H1. Basing on the information obtained, sometimes one
can reject H0, though in reality it is correct. This is an error and named it as Type I error. Also it is
possible that one may accept H0, though it is wrong in reality (in fact in reality H1 is correct) This is also
an error and this termed as Type II error. Hence there are two types of Errors existing in statistical
inference and they are Type – I error and Type – II error. This can be explained further in the table 11.5.

Actual
Decision H0 is True H0 is not True (Hi is
True)
Accept H0 Correct decision Type II error
Reject H0 Type I error Correct decision
Table 11.5
The probabilities of committing these error are generally called amount of those errors and are denoted by
and respectively.

That is

Probability of committing Type I error = Probability of Rejecting H0 when it is true=

Probability of committing Type II error =Probability of Accepting H0 when H1 is true =

11.5.7. Level of significance: Probability of rejecting when it is true, that is maximum possible amount
of Type I error is known as Level of significance. This is usually determined in advance before testing the
hypothesis. It is always some percentage which should be chosen with great care. The significance level
is supposed to be arrived at after considering the cost consequences. The only guidance that can give in
this regard is thathigher the difference in type I error and Type II error the greater the importance of Type
I error compared to Type II error, consequently the risk of Type I error should be lower, that is á should be
137
minimum. 5% level of significance means that researcher is willing to take as much as a 5% risk of
rejecting the null hypothesis when is true. In practice, the tests are conducted at 1%, 5% and 10% level of
significance.
11.5.8. Power of the test: The complimentary part of Type II error is called Power of the test and denoted
by  That is 1-Probability of committing Type II error

= 1-Probability of Accepting H0 when H1 is true


= Probability of accepting H0 when it is True.
11.5.9. Sample Space, Sample Points:To carry out the statistical test, one will find out the test statistic.
For that test statistic one for each sample, different possible vales can be obtained. The space which is
formed by all possible values of the test statistic is called Sample Space and those values are called
Sample Points.
11.5.10. Critical Region: Critical region is the subset of the sample space, whose size is
depending on the predetermined value of , the probability of Type I error. This is denoted
by omega . The critical region is so identified that whenever the value of the calculated
sample statistic falls in the critical region we reject the null hypothesis. As discussed earlier
the level of significance is pre fixed and cannot be changed. For minimum value of the
type II error, critical region is located at the ends of the sample space. Whether the critical
region is located at one end or two halves one each at two ends of the sample space is
depending on the alternative Hypothesis H1.

Acceptance Region:The complimentary part of the critical region is called Acceptance Region.
11.5.11. Critical or Significant Values:The value of the test statistics (z) which separates the critical
region and the acceptance region is called Critical or Significant Values.The decision rule for rejecting
or accepting the null hypothesis H0 is that if the calculated value of the test statistic is less than the
significant value, null hypothesis will be accepted otherwise rejected.
Statistical Tests can be of two types namely one tail test and two tail test basing on the nature of the
alternative hypothesis, as it is the base for location of critical region. If the critical region locates at one
end of the sample space, one tail test is applicable and if the critical region locates at two ends of the
sample space two tail test will be applicable.
11.5.12. One tailed test: A one tailed test is used when one tries to test whether the
population mean is either lower than or higher than some hypothesized value. For instance, if
the null hypothesis against the alternative hypothesis or if the
rejection region is on the right tail then it is said to be right tailed test. If the level of
significance is 5%, then the rejection region will be equal to 0.05 and the area will be on the
right tail. Then the test is said to be right one tailed test. Hence the rejection region will be
only on the right side for right tailed test (Figure 11.1). If the alternative hypothesis is
or if the rejection region is on the left tail, then we say it is left tailed test.If the
level of significance is 5%, then the rejection region will be equal to 0.05 and the area will be
on the left tail. Then the test is said to be left one tailed test. Hence the rejection region will
be only on left side for left tailed test (Figure 11.2).

138
Figure 11.1 Figure 11.2

11.5.13. Two tailed test: A two tailed test is used when one tries to test,whether the
population mean is not equal to some hypothesized value. For instance, if the null hypothesis
is against the alternative hypothesis is i.e or .
Here the rejection region is on both the tails. Hence the level of significance is divided into
two parts that is the rejection region will be on each part (Figure 11.3).

Figure 11.3
11.5.14. Critical values for larger samples:Following are the different critical values one can have
when the sampling distribution is Normal distribution (larger samples) at different important level of
significance values and one tail and two tail tests:

Test Level of significance (α )


1% 5% 10%
Two tail 2.58 1.96 1.645
Right tail (One Tail) 2.33 1.645 1.28
Left Tail (One tail) -2.33 -1.645 -1.28

Table 11.6

139
11.5.15. Critical values for smaller samples: In case of smaller samples (<30), the sampling distributions
may not follow the normal distributions and in such cases the sampling distribution may follow exact
distributions like t, F, etc. In those cases, the critical values can be seen from the corresponding tables
available for different sample sizes, levels of significance and degrees of freedom.
Note: Degrees of freedom can be defined as the number of independent observations for a source of
variation minus the number of independent parameters estimated in computing the variation. In other
words, it can be defined as the number of observations that are free to vary after certain restrictions have
been placed on the data. If there are n observations in the sample, for each restriction imposed upon the
original observations, the number of degrees of freedom is reduced by one.
11.5.16. The p-value of a test: The test of significance is designed for a fixed level of significance and at
the end of the test one will make conclusion on reject or accept the null hypothesis. As discussed in
section 11.5.7, fixing of level of significance is arbitrary and hence, the mere fact that a hypothesis is
rejected or not rejected does not reveal the full strength of the sample evidence. Sometimes a better way
of expressing the conclusion is that to state the p- value also called probability value of the test. This p-
value approach for testing the hypothesis is used for large samples and in some times referred as the
observed level of significance. The p-value of a test expresses the probability of observing the sample
statistic as extreme as or more extreme than, the value actually observed, assuming that the null hypothesis
is true. That is p-value is the smallest value of level of significance á for which the null hypothesis can be
rejected. The decision rule for accepting or rejecting a null hypothesis H0 based on the p-value is as
follows: Reject the null hypothesis H0 when the p value is < á, otherwise, accept the null hypothesis H0.

11.6. MAJOR STEPS INVOLVED IN HYPOTHESIS TESTING:


Following are the broad steps one has to follow in testing the hypothesis.
• Formulate Null and Alternate Hypothesis
• Set up a suitable significance level
• Choose a Test Criterion: Select suitable statistic that can be used and calculate the value of the test

statistic. Test Statistic

• Find out the critical / significant value


• Compare the Value of the test statistic and the critical value and make decision. If modulus of the
test statistic is less than the critical value, accept Null hypothesis otherwise dont accept the null
Hypothesis H 0 and say that it was rejected.
Activity 1: A random sample of 64 students of a college showed an average age of 27 years with a
standard deviation of 5 years.
a) Establish a 95% confidence estimate of the average age of all the students at the college.
b) Establish a 99% confidence estimate of the average age of all the students at the college.
Activity 2:In a survey, a sample of 1000 people shopping at a mall were selected and found that out of
1000, only 700 are using credit card. Construct confidence estimate of the proportion of people at the
mall, who use credit card for shopping.
140
Activity 3: Give examples with null and alternative hypothesisleading to one and two tail tests.
Activity 4: A car manufacturer claims that a particular model gets 28 km/l. An independent organization
took a sample of 49 cars of the same model and testing find the mean as 26.8 km/l . Previous studies infer
that the standard deviation of the population is 5km/l. Could we reasonably expect that we could select
such a sample if needed the population mean is actually 28km/l (take within 2 standard errors)

11.7. SUMMARY:
This unit is on the introductory part of the Testing of significance. Different concepts and terminologies
relating to testing of significance like concepts of estimation, difference between point and interval
estimation, different types of hypothesis, Types of errors, meaning of level of significance, critical region,
one tail and two tail tests. Different major steps one has to follow for testing the hypothesis was also
given.

Review Questions :
1. Define and explain the terms Parameter & statistic, sampling distribution, standard error.
2. Explain the concept of estimation. Distinguish between point an interval estimation.
3. Define Hypothesis, Null and alternative hypothesis, Types of errors, level of significance power of
test and critical region.
4. Give different steps in Hypothesis testing.
5. What is the meaning of “reject” a hypothesis based on a sample”?

Further readings:
1. Chandan,J.S, Jagjit Singh, Khanna.K.K., Business Statistics, Vikas Publishing House Pvt. Limited,
New Delhi.
2. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt. Ltd.
3. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
4. Gupta S.C and Kapoor VK, Fundamentals of Applied Statistics, Sultan Chand & Sons, VewDelhi.
5. Vivekananda Murty M, PrasadaRao JAA., Statistics (Twiteeyasamhasram) (for Intermediate), Telugu
Academy, Hyderabad. (Edited by Narasimham VL)

141
UNIT - 12
PARAMETRIC TESTS-I (LARGE SAMPLE TESTS)

OBJECTIVE:
After studying this unit one will be able to
• Explain the concept of large and small samples and tests based on size of the samples.
• Explain the meaning of sampling of Attributes
• Explain different tests based on sampling of attributes namely test for single proportion and test for
difference of two proportions.
• Explain the meaning of sampling of variables.
• Explain different tests based on sampling of variables namely test for single mean, test for difference
of two means, paired t-test, test for observed correlation coefficient.

STRUCTURE:
12.1. Introduction
12.2. Large and Small Sample Tests
12.3. Tests based on Sampling of Attributes
12.4. Tests based on Sampling of Variables
12.5 Summary
12.6 Review Questions
12.7 Further Readings

12.1. INTRODUCTION:
In the Unit-XI, discussions were made on the different basic concepts of inference. Different steps involved
in the testing of hypothesis were also given. Having known about the general procedure for testing the
hypothesis and theory of estimation, the next step is to discuss their practical applications. Basically,
these tests are classified into two namely Parametric and non-parametric tests. This classification is being
made on whether the assumptions about the hypothesis are based oninformation about population
parameters or not. Parametric tests, assumes that sample data comes from a population that follows a
probability distribution based on a fixed set of parameters. In these tests a sample is selected from the
population and basing on the sample statistic, the population parameter is estimated. During the estimation
process, basing on sample and sampling distribution certain parametric assumptions are made and
population from which the sample is drawn is assumed to be normally distributed. Here null hypothesis is
made on parameters of the population distribution and the test statistic is based on the distribution. One
can use interval and ratio scale data but cannot use nominal scale data for these tests.
142
Non-parametric tests differ Parametric tests and during the estimation process, basing on sample and
sampling distribution certain parametric assumptions are made and population from which the sample is
drawn isnot normally distributed. Here null hypothesis is free from parameters.
12.2. LARGE AND SMALL SAMPLE TESTS:
These parametric tests can be classified in to two categories namely 1) Large Sample Tests and 2) Small
sample tests. This classification is depending upon the size of the sample. In practice if the sample size is
more than 30, it will be called as large sample and tests based on those large samples are called large
sample tests and if the sample size is less than or equal to 30, it will be called small sample and tests
based on those small samples are called small sample tests.
Large sample tests are further classified in to i) Tests based on Sampling of Attributes and ii) Tests based
on sampling of variables.

12.3. TESTS BASED ON SAMPLING OF ATTRIBUTES:


Sampling of attributes mean selecting a sample of a certain size from a universe which consists of the
presence (A) or absence (a) or (á) of attribute like blindness, deafness, male, literate etc. with its member
units. Such type of samples is drawn at random to draw conclusion about the percentage/ proportion of
presence or absence of an attribute in a universe or population. In such a case drawl of an induvial unit
from the universe is called an event, the presence of the attribute (A) is called a success and the absence
of the attribute A is called a failure. Absence of attribute A is also called presence of attribute a or á.
Some of the important tests of sampling of attributes are
1. Test for Single Proportion or Population Proportion.
2. Test for the difference of two Population Proportions.
12.3.1. Test for Single Proportion:
Consider a sampling from a population which is divided into two mutually exclusive and completely
exhaustive classes- one class possessing an Attribute A and other not possessing the attribute A. The
presence of the attribute in the sampled unit may be termed as success and its absence as failure.

Further let P be the population proportion of successes and Q=1-P


Let x be the number of persons possessing the attribute (successes) in a sample of ‘n’
persons, then the sample proportion of successes will be with expectation

and variance and standard error of the sample proportion p is .

Then for large samples, the standard normal variate corresponding to the statistics namely
sample proportion of successes is

Z=

143
(1-α) Confidence Interval for in the case of one tailed testis ,where Q=1-P

(1- α) Confidence Interval for in the case of two tailed testis

In case of two tail test, the 95% confidence limits for P is

And the 99% confidence limits for P is

If P is unknown use in place of

Null hypothesis (There is no significant difference between sample proportion and


population proportion)
Alternative hypothesis (There is a significant difference between sample
proportion and population proportion)
or (Right one tailed test)(There is a significant difference between sample
proportion and population proportion and sample proportion is greater than population
proportion)
or (Left one tailed test)(There is a significant difference between sample
proportion and population proportion and sample proportion is less than the population
proportion)
Test statistic

If calculated value of Z is less than the table value of Z (Critical value) for a pre-fixed value
of the level of significance α, then accept otherwise reject .

Note: In interpreting a confidence interval, one must understand that a confidence interval for a population
parameter è is supposed to contain è. But sometimes it does and sometimes it does not. Out of the confidence
intervals that are constructed for the same confidence level say 95%, for different possible samples selected,
them 95%of them will contain the parameter è and 5% will fail to contain è.
Example 12. 1:A die was thrown 9000 times and of these 3220 yielded a 3 or 4. Can the die be regarded
as unbiased.

Null hypothesis

Alternative hypothesis
Here

144
Test statistic

The table value of Z at 5% level of significance is 1.96. Since the calculated value is greater than the table
value, reject null hypothesis and can be concluded as the die is biased.
12.3.2. Test for difference between the Proportion:
Consider the problem of comparing the prevalence of a certain attribute between two populations A and
B. Let be the sample sizes drawn from the two populations A and B respectively. Further let
be the number of persons possessing the given attribute in two random samples of sizes
.

Then = Observed proportion of successes in the sample from Population A

=Observed proportion of successes in the sample from Population B.

Their expectations are and


and their variances are and respectively.

Where are population proportions of A and B respectively.

Null hypothesis: i.e.


This is to say that the sample proportions p1 and p2 do not differ significantly.
Alternative Hypothesis or
or or (both are one tail tests)
or or (two tail test)

Then under H0,the test statistic , Q=1-P

If the common population proportion P is not known use unbiased estimate provided by both
the samples taken together which is given by

Then the test statistic

145
If calculated value of Z is less than the table value of Z we accept otherwise reject .
Also, the Confidence Interval or limits for the difference of population proportions is

= , Under H0

= , if P is unknown

Example 12.2.: In a random sample of 1000 persons from Town-A, 400 are found to be consumes of
wheat. in a sample of 800 from Town-B, 400 are found to be consumes wheat. Do these data reveal a
significant difference between Town-A and Town-B, so far as the proportion of wheat consumers?
Answer:

Null hypothesis:

Alternative Hypothesis

Where

Test statistic

The table value of Z at 5% level of significance is 1.96. Since the calculated value is greater than the table
value one should reject null hypothesis and conclude that there is significant difference between Town-A
and Town-B, so far as the proportion of wheat consumers are concerned.

12.4. TESTS BASED ON SAMPLING OF VARIABLES: Sampling of variables mean selecting a


sample of a certain size from a universe which assume any value of a variable like height, weight, mark,
length, wages etc.
Some of the important tests of sampling of variables are
1. Test for single mean
2. Test for difference of two population means
3. Test for difference of two population standard deviations.
12.4.1. Test for Single Mean for Large Samples:
Mean of the population is tested to know its significance, whether the population is normal or not, finite
or infinite, sample size is large, variance of population may be known or unknown and alternative hypothesis
146
to be tested may be one - tailed or two – tailed.

Let be a random sample of size n drawn from a normal population with mean and

,
variance . Then the sample mean is normally distributed with mean and variance even if

population is not-normal provided the sample size is large.

That is and S.E ( =

NullHypothesis
Alternative Hypothesis One tailed and right tailed
or One tailed and left tailed
or Two tailed

Test Statistic when population variance is known.

Test Statistic when population variance is unknown.

Where , that is if population s.d. is not known use ‘s’, the estimate of

provided by the sample.

If calculated value of Z is less than the table value of Z we accept otherwise one would
reject .

The confidence interval for single mean in case of two tailed test is

The confidence limit for single mean in case of one tailed test is

Example12.3.:It is claimed that a random sample of 100 tyres with a mean life of 15269 kms is drawn
from a population of tyres which has a mean life of 15200 kms and a standard deviation of 1248 kms. Test
thevalidity of the claim at 5% level of significance.

Answer: Given n=100, Sample mean =15269 kms. Population Mean μ=15200 kms, Population standard
deviation =1248 kms.

Null Hypothesis H0- :μ=15200 kms, that is the claim that the sample is taken from the population whose
mean id 15400 kms.
Alternative Hypothesis H1: μ‘“15200 kms. (two tailed Test)
Level of significance á is 5%

147
Test Statistic , since population standard deviation is known

Under H0 , the tes statistic Z= =0.5529

Since calculated Z value is less than 1.96, the critical value at 5% level of significance, accept H0,the null
hypothesis. That is one can conclude that the given sample has been drawn from the population of tyres
with a mean life of 15200 kms.
Example 12.4: An insurance agent has claimed that the average age of policy holders who insure through
him is less than the average for all agents, which is 30.5 years. a random sample of 100 policy holders
who had insured through him gave the following age distribution: Use the arithmetic mean and standard
deviation of this distribution and test his claim at 5% level of significance.

Age last birth day 16-20 21-25 26-30 31-35 36-40


No. of persons 12 22 20 30 16

Answer:For the data given above it can be find that the sample mean =28.8 years and its standard
deviation s= 6.35 years. (Working can be done as activity).
Sample size n=100
Null Hypothesis H0 : μ=30.5 years
Alternative Hypothesis H1 : μ< 30.5 years. ( One tailed test) ( Left tailed)

Test Statistic since population standard deviation is unknown.

Now Z= = -2.681

Since |Z|= 2.681 > 1.645, table/ critical value at 5% level of significance, reject the null hypothesis. That
is, one cannot accept the average that agents claim.
12.4.2. Test for difference of two population means:
Consider two independent samples of sizes n1 and n2 , drawn from two Normal Populations with means
and variances respectively. The values of are assumed to be known. The means of the
samples are. Then their difference in two means is approximately normally distributed as follows.

Null hypothesis (No difference in population means)

Alternate hypothesis .

148
or One tailed and right tailed

or One tailed and left tailed

Then under H0, Test Statistic Z becomes as follows:

Note: and are unknown use estimates and in place of and


If calculated value of Z is less than the table value of Z we accept otherwise reject .
And confidence interval in case of two tailed test for is

confidence interval in case of one tailed test for is

Note:When are unknown then calculate pooled variance as given below.

and the test statistic becomes as follows:

Test Statistic:

And confidence interval in the case of two tailed test for is

confidence interval in the case of one tailed test for is

Example 12.5: Ina survey of buying habits, 400 women shoppers are chosen at random in super market
A located in a City. Their average weekly food expenditure is Rs. 250 with a standard deviation of Rs. 40.
For 400 women shoppers chosen from super market B, in another section of the city, the average food
expenditure is Rs. 220 with a standard deviation of Rs. 55. Test at 1% level of significance whether the
average weekly food expenditure of the two populations of shoppers are equal.

149
Answer: Given that n1=400, =Rs.250 and s1= Rs 40

n2 =400, =Rs.220 and s2= Rs 55

Null Hypothesis , that is there is no significant difference between the two populations as far
as the food expenditure is concerned.

Alternative Hypothesis (Two tailed test)

The test statistic

Since and are unknown use estimates and in place of and .

The Z= = 8.82

Since the calculated Z is much more than 1.96, the critical value at 5%level of significance, reject Null
Hypothesis. That is , one can conclude that the average weekly food expenditure of two populations of
shoppers in market A and market B differ significantly.
Example 12.6: The mean yield of sample of 100 plots from a district A was 210 kgs with Standard
deviation (S.D) 10 kgs. and mean yield of a sample of 150 plots taken from district B of the same state
was 220kgs with S.D 12 kgs. The S.D of entire districts was assumed as 11kgs. Test whether there is any
significant difference between the yields of these two district.

Answer:Giventhat n1=100, =210kgs and s1= 10kgs

n2 =150, =220kgs and s2= 12 kgs

Here = = =112 =121

Null hypothesis (No difference in population means)

Alternate hypothesis .

Under H0 , the test statistic

Then Z= = -7.05

Since |Z|=7.05 is much greater than 1.96, the difference is highly significant and reject the Null hypothesis.
That is the mean yields of the crop in two district differ significantly.

150
12.4.3. Test for difference of two Population Standard Deviations:
Consider two independent samples of sizes n1 and n2 , drawn from two Normal Populations with standard
deviations respectively. The standard deviations of the samples are s1and s2 respectively.
Then their difference in two standard deviations is approximately normally distributed as follows.

Null hypothesis (No difference in population standard deviations)

Alternate hypothesis .

or One tailed and right tailed

or One tailed and left tailed

Then under H0, Test Statistic Z becomes as follows:

and are unknown use estimates and in place of and


If calculated value of Z is less than the table value of Z we accept otherwise reject .

Example 12.7: The mean yield of two sets of plots and their variability are given below. Examine the
difference in the variability in yields is significant at 5% level of significance.

Number of plots Set of 40 plots Set of 60 plots


Mean yield per plot 1258kgs 1243 kgs
S.D per plot 34 kgs 28 kgs

Answer:Giventhat n1=40, =1258 kgs and s1= 34 kgs

n2 =60, =1243 kgs and s2= 28 kgs

Null hypothesis (No difference in population standard deviations)

Alternate hypothesis ( Two tailed test)

Level of significance 5%
Then under H0 Test Statistic Z:

151
Since and are unknown use estimates and in place of and

Then Z= = 1.31

Since Z=1.31 is less than 1.96, the critical value at 5% level of significance, accept the Null hypothesis.
That is there is no significant difference in the variability in yields.
Activity 12.1: A random sample of boots worn by 36 soldiers in a desert region showed an average life of
1.08 years with a standard deviation of 0.6 years. Under the standard conditions, the boots are known to
have an average life of 1.28 years. Is there a reason to assert, at 1% level of significance, that these uses
in desert causes the mean life of such boots to decrease? What will be the conclusion if the level of
significance is 5%? Assume that the life of boots is normally distributed.
Activity 12.2: A weighing machine without any display was used by an average of 320 persons a day with
a standard deviation of 50 persons. When an attractive display was used on the machine, the average for
100 days increased by 15 persons. Can you support the claim that the display did not help much?
Activity 12.3:The average hourly wage of a sample of 150 workers from a plant A was Rs.25.6 with a S.D
of Rs 10.8 and that of a sample of size 200 from plant B was Rs. 28.7 with S.D Rs. 12.8. Can an applicant
safely assume that the hourly wages paid by the firm B are higher than those paid by plant A.
12.5 SUMMARY: A brief introduction about the parametric tests were given. Out of those parametric
tests, tests based on large samples are discussed under two heads namely tests based on sampling of
attributes and tests based on sampling of variables. Among tests based on sampling of attributes, test for
single proportion and test for difference of proportion were discussed. Among the tests based on variables,
test for single mean, test for difference of means and test for difference of standard deviations are discussed.

Review Questions:
1. Explain different tests associated with sampling of attributes when the sample sizes are large.
2. Briefly explain the test for single mean when the sample is large.
3. Explain the large sample test for significance of difference of two means.
4. Explain the large sample test for significance of difference of two standard deviations.
5. In a random sample of 500 persons from Telangana State 200 are found to be consumers of vegetable
oil. Construct 95% confidence limits for the proportion of persons who consume vegetable oils.
6. a random sample of 100 students gave a mean weight of 58 kgs and a standard deviation of 4 kgs.
Test the hypothesis that the mean weight of all the students is 60 kgs.
7. A certain process produces 10 percent defective articles. A supplier of a new raw material claims

152
the use of this material would reduce the proportion of defectives. A random sample of 400 units
using this new material was taken out of which 34 were found to be defective units. Can the supplier’s
claims be accepted at 1% level of significance?
8. A stenographer claims that she can take dictation at the rate of 120 words per minute. Can we reject
her claim based on 100 trails in which she demonstrates a mean of 116 words with a standard
deviation of 15 words at 5% level of significance?
9. The mean height of 50 male students who showed above participation in college athletics was 68.2
inches with standard deviation of 2.5 inches; while 50 male students who showed no interest in such
participation had a mean height of 67.5 inches with standard deviation of 2.8 inches. Test the
hypothesis that male students who participate in college athletics are taller than other male students.
10. Random samples drawn from two countries gave the following data relating to the heights of adult
males. Test i) Is the difference between the means significant. ii) is the difference between the
standard deviations significant.

Country A Country B
Mean height in inches 67.42 67.25
S.D in inches 2.58 2.5
Number in samples 1000 1200

Further readings:
1. Kothari, C.R., GauravGarg, Research Methodology – Methods and Techniques, New Age
International publications, New Delhi
2. Chandan,J.S, Jagjit Singh, Khanna.K.K., Business Statistics, Vikas Publishing House Pvt. Limited,
New Delhi.
3. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt. Ltd.
4. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
5. Gupta S.C and Kapoor VK, Fundamentals of Applied Statistics, Sultan Chand & Sons, NewDelhi.
6. Gupta S.C, Fundamentals Statistics, Sultan Chand & Sons, New Delhi.

153
UNIT - 13
PARAMETRIC TESTS-II (SMALL SAMPLE TESTS)

OBJECTIVES:
After studying this unit one will be able to
1. Explain the meaning of small sample tests and necessity for using exact sampling distribution in
testing of hypothesis.
2. Explain different t-tests that are commonly used by researchers.
3. Explain the difference between t tests used for independent and dependent samples in case of samples
drawn from two populations.
4. Explain the concept of Z-transformation and tests that are commonly used by researchers based on
Z- transformation.
5. Explain different F-tests that are commonly used by researchers.

STRUCTURE:
13.2 Introduction
13.2 t-tests
13.3 Tests based on Z-transformation
13.4 F-tests
14.5 Summary
14.6 Review Questions
14.7 Further Readings

13.1 INTRODUCTION:
In Unit XII, parametric tests whose sample sizes are more than thirty in number (Large samples) are
discussed. In this unit tests corresponding to small samples say less than thirty will be discussed. In these
tests, test statistic will follow different exact distributions like t, F, 5Øß2 etc.,
There are different small sample tests existed in the literature. However, a few important test that are
commonly used by the researchers are presented and discussed below.

154
13.2. t-TESTS:
These are theTests based on t- distribution. The following are the assumptions made in the Student’s t-
test: i) The parent population from which the sample is drawn is normal. ii) The sample observations are
independent i.e. the given sample is random iii) The population standard deviation is un known.

Some of the important t-tests are: 1. Test for single mean, 2. Test for difference of two means, 3. Paired t-
test, 4. Test for significance of an observed correlation coefficient.
13.2.1. Test for Single Mean:

Let be a random sample of size n drawn from a normal population with

mean and variance (un known) then t follows t- distribution with n-1

degrees of freedom. Where and and

‘s’ is the sample standard deviation

Null Hypothesis
Alternative hypothesis (One tailed test)
or (One tailed test)
or (Two tailed test)

Test Statistic t

If calculated value of t is less than the table value of we accept otherwise one reject.

The confidence interval for population mean ,in case of two tailed test is
The confidence limit for population mean , in case of one tailed test is
Example13.1: A claim is made thatGITAM University GIMS students have an average IQ of 120.To test
this claim, a sample of 10 students were tested and their IQ’s are obtained as 105, 110, 120, 125,100,
130,120,115, 125, 130. At 5% level of significance, test the validity of the claim.
Answer: Sample size n=10 and it is small. population S.D is not known, and population seems to be
normally distributed.

Null hypothesis H0 =120

Alternative Hypothesis H1: 120

Level of significance is 5%= 0.05

155
Test Statistic t

For the given data one can find its mean as 118 and S= =10.33 (Reader can

verify these results as an activity)


Statistic t = = - 0.612 and | t | = 0.612

Here degrees of freedom is 10-1=9 and at 9 degrees of freedom and 5% level of significance the critical t-
value is 2.26
Since, | t | = 0.612 is < 2.26, the critical / table value, accept null hypothesis. That is one can conclude the
claim can be accepted. Also 95% confidence limits: The 95% confidence limits for also called acceptance

region is i.e. 118 2.26 ( ) = 118 7.38 = (110.62,125.38)

Since =120 falls within the values of the acceptance region, one cannot reject the null hypothesis.
That is accept the null hypothesis.
Example 13.2: The mean weekly sales of the chocolate bar in candy stores was 146.3 bars per store. Then
they made one advertising campaign. After the campaign, the mean weekly sales of 22 candy shops were
observed as 153.7 with a standard deviation of 17.2. Can you conclude that the advertising campaign is
successful?

Answer: Given n=22 (small sample), =153.7 bars, s=17.2 bars


Null Hypothesis H0: μ = 146.3 bars. That is the difference between and μ are not
significantly different. in other words, the advertising campaign is not successful.
Alternative Hypothesis: μ > 146.3 bars (Right tailed/ one tailed test). That is the campaign is
successful.
Under the null hypothesis the test statistic t = =

t= =1.9716

The critical / table value of t at 21 degrees of freedom single tailed test at 5% level of significance is 1.721
and calculated t value is more than the table vale. Hence reject H0 . That is accept H1 . That is one can
conclude that the advertising campaign is successful.
13.2.2. Test forDifference in Means in case of Small Samples:

Let ( ) and ( ) be two independent samples of sizes


from the two normal population with means and variances respectively . Let
are the means and variances of the samples. Then their difference in two
sample means is distributed as t-distribution with ( ) degrees of freedom.
156
] and

= and

Null Hypothesis
Alternative hypothesis ( One tailed test)
or (One tailed test)
or (Two tailed test)
and under the assumption of σ2 (say)

Test Statistic

Where pooled variance

= + ]

If calculated value of t is less than the table value of accept otherwise reject
.
And confidence interval in the case of two tailed test for is

confidence interval in the case of one tailed test for is

Example 13.3: The means of two random samples of size 9 and 7 drawn from two group of pigs given
feed A and B respectively. The mean weights of samples taken from group A and Group B respectively
are 196.42 and 198.82 kgs. The sum of the squares of the deviations from the mean are respectively 26.94
and 18.73. Do you agree with the claim that feed B results in increase in weight?

Answer: Given that n1 = 9 and n2 =7; = 196.43 and =198.82

= 26.94 and =18.73


Null Hypothesis , that is there is no difference between the two feed as far as
giving weight is concerned.
Alternative Hypothesis (Left/ One tailed) That is Feed B increases the weight
significantly.

157
Test Statistic

Where pooled variance = + ]= =3.26

Therefore, the statistic t = = - 2.64

The critical value of t at 9+7-2=14 degrees of freedom and at 5% level of significance is 1.761 One tailed
value).
Since| t |= 2.64 >1.761, Reject The null hypotheses. That is there is a significant effect of feed B in
increasing the weight of pigs.
13.2.3. Paired t – test for difference of means:
In the t- test for difference of means the two samples were assumed to be independent to each other. But
in many situations the observations for two groups are not independent. Suppose a business concern is
interested to find whether the sales of a particular product at different shops have been increased after
advertisement. Here the sales in selected shops before and after advertisement will be collected. Here
shops are same. The data on sales depends on the shops and the observations of sold amount before and
after advertising are paired to relate the same shop. Similarly, to test the impact of a training on a group of
students, the marks obtained before and after the training for each student will be collected and the data
can be paired again. Hence can be said that they are not independent.
So consider the situation in which
1. The sample sizes are same. i.e. n1 =n2 =n (say)
2. The sample observations (x1, x2, …..xn) and (y1,y2,……yn) are not independent but they are dependent
in pairs of observations (x1,y1), (x2,y2)……..(xn,yn) corresponding to 1st,2nd, 3rd …nth unit of the
sample respectively.
In this situation t- test for difference of means discussed above is not suitable as the independence property
is lacking. In such situation paired t-test is suitable to test the significance of difference between before
and after happening of a particular act.
Consider a sample of size n. Let x be the values before act (like before advertisement, before training,
before giving a particular feed) and y be the values after the act (like after advertisement, after training,
after giving a particular feed). Further let be two non - independent sample of
observations obtained on different units in the sample corresponding to x and y. To test the significance of
the difference between x and y, one has to pair these observations and apply paired t-test. After the pairing
the sample observations, the pair represent the observations of the unit in the
sample.

Let the deviations represent the increments (these may be

positive, negative or zero). Then the mean of those deviations and their standard

deviation is .

158
There is no significant difference in the two sets of sample means or the increments are just by chance
not due to any act (like advertising, training etc.) i.e.

There is significant difference in the two sets of sample means or the increments are due to the any
act. or (Two tailed Test)

or ( One tailed Test).

Under H0

Test statistic

If the calculated value of ‘t’ is less than the critical value of ‘t‘ at á percent level of significance with ‘n-
1’ degrees of freedom one has to accept null hypothesis otherwise reject null hypothesis.
Example13.4: The performance of ten employees in an organization before and after training is given
below. Test whether the training is effective at 5 % level of significance.

Before
training 52 48 49 57 55 45 58 55 50 60
After training 55 50 49 59 58 46 60 50 50 63

Answer: di =After training- Before training

di 3 2 0 2 3 1 2 -5 0 3
di2 9 4 0 4 9 1 4 25 0 9

=1.1, =65 and n=10

S= =2.424

Null hypothesis H0 : =0 , that is there in no effect of training.


Alternative Hypothesis H1 : >0, that is training is effective.

Statistic t

t= 1.435< 1.833, table value of t (one tailed) at 5% level of significance at 9 degrees of


freedom.
Accept H0 , that is there is no effect of training.

159
13.2.4. t-Test for significance of an observed sample correlation coefficient:

Let be a random sample of same size n drawn from a bivariate normal population.
Let r be the observed sample correlation coefficient. Let be the population correlation coefficient. The
problem is to find whether the sample correlation coefficient between the variables is due to fluctuations
of sampling or there is a significant correlation coefficient.

The variables in the population are uncorrelated i.e.

The variables in the population are correlated i.e. (Two tailed test)

Test statistic

If the calculated value t is less than the critical value of t at á percent level of significance with ‘n-2’
degrees of freedom one has to accept null hypothesis otherwise reject null hypothesis.

The confidence limits for correlation coefficient at percent level of significance is given by

Example13.5: A plant supervisor ranked a sample of 27 workers on the number of hours of overtime
worked and length of employment. The correlation between the two measures is 0.42. Is the correlation
between the measures significant level at 1% and 5% ?
Answer: Given n=27, r=0.42

Null Hypothesis H0 : , that is the variables are uncorrelated.in the population

Alternative Hypothesis H1 : ( Two tailed)

Under H0 , the test statistic = t25

t= 0.42 X = 2.31

The t-value at 25 degrees of freedom at 5% level of significance is 2.06, and calculated value of t 2.31 >
2.06, hence reject the null hypothesis. That is at 5% level of significance the variables are significantly
correlated.
The t-value at 25 degrees of freedom at 1% level of significance is 2.787, and calculated value of t 2.31 <
2.787, hence accept the null hypothesis. That is at 1% level of significance the variables are uncorrelated.
13.3. TESTS BASED ON Z-TRANSFORMATIONS:

It is learned that t-test is used to test the hypothesis H0 : , that is variables are uncorrelated in the
population. In sampling from a bivariate population in which the variables are correlated i.e. the
distribution of statistics mentioned above is not normal even for large numbers. In such cases Prof R A
Fisher suggested the following transformation for‘r’ to new variable Z (hence the name Z transformation)
as follows
160
Fisher proved that even for small samples, the distribution of Z is approximately normal with

mean and variance

That is ~ N( , )

Also =

and , (Since =1/ =1/ .4343.)

This has following applications and those applications are termed as tests based on Z-transformations.
1.To test if the population correlation coefficient has a specified value and
2. To test if two independent sample correlation coefficients r1 , r2 differ significantly.
13.3.1. To test if the population correlation coefficient has a specified value:

A sample of size n is drawn from a normal bivariate distribution with correlation coefficient . The
sample correlation coefficient is r. This test is to test whether the sample correlation coefficient differs
significantly from the population correlation coefficient. In other words, to test whether the sample is
drawn from the population or not.

Null hypothesis H0: that is the sample correlation coefficient r does not differ significantly from
the population correlation coefficient.

Alternative Hypothesis H-1:

Under H0, U= ~ N(0,1)

where

and

If the calculated statistic value U is less than the table value accept H0, otherwise reject it. Table/ significant
values can be had from Table 11.6.
Example 13.6: A correlation coefficient of 0.72 is obtained from a sample of 29 pairs of the observations.
Can the sample be regarded as drawn from a bivariate normal population in which true correlation coefficient
is 0.8?
Answer: H0: There is no significant difference between sample correlation coefficient r=0.72 and population
correlation coefficient =0.80.That is the sample can be regarded as drawn from the bivariate normal
distribution with =0.80.

161
Here =1.1513 =0.907

= 1.1.

Under H0, U= ~ N(0,1)

U= -0.985

Since |U|= .985 < 1.96, Null Hypothesis is accepted. Hence the sample may be regarded as drawn from a
bivariate distribution with =0.80.

13.3.2. To test if two independent sample correlation coefficients differ significantly.

This is to test, if two independent random samples of sizes whose correlation coefficients respectively
are have been drawn from same bivariate normal distribution or from two different bivariate normal
distribution with same correlation coefficient.
Then the null hypothesis H0: = (say) , that is the sample correlations do not differ significantly.

Alternative Hypothesis H1 :
Under H0 the test statistic is
U= ~ N(0,1)

where

If the calculated statistic value U is less than the table value accept H0 , otherwise reject it. Table/ significant
values can be had from Table 11.6.
Example 13.7: Two independent samples have 28 and 19 pairs of observations with correlation coefficients
0.55 and 0.75 respectively. Are these values of r consistent with the hypothesis that both the samples are
drawn from the same population?
Answer: Given n1 =28,n2 =19, r1= 0.55 and r2 = 0.75

Null Hypothesis H0: = , that is the correlation coefficients are do not differ significantly. In
other words, the samples are drawn from the same population

Alternative Hypothesis H1 : ( Two tailed test)

Now =1.1513 =0.6184

=1.1513 = 0.9730

162
UnderH0 , the test statistic U= ~ N(0,1)

U= = -1.11

Since |U| =1.11 < 1.96, accept the Null hypothesis. That is the samples are drawn from the same population.
13.3. F- tests:
Tests based on F-distribution also called F-tests: In this Unit only one test namely “ F-test for equality of
two Population variances” was discussed. Other F-tests like 1. Test for equality of several means (ANOVA)
(Discussed in Unit-III), 2. test for significance of observed multiple correlations and 3. Test for significance
of an observed sample correlation ratio are not discussed as these are beyond the scope of this book.
13.3.1. F-test for equality of two Population variances:

Let and be two random samples of sizes n1, n2 drawn from two normal population
with variances respectively. Let their sample variances be respectively. To test the
significance of equality of difference between the variances one has to use F test for equality of variances.
Null hypothesis: There is no significant difference between the variances. That is

Test statistic: where and

If calculated value of F is less than the table value of F, one has to accept null hypothesis otherwise reject
null hypothesis.
Example 13.8: In a sample of 8 observations, the sum of the squares of the observations from their mean
was 94.5. In another sample of 10 observations the sum of squares of observations from mean was found
to be 101.7. Test whether the difference between the variances are significant or not. (You are given at 5%
level of significance, critical value of F for , the first degrees of freedom=7 and , the second degrees
of freedom=9 is 3.29 and for , the first degrees of freedom=8 and , the second degrees of freedom=10 is
3.07)
Answer: Null Hypothesis, that is the sample variances do not differ significantly.

Given n1 =8, n2 =10; =94.5 and =101.7

= 13.5 and = = 11.3

The test statistic =

Since S12 > S22, F= 13.5/11.3 = 1.195 is less than 3.29, the table/ critical value of F7,9 at 5% level of
significance (given).
Accept the Null Hypothesis. That is the difference in sample variability is not significantly different.
163
Activity 13.1: Certain pesticides packed into bags by a packing machine. The machine is supposed to
pack a bag of weight of 50 kgs each. to test whether the machine is performing correctly or not a random
sample of 10 bags are weighted and found that the weights are 50, 49, 52, 44,45,48, 46, 45,49,45. Conclude
whether the machine is correctly weighing or not.
Activity 13.2: Weights of a group of 5 patients treated with medicine ‘A ‘are 42,39, 48,60 and 41 kgs and
another group of 7 patients treated with medicine ‘B’ are 38,42,56,64,68,69,62 kgs. Do you agree with the
claim that Medicine B is more effective in provide weight than Medicine A at 5% level of significance?
Activity 13.3.: A random samples of 8 boys and 10 girl students at GITAM University Engineering
College are taken and found that their average pocket money carried per day are Rs. 200 and Rs. 170 with
standard deviation respectively are Rs. 21 and Rs. 15. Test at 5% level of significance, whether there is
any difference between boys and girls as far as pocket money amount carried per day is concerned.
Activity 13.4: The difference in sales of a product before and after an advertisement campaign is given
below. Test whether the advertising campaign is effective at 5 % level of significance.

Difference -3 -2 -1 -1 -3 -4 -2 -4

Activity 13. 5: A random sample of 8 pairs of observation from a normal population gave a correlation
coefficient of 0.6. Is this significant of correlation in the population.
Activity 13.6: A Sample of 20 pairs of observations gives r=0.5. Can it be regarded as drawn from a
bivariate normal population having the correlation coefficient at 0.6?

13.2. SUMMARY:
In this unitsome importantsmall sample tests were discussed. Initially t- tests for single mean, t-test for
significance of difference of means, paired t-test, test for observed correlation coefficient were discussed.
Then tests based on Z-transformations like test if the population correlation coefficient has a specified
value and test if two independent sample correlation coefficients differ significantly or not were discussed.
F-test for equality of two Population variances were presented. All these tests were further explained with
the help of examples.
Review Questions :
1. Ten specimens of copper wires drawn at random from a large lot have the following breaking
strength (kg. wt.). Test whether the mean breaking strength of the lot may be taken to be 578
kg.wt.(Level of significance 5%).
2. A farmer grows crop on two fields A and B. On A he puts Rs. 100 worth of manure per acre and on
B Rs 200 worth. The net returns per acre, exclusive of cost of manure on the two fields in the five
years are as follows.

Year 1 2 3 4 5
Field A 340 280 420 370 440
Field B 360 330 480 380 500

164
Other things being equal, discuss the question whether it is likely to pay the farmer to continue the more
expensive manure at 5% level of significance.
1. The increase in weights due to two kinds of food are given below. Can it be said that Food B is
better than food A? ( use 5% level of significance)

Food A 49 53 51 52 47 50 52 53
Food B 52 55 52 53 50 54 54 53

1. The sales data of an item in seven shops before and after a special promotional campaign are as
under. Can the campaign be judged to be a success ay 5% level of significance?

Shops 1 2 3 4 5 6 7
Before 53 28 31 48 50 42 56
After 58 29 30 55 56 45 56

1. A coefficient of correlation of 0.2 is determined from a random sample of 625 pairs of observations.
Is this value significant?
2. Can the following two sample be regarded as coming from the same normal population?

Sample Size Sample mean Sum of squares of deviations from mean


1 10 12 120
2 12 15 314

Further readings:
1. Kothari, C.R., GauravGarg, Research Methodology – Methods and Techniques, New Age
International publishers, New Delhi.
2. Chandan,J.S, Jagjit Singh, Khanna.K.K., Business Statistics, Vikas Publishing House Pvt. Limited,
New Delhi.
3. Gupta S.C, Fundamentals of Statistics, Sultan Chand & Sons, New Delhi.
4. Gupta S.C and Kapoor V.K, Fundamentals of Mathematical Statistics, Sultan Chand& sons,
NewDelhi.

165
UNIT - 14
NON-PARAMETRIC TESTS
OBJECTIVES:
After studying this unit one will be able to
• Explain the meaning of non-parametric tests.
• Explain different advantages and disadvantages of the non-parametric tests.
• Explain different important non-parametric tests bases on different number of sample.

STRUCTURE:
14.1 Introduction
14.2 Advantages and disadvantages of non-parametric tests
14.3 Different important non-parametric tests
14.4 One Sample tests
14.5 Kolmogorov–Smirnov one sample test (k–s of ks one sample test)
14.6 Ordinary one sample run test simply Run test
14.7 Runs test for randomness
14.8 Chi-Square test
14.9 Two sample tests
14.10 Mann Whitney U test (independent samples)
14.11 Median test (independent samples)
14.12 Wald- Wolfowitz runs test (independent samples)
14.13 Rank Correlation Coefficient significance test (dependent samples)
14.14 Several Sample tests
14.15 The Kruskal-Wallis test (independent samples)
14.16 Summary
14.17 Review Questions
14.18 Further Readings

166
14.1. INTRODUCTION:
The statistical tests that are discussed earlier are based on the estimating parameters of a distribution with
known or assumed from the population. That is the parameters of the population are known or assumed to
be known. Such tests are known as parametric tests. In these tests the model specifies certain condition
about the parameters of the population from which the samples are drawn.
However, there are situations in business, in psychological studies where one cannot make any assumptions
regarding the form of distribution in the population from which the samples is drawn. That is the functional
form of the distribution and knowledge of the population parameters being tested are not known. The
tests that are used for such problems are known as non-parametric tests. Nonparametric tests are sometimes
called as distribution-free tests.
Nonparametric statistics are statistics not based on parameterized families of probability distributions.
They include both descriptive and inferential statistics. Statistical tests that do not rely on assumptions of
distributions or parameter estimates are called non-parametric tests. If the information regarding the
population and its parameters is not completely known, but still it is required to test the hypothesis of
population then it is known as non-parametric test. During the estimation process of these tests, parametric
assumptions basing on sample and sampling distribution are not made and population from which the
sample is drawn is not normally distributed. Here null hypothesis is free from parameters and test statistics
is arbitrary. These tests are applicable for variables and attributes. One can use these tests for nominal and
ordinal scale data also.
The difference between parametric models and non-parametric models is that the former has a fixed
number of parameters, while the latter grows the number of parameters with the amount of training data.
Assumptions associated with Non-parametric test:
Sample observations are independent.
The variable under study is continuous.
Moments of lower order exist.

14.2. ADVANTAGES AND DISADVANTAGES OF NON-PARAMETRIC TESTS:


Advantages:
1. They are simple, easy and readily comprehensible and do not require complicated sample theory.
2. They involve less arithmetic computation.
3. No assumption is made about the parent population from which the sample is drawn.
4. These tests are applicable to the data which are classified by means of nominal scale.
5. They are applicable for analyzing ranked, scaled or rated data.
6. These tests are applicable in sociology, psychometry, educational statistics, business etc.
Disadvantages:
1. These tests can be used only for the data measured using nominal or ordinal scale.

167
2. These tests are generally less powerful than parametric tests.
3. No Non-parametric test exists for analysis of variance.
4. Non-parametric tests are designed for estimating the parameters but they are designed to test statistical
hypothesis only.
5. They ignore certain amount of information.
14.3. DIFFERENT IMPORTANT NON-PARAMETRIC TESTS:
Basing on the number of samples (One, two or more) drawn from the population used to test, these tests
can be classified in to one sample, two sample and several sample tests. Depending on the relation between
the samples drawn, Two and several sample tests are further classified in to tests based on independent
variables and tests based on related/ dependent samples.The most popular one sample tests are One Sample
Sign test, Kolmogorov-Smirnov (KS)one sample test, Wilcoxon Signed Rank Sum Test (Single Sample),
ordinary one sample run test, Runs test for Randomness and Chi-square test.
The most popular Two sample tests are (Wilcoxon-) Mann Whitney U Test (independent Samples), Median
test (Independent samples), Wald- Wolfowitz Runs test (independent samples), Kolmogorov-Smirnov
two sample test (Independent samples), Two Sample Sign test (dependent Samples), Wilcoxon Signed
Rank Test or Wilcoxon Matched –Pairs Test (dependent Samples), and rank correlation coefficient
significance test (dependent Samples).
The most popular several sample tests are Kruskal Wallis H test (independent samples), Quade test
(dependent Samples), Friedman Test (dependent Samples). Some of these tests are beyond the scope of
this book. Excluding those tests remaining tests are discussed and presented below.

14.4. ONE SAMPLE TESTS:


The one sample tests are used to in the situations which are aimed to answer the goodness of fit such as:
Is there any significant difference between the observed and expected frequencies?
Is it believable that the sample drawn from a specified population?
Is it acceptable that the sample drawn from a known population?

14.5.KOLMOGOROV–SMIRNOV ONE SAMPLE TEST (K–S OF KS ONE SAMPLE TEST):


The Kolmogorov–Smirnov one sample test is a nonparametric test of the equality of continuous, one-
dimensional probability distributions that can be used to compare a sample with a reference probability
distribution (one-sample K–S test). The Kolmogorov–Smirnov statistic quantifies a distance between the
empirical distribution function of the sample and the cumulative distribution function of the reference
distribution.

168
Fig 14.1
Illustration of the Kolmogorov–Smirnov statistic is given in fig 14.1. Smoothened (Red line) is CDF,
unsmoothed (blue line) is an ECDF (empirical CDF), and the arrow is the K–S statistic.
The null distribution of this statistic is calculated under the null hypothesis that the sample is drawn from
the reference distribution. In this case, the distributions considered under the null hypothesis are continuous
distributions but are otherwise unrestricted.
This test is used when the data are at least ordinal, and we compare observed sample distribution with a
theoretical distribution. This test is more powerful than Chi-Square test and can be used for very small
samples when one cannot use Chi-Square. This test is a test of goodness of fit in which we specify the
cumulative frequency distribution.

There is no significance difference between the observed and theoretical frequencies.

There is significance difference between the observed and theoretical frequencies.

1. Calculate the cumulative frequencies for each class in respect of both observed and
theoretical categories respectively.
2. Convert the cumulative frequency of each class into proportion in respect of both observed and
theoretical categories.
3. Find the difference between the observed and theoretical proportions ignoring the plus or minus
signs

4. Compare the D=maximum difference with the critical value at a given level of
significance.
If calculated value of D is less than the table value of D at a given level of significance one can accept null

169
hypothesis otherwise reject null hypothesis, where is the observed cumulative distribution of a

random sample of size n and X is he possible score. where k is the number of observations
equal to or less than X. is the theoretical frequency distribution specified under .

Example 14.1: Mr. Rao, national sales manager of a plastic company, has collected the salary statistic of
the field staff and given in the following table. He also expected the distribution of salary statistics assuming
the normal distribution to the salary statistics and presented below. At 0.01 level of significance can Mr.
Rao conclude that the salary distribution is normal.
Salary particulars are in 000’s rupees.

Frequency 25-30 31-36 37-42 43-48 49-54 55-60 61-66


Observed 9 22 25 30 21 12 6
Expected 6 17 32 35 18 13 4

Answer:H0 : Distribution of salaries are normal

Observed Cum. Freq F0(x) Expected Cum. Freq Fe(x) |F0(x)-Fe(x)|


9 9 0.072 6 6 0.048 0.024
22 31 0.248 17 23 0.184 0.064=D
25 56 0.448 32 55 0.44 0.008
30 86 0.688 35 90 0.72 0.032
21 107 0.856 18 108 0.864 0.008
12 119 0.952 13 121 0.968 0.016
6 125 1 4 125 1 0

The critical value at (.01) 1% level of significance as the sample is large is 1.22/”125=0.109
Since D value is less than the table / critical value 0.109, accept the Null hypothesis and can be concluded
that Mr. Rao conclusion is correct.

14.6. ORDINARY ONE SAMPLE RUN TEST simply RUN TEST:


One sample run test is used to judge the randomness of a sample based on the order in which the observations
are taken. There are many applications in which it is difficult to decide whether the sample used is a
random or not. This is true when one has little or no control over the selection of data.
Run: A run is succession of identical letters which is followed and preceded by different letters or no
letters at all.

170
Example: Consider an arrangement of healthy H, and diseased D mango trees that were planted many
years ago, along certain road.
HH DD HHHHH DDD HHHH DDDDD HHHHHHH
1st 2nd 3rd 4th 5th 6th 7th
Here 1st run consists of two H’s, 2nd run consists of two D’s and so on 7th run consists of seven H’s. There
are seven runs altogether. The number of items in each run is called its length that is run length of the run.
In the above example 1st run length is 2, 2nd run length is 2, 3rd run length is 5 etc.
If there are two few runs one can suspect some sort of repeated altering patterns. One sample run test is
based on the idea that too few or too many runs show the items were not chosen randomly.

To test the : The healthy diseased mango trees occur in random order.

against : They do not occur in random order, find number of runs ‘r’ and

= the number of occurrences of type I say H in the example

= the number of occurrences of type II say D in the example

For the one sample runs test, if the value of r is less than or equal to r1, the table value for left end critical
value of r in the runs test value or more than or equal to r2, the table value for the right end critical of the
runs test, reject the null hypothesis H0. Otherwise accept.
If any one of n1, n2 is greater than 20, the distribution of r can be estimated as Normal distribution and Test
statistic will be as follows.

Test statistic where r represents the number of runs.

Where
If calculated value of Z is less than the table value of Z at a given level of significance one can accept null
hypothesis otherwise reject null hypothesis.
Example 14.2: Test the randomness of the data given below:
9,24,73,67,48,32,68,65,18,12,14,64,80,23,80,52.
Answer:
H0: Given data is random.
H1: Given data can be said that it is random
After arranging them in order of magnitude one can find the median of the given data as 50.
Now write – if the given value is less than the median, + if the given value is more than median in the
given order of the data.

171
--++--+++---++-++
There are 8 runs that is r = 8
Positive signs are n1=8 and negative signs are n2=8
The table values r1= 4 and r2 =14 for n1=8 and n2=8
Since 8 is in between 4 and 14, accept the null hypothesis and conclude that the data is random.
Example 14.3:
In an experiment a coin is tossed 35 times and out of which 21 times got head. If the number of runs is 15,
test the unbiasedness of the coin.
Answer: H0: The coin is unbiased
H1: The is not unbiased.
Given r=15, n1=21, n2 =14
Since n1= 21 is greater than 20, use Z statistic
E(r) = 17.8; V (r) =7.807
the |Z|=1.002 and this is less than 1.96, the critical value at 5% level of significance.
Accept the null hypothesis and can be conclude that the coin is unbiased.

14.7. RUNS TEST FOR RANDOMNESS:


Let x1, x2, …… x n be a sample of observations arranged in the order of their occurrence. Find median of
the data given then write A if the observation is above the median, B if the observation is below the
median. Remove the observation if it is equal to the value of the median. And accordingly adjust the
sample size.
The one will get a sequence like AABBBABAABBBBA….
Define U= Number of runs be n and the sample size after removing the observations. Then for ample size
>25, U follows normal distribution with mean (n+2)/2 and variance (n/4)[(n-2)/(n-1)]. Then

Z= ~ N (0,1)

If calculated value of Z is less than the table value of Z (Normal Test) at a given level of significance one
can accept null hypothesis that sample drawn is random otherwise reject null hypothesis.
Example 14.4.: The win-loss record of a certain basketball team for their last 50 consecutive games was
as follows:
WWWWWWLWWWWWWLWLWWWLLWWWWLWWWLLWWWWWWLLWWLLLWWLWWW
Test whether the sequence of wins and losses are random or not using run theory (test).

172
Answer: H0 : The win-loss sequence is random
H1: The sequence is not random.
In the above sequence the number of runs are U=19, the sample size id n=50
E(U)= (n+2)/2= 26; V(U)= (n/4)[(n-2)/(n-1)]=12.25 and Z= |19-26|/”12.25=2, greater than 1.96, the
critical value at 5% level of significance. Hence reject the null hypothesis and can be concluded that the
sequence is not random.
However, if considered 1% level of significance Z=2 < 2.58, the critical value. Hence accept the null
hypothesis and can be concluded that the sequence is random.

14.8. CHI-SQUARE TEST:


The Chi-Square test is applied to nominal and ordinal fields. This produces a one-sample test that computes
a chi-square statistic based on the differences between the observed and expected frequencies of categories
of a field.
It is the most powerful and widely used non- parametric test.It can be applied to any univariate distribution
for which one can calculate the cumulative distribution function. Some b asic assumptions are made on
these tests such as minimum sample size is required, and the variables which are being examined can be
measured using any measuring scale such as nominal, ordinal, interval, or ratio. This test is used to make
comparisions between two or more variables. It is used to make comparisions between frequencies rather
than between mean scores . Here the sample data is divided into intervals and the numbers of points
that fall in each interval are compared, with the expected numbers of points in each interval. It is
used to determine whether sample data are consistent with a hypothesized distribution. This test determines
how well theoretical distribution (such as normal, binomial, or Poisson) fit the empirical distribution.
The minimum possible value for variable is 0 and there is no limit for maximum value. This unit is
limited to two tests of the chi square.
14.8.1.Chi- Square test for goodness of fit:
It is used to make comparisions between two or more nominal, ordinal, interval or ratio variables. Suppose
a set of observed values/ frequencies were gien or obtained under some experiment and want to test
whether the experimentalresults support a particular theory or hypothesis. Here the actual frequencies in
a category are to be compared to the frequencies that are expected to occur theoretically if the data
follows the specific probability distribution of interest or some assuming hypothesis. Here the total of the
expected number of cases is always made equal to the total of the observed number of cases. Let (i
=1,2,3…n) be the observed frequency/ observed number of cases for ith category and further let be the
expected frequency/ expected number of cases for ith category. And , the total frequency.

Null Hypothesis: There is no significant difference between the observed and expected frequencies.

Alternative hypothesis: There is significant difference between the observed and expected frequencies.

Test statistic:

173
The calculated value of is compared with the table value of at percent level of significance with (n-
1) degrees of freedom. If calculated value is less than the table value one can accept null hypothesis
otherwise reject null hypothesis.
Conditions for validity of Chi-Square test: The total frequency N should be large. The sample observations
should be independent. No theoretical / expected frequency should be small (in any case should not be
less than 5). If it is less than 5, apply pooling technique, which consists in adding the preceding or
succeeding frequency to the frequency less than 5, so that the total is greater than or equal than 5 and
accordingly adjust the degrees of freedom.
Example 14.5: The theory predicts that the proportion of beans, in the four groups A, B, C and D
should be 9:3:3:1. In an experiment among 1600 beans, the number in the four groups were 882, 313,
287 and 118. Does the experiment result support theory?
Answer: Null Hypothesis: There is no significant difference between experimental results and theory.
That is the experiment support theory
Alternative Hypothesis: The experiment does not support the Theory.

Category Observed Expected O i -E i (O i -E i ) 2 (O i -E i ) 2 /E i


Frequency Frequency (E i )
(O i)
A 882 1600X9/16=900 -18 324 0.360
B 313 1600x3/16=300 13 169 0.563
C 287 1600x3/16=300 -13 169 0.563
D 118 1600x1/16=100 18 324 3.240
Total 1600 1600 4.726

Degrees of freedom=4-1=3 and the critical t value for 3 degrees of freedom and at 5% level of
significance is 7.81 (from table).
Calculated Chi square value =4.726 <7.81, table value, hence accept the Null Hypothesis. That is the
experiment supports the theory.
14.8.2. Chi Square test for independence of attributes:
Attribute is one which cannot be measured using any mathematical measuring scale that is which
cannot be measured quantitatively. It will be qualitative in nature. Chi square test for independence of
two attributes is to examine whether the two attributes are either related or are not related to each other. It
uses a cross classification table mostly known as contingency table which specifies the pattern between
the attributes. The nature of the pattern or connection between the attributes can be examined with many
statistical procedures and one such procedure is Chi Square test for independence of attributes. Unlike

174
Chi Square test for Goodness of fit this test can also be used with variables measured on any type of scale,
nominal, ordinal, interval or ratio. Let A and B be two attributes which are to be examined for their
association. Let attribute A be divided into r classes and the attribute B be divided into s
classes . Let be the number of persons possessing the attribute . Let
be the number of persons possessing the attribute . Let be the number of persons
possessing both the attributes . Their cell frequencies can be expressed in the following
contingency table having a manifold of rxs as shown below. .

Total

Total N

Let be the probability that a person possessing the attribute

Let be the probability that a person possessing the attribute .

Let be the probability that a person possessing both the attributes.

If A and B are independent, then .

Each cell value is known as cell frequencies or observed frequencies and their
expected frequencies is the ratio of the product of row total and column total

to the grand total value N. That is (Ai)(Bj)/N


Null Hypothesis: The attributes are independent to each other.
Alternative hypothesis: The attributes are dependent.
Test statistic:

175
The calculated value of is compared with the table value of at percent level of significance with (r-
1) (s-1) degrees of freedom. If calculated value is less than the table value one can accept null hypothesis
otherwise reject null hypothesis.
Example 14.6: A sample of 435 women were distributed as per the education level and marriage adjustment
score (MAS). Test whether the marriage adjustment score is grater in highly educated women.

Very low MAS Low High Very high Total


MAS MAS MAS
College 24 97 62 58 241
High School 22 28 30 41 121
Elementary School 32 10 11 20 73
Total 78 135 103 119 435

Answer:
H0 Null Hypothesis: There is no association between education levels and Marriage adjustment score.
H1 Alternative Hypothesis: Marriage adjustment score is greater in Highly educated women.
Test statistics is

A 3 x 4 contingency table is given, hence the degrees of freedom is (3-1)(4-1)=6


The expected frequencies are as follows: E(24)= 241x78/435=43.21; E(97)=241x135/435=74.79;
E(62)=241x103/435=57.06; E(58)=241x119/435=65.94; E(22)=121x78/435=21.69; E(28)=121x135/
435=37.55; E(30)=121x103/435=28.65; E(41)=121x119/435=33.11; E(32)=73x78/435=13.10;
E(10)=73x135/435=22.66; E(11)=73x103/435=17.29; E(20)=73x119/435=19.95

Then calculated will be after simplification 57.51, which is more than 12.59, the table value at 5%
level of significance and 6 degrees of freedom. Reject Null Hypothesis.
That is one can conclude that the greater the level of education, the greater is the degree of adjustment of
marriage.
14.8.3. Chi Square test for independence of attribute 2x2contingency table:
Let A and B be two attributes which are to be examined for their association. Let attribute A be divided
into 2 classes and the attribute B be divided into 2 classes . Let be the number of
persons possessing the attribute and let the number of persons possessing the attribute Let
be the number of persons possessing the attribute and let the number of persons possessing the
attribute Let be the number of persons possessing the attribute and let the number of

176
persons possessing the attribute. Let be the number of persons possessing the attribute and let
the number of persons possessing the attribute And N=a+b+c+d.

Total

Total N= a+b+c+d

Null Hypothesis: The attributes A and B are independent to each other.

Alternative hypothesis: The attributes are dependent.

Test statistic:

The calculated value of is compared with the table value of at percent level of significance with 1
degrees of freedom. If calculated value is less than the table value one can accept null hypothesis otherwise
reject null hypothesis.
The chi-square test of independence can be used for any variable; the group (independent) and the test
variable (dependent) can be nominal, dichotomous, ordinal, or grouped interval.
Example 14.7: Two sample polls of votes for two candidates’ A and B from two area were given
below. Test whether the nature of the area is related to the voting preference to the candidate.

Area Candidate A Candidate B Total


Rural 620 380 1000
Urban 550 450 1000
Total 1170 830 2000

Answer:
Null Hypothesis: There is no relation between voting preference to candidates’ A and B and the
nature of area.

177
Here a=620, b=380, c=550, d=450, then

Table value for 1 degrees of freedom at 5% level of significance is 3.841. The calculated value is
more than the table value. Hence reject the Null Hypothesis. That is there is a relation between the
voting preference and the nature of the area.
14.8.4. Yates Correction: In case if the expected frequency of any cell is <5, we apply Yates correction
to the observed frequencies and ultimately the statistic will be as follows:

Example 14.8.: In an experiment on the immunization of goats from Anthrax, following results were
onatained. Test the efficiency of the vaccine in controling anthrax.

Died of Anthrax survived Total


Inoculated with vaccine 2 10 12
Not innoculated 6 6 12
8 16 24

Solution: Null Hypothesis: There is no impact of vaccine in controling the anthrax.


Alternative Hypothesisi: There is an impact of vaccine in controling the Anthrax. That is the vaccine is
effective.
Given that a=2, b=10, c=6, d=6.
Sice a=2 <5, Yates correction must be applied to theformula for calculating chi square statistic.
Then

So,

Since the calculated value 1.6875 is less than 3.841, the table value at 5% level of significance and at 1
degrees of freedon, accept the Null hypothesis. that is there is no impact of vaccination in contrilling the
antrrax.

14.9. TWO SAMPLE TESTS:


The two sample tests that are discusses in this section are (Wilcoxon-) Mann Whitney U Test (independent
Samples), Median test (Independent samples), Wald- Wolfowitz Runs test (independent samples),
178
Kolmogorov-Smirnov two sample test (Independent samples) and rank correlation coefficient significance
test (dependent Samples).

14.10. (WILCOXON-) MANN WHITNEY U TEST (INDEPENDENT SAMPLES):


This is analogous of t-test in case of two independent samples.
It belongs to a family of rank sum tests. It uses ranking information rather than plus or minus signs. It is
powerful than Median test.
Mann WhitneyU test is designed to determine whether two independent samples have been drawn from
the same population or from two different populations having the same distribution.
It considers the magnitudes of various scores and not just the number of runs. To apply U test to a set of
two samples of sizes , the values of the combined samples are ranked from the lowest to
the highest rank (highest to lowest rank) in order and soon. In case there exists a tie then one can assign
each of the observation the mean of the ranks which they jointly occupy. Find the sum of the ranks
assigned to the values of the first and second sample separately and represent them by respectively.

Compute the value of test statistic U= Min {U1, U2} where

and

Null Hypothesis The samples come from the same population.

Alternative Hypothesis The samples come from different population.

For very small samples i.e. neither n1 nor n2 is larger than 8: In this case table of probabilities (p-
values) for Mann- Whitney U statistic for small samples, which are available for n2=3 to 8 (6 different
tables) (in some books up to 10, in such cases 8 tables) separately and for n1 d” n2 and for different values
of U. These probabilities are for one tailed test. For two tailed tests double the value of one tailed values.
If the p value so obtained is greater than .05, accept the null hypothesis and can be inferred that the
samples are drawn from the same population.
If the value of U is so large and range of U is out of the tables available, denote large value of U as U’ then
find U =n1n2-U’ and precede as earlier.
For n2 is in between 9and 20 and n1 d” 20:
In this case the tables used above are not useful.In this case use tables of critical values of U in the Mann-
Whitney Test available for significant levels .001, .01,.025 and .05 for a one tailed test. For two tailed test
the significance levels given are .002, .02, .05 and .1.
If the calculated value of U is less than the table value (corresponding to level of significance, n1and n2)
reject the null hypothesis otherwise accept.

179
For large samples (n2 is larger than 20):
In this case the above tables are not useful. However, as n2 is increasing the sampling
distribution of U is rapidly approaching to normal distribution with mean and

variance . That is Test statistic

If there are any ties, then V(U)=

where N= n1+n2 and T= and t= the number of observations involved in tie for a given rank.If
calculated value of Z is less than the table value of Z at a given level of significance one can accept null
hypothesis otherwise reject null hypothesis.
Note: Some times, this will be used even in case of n2 larger than 8
Example 14.9: In the process of testing the quality of education provided by the private (A) and public
(B) schools in mathematics 15 students from private and 12 students from public school were selected
and tested in mathematics. The following are the marks obtained in the test:

A 73 75 83 77 72 69 56 80 68 60 84 61 64 71 86
B 70 78 79 81 65 63 74 83 67 76 88 48 - - -

Test the whether the quality of education between private and public schools are equal or not.
Answer:
Null Hypothesis H0 : There is no significant difference between private and public schools in case of
teaching in mathematics.
Assigning ranks to the combined sample of students:

A 73 75 82 77 72 69 56 80 68 60 84 61 64 71 86
Rank 14 16 23 18 13 10 2 21 9 3 25 4 6 12 26
B 70 78 79 81 65 63 74 83 67 76 88 48 - - -
Rank 11 19 20 22 7 5 15 24 8 17 27 1

Sum of ranks of Private school students R1=202


Sum of ranks of Public school students R2=176

The value of test statistic U= Min {U1, U2} where

and

U1= 15 x 12+ (15 x16)/2 -202= 98 and U2= 15 x12+(12x13)/2-176=82


Therefore, U= 82
and E(U)= 12x15/2= 90 and V(U)= 12x15(12+15+1)/12=420
180
Statistic Z=(82-90)/”420= -0.39, whose modulus value is 0.39 and is less than 1.96, the critical value at
5%level of significance.
Hence Accept Null Hypothesis. that is there is no difference between private and public schools as far as
teaching in Mathematics.

14.11. MEDIAN TEST (INDEPENDENT SAMPLES):


It determines the significance of the difference between the medians of two or more random samples. It is
meant for determining whether the samples have taken from the population with the same median. Let

be the sample size of first sample and be the sample size of second sample. Find the combined
median of two or more samples. Count the number of values in each sample falling above combined
median and the number of items below combined median. Prepare a 2xK contingency table and perform

chi-square test for the above median cases and below median cases so calculated. Under the null

hypothesis The given samples belong to population with same median. For large samples,

Test statistic

Where is the number of observations above median and

if is even.

if is odd

ĂŶĚ ŝĨEŝƐŽĚĚ͘

If the calculated value of Z is less than the critical value of Z one can accept null hypothesis at a given

level of significance ,otherwise one would reject null hypothesis

Note:

If one has a 2x2 contingency table of the following format, then one can apply test.

Observations Sample 1 Sample2 Total


Above Median m1 m2 m1 +m2
Below Median n1-m1 n2-m2 n1+n2-m1-m2
Total n1 n2 n1+n2=N

181
Example14.10: Following are the wo samples of sizes 30 each, drawn from two populations. Test the
difference between the populations using Median test.

X 13.3 14.6 13.6 17.2 14.1 10.6 15.9 14.7 14.2 14 17.4 15.6 8.2 13.8 15.4
Y 14.1 15.1 9.9 14.5 17.9 16.1 16.8 15.1 13.2 18 16.3 13.3 15.8 18 20.4

X 16.3 17.7 5 13.4 13.4 16 13.3 14.9 12.9 14 16.2 11.5 10.4 12.1 18.1
Y 15.7 21.5 14.5 16.7 13.7 13.6 17 15.7 16.8 18.8 18.8 16 14.6 12.3 17.7

Answer:
Null Hypothesis H0: There is no difference between the two populations i.e. f(x)=f(y)
After arranging all the 60 observation in ascending order median can be calculated as 15.05.
Then identify the values above and values below the median value in each sample.
8.2 9.9 10.4 10.6 11.5 12.1 12.3 12.9 13.2 13.3 13.3 13.3 13.4 13.4 13.6
y y y y y
13.6 13.7 13.8 14 14 14.1 14.1 14.2 14.5 14.5 14.6 14.6 14.7 14.9 15
y y y y
15.1 15.1 15.4 15.6 15.7 15.7 15.8 15.9 16y 16 16.1 16.2 16.3 16.3 16.7
y y y y y y y y
16.8 16.8 17y 17.2 17.4 17.7 17.7 17.9 18y 18y 18.1 18.8 18.8 20.4 21.5
y y y y y y y y

The y values are identified by marking the same on their right side. Since the median value is 15.05, it can
be observed that there are 10- x values and 20- y values more than the median values.

That is n1=n2=30; m1=10; m2=20, N=60; and N is even. E(m1) = n1/2 and V(m1)= n1n2/4(N-1)

Test statistic

dŚĞŶ͕ƚŚĞdĞƐƚƐƚĂƚŝƐƚŝĐ с ͮсϱͬϭ͘ϵϱсϮ͘ϱϲ

Since 2.56>1.96, reject the null hypothesis at 5% level of significance. That is the distributions cannot be
treated as equal.
If 1% level of significance was considered, 2.56<2.58, the table vale and in this case accept the null
hypothesis.

182
14.12. WALD- WOLFOWITZ RUNS TEST (INDEPENDENT SAMPLES):

Let be an ordered sample from a population with density f1(.) and be another
independent ordered sample from population with density f2(.) .Wald-Wolfowitz run test help to test
whether the two samples are drawn from the same population or from the populations with same density
i.e. f1(.)=f2(.)
So, Null Hypothesis H0: f1(.)=f2(.) and the Alternative Hypothesis H1: f1(.)‘“f2(.)
Now combine the two samples and arrange them as per magnitude so that one can get the combined
ordered sample as(say)
x1, x2, y1,x3,y2,y3,y4,x4,……..
Definition of RUN: A run is defined as a sequence of letters one kind surrounded by a sequence of the
other kind, and the number of elements in a run is usually referred to as the length of the run.
If both the samples are drawn from the same population, the number of run will be maximum as there
would be thorough mingling of x’s and y’s.

Now count the number of runs in the combined ordered sample and denote it by U. Then

under H0, ,V

Test statistic

If calculated value of Z is less than the table value of Z at a given level of significance one can accept null
hypothesis otherwise reject null hypothesis.
Example 14.11: At the beginning of the year a first-grade class was randomly divided in two groups and
one group was thought using uniform method and second group was taught by individual method. At the
end of the year each student was given test and marks are as follows. Use Wald-Wolfowitz run test to test
for the equality of the distribution functions of the two groups.
First Group: 227, 55, 184, 176, 234, 147, 252, 194,88, 149, 247, 161, 16, 92, 171, 292, 99
Second Group: 202, 271, 63, 14, 151, 284, 165, 235, 53, 171, 147, 228, 271
Answer:
Null Hypothesis H0: Two group distribution functions are equal
Combining the samples

II II II II II II II
14 16 53 55 63 88 92 99 141 147 147 149 151 161 165

II II II II II II II II
171 171 176 184 194 202 227 228 234 235 252 271 271 284 292

183
Marks 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89
Class I 1 3 5 8 14 9 6 3 1
Class 1 5 10 13 20 14 5 2 0
II

Solution:
H0 : There is no difference between the two classes under consideration.

Marks Class Class I Class II Cumulative


interval frequency frequency frequency Or Or
Class Class F(x) G(x)
I II
0-9 1 1 1 1 0.02 0.01 0.01
10-19 3 5 4 6 0.08 0.09 0.01
20-19 5 10 9 16 0.18 0.23 0.05
30-39 8 13 17 29 0.34 0.41 0.08
40-49 14 20 31 49 0.62 0.70 0.08
50-59 9 14 40 63 0.80 0.90 0.10
60-69 6 5 46 68 0.72 0.97 0.05
70-79 3 2 49 70 0.98 1.00 0.02
80-89 1 0 50 70 1.00 1.00 0

From this table, =0.10

Since both the sample sizes are greater than 40, the critical value1.36. =0.7962

Since calculated value is less than the table value, accept H0. That is no significant difference between the
two classes.

4.13. RANK CORRELATION COEFFICIENT SIGNIFICANCE TEST (DEPENDENT SAMPLES):


In Unit IX, the procedure for calculating the Spearman’s Rank correlation coefficientwas explained with
examples. In this section testing the significance of the rank correlation coefficient is explained. After
calculating the rank correlation coefficient (R), compare the modulus of the R with the critical value
available in ‘Values for rank correlation coefficient tables’ corresponding to sample size and level of
significance. If it is less than the table value, accept H0 i.e. that there is no significant correlation coefficient.
Otherwise reject it.
If the samples size is more than 30, the value of R approximately follow Normal with mean zero and
standard error 1/

That is Z = ~ N (0,1)

185
Apply normal test for testing the significance of the rank correlation coefficient.
Example 14.13: The IQ of 12 pairs of twins are as follows. Basing on this what conclusion can be made
on the equivalence between the twins.

1st birth 86 79 77 68 91 72 77 91 70 71 88 87
2nd birth 88 77 76 64 96 72 65 90 65 80 81 72

Answer: Null Hypothesis H0: They are uncorrelated.


The rank correlation coefficient can be calculated as 0.7357 (the reader can be done as an activity)
Sample size = 12
For N-12, and at 5% level of significance the table value is 0.506 (Table of critical values of the spearman’s
correlation coefficient)
Since the calculated value 0.7357 is greater than the table value 0.506, reject the null hypothesis. That is
one cannot conclude that the twins have same IQ.

14.14. SEVERAL SAMPLE TESTS:


The several sample tests that are presented in this unit is Kruskal Wallis H test. This is for independent
samples.

14.15. THE KRUSKAL-WALLIS TEST (INDEPENDENT SAMPLES):


It is a non-parametric test which is used for comparing two or more populations. This is a non-parametric
alternative to the analysis of variance. This test assumes that the variable under study is continuously
distributed.

, the null hypothesis is that the probability distributions for each sample are identical or all the populations
from which the samples have been drawn are identical.

, the alternative hypothesis is the at least two of the probability distributions differ in location.

Suppose here are k- independent samples with sample size ni, i=1,2,3…k. Total number of observations in
all the samples together is N= Arrange k- samples in columns. Now consider all the k samples as
single series assign ranks to those n observations from 1 to N. (Lowest observation is 1st rank, next
lowestis2nd rank etc.) Calculate sum of the ranks of each sample and denote it by Ti, i-1, 2,..k.
Compute the test statistic H which is a function of the rank sums for each sample.
Test statistic

(when there are no ties in ranks)

Where N is the total number of all groups combined. is the number of participants in each sample.
186
it is the sum of squares of each group’s rank total, and then divide the result by the size of the sample.

Then the Statistic H follows

If calculated value of H is less than the table value of at a given level of significance accept null
hypothesis otherwise reject null hypothesis.

Note: If there are ties,H=

Where K= k(k2-1), k is the number of terms involved in the tie.


Example 14.14: Sales of four salesmen during one period are presented below. Using this, test whether
the efficiency of all four salesmen are equal or not.

Salesman 1 Salesman 2 Salesman 3 Salesman 4


51 44 60 32
54 49 58 36
55 48 53 31
47 61
Answer:
Null Hypothesis H0: There is no difference between the four salesmen ‘s efficiency.

Salesman 1 Salesman 2 Salesman 3 Salesman 4


Number Rank Number Rank Number Rank Number Rank

51 8 44 4 60 13 32 2
54 10 49 7 58 12 36 3
55 11 48 6 53 9 31 1
47 5 61 14
ni 4 3 4 3
Ti 34 17 48 6
Ti2 /ni        

=970.33

Test statistic = =10.45

Here degrees of freedom are 4-1=3 and statistic follows chi square with 3 degrees if freedom.
The value at 5% level of significance is 7.81and the calculated value if more than the table value. Hence
reject Null hypothesis.

187
Activity 14.1:The carbohydrate content of two banana varieties are as follows:
Variety 1: 41 41 44 44 43 46 53
Variety 2: 47 44 54 50 40 53 50
Test whether there is any significance difference in carbohydrate content of the two banana varieties.
Activity 14.2: A chemical extraction plant processes sea water to collect sodium chloride and magnesium.
It is known that the sea water contains sodium chloride, magnesium and other elements in the ration of
62:4:34. A sample of 200 tons of sea water has resulted in 130 tons of sodium chloride and 6 tons of
magnesium. Are these data consistent with the known composition of sea water at 5% level.
Activity 14.3: The following data is collected on two categories:

Cinegoers Non-cinegoers
Literate 83 57
Illiterate 45 68

Based on this, can you conclude that there is no relation between the habit of cinema going and literacy.
Activity 14.4: To test the quality of silk two samples of sizes 18 and 23 at two temperatures were chosen
and data was collect3d and presented below:
1st sample: 235,258,225,207,206, 268,256,220,224,248,230,245,315,250,247,254,251,225.
2 nd Sample:
250,229,206,220,226,255,258,243,222,227,205,228,249,237,206,214,236,225,243,239,239,211,232.
Basing on this test the equality of quality of silk at the two temperatures using Median test, U statistic, run
test.
Activity 14.5: Use Kolmogorov-Smirov test for testing the goodness of fit using the following data:

Expected 5 14 36 71 102 109 85 50 21 7 2


Observed 5 12 43 61 105 103 89 54 19 7 2

Activity 14.6: The operations manager of a company that manufactures tires wants to determine whether
there are any differences in the quality of workmanship among the three daily shifts. She randomly selects
496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective,
and the shift that produced it is also recorded. The two categorical variables of interest are: shift and
condition of the tire produced. The data can be summarized by the accompanying two-way table. Do
these data provide sufficient evidence at the 5% significance level to infer that there are differences in
quality among the three shifts?

188
Satisfactor
Perfect Defective Total
y

Shift 1 106 124 1 231

Shift 2 67 85 1 153
Shift 3 37 72 3 112
Total 210 281 5 496

Activity 14.7: Test the dependence between Type of Hair and eye color of the individuals using the
following information:

Eye color
Type of Hair Blue Green Brown Black Total
Blonde 20 15 18 14 67
Red 11 4 24 2 41
Brown 9 11 36 18 74
Black 8 17 20 4 49
Total 48 47 98 38 231

Activity 14.8: The birth weight of 56 infant pigs in ponds belonging to 8 litters are given below. Test
whether the weights are the infant’s pigs are varying over the litters.

Litters
1 2 3 4 5 6 7 8
2.0 3.5 3.3 3.2 2.6 3.1 2.6 2.5
2.8 2.8 3.6 3.3 2.6 2.9 2.2 2.4
3.3 3.2 2.6 3.2 2.9 3.1 2.2 3.0
3.2 3.5 3.1 2.9 2.0 2.5 2.5 1.5
4.4 2.3 3.2 3.3 2.0 1.2
3.6 2.4 3.3 2.5 2.1 1.2
1.9 2.0 2.9 2.6
3.3 1.6 3.4 2.8
2.8 3.2
1.1 3.2

14.16. SUMMARY:
This Unit is on Non-parametric tests. Initially the basic advantages and disadvantages of non-parametric
tests over parametric tests were given. Different important non-parametric tests that available in the
literature was presented. While presented the tests depending on one sample, two samples and more than
two samples are presented separately. In case of two and more than two sample cases, the cases of dependent
and independent samples are also discussed. Out of the sample tests mentioned the following are only
discusses with examples In case of one sample tests Kolmogorov–Smirnov one sample test (K-S of KS
one sample test), Ordinary one sample run test simply Run test, Runs test for randomness, Chi-Square test
are considered and presented in details. In case of Two sample tests Mann Whitney U test (independent

189
samples), Median test (independent samples), Wald- Wolfowitz runs test (independent samples), Rank
Correlation Coefficient significance test (dependent samples) and in case of Several Sample tests the
Kruskal-Wallis test (independent samples) were considered and presented in detail.

Review Questions:
1. What are non-parametric tests? How these are better than parametric tests. Also, mention the
disadvantages of these tests compared to parametric tests.
2. Mention different one sample tests and explain each one of the test procedure with or without the
help of an example.
3. Give different names ate two sample no-parametric test, separately for dependent and independent
samples.
4. Explain the procedure for Mann Whitney U test, Median test, Wald- Wolfowitz runs test, Rank
Correlation Coefficient significance test.
5. Explain the procedure for testing the comparison among more than two samples by using Kruskal-
Wallis test.

Further readings:
1. Kothari, C.R., GauravGarg, Research Methodology – Methods and Techniques, New Age
International publishers, New Delhi.
2. Chandan,J.S, Jagjit Singh, Khanna.K.K., Business Statistics, Vikas Publishing House Pvt. Limited,
New Delhi.
3. Gupta S.C and Kapoor V.K, Fundamentals of Mathematical Statistics, Sultan Chand& sons,
NewDelhi.
4. R Rangaswamy, A text Book of Agricultural Statistics, New Age international (P) Limited, New
Delhi.
5. Sidney Siegel, Nonparametric Statistics for the behavioral sciences, McGraw Hill company, INC,
London.

190
UNIT – 15
TIME SERIES ANALYSIS

OBJECTIVES:
After studying this unit one will be able to
• Explain what is the meaning, applications, uses and importance of Time series.
• Explain what are the component of Time series and its meaning.
• Explain different methods of studying Trend.
• Explain different methods of studying Seasonal variations.
• Explain how one will find Cyclical and Random variations.

STRUCTURE:
15.1 Introduction
15.2 Applications of Time Series Analysis
15.3 Uses of Time Series
15.4 Importance of Time Series
15.5 Components of Time Series
15.6 Mathematical Models for Time Series
15.7 Measurement of Trend
15.8 Measurement of Seasonal Variations
15.9 Measurement of Cyclical Variations
15.10Measurement of Random Variations
15.11 Summary
15.12 Review Questions
15.13 Further Readings

15.1 INTRODUCTION:
Forecasting or predicting is an important and essential tool in any decision making process required in
any business or related field. To decide the future requirement of inventory of a particular good to a shop,
to forecast the future value of the shares etc. requires proper forecasting. This is generally based on the
past data. Time series analysis is one of the quantitative method one can use to determine the pattern in
collected data over time. Once identified the pattern one can easily predict / estimate the future value.
191
Definition: Timeseries is a set of values / collected at equal time intervals. A time series is a sequence / set
of some variable such as production of steel, per capita income, gross national income, price of tobacco
etc. obtained at regular period over time. A time series is a sequence of numerical data/ values in which
each item / variable is associated with a particular instant in time. It is a collection of quantitative data that
are equally spaced in time and measured successively. It is any group of statistical information accumulated
at regular time intervals. Here time intervals can be annually, quarterly, daily, hourly or even at minute of
seconds.
An analysis of a single sequence of data is called univariate time-series analysis. An analysis of several
sets of data for the same sequence of time periods is called multivariate time-series analysis or, more
simply, multiple time-series analysis. Time series is analyzed to understand the structure and function of
the data. One can develop mathematical model to explain the structure of data and predict/forecast the
unknown variable given a known variable time where time acts as independent variable. Proper study
requires a minimum sample of size 50. This is appropriate for short term forecasting.

15.2. APPLICATIONS OF TIME SERIES ANALYSIS:


Application of Time series involves in the broad field of Statistics, signal processing, pattern recognition,
economics, econometrics, finance, mathematical finance, Production and operation, Medical, weather
forecasting, intelligent transport and trajectory forecasting,[1]earthquake prediction, electroencephalography,
control engineering, astronomy, communications engineering, and largely in any domain of applied science
and engineering which involves temporal measurements.

15.3. USES OF TIME SERIES:


• To understand past behavior of the phenomenon under consideration.
• To describe or characterize or generatethe execution and policy decisions.
• To evaluate current accomplishments in evolution of performance.
• To understand and explain variations in one variable by causing a variation/change in another
variable.
• To forecast or predict or estimate or future the behavior of the phenomenon.
• It helps us to compare the changes in the values of different phenomenon at different times or
places.

15.4. IMPORTANCE OF TIME SERIES:


1. It enables us to understand the past behavior or performance. If the data have changed over time it
will find out probable reasons responsible for such changes.
2. It helps directly in business planning. The firm can know the long term trend in sale of its product.
Also helps in making projections of its sales for the next few years.
3. It enables one to study such movements as cycles that fluctuate around the trend. A knowledge of

192
cyclical pattern in certain series of data will be helpful in making generalizations in the concerned
business.
4. It enables one to make meaningful comparisons in two or more series regarding the rate or type of
growth.

15.5. COMPONENTS OF TIME SERIES:


When the data are arranged according to time, often they show fluctuations from time to time. These
fluctuations are caused by a constantly working composite force. This composite force has four components.
These are called Components of time series. They are Trend, Seasonal Variation, Cyclical Variation and
Irregular or random variations. The changes in the time series may be due to the changes in all or some of
these four components. That is the collective impact of changes that occur in these four components are
responsible for the changes in the time series. To identify the behavior of the time series one must find the
behavior of these components individually. That is, it is necessary to isolate and measure the separate
effects of these components in a given time series. This can be done in the analysis of time series.
15.5.1. Trend: This is a long term variation. Time series data has a tendency of upward or down ward
movement in the mean value of forecast variable. This tendency is known as trend. When the values of
observations are plotted on the graph paper against time there will be a straight line to describe the
increase or decrease in the time series over period of time. This straight line is known as trend line. It
shows the direction of series in a long period of time. The effect of trend may be due to growth factor or
decline factor but extends more or less consistently throughout the entire period of time under consideration.
Trend can be either short term or long term trend. Trend can be linear or non –linear trend.The study of
trend gives the general idea about the pattern of behavior of the phenomenon under consideration. This
helps in business forecasting and planning future operations. By isolating trend values from the given
time series one can study the short term and irregular variations.
15.5.2. Seasonal Variations: Seasonal variations are those periodic movements which occur regularly
every year and have their origin in the nature of the year itself. These will recur in regular and periodic
manner over a span of less than a year such as a day, a week, a month, a quarter with high degree of
regularity. These fluctuations involve patterns of change within a year that tends to be repeated from year
to year. Production of agricultural products, demand for wool products, demand for umbrellas, demand
for cotton cloths etc. are the examples of time series which has a component of seasonal variations.
15.5.3. Cyclic Variations: The oscillatory movements in a time series with a period of oscillation more
than one year are termed as cyclical fluctuations. The oscillation in one complete time period is known as
cycle. The cyclic movements in time series are known as Business Cycle. It may vary in length, usually
more than a year but less than 5 to 7 years. The movement is through four phases from peak (prosperity)
to contradiction (recession) to trough (depression) to expansion (recovery to growth). The study of cyclic
variation will help the business executives in the formulation of policies aimed at stabilizing the business
activity.
15.5.4. Irregular Variations: Irregular Variations are also known as random variations. In addition to the
influence of long term and short term forces, every time series is subjected to occasional influences,
which may occur just once, or several times, but without any pattern or regularity. The variations they
produce are called Irregular variations. Floods, earthquakes, lockouts, fires, strikes etc. are the typical
causes of irregular variations.

193
15.6. MATHEMATICAL MODELS FOR TIME SERIES :
To analyze time series one has to measure the effect of various types of these factors on a series. To study
the effect of one type of factor the other type factor is eliminated from the series. Time series can be
analyzed through two modals namely Additive model and Multiplicative model.
15.6.1. Additive model:
Let ‘Y’ be the actual value of time series obtained by adding the four components at a particular time. The
effect of these four components on time series is interdependent. The Multiplicative model is given by
Y=T+C+S+I.
Where T is the trend component, C is the cyclic component, S is the seasonal component and I is the
irregular component. This model assumes that all the of a time series are independent of one other (though
practically is not true in most of the business and economic time series).
15.6.2. Multiplicative model:
Let ‘Y’ be the actual value of time series obtained by multiplying the four components at a particular time.
The effect of these four components on time series is interdependent. The Multiplicative model is given
by Y=T x C x S x I.
Where T is the trend component, C is the cyclic component, S is the seasonal component and I is the
irregular component. Here components need not be independent. Most of the time series conform this
model than additive model.
Note: Mixed models are also possible.

15.7. MEASUREMENT OF TREND:


The trend can be measured by 1. Graphic method or free hand method or trend by inspection. 2. Method
of semi-averages 3. Method of Moving averages 4. Method of curve fitting by Least Squares.
1. Graphic Method: Consider the independent variable representing time on X-axis and the corresponding
dependent variable (values of the time series) on Y-axis by taking suitable scale. Now plot time series
values against the corresponding time on the graph paper and join them. One can easily observe the up
and down movement of the curve. If a smooth free hand curve is drawn passing approximately to all
points of a curve previously drawn, it would represent the general tendency of the time series, and eliminates
all its other components namely seasonal, cyclical and irregular ones. This curve is known as trend line.
The following points must be kept in mind in drawing a freehand smooth curve:
1. The curve is smooth.
2. The numbers of points above the line or curve are more or less equal to the points below it.
3. The sum of vertical deviations of the points above the smoothed line is equal to the sum of the
vertical deviations of the points below the line. In this way the positive deviations will cancel
the negative deviations. These deviations are the effects of seasonal, cyclical and irregular variations
and by this process they are eliminated.
4. The sum of the squares of the vertical deviations from the trend line curve is minimum.

194
Example 15.1: Draw a time series graph to the following time series data of export of a country A in
millions of $ (X) and fit the trend by graphical method.

Year 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
X 95 143 125 119 100 140 150 157 150 154 189

Answer:

2. Method of Semi-Averages:
In the method of semi-averages, the whole data or the given series is divided into two equal parts (halves).
The arithmetic mean of the values of each part (half) is calculated. The computed means are termed as
semi-averages. Each semi-average is paired with the centre of time period of its part. The two pairs are
then plotted on a graph paper and the points are joined by a straight line to get the trend and may be
extended both ways to estimate future or intermediate values. It should be noted that if the data is for even
number of years, it can be easily divided into two halves. But if the number of years is odd then the time
series data is divided into two halves or two parts by omitting the value corresponding to the middle year.
The equation of the trend line is Y=a +b X, where a = year of origin and in this case the middle year of the
first group; b= slope of the line. The two constants can be computed as follows:

and , where =Sum of the first part time series, = Sum of the second part time
series, n = number of time periods in the given time series.
Once the trend is obtained substituting the time variable one can get the corresponding trend values.
Example 15. 2: The sales of a firm in lakhs of Rupees for the part 10 tears are given below. Draw trend
curve by semi average curve and also find the mathematical form of the curve by this method.

195
Year 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Sales 38 41 45 48 53 56 60 64 68 72

Answer: The sum of the first five observations =225 and it’s average = 225/5=45.

The sum of the last five observations =320 and it’s average = 320/5=64.

Plot 45 against 2007, which is the mid value of the first half of the time period and plot 64 against 2012,
mid value of the second half time period. Joining these two point gives the trend line. The drawing of the
graph is left to the reader.
orThe equation of the trend line in semi average method is Y = a + b X.

and , since S1 =Sum of the first part time series=


225, S2 = Sum of the second part time series=320, n = 10 number of time periods in the given time
series.The equation is Y = 45 +3.8 X
3. Method of moving-average:
The method of moving average consists in measurement of trend by smoothing the fluctuations of the
data. This method consists of taking arithmetic mean of taking m items for a time span from the first value
and placing it at the center of the timespan. The procedure is repeated by dropping the first item and
adding the figure directly following the last figure that was previously added, thus moving the time span
and its center forward by one year, and computing a new average. This procedure is continued until the
series are exhausted. That is find the average for m time periods starting from 1st , 2nd, 3rd,…., mth terms.
The second average is the mean of m terms starting from 2nd to (m+1)th term and soon. These averages are
known as m- period moving averages. If m is odd say 2k+1, moving average is placed against the middle
value of the time interval it occurs. If m is even say 2k, moving average is placed between the two middle
value of the time interval it occurs. The moving averages of a group are always shown at the centre of its
period. In case of even m, one has to take a 2 point moving average of the m- point moving average
values so that the values are against a time point only. This procedure is called centering. These moving
averages are the trend values of the corresponding years. In this method one cannot find trend values for
the k-tail end values on either side of the time series. This method is more flexible and cannot be effected
by adding some more observations. If the period of moving average is same as period of cycle, cyclical
fluctuations are automatically eliminated.
Example 15. 3: Calculate trend by 3- yearly moving average and 4- year moving average method for the
sales (Rs. 000’s) of a product.

Year 2008 2009 2010 2011 2012 2013 2014 2015


Sales 14 13 14 15 16 16 17 18

196
Example 15.4: Mr Chary& Co manufactures potable tables and since their starting of their company the
number of table sold by the company are as follows.

Year 2007 2008 2009 2010 2011 2012 2013 2014 2015
Number 42 50 61 75 92 111 120 127 140

Find the trend line by the method of least squares for the amount of sold by the company. Also estimate
the sales for 2016, 2017 and 2018.
Answer: Letthe trend equation be Y=a + b t and the normal equations are as given in equations (2) and (3)

Number(Y) t=year-2011 t2 tY Trend


Year Values
2007 42 -4 16 -168 20.42
2008 50 -3 9 -150 38.04
2009 61 -2 4 -122 55.66
2010 75 -1 1 -75 73.28
2011 92 0 0 0 90.9
2012 111 1 1 111 108.52
2013 120 2 4 480 126.14
2014 127 3 9 381 143.76
2015 140 4 16 600 161.38
Total =818 t =0 =60 =1057

Substituting the values in the normal equations

The trend equation id Y=90.9 + 17.62 t.


Substituting different values of t in third column in the above table, one can obtain the trend values and
same were presented in the last column of the above table.
5.MEASUREMENT OF SEASONAL VARIATIONS:
Following are different methods of measuring seasonal variations: 1. Method of simple averages, 2. Ratio
to moving average method, 3. Ratio to tend method and 4. Link relative method. Through one of these
methods one will find seasonal index as a measure of seasonal variations.
1. Method of simple averages:
This is the simplest of all the methods measuring the seasonal variations. Procedure for finding seasonal
index is presented in the following steps.
i. Arrange the data by years and quarters/months which ever possible basing on the type of data
provided.

198
ii. Compute the average of , for the ith quarter/ month basing on all the years. (i=1,2, 3…12 for months
and i=1,2,3,4 for quarters)
iii. Compute the average of the quarterly/ monthly averages. That is average of averages calculated in
step (ii). That is for monthly averages and for quarterly averages

iv. Seasonal indices are obtained by expressing monthly/ quarterly averages as percentage of . That is

Seasonal Index forith month/ quarter (i=1,2,..,4 for quarterly data and i=1,2,…,12 for monthly
data
Note: The total of all the indices for monthly data is 1200and for quarterly data it is 400.
Note: This method is based on the assumption that the data do not contain trend and cyclic variations.
Hence though it is simple this is not having much practical utility as the assumption of absence of trend
and cyclical variations is not true generally.
Example 15.5 :Prepare a monthly seasonal index using the following data on sales in 000’s rupees of firm
‘A’ by the method of simple averages.

Month Jan Feb Mar Apr May June July Aug. Sep. Oct. Nov. Dec.
Year
2012 92 90 92 94 96 98 98 108 106 112 102 88
2013 90 92 92 96 98 102 104 108 104 110 106 90
2014 92 90 96 94 102 104 110 116 120 108 106 92
2015 94 96 100 96 104 100 98 108 110 114 98 86

Answer:

Month Jan Feb Mar Apr May June July Aug. Sep. Oct. Nov. Dec.
Year
2012 92 90 92 94 96 98 98 108 106 112 102 88
2013 90 92 92 96 98 102 104 108 104 110 106 90
2014 92 90 96 94 102 104 110 116 120 108 106 92
2015 94 96 100 96 104 100 98 108 110 114 98 86
Total 368 368 380 380 400 404 410 440 440 444 412 356
Average 92 92 95 95 100 101 102.5 110 110 111 103 89

S.I 91.96 91.96 94.96 94.96 99.96 100.96 102.44 109.96 109.96 110.96 102.96 88.96

Sum of all averages =1200.5


Since sum must be equal to 1200, the correction factor is 1200/1200.5 and Seasonal index can be obtained
by multiplying averages by the correction factor.

199
Seasonal Index (S.I) of January = 92 x 1200/1200.5=92.96. Similarly, S.I of other months can be calculated.
The same were presented in the above table after calculation (last row)
2. Ratio to moving average method:
This is most commonly used measure of seasonal index. Following are the steps involved in calculating
Seasonal Indices (assuming multiplicative model).
i. Compute 12 point moving average for monthly data (4-point moving average for quarterly data)
ii. Divide each original value (ignoring six values on either side for monthly data and 2 values on
either side for quarterly data) by the corresponding moving average and present as a result of
percentage.
iii. Arrange these percentage values by month (quarter) wise and find out the mean values over years.
iv. Find mean of means and to get seasonal indices express means as a percentage of their mean of
means.
v. Here the total of all the indices is 1200 in case of monthly data and 400 in case of quarterly data.
Note: In case if additive model is assumed, after obtaining centered moving average instead of dividing
original values, subtract moving average values from the corresponding original values. Arrange those
differences month/ quarter wise and find the means. Find mean of the means and subtract the means of the
means from the means of differences, which gives the adjusted seasonal index and sum of those indices
will be Zero.
Example 15. 6: Calculate the seasonal index by the method of ratio to moving average using multiplicative
model for data given below on price (Rs. 00’s ) of a commodity.

Quarters Quarter-I Quarter-II Quarter-III Quarter-IV


Year
2013 2 3 2 4
2014 5 7 6 8
2015 6 9 9 10

Answer:
Compute 4-point moving average and do centering for the data obtained to get the moving averages for
the periods given in the problem (Except for the first two and last two periods).
(Reader is asked to do it on his own as explained in earlier sections). Those computed values are presented
in the following table and further calculation are given below.

200
Quarters Quarterly Trend percentages {(Original vales/Moving averages) x 100} Total
Years I II III IV
2013 - - 67 100 167
2014 100 117 92 114 423
2015 80 113 - - 193
Qtr. Totals 180 230 159 214 783
Qtr. average 90 115 79.5 107 391.5
Adjusted 90 x 400/391.5 115 x 400/391.5 79.5 x 107 x 400/391.5 400
Indices =91.5 =117.5 400/391.5 =109.32
=81.23

3.Ratio to tend method:


This methodis having advantage of using entire data than the ratio to moving average where loss of data
is there at the tail ends. Following are the principle steps involved in calculating seasonal index by the
method of ratio to trend.
I. Compute the trend for each month/ quarter by the method of least square. Sometimes the method
can be simplified by fitting trend line for the yearly totals (averages) and then obtaining the monthly
or quarterly trend values by suitable modifications of the trend equation obtained for yearly data.
II. Express each original value as a percentage of the corresponding trend figure.
III. Average those percentages over years for each month/ quarter.
IV. Find the average of averages and calculate seasonal index by expressing those means as percentage
of their mean of means. Sum of those will be 400 for quarterly data and 1200 for monthly data.
Example15.7:
Find the seasonal index for the following data of production (in 0000 units) using ration to trend method.

Month Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec
2013 12 13 10 14 15 15 16 13 11 10 12 15
2014 15 14 13 16 14 15 17 12 13 12 13 14
2015 17 16 15 17 16 18 17 14 11 11 12 16

Solution:
Determination of trend line and trend values (The trend equation is to fit for the yearly data to avoid the
influence of seasonal variations)

201
Year Yearly Monthly t-2014=X XY X2 Trend values
(t) Total average T=14+X
(Y)
2013 156 13 -1 -13 1 13
2014 168 14 0 0 0 14
2015 180 15 1 15 1 15
Total N=3

By the method of least squares the trend line for the three years is Tc=a+bX

Where

Therefore, the trend equation is Yc = 14+1.X = Yc = 14+X.


The monthly increment =1/12.
Calculation of monthly trend values: For the first year 2013, the trend value is 13 and this value is for the
middle of the year 2013. That is for June 15. Monthly increment is 1/12” 0.08 and for 15 days ½ (1/12)
=1/24=0.04. Therefore, trend for the month of June is 13- .04 =12.96 and for the month of July is
12.96+.08=13.04, for the month of August is 13.04+.08=13.12 etc. and trend for May is 12.96-.08=12.88,
for April 12.88-0.08=12.80 etc. Similarly, trend for the months of 2014 and 2015 can be computed.
The monthly trend values are presented below:

Year Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec
201 12.5 12.6 12.7 12.8 12.8 12.9 13.0 13.1 13.2 13.2 13.3 13.4
3 6 4 2 0 8 6 4 2 0 8 6 4
201 13.5 13.6 13.7 13.8 13.8 13.9 14.0 14.1 14.2 14.2 14.3 14.4
4 6 4 2 0 8 6 4 2 0 8 6 4
201 14.5 14.6 14.7 14.8 14.8 14.9 15.0 15.0 15.2 15.2 15.2 15.4
5 6 4 2 0 8 6 4 2 0 8 8 4

Now express the given values as percentages of the corresponding trend values as shown below: (Given
value/Trend Value) x 100. This gives the ratio to trend values/trend eliminated values.
As an example, for the January 2013 the trend eliminated value is (12/12.56)100=95.54. Similarly, other
values can be calculated. The ratio to trend values/trend eliminated values are tabulated in the following
table.

202
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2013 95.54 102.85 78.62 109.38 116.46 115.74 122.70 99.09 83.33 75.30 89.82 111.61
2014 110.62 102.64 94.75 115.94 100.86 107.45 121.08 84.99 91.55 84.03 90.53 96.96
2015 116.76 109.29 101.90 114.82 107.53 120.32 113.03 92.59 72.37 71.99 78.12 103.63
Aver 107.64 104.93 91.76 113.38 108.28 114.50 118.94 92.22 82.42 77.11 86.16 104.07
Adj.SI 107.51 104.82 91.66 113.26 108.16 114.37 118.72 92.11 82.33 77.03 86.07 103.96

The sum of the averages given in the 5th row is 1201.41. To make the sum 1200, multiply each monthly
average by the correction factor K=1200/1201.41. Then adjusted final Seasonal Index can be obtained.
The same were presented in the last row of the above table.
4. Link relative method:
This is also known as Pearson’s method of constructing seasonal index. This is based on averaging the
link relatives. Link relative of any season (month/ quarter) is the value of one season expressed as a
percentage of the preceding season value.

That is,

Following are the steps involved in constructing seasonal index by link relative method:
I. Convert the original data in to link relative using the link relative formula (except for the first
season).
I. Find the averages of the link relative over years for each season (Mean or median may be used)
II. Convert the average link relative in to chain relative using the following formula, assuming first
season chain relative is 100 (initially).

III. Now construct new Chain relative of the first season as

That is construct a new chain relative to the first season using the last seasons Chain relative as basis. This
new chain relative is to obtain the correction factor to be subtracted from chain relative of the second
season on words.
i. The adjustment factor d= (New chain relative of the first season -100)/ k, where k= number of
seasons in a year (k=4, for quarterly data and k= 12, for monthly data).
ii. Now to correct the Chain relatives subtract d, 2d, 3d ………values from Chain relatives of the
second season on words, keeping first one as 100.
iii. To get Seasonal indices, express these corrected Chain relatives as percentages of their averages.

203
Example 15.8:
Calculate the seasonal Index for the following data by link relative method.
Year Production in ‘ooo tones
1st Quarter 2nd Quarter 3rd Quarter 4th Quarter
2011 60 62 61 63
2012 65 58 56 61
2013 68 63 63 67
2014 70 59 56 62
2015 60 55 51 58

Solution: Calculations of Link relatives, Chain relativesand seasonal index:

Year 1st Quarter 2nd Quarter 3rd Quarter 4th quarter


2011 - (62/60) x100=103.3 (61/62) x100=98.4 (63/61) x100=108.3
2012 (65/63)x100=103.2 (58/65) x100=89.2 (56/58) x100=96.6 (61/56) x100=103.9
2013 (70/67) x100=111.5 (63/68) x100=92.6 (63/63) x100=100.0 (67/63) x100=106.3
2014 (60/62) x100=104.5 (59/70) x100=84.3 (56/39) x100=94.9 (62/56)
x100=110.73
2015 (68/61 )x100=98.8 (55/60) x100=91.7 (51/55) x100=92.7 (58/51) x100=113.7
Total 416 461.1 482.6 52.9
Average 104.0 92.2 96.5 108.6
Chain 100 100 x 92.2/100 =92.2 92.2 x 96.5 /100 88.87 x 108.6 /100
Relative(CV) =88.97 =96.62
Adjusted CV 100 92.2-0.12 =92.08 88.97-0.24= 88.73 96.62-.36= 96.26
Seasonal Index 100x 400/377.07 = 92.08x400/377.07= 88.73x400/377.12= 96.26x 400/377.07
106.1 97.67 94.12 = 102.11

Adjustment of Chain relatives (CR): CR of the 1st Quarter (based on 1st Quarter) = 100
CR of the 1st Quarter (based on last Quarter) = 96.62 x 104 /100 =100.48
The difference is due to the presence of trend and the adjustment factor ‘d’= (100.48-100)/4 =0.12
9. MEASUREMENT OF CYCLICAL VARIATIONS:
There are different methods like residual method, reference cycle analysis method, Harmonic Analysis
method etc. are in literature. However, the residual method is most commonly used method of estimating
cyclical variations. Hence that method only discussed in this unit.
Residual method: This is a crude method and assuming multiplicative model.
Original Values = T X S X C X I
Following are the steps in obtaining Cyclical variations.

204
1. Calculate trend values T, by moving average method and seasonal variations S, preferably by moving
average method.
2. Divide the original values by T X S
3. The resulting value gives C, cyclical variations.
The use of moving average method of suitable period averages / smoothens out the random component/
Irregular fluctuations I.
10. MEASUREMENT OF IRREGULAR VARIATIONS:
Irregular variations are those which are left after the trend, seasonal, cyclical Variations have been
eliminated. Thus if the original data are divided by T, S, C, one can get I, irregular variations. Assuming
multiplicative model
Original Values/ T X S X C = T X S X C X I / T X S X C =I
In practice, however the cycle itself is so erratic and intermingled with irregular movements that it is
impossible to separate the two. Therefore, Trend and Seasonal variations are measured directly and cyclic
and random variations left together after the other two elements have been removed.
Activity1:
From the following data, fit a free hand smooth curve and forecast the trend values thereby.

199 199 199 199 199 199 199 199 200 200 200 200 200
Year 2 3 4 5 6 7 8 9 0 1 2 3 4
Productio
n in tones 65 80 100 70 80 110 115 130 90 100 150 160 120

Activity 2:
Fit a trend line to the following data by the method of semi-averages:

Year 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Bank
clearances 53 79 76 66 69 94 105 87 79 104 97 92 101

Activity 3: For the problem given in activity 2, find trend values by the method of moving averages
(3- year moving average and 5-year moving average) and by method of least squares.

205
Activity 4: Thedata on prices (Rs in per KG) of a certain commodity during 2011 to 2015 are shown
below:

Quarter Years
2011 2012 2013 2014 2015
I 45 48 49 52 60
II 54 56 63 65 70
III 72 63 70 75 84
IV 60 56 65 72 66

Compute the seasonal Index by the method of simple average, ratio to moving average method and ration
to trend method.
Activity 5: Apply the method of link relatives to the following data and calculate seasonal indices.

Quarter 2011 2012 2013 2014 2015


I 6.0 5.4 6.8 7.2 6.6
II 6.5 7.9 6.5 5.8 7.3
III 7.8 8.4 9.3 7.5 8.0
IV 8.7 7.3 6.4 8.5 7.1

11. SUMMARY: A brief introduction and meaning of theTime serieswas given. Applications, importance
and uses of the time series were also given in this unit. Different components of the time series and its
explanation was also given. Different methods of studying trend, seasonal variations, cyclical variations
are also explained with suitable examples. A mention about the irregular variations was also made.
Review Questions:
1. Define Time series. Explain different components of Time series.
2. Explain different methods of isolating trend from the Time series data.
3. Explain different methods of obtaining the seasonal index from the given time series.
4. What are the uses and importance of Time series?
5. What is measured by a moving average? Why/ when are 4-point and 12-point moving average used
to develop seasonal index.
Further readings:
1. Gupta S C and Kapoor VK, Fundamentals of Applied Statistics, Sultan Chand & Sons, Delhi.
2. Richard I Lev in, David S Rubin, Statistics for management, Prentice-Hall of India Private Limited,
New Delhi
3. Chandan J C, Jagjit Singh, Khanna KK, Business statistics, Vikas Publishing Hose Pvt. Ltd., New
Delhi.
4. J K Sharma, Business Statistics, Dorling Kinderley (India) Pvt.Ltd.New Delhi.

206
UNIT – 16
USE OF STATISTICAL PACKAGES: A BRIEF INTRODUCTION,
USAGE AND EXPLANATION ABOUT STATISTICAL PACKAGES LIKE
EXCEL, SPSS, SAS, R.

OBJECTIVES:
After studying this Unit, the reader should be able to
1. Explain the need for different statistical packages.
2. Explain the different statistical packages available for analyzing the data
3. Explain the advancements of different packages over time
4. Explain the basic information about the usage of different software packages in analyzing the data.

STRUCTURE:
16.1 Introduction
16.2 Excel
16.3 SPSS
16.4 SAS
16.5 R
16.6 Summary.
16.7 Review Questions
16.8 Further Readings

16.1. INTRODUCTION:
In the previous units, different statistical techniques based on univariate, bivariate and multivariate
were discussed. Different statistical tests both parametric and nonparametric were presented. In all those
cases the algorithms that are required to calculate the results were discussed. However, in the present-day
scenario lot of statistical packages are available to replace the calculation part of the above. Once the
researcher has an idea about the statistical constant to be calculated and statistical technique to be used
for obtaining the required inference on the hypothesis, objectives and data availability, the researcher job
of calculation will be easy with the help of the statistical packages available. What statistical constant
should be calculated, what test to be used is to be identified by the researcher only. No Computer package
will provide information on such things.Similarly, after obtaining the constants values /statistic
valuesinterpretation also should be done by the researcher alone. Up Course some examples may be
provided by some Statistical packages which may be used as guidelines for interpretation of results. The
main thing is that these statistical packages will help us in process of calculation even in case of big data
207
sets also. (Otherwise manually impossible to do). There are different statistical packages available in the
market, some are on cost basis, some are free of cost; some are with limited use, some are with wider use.
Some of the commonly used statistical packages are Excel, SPSS, SAS and R.
These packages are self-explanatory. However, some guiding is need to learn and continuous practice of
the applications are need for better, accurate and quick calculations.

16.2. EXCEL:
Excel forms part of Microsoft Office. Microsoft Excel is a spreadsheet developed by Microsoft for
Windows, macOS, Android and iOS. It consists of rows and columns where the intersection between row
and column is known as cell or field. Each cell is referenced by its cell address which is denoted by
column letter and row number. A cell can hold either textual data which could be alphabets, numbers and/
or special characters or a formula. A formula could be mathematical expression or a predefined function.
MS Excel is widely used application to perform any mathematical and statistical calculation or analysis.
It is very popular for analyzing small and medium scale databases and is widely used office package by
business organization.
It features calculation, graphing tools, pivot tables, and a macro programming language called Visual
Basic for Applications. It has been a very widely applied spreadsheet for these platforms, especially since
version 5 in 1993, and it has replaced Lotus 1-2-3 as the industry standard for spreadsheets. It has a
battery of supplied functions to answer statistical, engineering and financial needs.
The functions of Excel can be categorized basing on the needs of the user as Financial, Logical, Text,
Date & Time, Lookup & Reference, Math & Trig, Statistical, Engineering, Cube and Information.It allows
sectioning of data to view its dependencies on various factors for different perspectives (using pivot
tables and the scenario manager). It has a programming aspect, Visual Basic for Applications, allowing
the user to employ a wide variety of numerical methods. It also has a variety of interactive features
allowing user interfaces that can completely hide the spreadsheet from the user, so the spreadsheet presents
itself as a so-called application, or decision support system (DSS), via a custom-designed user interface,
for example, a stock analyzer, or in general, as a design tool that asks the user questions and provides
answers and reports. In a more elaborate realization, an Excel application can automatically pool external
databases and measuring instruments using an update schedule, analyze the results, make a Word report
or PowerPoint slide show, and e-mail these presentations on a regular basis to a list of participants. But
this will not serve as a database.

208
The functions can be broadly classified as Test Functions, Logical Functions, Information Functions,
Date and time functions, Math functions, Statistical functions and Lookup and reference functions. Statistical
functions such as ANOVA, Correlation, Regression, Descriptive Statistics, Histogram, Random number
generation, sampling, F-test, t-test, Z-test e.t.c. are also available through analysis tool pack which is
available as Add-in in Excel.
Further information on these functions can be had from any bookrelating to Excel or from the PCs.
16.2.1. Charts In EXCEL
These functions include drawing of line graphs, histograms and charts, and with a very limited three-
dimensional graphical can be drawn in excel.Charts or Graphs are the visual representation of data to
get clear view of the data. There are different types of charts available in Excel.

• Column ” Column chart shows data changes over a period of time or illustrates comparisons among
items.
• Bar ” A bar chart illustrates comparisons among individual items.
• Pie ” A pie chart shows the size of items that make up a data series, proportional to the sum of the
items. It always shows only one data series and is useful when you want to emphasize a significant
element in the data.
• Line ” A line chart shows trends in data at equal intervals.
• Area ” An area chart emphasizes the magnitude of change over time.
• X Y Scatter ” An xy (scatter) chart shows the relationships among the numeric values in several
data series, or plots two groups of numbers as one series of xy coordinates.
• Stock ” This chart type is most often used for stock price data, but can also be used for scientific
data (for example, to indicate temperature changes).

209
• Surface ” A surface chart is useful when you want to find the optimum combinations between two
sets of data. As in a topographic map, colors and patterns indicate areas that are in the same range of
values.
• Doughnut ” Like a pie chart, a doughnut chart shows the relationship of parts to a whole; however,
it can contain more than one data series.
• Bubble ” Data that is arranged in columns on a worksheet, so that x values are listed in the first
column and corresponding y values and bubble size values are listed in adjacent columns, can be
plotted in a bubble chart.
• Radar ” A radar chart compares the aggregate values of a number of data series.
16.2.2. OTHER FEATURES OF EXCEL
• Sort: Data can be sorted either in ascending or descending order. The data can sorted in single as
well as multiple columns irrespective of what data type it is.
• Filter: Filter is used to show only those records from the dataset which satisfies the given criteria.
• Conditional Formatting: Based on some given criteria, one can highlight the cells with certain
color using the conditional formatting feature in Excel.
• Pivot Tables and Charts: Pivot tables and Charts are one of most powerful features of Excel. It
helps in performing cross tabulation or multi dimensional analysis on the data. One can get useful
insights from a large dataset using pivot table and charts.
• What-If Analysis: What-If Analysis in Excel helps in decision making by showing different scenarios
(outputs) for different alternatives (inputs). By understanding how change in inputs affects the
output, one can get better decision. Based on the kind of problem being solved and the number of
input changes, one can choose from the three tools of what – if analysis which are Data Table, Goal
Seek and Scenario Manager.
• Solver: Solver is an Excel add-in that uses techniques from the operations research to find optimal
solutions for all kind of decision problems.

16.3. SPSS:
The software was released in its first version in 1968 as the Statistical Package for the Social Sciences
(SPSS) after being developed by Norman H. Nie, Dale H. Bent, and C. Hadlai Hull. Though the original
market is intended for Social sciences, the software is now popular in other fields as well, including the
health sciences , marketing, survey companies, government, education researchers, marketing , miners
and others.
Early versions of SPSS Statistics were designed for batch processing on mainframes. From version 10
(SPSS-X) in 1983, data files could contain multiple record types.
Prior to SPSS 16.0, different versions of SPSS were available for Windows, Mac OS X and Unix. The
Windows version was updated more frequently and had more features than the versions for other operating
systems.SPSS Statistics versions 16.0 and later run under Windows, Mac, and Linux. The graphical user

210
interface is written in Java. The Mac OS version is provided as a Universal binary, making it fully compatible
with both PowerPC and Intel-based Mac hardware.SPSS Inc announced on July 28, 2009 that it was
being acquired by IBM and because of a dispute about ownership of the name “SPSS”, between 2009 and
2010, the product was referred to as PASW (Predictive Analytics Soft Ware). As of January 2010, it
became “SPSS: An IBM Company”. Complete transfer of business to IBM was done by October 1, 2010.
By that date, SPSS: An IBM Company ceased to exist. IBM SPSS is now fully integrated into the IBM
Corporation, and is one of the brands under IBM Software Group’s Business Analytics Portfolio, together
with IBM Algorithmics, IBM Cognos and IBM OpenPages.The current versions (2015) are officially
named IBM SPSS Statistics. Companion products in the same family are used for survey authoring and
deployment (IBM SPSS Data Collection),data mining (IBM SPSS Modeler), text analytics, and
collaboration and deployment (batch and automated scoring services).But for time being the term SPSS
Statistics is using in place of IBM SPSS Statistics for convenience.SPSS Statistics places constraints on
internal file structure, data types, data processing, and matching files, which together considerably simplify
programming. SPSS datasets have a two-dimensional table structure, where the rows typically represent
cases (such as individuals or households) and the columns represent measurements (such as age, sex, or
household income). Only two data types are defined: numeric and text (or “string”). All data processing
occurs sequentially case-by-case through the file (dataset). Files can be matched one-to-one and one-to-
many, but not many-to-many. In addition to that cases-by-variables structure and processing, there is a
separate Matrix session where one can process data as matrices using matrix and linear algebra operations.
“SPSS is a comprehensive system for analyzing data. SPSS can take data from almost any type of files.
That is SPSS Statistics can read and write data from ASCII text files (including hierarchical files), other
statistics packages, spreadsheets and databases. SPSS Statistics can read and write to external relational
database tables via ODBC and SQL.The graphical user interface has two views which can be toggled by
clicking on one of the two tabs in the bottom left of the SPSS Statistics window. The ‘Data View’ shows
a spreadsheet view of the cases (rows) and variables (columns). Unlike spreadsheets, the data cells can
only contain numbers or text, and formulas cannot be stored in these cells. The ‘Variable View’ displays
the metadata dictionary where each row represents a variable and shows the variable name, variable label,
value label(s), print width, measurement type, and a variety of other characteristics. Cells in both views
can be manually edited, defining the file structure and allowing data entry without using command syntax.
This may be sufficient for small datasets. Larger datasets such as statistical surveys are more often created
in data entry software, or entered during computer-assisted personal interviewing, by scanning and using
optical character recognition and optical mark recognition software, or by direct capture from online
questionnaires. These datasets are then read into SPSS.The variables in the database is defined in the
variable view, where the name of the variable, its data type and its description in terms of size, labels,
valid values and so on.

211
Data View is the place where we can enter data in different variables with each row as one data sample.
This view also contains the menu to perform different analysis and visualization techniques on the data.

16.3.1. ANALYZE MENU IN SPSS


After reading the data SPSS Statistics analyze them to perform different statistical techniques in the data
set such as generation of tabulated reports, charts, plots of distributions and trends, descriptive statistics,
and complex statistical analysis.” Statistics included in the base software:Descriptive statistics includes
Cross tabulation, Frequencies, Descriptive Statistics, Descriptive Ratio Statistics, Bivariate statistics
includes Means, t-test, ANOVA, Correlation (bivariate, partial, distances), Nonparametric tests, Prediction
for numerical outcomes which includes Linear regression, Prediction for identifying groups which includes
Factor analysis, cluster analysis (two-step, K-means, hierarchical), Discriminant analysis.

212
16.3.2. GRAPHS MENU IN SPSS
Data can be visually presented for better understanding by using the Graphs menu in SPSS.

Statistical output is to a proprietary file format (*.spv file, supporting pivot tables) for which, in addition
to the in-package viewer, a stand-alone reader can be downloaded. The proprietary output can be exported
to text or Microsoft Word, PDF, Excel, and other formats. Alternatively, output can be captured as data
(using the OMS command), as text, tab-delimited text, PDF, XLS, HTML, XML, SPSS dataset or a
variety of graphic image formats (JPEG, PNG, BMP and EMF).SPSS Statistics Server is a version of
SPSS Statistics with a client/server architecture. It had some features not available in the desktop version,
such as scoring functions. (Scoring functions are included in the desktop version from version 19.)
213
16.4. SAS
SAS (Statistical Analysis System) is a software suite developed by SAS Institute for advanced analytics,
multivariate analyses, business intelligence, data management, and predictive analytics.
The development of SAS began in 1966 after North Carolina State University re-hired Anthony Barr to
program his analysis of variance and regression software so that it would run on IBM System/360 computers.
The project was funded by the National Institute of Healthand was originally intended to analyze agricultural
data to improve crop yields. Barr was joined by his student James Goodnight, who developed the software’s
statistical routines, and the two became project-leaders. In 1968, Barr and Goodnight integrated new
multiple regression and analysis of variance routines. The first versions of SAS were named after the year
in which they were released. In 1971, SAS 71 was published as a limited release. It was used only on IBM
mainframes and had the main elements of SAS programming, such as the DATA step and the most common
procedures in the PROC step. The following year a full version was released as SAS 72, which introduced
the MERGE statement and added features for handling missing data or combining data sets. In 1972, after
issuing the first release of SAS, the project lost its funding. As Per Goodnight, this was because NIH only
wanted to fund projects with medical applications. Goodnight continued teaching at the university for a
salary of $1 and access to mainframe computers for use with the project, until it was funded by the
University Statisticians of the Southern Experiment Stations the following year. John Sall joined the
project in 1973 and contributed to the software’s econometrics, time series, and matrix algebra. Another
early participant, Caroll G. Perkins, contributed to SAS’ early programming. Jolayne W. Service and Jane
T. Helwig created SAS’ first documentation.
In 1976, Barr, Goodnight, Sall, and Helwig removed the project from North Carolina State and incorporated
it into SAS Institute, Inc. SAS was re-designed in SAS 76 with an open architecture that allowed for
compilers and procedures. The INPUT and INFILE statements were improved so they could read most
data formats used by IBM mainframes. Generating reports was also added through the PUT and FILE
statements. The ability to analyze general linear models was also added as was the FORMAT procedure,
which allowed developers to customize the appearance of data.
In 1979, SAS 79 added support for the CMS operating system and introduced the DATASETS procedure.
SAS/GRAPH, which produces graphics, was released in 1980, as well as the SAS/ETS component, which
supports econometric and time series analysis. Three years later, SAS 82 introduced an early macro
language and the APPEND procedure.
SAS version 4 had limited features, but made SAS more accessible. Version 5 introduced a complete
macro language, array subscripts, and a full-screen interactive user interface called Display Manager. In
1985, SAS was rewritten in the C programming language. This allowed for the SAS’ Multivendor
Architecture that allows the software to run on UNIX, MS-DOS, and Windows. It was previously written
in PL/I, Fortran, and assembly language.JMP was developed by SAS co-founder John Sall and a team of
developers to take advantage of the graphical user interface introduced in the 1984 Apple Macintosh and
shipped for the first time in 1989.
A component intended for pharmaceutical users, SAS/PH-Clinical, was released in the 1990s. SAS version
6 was used throughout the 1990s and was available on a wider range of operating systems, including
Macintosh, OS/2, Silicon Graphics, and Primos. SAS introduced new features through dot-releases. Version
7 introduced the Output Delivery System (ODS) and an improved text editor. ODS was improved upon in
successive releases. For example, more output options were added in version 8. The number of operating
systems that were supported was reduced to UNIX, Windows and z/OS, and laterLinux was added. SAS
version 8 and SAS Enterprise Miner were released in 1999

214
The Food and Drug Administration standardized on SAS/PH-Clinical for new drug applications in 2002.
Vertical products like SAS Financial Management and SAS Human Capital Management (then called
CFO Vision and HR Vision respectively) were also introduced. Version 9.0 added custom user interfaces
based on the user’s role and established the point-and-click user interface of SAS Enterprise Guide as the
software’s primary graphical user interface (GUI). The Customer Relationship Management (CRM) features
were improved in 2004 with SAS Interaction Management. SAS data can be published in HTML, PDF,
Excel and other formats using the Output Delivery System, which was first introduced in 2007. The SAS
Enterprise Guide is SAS point-and-click interface. It generates code to manipulate data or perform analysis
automatically and does not require SAS programming experience to use.
In 2008 SAS announced Project Unity, designed to integrate data quality, data integration and master
data management.A free version was introduced for students in 2010. SAS Social Media Analytics, a tool
for social media monitoring, engagement and sentiment analysis, was also released that year. SAS Rapid
Predictive Modeler (RPM), which creates basic analytical models using Microsoft Excel, was also
introduced that same year. JMP 9 in 2010 added a new interface for using the R programming language
from JMP and an add-in for Excel. The following year, a High Performance Computing appliance was
made available in a partnership with Teradata and EMC Greenplum. In 2011, the company released
Enterprise Miner 7.1. Updated versions of JMP were released continuously after 2002 with the most
recent release being from 2012. The company introduced 27 data management products from October
2013 to October 2014 and updates to 160 others. At the 2015 SAS Global Forum, it announced several
new products that were specialized for different industries, as well as new training software.
SAS is a software suite that can mine, alter, manage and retrieve data from a variety of sources and
perform statistical analysis on it. SAS provides a graphical point-and-click user interface for non-technical
users and more advanced options through the SAS language. To use Statistical Analysis System, Data
should be in anspreadsheet table format or SAS format. SAS programs have a DATA step, which retrieves
and manipulates data, usually creating a SAS data set, and a PROC step, which analyzes the data.
Each step consists of a series of statements.The DATA step has executable statements that result in the
software taking an action, and declarative statements that provide instructions to read a data set or alter
the data’s appearance.The DATA step has two phases, compilation and execution. In the compilation
phase, declarative statements are processed and syntax errors are identified. Afterwards, the execution
phase processes each executable statement sequentially. Data sets are organized into tables with rows
called “observations” and columns called “variables”. Additionally, each piece of data has a descriptor
and a value.
The PROC step consists of PROC statements that call upon named procedures. Procedures perform analysis
and reporting on data sets to produce statistics, analyses and graphics. There are more than 300 procedures
and each one contains a substantial body of programming and statistical work. ROC statements can also
display results, sort data or perform other operations. SAS Macros are pieces of code or variables that are
coded once and referenced to perform repetitive tasks.
16.4.1. Some Important Components:
The SAS software suite has more than 200 components. some of them are:
• Base SAS – Basic procedures and data management
• SAS/STAT – Statistical analysis
• SAS/GRAPH – Graphics and presentation

215
• SAS/OR – Operations research
• SAS/ETS – Econometrics and Time Series Analysis
• SAS/IML – Interactive matrix language
• SAS/AF – Applications facility
• SAS/QC – Quality control
• SAS/INSIGHT – Data mining
• SAS/PH – Clinical trial analysis
• Enterprise Miner – data mining
• Enterprise Guide - GUI based code editor & project manager
• SAS EBI - Suite of Business Intelligence Applications
• SAS Grid Manager - Manager of Origins
SAS is a leader in business analytics. Through innovative analytics it caters to business intelligence and
data management software and services. SAS transforms data into insight which can give a fresh perspective
on business.
Unlike other BI tools available in the market, SAS takes an extensive programming approach to data
transformation and analysis rather than a pure drag drop and connect approach. That makes it stand out
from the crowd as it gives much finer control over data manipulation. SAS has a very large number of
components customized for specific industries and data analysis tasks.
This tutorial is designed for all those readers who want to read and transform raw data to produce insights
for business using SAS. In addition, it will also be quite useful for those readers who would like to
become a Data Analyst or Data Scientist.Before proceeding with this tutorial, you should have a basic
understanding of Computer Programming terminologies. A basic understanding of any of the programming
languages will help you in understanding the SAS programming concepts. Familiarity with SQL will help
you learn it very fast.
16.4.2. Comparison to other product:
Alan C. Acock (2005) wrote that SAS programs provide “extraordinary range of data analysis and data
management tasks,” but were difficult to use and learn. SPSS and Stata, meanwhile, were both easier to
learn (with better documentation) but had less capable analytic abilities, though these could be expanded
with paid (in SPSS) or free (in Stata) add-ons. Acock concluded that SAS was best for power users, while
occasional users would benefit most from SPSS and Stata. A comparison by the University of California,
Los Angeles, gave similar results.Competitors such as Revolution Analytics and Alpine Data Labs advertise
their products as considerably cheaper than SAS. In a 2011 comparison, Doug Henschen of InformationWeek
found that start-up fees for the three are similar, though he admitted that the starting fees were not necessarily
the best basis for comparison. SAS business model is not weighted as heavily on initial fees for its programs,
instead focusing on revenue from annual subscription fees.

216
16.5. R:
R is a language and environment for statistical computing and graphics. It is a GNU project and is developed
byR was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and
is currently developed by the R Development Core Team. R provides a wide variety of statistical and
graphical techniques and is open source software. That is R is an integrated suite of software facilities for
data manipulation, calculation and graphical display. It includes
• an effective data handling and storage facility,
• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either on-screen or on hardcopy, and
• a well-developed, simple and effective programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
R is an environment in which statistical techniques are implemented.R can be extended via packages.
There are about eight packages supplied with the R distribution and many more are available through the
CRAN family of Internet sites covering a very wide range of modern statistics.R is the most popular
programming language among statisticians. R’s expressive syntax allows quickly importing, clean and
analyzing data from various data sources.
R also has charting capabilities which would help in creating interesting visualizations from any dataset.R
is widely used in predictive analytics and machine learning. It has various packages for common Machine
learning tasks like linear and non-linear regression, decision trees, linear and non-linear classification.

16.6. SUMMARY:
This Unit is on the brief introduction of different Statistical packages commonly used by the researchers.
The packages that are mentioned and discussed are EXCEL, SPSS, SAS and R. However actual application
of these packages was not discussed in detail. Further understanding of usage of these packages can be
had from different recent editions of the any Research Methodology or Statistics Books.
Review Questons:
1. What are the different Statistical Packages in common use in analyzing data?
2. What is EXCEL? Mention different functions available in EXCEL.
3. What are the technical terms in Excel in drawing the Graphs?
4. What are the different features of the Excel?
5. What is SPSS? Briefly explain.
6. What is SAS? Mention few important components in SAS.
7. What is R? Mention few important facilities in the Software.

217
Further readings:
1. Mcfedries, P., Excel 2013 Formulas and Functions, Pearson Education, New Delhi, 2013
2. Hurt-Davis, G., How to do everything with Microsoft Office Excel, Tata McGraw Hill, New Delhi,
2012
3. Cotton, R., Learning R, O’ Reilly India, Mumbai, 2014.
4. Gardener, M, Beginning R, Wiley India, New Delhi, 2013
5. KiranPandya, SmrutiBulsari, Sanjay Sinha, SPSS in Simple Steps,Dreamtech Press , New Delhi,
2011.
6. Naval Bajpai , Business Research Methods, Dorling Kindersley (India) Pvt. Ltd. New Delhi.
7. Naresh K Malhotra and Satyabhushan Das., Marketing Research, An applied orientation, Pearson,
New Delhi.
8. Richard I Levin and David S Rubin, Statistics for Management, Prentice Hall of India Private
Limited, New Delhi.
9. Lora D Delwiche and Susan J Slaughter, The little SAS Book, SAS publishing.
10. Electronic information available in GOOGLE etc.

218
UNIT - 17
REPORT WRITING AND PRESENTATION

OBJECTIVES:
After studying this Unit, the reader should be able to
• Explain the meaning, purpose and need for report writing.
• Explain the organization of the report in which the important items to be included in the report.
• Explain the precautions one should take while writing the report writing.
• Explains the guidelines for tables and graphs should follow while writing the report.
• Explain the meaning and procedure for preparing the oral preparation.

STRUCTURE:
17.1 Introduction
17.2 Organization of the written report
17.3. Report writing
17.4. Guidelines for Tables
17.5. Guidelines for graphs
17.6. Oral presentation
17.7 Summary
17.8 Review Questions
17.9 Further Readings

17.1 INTRODUCTION:
It is important for any researcher to present the research material in a systematic and predefined manner
in a such a way that a reader can understand and follow the work done and results obtained in a lucid
fashion. A good research is diluted if he fails to present in a better way. There exists a difficulty in
communicating the researcher’s mind to a common manor person who does not have a solid research
backward. The researcher may be interesting in the techniques involved in conducting the research whereas
the common man or person/ company who sponsored the research work may show interest in theresults
rather than the techniques and procedures involved in the research. Hence a proper balance must be
maintained between the researcher and requirements of the reader / sponsoring agency. In nutshell, the
presentation must contain all the important ingredients of the research in a systematic and predetermined
manner. This presentation is known as report. A Report is a written and/or oral presentation of the research
process, results, recommendations, and/or conclusions to a specific audience / reader.
219
17.2. ORGANIZATION OF THE RESEARCHREPORT:
After conducting any research, the researcher needs to draft the work done and results obtained. Drafting
needs to follow a systematic process and needs a careful articulation. There are no specific guidelines for
preparing the report. These are varying depending upon the readers/ type of target groups/ the person to
whom the report is addressed. However, following is the general guidelines from which the researcher
can develop a format for the research report. Most reports include the following elements.
These elements can be broadly classified in to three main parts namely Prefatory part/ interdictory part,
Main body and appended part/appendix.
Prefatory Part: This includes
1. Title Page
2. letter of transmittal
3. Letter of authorization
4. Table of contents
5. List of tables
6. List of graphs
7. List of appendices
8. List of exhibits
9. Executive Summary i). Objectives, ii) Concise statement about the methodology iii) Major findings
iv) Recommendations.
The main body includes
10. Introduction
11. Objectives of the research
12. Research Methodology which includes research design, data collection, questionnaire preparation,
pre-testing, sampling techniques, data analysis, statistical tests if any etc.
13. Results and findings
14. Conclusions and recommendations
15. Limitations of the research.
The appended part includes
16. Questionnaires and forms
17. Statistical output details
18. Bibliography
19. Other required supporting material if any.

220
Following is the brief explanation about the different elements of the report format mentioned above.
A title page includes the title of the study/ report, the name and affiliation of the researcher, the name of
the client for whom the report was prepared. The title should indicate the nature of the project. This must
be nut shell of the report in one sentence. This must be attractive to get the good impression. This must be
in capital letters and centrally located on the page.
Letter of transmittal is an important component and it is a letter that delivers the report to the client and
summarizes the overall experience with the project, without mentioning the results. It shall also include
the further action the client should take while implementing theproject. A mention about further research
may also be included.
Letter of Authorization is issued by the client to the researcher before initiation the project/ research
work. this is a formal letter that authorizes the researcher to conduct the research. It may include the
names of the persons, to whom the research work was granted, schedule of time for completion of the
project, the budget proposal etc.
Table of contents contains the list of topics along with the corresponding page numbers. It helps the
reader to locate the contents easily. Some reports include the sub topics also. This is much relevant. The
title page, letter of transmittal, letter of authorization, table of contents, List of tables, List of graphs, List
of appendices, List of exhibits are generally numbered with small roman numeral such as (i), (ii) and so
on. Arabic numerals are used for the Executive summary. Numbering in the table of contents start from
the Introduction of the main body.
Executive summary is the most important part of the report and this is the only part of the report that
executive will read. This should contain the objectives, brief methodology,research findings, conclusions
and recommendations. This should be prepared after the rest of the report was prepared.
In the main body, introduction section should connect the reader with the back-ground circumstances that
led to conduction of the research that is motivation of the research work. Some review of the similar
research works is also expected to include in the introduction. The problem definition and the broad
approach that was adopted to address the problem should be included in this chapter.
The research objective may be included either in the Introductionchapter or as a separate chapter. this
should include the general problem which the proposed research going to address and the specific objectives
of the research and a brief discussion about the rationale of the specific objectives and their relevance to
the problem at hand. The hypotheses that were constructed in relation to the problem also to be included.
Research Methodology contains a detailed discussion of Research design, the details of which how the
research design was carried out, sample, sample size, sample techniques, questionnaire, field work, the
way respondents are contacted and difficulties occurred, the way of overcoming them in getting the
completed questionnaires, test statistic etc.Data analysis in which the plan of data analysis, justification
of data analysis strategy and techniques that were used in the data analysis are also should be included.
This can be under separate chapter under data analysis head.
Results and findings mainly expected to focus on the outcome of the statistical analysis performed to test
different hypotheses. This may comprise different number of chapters as it is longest of the report. The
results must be organized in a coherent and logical way. The details should be presented in tables and
graphs with the main findings discussed in the text.
The conclusions are derived from the acceptance and rejection of the hypothesis. The results are in statistical

221
terms. The conclusion is in non-statistical explanation of the results.The conclusions are supported by
some precious research or existing studies and are made with direct reference to the research objective.
Recommendations are slightly different from conclusions. Critical thinking and interpretation basing on
the thinking of the researcher are the sources of recommendations. The conclusion might happen to be
non-action statement whereas the recommendations are action statements and guide the research sponsor
agency to take action to solve the problem or to make action to explore the untapped opportunity.
Every researcher will try to conduct a flawless research, free from limitations. However, it is only theoretical
but practically it is difficult. Limitations are caused by sample size, sampling bias, measurement error,
time, budget and other organizational constraints. The limitations of the research should not be
overemphasized rather the aim should be to aid the decision maker in taking the appropriate action. This
must be written with great care and in balanced way.
The appended part includes the information that is of in-significance to the researcher and reader but
cannot be placed in the main body of the research report. There is no fixed universally accepted rule for
any item to be included in this section. It is varying from report to report and researcher to researcher.

17.3. REPORT WRITING:


Research report is a channel of communicating the research findings to the readers of the report. A good
research report is one which does this task effectively and efficiently. As such one must keep different
precautions/ points in preparing the report.
While writing any report one must keep in mind the type of readers who utilize the report. The report
should take in to account the reader’s technical sophistication and interest in the project, as well as the
circumstances under which they will read and use the report. Abstract terminology and Technical jargon
should be avoided. If it is must to use any technical terms, explain the same in brief then and there itself.
If the audience are of different level of understanding different sections suitable to different levels of
audience may be included in the report. Sometime a separate report may be prepared as per levels of
audience.
The report should be easy to follow. It should be as lucid as follow. It should be structured logically and
written clearly. The logical pattern should be such that the reader can easily see the inherent connections
and linkages in the body of the report. Headings and sub headings should be used for different topics and
sub topics. Difficult words, slang and clinches should be avoided. One or more critical reading and reviews
by independent persons is needed before the final report emerges. Appearance, quality of paper, quality of
printing/ typing, binding, font sizes, gap between lines, Variations in type sizes etc. are also important
while preparing the report.
The report should accurately present the methodology, results and conclusions of the project without
slanting the finding to confirm to the expectations of the management/ sponsors of the project. Follow the
rule “tell it like it is”.
The length of the report should be long enough to cover the subject but short enough to maintain interest.
The presentation should be well thought out and must be appropriate and in accordance with the objective
of the research project. Towards the end, the report must state the policy implications relating to the
problem under consideration. Appendices must be enlisted in respect of all the technical data in the report.
Bibliography of sources consulted is a must for a good report and should be given at the end of the report.
Index must be prepared and appended at the end of the report as it is also an important part a good report.

222
The report should ne concise and unnecessary if any must be avoided. Important point may have lost in
the process of including too much information.
It is also important to reinforce key information in the text with tables, graphs, maps, pictures and other
visual devices whenever needed and possible.

17.4. GUIDELINES FOR TABLES:


Statistical tables are very important and effective tools of presenting data and will provide a quick
understanding about the facts and figures. Usage of tables allows the writer to point significant features
without getting bogged down by details. It is advisable to include only important tables in the report and
others if any may be incorporated in the appendix.
Every table must have a title and number. Tile must be brief but clearly descriptive the information in the
table. Numbers of the tables are in Arabic. Unit of measurement should clearly be mentioned in the
Table.Explanations and comments clarifying the table can be provided in the form of captains,stubs and
footnotes. Designations placed over the vertical columns are called Captains. Designations placed in the
left-hand side column are called stubs. Information that cannot be incorporated in the table should be
explained as footnotes. It is also important to mentionthe source at the footnotes, if it was collected from
any other primary source of information.

17.5. GUIDELINES FOR GRAPHS:


In most presentations graphs are used as supporting materials for presenting facts and figures. To convey
the trend of the data graphical representation is more appealing. As a saying goes “a picture is worth a
thousand words”. Generally, graphic aids / graphical representation should be employed whenever practical.
Every graph should have a number and title. Different diagrams and graphs that are commonly employed
are discussed in Unit VI.

17.6. ORAL PRESENTATION:


Along with the written report most sponsors are expected to have an oral presentation of the work presented
in the report by the researcher. The main purpose of oral presentation is to provide a glimpse of the major
findings of the research to the client audience / sponsored persons orally, so that any query or ambiguity
may be clarified. The presentation must be well prepared in lines of the project and rehearsed several
times before it was presented to the audience. Visual aids such as tables and graphs must be displayed
with the help of a variety of media. Now a days many software like MS power point are available to make
the presentation strong and audience friendly. Use such facility for presentation. Make the presentation
bulleted point and highlight important findings separately. It is important to maintain eye to eye contact
with the audience during oral presentation. Avoidreading the presentation. Always try to make presentation
basing on the type of audience for the presentation

223
17.7 SUMMARY:
This Unit is on the report writing and presentation. As a introductory part the necessity for the preparation
of the report writing was given. Various steps, generally researchers will follow, in writing a report are
presented and discussed. The important points one must keep in mind while preparing the research report
was also presented in this Unit. The need for Oral presentation and points to be considered while writing
the research report were also presented.
Activity 1. Prepare a Power point presentation for any one the units studied so far in this Paper.

Review Questions :
1. Explain the significance of a research report?
2. Write the different steps that are in a research report. Explain them.
3. Write the precautions one should take while preparing the research report.
4. What is necessity of oral presentation. Give various point one should keep in mind while preparing
the oral presentation.

Further readings:
1. Kothari, C.R and GauravGarg, Research Methodology – Methods and Techniques, New Age
International Publishers, New Delhi
2. Naval Bajpai, Business Research Methods, Dorling Kindersley (India) Pvt.Ltd., New Delhi.
3. Mark Saunders, Philip Lewis and Adrian Thornhill., Research Methods for Business Studies, Pearson,
2012
4. Naresh K Malhotra and Satyabhushan Dash, MarketingResearch: an applied orientation, Dorling
Kindersley (India) Pvt. Ltd, New Delhi.
5. N Thanulingom, Research Methodology, Himalaya Publishing House,Mumbai.

224
UNIT – 18
INTRODUCTION TO BUSINESS ANALYTICS.CONCEPT OF BUSINESS
ANALYTICS, APPLICATIONS

OBJECTIVES:
After studying this unit one will be able to
• Explain the meaning concept and definition of business analytics.
• Explain the major areas in which Business Analytics can be applied.
• Explain subsects of Business analytics.
• Explain different types of Business analytics
• Explain the use and applications of Business analytics.
• Explain the Market basket analysis and decision tree analysis.

STRUCTURE:
18.1 Introduction
18.2 Definition and Meaning
18.3 Concept of Business Analytics
18.4 Subsets of Analytics
18.5 Types of Business analytics
18.6 Major areas of applying Models of Business Analytics
18.7 Uses and Applications of Business Analytics:
18.8 Business Analytics and Research
18.9 Market Basket Analysis
18.10 Decision Tree
18.11 Bayes theorem
18.12 Summary
18.13 Review Questions
18.14 Further Readings

225
18.1. INTRODUCTION
The revolution in the development of computers and automated systems in the 20th Century has created a
fundamental problem: how to make sense of the unprecedented volumes of data generated on a daily
basis. In every facet of modern life, from online shopping and social networks to scientific research and
finance, one will collect immensely detailed data on actions, taking place throughout the world. With
emergence of internet and especially social networks, huge amount of data is available about the product,
customer and the views expressed by the customer towards the product in marketing research. Such huge
amount of data which is often semi structured or unstructured is called as ‘Big Data’. This emerging
phenomenon of Big Data - large pools of data sets that can be captured, communicated, aggregated,
stored, and analyzed - has presented companies and organisations with trillions of bytes of information
about their customers, suppliers, and operations. Millions of networked sensors are also embedded in
various devices such as mobile phones and tablet computers to sense, create, and communicate data.
Advances in technology have led to the creation of large data sets in many areas, including genetics (e.g.
decoding the human genome); personalized healthcare, the internet, social networks, physics etc., that
requires the use of sophisticated computing techniques. Big data is now part of every industry sector and
function of the global economy. It is increasingly the case that modern economic activity, innovation, and
growth have to take place with data and the related analytic processes, methods and outputs. However,
without proper interpretation, this data is just noise. As a consequence, it is now imperative for “information
workers” to have a combination of highly developed mathematical, statistical and computing skills. Business
Analytics is fundamentally concerned with how data can be turned into intelligence.

18.2. DEFINITION AND MEANING:


Business analytics is a set of statistical, mathematical and management tools and processes used for
analyzing the past data that can drive fact-based decision making in an organization. A key aspect of
Analytics that distinguishes it from more traditional statistics is the need to extract information from very
large data sets. Business analytics (BA) is a subset of business intelligence (BI) that helps organizations
to manage the data-to-decision cycle in terms of descriptive analytics, predictive analytics and prescriptive
analytics.

18.3. CONCEPT OF BUSINESS ANALYTICS


As already mentioned earlier, Business Analytics refers to technologies and applications that perform
deep analysis on the data and provides insights that will help the business in the future.The discipline of
business analytics (BA) enables companies and organizations to realize the full potential of data generated
from various business processes, sources and devices, thus improving their speed and effectiveness in
generating business insights and intelligence for optimal decision making purposes.Usage of Business
Analytics tools will help in gaining meaning to the data and help in improving the prospects of a business.
Improvements in business that are mostly felt are increased profitability, reduced cost, faster decision
making and critical performance improvements.
Novel insights are formed through business analytics. It helps in answering question such as:
Why this is happening?

226
What will happen if the trend continues?
What is likely to happen next?
What is the most optimum solution?
These questions clearly reflect that BA helps in understanding the present data, predict the future and
prescribe best solution for a given problem. It also helps in coming with new patterns and insights which
were hidden or not known earlier.

18.4. SUBSETS OF ANALYTICS:


There are different subsets of analytics that one may find it useful based on the insights being sought.
• Behavioral Analytics: It is used to understand the behavior of the customer of user. It could be both
offline and online but often the term is used for online behavior. In online, it is the paths that visitors
take and how they interact with each path arecritical data to understand, how the site can be optimized
for increased engagement and conversion. It could also cover the shopping behavior which will
help in coming up with recommended systems. Offline transactions and the interaction that the
customer or the user has with the company or service personnel will also help in understanding the
behavior. Customer churn prediction is one popular application of behavior analytics.
• Business Intelligence orBI:Analytics centralize all aspects of theorganizations performance, from
marketing to operations and accounting, for senior leadership to junior leadership in the process of
monitoring a company’s behavior. BI is central point to performance monitoring and planning of
mediumand large enterprise organizations’
• Conversion Analytics:A conversion on a site is an activity of value such as purchasing a product
on an e-commerce site. It also indicates the success of a promotional link that is placed on an website.
Often, promotional links like web advertisements are placed in websites to entice the user to click.
However, it is deemed successful when the user clicks on the link and it is even more success, if a
positive reaction or outcome (example: making purchase) is seen from the user. Conversion analytics
often incorporate testing of elements so that one can optimize the site to convert more visitors into
customers.
• Messaging Analytics: Marketing automation, email, inbox reporting, SMS, phone, and other
messaging systems offer analytics to provide with activity per campaign, subscriber activity, and
often integrate with the other analytics systems to help improve the messaging and campaign
execution.
• Search Analytics: Tools that help to monitor the search engine keywords, competitors, and how a
company’s content is ranked will help in attracting new visitors and build content strategies that
drive business. One can find the most used keywords that a user uses to search for the company’s
website and the rank of the website in the search engine results. Based on this information, Search
Engine Optimization (SEO) can be performed to improve website ranking.
• Social Analytics: With the growth and popularity of social media or social networking, social analytics
where interactions in the social media can help in understanding the user and their opinions on a
company product will help in improving the product further. Not only that, even it will help in the
understanding the current trends in vogue and come up with service that satisfies the customers.

227
Social analytics can measure the authority, track the social ranking, will help in understanding why
people follow a company and what topics they are interested and wants to interact with the customer
e.t.c. Social Analytics will help in forming increased trust amongst your audience or community
which can be used to echo the promotions or even drive direct conversions.
18.5. TYPES OF BUSINESS ANALYTICS
Business Analytics can be divided into basically three types:
1. Descriptive Analytics
2. Predictive Analytics
3. Prescriptive Analytics
18.5.1. Descriptive Analytics
It is set of techniques to describe the data. It is usually performed on historical data. Tasks such as data
queries, report generation, descriptive statistics, data visualization, dashboards and data mining comes
under descriptive analytics.
Descriptive Analytics becomes a powerful tool to describe data when it is clubbed with data visualization.
Descriptive analytics with visualization has been successfully used in the past and one such case is the
“Spot Map” developed by Dr. John Snow during the outbreak of cholera. The case history is as follows:
“As London suffered a series of cholera outbreaks during the mid-19th century, Snow theorized that
cholera reproduced in the human body and was spread through contaminated water. This contradicted
the prevailing theory that diseases were spread by “miasma” in the air.
London’s water supply system consisted of shallow public wells where people could pump their own
water to carry home, and about a dozen water utilities that drew water from the Thames to supply a
jumble of water lines to more upscale houses. London’s sewage system was even more ad hoc: privies
emptied into cesspools or cellars more often than directly into sewer pipes. So the pervasive stench of
animal and human feces combined with rotting garbage made the miasma theory of disease seem very
plausible. Disease was more prevalent in lower-class neighborhoods because they stank more, and because
the supposed moral depravity of poor people weakened their constitutions and made them more vulnerable
to disease.
The September 1854 cholera outbreak was centered in the Soho district, close to Snow’s house. Snow
mapped the 13 public wells and all the known cholera deaths around Soho, and noted the spatial clustering
of cases around one particular water pump on the southwest corner of the intersection of Broad (now

228
Broad wick) Street and Cambridge (now Lexington) Street. He examined water samples from various
wells under a microscope, and confirmed the presence of an unknown bacterium in the Broad Street
samples. Despite strong skepticism from the local authorities, he had the pump handle removed from the
Broad Street pump and the outbreak quickly subsided.
Snow subsequently published a map of the epidemic to support his theory. A detail formof this map is
shown in figure 18.1.”

Fig 18.1
Source: https://www1.udel.edu/johnmack/frec682/cholera/
From the above case, it can be seen that data visualization where data can be shown visually either in the
form of maps or graphs helps in understanding the data better and thus better analysis.
18.5.2. Predictive Analytics
It consists of techniques that uses past data (historical data) to predict the future or predict the value of one
variable from the values of another variable. For instance future sales of a company can be predicted by
observing the past sales trends. This would help in planning ahead in regards to production, meeting the
demand and strategies to be adopted in the market. Predictive analytics is also seen in e-commerce where
it is used to recommend products to the customer by observing the past behavior of similar customers.
Statistical techniques such as linear regression, time series analysis, some data mining and machine learning
techniques like supervised or semi-supervised learning are used for building predictive models.
One popular example of predictive analytics in action was done by the marketing giant ‘Walmart’. The
case details are as follows:
“Wal Mart is the largest retailer in the world and the world’s largest company by revenue, with more than
two million employees and 20,000 stores in 28 countries.
With operations on this scale it’s no surprise that they have long seen the value in data analytics. In 2004,
when Hurricane Sandy hit the US, they found that unexpected insights could come to light when data was
studied as a whole, rather than as isolated individual sets.

229
Attempting to forecast demand for emergency supplies in the face of the approaching Hurricane Sandy,
CIO, Linda Dillman turned up some surprising statistics. As well as flashlights and emergency equipment,
expected bad weather had led to an upsurge in sales of strawberry Pop Tarts in several other locations.
Extra supplies of these were dispatched to stores in Hurricane Sandy’s path in 2012, and sold extremely
well.” [source:www.forbes.com]
18.5.3. Prescriptive Analytics
As the name suggest, prescriptive analytics prescribes the best solution or course of action to be taken.
Thus, prescriptive analytics helps in coming with decision, the best possible solution in the given situation
and constraints.
Operation research techniques such as optimization models are the techniques that are commonly used
for prescriptive analytics. For instance, prescriptive analytics may help in coming with the right product
mix to be followed by a company such that their profits are maximized and the products are optimally
utilizing the resources in the hand. Since resources availability is subject to change and has a limit,
availability of resources acts as constraints while deciding the right product mix. Prescriptive analytics
will help in such scenario.
One of the popular examples of prescriptive analytics is in the case of Google’s ‘self-driving car’ as
illustrated below:
“During every trip, it makes multiple decisions about what to do based on predictions of future outcomes.
For example, when approaching an intersection, the car needs to determine whether to go left or right
and, ased on various future possibilities, it makes a decision. So, the car needs to anticipate what might
be coming in terms of traffic, pedestrians, etc. and the effect of a possible decision before actually making
that decision.” [source: http://data-informed.com]

18.6. MAJOR AREAS OF APPLYINGMODELS OFBUSINESS ANALTICS:


Application of analytics is seen in different areas of business which are tabulated in Table 18.1:

Area Examples
Marketing and Sales Identifying potential customers, optimizing market campaigns,
customer segmentation, pricing decisions, identifying sales
trends, etc.,
Customer Service Customer behavior analysis, customer value analysis, churn
prediction, identifying trends in customer inquiries, etc.,
Finance & Controlling Risk analysis, Fraud detection, Credit default risk, Prediction,
etc.,
IT Performance optimization, security analysis, etc.,
Production Quality control and improvement, optimizing production process,
improve production planning, predictive maintenance, etc.,
Logistics/ supply chain Identification of optimal stock inventory, optimization of supply
chain, shipment tracking, etc.,
R&D Testing new product, obtaining new product idea, identifying
customer needs
Table 18.1

230
18.7. USES AND APPLICATIONS OF BUSINESS ANALYTICS:
Business Analytics helps the business to make data driven decision. It also helps the organization in
having a competitive edge over their counterparts. Big data analytics is one of the major area of analytics
where research is done to address the 3Vs of Big data i.e., Volume, Velocity and Variety. Apart from big
data analytics, business analytics is also used for different tasks such as:
• Data profiling by understanding the data better
• Mining the data to find interesting patterns in the data. It helps in understanding the relationship
between different data items and thus establishes how a change in one variable may affect another
variable. This will help in the process of prediction and decision making.
• Performance of statistical and quantitative analytics to find cause and effect relationship.
• Perform multivariate analysis
• Build predictive models for forecasting
Analytics has many applications across industries and social sectors. Based on the kind of data and the
purpose of analysis, we may have many different uses or applications of analytics.
18.8. BUSINESS ANALYTICS AND RESEARCH
Business analytics methods can be used in research when one has to explore large amount of data and
perform the task of finding the relationship between variables, predict values for some variables and
prescribe solution using operation research models.
Business analytics can be seen as an integration of different disciplines in which operations research,
information systems and machine learning are most relevant. Business analytics helps in the research
problems related to Operation research as well as data science (Fig18.2).

Fig 18.2.
Source: Mortenson, M. J., Doherty, N. F., & Robinson, S. (2015). Operational research from Taylorism to
Terabytes: A research agenda for the analytics age. European Journal of Operational Research, 241(3),

231
583-595
Research into analytics should seek to both to incorporate the unique aspects of the OR discipline, as well
as the innovations, concerns and characteristics of the analytics. Some of the research questions that one
can use in analytics are:
• Limitations and applications of optimization and other operation research techniques to large datasets.
• Challenges for applications of operation research methods within distributed systems
• Application of operation research models for building simulation models
• Adoption of operation research techniques in data mining
• Real time applications of operation research
• Methods to be used for hypothesis testing and model validation in large datasets
• Effective use of operation research and statistical techniques to handle semi structured unstructured
data
• Role of data visualization techniques in operation research

18.9. MARKET BASKET ANALYSIS


Market Basket Analysis uses the concept of association mining where one can derive rules on the occurrence
of the items that are often bought together and hence influenced its terminology. An association rule is
usually in the form ofXàYwhere X and Y are itemsets and X)”Y = Ø.
This is the most widely used and, in many ways, most successful data mining algorithm. It essentially
determines what products people purchase together. Stores can use this information to place these products
in the same area. Direct marketers can use this information to determine which new products to offer to
their current customers. Inventory policies can be improved if reorder points reflect the demand for the
complementary products.
Rules are written in the form “left-hand side implies right-hand side” and an example is:
Yellow Peppers IMPLIES Red Peppers, Bananas, Bakery
To make effective use of a rule, following numeric measures about that rule must be considered: (1)
support, (2) confidence
1. Support refers to the percentage of baskets where the rule was true (both left and right side products
were present).
2. Confidence measures what percentage of baskets that contained the left-hand product also contained
the right.
18.9.1. Market Basket Methodology:
• We first need a list of transactions and what was purchased. This is pretty easily obtained these
days from scanning cash registers.

232
• Next, we choose a list of products to analyze, and tabulate how many times each was purchased
with the others.
• The diagonals of the table show how often a product is purchased in any combination, and the off-
diagonals show which combinations were bought.
Example 18.1:
Consider the following simple example about five transactions at a convenience store:
Transaction 1: Frozen pizza, cola, milk
Transaction 2: Milk, potato chips
Transaction 3: Cola, frozen pizza
Transaction 4: Milk, pretzels
Transaction 5: Cola, pretzels
Theseneeds to be cross tabulated and displayed in a table.

Product Bought Pizza also Milk also Cola also Chips also Pretzels also
Pizza 2 1 2 0 0
Milk 1 3 1 1 1
Cola 2 1 3 0 1
Chips 0 1 0 1 0
Pretzels 0 1 1 0 2

Inferences:
• Pizza and Cola sell together more often than any other combo; a cross-marketing opportunity?
• Milk sells well with everything – people probably come here specifically to buy it.
To measure the rule
Milk > Pizza
Support = 1/5
Confidence = 1/3
18.9.2. How is it used?
In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a
customer might have bought if the idea had occurred to them.
As a first step, therefore, market basket analysis can be used in deciding the location and promotion of
goods inside a store. If, as has been observed, purchasers of Barbie dolls are more likely to buy candy,
then high-margin candy can be placed near to the Barbie doll display. Customers who would have bought
candy with their Barbie dolls had they thought of it will now be suitably tempted.

233
But this is only the first level of analysis. Differential market basket analysis can find interesting results
and can also eliminate the example 18.1 of a potentially high volume of trivial results.
In differential analysis, results will be compared between different stores, between customers in different
demographic groups, between different days of the week, different seasons of the year, etc.
If can also be observed that a rule may holds in one store, but not in any other (or does not hold in one
store, but holds in all others), then it is known that there is something interesting about that store. Perhaps
its clientele is different, or perhaps it has organized its displays in a novel and more lucrative way.
Investigating such differences may yield useful insights which will improve company sales.
18.9.2. Other Application Areas of Market Basket Analysis:
Although Market Basket Analysis conjures up pictures of shopping carts and supermarket shoppers, it is
important to realize that there are many other areas in which it can be applied. These include:
• Analysis of credit card purchases.
• Analysis of telephone calling patterns.
• Identification of fraudulent medical insurance claims. (Consider cases where common rules are
broken).
• Analysis of telecom service purchases.
18.9.3. Limitations of Market Basket Analysis:
• Many real transactions are needed to do an effective basket analysis, but the data’s accuracy is
compromised if all the products do not occur with similar frequency.
• The analysis can sometimes capture results that were due to the success of previous marketing
campaigns (and not natural tendencies of customers).

18.10. DECISION TREE


In a single stage decision making problems the payoffs, status of nature, course of action and probabilities
associated with the occurrences of states of nature are not subject to change. However, situations may
arise when a decision maker needs to revive his previous decision on getting new information and make
a sequence of other decisions. Thus, the problem becomes multi stage decision problem because the
consequences of one decision affects future decision. Given one decision, there will be an outcome which
will lead to another decision and so on. Decision tree analysis involves the constructions of a diagram
showing all the choices, states of nature, and the probabilities associated with the states of nature. The
decision diagram looks like a drawing of a tree, therefore also called decision tree.
Decision tree is most popular technique to build a classifier. A decision tree consists of nodes, branches,
probability estimates and payoffs. There are two types of nodes namely decision nodes, usually denoted
by a squares and chance / leaf nodes usually denoted by a circles. Using decision tree, paths can be
created that builds up rules on how to classify a given data into known class labels, where the class label
are the leaf nodes and the rules are derived by the path it takes from root to reach the leaf node.
Consider the below example where a decision node (outlook) has two or more branches (sunny, overcast
and rainy). Leaf node (play) is the classification or decision to be taken.

234
Fig 18.3
Source: http://www.saedsayad.com/decision_tree.htm
The most popular algorithm that is used for the construction of decision tree is ID3 by J.R. Quinlan which
employs a top-down, greedy search algorithm. It uses the concept of Entropy and Information Gain to
construct a decision tree.
The decision tree is built on the training set and tested on the test set to measure the performance of the
decision tree model which is evaluated based on the accuracy of the prediction.

18.11. BAYES THEOREM


Another method that has caught the attention of researchers in this area is the usage of Bayes Theorem.
Bayes’ theorem describes the relationships that exist within an array of simple and conditional probabilities.
It begins with initial or prior probability estimates for specific events of interest. From the sources like
sample, a special report, a product test and so on one can obtain some additional information about the
events. Given this raw information one can update the prior probability values by calculating revised
probabilities, referred to as Posterior probabilities. The steps in this probability revision process are as
follows.
Prior Probabilities New information Application of Bayes Theorem Posterior Probabilities

Bayes’ theorem relates the conditional and marginal probabilities of events A and B, if the probability
of B does not equal to zero:

Each term in Bayes’ theorem has a conventional name:


• P(A) is the prior probability or marginal probability of A.
• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability.

235
• P(B|A) is the c
• Conditional probability of B given A. It is also called the likelihood.
• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
Bayes’ theorem in this form gives a mathematical representation of how the conditional probability of
event A given B is related to the converse conditional probability of B given A.

18.12. SUMMARY:
This Unit is on some fundamentals on Business analytics. Definition, meaningand the concept of the
Business Analytics was discussed. Different subsets and types of Business Analytics, major application
areas, uses and applications were also discussed in this Unit. The relation between research and Business
analytics was also discussed. The importance of market basket analysis, Decision tree approach, Bayes’
theoremwas also briefly presented.
Exercises:
1. What is meant by Business Analytics? Define it. Explain it concepts?
2. What are the uses of the Business Analytics?
3. Explain different subsects of Business analytics.
4. Explain different types of Business analytics.
5. What are major areas of applying the business analytics.
6. Explain different applications of Business analytics.
7. What is market basket analysis? Explain briefly.
8. How a decision tree will help in classifying the data? Explain.
9. Briefly explain the concept of Bayes theorem.
References/ Further readings:
1. Sahil Raj, Business Analytics, Cengage Learning, New Delhi, 2015
2. R.N.Prasad, SeemaAcharya, Fundamentals of Business Analytics, Wiley India, New Delhi, 2012
3. Minelli, M., Big Data, Big Analytics, Wiley India, New Delhi, 2013
4. Jeffrey D. Camm, James J. Cochran, Michael J. Fry, Jeffrey W. Ohimann, David R. Anderson,
Dennis J. Sweeney, Thomas A. Williams, Essentials of Business Analytics, Cengage Learning, New
Delhi, 2015.
5. James R. Evans, Business Analytics, Pearson Education, New Delhi, 2015
6. Gupta S.C and Kapoor V.K, Fundamentals of Mathematical Statistics, Sultan Chand & Sons, New
Delhi.

236

You might also like