Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 68

Unit-1

Research Methodology: An Introduction Meaning Of Research


Research may be very broadly defined as systematic gathering of data and information and its
analysis for advancement of knowledge in any subject. Research attempts to find answer intellectual
and practical questions through application of systematic methods. Webster’s Collegiate Dictionary
defines research as "studious inquiry or examination; esp: investigation or experimentation aimed at
the discovery and interpretation of facts, revision of accepted theories or laws in the light of new
facts, or practical application of such new or revised theories or laws". Some people consider
research as a movement, a movement from the known to the unknown.
Definition:-
According to Clifford Woody research comprises defining and redefining problems, formulating
hypothesis or suggested solutions; collecting, organizing and evaluating data; making deductions
and reaching conclusions; and at last carefully testing the conclusions to determine whether they fit
the formulating hypothesis.
D. Steiner and M. Stephenson in the Encyclopedia of Social Sciences define research as “the
manipulation of things, concepts or symbols for the purpose of generalizing to extend, correct or
verify knowledge, whether that knowledge aids in construction of theory or in the practice of an
art.”
Research is, thus, an original contribution to the existing stock of knowledge making for its
advancement. It is the pursuit of truth with the help of study, observation, comparison and
experiment. In short, the search for knowledge through objective and systematic method of finding
solution to a problem is research. The systematic approach concerning generalization and the
formulation of a theory is also research. As such the term ‘research’ refers to the systematic method
consisting of enunciating the problem, formulating a hypothesis, collecting the facts or data,
analyzing the facts and reaching certain conclusions either in the form of solutions(s) towards the
concerned problem or in certain generalizations for some theoretical formulation.
Objectives Of Research:
The purpose of research is to discover answers to questions through the application of scientific
procedures. The main aim of research is to find out the truth which is hidden and which has not been
discovered as yet. Though each research study has its own specific purpose, we may think of
research objectives as falling into a number of following broad groupings:
To gain familiarity with a phenomenon or to achieve new insights into it (studies with this object in
view are termed as exploratory or formulative research studies);
To portray accurately the characteristics of a particular individual, situation or a group(studies
with this object in view are known as descriptive research studies);
To determine the frequency with which something occurs or with which it is associated with
something else (studies with this object in view are known as diagnostic research studies)
To test a hypothesis of a causal relationship between variables (such studies are known as
hypothesis-testing research studies).
Types of research
Types of research can be classified in many different ways. Some major ways of classifying research
include the following.
Descriptive versus Analytical Research
Applied versus Fundamental Research
Qualitative versus Quantitative Research
Conceptual versus Empirical Research
Descriptive research concentrates on finding facts to ascertain the nature of something as it exists.
In contrast analytical research is concerned with determining validity of hypothesis based on
analysis of facts collected.
Applied research is carried out to find answers to practical problems to be solved and as an aid in
decision making in different areas including product design, process design and policy making.
Fundamental research is carried out as more to satisfy intellectual curiosity, than with the
intention of using the research findings for any immediate practical application.
Quantitative research studies such aspects of the research subject which are not quantifiable, and
hence not subject to measurement and quantitative analysis. In contrast quantitative research
makes substantial use of measurements and quantitative analysis techniques.

Conceptual research is involves investigation of thoughts and ideas and developing new ideas or
interpreting the old ones based on logical reasoning. In contrast empirical research is based on firm
verifiable data collected by either observation of facts under natural condition or obtained through
experimentation.
Scope of Research:
The scope of the study explains the extent to which the research area will be explored in the work.
Basically scope of research divided into three parts.

Environmental level
Technological innovations
Competitive analysis
Industry fears
New market entry
New product development
Organizational Level
HRM
Finance
Production
Organizational effectiveness and success
Marketing Level
Product
Price
Place
Promotion
Sales
Limitations:
If business activities are carried on the basis of custom and traditions, in such case research study
becomes irrelevant.
The research activities are very expensive.
It may not be done for small and medium scale unit.
Research Process:
Before embarking on the details of research methodology and techniques, it seems appropriate to
present a brief overview of the research process. Research process consists of series of actions or
steps necessary to effectively carry out research and the desired sequencing of these steps. The chart
shown in Figure well illustrates a research process. The chart indicates that the research process
consists of a number of closely related activities, as shown through I to VII. But such activities
overlap continuously rather than following a strictly prescribed sequence.

At times, the first step determines the nature of the last step to be undertaken. If subsequent
procedures have not been taken into account in the early stages, serious difficulties may arise which
may even prevent the completion of the study. One
should remember that the various steps involved in a research process are not mutually exclusive;
nor are they separate and distinct.
They do not necessarily follow each other in any specific order and the researcher has to be
constantly anticipating at each step in the research process the requirements of the subsequent
steps. However, the following order concerning various steps provides a useful procedural guideline
regarding the research process:
formulating the research problem;
extensive literature survey;
developing the hypothesis;
preparing the research design;
determining sample design;
collecting the data;
execution of the project;
analysis of data;
hypothesis testing;
generalizations and interpretation, and
preparation of the report or presentation of the results,i.e., formal write-up of conclusions reached.
1.Formulating the research problem: There are two types of research problems, vi., those which
relate to states of nature and those which relate to relationships between variables. At thievery
outset the researcher must single out the problem he wants to study, i.e., he must decide the general
area of interest or aspect of a subject-matter that he would like to inquire into. Initially the problem
may be stated in a broad general way and then the ambiguities, if any, relating to the problem be
resolved. Then, the feasibility of a particular solution has to be Considered before a working
formulation of the problem can be set up. The formulation of a general topic into a specific research
problem, thus, constitutes the first step in a scientific enquiry. Essentially two steps are involved in
formulating the research problem, vi., understanding the problem thoroughly, and rephrasing the
same into meaningful terms from an analytical point of view.
The best way of understanding the problem is to discuss it with one’s own colleagues or with those
having some expertise in the matter. In an academic institution the researcher can seek the help from
a guide who is usually an experienced man and has several research problems in mind.
He may review two types of literature—the conceptual literature concerning the concepts and
theories, and the empirical literature consisting of studies made earlier which are similar to the one
proposed.
This task of formulating, or defining, a research problem is a step of greatest importance in the entire
research process. The problem to be investigated must be defined unambiguously for that will help
discriminating relevant data from irrelevant ones.
2. Extensive literature survey: Once the problem is formulated, a brief summary of it should be
written down. It is compulsory for a research worker writing a thesis for a Ph.D. degree to write
synopsis of the topic and submit it to the necessary Committee or the Research Board for approval.
At this juncture the researcher should undertake extensive literature survey connected with the
problem.

For this purpose, the abstracting and indexing journals and published or unpublished bibliographies
are the first place to go to. Academic journals, conference proceedings, government reports, books
etc., must be tapped depending on the nature of the problem. In this process, it should be
remembered that one source will lead to another. The earlier studies, if any, which are similar to the
study in and should be carefully studied. A good library will be a great help to the researcher at this
stage.
3. Development of working hypotheses: After extensive literature survey, researcher should state
in clear terms the working hypothesis or hypotheses. Working hypothesis is tentative assumption
made in order to draw out and test its logical or empirical consequences. As such the manner in
which research hypotheses are developed is particularly important since they provide the focal point
for research.

They also affect the manner in which tests must be conducted in the analysis of data and indirectly
the quality of data which is required for the analysis. In most types of research, the development of
working hypothesis plays an important role.

Hypothesis should be very specific and limited to the piece of research in hand because it has to be
tested. The role of the hypothesis is to guide the researcher by delimiting the area of research and to
keep him on the right track. It sharpens his thinking and focuses attention on the more important
facets of the problem. It also indicates the type of data required and the type of methods of data
analysis to be used.
How does one go about developing working hypotheses? The answer is by using the following
approach:
Discussions with colleagues and experts about the problem, its origin and the objectives in seeking a
solution;
Examination of data and records, if available, concerning the problem for possible trends,
peculiarities and other clues;
Review of similar studies in the area or of the studies on similar problems; and
Exploratory personal investigation which involves original field interviews on a limited scale with
interested parties and individuals with a view to secure greater insight into the practical aspects of
the problem.

4. Preparing the research design: The research problem having been formulated in clear cut
terms, the researcher will be required to prepare a research design, i.e., he will have to state the
conceptual structure within which research would be conducted. The preparation of such a design
facilitates research to be as efficient as possible yielding maximal information.
In other words, the function of research design is to provide for the collection of relevant evidence
with minimal expenditure of effort, time and money. But how all these can be achieved depends
mainly on the research purpose. Research purposes may be grouped into four categories,
Exploration,
Description,
Diagnosis, and
Experimentation.

5.Determining sample design: All the items under consideration in any field of inquiry constitute
‘universe’ or ‘population’. A complete enumeration of all the items in the ‘population’ is known as a
census inquiry. It can be presumed that in such an inquiry when all the items are covered no element
of chance is left and highest accuracy is obtained. But in practice this may not be true.
Even the slightest element of bias in such an inquiry will get larger and larger as the number of
observations increases. Moreover, there is no way of checking the element of bias or its extent except
through is survey or use of sample checks. Besides, this type of inquiry involves a great deal of time,
money and energy. Not only this, census inquiry is not possible in practice under many
circumstances. For instance, blood testing is done only on sample basis. Hence, quite often we select
only a few items from the universe for our study purposes. The items so selected constitute what is
technically called sample.

The researcher must decide the way of selecting a sample or what is popularly known as the sample
design. In other words, a sample design is a definite plan determined before any data are actually
collected for obtaining a sample from a given population. Thus, the plan to select 12 of a city’s 200
drugstores in a certain way constitutes a sample design. Samples can be either probability samples or
non- probability samples.

With probability samples each element has a known probability of being included in the sample but
the non-probability samples do not allow the researcher to determine this probability.

A brief mention of the important sample designs is as follows:

a. Deliberate sampling

b.Simple random sampling

c.Systematic sampling

d.Stratified sampling

e.Quota sampling

f.Cluster sampling and area sampling

g. Multi-stage sampling

h. Sequential sampling
6.Collecting the data: In dealing with any real life problem it is often found that data at hand are
inadequate, and hence, it becomes necessary to collect data that are appropriate. There are several
ways of collecting the appropriate data which differ considerably in context of money costs, time and
other resources at the disposal of the researcher. Primary data can be collected either through
experiment or through survey. If the researcher conducts an experiment, he observes some
quantitative measurements, or the data, with the help of which he examines the truth contained
Considered before a working formulation of the problem can be set up. The formulation of a general
topic into a specific research problem, thus, constitutes the first step in a scientific enquiry.
Essentially two steps are involved in formulating the research problem, vi., understanding the
problem thoroughly, and rephrasing the same into meaningful terms from an analytical point of
view.
The best way of understanding the problem is to discuss it with one’s own colleagues or with those
having some expertise in the matter. In an academic institution the researcher can seek the help from
a guide who is usually an experienced man and has several research problems in mind.
He may review two types of literature—the conceptual literature concerning the concepts and
theories, and the empirical literature consisting of studies made earlier which are similar to the one
proposed.
This task of formulating, or defining, a research problem is a step of greatest importance in the entire
research process. The problem to be investigated must be defined unambiguously for that will help
discriminating relevant data from irrelevant ones.
8.Analysis of data: After the data have been collected, the researcher turns to the task of analyzing
them. The analysis of data requires a number of closely related operations such as establishment of
categories, the application of these categories in his hypothesis. But in the case of a survey, data can
be collected by any one or more of the following ways:

By observation
Through personal interview
Through telephone interviews
By mailing of questionnaires
Through schedules
7. Execution of the project: Execution of the project is a very important step in the research
process. If the execution of the project proceeds on correct lines, the data to be collected would be
adequate and dependable. The researcher should see that the project is executed in a systematic
manner and in time. If the survey is to be conducted by means of structured questionnaires, data
can be readily machine- processed. In such a situation, questions as well as the possible answers may
be coded. If the data are to be collected through interviewers, arrangements should be made for
proper selection and training of the interviewers.
The training may be given with the help of instruction manuals which explain clearly the job of the
interviewers at each step. Occasional field checks should be made to ensure that the interviewers are
doing their assigned job sincerely and efficiently .A careful watch should be kept for unanticipated
factors in order to keep the survey as much realistic as possible. This, in other words, means
that steps should be taken to ensure that the survey is under statistical control so that the collected
information is in accordance with the pre-defined standard of accuracy.
to raw data through coding, tabulation and then drawing statistical inferences. The unwieldy data
should necessarily be condensed into a few manageable groups and tables for further analysis. Thus,
researcher should classify the raw data into some purposeful and usable categories. Coding
operation is usually done at this stage through which the categories of data are transformed into
symbols that may be tabulated and counted. Editing is the procedure that improves the quality of
the data for coding. With coding the stage is ready for tabulation .Tabulation is a part of the
technical procedure wherein the classified data are put in the form of tables. The mechanical devices
can be made use of at this juncture. A great deal of data, especially in large inquiries, is tabulated by
computers. Computers not only save time but also make it possible to study large number of
variables affecting a problem simultaneously.

Analysis work after tabulation is generally based on the computation of various percentages,
coefficients, etc., by applying various well defined statistical formulae. In the process of analysis,
relationships or differences supporting or conflicting with original or new hypotheses should be
subjected to tests of significance to determine with what validity data can be said to indicate any
conclusion(s).
9.Hypothesis-testing: After analyzing the data as stated above, the researcher is in a position to test
the hypotheses, if any, he had formulated earlier. Do the facts support the hypotheses or they happen
to be contrary? This is the usual question which should be answered while testing hypotheses
.Various tests, such as Chi square test, t-test, F-test, have been developed by statisticians for the
purpose. The hypotheses may be tested through the use of one or more of such tests, depending upon
the nature and object of research inquiry. Hypothesis -testing will result in either accepting the
hypothesis or in rejecting it. If the researcher had no hypotheses to start with, generalizations
established on the basis of data may be stated as hypotheses to be tested by subsequent researches
in times to come.
10. Generalizations and interpretation: If a hypothesis is tested and upheld several times, it may
be possible for the researcher to arrive at generalization, i.e., to build a theory. As a matter of fact, the
real value of research lies in its ability to arrive at certain generalizations. If the researcher had no
hypothesis to start with, he might seek to explain his findings on the basis of some theory. It is
known as interpretation. The process of interpretation may quite often trigger off new questions
which in turn may lead to further researches.

11. Preparation of the report or the thesis: Finally, the researcher has to prepare the report of
what has been done by him. Writing of report must be done with great care keeping in view the
following: The layout of the report should be as follows:
the preliminary pages;
the main text,and iii.the end matter.
Research Design:
Research design is the framework of research methods and techniques chosen by a researcher. A
research design is the arrangement of conditions for collection and analysis of data in a manner that
aims to combine relevance to the research purpose with economy in procedure. The research design
is the conceptual structure within which research is conducted. It constitutes the blueprint for the
collection, measurement and analysis of data. The design includes an outline of what the researcher
will do from writing the hypothesis and its operational implications to the final analysis of data. The
design decisions can be taken by considering the following heads:
What is the study about?
Why is the study being made?
Where will the study be carried out?
What type of data is required?
Where can the required data be found?
What periods of time will the study include?
What will be the sample design?
What techniques of data collection will be used?
How will the data be analysed?
In what style will the report be prepared?
The essential elements of the research design are:
Accurate purpose statement
Techniques to be implemented for collecting and analyzing research
The method applied for analyzing collected details
Type of research methodology
Probable objections for research
Settings for the research study
Timeline
Measurement of analysis
Proper research design sets your study up for success. Successful research studies provide
insights that are accurate and unbiased. You’ll need to create a survey that meets all of the
main characteristics of a design.
There are four key characteristics of research design:
1. Neutrality: When you set up your study, you may have to make assumptions about the data
you expect to collect. The results projected in the research design should be free from bias and
neutral. Understand opinions about the final evaluated scores and conclusions from multiple
individuals and consider those who agree with the derived results.
2. Reliability: With regularly conducted research, the researcher involved expects similar
results every time. Your design should indicate how to form research questions to ensure the
standard of results. You’ll only be able to reach the expected results if your design is reliable.
3. Validity: There are multiple measuring tools available. However, the only correct measuring
tools are those which help a researcher in gauging results according to the objective of the
research. The questionnaire developed from this design will then be valid.
4. Generalization: The outcome of your design should apply to a population and not just a
restricted sample. A generalized design implies that your survey can be conducted on any part
of a population with similar accuracy.
The above factors affect the way respondents answer the research questions and so all the
above characteristics should be balanced in a good design. A researcher must have a clear
understanding of the various types of research design to select which model to implement for a
study. Like research itself, the design of your study can be broadly classified into quantitative
and qualitative.
Qualitative research design: Qualitative research determines relationships between
collected data and observations based on mathematical calculations. Theories related to a
naturally existing phenomenon can be proved or disproved using statistical methods.
Researchers rely on qualitative research design methods that conclude “why” a particular
theory exists along with “what” respondents have to say about it.
Quantitative research design: Quantitative research is for cases where statistical conclusions
to collect actionable insights are essential. Numbers provide a better perspective to make
critical business decisions. Quantitative research design methods are necessary for the growth
of any organization. Insights drawn from hard numerical data and analysis prove to be highly
effective when making decisions related to the future of the business.
Types of Research Design:
1. Descriptive research design: In a descriptive design, a researcher is solely interested in
describing the situation or case under their research study. It is a theory-based design method
which is created by gathering, analyzing, and presenting collected data. This allows a
researcher to provide insights into the why and how of research. Descriptive design helps
others better understand the need for the research. If the problem statement is not clear, you
can conduct exploratory research.
2. Experimental research design: Experimental research design establishes a relationship
between the cause and effect of a situation. It is a causal design where one observes the impact
caused by the independent variable on the dependent variable. For example, one monitors the
influence of an independent variable such as a price on a dependent variable such as customer
satisfaction or brand loyalty. It is a highly practical research design method as it contributes to
solving a problem at hand. The independent variables are manipulated to monitor the change
it has on the dependent variable. It is often used in social sciences to observe human behavior
by analyzing two groups. Researchers can have participants change their actions and study
how the people around them react to gain a better understanding of social psychology.
3. Correlational research design: Correlational research is a non-experimental
research design technique that helps researchers establish a relationship between two closely
connected variables. This type of research requires two different groups. There is no
assumption while evaluating a relationship between two different variables, and statistical
analysis techniques calculate the relationship between them. A correlation coefficient
determines the correlation between two variables, whose value ranges between -1 and +1. If
the correlation coefficient is towards +1, it indicates a positive relationship between the
variables and -1 means a negative relationship between the two variables.
4. Diagnostic research design: In diagnostic design, the researcher is looking to evaluate the
underlying cause of a specific topic or phenomenon. This method helps one learn more about
the factors that create troublesome situations.
This design has three parts of the research:
· Inception of the issue
· Diagnosis of the issue
· Solution for the issue
5. Explanatory research design: Explanatory design uses a researcher’s ideas and thoughts
on a subject to further explore their theories. The research explains unexplored aspects of a
subject and details about what, how, and why of research questions.
Terminologies:
RESEARCH: Research is defined as a systematic and scientific process to answer questions
about facts and relationship between facts. It is an activity involved in seeking answer to
unanswered questions.
ABSTRACT: A clear, concise summary that communicates the essential information about the
study. In research journals, it is usually located at the beginning of an article.
DATA: Units of information or any statistics, facts, figures, general material, evidence, or
knowledge collected during the course of the study.
VARIABLES: Attributes or characteristics that can have more than one value, such as height or
weight. Variables are qualities or quantities, properties or characteristics of people, things, or
situations that change or vary.
INDEPENDENT VARIABLE: Variables that are purposely manipulated or changed by the
researcher. It is also called as “MANIPULATED VARIBLE”.
RESEARCH VARIABLE: Refers to Qualities, Properties or Characteristics which are observed
or measured in a natural setting without manipulating & establishing cause & effect
relationship
DEMOGRAPHIC VARIABLES: The characteristics & attributes of study subjects such as age,
gender, place of living, educational status, religion, social class, marital status, occupation,
income are considered as demographic variables.
EXTRANEOUS VARIABLES: Are factors that are not the part of the study but may affect the
measurements of the study variable.
OPERATIONAL DEFINITION: Refers to the way in which the researcher defines the variables
under investigation. Operational definition is stated in such way by the investigator specifying
how the study variables will be measured in the actual research situation.
CONCEPT: Refers to a mental idea of a phenomenon. Concepts are words or terms that
symbolize some aspects of reality. E.g.. Love, pain.
CONSTRUCT: Is a highly abstract & complex phenomenon (concept) which is denoted by a
made up or constructed term.
PROPOSITION: A Proposition is a statement or assertion of the relationship between
concepts. E.g., relationship between anxiety and performance.
CONCEPTUAL FRAMEWORK: Interrelated concepts or abstractions that are assembled
together in some rational scheme by virtue of their relevance to a common theme. It is also
referred to as theoretical framework.
ASSUMPTION: Basic principle that is being true on the basis of logic or reason, without proof
or verification.
HYPOTHESIS: A statement of the predicted relationship between two or more variables in a
research study; an educated or calculated guess by the researcher.
LITERATURE REVIEW: A critical summary or research on a topic of interest, generally
prepared to put a research problem in context or to identify gaps and weaknesses in prior
studies so as to justify a new investigation.
LIMITATIONS: Restrictions in a study that may decrease the credibility and generalizability of
the research findings.
MANIPULATION: An intervention or treatment introduced by the researcher in an
experimental or quasi experimental study; the researcher manipulates the independent
variable to assess its impact on the dependent variable.
POPULATION: The entire set of individuals or objects having some common characteristic(s)
selected for a research study is referred to as population.
TARGET POPULATION: The entire population in which the researchers are interested and to
which they would like to generalize the research findings.
ACCESSIBLE POPULATION: The aggregate of cases that conform to designated inclusion or
exclusion criteria and that are accessible as subjects of the study.
RESEARCH SETTING: The study setting is the location in which the research is conducted. It
could be natural, partially controlled environment or laboratories.
SAMPLE: A part or subset of population selected to participate in the research study.
SAMPLING: The process of selecting sample from the target population to represent the entire
population.
PROBABILITY SAMPLING: The selection of subjects or sampling units from a population
using random procedure; E.g., Simple random Sampling, Stratified random Sampling.
NON PROBABILITY SAMPLING: The selection of subjects or sampling units from a population
using non random procedure. E.g., Convenient Sampling, Purposive Sampling.
RELIABILITY: The degree of consistency or accuracy with which an instrument measures the
attributes it is designed to measure.
VALIDITY: The degree to which an instrument what it is intended to measure.
PILOT STUDY: Study carried out at the end of the planning phase of research in order to
explore and test the research elements to make relevant modifications in research tools and
methodology.
ANALYSIS: Method of organizing, sorting, and scrutinizing data in such a way that research
question can be answered or meaningful inferences can be drawn.
RESEARCH PROJECT:
A research project is a scientific attempt to answer a research question. A research project
must include a description of a defined protocol, clearly defined goals, defined methods and
outputs, and a defined start and end data.
Choice of Topic
The ability to develop a good research topic is an important skill. An instructor may assign you
a specific topic, but most often instructors require you to select your own topic of interest.
When deciding on a topic, there are a few things that you will need to do:
brainstorm for ideas
Choose a topic that will enable you to read and understand the literature.
Ensure that the topic is manageable, and that material is available.
Make a list of key words.
Be flexible.
Define your topic as a focused research question.
Research and read more about your topic.
Formulate a thesis statement.
Be aware that selecting a good topic may not be easy. It must be narrow and focused enough to
be interesting, yet broad enough to find adequate information. Before selecting your topic,
make sure you know what your final project should look like. Each class or instructor will
likely require a different format or style of research project.
Use the steps below to guide you through the process of selecting a research topic.
Step 1: Brainstorm for ideas
Choose a topic that interests you. Use the following questions to help generate topic ideas.
Do you have a strong opinion on a current social or political controversy?
Did you read or see a news story recently that has piqued your interest or made you angry or
anxious?
Do you have a personal issue, problem or interest that you would like to know more about?
Do you have a research paper due for a class this semester?
Is there an aspect of a class that you are interested in learning more about?
Step 2: Read General Background Information
Read a general encyclopedia article on the top two or three topics you are considering. Reading
a broad summary enables you to get an overview of the topic and see how your idea relates to
broader, narrower, and related issues. It also provides a great source for finding words
commonly used to describe the topic. These keywords may be very useful to your later
research. If you can’t find an article on your topic, try using broader terms and ask for help
from a librarian. For example, the Encyclopedia Britannica Online (or the printed version of
this encyclopedia, in Thompson Library's Reference Collection on Reference Table 1) may not
have an article on Social and Political Implications of Jackie Robinsons Breaking of the Color
Barrier in Major League Baseball but there will be articles on baseball history and on Jackie
Robinson.
Step 3: Focus on Your Topic : Keep it manageable. A topic will be very difficult to research if
it is too broad or narrow. One way to narrow a broad topic such as "the environment" is to
limit your topic. Some common ways to limit a topic are:
by geographical area
Example: What environmental issues are most important in the Southwestern United States
by culture
Example: How does the environment fit into the Navajo world view?
by time frame: Example: What are the most prominent environmental issues of the last 10
years?
by discipline
Example: How does environmental awareness effect business practices today?
by population group
Example: What are the effects of air pollution on senior citizens?
locally confined - Topics this specific may only be covered in these (local) newspapers, if at
all.
Example: What sources of pollution affect the Genesee County water supply?
recent - If a topic is quite recent, books or journal articles may not be available, but newspaper
or magazine articles may. Also, Web sites related to the topic may or may not be available.
broadly interdisciplinary - You could be overwhelmed with superficial information.
Example: How can the environment contribute to the culture, politics and society of the
Western states?
popular - You will only find very popular articles about some topics such as sports figures and
high-profile celebrities and musicians.
If you have any difficulties or questions with focusing your topic, discuss the topic with your
instructor, or with a librarian
Step 4: Make a List of Useful Keywords
Keep track of the words that are used to describe your topic.
Look for words that best describe your topic
Look for them in when reading encyclopedia articles and background and general information
Find broader and narrower terms, synonyms, key concepts for key words to widen your search
capabilities
Make note of these words and use them later when searching databases and catalogs
Step 5: Be Flexible
It is common to modify your topic during the research process. You can never be sure of what
you may find. You may find too much and need to narrow your focus, or too little and need to
broaden your focus. This is a normal part of the research process. When researching, you may
not wish to change your topic, but you may decide that some other aspect of the topic is more
interesting or manageable.
Keep in mind the assigned length of the research paper, project, bibliography or other research
assignment. Be aware of the depth of coverage needed and the due date. These important
factors may help you decide how much and when you will modify your topic.
Writing Research Proposal:
Add a meaningful short title: provide a brief and meaningful title to your project.
Introduction: Background or introduction section provides a description of the basic facts and
importance of the research area.
What is your research area?
What is the motivation of research?
How important is it for the industry?
Knowledge advancement.
Problem Statement: Problem statement provides a clear and concise description of the issues
that need to be addressed.
What is the specific problem in that research area that you will address?
For example: lack of understanding of a subject. Low performance etc.
Objectives: it provides a list of goals that will be achieved through the proposed research.
What are the benefits/impact that will be generated if the research problem is
answeres?
Why would we allow this research to be done?
Preliminary Literature Review: Provide a summary of previous related research on the
research problem and their strength and weakness and a justification of your research.
What is known/what have been done by others?
Why your research is still necessary?
Research Methodologies: what to do and how to solve the problem and achieve proposed
objectives. Which research methods will be used? Attach a project schedule table, if necessary.
Reference: All factual material that is not original with you must be accompanied by a
reference to its source. Follow the proper referencing guidelines as directed by the research
approval authorities.
Hypothesis: A hypothesis can be defined as a tentative prediction or explanation of the
relationship between two or more variables. Hypotheses are not meant to be haphazard
guesses, but should reflect the depth of knowledge, imagination and experience of the
investigator. In the process of formulating the hypothesis all variables relevant to the study
must be identified.
Time Frame: The researcher should include an outline of the various stages and
corresponding time frames for developing and implementing the research, including writing
up the research. For full-time study the research should be completed within three years, with
writing up completed in the fourth year of registration. For part-time study the research
should be completed within six years, with writing up completed by the eighth year.

UNIT-2
SURVEY RESEARCH DESIGN:
Survey research designs are procedures in quantitative research in which investigators
administer a survey to a sample or to the entire population of people to describe the attitudes,
opinions, behaviors, or characteristics of the population. In this procedure, survey researchers
collect quantitative, numbered data using questionnaires (e.g., mailed questionnaires) or
interviews (e.g., one-on-one interviews) and statistically analyse the data to describe trends
about responses to questions and to test research questions or hypotheses.
Types of Survey Designs:
1. Cross-Sectional Survey Designs:
In a cross-sectional survey design, the researcher collects data at one point in time.
For example, when middle school children complete a survey about teasing, they are recording
data about their present views.
This design has the advantage of measuring current attitudes or practices.
It also provides information in a short amount of time.
A cross-sectional study can examine current attitudes, beliefs, opinions, or practices. Attitudes,
beliefs, and opinions are ways in which individuals think about issues, whereas practices are
their actual behaviours.
Another cross-sectional design compares two or more educational groups in terms of
attitudes, beliefs, opinions, or practices. These group comparisons may compare students with
students, students with teachers, students with parents, or they may compare other groups
within educational and school settings.
Longitudinal Survey Designs:
An alternative to using a cross-sectional design is to collect data over time using a longitudinal
survey design.
A longitudinal survey design involves the survey procedure of collecting data about trends
with the same population, changes in a cohort group or subpopulation, or changes in a panel
group of the same individuals over time.
Thus, in longitudinal designs, the participants may be different or the same people.
Example of a longitudinal design would be a follow-up with graduates from a program or
school to learn their views about their educational experiences.
SAMPLING:
Sampling definition:
Sampling is a technique of selecting individual members or a subset of the population to make
statistical inferences from them and estimate characteristics of the whole population.
Types of sampling: sampling methods:
Sampling in market research is of two types – probability sampling and non-probability
sampling. Let’s take a closer look at these two methods of sampling.
1. Probability sampling: Probability sampling is a sampling technique where a researcher
sets a selection of a few criteria and chooses members of a population randomly. All the
members have an equal opportunity to be a part of the sample with this selection parameter.
There are four main types of probability sample.
a. Simple random sampling:
In a simple random sample, every member of the population has an equal chance of being
selected. Your sampling frame should include the whole population.
To conduct this type of sampling, you can use tools like random number generators or other
techniques that are based entirely on chance.
Example
You want to select a simple random sample of 100 employees of Company X. You assign a
number to every employee in the company database from 1 to 1000 and use a random number
generator to select 100 numbers.

b. Systematic sampling:
Systematic sampling is like simple random sampling, but it is usually slightly easier to conduct.
Every member of the population is listed with a number, but instead of randomly generating
numbers, individuals are chosen at regular intervals.
Example
All employees of the company are listed in alphabetical order. From the first 10 numbers, you
randomly select a starting point: number 6. From number 6 onwards, every 10th person on the
list is selected (6, 16, 26, 36, and so on), and you end up with a sample of 100 people.
c. Stratified sampling:
Stratified sampling involves dividing the population into subpopulations that may differ in
important ways. It allows you draw more precise conclusions by ensuring that every subgroup
is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based
on the relevant characteristic (e.g. gender, age range, income bracket, job role).
Based on the overall proportions of the population, you calculate how many people should be
sampled from each subgroup. Then you use random or systematic sampling to select a sample
from each subgroup.
Example
The company has 800 female employees and 200 male employees. You want to ensure that the
sample reflects the gender balance of the company, so you sort the population into two strata
based on gender. Then you use random sampling on each group, selecting 80 women and 20
men, which gives you a representative sample of 100 people.
d. Cluster sampling:
Cluster sampling also involves dividing the population into subgroups, but each subgroup
should have similar characteristics to the whole sample. Instead of sampling individuals from
each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled cluster. If the
clusters themselves are large, you can also sample individuals from within each cluster using
one of the techniques above.
This method is good for dealing with large and dispersed populations, but there is more risk of
error in the sample, as there could be substantial differences between clusters. It’s difficult to
guarantee that the sampled clusters are representative of the whole population.
Example
The company has offices in 10 cities across the country (all with roughly the same number of
employees in similar roles). You don’t have the capacity to travel to every office to collect your
data, so you use random sampling to select 3 offices – these are your clusters.
Uses of probability sampling:
There are multiple uses of probability sampling:
Reduce Sample Bias: Using the probability sampling method, the bias in the sample derived
from a population is negligible to non-existent. The selection of the sample mainly depicts the
understanding and the inference of the researcher. Probability sampling leads to higher quality
data collection as the sample appropriately represents the population.
Diverse Population: When the population is vast and diverse, it is essential to have adequate
representation so that the data is not skewed towards one demographic. For example, if
Square would like to understand the people that could make their point-of-sale devices, a
survey conducted from a sample of people across the US from different industries and socio-
economic backgrounds helps.
Create an Accurate Sample: Probability sampling helps the researchers plan and create an
accurate sample. This helps to obtain well-defined data.
2. Non-probability sampling: In non-probability sampling, the researcher chooses members
for research at random. This sampling method is not a fixed or predefined selection process.
This makes it difficult for all elements of a population to have equal opportunities to be
included in a sample.

a. Convenience sampling:
A convenience sample simply includes the individuals who happen to be most accessible to the
researcher.
This is an easy and inexpensive way to gather initial data, but there is no way to tell if the
sample is representative of the population, so it can’t produce generalizable results.
Example
You are researching opinions about student support services in your university, so after each
of your classes, you ask your fellow students to complete a survey on the topic. This is a
convenient way to gather data, but as you only surveyed students taking the same classes as
you at the same level, the sample is not representative of all the students at your university.
b. Voluntary response sampling: Like a convenience sample, a voluntary response sample is
mainly based on ease of access. Instead of the researcher choosing participants and directly
contacting them, people volunteer themselves (e.g. by responding to a public online survey).
Voluntary response samples are always at least somewhat biased, as some people will
inherently be more likely to volunteer than others.
Example
You send out the survey to all students at your university and a lot of students decide to
complete it. This can certainly give you some insight into the topic, but the people who
responded are more likely to be those who have strong opinions about the student support
services, so you can’t be sure that their opinions are representative of all students.
c. Purposive sampling:
This type of sampling, also known as judgment sampling, involves the researcher using their
expertise to select a sample that is most useful to the purposes of the research.
It is often used in qualitative research, where the researcher wants to gain detailed knowledge
about a specific phenomenon rather than make statistical inferences, or where the population
is very small and specific.
An effective purposive sample must have clear criteria and rationale for inclusion.
Example
You want to know more about the opinions and experiences of disabled students at your
university, so you purposefully select a number of students with different support needs in
order to gather a varied range of data on their experiences with student services.
d. Snowball sampling:
If the population is hard to access, snowball sampling can be used to recruit participants via
other participants. The number of people you have access to “snowballs” as you get in contact
with more people.
Example
You are researching experiences of homelessness in your city. Since there is no list of all
homeless people in the city, probability sampling isn’t possible. You meet one person who
agrees to participate in the research, and she puts you in contact with other homeless people
that she knows in the area.
Uses of non-probability sampling
Non-probability sampling is used for the following:
Create a hypothesis: Researchers use the non-probability sampling method to create an
assumption when limited to no prior information is available. This method helps with the
immediate return of data and builds a base for further research.
Exploratory research: Researchers use this sampling technique widely when conducting
qualitative research, pilot studies, or exploratory research.
Budget and time constraints: The non-probability method when there are budget and time
constraints, and some preliminary data must be collected. Since the survey design is not rigid,
it is easier to pick respondents at random and have them take the survey or questionnaire.

QUALITATIVE DATA:
Definition:
Qualitative data is defined as the data that approximates and characterizes.
Qualitative data can be observed and recorded. This data type is non-numerical in nature. This
type of data is collected through methods of observations, one-to-one interviews, conducting
focus groups, and similar methods. Qualitative data in statistics is also known as categorical
data – data that can be arranged categorically based on the attributes and properties of a thing
or a phenomenon.
Examples: The cake is orange, blue, and black in colour (qualitative).
Females have brown, black, blonde, and red hair (qualitative).
Importance of Qualitative Data:
Qualitative data is important in determining the frequency of traits or characteristics. It allows
the statistician or the researchers to form parameters through which larger data sets can be
observed.
Qualitative data provides the means by which observers can quantify the world around them.
For a market researcher, collecting qualitative data helps in answering questions like, who
their customers are, what issues or problems they are facing, and where do they need to focus
their attention, so problems or issues are resolved.
Qualitative data is about the emotions or perceptions of people, what they feel. In quantitative
data, these perceptions and emotions are documented. It helps the market researchers
understand the language their consumers speak and deal with the problem effectively and
efficiently.
Types of Qualitative Data:
1. One-to-One Interviews: It is one of the most used data collection instruments for
qualitative research, mainly because of its personal approach. The interviewer or the
researcher collects data directly from the interviewee on a one-to-one basis. The interview
may be informal and unstructured – conversational. Mostly the open-ended questions are
asked spontaneously, with the interviewer letting the flow of the interview dictate the
questions to be asked.
2. Focus groups: This is done in a group discussion setting. The group is limited to 6-10
people, and a moderator is assigned to moderate the ongoing discussion.Depending on the
data, which is sorted, the members of a group may have something in common. For example, a
researcher conducting a study on track runners will choose athletes who are track runners or
were track runners and have enough knowledge of the subject matter.
3. Record keeping: This method makes use of the already existing reliable documents and
similar sources of information as the data source. This data can be used in the new research. It
is like going to a library. There, one can go over books and other reference material to collect
relevant data that can be used in the research.
4. Process of observation: In this qualitative data collection method, the researcher immerses
himself/ herself in the setting where his respondents are and keeps a keen eye on the
participants and takes down notes. This is known as the process of observation. Besides taking
notes, other documentation methods, such as video and audio recording, photography, and
similar methods, can be used.
5. Longitudinal studies: This data collection method is performed on the same data source
repeatedly over an extended period. It is an observational research method that goes on for a
few years and, in some cases, can go on for even decades. This data collection method aims to
find correlations through an empirical study of subjects with common traits.
6. Case studies: In this method, data is gathered by an in-depth analysis of case studies. The
versatility of this method is demonstrated in how this method can be used to analyse both
simple and complex subjects. The strength of this method is how judiciously it uses a
combination of one or more qualitative data collection methods to draw inferences.
5 Steps to Qualitative Data Analysis:
Whether you are looking to analyse qualitative data collected through a one-to-one interview
or qualitative data from a survey, these simple steps will ensure a robust data analysis.
Step 1: Arrange your Data:
Once you have collected all the data, it is largely unstructured and sometimes makes no sense
when looked briefly. Therefore, it is essential that as a researcher, you first need to transcribe
the data collected. The first step in analysing your data is arranging it systematically. Arranging
data means converting all the data into a text format. You can either export the data into a
spreadsheet or manually type in the data or choose from any of the computer-assisted
qualitative data analysis tools.
Step 2: Organize all your Data:
After transforming and arranging your data, the immediate next step is to organize your data.
There are chances you most likely have a large amount of information that still needs to be
arranged in an orderly manner. One of the best ways to organize the data is by going back to
your research objectives and then organizing the data based on the questions asked. Arrange
your research objective in a table, so it appears visually clear. At all costs, avoid temptations of
working with unorganized data. You will end up wasting time, and there will be no conclusive
results obtained.
Step 3: Set a Code to the Data Collected:
Setting up proper codes for the collected data takes you a step ahead. Coding is one of the best
ways to compress a tremendous amount of information collected. The coding of qualitative
data simply means categorizing and assigning properties and patterns to the collected data.
Coding is an important step in qualitative data analysis, as you can derive theories from
relevant research findings. After assigning codes to your data, you can then begin to build on
the patterns to gain in-depth insight into the data that will help make informed decisions.
Step 4: Validate your Data:
Validating data is one of the crucial steps of qualitative data analysis for successful research.
Since data is quintessential for research, it is imperative to ensure that the data is not flawed.
Please note that data validation is not just one step in qualitative data analysis; this is a
recurring step that needs to be followed throughout the research process. There are two sides
to validating data:
Accuracy of your research design or methods.
Reliability, which is the extent to which the methods produce accurate data consistently.
Step 5: Concluding the Analysis Process:
It is important to finally conclude your data, which means systematically presenting your data,
a report that can be readily used. The report should state the method that you, as a researcher,
used to conduct the research studies, the positives, and negatives and study limitations. In the
report, you should also state the suggestions/inferences of your findings and any related area
for future research.
Advantages of Qualitative Data:
1. It helps in-depth analysis: Qualitative data collected provide the researchers with a
detailed analysis of subject matters. While collecting qualitative data, the researchers tend to
probe the participants and can gather ample information by asking the right kind of questions.
From a series of questions and answers, the data that is collected is used to conclude.
2. Understand what customers think: Qualitative data helps the market researchers to
understand the mindset of their customers. The use of qualitative data gives businesses an
insight into why a customer purchased a product. Understanding customer language helps
market research infer the data collected more systematically.
3. Rich data: Collected data can be used to conduct research in the future as well. Since the
questions asked to collect qualitative data are open-ended questions, respondents are free to
express their opinions, leading to more information.
Disadvantages of Qualitative Data:
1. Time-consuming: As collecting qualitative data is more time consuming, fewer people are
studying in comparison to collecting quantitative data. Unless time and budget allow, a smaller
sample size is included.
2. Not easy to generalize: Since fewer people are studied, it is difficult to generalize the
results of that population.
3.Dependent on the researcher’s skills: This type of data is collected through one-to-one
interviews, observations, focus groups, etc. it relies on the researcher’s skills and experience to
collect information from the sample.

QUANTITATIVE DATA:
Definition:
Quantitative data is defined as the value of data in the form of counts or numbers where each
dataset has a unique numerical value associated with it. This data is any quantifiable
information that can be used for mathematical calculations and statistical analysis, such that
real-life decisions can be made based on these mathematical derivations. Quantitative data is
used to answer questions such as “How many?”, “How often?”, “How much?”.
Types of Quantitative Data with Examples:
The most common types of quantitative data are as below:
Counter: Count equated with entities. For example, the number of people who download
application from the App Store.
Measurement of physical objects: Calculating measurement of any physical thing. For
example, the HR executive carefully measures the size of each cubicle assigned to the newly
joined employees.
Sensory calculation: Mechanism to naturally “sense” the measured parameters to create a
constant source of information. For example, a digital camera converts electromagnetic
information to a string of numerical data.
Projection of data: Future projection of data can be done using algorithms and other
mathematical analysis tools. For example, a marketer will predict an increase in the sales after
launching a new product with thorough analysis.
Quantification of qualitative entities: Identify numbers to qualitative information. For
example, asking respondents of an online survey to share the likelihood of recommendation on
a scale of 0-10.
Quantitative Data Collection Methods:
As quantitative data is in the form of numbers, mathematical and statistical analysis of these
numbers can lead to establishing some conclusive results.
There are two main Quantitative Data Collection Methods:
Surveys: Traditionally, surveys were conducted using paper-based methods and have
gradually evolved into online mediums. Closed-ended questions form a major part of these
surveys as they are more effective in collecting quantitative data. The survey makes include
answer options which they think are the most appropriate for a question. Surveys are integral
in collecting feedback from an audience which is larger than the conventional size. A critical
factor about surveys is that the responses collected should be such that they can be
generalized to the entire population without significant discrepancies. Based on the time
involved in completing surveys, they are classified into the following –
Longitudinal Studies: A type of observational research in which the market researcher
conducts surveys from a specific time period to another, i.e., over a considerable course of
time, is called longitudinal survey. This survey is often implemented for trend analysis or
studies where the primary objective is to collect and analyse a pattern in data.
Cross-sectional Studies: A type of observational research in which the market research
conducts surveys at a time period across the target sample is known as cross-sectional survey.
This survey type implements a questionnaire to understand a specific subject from the sample
at a definite time period.
Quantitative Data Analysis Methods:
Data collection forms a major part of the research process. This data however has to be
analysed to make sense of. There are multiple methods of analysing quantitative data collected
in surveys. They are:
Cross-tabulation: Cross-tabulation is the most widely used quantitative data analysis
methods. It is a preferred method since it uses a basic tabular form to draw inferences
between different datasets in the research study. It contains data that is mutually exclusive or
have some connection with each other.
Trend analysis: Trend analysis is a statistical analysis method that provides the ability to look
at quantitative data that has been collected over a long period of time. This data analysis
method helps collect feedback about data changes over time and if aims to understand the
change in variables considering one variable remains unchanged.
MaxDiff analysis: The MaxDiff analysis is a quantitative data analysis method that is used to
gauge customer preferences for a purchase and what parameters rank higher than the others
in this process. In a simplistic form, this method is also called the “best-worst” method. This
method is very similar to conjoint analysis but is much easier to implement and can be
interchangeably used.
Conjoint analysis: Like in the above method, conjoint analysis is a similar quantitative data
analysis method that analyzes parameters behind a purchasing decision. This method
possesses the ability to collect and analyze advanced metrics which provide an in-depth
insight into purchasing decisions as well as the parameters that rank the most important.
TURF analysis: TURF analysis or Total Unduplicated Reach and Frequency Analysis, is a
quantitative data analysis methodology that assesses the total market reach of a product or
service or a mix of both. This method is used by organizations to understand the frequency and
the avenues at which their messaging reaches customers and prospective customers which
helps them tweak their go-to-market strategies.
Gap analysis: Gap analysis uses a side-by-side matrix to depict quantitative data that helps
measure the difference between expected performance and actual performance. This data
analysis helps measure gaps in performance and the things that are required to be done to
bridge this gap.
SWOT analysis: SWOT analysis, is a quantitative data analysis method that assigns numerical
values to indicate strength, weaknesses, opportunities and threats of an organization or
product or service which in turn provides a holistic picture about competition. This method
helps to create effective business strategies.
Text analysis: Text analysis is an advanced statistical method where intelligent tools make
sense of and quantify or fashion qualitative and open-ended data into easily understandable
data. This method is used when the raw survey data is unstructured but has to be brought into
a structure that makes sense.
Steps to conduct Quantitative Data Analysis:
For Quantitative Data, raw information has to presented in meaningful manner using analysis
methods. Quantitative data should be analyzed in order to find evidential data that would help
in the research process.
Relate measurement scales with variables: Associate measurement scales such as Nominal,
Ordinal, Interval and Ratio with the variables. This step is important to arrange the data in
proper order. Data can be entered into an excel sheet to organize it in a specific format.
Connect descriptive statistics with data: Link descriptive statistics to encapsulate available
data. It can be difficult to establish a pattern in the raw data. Some widely used descriptive
statistics are:
Mean- An average of values for a specific variable
Median- A midpoint of the value scale for a variable
Mode- For a variable, the most common value
Frequency- Number of times a particular value is observed in the scale
Minimum and Maximum Values- Lowest and highest values for a scale
Percentages- Format to express scores and set of values for variables
Decide a measurement scale: It is important to decide the measurement scale to conclude a
descriptive statistic for the variable. For instance, a nominal variable score will never have a
mean or median and so the descriptive statistics will correspondingly vary. Descriptive
statistics suffice in situations where the results are not to be generalized to the population.
Select appropriate tables to represent data and analyze collected data: After deciding on
a suitable measurement scale, researchers can use a tabular format to represent data. This
data can be analyzed using various techniques such as Cross-tabulation or TURF.
Advantages of Quantitative Data:
Some of advantages of quantitative data, are:
Conduct in-depth research: Since quantitative data can be statistically analyzed, it is highly
likely that the research will be detailed.
Minimum bias: There are instances in research, where personal bias is involved which leads
to incorrect results. Due to the numerical nature of quantitative data, the personal bias is
reduced to a great extent.
Accurate results: As the results obtained are objective in nature, they are extremely accurate.
Disadvantages of Quantitative Data:
Some of disadvantages of quantitative data, are:
Restricted information: Because quantitative data is not descriptive, it becomes difficult for
researchers to make decisions based solely on the collected information.
Depends on question types: Bias in results is dependent on the question types included to
collect quantitative data. The researcher’s knowledge of questions and the objective of
research are exceedingly important while collecting quantitative data.
SCALING TECHNIQUES:
Definition: Scaling technique is a method of placing respondents in continuation of gradual
change in the pre-assigned values, symbols or numbers based on the features of a particular
object as per the defined rules. All the scaling techniques are based on four pillars, i.e., order,
description, distance and origin.
Types of Scaling Techniques:
The researchers have identified many scaling techniques; today, we will discuss some of the
most common scales used by business organizations, researchers, economists, experts, etc.
These techniques can be classified as primary scaling techniques and other scaling techniques.
Let us now study each of these methods in-depth below:
Primary Scaling Techniques
The major four scales used in statistics for market research consist of the following:

a. Nominal Scale:
Nominal scales are adopted for non-quantitative (containing no numerical implication)
labeling variables which are unique and different from one another.
Types of Nominal Scales:
Dichotomous: A nominal scale that has only two labels is called ‘dichotomous’; for example,
Yes/No.
Nominal with Order: The labels on a nominal scale arranged in an ascending or descending
order is termed as ‘nominal with order’; for example, Excellent, Good, Average, Poor, Worst.
Nominal without Order: Such nominal scale which has no sequence, is called ‘nominal
without order’; for example, Black, White.
b. Ordinal Scale:
The ordinal scale functions on the concept of the relative position of the objects or labels based
on the individual’s choice or preference.
For example, At Amazon. in, every product has a customer review section where the buyers
rate the listed product according to their buying experience, product features, quality, usage,
etc.
The ratings so provided are as follows:
5 Star – Excellent
4 Star – Good
3 Star – Average
2 Star – Poor
1 Star – Worst

c. Interval Scale:
An interval scale is also called a cardinal scale which is the numerical labelling with the same
difference among the consecutive measurement units. With the help of this scaling technique,
researchers can obtain a better comparison between the objects.
For example, A survey conducted by an automobile company to know the number of vehicles
owned by the people living in a particular area who can be its prospective customers in future.
It adopted the interval scaling technique for the purpose and provided the units as 1, 2, 3, 4, 5,
6 to select from.
In the scale mentioned above, every unit has the same difference, i.e., 1, whether it is between
2 and 3 or between 4 and 5.
d. Ratio Scale:
One of the most superior measurement techniques is the ratio scale. Similar to an interval
scale, a ratio scale is an abstract number system. It allows measurement at proper intervals,
order, categorization and distance, with an added property of originating from a fixed zero
point. Here, the comparison can be made in terms of the acquired ratio.
For example, A health product manufacturing company surveyed to identify the level of
obesity in a particular locality. It released the following survey questionnaire:
Select a category to which your weight belongs to:
Less than 40 kilograms
40-59 Kilograms
60-79 Kilograms
80-99 Kilograms
100-119 Kilograms
120 Kilograms and more
Other Scaling Techniques:
Scaling of objects can be used for a comparative study between more than one objects
(products, services, brands, events, etc.). Or can be individually carried out to understand the
consumer’s behaviour and response towards a particular object.
Following are the two categories under which other scaling techniques are placed based on
their comparability:
Comparative Scales:
For comparing two or more variables, a comparative scale is used by the respondents.
Following are the different types of comparative scaling techniques:
1. Paired Comparison:
A paired comparison symbolizes two variables from which the respondent needs to select one.
This technique is mainly used at the time of product testing, to facilitate the consumers with a
comparative analysis of the two major products in the market.

2. Rank Order:
In rank order scaling the respondent needs to rank or arrange the given objects according to
his or her preference.
3. Constant Sum:
It is a scaling technique where a continual sum of units like dollars, points, chits, chips, etc. is
given to the features, attributes and importance of a particular product or service by the
respondents.
4. Q-Sort Scaling:
Q-sort scaling is a technique used for sorting the most appropriate objects out of a large
number of given variables. It emphasizes on the ranking of the given objects in a descending
order to form similar piles based on specific attributes.It is suitable in the case where the
number of objects is not less than 60 and more than 140, the most appropriate of all ranging
between 60 to 90.
Non-Comparative Scales:
A non-comparative scale is used to analyse the performance of an individual product or object
on different parameters. Following are some of its most common types:
1. Continuous Rating Scales:
It is a graphical rating scale where the respondents are free to place the object at a position of
their choice. It is done by selecting and marking a point along the vertical or horizontal line
which ranges between two extreme criteria.
2. Itemized Rating Scale:
Itemized scale is another essential technique under the non-comparative scales. It emphasizes
on choosing a particular category among the various given categories by the respondents. Each
class is briefly defined by the researchers to facilitate such selection.
The three most commonly used itemized rating scales are as follows:
Likert Scale: In the Likert scale, the researcher provides some statements and ask the
respondents to mark their level of agreement or disagreement over these statements by
selecting any one of the options from the five given alternatives.
For example, A shoes manufacturing company adopted the Likert scale technique for its new
sports shoe range named Z sports shoes. The purpose is to know the agreement or
disagreement of the respondents.
For this, the researcher asked the respondents to circle a number representing the most
suitable answer according to them, in the following representation:
1 – Strongly Disagree
2 – Disagree
3 – Neither Agree nor Disagree
4 – Agree
5 – Strongly Agree
Semantic Differential Scale: A bi-polar seven-point non-comparative rating scale is where
the respondent can mark on any of the seven points for each given attribute of the object as
per personal choice. Thus, depicting the respondent’s attitude or perception towards the
object.

RESEARCH METHODS:
Research methods are specific procedures for collecting and analysing data.Research methods
are the strategies, processes or techniques utilized in the collection of data or evidence for
analysis in order to uncover new information or create better understanding of a topic.
There are different types of research methods which use different tools for data collection.
1. Interview Research Method:
An interview is generally a qualitative research technique which involves asking open-ended
questions to converse with respondents and collect elicit data about a subject.
The interviewer in most cases is the subject matter expert who intends to understand
respondent opinions in a well-planned and executed series of questions and answers.

Types of Interviews:
(a) Structured Interview:
A structured interview is a quantitative research method where the interviewer a set of
prepared closed-ended questions in the form of an interview schedule, which he/she reads out
exactly as worded.
Interviews schedules have a standardized format which means the same questions are asked
to each interviewee in the same order.
The interviewer will not deviate from the interview schedule (except to clarify the meaning of
the question) or probe beyond the answers received.
A structured interview is also known as a formal interview (like a job interview).
Strengths
Structured interviews are easy to replicate as a fixed set of closed questions are used, which
are easy to quantify – this means it is easy to test for reliability.
Structured interviews are fairly quick to conduct which means that many interviews can take
place within a short amount of time. This means a large sample can be obtained resulting in the
findings being representative and having the ability to be generalized to a large population.
Limitations
Structure interviews are not flexible. This means new questions cannot be asked impromptu
(i.e. during the interview) as an interview schedule must be followed.
The answers from structured interviews lack detail as only closed questions are asked which
generates quantitative data. This means a researcher won't know why a person behaves in a
certain way.
(b) Unstructured Interview:
Unstructured interviews do not use any set questions, instead, the interviewer asks open-
ended questions based on a specific research topic and will try to let the interview flow like a
natural conversation.
The interviewer modifies his or her questions to suit the candidate's specific experiences.
Unstructured interviews are sometimes referred to as ‘discovery interviews’ and are more like
a ‘guided conservation’ than a strict structured interview.
They are sometimes called informal interviews.
Strengths
Unstructured interviews are more flexible as questions can be adapted and changed depending
on the respondents’ answers. The interview can deviate from the interview schedule.
Unstructured interviews generate qualitative data through the use of open questions. This
allows the respondent to talk in some depth, choosing their own words. This helps the
researcher develop a real sense of a person’s understanding of a situation.
They also have increased validity because it gives the interviewer the opportunity to probe for
a deeper understanding, ask for clarification & allow the interviewee to steer the direction of
the interview etc.
Limitations
It can be time-consuming to conduct an unstructured interview and analyze the qualitative
data (using methods such as thematic analysis).
Employing and training interviewers is expensive, and not as cheap as collecting data via
questionnaires. For example, certain skills may be needed by the interviewer. These include
the ability to establish rapport and knowing when to probe.
(c) Focus Group Interview:
Focus group interview is a qualitative approach where a group of respondents are interviewed
together, used to gain an in‐depth understanding of social issues.
The method aims to obtain data from a purposely selected group of individuals rather than
from a statistically representative sample of a broader population.
The role of the interview moderator is to make sure the group interact with each other and do
not drift off-topic.
Strengths
Group interviews generate qualitative narrative data through the use of open questions. This
allows the respondents to talk in some depth, choosing their own words. This helps the
researcher develop a real sense of a person’s understanding of a situation. Qualitative data also
includes observational data, such as body language and facial expressions.
They also have increased validity because some participants may feel more comfortable being
with others as they are used to talking in groups in real life (i.e. it's more natural).
Limitations
The researcher must ensure that they keep all the interviewees' details confidential and
respect their privacy. This is difficult when using a group interview. For example, the
researcher cannot guarantee that the other people in the group will keep information private.
Group interviews are less reliable as they use open questions and may deviate from the
interview schedule making them difficult to repeat.
Group interviews may sometimes lack validity as participants may lie to impress the other
group members. They may conform to peer pressure and give false answers.
Design of Interviews:
First, you must choose whether to use a structured or non-structured interview.
Next, you must consider who will be the interviewer, and this will depend on what type of
person is being interviewed. There are a number of variables to consider:
Gender and age: This can have a big effect on respondent's answers, particularly on personal
issues.
Personal characteristics: Some people are easier to get on with than others. Also, the accent
and appearance (e.g. clothing) of the interviewer can have an effect on the rapport between the
interviewer and interviewee.
Also, the language the interviewer uses should be appropriate to the vocabulary of the group of
people being studied. For example, the researcher must change the language of questions to
match the social background of respondents' age / educational level / social class / ethnicity
etc. <
The interviewer must ensure that they take special care when interviewing vulnerable groups,
such as children. For example, children have a limited attention span and for this reason,
lengthy interviews should be avoided.
Ethnicity: People have difficulty interviewing people from a different ethnic group.
2. Observational Method:
Observation (watching what people do) would seem to be an obvious method of carrying out
research in psychology. However, there are different types of observational methods and
distinctions need to be made between:
(a) Controlled Observations
(b) Naturalistic Observations
(c) Participant Observations
(a) Controlled Observation:
Controlled observations (usually a structured observation) are likely to be carried out in a
psychology laboratory.
The researcher decides where the observation will take place, at what time, with which
participants, in what circumstances and uses a standardized procedure.
Participants are randomly allocated to each independent variable group.
Rather than writing a detailed description of all behavior observed, it is often easier to code
behavior according to a previously agreed scale using a behavior schedule (i.e. conducting a
structured observation).
Coding might involve numbers or letters to describe a characteristic, or use of a scale to
measure behavior intensity.
Strengths
Controlled observations can be easily replicated by other researchers by using the same
observation schedule. This means it is easy to test for reliability.
The data obtained from structured observations is easier and quicker to analyse.
Controlled observations are quick to conduct which means that many observations can take
place within a short amount of time.
Limitations
Controlled observations can lack validity due to the Hawthorne effect/demand characteristics.
When participants know they are being watched they may act differently.
(b) Naturalistic Observation:
Naturalistic observation is a research method commonly used by psychologists and other
social scientists.
This technique involves observing involves studying the spontaneous behavior of participants
in natural surroundings.
The researcher simply records what they see in whatever way they can.
In unstructured observations, the researcher records all relevant behavior without system
Strengths
Naturalistic observation is often used to generate new ideas. Because it gives the researcher
the opportunity to study the total situation it often suggests avenues of inquiry not thought of
before.
Limitations
These observations are often conducted on a micro (small) scale and may lack a representative
sample (biased in relation to age, gender, social class or ethnicity). This may result in the
findings lacking the ability to be generalized to wider society.
Natural observations are less reliable as other variables cannot be controlled. This makes it
difficult for another researcher to repeat the study in exactly the same way.
(c) Participant Observation:
Participant observation is a variant of the above (natural observations) but here the
researcher joins in and becomes part of the group they are studying to get a deeper insight into
their lives.
If it were research on animals, we would now not only be studying them in their natural
habitat but be living alongside them as well!
Participant observations can be either cover or overt.
Covert is where the study is carried out 'undercover'. The researcher's real identity and
purpose are kept concealed from the group being studied. The researcher takes a false identity
and role, usually posing as a genuine member of the group.
On the other hand, overt is where the researcher reveals his or her true identity and purpose
to the group and asks permission to observe.
Limitations
It can be difficult to get time / privacy for recording. For example, with covert observations
researchers can’t take notes openly as this would blow their cover. This means they have to
wait until they are alone and rely on their memory. This is a problem as they may forget details
and are unlikely to remember direct quotations.
If the researcher becomes too involved, they may lose objectivity and become bias. There is
always the danger that we will “see” what we expect (or want) to see. This is a problem as they
could selectively report information instead of noting everything they observe. Thus, reducing
the validity of their data.
3. Questionnaire method:
Questionnaire is as an instrument for research, which consists of a list of questions, along with
the choice of answers, printed or typed in a sequence on a form used for acquiring specific
information from the respondents.
In general, questionnaires are delivered to the persons concerned either by post or mail,
requesting them to answer the questions and return it.
Informants are expected to read and understand the questions and reply in the space provided
in the questionnaire itself.
The questionnaire is prepared in such a way that it translates the required information into a
series of questions, that informants can and will answer.
Characteristics of a Good Questionnaire:
The following are characteristics of good questionnaires:
It should consist of a well-written list of questions.
The questionnaire should deal with an important or significant topic to create interest among
respondents.
It should seek only that data which cannot be obtained from other sources.
It should be as short as possible but should be comprehensive.
It should be attractive.
Directions should be clear and complete.
It should be represented in good psychological order proceeding from general to more specific
responses.
Double negatives in questions should be avoided.
Putting two questions in one question also should be avoided. Every question should seek to
obtain only one specific information.
It should be designed to collect information which can be used subsequently as data for
analysis.
Format of Questions in Questionnaires:
The questions asked can take two forms:
Restricted questions, also called closed-ended, ask the respondent to make choices — yes or
no, check items on a list, or select from multiple choice answers.Restricted questions are easy
to tabulate and compile.
Unrestricted questions are open-ended and allow respondents to share feelings and opinions
that are important to them about the matter at hand.
Unrestricted questions are not easy to tabulate and compile, but they allow respondents to
reveal the depth of their emotions.
If the objective is to compile data from all respondents, then sticking with restricted questions
that are easily quantified is better.
If degrees of emotions or depth of sentiment is to be studied, then develop a scale to quantify
those feelings.
Uses of Questionnaires:
Questionnaires are a common and inexpensive research tool used by private companies,
government departments, individuals, groups, NGOs etc to get feedback, research, collect data
from consumer, customers or from general public depending on the need.
Questionnaires are the most important part of primary surveys.
Advantages of Questionnaire:
One of the greatest benefits of questionnaires lies in their uniformity — all respondents see the
same questions.
It is an inexpensive method, regardless of the size of the universe.
Free from the bias of the interviewer, as the respondents answer the questions in his own
words.
Respondents have enough time to think and answer.
Due to its large coverage, respondents living in distant areas can also be reached conveniently.
Limitations of Questionnaire:
The risk of collection of inaccurate and incomplete information is high in the questionnaire, as
it might happen that people may not be able to understand the question correctly.
The rate of non-response is high.

4. Case Study Method:


A case study is a detailed study of a specific subject, such as a person, group, place, event,
organization, or phenomenon.
Case studies are commonly used in social, educational, clinical, and business research.
A case study research design usually involves qualitative methods, but quantitative methods
are sometimes also used.
Case studies are good for describing, comparing, evaluating and understanding different
aspects of a research problem.
When to do a case study
A case study is an appropriate research design when you want to gain concrete, contextual, in-
depth knowledge about a specific real-world subject.
It allows you to explore the key characteristics, meanings, and implications of the case.
Case studies are often a good choice in a thesis or dissertation.
They keep your project focused and manageable when you don’t have the time or resources to
do large-scale research.
Steps in Case Study:
Step 1: Select a case
Once you have developed your problem statement and research questions, you should be
ready to choose the specific case that you want to focus on. A good case study should have the
potential to:
Provide new or unexpected insights into the subject
Challenge or complicate existing assumptions and theories
Propose practical courses of action to resolve a problem
Open up new directions for future research
Unlike quantitative or experimental research, a strong case study does not require a random or
representative sample. In fact, case studies often deliberately focus on unusual, neglected, or
outlying cases which may shed new light on the research problem.
Step 2: Build a theoretical framework
While case studies focus more on concrete details than general theories, they should usually
have some connection with theory in the field.
This way the case study is not just an isolated description but is integrated into existing
knowledge about the topic.
To ensure that your analysis of the case has a solid academic grounding, you should conduct a
literature review of sources related to the topic and develop a theoretical framework.
This means identifying key concepts and theories to guide your analysis and interpretation.
Step 3: Collect your data
There are many different research methods you can use to collect data on your subject.
Case studies tend to focus on qualitative data using methods such as interviews, observations,
and analysis of primary and secondary sources (e.g. newspaper articles, photographs, official
records).
Sometimes a case study will also collect quantitative data.
Step 4: Describe and analyze the case
In writing up the case study, you need to bring together all the relevant aspects to give as
complete a picture as possible of the subject.
How you report your findings depends on the type of research you are doing.
Some case studies are structured like a standard scientific paper or thesis, with separate
sections or chapters for the methods, results and discussion.
Others are written in a more narrative style, aiming to explore the case from various angles
and analyze its meanings and implications (for example, by using textual analysis or discourse
analysis).

Action Research:
Action research can be defined as “an approach in which the action researcher and a client
collaborate in the diagnosis of the problem and in the development of a solution based on the
diagnosis”.
In other words, one of the main characteristic traits of action research relates to collaboration
between researcher and member of organisation in order to solve organizational problems.
Action study assumes social world to be constantly changing, both, researcher and research
being one part of that change.
Generally, action research can be divided into three categories: positivist, interpretive and
critical.
Positivist approach to action research, also known as ‘classical action research’, perceives
research as a social experiment. Accordingly, action research is accepted as a method to test
hypotheses in a real-world environment.
Interpretive action research, also known as ‘contemporary action research’, perceives
business reality as socially constructed and focuses on specifications of local and
organizational factors when conducting the action research.
Critical action research is a specific type of action research that adopts critical approach
towards business processes and aims for improvements.
The following features of action research need to be considered when considering its
suitability for any given study:
It is applied in order to improve specific practices. Action research is based on action,
evaluation and critical analysis of practices based on collected data in order to introduce
improvements in relevant practices.
This type of research is facilitated by participation and collaboration of number of individuals
with a common purpose
Such a research focuses on specific situations and their context
Advantages of Action Research
High level of practical relevance of the business research.
Can be used with quantitative, as well as qualitative data.
Possibility to gain in-depth knowledge about the problem.
Disadvantages of Action Research
Difficulties in distinguishing between action and research and ensure the application of both.
Delays in completion of action research due to a wide range of reasons are not rare
occurrences
Lack of repeatability
Documentary Research:
Documentary research is defined as the research conducted using official documents or
personal documents as the source of information.
Documents can include anything from the following:

Newspapers
Stamps
Diaries
Maps
Handbills
Directories
Paintings
Government statistical publications
Gramophone records
Photographs
Computer files
Tapes
The above may not fit the traditional bill of a “document” but since they contain information,
they can be used towards documentary research.
Social scientists often conduct documentary research. It is mainly conducted to assess various
documents in the interest of social or historical value.
Sometimes, researchers also conduct documentary research to study various documents
surrounding events or individuals.
Documentary research is similar to content analysis, which involves studying existing
information recorded in media, texts, and physical items.
Here, data collection from people is not required to conduct research. Hence, this is a prime
example of secondary research.
Advantages of documentary research method
Here are the advantages of the documentary research method:
Data readily available: Data is readily available in various sources. You only need to know
where to look and how to use it. The data is available in different forms, and harnessing it is
the real challenge.
Inexpensive and economical: The data for research is already collected and published in
either print or other forms. The researcher does not need to spend money and time like they
do to collect market research insights and gather data. They need to search for and compile the
available data from different sources.
Saves time: Conducting market research is time-consuming. Responses will not come in
quickly as expected, and gathering global responses will take a huge amount of time. If you
have all the reference documents available (or you know where to find them), research is
relatively quick.
Non-bias: Primary data collection tends to be biased. This bias depends on a lot of factors like
the age of the respondents, the time they take the survey, their mentality while taking the
survey, their gender, their feelings towards certain ideas, to name a few. The list goes on and
on when it comes to surveying bias.
Researcher not necessary during data collection: The researcher doesn’t need to be
present during data collection. It is practically impossible for the researcher to be present at
every point of the data source, especially thinking about the various data sources.
Useful for hypothesis: Use historical data to draw inferences of the current or future events.
Conclusions can be drawn from the experience of past events and data available for them.
Disadvantages of documentary research method
Here are the disadvantages of the documentary research method:
Limited data: Data is not always available, especially when you need to cross-verify a theory
or strengthen your argument based on different forms of data.
Inaccuracies: As the data is historical and published, there is almost no way of ascertaining if
the data is accurate or not.
Incomplete documents: Often, documents can be incomplete, and there is no way of knowing
if there are additional documents to refer to on the subject.
Data out of context: The data that the researcher refers to may be out of context and may not
be in line with the concept the researcher is trying to study. Its because the research goal is not
thought of when creating the original data. Often, researchers have to make do with the
available data at hand.

UNIT-3
RESEARCH:
Research is a systematized effort to gain new knowledge.
What is data analysis in research?
Research data analysis is a process used by researchers for reducing data to a story and
interpreting it to derive insights. The data analysis process helps in reducing a large number of
data into smaller fragments, which makes sense.
Types of Research Data:
Data may be grouped into four main types based on methods for collection:
Observational data
Experimental data
Simulation data
Derived /Compiled data
1) Observational Data:
Observational data are captured through observation of a behavior or activity.
It is collected using methods such as human observation, open-ended surveys, or the use of an
instrument or sensor to monitor and record information.
Because observational data are captured in real time, it would be very difficult or impossible to
recreate if lost.
2) Experimental Data:
Experimental data are collected through active intervention by the researcher to produce and
measure change.
It allows the researcher to determine a casual relationship and is typically projectable to a
larger population.
3) Simulation Data:
Simulation data are generated by imitating the operation of a real-world process or system
over time using computer test models.
This method is used to try to determine what would, or could, happen under certain
conditions.
4) Derived/Compiled Data:
It involves using existing data points, often from different data sources, to create new data
through some sort of transformation.
For example, combining area and population data from the twin cities to create population
density data.
Types of Primary data in research:
(i) Qualitative data:
When the data presented has words and descriptions, then we call it qualitative data.
Ex: quality data represents everything describing taste, experience, or an opinion is considered
as a quality data.
(ii) Quantitative Data:
Any data expressed in number of numerical figures are called quantitative data.
Ex: questions such as age, rank, cost, length, weight, score, etc.
This data can be presented in graphical format, charts etc.

(iii) Categorical Data:


It is the data presented in groups.
An item included in the categorical data can not belong to more than one group at a time.
Ex: A person responding to a survey by telling his living style, marital status etc. comes under
the categorical data.
A chi-square test is a standard method used to analyze this data.
FREQUENCY DISTRIBUTION:
Frequency distribution is a representation, either in a graphical or tabular format that displays
the number of observations within a given interval. The interval size depends on the data
being analyzed and the goals of the analyst. Frequency distributions are typically used within a
statistical context. Generally, it is associated with the charting of a normal distribution. As a
statistical tool, a frequency distribution provides a visual representation for the distribution of
observations within a particular test. Both histograms and bar chart provide a visual display
using columns, with the Y-axis representing the frequency count, and the X-axis representing
the variable to be measured.
Frequency distribution table:
Height Frequency
139 1
154 2
150 2
136 1
152 1
144 1
138 2

This frequency table will help us make better sense of the data given. Also when the data set is
too big we use tally marks for counting. It makes the task more organized and easy.
There are different types of frequency distributions.
Grouped frequency distribution
Ungrouped frequency distribution
Cumulative frequency distribution
Relative frequency distribution
Relative cumulative frequency distribution
To get a frequency distribution, we need to divide data into different classes of appropriate
size while indicating the number of observations in each class. Through frequency distribution,
it becomes a lot easier to summarize the data. That's why it's also defined as a process of
presenting the data in a summarized form. It's also called Frequency Table.
Uses of Frequency Distribution
It is quite useful for data analysis.
It assists in estimating the frequencies of the population on the basis of the ample.
It facilitates the computation of different statistical measures.

Frequency Distribution Table


Frequency distribution table (also known as frequency table) consists of various components.
Classes: A large number of observations varying in a wide range are usually classified in
several groups according to the size of their values. Each of these groups is defined by an
interval called class interval. The class interval between 10 and 20 is defined as 10-20.
Class limits: The smallest and largest possible values in each class of a frequency distribution
table are known as class limits. For the class 10-20, the class limits are 10 and 20. 10 is called
the lower class limit and 20 is called the upper class limit.
Class limit: Class limit is the midmost value of the class interval. It is also known as the mid
value. Mid value of each class = (lower limit + Upper limit)/2.
Magnitude of a class interval: The difference between the upper and lower limit of a class is
called the magnitude of a class interval.
Class frequency: The number of observation falling within a class interval is called class
frequency of that class interval.
Types of frequency distribution:
Relative Frequency Distribution
It's a distribution where we mention relative frequencies against each class interval.. Relative
frequency of a class is the frequency obtained by dividing frequency by the total frequency.
Relative frequency is the proportion of the total frequency that is in any given class interval in
the frequency distribution.
Cumulative Frequency Distribution
One of the important types of frequency distribution is Cumulative frequency distribution. In
cumulative frequency distribution, the frequencies are shown in the cumulative manner. The
cumulative frequency for each class interval is the frequency for that class interval added to
the preceding cumulative total. Cumulative frequency can also defined as the sum of all
previous frequencies up to the current point.
Simple Frequency Distribution
Simple frequency distribution is used to organize the larger data sets in an orderly fashion.
When there are several cases to be studied, it's a good idea to list them separately, or else there
will be a lengthy list to use. . A simple frequency distribution shows the number of times each
score occurs in a set of data. To find the frequency for score count how many times the score
occurs.
Grouped Frequency Distribution
A grouped frequency distribution is an ordered listed of a variable X, into groups in one
column with a listing in a second column, which is called the frequency column. A grouped
frequency distribution is an arrangement class intervals and corresponding frequencies in a
table.
Ungrouped Frequency Distribution
A frequency distribution with an interval width of 1 is called ungrouped frequency
distribution. Ungrouped frequency distribution is an arrangement of the observed values in
ascending order. The ungrouped frequency distribution are those data, which are not arranged
in groups. They are known as individual series.
Mean of Frequency Distribution
Mean of frequency distribution can be found by multiplying each midpoint by its frequency,
and then dividing by the total number of values in the frequency distribution.
Mean = ∑=f×xn∑=f×xn
where, f = frequency in each class
n = sum of the frequencies.
BAR CHARTS/ BAR DIAGRAM:
A bar graph (also known as a bar chart or bar diagram) is a visual tool that uses bars to
compare data among categories. A bar graph may run horizontally or vertically. The important
thing to know is that the longer the bar, the greater its value.
Bar graphs consist of two axes. On a vertical bar graph, as shown above, the horizontal axis (or
x-axis) shows the data categories. In this example, they are years. The vertical axis (or y-axis)
is the scale. The colored bars are the data series.
Bar graphs have three key attributes:
A bar diagram makes it easy to compare sets of data between different groups at a glance.
The graph represents categories on one axis and a discrete value in the other. The goal is to
show the relationship between the two axes.
Bar charts can also show big changes in data over time
Uses of Bar Graph
Bar graphs are an effective way to compare items between different groups.
This bar graph shows a comparison of numbers on a quarterly basis over a four-year period of
time.
Users of this chart can compare the data by quarter on a year-over-year trend, and also see
how the annual sales are distributed throughout each year.
Bar graphs are an extremely effective visual to use in presentations and reports.
They are popular because they allow the reader to recognize patterns or trends far more easily
than looking at a table of numerical data.
Types of a Bar Graph
When presenting data visually, there are several different styles of bar graphs to consider.
Vertical Bar Graph:
The most common type of bar graph is the vertical bar graph. It is very useful when presenting
a series of data over time. One disadvantage of vertical bar graphs is that they don't leave much
room at the bottom of the chart if long labels are required.
Horizontal Bar Graph:
Converting the vertical data to a horizontal bar chart solves this problem. There is plenty of
room for the long label along the vertical axis, as shown below.
Stacked Bar Graph:
The stacked bar graph is a visual that can convey a lot of information.
HISTOGRAM:
A histogram is a graph of a grouped frequency distribution.
In a histogram, we plot the class intervals on the X-axis and their respective frequencies on the
Y-axis.
Further, we create a rectangle on each class interval with its height proportional to the
frequency density of the class.
Steps to draw a Histogram:
Step—1:
Represent the class intervals of the variables along the X axis and their frequencies along the
Y-axis on natural scale.
Step—2:
Start X axis with the lower limit of the lowest class interval. When the lower limit happens to
be a distant score from the origin give a break in the X-axis n to indicate that the vertical axis
has been moved in for convenience.
Step—3:
Now draw rectangular bars in parallel to Y axis above each of the class intervals with class
units as base: The areas of rectangles must be proportional to the frequencies of the cor-
responding classes.
Advantages of histogram:
It is easy to draw and simple to understand.
It helps us to understand the distribution easily and quickly.
Limitations of histogram:
It is not possible to plot more than one distribution on same axes as histogram.
Comparison of more than one frequency distribution on the same axes is not possible.
PARETO CHART:
Also called: Pareto diagram, Pareto analysis
Variations: weighted Pareto chart, comparative Pareto charts
A Pareto chart is a bar graph. The lengths of the bars represent frequency or cost (time or
money), and are arranged with longest bars on the left and the shortest to the right. In this way
the chart visually depicts which situations are more significant. This cause analysis tool is
considered one of the seven basic quality tools.
When to Use A Pareto Chart:
When analyzing data about the frequency of problems or causes in a process
When there are many problems or causes and you want to focus on the most significant
When analyzing broad causes by looking at their specific components
When communicating with others about your data
Pareto Chart Procedure
Decide what categories you will use to group items.
Decide what measurement is appropriate. Common measurements are frequency, quantity,
cost and time.
Decide what period of time the Pareto chart will cover: One work cycle? One full day? A week?
Collect the data, recording the category each time, or assemble data that already exist.
Subtotal the measurements for each category.
Determine the appropriate scale for the measurements you have collected. The maximum
value will be the largest subtotal from step 5. (If you will do optional steps 8 and 9 below, the
maximum value will be the sum of all subtotals from step 5.) Mark the scale on the left side of
the chart.
Construct and label bars for each category. Place the tallest at the far left, then the next tallest
to its right, and so on. If there are many categories with small measurements, they can be
grouped as “other.”
Note: Steps 8 and 9 are optional but are useful for analysis and communication.
Calculate the percentage for each category: the subtotal for that category divided by the total
for all categories. Draw a right vertical axis and label it with percentages. Be sure the two
scales match. For example, the left measurement that corresponds to one-half should be
exactly opposite 50% on the right scale.
Calculate and draw cumulative sums: add the subtotals for the first and second categories, and
place a dot above the second bar indicating that sum. To that sum add the subtotal for the third
category, and place a dot above the third bar for that new sum. Continue the process for all the
bars. Connect the dots, starting at the top of the first bar. The last dot should reach 100% on
the right scale.
STATISTICAL TOOLS OF DATA ANALYSIS:
1. Mean:
The mean (or average) is the most popular and well-known measure of central tendency.
It can be used with both discrete and continuous data, although its use is most often with
continuous data.
The mean is equal to the sum of all the values in the data set divided by the number of values in
the data set.
Methods of Calculating Mean:
There are several methods for calculating mean. But here we shall discuss only two methods.
Direct method
Step deviation method.
1. Direct method:
In this method the mean is calculated directly from the given series. In this method we can
calculate mean from the ungrouped data and the formula for calculating mean from un-
grouped data.
Mean=∑X÷N
2. Step deviation method:
It is also known as assumed mean method because instead of calculating mean from the mid-
points we take assumed mean to find out the mean.
First, we ‘guess’ or assume a mean and then we apply a correction to this assumed value in
order to find the exact value.
The formula to find out the mean:

The factor is indicated by ‘C’. The deviation, when reduced by this factor, is known as a step-
deviation. The formula is as follows:
Mean = A + (∑fd’/∑f) ×C
C = The common factor using which deviations are converted to step-deviations
Note: In this method step-deviation denoted by d’ is used and not d.
d’=(X-A)/C
Here, X = The value of the item, A = Assumed value of mean and
C = Common factor chosen
Below are discussed the steps to calculate mean:
Step—1:
Assume any one mid-point of the distribution as mean. But the best plan is to take mid-point of
an interval near the centre which has the largest frequency.
Step—2:
Find out the d column, d is the deviation between the score and the assumed mean.
Here we can find out d by using the following formula:
Step—3:
Find out fd column. It is found out by multiplying f column by d column.
Step—4:
Find out ∑fd. Add all the positive values and negative values separately. Then find out the
algebraic sum which is ∑fd.
Step—5:
Find out the mean by using formula.
Uses of Mean:
There are certain general rules for using mean. Some of these uses are as following:
Mean is more stable than the median and mode. So that when the measure of central tendency
having the greatest stability is wanted mean is used.
Mean is used to calculate other statistics like S.D., coefficient of correlation, ANOVA, ANCOVA
etc.
Merits of Mean:
Mean is rigidly defined so that there is no question of misunderstanding about its meaning and
nature.
It is the most popular central tendency as it is easy to understand.
It is easy to calculate.
It is not affected by sampling so that the result is reliable.
Demerits of Mean:
Mean is affected by extreme scores.
Sometimes mean is a value which is not present in the series.
Sometimes it gives absurd values. For example, there are 41, 44 and 42 students in class VIII, IX
and X of a school. So, the average students per class are 42.33. It is never possible.

2. MEDIAN:
Median, in statistics, is the middle value of the given list of data, when arranged in an order.
The arrangement of data or observations can be done either in ascending order or descending
order.
Example: The median of 2,3,4 is 3.
The median of a set of data is the middlemost number or center value in the set. The median is
also the number that is halfway into the set.
To find the median, the data should be arranged, first, in order of least to greatest or greatest
to the least value. A median is a number that is separated by the higher half of a data sample, a
population or a probability distribution, from the lower half. The median is different for
different types of distribution.
For example, the median of 3, 3, 5, 9, 11 is 5. If there is an even number of observations, then
there is no single middle value; the median is then usually defined to be the mean of the two
middle values: so the median of 3, 5, 7, 9 is (5+7)/2 = 6.
Median Formula
The formula to calculate the median of the finite number of data set is given here. Median
formula is different for even and odd numbers of observations. Therefore, it is necessary to
recognize first if we have odd number of values or even number of values in a given data set.
The formula to calculate the median of the data set is given as follow.
Odd Number of Observations
If the total number of observations given is odd, then the formula to calculate the median is:
Median = {(n+1)/2}thterm
where n is the number of observations
Even Number of Observations
If the total number of observations is even, then the median formula is:
Median = [(n/2)th term + {(n/2)+1}th]/2
where n is the number of observations
How to Calculate the Median?
To find the median, place all the numbers in the ascending order and find the middle.

Example 1:
Find the Median of 14, 63 and 55
solution:
Put them in ascending order: 14, 55, 63
The middle number is 55, so the median is 55.
Example 2:
Find the median of the following:
4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29
Solution:
When we put those numbers in the order we have:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
There are fifteen numbers. Our middle is the eighth number:
The median value of this set of numbers is 24.
Example 3:
Rahul’s family drove through 7 states on summer vacation. The prices of Gasoline differ from
state to state. Calculate the median of gasoline cost.
1.79, 1.61, 2.09, 1.84, 1.96, 2.11, 1.75
Solution:
By organizing the data from smallest to greatest, we get:
1.61, 1.75, 1.79, 1.84 , 1.96, 2.09, 2.11
Hence, the median of the gasoline cost is 1.84. There are three states with greater gasoline
costs and 3 with smaller prices.
Merits of Median:
It is simple to understand and easy to calculate.
It is not affected by the extreme items in the series.
It can be determined graphically.
For open-ended classes, median can be calculated.
Demerits of Median:
It does not consider all variables because it is a positional average.
The value of median is affected more by sampling fluctuations
It is not capable of further algebraic treatment.
Like mean, combined median cannot be calculated.
It cannot be computed precisely when it lies between two items.
3. MODE:
The mode is the value that appears most frequently in a data set.
A set of data may have one mode, more than one mode, or no mode at all.
In statistics, the mode is the most commonly observed value in a set of data.
For the normal distribution, the mode is also the same value as the mean and median.
Examples of the Mode:
For example, in the following list of numbers, 16 is the mode since it appears more times in the
set than any other number:
3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48
A set of numbers can have more than one mode (this is known as bimodal if there are two
modes) if there are multiple numbers that occur with equal frequency, and more times than the
others in the set.
3, 3, 3, 9, 16, 16, 16, 27, 37, 48
In the above example, both the number 3 and the number 16 are modes as they each occur
three times and no other number occurs more often.
Advantages:
The mode is easy to understand and calculate.
The mode is not affected by extreme values.
The mode is easy to identify in a data set and in a discrete frequency distribution.
The mode can be located graphically.
Disadvantages:
The mode is not defined when there are no repeats in a data set.
The mode is not based on all values.
Sometimes data have one mode, more than one mode, or no mode at all.

CORRELATION AND REGRESSION:


Correlation Analysis:
Correlation Analysis is statistical method that is used to discover if there is a relationship
between two variables/datasets, and how strong that relationship may be.
In terms of market research this means that, correlation analysis is used to analyse
quantitative data gathered from research methods such as surveys and polls, to identify
whether there is any significant connections, patterns, or trends between the two. Essentially,
correlation analysis is used for spotting patterns within datasets.
A positive correlation result means that both variables increase in relation to each other, while
a negative correlation means that as one variable decreases, the other increases.
Symbol of correlation = r and the value of correlation lies between -1 to 1.
Correlation Coefficients:
There are usually three different ways of ranking statistical correlation according to Spearman,
Kendall, and Pearson. Each coefficient will represent the end result as ‘r’. Spearman’s Rank and
Pearson’s Coefficient are the two most widely used analytical formulae depending on the types
of data researchers have to hand:
1. Spearman’s Rank Correlation Coefficient:
This coefficient is used to see if there is any significant relationship between the two datasets,
and operates under the assumption that the data being used is ordinal, which here means that
the numbers do not indicate quantity, but rather they signify a position of place of the subject’s
standing (e.g., 1st, 2nd, 3rd, etc.)

2. Pearson Product-Moment Coefficient:


This is the most widely used correlation analysis formula, which measures the strength of the
‘linear’ relationships between the raw data from both variables, rather than their ranks. This is
a dimensionless coefficient, meaning that there are no data-related boundaries to be
considered when conducting analyses with this formula, which is a reason why this coefficient
is the first formula researchers try.
Types of Correlation:
Correlation is used almost everywhere in statistics. Correction illustrates the relationship
between two or more variables. It is expressed in the form of a number that is known as
correlation coefficient. There are mainly three types of correlations:
1. Positive Correlation
2. Negative Correlation
3. Zero Correlation
Positive Correlation: The value of one variable increases linearly with increase in another
variable. This indicates a similar relation between both the variables. So, its correlation
coefficient would be positive or 1 in this case.

Negative Correlation: When there is a decrease in values of one variable with decrease in
values of other variable. In that case, correlation coefficient would be negative.

Zero Correlation or No Correlation: There is one more situation when there is no specific
relation between two variables.

Regression Analysis:
Regression analysis is a set of statistical methods used for the estimation of relationships
between a dependent variable and one or more independent variables. It can be utilized to
assess the strength of the relationship between variables and for modeling the future
relationship between them.
Regression Analysis – Simple linear regression:
Simple linear regression is a model that assesses the relationship between a dependent
variable and an independent variable. The simple linear model is expressed using the following
equation:
Y = a + bX + ϵ
Where:
Y – Dependent variable
X – Independent (explanatory) variable
a – Intercept
b – Slope
ϵ – Residual (error)

Regression Analysis – Multiple linear regression:


Multiple linear regression analysis is essentially similar to the simple linear model, with the
exception that multiple independent variables are used in the model. The mathematical
representation of multiple linear regression is:
Y = a + bX1 + cX2 + dX3 + ϵ
Where:
Y – Dependent variable
X1, X2, X3 – Independent (explanatory) variables
a – Intercept
b, c, d – Slopes
ϵ – Residual (error)
Multiple linear regression follows the same conditions as the simple linear model. However,
since there are several independent variables in multiple linear analysis, there is another
mandatory condition for the model:
Regression Coefficients:
The regression coefficients are a statically measure which is used to measure the average
functional relationship between variables.
In regression analysis, one variable is dependent and other is independent. Also, it measures
the degree of dependence of one variable on the other(s).
The regression coefficient was first used to measure the relationship between the heights of
fathers and their sons. Regression coefficients are also known as the slope coefficient.
Since it determines the slope of the line which is the change in the independent variable for the
unit change in the independent variable.
Report Writing:
Report writing is a formal style of writing elaborately on a topic. The tone of a report is always
formal. For example, report writing about a school event, report writing about a business case
etc.
Formatting: To keep the report organized and easy to understand, there is certain format to
follow. This report writing format will make it easier for the reader to find what he is looking
for.
The main sections of a standard report are as follows:
Title Page: Title page carries the name of the project. It should be clearly typed in capital
letters. It bears the name of the topic, name of the author, the purpose of the study, the name of
the institution and the date of presentation.
Abstract: The abstract consists of the major points, conclusion, and recommendations. It
needs to be short, as it is a general overview of the report.
Body: This is the main section of the report. This section includes technical terms or jargons
from the industry. Each section is clearly labeled.
Introduction: The first page of the report needs to have an introduction. Here you will explain
the problem and inform the reader why the report is being made. It includes the background,
the purpose and the scope.
Methods: The purpose of the methods is to describe the materials and procedures used to
carry out the measurements. This section needs to provide a clear presentation of how key
measurements were obtained and how the measurements were analyzed.
Sample: A sample refers to a smaller, manageable version of a larger group. Samples are used
in statistical testing when population sizes are too large for the test includes all possible
members or observations.
Measures: Measures are the sources of the actual data for report writing. These can be
interviews, surveys etc.
Design: Generally the longest section of the report. It describes the details of the specific
design you propose to build. Support design with analysis. Support design with drawings and
illustrations.
Results: The results section is dedicated to presenting the actual results (i.e. measured and
calculated quantities), not to discussing their meaning or interpretation.
Conclusion: The conclusion should summarize the central points made in discussion section.
Any conclusions should be based on observations and data already discussed.
References: This should certain complete citations following standard form. The reference
should be numbered and listed in the order they were cited in the body of the report.
Tables: Tables should be well organized. A table should not include columns that have all
entries identical. Tables should be numbered consecutively, and above each table should be a
caption describing the table contents.
Figures: Figures are categorized as either graphs or drawings. Backgrounds should be white,
not shaded.
Appendices: This includes information that the experts in the field will read. It has all the
technical details that support your conclusions.
Presentation of Report:
How to turn a written report into a presentation:
Your objective: Start by being clear about your goals.
Your audience: Know your audience thoroughly. Check for anything that can affect how they
are likely to respond.
Your road map: List those points from your report which support your key messages.
Structure your talk: When you are dealing with a lengthy report that later will become an
oral presentation.
Create a strong opener: It is essential that you being any presentation with confidence.
Keep those visuals lean and mean: Be on the alert to include only the most essential data in
your visuals.
Some tips-
Be clear about the time allotted
Summarize clearly at the end of the presentation
Be prepared for the questions
Have handouts ready
Finally, you have to look into your non-verbal communication skills as well which includes-eye
contact, using your voice and gestures.
Methods:
There are various methods for presentation-
Oral presentation: One or more research students give a talk to a group and present views on
topic based on their research.
Poster session: The presentation of report or research information in the form of a paper
poster that conference participants may view.
Computer based presentation: It is the presentation by using the computer slides or
graphics which can create a dynamic presentation.
Written presentation: A written presentation tends to be the point. It is very objective in
nature and highly organized.
Multimedia presentation: A multimedia presentation would be a presentation file which is
not limited to just text in terms of content. For example, it would have interactive video, sound,
links, images, animated gifts and transitions in it.
Practicing Session on Assignment:
Assignment: A task or piece of work allocate to someone as part of a job or course of study.
Steps in writing an assignment:
Step-1- Plan: Planning your assignment will help you get focused and keep you on track.
Think about what you need to do to complete your assignment (for example- what research,
writing drafts, reference checking, reviewing and editing etc.). Break this into a list of tasks to
do.
Step-2- Analyze the question: Before you can answer a question you need to know what it
means. Read it slowly and carefully, and try to understand what is expected from you.
Ask yourself-
What is the question about?
What is the topic?
What does the question mean?
What do I have to do?
Step-3- Draft an Outline: Drafting an outline will give you a structure to follow when it comes
to writing you assignment.
Step-4- Find Information: Before you start writing, you need to research your topic and find
relevant and reliable information. Once you have found information, the next step will be to
evaluate it to ensure that it is right for your assignment.
Step-5- Write: Once you have found the information you need to bring it together and write
your assignment.
Step-6- Edit: Once you have written your assignment, you can improve it by editing. Check the
details i.e.-
Check the grammar, punctuation, and spelling
Check your reference
Are your pages numbered?
Have you included your name?
Student ID, the assignment details and date on each page.
UNIT-4
Large sample tests of hypothesis: Tests for a Population Mean:
Both the critical value approach and the p-value approach can be applied to test hypotheses
about a population proportion p. The null hypothesis will have the form H0:p=p0H0:p=p0 for
some specific number p0 between 0 and 1. The alternative hypothesis will be one of the three
inequalities p<p0p<p0, p>p0p>p0, or p≠p0p≠p0 for the same number p0 that appears in the
null hypothesis.
. In the formula p0 is the numerical value of p that appears in the two
hypotheses, q0=1−p0q0=1−p0, pˆp^ is the sample proportion, and n is the sample size.
Remember that the condition that the sample be large is not that n be at least 30 but that the
interval
[pˆ−3 pˆ(1−pˆ)n−−−−−−−−√,pˆ+3 pˆ(1−pˆ)n−−−−−−−−√]p^−3 p^(1−p^)n,p^+3 p^(1−p^)n
lie wholly within the interval [0,1].
Standardized Test Statistic for Large Sample Hypothesis Tests Concerning a Single Population
Proportion

The test statistic has the standard normal distribution.


The distribution of the standardized test statistic and the corresponding rejection region for
each form of the alternative hypothesis (left-tailed, right-tailed, or two-tailed), is shown.

EXAMPLE-A soft drink maker claims that a majority of adults prefer its leading beverage over
that of its main competitor’s. To test this claim 500 randomly selected people were given the
two beverages in random order to taste. Among them, 270 preferred the soft drink maker’s
brand, 211 preferred the competitor’s brand, and 19 could not make up their minds.
Determine whether there is sufficient evidence, at the 5% level of significance, to support the
soft drink maker’s claim against the default that the population is evenly split in its preference.
Solution:
We will use the critical value approach to perform the test. The same test will be performed
using the p-value
We must check that the sample is sufficiently large to validly perform the test.
Since pˆ=270/500=0.54p
,
pˆ(1−pˆ)n−−−−−−−−√=(0.54)(0.46)500−−−−−−−−−−−√≈0.02p^(1−p^)n=(0.54)(0.46)500≈0.02
hence
[pˆ−3 pˆ(1−pˆ)n−−−−−√, pˆ+3 pˆ(1−pˆ)n−−−−−√ ]=[0.54−(3)(0.02),0.54+(3)
(0.02)]=[0.48,0.60]⊂[0,1][p^−3 p^(1−p^)n, p^+3 p^(1−p^)n ]=[0.54−(3)(0.02),0.54+(3)
(0.02)]=[0.48,0.60]⊂[0,1]
so the sample is sufficiently large.
Step 1. The relevant test is
H0:p vs. Ha:p=>0.500.50@ α=0.05H0:p=0.50 vs. Ha:p>0.50@ α=0.05
where p denotes the proportion of all adults who prefer the company’s beverage over that of
its competitor’s beverage.
Step 2. The test statistic is
Z=pˆ−p0p0 q0n−−−−√Z=p^−p0p0 q0n
and has the standard normal distribution.
Step 3. The value of the test statistic is
Z=pˆ−p0p0 q0n−−−−√=0.54−0.50(0.50)(0.50)500−−−−−−−−√=1.789Z=p^−p0p0
q0n=0.54−0.50(0.50)(0.50)500=1.789
Step 4. Since the symbol in Ha is “>” this is a right-tailed test, so there is a single critical
value, zα=z0.05.zα=z0.05. Reading from the last line in Figure 12.3 "Critical Values of " its value
is 1.645. The rejection region is [1.645,∞).[1.645,∞).
Step 5. As shown in Figure 8.15 "Rejection Region and Test Statistic for " the test statistic falls
in the rejection region. The decision is to reject H0. In the context of the problem our
conclusion is:
The data provide sufficient evidence, at the 5% level of significance, to conclude that a majority
of adults prefer the company’s beverage to that of their competitor’s.

test of difference between two population means


Sampling Distribution of the Differences Between the Two Sample Means for Independent
Samples
The point estimate for μ1−μ2μ1−μ2 is ¯x1−¯x2x¯1−x¯2.
In order to find a confidence interval for μ1−μ2μ1−μ2 and perform a hypothesis test, we need
to find the sampling distribution of ¯x1−¯x2x¯1−x¯2 .
We can show that when the sample sizes are large or the samples from each population are
normal and the samples are taken independently, then ¯y1−¯y2y¯1−y¯2 is normal with
mean μ1−μ2μ1−μ2 and standard deviation is √ σ12n1+σ22n2 σ12n1+σ22n2.
However, in most cases, σ1σ1 and σ2σ2 are unknown and they have to be estimated. It seems
natural to estimate σ1σ1 by s1s1 and σ2σ2 by s2s2. When the sample sizes are small, the
estimates may not be that accurate and one may get a better estimate for the common
standard deviation by pooling the data from both populations if the standard deviations for the
two populations are not that different.

2-Sample t-Procedures: Pooled Variances Versus Non-Pooled Variances for Independent


Samples
In view of this, there are two options for estimating the variances for the 2-sample t-test with
independent samples:
2-sample t-test using pooled variances
2-sample t-test using separate variances
When to use which? When we are reasonably sure that the two populations have nearly equal
variances, then we use the pooled variances test. Otherwise, we use the separate variances
test.
Using Pooled Variances to Do Inferences for Two-Population Means
When we have good reason to believe that the variance for population 1 is about the same as
that of population 2, we can estimate the common variance by pooling information from
samples from population 1 and population 2. An informal check for this is to compare the ratio
of the two sample standard deviations. If the two are equal this ratio would be 1. However,
since these are samples and therefore involve error, we cannot expect the ratio to be exactly 1.
When the sample sizes are nearly equal (admittedly "nearly equal" is somewhat ambiguous
so often if sample sizes are small one requires they be equal), then a good Rule of Thumb to
use is to see if this ratio falls from 0.5 to 2 (that is neither sample standard deviation is more
than twice the other). If this rule of thumb is satisfied we can assume the variances are equal.
Later in this lesson we will examine a more formal test for equality of variances.
Let n1 be the sample size from population 1, s1 be the sample standard deviation of population
1.
Let n2 be the sample size from population 2, s2 be the sample standard deviation of population
2.
Then the common standard deviation can be estimated by the pooled standard deviation:
sp=√ (n1−1)s21+(n2−1)s22n1+n2−2 sp=(n1−1)s12+(n2−1)s22n1+n2−2
The test statistic is:
t∗=¯x1−¯x2sp√ 1n1+1n2 t∗=x¯1−x¯2sp1n1+1n2
with degrees of freedom equal to df=n1+n2−2df=n1+n2−2 .
Example: Comparing Packing Machines
In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will
pack faster on the average than the machine currently used. To test that hypothesis, the times
it takes each machine to pack ten cartons are recorded. The results (machine.txt), in seconds,
are shown in the following table.
New machine Old machine
42.1 41.3 42.4 43.2 41.8 42.7 43.8 42.5 43.1 44.0
41.0 41.8 42.8 42.3 42.7 43.6 43.3 43.5 41.7 44.1
¯y1y¯1 = 42.14, s1 = 0.683 ¯y2y¯2 = 43.23, s2 = 0.750
Do the data provide sufficient evidence to conclude that, on the average, the new machine
packs faster? Perform the required hypothesis test at the 5% level of significance.
It is given that:
¯y1=42.14y¯1=42.14, s1=0.683s1=0.683
¯y2=43.23y¯2=43.23, s2=0.750s2=0.750
Assumption 1: Are these independent samples? Yes, since the samples from the two machines
are not related.
Assumption 2: Are these large samples or a normal population? We
have n1<30n1<30, n2<30n2<30. We do not have large enough samples and thus we need to
check the normality assumption from both populations.
Let's take a look at the normality plots for this data:

From the normality plots, we conclude that both populations may come from normal
distributions.
Assumption 3: Do the populations have equal variance? Yes, since s1s1 and s2s2 are not that
different. How do conclude this? By using a rule of thumb where the ratio of the two sample
standard deviations is from 0.5 to 2. (They are not that different
as s1/s2=0.683/0.750=0.91s1/s2=0.683/0.750=0.91 is quite close to 1. We will discuss this in
more details and quantify what is "close" later in this lesson.)
We can thus proceed with the pooled t-test.
Let μ1μ1 denote the mean for the new machine and μ2μ2 denote the mean for the old machine.
Step 1.
H0:μ1−μ2=0H0:μ1−μ2=0,
Ha:μ1−μ2<0Ha:μ1−μ2<0
Step 2. Significance level:
α=0.05α=0.05.
Step 3. Compute the t-statistic:
sp=√ 9⋅(0.683)2+9⋅(0.750)210+10−2 =0.717sp=9⋅(0.683)2+9⋅(0.750)210+10−2=0.717
t∗=(¯x1−¯x2)−0sp√ 1n1+1n2 =42.14−43.230.717⋅√ 110+110 =−3.40t∗=(x¯1−x¯2)−0sp1n1+1n2
=42.14−43.230.717⋅110+110=−3.40
Step 4. Critical value:
Left-tailed test
Critical value = −tα=−t0.05−tα=−t0.05
Degrees of freedom =10+10−2=18=10+10−2=18
−t0.05=−1.734−t0.05=−1.734
Rejection region t∗<−1.734t∗<−1.734
Step 5. Check to see if the value of the test statistic falls in the rejection region and decide
whether to reject Ho.
t∗=−3.40<−1.734t∗=−3.40<−1.734
Reject H0H0 at α=0.05α=0.05

Step 6. State the conclusion in words.


At 5% level of significance, the data provide sufficient evidence that the new machine packs
faster than the old machine on average.
When one wants to estimate the difference between two population means from independent
samples, then one will use a t-interval. If the sample variances are not very different, one can
use the pooled 2-sample t-interval.
Step 1. Find tα/2tα/2 with df=n1+n2−2df=n1+n2−2.
Step 2. The endpoints of the (1 - αα) 100% confidence interval for μ1−μ2μ1−μ2 is:
¯x1−¯x2±tα/2⋅sp⋅√ 1n1+1n2 x¯1−x¯2±tα/2⋅sp⋅1n1+1n2
the degrees of freedom of t is n1+n2−2n1+n2−2.
What to do if some of the assumptions are not satisfied:
Assumption 1. What should we do if the assumption of independent samples is violated?
If the samples are not independent but paired, we can use the paired t-test.
Assumption 2. What should we do if the sample sizes are not large and the populations are
not normal?
We can use a nonparametric method to compare two samples such as the Mann-Whitney
procedure.
Assumption 3. What should we do if the assumption of equal variances is violated?
We can use the separate variances 2-sample t-test.

Using Separate (Unpooled) Variances to Do Inferences for Two-Population Means

We can perform the separate variances test using the following test statistic:
t∗=¯x1−¯x2√ s21n1+s22n2 t∗=x¯1−x¯2s12n1+s22n2
with
df=(n1−1)⋅(n2−1)(n2−1)C2+(1−C)2(n1−1)df=(n1−1)⋅(n2−1)(n2−1)C2+(1−C)2(n1−1)
(round down to nearest integer)
where
C=s21/n1s21n1+s22n2C=s12/n1s12n1+s22n2
NOTE: This calculation for the exact degrees of freedom is cumbersome and is typically done
by software. An alternate, conservative option to using the exact degrees of freedom
calculation can be made by choosing the smaller of n1−1n1−1 and n2−1n2−1.
test of hypothesis for a binomial proportion
Example 1: Suppose you have a die and suspect that it is biased towards the number three, and
so run an experiment in which you throw the die 10 times and count that the number
three comes up 4 times. Determine whether the die is biased.
Define x = the number of times the number three occurs in 10 trials. This random variable has
the binomial distribution where π is the population parameter corresponding to the
probability of success on any trial. We use the following null and alternative hypotheses:
H0: π ≤ 1/6; i.e. the die is not biased towards the number 3
H1: π > 1/6
Setting α = .05, we have
P(x ≥ 4) = 1–BINOM.DIST(3, 10, 1/6, TRUE) = 0.069728 > 0.05 = α.
and so we cannot reject the null hypothesis that the die is not biased towards the number 3
with 95% confidence.
Example 2: We suspect that a coin is biased towards heads. When we toss the coin 9 times,
how many heads need to come up before we are confident that the coin is biased towards
heads?
We use the following null and alternative hypotheses:
H0: π ≤ .5
H1: π > .5
Using a confidence level of 95% (i.e. α = .05), we calculate
BINOM.INV(n, p, 1–α) = BINOM.INV(9, .5, .95) = 7
which means that if 8 or more heads come up then we are 95% confident that the coin is
biased towards heads, and so can reject the null hypothesis.
We confirm this conclusion by noting that P(x ≥ 8) = 1–BINOM.DIST(7, 9, .5, TRUE) = 0.01953
< 0.05 = α, while P(x ≥ 7) = 1–BINOM.DIST(6, 9, .5, TRUE) = .08984 > .05.
Example 3: Historically a factory has been able to produce a very specialized nano-technology
component with 35% reliability, i.e. 35% of the components passed its quality assurance
requirements. They have now changed their manufacturing process and hope that this has
improved the reliability. To test this, they took a sample of 24 components produced using the
new process and found that 13 components passed the quality assurance test. Does this show a
significant improvement over the old process?
We use a one-tailed test with null and alternative hypotheses:
H0: p ≤ .35
H1: p > .35
p-value = 1–BINOM.DIST(12, 24, .35, TRUE) = .04225 < .05 = α
and so conclude with 95% confidence that the new process shows a significant improvement.
test of hypothesis for the difference between binomial proportions
Given a set of N1 observations in a variable X1 and a set of N2 observations in a variable X2, we
can compute a normal approximation test that the two proportions are equal (or alternatively,
that the difference of the two proportions is equal to 0). In the following, let p1 and p2 be the
population proportion of successes for samples one and two, respectively.
The hypothesis test that the two binomial proportions are equal is
H0: p1 = p2
Ha: p1 ≠ p2
Z=p1^−p2^p^(1−p^)(1/n1+1/n2)√
Test Statistic: where p^ is the proportion of successes for the combined sample and
p^==n1p1^+n2p2^n1+n2X1+X2n1+n2
Significance Level: Α
For a two-tailed test
Z>Φ−1(1−α/2)
Z<Φ−1(α/2)
Critical Region: For a lower tailed test
Z<Φ−1(α)
For an upper tailed test
Z>Φ−1(1−α)
Conclusion: Reject the null hypothesis if Z is in the critical region
inference from small samples :student's t distribution
The fact that as the number of observations increase, the distribution values will often tend
towards a normal frequency distribution allows statisticians to make important inferences
about statistical properties. However, in practice, due to experimental costs or the availability
of subject populations, scientists often are constrained to use small sample sizes.
The t distribution is very useful in these cases.
The rationale for the t distribution:
For small samples, where the sample estimates are less representative of the
population, t provides an alternative to the normal curve that is more conservative.
The t distribution is sometimes named Student’s t distribution after the pseudonym of the
statistician William Sealy Gosset, who developed it while working at the Guiness Brewery.
The t distribution describes the frequency distribution of small numbers of samples
chosen from a normal distribution. Like the normal distribution, the function that defines its
probability density is complex and understanding it is not required to make use of
the t distribution.
Defining the t statistic
The t distribution looks a lot like the normal distribution, but it does not approach zero as
quickly:

Figure 1: the t distribution with 1 degree of freedom


Note that the tails of the t distribution in Figure 1, plotted in black, have higher probabilities
than the corresponding points on a normal distribution with the same mean and SD = 1,
plotted in grey. Since the probability function for this t distribution has a higher Y value for
values of X more distant from the mean, the interpretation is that these values are more likely
to occur. Therefore, the t distribution is more conservative.
The formula describing the t distribution is rather complicated. In the past to derive a specific
value for the distribution, given a value for t and the number of degrees of freedom (defined
below), one would have used a table of pre-computed values. Nowadays we can use a
statistical package such as R to do this.
Degrees of Freedom: how much information?
The purpose of using a t distribution in this kind of problem is to be more conservative, since
our sample size is small. However, if we have more observatioons in our sample, we can be less
conservative because having more independent sample observations improves our estimate of
the population value.
The “degrees of freedom” is a way to quantify how much independent information we have in
our estimate. A higher value for degrees of freedom means our samples contain more
independent data. In the context of the t distribution, that means we can be less conservative.
As the number of degrees of freedom increases, the t distribution shape changes and becomes
more and more like a normal distribution.
Samples from a t distribution made N observations have N-1 degrees of freedom. The rationale
is as follows:
Recall that for n samples, the sample mean is defined
x¯=x1+...+xnn
and the sample variance is defined:
s2=∑i=1n(xi−x¯)2n−1.
To calculate the sample variance, we need to know the sample mean. If we have a small sample
of five observations, how many indepdenent observations go into calculating the variance?
Since, given the mean and four values, we can calculate what the fifth value must be, there are
four independent values in five observations.
Calculating Confidence Intervals using the t distribution
Calculating CI
The method of calculating a Confidence Interval (CI) using the t distribution has is similar to
that used when calculating a CI using the normal distribution. However, the bounds of
the t distribution will be different from those of the normal distribution.
calculate sample mean, standard error (SEM)
determine degrees of freedom (DF) (N-1)
use R to calculate the number of standard deviations from the mean that contain 95% of a t
distribution with the DF
Example: confidence interval of the mean estimate:
>obs=c(2,5,7,8,3)
>N=length(obs)
>SEM=sd(obs)/sqrt(N)
>degrees_of_freedom=N-1
>CI_bound=qt((0.05/2), df=degrees_of_freedom)
>lower_bound=round(mean(obs)-(abs(CI_bound)*SEM), 1)
>upper_bound=round(mean(obs)+(abs(CI_bound)*SEM), 1)
>print(paste("mean", mean(obs), "95% CI", lower_bound, "to", upper_bound,
+"on", degrees_of_freedom, "DF"))
[1] "mean 5 95% CI 1.8 to 8.2 on 4 DF"
So given these five observations with mean 5, the 95% CI ranges from 1.8 to 8.2.
Comparison to a CI from the normal distribution
If we perform the same calculations using the bounds set by a normal distribution, the
confidence interval will be smaller (less conservative). This is not the best choice here, as the
small sample size indicates we should use the t distribution instead.
>obs=c(2,5,7,8,3)
>N=length(obs)
>SEM=sd(obs)/sqrt(N)
>lower_bound_nor=round(mean(obs)-(1.96*SEM), 1)
>upper_bound_nor=round(mean(obs)+(1.96*SEM), 1)
>print(paste("mean", mean(obs), "95% CI", lower_bound_nor, "to", upper_bound_nor))
[1] "mean 5 95% CI 2.8 to 7.2"
small sample inference conceming a population mean and difference between two population
mean
Comparing two population means - small independent samples
If the sample size is small ( ) and the sample distribution is normal or approximately
normal, then the Student's t distribution and associated statistics can be used to determine if
or test whether the sample mean = population mean.
Comparing sample means of two independent samples with small sample size is similar to
comparing a sample mean against a population mean (Chapter 7); the t-statistics or student's t
distribution is used to evaluate tests. The only difference is the values for the parameters used
in determining the statistics.
The hypothesis testing involving two different means study the distribution of their
differences:.
1. Know the basic general statistics used for comparing two population means - small sample
size or is unknown.
If we have two populations or sample distributions the following basic statistics can be
obtained from each:

Population or Sample Sample Standard


Sample size Sample mean Population mean
Identification deviation

1 n1

2 n2
Small sample size studies use the student t statistics and large sample sizes studies use the
standard normal z-score statistics.
If we let ( and and be a combined standard
deviation for both sample distribution or data sets, then

For large sample size ( ) For


the
small sample size ( ) the test
test statistics in a hypothesis test is: statistics in a hypothesis test is:

, the z-score , the student's t, df = n-1


For small sample size the standard deviation and test statistics are:

Test statistics, t, df = n-1 (smallest sample size)


Standard Deviation:

Pooled sample standard deviation when Pooled Test statistics, t ,


when

Also
Confidence Interval is
2. Know how to use appropriate statistics to test if two sample means are equal or if their
difference = 0 (small sample size).
3 Types of tests in comparing two sample means:
When comparing the sample means, there are 3 questions to consider:
Question 1: : Is ? Ha (Two-tailed test)
Question 2: : Is ? Ha (Right-tailed test)
Question 3: : Is ? Ha (Left-tailed test)
Example 4 is an example of the pool t-test.
Question 1: Is ? Ha (Two-tailed test)
Problem 1. Two types of cars are compared for acceleration rate. The test runs are recorded
for each car and the results for the mean elapsed time recorded below:

Sample standard
Sample mean Sample size
deviation
Car A (x1) 8.5 1.8 20
Car B (x2) 7.2 2.1 30
Construct a 98% CI for the difference in the mean elapsed time for the two types of cars. Using
this CI, determine if there is a difference in the mean elapsed times?
Given difference , at least one of the same < 30 (small so must use the
student's t distribution or t-statistics),
Step 1 - Hypothesis: The claim that , the null hypothesis.
The alternate hypothesis is that
H0 :
Ha :
Step 2. Select level of significance: This is given as (2% = 100 - 98)

So for two-tailed test:


Step 3. Test statistics and observed value.

Step 4. Determine the critical region (favors Ha)


For alpha = 0.02 at both ends of intervals: 0.01 and 0.99, ta = 2.5395 and -ta = -2.5395
Step 5. Make decision.
No not reject the null hypothesis if or
The observed t=2.3386, and since 2.3386 < 2.5395 and is not in the critical region, we have no
reason to reject H0 in favor of Ha.
Note also that a difference of 1.3 is between the confidence intervals of -0.1117 and 2.7117 the
blue region for the null hypothesis acceptance.
Therefore the difference between both means is significantly different from 0.

UNIT-5
POINT ESTIMATION & NON-PARAMETER TEST
The central limit theorem (CLT) is one of the most important results in probability theory. It
states that, under certain conditions, the sum of a large number of random variables is
approximately normal. Here, we state a version of the CLT that applies to i.e., random
variables. Suppose that X1, X2,...,Xn are random variables with expected values EXi=μ<∞ and
variance Var (Xi)=σ2<∞. Then as we saw above, the sample mean X¯=X1+X2+...+Xn has mean
EX¯=μ and variance Var(X¯) =σ rootn. Thus, the normalized random variable
sampling distribution of sample mean and proportion.
sample statistic is a single value that estimates a population parameter, we refer to the statistic
as a point estimate.
Before we begin, we will introduce a brief explanation of notation and some new terms that we
will use this lesson and in future lessons.
Notation:
Sample mean: book uses y-bar or y¯; most other sources use x-bar or x¯
Population mean standard notation is the Greek letter μ
Sample proportion: book uses π-hat (^π); other sources use p-hat, (^p)
Population proportion: book uses ππ; other sources use p
[NOTE: Remember that the use of ππ is NOT to be interpreted as the numeric representation of
3.14 but instead is simply a symbol.]
Terms
Standard error – standard deviation of a sample statistic
Standard deviation – relates to a sample.
Parameters, e.g., mean and SD, are summary measures of population, e.g., μ and σ. These are
fixed.
Statistics, e.g., sample mean and sample SD, are summary measures of a sample, e.g., ¯x and s.
These vary. Think about taking a sample and the sample is not always the same therefore the
statistics change. This is the motivation behind this lesson - due to this sampling variation the
sample statistics themselves have a distribution that can be described by some measure of
central tendency and spread.
Sampling error is the error resulting from using a sample characteristic to estimate a
population characteristic.
Sample size and sampling error: As the dot pots above shows, the possible sample means
cluster more closely around the population mean as the sample size increases. Thus, possible
sampling error decreases as sample size increases.
The mean of sample mean is the population mean. That is: μ¯y=μ
When sampling with replacement, the standard deviation of the sample mean called the
standard error equals the population standard deviation divided by the square root of the
sample size. That is: σ¯y=σ√n.
Sampling Distribution of the Mean When the Population is Normal:
Key Fact: If the population is normally distributed with mean μμ and standard deviation σ,
then the sampling distribution of the sample mean is also normally distributed no matter what
the sample size is. When the sampling is done with replacement or if the population size is
large compared to the sample size, it follows from the above two formulas that ¯yy¯ has mean
μμ and standard error σ/√n
SPECIAL NOTE: In the rest of this course, we only deal with the case when the sampling is done
with replacement or if the population size is much larger than the sample size.
Application of Sample Mean Distribution:When we know the sample, mean is normal or
approximately normal, and we know the population mean, μμ, and population standard
deviation, σσ, then we can calculate a z-score for the sample mean and determine probabilities
for it where:
Z=¯y−μσ/√n
Large Sample Estimation:
The Central Limit Theorem says that, for large samples (samples of size n ≥ 30), when viewed
as a random variable the sample mean X- is normally distributed with mean μX−=μand
standard deviation σX−=σ/n The Empirical Rule says that we must go about two standard
deviations from the mean to capture 95% of the values of X− generated by sample after
sample. A more precise distance based on the normality of X−- is 1.960 standard deviations,
which is E=1.960σ/n
For 100(1−α) % confidence the area in each tail is α∕2.
Figure 7.4

For 95% confidence the area in each tail is α∕2=0.025.


The level of confidence can be any number between 0 and 100%, but the most common values
are probably 90% (α=0.10), 95% (α=0.05), and 99% (α=0.01).
Thus, in general for a 100(1−α) % confidence interval, E=zα∕2(σ∕n), so the formula for the
confidence interval is .x-±zα∕2(σ∕n). While sometimes the population standard deviation σ is
known, typically it is not. If not, for n ≥ 30 it is generally safe to approximate σ by the sample
standard deviation s.
Large Sample 100(1−α) % Confidence Interval for a Population Mean:
If σ is knownx+zα∕2σn
A sample is considered large when n ≥ 30.
As mentioned earlier, the number E=zα∕2σ/n−−√E=zα∕2σ∕n or E=zα∕2s/n−−√E=zα∕2s∕n is called
the margin of error of the estimate.
EXAMPLE 1
Find the number zα∕2needed in construction of a confidence interval:
a. when the level of confidence is 90%.
b. when the level of confidence is 99%.
Solution:
a. For confidence level 90%, α=1−0.90=0.10, so zα∕2=z0.05. Since the area under the
standard normal curve to the right of z.05z.05 is 0.05, the area to the left of z0.05 is 0.95. We
search for the area 0.9500 . The closest entries in the table are 0.9495 and 0.9505,
corresponding to z-values 1.64 and 1.65. Since 0.95 is exactly halfway between 0.9495 and
0.9505 we use the average 1.645 of the z-values for z0.05.
b. For confidence level 99%, α=1−0.99=0.01α=1−0.99=0.01, so zα∕2=z0. 005.zα∕2=z0.005.
Since the area under the standard normal curve to the right of z0.005 is 0.005, the area to the
left of z0.005 is 0.9950. We search for the area 0.9950. The closest entries in the table are
0.9949 and 0.9951, corresponding to z-values 2.57 and 2.58. Since 0.995 is halfway between
0.9949 and 0.9951 we use the average 2.575 of the z-values for z0.005.
Point Estimation:
Suppose we have an unknown population parameter, such as a population mean μ or a
population proportion p, which we'd like to estimate. For example, suppose we are interested
in estimating:
p = the (unknown) proportion of American college students, 18-24, who have a smart phone
μ = the (unknown) mean number of days it takes Alzheimer's patients to achieve certain
milestones
In either case, we can't possibly survey the entire population. That is, we can't survey all
American college students between the ages of 18 and 24. Nor can we survey all patients with
Alzheimer's disease. So, of course, we do what comes naturally and take a random sample from
the population and use the resulting data to estimate the value of the population parameter. Of
course, we want the estimate to be "good" in some way.
In this lesson, we'll learn two methods, namely the method of maximum likelihood and the
method of moments, for deriving formulas for "good" point estimates for population
parameters. We'll also learn one way of assessing whether a point estimate is "good." We'll do
that by defining what a means for an estimate to be unbiased.
Objectives
To learn how to find a maximum likelihood estimator of a population parameter.
To learn how to find a method of moments estimator of a population parameter.
To learn how to check to see if an estimator is unbiased for a particular parameter.
To understand the steps involved in each of the proofs in the lesson.
To be able to apply the methods learned in the lesson to new problems.

We'll start the lesson with some formal definitions. In doing so, recall that we denote the
nrandom variables arising from a random sample as subscripted uppercase letters:
X1, X2, ..., Xn
The corresponding observed values of a specific random sample are then denoted as
subscripted lowercase letters:
x1, x2, ..., xn
Definition: The range of possible values of the parameter θ is called the parameter space Ω
(the greek letter "omega").
For example, if μ denotes the mean grade point average of all college students, then the
parameter space (assuming a 4-point grading scale) is:
Ω = {μ: 0 ≤ μ ≤ 4}
And, if p denotes the proportion of students who smoke cigarettes, then the parameter space
is:
Ω = {p: 0 ≤ p ≤ 1}
Definition. The function of X1, X2, ..., Xn, that is, the statistic u (X1, X2, ..., Xn), used to
estimate θ is called a point estimator of θ.
Definition. The function u (x1, x2, ..., xn) computed from a set of data is an observed
point estimate of θ.
In simple terms, any statistic can be a point estimate. A statistic is an estimator of some
parameter in a population. For example:
• The sample standard deviation (s) is a point estimate of the population standard deviation
(σ).
• The sample mean (̄x) is a point estimate of the population mean, μ
• The sample variance (s2 is a point estimate of the population variance (σ2).
In more formal terms, the estimate occurs as a result of point estimation applied to a set of
sample data. Points are single values, in comparison to interval estimates, which are a range of
values. For example, a confidence interval is one example of an interval estimate.
Finding the Estimates
Four of the most common ways to find an estimate:
• The Method of Moments is based on the law of large numbers and uses relatively simple
equations to find point estimates. Is often not too accurate and has a tendency to be biased.
More info.
• Maximum Likelihood: uses a model (for example, the normal distribution) and uses the
values in the model to maximize a likelihood function. This results in the most likely parameter
for the inputs selected. More info.
• Bayes Estimators: minimize the average risk (an expectation of random variables). More
info.
• Best Unbiased Estimators: several unbiased estimators can be used to approximate a
parameter. Which one is “best” depends on what parameter you are trying to find? For
example, with variance, the estimator with the smallest variance is “best”. More info.
Interval Estimation:
Interval estimation, in statistics, the evaluation of a parameter—for example, the mean
(average)—of a population by computing an interval, or range of values, within which the
parameter is most likely to be located. Intervals are commonly chosen such that the parameter
falls within with a 95 or 99 percent probability, called the confidence coefficient. Hence, the
intervals are called confidence intervals; the end points of such an interval are called upper
and lower confidence limits.
The interval containing a population parameter is established by calculating that statistic from
values measured on a random sample taken from the population and by applying the
knowledge (derived from probability theory) of the fidelity with which the properties of a
sample represent those of the entire population.
The probability tells what percentage of the time the assignment of the interval will be correct
but not what the chances are that it is true for any given sample. Of the intervals computed
from many samples, a certain percentage will contain the true value of the parameter being
sought.
Here we consider the joint estimation of a multivariate set of population means. That is, we
have observed a set of p X-variables and may wish to estimate the population mean for each
variable. In some instances, we may also want to estimate one or more linear combinations of
population means. Our basic tool for estimating the unknown value of a population parameter
is a confidence interval, an interval of values that is likely to include the unknown value of the
parameter.
The general format of a confidence interval estimates of a population mean is
Sample mean ± Multiplier × Standard error of mean
In this formula, ¯xj is the sample mean, sj is the sample standard deviation and n is the sample
size. The multiplier value is a function of the confidence level, the sample size, and the strategy
used for dealing with the multiple inference issue.
Confidence Interval of Population Means:
Estimating the mean:
Estimating the mean of a normally distributed population entails drawing a sample of size n
and computing which is used as a point estimate of.
It is more meaningful to estimate by an interval that communicates information regarding the
probable magnitude of.
Sample distributions and estimation:
Interval estimates are based on sampling distributions. When the sample mean is being used
as an estimator of a population mean, and the population is normally distributed, the sample
mean will be normally distributed with mean, equal to the population mean, and variance.

The 95% confidence interval:


Approximately 95% of the values of x making up the distribution will lie within 2 standard
deviations of the mean. The interval is noted by the two points, and so that 95%
of the values are in the interval .
Since and are unknown, the location of the distribution is uncertain. We can use as a
point estimate of . In constructing intervals of , 95% of these intervals would contain
.

Example
Suppose a researcher, interested in obtaining an estimate of the average level of some enzyme
in a certain human population, takes a sample of 10 individuals, determines the level of the
enzyme in each, and computes a sample mean of x = 22. Suppose further it is known that the
variable of interest is approximately normally distributed with a variance of 45. We wish to
estimate .
Solution
An approximate confidence interval for is given by:
Components of an interval estimate:
This is the general form for an interval estimate.
estimator ± (reliability coefficient) (standard error)
The general form for an interval estimate consists of three components. These are known as
the estimator, the reliability coefficient, and the standard error.
Estimator: The interval estimate of is centred on the point estimate of . As noted in the
table above, is an unbiased point estimator for .

Reliability coefficient: Approximately 95% of the values of the standard normal curve lie
within 2 standard deviations of the mean. The z score in this case is called the reliability
coefficient. We use a value of z that will give the correct interval size. The proper z score
depends on the value of being used. Generally, the three values of most commonly used
are .01, .05 and .10. Their corresponding z scores are 1.645, 1.96 and 2.575, respectively.

Standard error: The standard error equals


Interpretation of confidence intervals:
The interval estimate for is expressed as:
Assuming that we are using a value of =.05, we can say that, in repeated sampling, 95% of
the intervals constructed this way will include . This is based on the probability of
occurrence of different values of .
The area of the curve of that is outside the area of the interval is called , and the area inside
the interval is called 1- .
Interpretation of the interval
There are two ways in which interval estimates can be interpreted. These are known as the
probabilistic interpretation and the practical interpretation.
The probabilistic interpretation results from repeated sampling. With repeated sampling from
a normally distributed population with a known standard deviation, 100(1- ) percent of all

intervals in the form will, in the long run, include the population mean, . The
quantity 1- is called the confidence coefficient or confidence level and the interval,

, is called the confidence interval for .


Note that the percentage of intervals involved depends on the value of . With modern
electronic devices such as the TI-83 calculator and Microsoft Excel, it is possible to use any
value of . When statistics was developing during the 20th century, such devices were not
generally available so one had to use tables. These tables were very difficult to prepare and so
only a few values of were supported. The most commonly used values of are .01, .05,
and .10. When these are used in the formula 100 (1- ), they yield percentages of 99%, 95%,
and 90%, respectively. The most widely used value for a confidence level is 95%, which
corresponds to =.05. Using this figure, the probabilistic interpretation says that in 100
samplings, 95 of them should include . For situations in which there is neither time nor
ability to do 100 samplings, the practical interpretation is used.
The practical interpretation of the interval is used for a single sampling. When sampling is
from a normally distributed population with known standard deviation, we are 100(1- )

percent confident that the single computed interval, , contains the population
mean, .
The t distribution:
In most real-lifesituations, the variance of the population is unknown. We know that the z

score, , is normally distributed if the population is normally distributed and is


approximately normally distributed when the population is large. But it cannot be used
because is unknown.
Estimation of the standard deviation

The sample standard deviation, , can be used to replace . If n 30,


then s is a good approximation of . An alternate procedure is used when the samples are
small. It is known as Student's t distribution.
Student's t distribution
Student's t distribution is used as an alternative for z with small samples. It uses the following
formula:

Properties of the t distribution


1. Mean = 0
2. It is symmetrical about the mean.
3. Variance is greater than 1 but approaches 1 as the sample gets large. For df > 2, the
variance

= df/(df-2) or
4. The range is - to + .
5. t is really a family of distributions because the divisors are different.
6. Compared with the normal distribution, t is less peaked and has higher tails.
7. t distribution approaches the normal distribution as n-1 approaches infinity.
Confidence interval for a mean using t
When sampling is from a normal distribution whose standard deviation, , is unknown, the
100(1- ) percent confidence interval for the population mean, , is given by:

Deciding between z and t


When constructing a confidence interval for a population mean, we must decide whether to
use z or t. Which one to use depends on the size of the sample, whether it is normally
distributed or not, and whether or not the variance is known. There are various flowcharts
and decision keys that can be used to help decide. Mine appears below.
Key for deciding between z and t in confidence interval construction.
1. Population normally distributed................5
Not as above—normally distributed.........5

2. Population variance is known.............use z


Population variance not known.... use t (or z)

3. Population variance is known.............use z


Population variance is not known.......use t

4. Sample size is large..................................6


Sample size is small..................................7

5. Population variance is known.............use t


WILCOXON SIGNED RANK TEST:
Let be the sample size, i.e., the number of pairs. Thus, there are a total of 2N data points. For
pairs , let and denote the measurements.
H0: difference between the pairs follows a symmetric distribution around zero
H1: difference between the pairs does not follow a symmetric distribution around zero.
1. For , calculate and , where is the sign function.
2. Exclude pairs with . Let be the reduced sample size.
3. Order the remaining pairs from smallest absolute difference to largest absolute
difference, .
4. Rank the pairs, starting with the smallest as 1. Ties receive a rank equal to the average of
the ranks they span. Let denote the rank.
5. Calculate the test statistic
, the sum of the signed ranks.
6. Under null hypothesis, follows a specific distribution with no simple expression. This
distribution has an expected value of 0 and a variance of .
can be compared to a critical value from a reference table.[1]
The two-sided test consists in rejecting if .
7. As increases, the sampling distribution of converges to a normal distribution. Thus,
For , a z-score can be calculated as , where .
To perform a two-sided test, reject if .
Alternatively, one-sided tests can be performed with either the exact or the approximate
distribution. p-values can also be calculated.
8. For the original test using the T statistic is applied.
Denoted by Siegel as the T statistic, it is the smaller of the two sums of ranks of given sign; in
the example given below, therefore, T would equal 3+4+5+6=18. Low values of T are required
for significance. As will be obvious from the example below, T is easier to calculate by hand
than W and the test is equivalent to the Low two-sided test described above; however, the
distribution of the statistic under has to be adjusted.
MANN WHITNEY TEST:
In statistics, the Mann–Whitney U test (also called the Mann–Whitney–Wilcoxon (MWW),
Wilcoxon rank-sum test, or Wilcoxon–Mann–Whitney test) is a nonparametric test of the null
hypothesis that it is equally likely that a randomly selected value from one sample will be less
than or greater than a randomly selected value from a second sample.
Unlike the t-test it does not require the assumption of normal distributions. It is nearly as
efficient as the t-test on normal distributions.
This test can be used to determine whether two independent samples were selected from
populations having the same distribution; a similar nonparametric test used on dependent
samples is the Wilcox on signed-rank test.
KRUSKAL–WALLIS TEST:
The Kruskal–Wallis test by ranks, Kruskal–WallisH test (named after William Kruskal and W.
AllenWallis), or one-way ANOVA on ranks is a non-parametric method for testing whether
samples originate from the same distribution. It is used for comparing two or more
independent samples of equal or different sample sizes.

You might also like