Professional Documents
Culture Documents
Research Methods GIS RS Kefelegn G June 2017
Research Methods GIS RS Kefelegn G June 2017
RESEARCH METHODS
5
The Scientific Method… cont’d
8
Guiding Principles… cont’d
9
Guiding Principles… cont’d
Logical reasoning?
10
Guiding Principles… cont’d
11
The Meaning of Research
15
The Purpose of Research…Cont’d
17
Classification of Research
Activities…Cont’d
19
Classification of Research
Activities…Cont’d
23
Classification of Research
Activities…Cont’d
24
Classification of Research
Activities…Cont’d
25
Classification of Research
Activities…Cont’d
26
Classification of Research
Activities…Cont’d
27
Classification of Research
Activities…Cont’d
28
Classification of Research
Activities…Cont’d
research,
can be field-based or laboratory- based or
simulation research,
can also be clinical or diagnostic research,
etc.
29
Time Dimension in Research
30
Time Dimension in Research… cont’d
33
Some of the most important shared values
35
Some of the most important shared
values…cont’d
Thus, collection of data illegally, under false pretenses, from
minors, etc is unethical.
Getting access and consent to do research is therefore,
essential.
During analysis (Misuse of data)
40
The research process
Question
Introduction
Discussion
Conclusion
Results
Formulate problem
Interpretation,
conclusion
Hypothesis
Materials and
Methods
Analyse, Results Project plan
Experiment
Collect data
41
Introduction
45
Identification of a Research Topic (Cont’d)
Professional Experience:
Own professional experience is the most
important source of a research problem,
Contacts and discussions with research-oriented
people,
Attending conferences, seminars, and
Listening to learned speakers
46
Identification of a Research Topic (Cont’d)
Funding,
50
2. Definition and Statement of the Problem
53
Definition and Statement of the Problem
(Cont’d)
c) Survey of the available literature:
The researcher must devote sufficient time in reviewing
both the conceptual and empirical literatures:
Researches already undertaken on related topics or
problems need to be systematically reviewed.
This exercise enables the researcher to:
find out what data are available
find out if there are gaps in theories, and
find out whether the existing theory is applicable to the
problem under study.
find out what other researchers have to say about the
topic,
ensure that no one else has already exhausted the
questions that you aim to examine, etc. 54
Definition and Statement of the Problem
(Cont’d)
d) Developing the idea through discussion:
Discussion concerning a problem often produces useful
information.
The discussion sharpens the researcher’s focus of
attentions on specific aspects of the study.
e) Rephrasing the research problem:
The researcher must sit to rephrase the research problem
into a working proposition.
Through rephrasing, the researcher puts the research
problem in as specific terms as possible so that it may
become operationally viable and may help in the
development of a working hypothesis.
55
Definition and Statement of the Problem
(Cont’d)
f) In addition:
Technical terms or phrases, with special meanings
used in the statement of the problem should be
clearly defined.
Basic assumptions or postulates relating to the
research problem should be clearly stated.
The suitability of the time period and the sources
of data available must be considered in defining
the problem.
The scope of the investigation within which the
problem is to be studied must be mentioned
explicitly in defining a research problem.
56
3. Extensive Literature Survey
58
Extensive Literature Survey (Cont’d)
You will also know what other people have said about
similar topics.
You can learn how other people faced methodological
and theoretical issues similar to your own.
You can learn about sources of data that you might not
have known before.
59
Extensive Literature Survey (Cont’d)
Articles:
JSTOR: www.jstor.org
EconLit
Web Pages
60
Extensive Literature Survey (Cont’d)
63
4. Developing working hypothesis (Cont’d)
Theoretical framework:
Definition. Theories are formulated to explain,
predict, and understand phenomena and, in
many cases, to challenge and extend existing
knowledge within the limits of critical bounding
assumptions.
The theoretical framework is the structure that
can hold or support a theory of a research study.
64
4. Developing working hypothesis (Cont’d)
67
6. Preparing the Research Design
69
7. Sample Selection
70
8. Execution of the Project
Time costs
Other resources
71
8. Execution of the Project (Cont’d)
74
10. Interpretation and Generalizations
Explaining and discussing the research results in line
with the theoretical framework is part of the
interpretation exercise.
The real value of a research lies in its ability to arrive at
certain generalizations.
11. Preparation of the Report
The research process is completed only when the results
are shared with the scientific community.
Report should be written in concise and objective style
in simple language avoiding vague expressions.
75
Preparing the Research Proposal
76
Research Proposal…cont’d
1. The title:
It should be worded in such a way that it suggests the
theme of the study.
It should be long enough to be explicit but not too long so
that it is tedious – usually between 15 and 25 words.
It should contain the key words –important words that
indicate the subject. 77
Research Proposal…cont’d
Question-type titles:
These are used less commonly than indicative and
hanging titles.
However, they are acceptable where it is possible to use
few words – say less than 15.
Example: Does agricultural credit alleviate poverty in low-
potential areas of Ethiopia.
79
Research Proposal…cont’d
80
Research Proposal…cont’d
83
Objectives of the study…cont’d
4. Review of Literature:
5. The Hypothesis:
Questions that the research is designed to answer are
usually framed as hypothesis to be tested on the basis of
evidence.
It gives direction to the data gathering procedure.
6. Significance of the Study:
This section justifies the need of the study.
It describes the type of knowledge expected to be
obtained and the intended purpose of its application.
It should indicate clearly how the results of the research
could influence theory or practice.
86
Research Proposal…cont’d
Rationale:
The rationale for undertaking a research study can
be:
1. To show the existence of a time lapse between the
earlier study and the present one, and therefore, the
new knowledge, techniques or considerations indicate
the need to replicate the study.
9. Basic assumptions:
Assumptions are statements of ideas that are
accepted as true.
They serve as the foundation upon which the research
study is based.
90
Research Proposal…cont’d
II) Methodology
91
METHODOLOGY…cont’d
94
Research Proposal…cont’d
95
Research Proposal…cont’d
VI. Bibliography:
Be sure to include every work that was referred to in the
proposal
You do not have to refer to any other works if you do not
want to; the bibliography does not have to be long or
complete.
Formats vary slightly by journal, etc.
A common format:
For a book: Smith, Adam (1776). An Inquiry into the Nature and
Causes of the Wealth of Nations. London: Dent and Sons
publishing.
For an article: Coase, R (1937). “The Nature of the Firm.”
Economica 4, 386-405.
98
Survey and Field Research Methods
Representativeness:
• Representativeness is important particularly if you want to
make generalization about the population.
• A representative sample has all the important characteristics
of the population from which it is drawn.
For Quantitative Studies:
• If researchers want to draw conclusions which are valid for
the whole study population, they should draw a sample in
such a way that it is representative of that population.
For Qualitative Studies:
• Representativeness of the sample is NOT a primary concern.
• We select study units which give us the richest possible
information.
You go for INFORMATION-RICH cases!
103
Survey and Field Research Methods… cont’d
108
Survey and Field Research Methods… cont’d
Hence:
For small populations (<1000), a researcher needs a large
sampling ratio (about 30%). Hence, a sample size of
about 300 is required for a high degree of accuracy.
For moderately large population (10,000), a smaller
sampling ratio (about 10%) is needed – a sample size
around 1,000.
To sample from very large population (over 10 million),
one can achieve accuracy using tiny sampling ratios
(.025%) or samples of about 2,500.
These are approximate sizes, and practical limitations
(e.g. cost) also play a role in a researcher’s decision about
sample size. 109
Survey and Field Research Methods… cont’d
non-probability means.
110
Survey and Field Research Methods… cont’d
Probability sampling:
112
Survey and Field Research Methods… cont’d
113
Survey and Field Research Methods… cont’d
114
Survey and Field Research Methods… cont’d
116
Survey and Field Research Methods… cont’d
3. Stratified Sampling
Most populations can be segregated into a number
of mutually exclusive sub-populations or Strata.
118
Survey and Field Research Methods… cont’d
How to Stratify
Three major decisions must be made in order to
stratify the given population into some mutually
exclusive groups.
(1) What stratification base to use: stratification would be
based on the principal variable under study such as
income, age, education, sex, location, religion, etc.
(2) How many strata to use: there is no precise answer as
to how many strata to use.
The more strata the closer one would be to come to
maximizing inter-strata differences and minimizing intra-
strata variables.
119
Survey and Field Research Methods… cont’d
120
Survey and Field Research Methods… cont’d
4. Cluster Sampling:
The selection of groups of study units (clusters) instead
of the selection of study units individually is called
CLUSTER SAMPLING.
If the total area of interest happens to be a big one and can
be divided into a number of smaller non –overlapping
areas (clusters) and if some of the groups or clusters are
selected randomly we have cluster sampling.
Non-Probability Sampling
Non-probability selection is non random i.e., each
member does not have a known non-zero chance
of being included.
Generally thee conditions need to be met in order
to use non-probability sampling.
First, if there is no desire to generalize to a
population parameter, then there is much less
concern whether or not the sample fully reflects the
population - when precise representation is not
necessary.
123
Secondly, it is used because of cost and time
requirements.
probability sampling could be prohibitively
expensive since it calls for more planning and
repeated callbacks to assure that each selected
sample unit is contacted.
Thirdly, though probability sampling may be
superior in theory there are breakdowns in its
applications.
The total population may not be available for the
study in certain cases.
124
Non-probability sampling methods:
(1) Convenience sampling
The method selects anyone who is convenient.
It can produce ineffective, highly un-
representative samples and is not recommended.
Such samples are cheap, however, biased and
125
(2) Quota Sampling
Quotas are assigned to different strata groups and
interviewers are given quotas to be filled from
different strata.
A researcher first identifies categories of people
(e.g., male, female) then decides how many to get
from each category.
The major limitation of this method is the absence
of an element of randomization. Consequently the
extent of sampling error cannot be estimated.
is used in opinion pollsters, marketing research
and other similar research areas.
126
(3) Purposive or Judgment sampling
Purposive sampling occurs when one draws a non-
probability sample based on certain criteria.
When focusing on a limited number of informants,
whom we select strategically so that their in-depth
information will give optimal insight into an issue is
known as purposeful sampling.
It uses the judgment of the expert in selecting cases.
BUT, care should be taken that for different
categories of informants; selection rules are
developed to prevent the researcher from sampling
according to personal preference.
127
(4) Snowball (Network) Sampling
This is a method for identifying and sampling (or
selecting) the cases in a network.
Snowball sampling is based on an analogy to a
snowball, which begins small but becomes
larger as it is rolled on wet snow and pick up
additional snow.
Snowball sampling begins with one or a few people
or cases and spread out on the basis of links to the
initial case.
You start with one or two information-rich
key informants and ask them if they know
persons who know a lot about your topic of
interest.
128
Problems in Sampling
Two types of errors:
Non sampling errors
Sampling errors
Non Sampling errors are biases or errors due to
fieldwork problems, interviewer induced bias,
clerical problems in managing data, etc.
These would contribute to error in a survey,
irrespective of whether a sample is drawn or a
census is taken.
On the other hand, error which is attributable to
sampling, and which therefore, is not present in
information gathered in a census is called sampling
error.
129
a) Non-Sampling Error
Non sampling error refer to
Non-coverage error
Wrong population is being sampled
No response error
Instrument error
Interviewer’s error
Non-Coverage sampling error: This refers to sample
frame defect.
Omission of part of the target population (for
instance, soldiers, students living on campus, people
in hospitals, prisoners, households without a
telephone in telephone surveys, etc).
Non-coverage error also occurs when the list used
for the sampling are incomplete or are outdated. 130
The wrong population is sampled
Researchers must always be sure that the group
being sampled is drawn from the population they
want to generalize about or the intended
population.
Non response error
Some people refuse to be interviewed because they
are ill, are too busy, or simply do not trust the
interviewer.
One should try to reduce the incidence of
non-response errors.
Non-response error can occur in any interview
situation, but it is mostly encountered in large-scale
surveys with self-administered questionnaires.
131
It is important in any study to mention the non-
response rate and to honestly discuss whether and
how the non-response might have influenced the
results.
Instrument error
The word instrument in sampling survey means the
device in which we collect data- usually a
questionnaire.
When a question is badly asked or worded, the
resulting error is called instrument error.
Example: leading questions or carelessly
worded questions may be misinterpreted by
some researchers. 132
Interviewer error : This occurs when some
characteristics of the interviewer such as age, sex,
affects the way in which the respondent answer
questions.
Example: questions about sexual behavior might be
differently answered depending on the gender of
the interviewer.
To sum up, a researcher must ensure that non
sampling error are avoided as far as possible, or is
evenly balanced (non systematic) and thus cancels
out in the calculation of the population estimates.
133
b) Sampling Errors
Sampling errors are random variations in the
sample estimates around the true population
parameters.
Error which is attributable to sampling, and
which therefore is not present in a census-
gathered information, is called sampling
error.
Sampling errors can be calculated only for
probability samples.
Increasing the sample size is one of the major
instruments to reduce the extent of the sampling
error.
Sampling error is related to confidence intervals. 134
A narrower confidence interval means more
precise estimates of the population for a given
level of confidence.
The confidence interval for the true population
mean is given by:
Mean z
n
Mean is the sample mean, z is the value of the
standard variate at a given confidence level (to be
read from the table giving the area under the
normal curve) n is the sample size, and is the
standard deviation of the sample mean.
The sampling error is given by:
z
n
135
Dealing with missing data:
There are several reasons why the data may be
missing.
They may be missing because equipment
malfunctioned, the weather was terrible, or
people got sick, or the data were not entered
correctly.
If data are missing at random, by far the most
common approach is to simply omit those cases
with missing data and to run our analyses on what
remains.
136
Although deletion often results in a substantial
decrease in the sample size available for the
analysis, it does have important advantages.
Under the assumption that data are "missing at
random”, it leads to unbiased parameter estimates.
If, on the other hand, data are not missing at
random, but are missing as a function of some
other variable, a complete treatment of missing
data would have to include a model that accounts
for missing data.
137
Data Collection Techniques
Every study is a search for information about the
given topic.
Qualitative and Quantitative data
139
Advantages of Secondary data
Can be found more quickly and cheaply.
needs.
Definitions might differ, units of measurements may be
different and different time periods may be involved.
difficult to assess the accuracy of the information-
unknown research design or the conditions under
which the research took place.
Data could also be out of date.
140
Sources of Secondary Data
Secondary data may be acquired from various
sources:
Department reports, production summaries,
financial and accounting reports, marketing and
sales studies, books, periodicals, reference books
encyclopedia, university publications (thesis,
dissertations, etc.), policy documents, statistical
compilations, research report, proceedings, personal
documents (historical studies) , etc.
The Internet
141
Primary Sources of Data
Data that came into being by the people directly
be original in character.
Qualitative and Quantitative data collection techniques
There are two approaches to primary data collection:
142
Qualitative data collection approaches
Qualitative data can be acquired from:
case studies,
rapid rural appraisal methods,
focus group discussions and
key informant interviews.
i) Case studies
A case study research involves a detailed
investigation of a particular case.
• Through Interviews (several forms of interviews-
open-ended, focused, or structured).
• Through Direct observation (field visits). 143
ii)Rapid Rural Appraisal (RRA)
RRA is a systematic but semi-structured activity
often by a multidisciplinary team.
The techniques rely primarily on expert observation
coupled with semi-structured interviewing.
The RRA method:
takes only a short time to complete,
147
iv) Key Informant Interview
The key informant interview technique is an
interviewing process for gathering information from
opinion leaders such as elected officials, government
officials, and business leaders, etc.
This technique is particularly useful for:
Raising community awareness about socio-economic
issues
Learning minority viewpoints
Gaining a deeper understanding of opinions and
perceptions, etc.
148
v) Triangulation
149
Types of Triangulation
Data triangulation, which entails gathering data
through several sampling strategies at different
times and social situations.
Investigator triangulation, which refers to the use
of more than one researcher in the field to gather
and interpret data.
Theoretical triangulation, which refers to the use
of more than one theoretical proposition in
interpreting data.
Methodological triangulation, which refers to the
use of more than one method for analyzing the
data.
150
Quantitative Primary Data Collection Methods
154
Weakness of the Method
The quality of information secured depends heavily on
the ability and willingness of the respondents.
A respondent may interpret questions or concept
differently from what was intended by the researcher.
A respondent may deliberately mislead the
researcher by giving false information.
Surveys could be carried out through:
Face to face personal interview
By telephone interview
By mail or e-mail, or
By a combination of all these.
155
a) Personal Face to face Interview
It is a two-way conversion where the respondent is
asked to provide information.
Advantages:
The depth and detail of the information that can be
secured far exceeds the information secured from
telephone or mail surveys.
Interviewers can probe additional questions, gather
supplemental information through observation, etc.
Interviewers can make adjustments to the language
of the interview because they can observe the
problems and effects with which the interviewer is
faced.
156
Limitations of the Method
157
Non-response error
This error occurs when you are not able to find those whom
you are supposed to study.
In probability samples there are pre-designated persons to
be interviewed.
When one is forced to interview substitutes, an unknown
bias is introduced.
Under such circumstances one of the following could be
tried.
The most reliable solution is to make callbacks.
To treat all remaining non-respondents as a new
subpopulation and draw a random sample from the
subpopulation.
To substitute someone else for the missing respondent if
the population is homogeneous.
158
Response error
Errors are made in the processing and tabulating of
data.
Respondent may fail to report fully and accurately.
Cheating by enumerators, usually with only limited
training and under little direct supervision.
Enumerator can also distort the results of a survey by in-
appropriate suggestions, word emphasis, tone of voice
and question rephrasing.
Perceived social distance between enumerator and
respondent also has a distorting effect.
159
Cost Considerations
Interviewing is a costly exercise.
Much of the cost results from the substantial
enumerator time taken up with administrative and
travel tasks.
b) Telephone Interview
Telephone can be a helpful medium of communication
in setting up interviews and screening large population
for rare respondent type.
160
Strength of this method
Moderate travel and administrative costs
Faster completion of the study
Responses can be directly entered on to the computer
Fewer interviewers’ bias.
Limitations of this method
Respondents must be available by phone.
The length of the interview period is short.
Telephone interview can result in less complete
responses and that those interviewed by phone find the
experience to be less rewarding than a personal
interview.
161
C) Interviewing by Mail
Self-administrated questionnaires may be used in
surveys.
Advantages
Lower cost than personal interview
Persons who might otherwise be inaccessible can
be contacted (major corporate executives)
Respondents can take more time to collect facts
Disadvantages
Non response error is expected
Large amount of information may not be acquired
162
Survey Instrument Design
Actual instrument design begins by drafting
specific measurement questions.
Both the subject and wording of each question are
important.
The psychological order of the question needs to be
considered.
Questions that are more interesting, easier to answer,
and less threatening usually are placed early in the
sequence to encourage response.
163
The main components of a questionnaire
164
Designing of a Questionnaire
165
1. Question Content
Both questions and statements could be used in
survey research.
Using both in a given questionnaire gives the
researcher more flexibility.
Minimizing the number of questions is highly
desirable, but one should never try to ask two
questions in one.
Question content usually depends on the
respondent’s:
ability, and
willingness to answer the question accurately.
166
a) Is the question of proper scope?
Respondent must be competent enough to answer the
questions.
The respondent information level should be assessed
when determining the content and appropriateness of a
question.
Questions that overtax the respondent’s recall ability may
not be appropriate.
b) Willingness of respondent to answer adequately
Even if respondents have the information, they may be
unwilling to give it.
Some topics are also too sensitive to discuss with strangers.
Examples: the most sensitive topics concern money matters and
167
family life.
If respondents consider a topic to be irrelevant and
uninteresting they would be reluctant to give an
adequate answer.
Some of the main reasons for unwillingness:
The situation is not appropriate for disclosing the
information
Disclosure of information would be embarrassing
Disclosure of information is a potential threat to the
respondent
168
Some approaches that may help to secure more
complete and truthful information
169
2. Question Wording
a) Shared Vocabulary
In a survey the two parties must understand each other
and this is possible only if the vocabulary used is
common to both parties. So, don’t use unfamiliar words
or abbreviations or ambiguous words.
b) Question Clarity
Do not use emotionally loaded or vaguely defined
words.
170
c) Personalization
Finding the right degree of personalization may be a
challenge.
Instead of asking „What would you do about ...?, it
is better to ask „what would people do about ...? „
d) Provision of adequate alternatives
Asking a question that does not accommodate all
possible responses can confuse and frustrate the
respondent.
Are adequate alternatives provided? It is wise to
express each alternative explicitly in order to avoid
bias. 171
3. Response structure or format
174
Limitations
Can suggest ideas that the respondents would not
otherwise have
Respondents with no opinion or no knowledge can
answer anyway
Respondents can be confused because of too many
choices
During the construction of closed ended questions:
The response categories provided should be exhaustive.
They should include all the possible responses that
might be expected.
In multiple choice type questions, the answer categories
must be mutually exclusive.
The respondent may not be compelled to select more
than one answer.
175
4) Question Sequence - order
The order in which questions are asked can affect the
response as well as the overall data collection
activity.
Transitions between questions should be smooth.
Grouping questions that are similar will make the
questionnaire easier to complete, and the respondent
will feel more comfortable.
Questionnaires that jump from one topic to another are
not likely to produce high response rates.
176
Some guides to improve quality
177
5) Physical Characteristics of a Questionnaire
178
Formats for Responses
A variety of methods are available for presenting a
series of response categories.
Boxes
Blank spaces
Entering code numbers besides each response and
circle.
Providing Instructions
Every questionnaire whether to be self administered
by the respondent or administered by an
interviewer should contain clear instructions.
179
General instructions
181
Data Processing and Analysis
185
iii) Classification and Tabulation
Once data are edited, and coded the data
presentation exercise begins.
Most research studies result in a large volume of
raw data, which must be reduced into homogenous
groups if we are to get meaningful relationships.
Classification is the process of arranging data in
groups or classes on the basis of common
characteristics.
Data having common characteristics are placed in
similar classes and in this way the entire data get
divided into a number of groups or classes.
186
Tabulation is the process of summarizing raw data and
displaying it in compact form (i.e. in the form of statistical
tables) for further analysis.
It is an orderly arrangement of data in columns and
rows.
Tabulation may be done by hand or by mechanical or
electronic devices such as the computer.
The choice is made largely on the basis of the size and type
of study, alternatives costs, time pressures and the
availability of computer facilities.
In the case of computer tabulation computer programs
such as SPSS, Lotus, excel, STATA, etc. could be used.
187
Tabulation provides the following advantages:
It conserves space and reduces explanatory and
descriptive statement to a minimum.
It facilitates the process of comparison
It facilitates the summation of items and the detection of
errors and omissions
It provides a basis for various statistical computations
such as measures of central tendencies, dispersions, etc.
Tabulation may be classified as simple and complex.
Simple tabulation gives information about one or more
groups of independent questions.
Complex tabulation shows the division of data into two
or more categories.
188
II) Data Analysis
Large volume of raw statistical information need
to be reduced to more manageable dimensions if
one is to see meaningful relationships in it.
Data analysis is the computation of certain indices
or measures.
It refers to the computation of certain measures
along with searching for patterns of relationship
that exists among data group.
Data can be analyzed qualitatively or quantitatively.
189
Quantitative data analysis
Where the data are quantitative, there are some
determinants of the appropriate statistical tool for
analysis.
Was the data collected using a random or non-
random sample?
If it was non random then non-parametric data
analysis techniques are appropriate,
if random then parametric techniques are
appropriate.
190
Were the samples dependent (related) or
independent?
Samples are said to be dependent (related) when
the measurement taken from one sample affects
the measurement taken again from the same
sample.
Samples are independent if the measurements
taken from one sample do not affect those from
another sample.
191
Parametric tests
Has the data got characteristics, which can lead to
the application of parametric tests? i.e.
Were observations drawn from a population with
normal distribution i.e. data normally distributed?
Does the set of data being compared have
approximately equal variances (homogeneity of
variances)?
Were the data measured on a ratio scale?
192
Non-parametric tests (data is nominal or interval)
193
Uni-variate Analysis
Uni-variate analysis refers to the analysis with
respect to one variable.
It is also called a one-dimensional analysis.
The uni-variate analysis could either be presented
in the form of statistical measures such as measures
of central tendencies and measure of variations or
in the form of graphs.
Graphical illustrations could also be used to
demonstrate the frequency distribution (histograms,
ogives, polygons, bar graphs, line graphs and
circular graphs or pie charts).
194
Descriptive Analysis
The initial uni-variate analysis may be the
presentation of descriptive analysis in the form of
frequency distributions.
A frequency distribution provides a profile of
different groups on any of a multitude of
characteristics such as size, composition, efficiency,
or preferences of persons or other entities.
The data in a frequency distribution can be used to
calculate a number of statistical indices, which
summarizes the results even further.
Measures of central tendency are examples.
195
Multivariate Analysis
Multivariate analysis involves the considerations
of two or more variables.
If we have two variables then we have bi-variate
analysis but if we have more than two variables then
we have multivariate analysis.
Several multivariate analyses could be undertaken
such as the construction of bi-variate tables or
multivariate analysis such as multiple regressions,
ANOVA, discriminant analysis, probit and logit
analyses, canonical analysis, etc.
196
Summary chart concerning analysis of data
Analysis of Data
(in a broad general way can be categrised into)
Analysis of data
processing of data (analysis proper)
(Preparing data for analysis)
197
Pitfalls in Data Analysis
The problem with statistics
Some aspects of statistical thoughts might lead
many people to be distrustful of it.
Three broad classes of statistical pitfalls.
The first involves sources of bias. These are conditions
or circumstances which affect the external validity of
statistical results.
The second category is errors in methodology,
which can lead to inaccurate or invalid results.
The third class of problems concerns interpretation
of results - how statistical results are applied (or
misapplied) to real world issues. 198
1. Sources of Bias
The core value of statistical methodology is its ability
to assist one in making inferences about a large group
(a population) based on observations of a smaller
subset of that group.
In order for this to work correctly,
the sample must be similar to the target population in all
relevant aspects (representative sampling);
certain aspects of the measured variables must conform to
assumptions which underlie the statistical procedures to be
applied (statistical assumptions).
199
Representative sampling
This is one of the most fundamental tenets of inferential
statistics:
the observed sample must be representative of the target
population in order for inferences to be valid.
The ideal scenario would be where the sample is
chosen by selecting members of the population at
random, with each member having an equal
probability of being selected for the sample.
The sample "parallels" the population with respect to
certain key characteristics which are thought to be
important to the investigation at hand.
the problem comes in applying this principle to real world
situations. 200
Statistical assumptions.
The validity of a statistical procedure depends on
certain assumptions it makes about various aspects of
the problem.
For instance, linear methods depends on the
assumption of normality and independence.
Unfortunately, this offers an almost irresistible temptation
to ignore any non-normality, no matter how bad the
situation is.
If the distributions are non-normal, try to figure out
why; if it's due to a measurement artifact try to
develop a better measurement device.
201
Another possible method for dealing with unusual
distributions is to apply a transformation.
However, this has dangers as well; an ill-considered
transformation can do more harm than good in terms of
interpretability of results.
The assumption regarding independence of observations is
more troublesome, because it is so frequently violated in
practice.
Observations which are linked in some way may show some
dependencies.
One way to try to get around this is to aggregate cases to
the higher level.
Example: use households as the unit of analysis, rather than
202
individuals.
2. Errors in methodology
The most common hazards include designing experiments
with insufficient power, ignoring measurement error, and
performing multiple comparisons.
Statistical Power. The power of your test generally depends
on the sample size, the effect size you want to be able to
detect, the alpha you specify, and the variability of the
sample.
Based on these parameters, you can calculate the power level of
your experiment.
Similarly you can specify the power you desire (e.g. .80), the
alpha level, and use the power equation to determine the
proper sample size for your experiment.
203
If you have too little power, you run the risk of overlooking
the effect you're trying to find.
If your sample is too large, nearly any difference, no matter
how small or meaningless from a practical standpoint, will
be "statistically significant".
Measurement error. Most statistical models assume error
free measurement.
However, measurements are seldom if ever perfect.
Particularly when dealing with noisy data such as
questionnaire responses or processes which are difficult to
measure precisely, we need to pay close attention to the
effects of measurement errors.
Two characteristics of measurement reliability and 204
validity.
Reliability refers to the ability of a measurement
instrument to measure the same thing each time it
is used.
So, a reliable measure should give you similar
results.
If the characteristic being measured is stable over
time, repeated measurement of the same unit should
yield consistent results.
Validity is the extent to which the indicator
measures the thing it was designed to measure.
Validity is usually measured in relation to some
external criterion. 205
3. Problems with interpretation
There are a number of difficulties which can arise in the
context of interpretation.
Confusion over significance. the difference between
"significance" in the statistical sense and "significance" in
the practical sense continues to elude many consumers
of statistical results.
Significance (in the statistical sense) is really a function
of sample size and experimental design and shows the
strength of the relationship.
With low power, you may be overlooking a really
useful relationship; with excessive power, you may be
finding microscopic effects with no real practical value.
206
Precision and Accuracy. These two concepts often
get confused.
precision refers to how finely an estimate is
specified, whereas accuracy refers to how close an
estimate is to the true value.
Estimates can be precise without being accurate.
Causality: assessing causality is the most important
function of most statistical analysis.
For causal inference you must have random
assignment.
Many of the things we might wish to study are not
subject to experimental manipulation. 207
Hence, it will require a multifaceted approach to
the research to come to any strong conclusions
regarding causality:
use of chronologically structured designs (placing
variables in the roles of antecedents and
consequents),
Use several replications.
Graphical Representations. There are many ways
to present quantitative results numerically, and it
is easy to go astray by misapplying graphical
techniques.
208
Multiple Variables and Confounds
211
Controlling for Confounding Variables
We can first organize the universe of variables and
reduce them by classifying every variable into one of
two categories: Relevant or Irrelevant to the
phenomenon being investigated.
The relevant variables are those which are important to
understand the phenomenon, or those for which a
reasonable case can be made.
Example: if the literature tells us that Consumption Expenditure
is associated with income, then we will consider income to be
a relevant variable.
212
If we have not included the relevant variable in our
analysis it can be because of different reasons.
One reason we might choose to exclude a variable is
because we consider it to be irrelevant to the
phenomenon we are investigating.
If we classify a variable as irrelevant, it means that it
has no systematic effect on any of the variables
included.
Irrelevant variables require no form of control, as they
are not systematically related to any of the variables in
our model, so they will not introduce any influence.
213
Two basic reasons why relevant variables might be
excluded:
First, the variables might be unknown.
We might have overlooked some relevant variables, but the
fact that we have missed these variables does not mean that
they have no effect.
Another reason for excluding relevant variables is
because they are simply not of interest.
Although the researcher knows that the variable
affects the phenomenon being studied, he does not
want to include its effect in the model.
214
Finally, there remain two kinds of variables which
are explicitly included in our hypothesis tests.
The first are the relevant, interesting variables which are
directly involved in our hypothesis test.
The second is called a control variable.
The control variable is included because it affects
the relevant variables and we need to remove or
control for its effect.
215
Internal and External Validity
Knowing what variables you need to control for is
important, but even more important is the way you
control for them.
Several ways of controlling variables exist.
Internal validity is the degree to which we can be
sure that no confounding variables have obscured
the true relationship between the variables in the
hypothesis test.
It is the confidence that we can put on the assertion
that the independent variables actually produce the
effects that we observe.
216
External validity describes our ability to generalize
from the results of a research study to the real world.
Unfortunately although controlling for the effect of
confounding variables increases internal validity it
often reduces external validity.
Methods for Controlling Confounding Variables
The effects of confounding variables can be controlled
with three basic methods: manipulated control, statistical
control, and randomization.
Internal validity, external validity, and the amount of
information that can be obtained about confounding
variables differs for each of these methods.
217
Manipulated Control
Manipulated control essentially changes a variable into a
constant.
We eliminate the effect of a confounding variable by not
allowing it to vary. If it cannot vary, it cannot produce
any change in the other variables.
If we can hold all confounding variables constant, we
can be confident that any difference observed between
two groups is indeed due to the explanatory variable
and not due to the other variables.
This gives us high internal validity.
So, Manipulated control prevents the controlled
variables from having any effect on the dependent
variable. 218
Statistical Control
With this method of control, we include the confounding
variable into the research design as an additional measured
variable, rather than forcing its value to be a constant.
So, we will be considering with three (or more) variables
and not two: the independent and dependent variables, plus
the confounding (or control) variable or variables.
The effect of the control variable is mathematically removed
from the effect of the independent variable, but the control
variable is allowed to vary naturally.
This process yields additional information about the
relationship between the control variable and the other
variables.
219
In addition to the additional information about the
confounding variables that statistical control
provides, it also has some real advantages over
manipulated control.
External validity is improved, because the confounding
variables are allowed to vary naturally, as they would
in the real world.
But, internal validity is not compromised to achieve this
advantage.
220
In general, statistical control provides us with much
more information about the problem we are
researching than does manipulated control.
But advantages in one area usually have a cost in
another, and this is no exception.
An obvious drawback of the method lies in the increased
complexity of the measurement and statistical analysis
which will result from the introduction of larger
numbers of variables.
221
Randomization
The third method of controlling for confounding
variables is to randomly assign the units of analysis
(experimental subjects) to experimental groups or
conditions.
The rationale for this approach is straightforward: any
confounding variable will have its effects spread
evenly across all groups, and so it will not produce
any consistent effect that can be confused with the
effect of the independent variable.
This is not to say that the confounding variables
produce no effects in the dependent variable—they do.
222
But the effects are approximately equal for all groups, so the
confounding variables produce no systematic effects on the
dependent variable.
The major advantage of randomization is that we can
assume that all confounding variables have been controlled.
Even if we fail to identify all the confounding variables, we
will still control for their effects.
As these confounding variables are allowed to vary
naturally, as they would in the real world.
External validity is high for this method of control.
223
Since we don’t actually measure the confounding
variables, we assume that randomization produces
identical effects from all confounding variables in all
groups, and that removes any systematic confounding
effects of these variables.
But any random process may result in
disproportionate outcomes occasionally.
Example: If we flip a coin 100 times, we will not always
see exactly 50 heads and 50 tails.
Sometimes we will get 60 heads and 40 tails, or even 70
tails and 30 heads.
224
Consequently, we have no way of knowing with
absolute certainty that the randomization control
procedure has actually distributed identically the
effects of all confounding variables.
We are only trusting that it did.
But, with manipulated control and statistical control,
we can be completely confident that the effects of the
confounding variables have been distributed so that no
systematic influence can occur, because we can
measure the effects of the confounding variable
directly.
There is no chance involved.
225
A further disadvantage of randomization is that it
produces very little information about the action of any
confounding variables.
• We assume that we have controlled for any effects of
these variables, but we don’t know what the variables
are, or the size of their effects, if, there are any.
• we assume that we’ve eliminated the systematic effects of
the confounding variables by insuring that these effects
are distributed across all values of the relevant variables.
• But we have not actually measured or removed these
effects—the confounding variables will still produce
change in the relevant variables.
226