Data Collection

Contents
 Introduction
 Data Collection as an end in itself
o Establishing the parameters of a system
o Establishing benchmark data
o Data Collection as part of broader strategy
 Propaganda
 Belief justification
 Market research
 Decision support
 'Objective' research
 The conduct of research
Introduction
How often have you heard someone say 'the fact is…' or 'the facts speak for themselves'? We have an almost religious belief
in the importance of facts as immutable, independent, objective pieces of information that tell us something 'real' about the
world around us.
Our fascination with 'facts' is persistent and universal. They seem to offer continual reassurance: whatever the foibles of
human opinion, some things at least are beyond argument. We all know that up is up, left is left, and the sun rises in the east.
The people who are most consumed by the search for facts - at least in the popular view - are scientists. To most people,
science is about the search for 'truth' - which is largely equated with the accumulation of data. This magical material can be
organised into useful pieces (facts) from which laws can be constructed. As the aim of science (so the argument goes) is to
find the 'laws of nature', everything the scientist measures is data, and every piece of data is potentially important. In its
extreme form, this approach sees science as the process of collecting (and sifting, organising and summarising) masses of
data. In this scenario data have particular and special significance.
The problem with this view is that it is almost totally wrong. It is true that some scientists collect data (usually resulting
from experiments) but many never handle data in the conventional sense. Few, if any, scientists see the accumulation of data
as worthwhile in its own right. Most scientists know that data are only useful in the right context; data out of context are at
best unhelpful, at worst misleading. Good scientists in particular have an instinct for knowing which data are useful and
relevant, and which are not.
Nor does science progress (if indeed science can be said to progress) by the mere accumulation of facts. The popular image
of the scientist has not caught up with modern thinking about how science is conducted. Society's conception of scientific
method is quite differe nt from that accepted as appropriate (and preferable) by those who study the process of science. In
their analysis there is a right way to do science and a wrong way - and the science of popular conception is the wrong way.
When you understand the distinction between these approaches, you will see more clearly what role data play in these
methods.
Of course, scientists in the traditional fields (physics, chemistry, biology, astronomy and so on) are not the only people who
collect data. The modern world - complex, industrialised, bureaucratic - thrives on data of all kinds. We are numbered,
analysed and surveyed throughout our lives, and the results are stored and analysed. We have in some senses become a part
of the statistics that largely define modern society.
But who actually collects data? All governments do, for reasons both laudable and questionable. Without up-to-date and
comprehensive data about the characteristics of the population no government can plan and build the facilities and resources
we have come to expect. Commercial organisations collect data to improve their economic prospects by offering the goods
1
or services that potential customers seem to want. Researchers collect data to further their understanding of the workings of
our social and economic systems. Physical scientists collect data to further their understanding of how the world functions.
The process of collecting data takes two forms: gathering data that already been collected by someone else (probably for a
different purpose), and creating 'new' data. The latter is a matter of some philosophic importance, and we will also return to
it shortly.
Data Collection as an end in itself
What motivates people to go through the often complex and costly process of collecting data? Apart from simply collecting
information to satisfy a fascination with so-called trivia, two main reasons for collecting data without an immediate and
specific purpose are:
 to establish the parameters of a system

 to establish benchmark data
Establishing the parameters of a system
When we investigate natural and social systems we often start with no clear idea of how the system functions. In particular,
we may have no strong impression of how the properties of the components of the system may vary. If we are studying river
flows, for example, we may have no idea of the likely range of values to be expected in a particular system. We can get
some idea from studies of rivers in similar environments, but this may not really be transferable due to some peculiarity in
this location. In this case we need to carry out preliminary studies whose results will define the parameters of the system.
These parameters will be concerned with the probable extremes of the data we expect to find in the 'real' study, and the
likely variability of the data. This knowledge may have a direct impact on the way in which we collect data during the major
part of the study.
Establishing benchmark data
If we pursue the example of river flow studies, we can also illustrate the way in which data are sometimes collected to
establish benchmarks. Regular monitoring of river levels, even if this is not part of a specific study, will help to build a
picture of the general behaviour of the system. This will provide valuable comparisons and context when we study the
system in more detail. Establishing benchmark data on flow patterns will indicate how 'typical' the data collected are for a
particular time period; they will also reveal long-term changes in the system.
Data Collection as part of broader strategy
Most data are collected for a specific purpose, as part of a broader strategy. We may be surveying how people would react to
a political proposition, or the likely sales for a new product. We may be investigating the effect of airborne pollution on
vegetation systems, or measuring the mass of a newly-discovered atomic particle. The following are some of the main
reasons why people collect data:
Propaganda
Some data are collected for what we might call propaganda purposes: to convince other people of the rightness of your view,
or a group to which you belong. Most propaganda that involves real data is based on processing and presenting raw data in a
way that suits a particular message, rather than on the generation of new data.
This category could also include instances of scientific fraud in which data are falsified or misrepresented to convince a
scientist's peer group of the correctness of his or her work.
Belief justification
Many people seek data that support the views that they already hold; this is true of some scientists, although it is rare.
2
Market research
Enormous amounts of data are collected by commercial organisations about the buying intentions of consumers; these
surveys are also widely used in the social and political arenas.
Decision support
In industry and government we have become used to the expectation that decisions will be based on careful analysis of data.
For example, before building a new port, an environmental impact statement would be carried out. Relevant data about the
area to be affected are collected and collated; this would then form one of the key foundations of the decision about whether
or not to proceed with that development.
"Objective" research
Scientists (and those who aspire to that status) collect data as a critical part of the process of research. You'll see how this
process operates in detail - and its implications for the data collection process - in later sections. You'll also examine the
popular (but misleading) concept of objectivity.
Ethical issues
Unlike most experimenters in physics and chemistry, or many in the earth sciences or biology, who deal with inanimate
objects or materials, researchers whose subjects are people or animals must consider the conduct of their research, and give
attention to the ethical issues associated with carrying out their research.
Sometimes a research project may involve changing the subjects' behaviour or, in some cases, causing them pain or distress.
Most research organisations have complex rules on human and animal experimentation. Although a detailed study of such
rules is beyond the scope of this unit, you should be aware that such systems exist, and what they deal with.
The American Psychological Society has developed a set of guidelines governing the conduct of research in psychology.
Although some are clearly most relevant to psychology, most are applicable to all forms of research, and will give you an
impression of the ethical issues involved. They can be summarised as follows:
You must justify the research via an analysis of the balance of costs.
The scientist's interest alone isn't sufficient justification to carry out research. In order to carry out experiments there have to
be benefits that outweigh the costs. Researchers are expected to carry out an analysis and ensure that the research is justified.
You are responsible for your own work, and for your contribution to the whole project.
Scientists must accept individual responsibility for the conduct of the research and, as far as foreseeable, the consequences
of that research.
You must obtain informed consent from any subjects.
The concept of informed consent is a major problem when dealing with research into human behaviour or physiology,
particularly in research that may have harmful side-effects. Can someone give informed consent, for example, if they are
below 18 years of age (or whatever is the legal age of consent)? What about potential subjects who are mentally or
physically disabled?
You must ensure that all subjects participate voluntarily.
In psychological research all subjects must participate voluntarily; informed consent must be accompanied by a free decision
to participate. On the other hand, it is very difficult to explain complicated research to non-specialists. Nevertheless, the
onus is on the researcher to explain the research, not on the volunteer to find out about it.
You must be open and honest in dealing with other researchers and research subjects.
3
The researcher must be as open and honest as is reasonable. This process, whilst fine in principle, is complicated by
considerations such as commercial advantage and professional rivalry.
You must not exploit subjects by changing agreements made with them.
As a researcher you might discover that your experiment shows something that you would like to further investigate, but
don't want to tell your subjects about. If you did investigate further, but pretended that you were still doing the experiment
that had been agreed to in the first place, this would be a form of exploitation, and would breach the principles of informed
consent and voluntary participation.
You must take all reasonable measures to protect subjects physically and psychologically.
The unexpected outcomes of a series of now-famous experiments in the 1950s convinced most psychologists that even
voluntary participants can 'get carried away' to the point where they have to be protected from themselves and each other. In
these experiments, university psychology students were allowed, using complex behavioural rules, to 'punish' their co-
subjects where they breached the rules. What surprised and disturbed the researchers was the number of students who
greatly exceeded their 'right' to inflict punishment, and how widespread the process became among the subjects. Eventually
the experiments had to be terminated to prevent injury to the subjects.
The researcher must be prepared to intervene, even at the cost of the experiment itself, to protect the subjects.
You must fully explain the research in advance, and 'de-brief' subjects afterwards.
Whilst full explanations before the experiment are essential to gaining informed consent, it is, unfortunately, a common
practice for researchers to complete their research without telling the participants anything about the results.
You must give particular weight to possible long-term effects of the research.
Obviously this can be difficult to achieve. It means that, regardless of their strong motives for collecting information,
researchers have to give particular emphasis to any potential long-term dangers of the research.
You must maintain confidentiality at all times.
Only certain people conducting the experiment should know the identify of the participants, and any subject should
generally not know the identity of other subjects. The key to maintaining confidentiality is that the individual should not be
identifiable.
4
In this section you'll see how philosophers have regarded the idea of knowledge. You will also examine the ways in which
data are processed to make them more comprehensible, in what is called the data transformation process.
On completion of this section you should be able to:
 list and explain the two main forms of knowledge

 list and explain the major steps in the data transformation process.
Contents
 Forms of knowledge
o Empirical Knowledge
o Reasoning
 The data transformation process
Forms of knowledge
We have been talking so far as if we had a single, unambiguous idea of what data are, but if you surveyed people in the
street you would get many different answers. These might range from 'a scientific measurement' to 'something in a textbook'
to 'whatever anybody thinks'. One common theme, however, is that data - including numbers, ideas and opinions - are part
of things that we can know.
Philosophers have been interested in the question of what constitutes knowledge, and how we perceive this, since the time of
the ancient Greeks; there is a branch of philosophy solely devoted to the topic, called epistemology (or theory of knowledge).
Broadly, philosophers divide the possible types (and hence sources) of knowledge into two main groups:
 empirical knowledge (things we perceive through our senses)

 reasoning (things we derive by thought alone)
Empirical Knowledge
The log we are sitting on has rings that show it is sixty years old. At a depth of two thousand feet it would become coal in
three thousand years. The deepest coal mine in the world in at Killingworth. A ton of coal will fit in a box four feet long,
three feet wide and two foot eight inches deep.
Oh, I think them statistics is wonderful.
Handbook of Hymen, O. Henry
Empirical knowledge is all knowledge that comes to us through our senses, or through 'extensions' to those senses, such as
physical instruments (telescopes, microscopes, meters, and so on). In its extreme form (empiricism) some philosophers have
argued that this is the only real form of knowledge, as it is the only form of knowledge external to the observer; this means
that the knowledge can (at least in principle) be obtained by someone else.
Reasoning
Many philosophers in the field of epistemology argue that the products of rational human (and perhaps some irrational)
thought also form a different but important type of knowledge. These products might include speculations, conclusions (of
arguments) and - most significantly - scientific theories.
5
The data transformation process
There is a cliché that, like most clichés, has some elements of truth to it. It runs something like this:
Data is not information
Information is not knowledge
Knowledge is not wisdom
Whatever the overall correctness of these propositions, they embody an important assumption: that we must expect to
transform 'raw' data in some way in order to extract 'meaning' from it. You could liken the process to refining minerals in
order to extract precious metals from the ores within which they are to be found. There may gold in the mineral, but the
quartz must be removed before the pure metal can be isolated and extracted.
An alternative analogy is one in vogue in economic circles, whereby the processing of materials (food, manufactured goods,
minerals) is seen as 'adding value' to them; whatever 'intrinsic' value they may have had is enhanced (added to) by the
processing applied to them. Mineral ore has some intrinsic value, but the processing of that ore to refine and extract valuable
materials creates much more value. Of course, it is implicit in this analogy that the processing of the material has a cost, and
that the final 'value' should be great enough to 'return a profit'.
How does this relate to any operations we may carry out on data? In either analogy we start with raw data, from an
experiment, from a survey, or as a product of our reasoning. When we process it (such as by summarising it with simple
statistics) we retain the 'value' of the raw data, but 'add value' in the form of meaningful short descriptions of the overall
structure of the data. Similarly, when we interpret the data we are 'adding value' by supplying a (hopefully) rational analysis
of what the data might mean.
A more 'traditional' way of looking at the general process of working with data - known as the data transformation process -
is summarised in figure 2.1. The figure indicates the cyclical nature of the process, in which attempts to refine a theory will
lead us to collect data. Our analysis and interpretation of the data may in turn lead us to modify the original theory. We shall
examine the 'components' of the system in more detail as we proceed, concentrating on what are scientific theories, what are
the various sources of data, and how are the collected data analysed and interpreted.
Figure 2.1: The Data Transformation Process
There are several immediate points that can be made about this process:
6
 In science (as you'll see shortly) the whole process proceeds from an existing theory, via an experimental design. The data
we use (either generated by us through experiments and surveys, or already collected by others for their own purposes ) is
then analysed and interpreted.
 The process is iterative, in that our interpretation of the data will often lead us to adjust the theory, which may lead us to
collect more data, which will be analysed … and so on.
 The whole process contains considerable degrees of subjectivity, especially in the analysis and interpretation stages; two
scientists confronted with the same raw data will summarise, analyse and interpret them differently.
 This process is 'internal' to the scientist (and his/her collaborators), but is also part of a larger process which includes the
communication of the results and their interpretation to other scientists and the community at large.
7
In this section you'll look at the basic approaches that are taken in science in generating new knowledge, and the central
roles that empirical observations (data) play in these processes.
 explain the significance of theories in relation to data collection

 explain how the subjective processes of analysis and interpretation operate
 describe the process of scientific research
 outline the stages involved in carrying out research by means of the induction process
 outline the stages involved in carrying out research by means of the deduction process
Contents
 Theories in science
 Experimental design
 Analysis and interpretation
 The process of 'research' in science
 Induction
o Collect relevant data, based on prior knowledge of system
o Summarise collected data
o Generalise about the system from the collected data
o Problems
 Selection of relevant data
 Selection of collection method (experimental design)
 Selection of summarisation and analysis methods
 The probabilistic (and subjective) nature of conclusions
 Deduction
o Decide what needs to be investigated
o Formulate a reasonable hypothesis and define how to test it
o Carry out an experiment
o Verify (or falsify) the hypothesis by comparing predicted results with experimental results
 The role of interpretation
Theories in science
It is one of the basic tenets of this unit that scientific data are, in certain crucial ways, 'different' from the data collected in
other fields - even if those other fields use identical methods of data collection, analysis and presentation. If asked to define
this difference, many people might point to the importance of experiments in science, to the mathematical sophistication of
some scientific models, or even to the supposed 'objectivity' of scientists.
The difference between scientific data and non-scientific data actually has little or nothing to do with these factors. It has
everything to do with the characteristics and significance of theories in science. Any piece of data in science is important
and relevant solely because it bears some connection to a theory; we have no way of assessing the value of any piece of data
not generated to test or extend a scientific theory. It follows from this that there is no such thing in science as 'collecting data
for its own sake' (at least in our current conception of scientific method). All data are collected (and are potentially valuable)
because they are linked to a specific theory.
For example, we know that some bivalve molluscs have spiral shells that turn 'clockwise', and others turn 'anti-clockwise'.
Even a complete survey of which species turn 'left' and which turn 'right' (including the environmental conditions in which
they live) has no intrinsic scientific significance - unless the data are linked to a scientific theory that seeks to explain
(scientific theories are never simply descriptive!) how and why this 'handedness' has evolved.
Again, you'll gain a clearer understanding of this point when you examine scientific method in greater detail in subsequent
sections.
8
Experimental design
One of the other distinguishing features of scientific data is that they are most usually generated in experiments that have
been carefully structured. It is through the design of such experiments that scientists attempt to guarantee that they collect
the 'right' data - that is, the data that will allow them to test and extend their theories.
Analysis and interpretation
The analysis and interpretation of scientific data are fundamentally subjective. That is, they involve an individual scientist
(or a small group) examining the results of the data collection process to see if those data are in accord with their
expectations (the predictions they have made) or not. Of course, this process is not (and cannot be) a random browse through
the data. The scientist is looking for evidence in the data of the relationships that he or she expects to find.
It is a fundamental assumption in science that the physical universe is basically deterministic; that is, there exist cause and
effect relationships between events, and these relationships can be identified and analysed. There is no need for supernatural
process or divine intervention; the universe is fundamentally comprehensible. Even our recent changed understanding of
dynamic and chaotic systems fits comfortably within this framework.
It is important to note that recent objections to determinism (and a preference for what is often termed an 'holistic' view of
the universe) seem to be based on a misunderstanding of the principal tenets of determinism. It is not strictly a part of the
deterministic view to suggest that all actions (particularly those of sentient beings, such as humans) are pre-determined. Nor
is it necessary to infer that understanding the operations of the components of a system will tell us everything there is to
know about the system (although this is substantially true of most physical - if not biological or social - systems at scales
above the quantum level).
A scientist can look to detect these relationships as patterns in the data. Such patterns might include noting that the value of
one property (such as blood pressure) rises as another, supposedly related, property falls (such as blood sugar levels). We
shall examine the forms of these relationships, and how they might appear in data, later.
The expected relationships that a scientists might find are of two types, as shown in figure 3.1.
Figure 3.1: Patterns and Relationships
The scientifically uninteresting relationships are those that are not produced by cause and effect, but by what we might
loosely call coincidence. The most famous example of this type of relationship is between long-term variations in share
market prices and sunspot activity. Of course, there may be an effect here of which we have no current understanding, but it
seems sufficiently unlikely as not to be worth much further investigation. Such relationships, whilst often intellectually
provocative, are generally coincidental or casual.
Incidentally, these are the sort of relationships that seem to attract the attention of the 'fringe dwellers' of science, such as
those fascinated by parapsychology, astrology and so on. Occasionally a phenomenon that has no obvious rational
explanation will turn out to be explicable by standard scientific methods. We might cite, for example, the development of
9
our ideas on electricity and lightning, from divine intervention ('the gods are angry') through galvanism to electro-
magnetism.
Far more important to science are those relationships that are expressions of true functional connections between
phenomena. Some functional relationships are usually also causal, reflecting a fundamental cause-and-effect linkage. It is
for these latter relationships in particular that scientists search, and their detection - and the analysis of their form, defined
graphically, algebraically or statistically - is central to scientific research.
The process of 'research' in science
In the broadest possible definition, research is the process whereby we generate new knowledge. This is true in the physical
sciences as well as the so-called 'social sciences'. Research is the method by which we are able to arrive at some form of
'truth' about the world around us. Sometimes it is argued that this 'truth' is absolute: it exists independently of our attempts to
discover it, waiting like diamonds in the rock to be exposed to the light. The alternative (and more prevalent) modern view
holds that 'truth' is a relative commodity, created by the people who are investigating the field. Our current understanding of,
say, physical chemistry is all the truth that is available to us at this moment; a new version of this truth will emerge as a
consequence of our research into the properties of matter. Put simply, the truth as it stands in a given field is what the
majority of scientists in a that field accept as being true right now.
Apart from the metaphysical aspect of this division, which position we take has some implications for how we will actually
carry out research. If we believe that truth exists independently, awaiting discovery, then we can afford to be fairly relaxed
about the details of the research process; any way we find the diamonds is okay, as long as we find them. Some ways may
be quicker, but as long as they work, we are happy.
If, on the other hand, we believe that 'truth' is created during the research process, it follows that the 'truth' we create is at
least in part a reflection of the methods we use; different methods will produce, to some extent, different truths. It is the
general acceptance of this latter view that has led scientists and philosophers of science to examine the research process, and
to develop an understanding of the 'proper' ways to do research.
There are basically two approaches to the research process, and the Greek philosopher Aristotle identified them nearly three
thousand years ago. As well as being the founding figure in the philosophy of science, Aristotle has claims to being the
world's first experimental scientist. For example, he made extensive observations (and what we would call 'controlled
experiments') on sea anemones.
Aristotle recognised that we can create knowledge by applying logic to observations to generate conclusions. The methods
of logical analysis that he developed still form the basis of current scientific analysis, having been developed and given
mathematical structure in the last hundred years. He realised that the necessary data could come from observations of
physical experiments, and that the conclusions that we derive are based upon those data. He also identified the two
approaches that can be used to create generate such conclusions: induction and deduction.
Induction
The method that Aristotle used is called induction. Because of the enormous influence of Aristotle on the development of
Western thought (particularly during the Renaissance and the Scientific Revolution) it has been the 'preferred' way of doing
research until the last half century. It is also fair to say that it represents the 'obvious' way of learning about the world: make
observations and draw conclusions. However, as you'll see, it is a fundamentally flawed method, especially when compared
with the alternative approach, namely deduction.
Induction proceeds through three basic stages, shown below in figure 3.2.
1. Collect 'relevant' data, based on prior knowledge of system
The crux of the inductive system is the observation process - including (but not requiring) experiments. This process
generates a body of 'facts' that, over an extended period of time, accumulates to a point where generalisations can be made.
The experience of the researcher will have a significant influence on decisions about what information should be collected.
10
2. Summarise collected data
The volume of data involved will require careful ordering of the data. One of the hallmarks of inductive research is an
emphasis on the best ways to classify and organise the collected data (hence the continuing fascination of plant and animal
taxonomy).
3. Generalise about the system from the collected data
Organised summaries of the raw data provide the basis for the interpretation process that will produce the generalisations
('laws') that are being sought. The experience of the individual scientist is significant. There is also a sufficiently strong
reliance on inspiration ('the idea just popped into my head') to make philosophers nervous!
Figure 3.2: The Induction Process
Problems
Attractive as it may be as an easy-to-grasp approach to doing scientific research (which is one reason why it forms the basis
of popular images of scientific method), the induction process has serious logical flaws:
Selection of 'relevant' data
The emphasis on choosing the 'right' data means that the conclusions drawn are largely determined by what data the
researcher has decided (or been able) to collect. This may mean that the scientist would reach different conclusions if they
happened to collect a different set of data - but how can there be different 'laws' for the same phenomenon?
Selection of collection method ('experimental design')
Induction is built upon the collection of data, but has no specific role for the kind of experimental (data collecting) process
that we identify with science.
11
Selection of summarisation and analysis methods
In induction there are no criteria for deciding exactly which techniques should be used to summarise and analyse the
collected data.
The probabilistic (and subjective) nature of conclusions
Most critically, the generalisations produced by induction are not really general in the proper sense; they are tied to the
specific data that is available, and to the person who has created them. A different researcher can (and probably will) draw
different conclusions from the same data, and more (or different) data will lead to different conclusions. The best we can
hope for is to be able to express our conclusions in a probabilistic form.
Deduction
Given the limitations of induction, why has it been the dominant form of scientific method for the last two thousand years?
There are at least three keys reasons. The first is the reverence for Aristotle's view, including his espousal of induction. The
second is the intuitive nature of induction; this is how we learn much of our understanding of the world around us. The third
(and most significant) is that induction works quite satisfactorily - at least until its limitations are exposed by large numbers
of scientists working on complex problems that generate large quantities of data.
The alternative process - deduction - was also described by Aristotle, but it is applicable only when a discipline has reached
a certain stage in its evolution. It requires an established body of strong generalisations (actually created inductively) upon
which it can build; such generalisations can only be accumulated over a protracted period of time (although the pace and
size of modern science tends to shorten this period).
Deduction proceeds through four main stages, which are summarised in Figure 3.3.
Figure3.3: The Deduction process
1. Decide what needs to be investigated
Deduction starts with a body of 'truth', defined as those general principles (which are correctly termed theories) that almost
all scientists in a given area accept to be 'true'. In this context 'true' means that a substantial amount of evidence - based on
many (preferably repeated) experiments - exists to support the theories, and little or no evidence exists that directly
contradicts them. As you'll see, this is a strictly limited but perfectly functional definition of 'true', and one that deduction
12
can never allow us to exceed.Existing theories will always be in some way incomplete (we never know all there is to know),
and it is the aim of research to attempt to 'fill in' these gaps.
2. Formulate a reasonable hypothesis and define how to test it
The critical step in deduction is to create an hypothesis. This is a statement that has the following key characteristics:
 The statement is consistent with, and based upon, current theory, but contains ideas that are not accepted as part of that
theory.
 The statement is formulated in such a way that it can generate predictions that are capable of being shown to be either true or
false (for example, 'A higher proportion people who smoke die of lung cancer than people of similar background who do not
smoke'): that is, it is testable.
 The statement leads to the formulation of an experiment that will generate data that can be used to test the hypothesis (prove
it true or false).
Often a scientist will formulate several alternative (equally likely) hypotheses and devise an experiment to distinguish
between them.
Incidentally, the ability to generate hypotheses that lead a subject in fruitful new directions is one of the few things that
leading scientists all seem to share. Similarly, great scientists are often characterised by the capacity to devise elegant
experiments to test crucial (and often long-standing) problems in significant areas of their subject.
3. Carry out an experiment
If experiments are problematic in induction, they are critical in deduction. The necessity to collect exactly the required data
in order to test an hypothesis drives scientists to strive for the maximum level of control over their data collection. Wherever
possible or appropriate a scientist will carry out a controlled experiment, either in a laboratory or in the outside world.
Where such experiments are not possible the scientist will carefully design the experiment to attempt to guarantee 'success'.
The success of an experiment is measured in terms of whether it collected the right amount of the right type of data to allow
the hypothesis to be tested - not whether the answer is true or false.
4. Verify (or falsify) the hypothesis by comparing predicted results with experimental results
Alongside testability, a critical factor in formulating a good hypothesis is that it should lead to specific predictions. If these
predictions are found to be incorrect (based on the experimental evidence) we know that the hypothesis is (at least in part)
'wrong'. This is termed refuting or falsifying the hypothesis. If, however, the predictions are found to have been correct
(based on the experimental data) we say that the hypothesis has been verified. Our response to these different cases will be
different, and we will examine that issue in the next section.
The role of interpretation
Unlike induction, the deductive process has no specific role for inspiration (and, by implication, subjectivity). Whilst it is
recognised that some scientists have a particular 'gift' for formulating hypotheses or devising experiments, it is also accepted
that the analysis of experimental results and the testing of an hypothesis is a fundamentally 'mechanical' process. Two
scientists should, in principle, generate equivalent results from the same experiment (which is, after all, the basis of the
replication of experiments). Interpretation in this context is limited to deciding whether or not specific predictions have been
supported, and the hypothesis is therefore true or false.
13
In this section you'll look at the processes by which scientific theories are developed and tested, and how logic is used to
further this process.
 describe the relationship in science between theories and hypotheses

 explain how the falsification and verification processes operate on theories
 explain how a scientific revolution develops
 outline the major components of logical analysis and the various forms of inference
 define the ways in which experiments are used in science
 list and describe the major types of experiments conducted by scientists.
Contents
 Theories and hypotheses

 Falsification and verification
 Scientific "revolutions"
 Logic and inference
o Inference
o Forms of inference
 The purpose of experiments
 Experiments and experimental design
o Controlled experiments
o Laboratory experiments
o Clinical trials
o Field trials
o Surveys
 Census
 Sample
 Longitudinal
 Simulation
Theories and hypotheses
Theories are significant because they are how science explains and allows us to understand the world around us. A 'good'
theory has the following characteristics:
 it is internally consistent, in that it provides a consistent explanation for almost all the known 'facts' in a given field
 it is extensible and predictive, in that it leads to hypotheses that can be used to extend the theory to cover a broader field, and
to link it to other related theories
The strength of a theory is in its current utility in explaining what we already know, and in its potential ability to predict and
explain things we don't yet know. This is why, for example, Darwinian evolution is so central to the whole of biology: there
is little or nothing that we can't (in principle) explain in terms of evolution by natural selection. So the power of a theory
rests in its ability to explain what we already know and also to make predictions about things we don't currently understand.
Theories are absolutely fundamental to science - contrary to the ill-informed dismissal of evolution by creationists as 'only a
theory'.
There is a paradox in this process: the attractiveness of a theory to scientists actually lies largely in what it doesn't - but
might - explain. If a theory explained everything then it would cease to be useful for future scientific work - no matter how
satisfying its 'completeness' might be for its creators.
14
The capacity for extension and development is what sets theories apart, but how do we modify apparently satisfactory
theories? When we extend a building, for example, we usually have to knock down something of the old building before we
can add the new. The same applies to theories. The process of extending (and rebuilding) theories takes place through the
continual development and testing of hypotheses.
Falsification and verification
Figure 3.3 suggests an experiment (in principle) allows the researcher to 'put a tick' against the hypothesis and say 'this
hypothesis is right'. If so, the experiment supports (or verifies) the hypothesis and the researcher is justified in adding the
idea behind the hypothesis to their theoretical base.
If, on the other hand, the experiment says 'this hypothesis is wrong' this would be falsifying the hypothesis. The problem,
then, is this: the scientist not only cannot add the idea underlying the hypothesis to the existing body of theory, but they must
respond to the fact that, as the hypothesis is based on the existing theory, the falsification has implications for that too. In
effect, something must be wrong with the accepted knowledge that the theory represents.
The importance that scientists have attached to falsification (as opposed to verification) goes back to the 1930s, when Karl
Popper developed our current view of how theories are created and re-created. He argued that the process of science
(scientific method) should be to develop existing theories by the use of testable hypotheses, as already outlined here. The
aim is to create and refine 'correct' theories: that is, those that have not yet been proven wrong. In these terms evolution,
relativity, and quantum mechanics are all 'correct' in that, whilst accepted as still incomplete, none of them has been
contradicted by any significant experimental evidence. Wherever they do not adequately explain observed phenomena, no
better (more powerful) theory is available.
The process of developing, extending and knocking down theories is not symmetrical; it is possible to overturn a theory, but
it is impossible to make it absolutely correct. In Popper's terms you can never totally verify ('prove') a theory - but you can
falsify it with one conflicting observation. Many people (including some scientists) are uncomfortable with such an 'extreme'
view; it seems to provide little basis upon which to build a consistent theoretical edifice. In practice, we don't abandon an
entire body of theory when it is contradicted by one experimental result, but an accumulation of contradictions will
eventually force scientists to examine the fundamentals of what they currently believe to be correct. This makes the
development of scientific theories dynamic in a way that non-scientists find perplexing and perhaps a little scary, but it is
something with which all scientists have to come to terms.
Scientific "revolutions"
As Popper's views became more accepted attention turned to the practical question of what happens when the entire fabric of
a major theory is seen by most scientists as so defective that it has to be replaced. How does this happen? Our current view
of this process is based on the work of the American physicist and historian Thomas Kuhn. In the 1960s Kuhn was studying
the period in which Isaac Newton can be said to have developed the basis of modern physics. In the Newtonian world
everything is determined by physical processes. In just a few decades Newton (along with Galileo, Kepler, Descartes,
Liebniz and others) developed the deterministic view of our world that now prevails. In certain areas (such as optics) most
of what Newton defined still (albeit modified and extended) forms the basis of our current theories.
Kuhn wondered how these new ideas had been able to sweep away over three thousand years of 'satisfactory' explanations,
based largely on Aristotle's views. In mechanics, for example, people accepted that things moved because they wanted (or
were forced) to get to somewhere else; they believed motion was either the 'desire' to move (as in animal and human
movement) or the unavoidable urge for an inanimate object to move to its 'natural' place (as when an object falls towards the
ground). Newton replaced this view - in mechanics and elsewhere - with a deterministic approach: things moved because
they were subject to predictable natural forces, such as gravity and friction.
Kuhn suggested that the process by which the Newtonian view of the world replaced the older view was an example of a
scientific revolution. In a relatively short period of time almost everybody in a particular field says 'Yes, we have been
wrong, this is a better way to look at things'. He applied the term paradigm to this 'way of looking at things' or 'world view'.
In any field there is at any period at least one dominant paradigm; this will eventually be replaced by a different, more
acceptable paradigm. This process of replacement is called a paradigm shift. For a scientific revolution to be generated there
must be major problems with the current theoretical structure. If no alternative structure is proposed, scientists will continue
in the current paradigm, doing the best they can. If, however, someone suggests a radically different paradigm that is as
powerful and consistent as the current view, explaining everything that it does, but also resolving the problems that the
current theory cannot, it will be adopted by most scientists and will become the new paradigm.
15
Modern scientific theories are rarely made by one individual; there are thousands of scientists in any given field, all
contributing to its development. Of course, there are some scientists to whom the others look for direction and leadership,
and whose opinions count for more; in this sense science is not particularly democratic. The willingness of scientists to
adopt a new paradigm is the determining factor in the success (or otherwise) of a scientific revolution.
Logic and inference
One of the distinguishing features of scientific method is that it employs logic to develop and evaluate arguments. The basic
methods of logical analysis have not changed since they were formulated by Aristotle, although the system has been given a
rigorous mathematical basis (symbolic logic) in this century.
As described by Aristotle, logic is the method that allows us to construct arguments that are consistent and correct (in terms
of their structure, if not their contents). Such arguments are called valid arguments. We can create such arguments by the use
of inference.
Inference
The process of creating logical arguments involves drawing conclusions from premises. In this the premises are taken to be
'correct' and the 'truth' of the conclusions depends upon the logical structure of the inference through which they are derived.
Logical analysis of inference breaks all complex arguments into large numbers of simple arguments, whose 'logicality' can
be examined. The fundamental form of logical argument remains the syllogism. A syllogism is a simple structure that comes
in a number of forms. One of these is the 'classic' form that Aristotle employed: 'All men are mortal. Socrates is a man.
Therefore Socrates is mortal.' Other forms include 'If A is true, then B will be true. A is not true. Therefore B is not true'. In
principle we can reduce all complex arguments to a connected set of syllogisms.
It is important to note that Popper's asymmetrical view of testing hypotheses (one 'wrong'
value can refute an hypothesis, but no amount of 'correct' values can prove it) is based on the
logical validity of certain syllogisms. If our hypothesis is A, and our theory is B, we can
define the basic structure of testing hypotheses as follows:
If A is true, B will be true

Falsification A is not true
Therefore B is not true
If A is true, B will be true

Verification A is true
Therefore B is true
Popper showed that the second argument is invalid; we cannot prove B (the theory) from the correctness or otherwise of A
(the hypothesis). The first argument, however, is a valid argument; we can prove B (the theory) wrong by showing that A
(the hypothesis) is wrong.
Forms of inference
It is important to recognise that not all types of inference are equally strong, in a logical sense. The weakest is
circumstantial inference, where we attempt to generalise from a single instance: 'This shell is red, so all of them must be
red.' Somewhat stronger is material inference, where we draw conclusions (often couched in probability terms) based upon a
reasonably-large number of observations: '67% of measured shells were red, so 67% of all shells are red.' The latter is the
basis of inductive reasoning, and its logical limitations are the origin of the limitations of induction described earlier.
The strongest form of inference is formal inference, in which the premises are not observational, as they are with the other
two forms, but axiomatic (that is, assumed to be true). The truth of the conclusions then rests not with the truth of the
premises, but with the correctness of the logical process used to create them. This form of inference is the basis of deductive
reasoning, and it is the superiority of its logical structure that leads us to regard deduction as the most effective way of
creating theories.
16
The purpose of experiments
Many people define science by its use of experiments, based on the popular iconography of the white lab coat and the test-
tube, but they are right for the wrong reasons. Experiments (in the broadest sense) are central to scientific method,
particularly as part of the deductive approach; experiments are how hypotheses are tested. We can define an experiment in
this context as ...
any systematic process that allows the testing of an hypothesis, involving the collection and analysis of specified data.
One of the key characteristics of scientific experiments is the search for control and the consequent need for simplification.
Scientists studying complex systems usually want to 'simplify' them (in the experimental sense) so that they can focus on the
area in which they are interested. This can involve removing the influence of factors that are thought not to affect the factor
being investigated, or whose influence has already been studied and understood. (This is not to be confused with thinking
that such systems actually are simple!)
The problem is that in the real world there are always a very large number of variables that could possibly be relevant, and
therefore might be included. Part of the decision process in creating an experiment is deciding what should (and shouldn't)
be measured.
In biogeography, for example, we might be studying a particular plant species, but we will not be interested in simply
measuring its distribution (counting it). Explaining the distribution of the plant means that we will have to examine factors
such as the local microclimatology, soil nutrient distribution, animal grazing patterns, fire impact, and so on. Simplifying the
experimental process means that we might ignore factors that seem to be of low significance (such as grazing, if the study
area is in a national park where grazing is not allowed). Alternatively, we might try to control certain factors, such as
grazing, by fencing off all or part of the study area. Such techniques lead to controlled experiments; these are applicable in
many, but not all, branches of science.
Experiments and experimental design
The process by which we define exactly how a particular experiment is to be conducted is called experimental design. Such
a design involves completely specifying the components of the experiment, including
 the type of experiment to be used

 the sample and/or population to be studied, where a survey of some kind is being carried out
 the factors to be controlled in the experiment, if possible
 the variables to be measured during the experiment, including the scales of measurement and the 'apparatus' to be employed
 the type of analysis to be carried out on the collected data after the completion of the experiment
Controlled experiments
In one sense all experiments are controlled, in that they are distinct from the (non-participatory) observation process. In the
latter we observe events as they happen, but do not attempt to alter these events by our intervention. Our only decisions are
concerned with which events to observe, and how to record the results of those observations. In a more specific sense,
controlled experiments are where we consciously attempt to exert our influence over the course of events. This might lead us
to do the following, all of which can be defined as a type of controlled experiment:
 growing plants in a greenhouse so that we can exert (almost) total control over the environmental conditions under which
they grow
 building a scale model of an estuary in order to investigate the impact on tide patterns of a proposed construction, such as a
jetty
 collecting soil samples and subjecting them to 'shear tests' under variable degrees of water content to determine how
moisture affects their strength (and hence their propensity to slippage in rainy conditions)
In each case we are substituting a controlled environment for the natural world, with all its complexity and uncertainty. In all
cases the controlled experiment also has the aim of trying to simplify the experiment by removing extraneous factors. In the
17
greenhouse we remove competition from other species, and control the climate. In the scale model we remove factors that
operate at small scales (such as variations in the type of estuarine sediments). In the shear test we pre-process the soil to
ensure total control over the moisture content; in so doing we also remove factors such as variations in organic matter
(twigs, worms, bugs) and inorganic matter (small stones).
Laboratory experiments
The classic laboratory experiment is in many ways the ultimate form of controlled experiment. The laboratory benchtop,
with its complex constructions of tubes and wires, is a totally artificial environment. This is true even when it is intended to
have real world analogues, as in the laboratory investigation of systems in chemical engineering, such as petroleum
refineries. The apparatus is designed to simulate the processes that work in the real world, but in an idealised (and of course
simplified) manner.
Clinical trials
Clinical trials are used to study the effect of a particular process, such as a specific medical therapy. To do this we set up a
trial in which one group is given the treatment, and another is not. Obviously, most clinical trials are done in medicine and
pharmacology, where they are used in studies of the effectiveness of new drugs and clinical treatments. The organisation of
clinical trials is shown in Figure 4.1.
Figure 4.1:&bnsp;Clinical Trials
In all surveys the relation between the study sample and the population from which it is being drawn is important, and
particularly in clinical trials; they are meaningless if the sample does not represent the target population. Finding (or
creating) a representative sample is not a trivial process, and we will examine it in more detail in a later section. In a clinical
trial the sample is split into two (usually equal-sized) groups: the study group and the control group.
In medical trials on humans the group may all be healthy volunteers, or they may be patients who have consented to take
part in the trial. In each case it is desirable (but not always possible for logistical or ethical reasons) to treat only the study
group and give the control group no treatment (or, in the case of a drug or therapy, give them a placebo). It is essential that
no member of the sample knows to which group - study or control - they belong; this is a double blind trial. The aim is to
eliminate possible psychological effects ('positive thinking') that may interfere with the experiment; usually both groups
believe they are the control group.
Field trials
18
The field trial was developed by naturalists in the last century, the classic example being Mendel and his studies on the
genetics of sweet peas. Field trials are used today in plant breeding and animal husbandry to study the effect of some process
- different feed systems, for example - on the group. Scientists do not generally feel the same ethical considerations as faced
in clinical trials on humans, so do not need to use the double blind approach. The organisation of field trials is shown in
Figure 4.2.
Figure 4.2: Field Trials
Study and control groups are used but their structure is determined by the statistical techniques (notably analysis of
variance) that have been developed specifically to analyse the results of such trials. To this end the study and control groups
are usually split into two-or more sub-groups, allowing us to analyse inter-group and intra-group variation; in turn this will
allow us to detect what effect (if any) the treatment has had.
Surveys
A survey is any procedure where we try to collect information about a population (of people, animals, plants, and so on).
Such a population may be finite or infinite.
Census
A census is any survey in which every member of the target population is counted and/or measured (at least in principle).
Depending on the size of the population this may be a time-consuming and costly process, and all members might not be
included. For example, the census with which most people are familiar (the census of the human population of a particular
country) usually has a 'success rate' of less than 90-95%; that is, up to 10% of the total population is not surveyed (or
exclude themselves from the process).
Sample
If the population is (effectively) infinite, or finite but too large to practically study all the members, the survey is carried out
on a sub-set of the population called a sample. The primary goal in sample surveys is representativeness: the sample must
represent the population. This generally means that the sample should have a distribution of values for important
characteristics that match those of the whole population. We will examine in detail the process (sample design) by which
good samples can selected in a later section.
Longitudinal
A third kind of survey is the longitudinal survey. In such surveys a group of individuals (usually people) is chosen and we
'follow' them over time. This group may be the whole population or a sample. We may do this by surveying the group at
regular intervals, or by observing their 'behaviour'. The major problems with doing longitudinal surveys are that people
move around and may be difficult to track down, or they may choose not to continue to participate. Most longitudinal
surveys end with many less participants that they started with.
19
Simulation
There are some situations in which experiments of the kind described are not feasible. For example:
 we are unable to design an experiment with the level of control we require

 the system exhibits great complexity and cannot be reasonably simplified (such as complex ecosystems)
 the system being studied is remote or inaccessible
 studying the system is physically dangerous
 the experiment cannot be carried out for ethical reasons
In these circumstance we may decide to use simulation. Simulation can be defined as ...
an experiment that substitutes for 'real' conditions (i.e. produces an analogue to the real world). This may be in analogue
(analog models) or digital (computer simulation) form.
In the last couple of decades simulation has largely moved from physical (analogue) models to computer (digital) models.
Where thirty years ago we would have relied on building a physical model (such as an estuary) nowadays we are far more
likely to build a computer 'model'; the latter is almost certainly going to be easier (and cheaper) to create and run, and more
accurate. One other advantage that computer simulations have over physical models is that they are not limited by scale
problems. In hydrology, for instance, all physical models are limited by the fact that we cannot properly scale down certain
key elements of the model, notably sediment size and the viscosity of water.
The major problem with computer simulation is that we have to already know a great deal about the system we are studying
in order to model it, and that knowledge must be expressible as data and/or equations; this is not always possible. A good
example is simulation of large-scale economic and social systems. Compared with physical systems, our understanding of
these systems is quite rudimentary, and consequently our attempts at simulation have been relatively crude and inaccurate.
This is in stark contrast with, for example, large-scale climate modelling, where we have extensive knowledge of the physics
involved (fluid dynamics and global energy balance). The limitations of such models are connected primarily to the volume
of data required and the computational time needed to deal with the relevant equations, most of which are dynamic systems
that require time-consuming iterative solutions.
20
In this section you'll look at the nature of scientific relationships which express meaningful connections between
phenomena. You will also learn how the quality of scientific data is evaluated, and how data can be categorised into
different types.
 describe the kinds of deterministic relationships with which scientists work

 describe the form of the most significant relationships
 outline the major factors used in defining the quality of scientific data
 list the properties of the major scientific measurement types
Contents
 Relationships in science
o The role of determinism
 Forms of relationships
o Functional
o Linear
o Non-linear
o Empirical
 Data quality
o Accuracy
o Precision
 Measurement types
o Scales
 Nominal
 Ordinal
 Interval/Ratio
Relationships in science
When we say something is related to something else we mean that the value of that phenomenon changes as the other
phenomenon changes: as one increases, so does the other, for example. As you've seen, this does not necessarily imply that
changes in one are caused by changes in the other, although in science we are searching for such relationships.
A relationship in science is a 'shorthand' way of describing how variations in one part of a system affect other parts. Some
relationships are more complicated than others, but we can often reduce complex relationships to sets of simpler
relationships.
The role of determinism
The concept of determinism is fundamental to science. Deterministic relationships are those where (apart from possible
external 'noise') the value of one phenomenon is totally controlled (determined) by the value of another phenomenon or
phenomena. In principle, knowing the value of one allows us to accurately predict (allowing for noise) the value of the
other.
Such relationships are fundamental to physics and chemistry, but more contentious in the biological sciences, and their
significance in the 'social sciences' is strongly disputed. In recent years the growth of interest in 'holistic' views of the world
have led to a common view that there is something unsound about determinism, but in the right area determinism is
fundamental. To argue that the natural world is basically deterministic is substantially equivalent to saying that it is made up
21
of a set of functioning systems which in principle can be 'pulled apart' (analysed) and understood more clearly. This
approach is the basis of reductionism, in which we attempt to understand complex systems by breaking them down into a
number of smaller (and presumably simpler) sub-systems, which will be more amenable to analysis.
Forms of relationships
The major aim of reductionism is to simplify systems to the point where the individual relationships in the system can be
analysed. We should find that, if we pursue this process, we are able to model these relationships in simple ways, using
graphical or analytical (equation) formats. Such a relationship can be expressed in principle as an equation linking two (or
more) variables; such equations are almost always also expressible as graphs in two (or more) dimensions. A functional
relationship is one in which the value of some variable is a function of another variable. Although there are a large number
of possible relationships, scientists often deal with only a small number of types. We need to be able to identify and interpret
such relationships, so that we can detect their presence in empirical (observational) data generated by experiments and
surveys.
Functional
In many physical systems the functional relationships between pairs of variables are often expressible as simple patterns
involving low-order equations. Such equations have a maximum degree of 1, 2 or perhaps 3. Those that have the dependent
variable (y) related to the first power of x (that is, y is proprotional to x 1) are said to be of first-order, and are also called
linear functions. All others (such as y is proportional to x 2 or y is proportional to e x) are non-linear, and can take many
different forms.
Linear
The most fundamental relationship is the linear relationship, as illustrated in Figure 5.1. The value of one variable (usually
drawn as the y-axis on two-dimensional plots like this) changes in direct proportion to changes in the other (x-axis) variable.
If the value of the y-axis variable increases as values in the x-axis variable increase we describe this as a positive
relationship (for example y = 2x). If the y-axis values decrease as the values on the x-axis increase, we call it a negative (or
inverse) relationship (for example y = 8-x). Each can be expressed as a simple linear equation in x and y.
{Note that the third 'relationship' shown in the Figure 5.1 (the horizontal line, y = 5) does not indicate a functional
relationship between the x and y variables, as the value of y is independent of changes in the value of the x-axis variable; y
is simply a constant.}
Figure 5.1: Linear [y = 2x; y = 5; y = 8-x]
Non-linear
Once we move 'beyond' linear relationships there are an enormous number of possible analytical relationships that we might
expect to find. Figures 5.2 and 5.3 show two of the more common forms. The exponential relationship (y = e x) in figure 5.2
is one of a number of power relationships in which the value of the y-axis variable changes at a proportionately larger rate
than changes in the x-axis.
22
Figure 5.2: Exponential [y = ex]
Figure 5.3 shows a simple periodic relationship (a sine wave) in which the variations in the y-axis variable change in a
'rhythmic' manner as the value of the x-axis variable changes. Whilst we will often find that periodic functions can be linked
to such simple trigonometric functions, we may also need to apply more complex functions. Nonetheless, the essence of
periodic functions is that the value of the phenomenon returns at some regular interval to the same value, and many such
functions are known in mathematics.
Figure 5.3: Trigonometric [y = sin(2x)]
Empirical
Such analytical solutions are what we are looking for in the data we collect, as an indication of the form of relationship with
which we may be dealing. The problem is that the 'real' data will contain complications, such as the influence of unknown
factors (other related variables) and 'random noise'. The real task of data analysis is to detect (visually and/or statistically)
the simpler functional relationships underlying the complex empirical relationships revealed when we look at data.
Data quality
There are two unique properties of scientific data that we need to consider before we are in a position to collect data to
search for useful relationships. The first is the 'quality' of data, and the second is the 'type' of data.
In assessing the quality of data we are primarily interested in two concepts: accuracy and precision. Most people confuse
these two ideas, but their correct definition in scientific data collection is extremely important.
23
Accuracy
We define accuracy in science as ...
an estimate of the probable error of a measurement (especially the average of repeated measurements) compared with the
'true' value of the property being measured.
As we do not (in principle) actually know the 'true' value of any phenomenon (including all so-called 'constants', such as the
mass of the electron), we are always forced to estimate the accuracy of a measurement. If the measured value is very close to
the 'true' value the measurement accuracy is very high; if the difference between the measurement and the 'true' value (the
measurement error) is large, the measurement is said to be of low accuracy. In general, as we make more measurements
(estimates) of a value we should get a more accurate estimate: 1000 measurements should be more accurate than 10. In
practice, this process is not linear; as the number of measurements increases the 'improvement' in accuracy diminishes to a
point where it is not worthwhile making more estimates.
It is important to realise that accuracy is a property of the measurement itself, not the apparatus (or experiment) with which
we generate it.
Precision
The term precision is used in science in two ways:
 as an indication of the 'spread' of values generated by repeated measurements

 as the number of significant figures with which a measurement is given
The former definition is the more significant in terms of data quality. If the distribution (spread) of measurements is wide
(that is, the measurements are very variable) then we have low precision; if the measurements have a narrow spread (all
similar) then we have high precision.
If we find that we have a high precision but low accuracy, the apparatus is producing a consistent - but wrong - result. We
call this a calibration error.
Precision is a property of the experiment and/or apparatus that is being used to generate the measurements. Clearly we look
for an appropriately-high precision experiment that is also accurate.
Data types
There are several ways that we can define the type of data. At the simplest level we define data as being qualitative or
quantitative. Qualitative data are non-numerical 'assessments' (e.g. thick, thin, slow, fast), rather than the numerical
`measurements' of quantitative data.
Data can also de defined in terms of whether they represent some intrinsic property of the things being measured, or are
arbitrary in that we in some sense 'create' them during the process of making measurements. In this distinction the mass of
an electron is an intrinsic value, whose numerical form is controlled by the scale used to define it (such as grams). The heat
content (total thermal energy) of an object is also an intrinsic (albeit variable) property of that object - but its temperature is
an arbitrary value that depends on the measurement process and scale we employ. The third way of defining data type is
through the idea of measurement scales.
Scales
We usually define four levels of data scale. When we talk about a scale as being 'lower' than another we are saying that the
range and power of the forms of analysis that we can apply to that scale are more limited - as we shall see when we examine
the basic techniques of data summarisation. The techniques that can be applied to the lower levels can also be applied to
higher levels, but not vice versa. This means, for example, that we can use the mode when analysing interval data, but we
cannot apply the mean to nominal data.
Nominal
24
The nominal data scale is the lowest level of data. We can only obtain quantitative information by doing counts of the
number of occurrences with a particular property; all we have is the number in a category. It makes no difference in
what order the categories occur, as there is no intrinsic rank or order. Note that the numbers
themselves are not the nominal scale; they are just values.
Eye colour Number

Blue 234
Green 123
Grey 56
Other 32
Ordinal
Ordinal data is distinguished from nominal data by having what we might term sequence.
This mean that with
ordinal data much of the structure is lost if reorganised. In the example below it is reasonable
to say that grade is an ordinal scale because fail/pass/credit/distinction form a sequence that
would not make sense in any other form. (Distinction is 'greater than' credit, in a non-
numerical sense.) Again, the grade is the ordinal scale and the numbers are simply counts.
Grade Number
Distinction 12
Credit 25
Pass 47
Fail 14
Interval/Ratio
The most sophisticated data scales (in terms of the data analysis techniques that can be applied to them) are interval and
ratio. Most of what we usually think of as numerical data fall into one of these two (very similar) classes. With interval data
the difference between two equal intervals is the same; that is, the difference between twenty and thirty is the same as any
other pair of values that are ten units apart, anywhere on the same scale.
between any two pairs of values that are the same 'distance apart' is the
With ratio data the ratio
same anywhere on the scale; this is possible because ratio scales have a single, absolute zero.
Thus temperatures in Centigrade or Fahrenheit are measured on interval scales, whereas
temperatures in degrees Absolute (Kelvin) are measured on a ratio scale. In terms of data
analysis techniques, there is no particular advantage in using one rather than the other.
Temperature (ºC)
12.3
13.4
16.9
25
In this section you'll look at the various forms of data sources that may be useful to scientists. These can be classified as
primary (generated by the researcher) or secondary (generated by someone else). The most significant secondary data
sources include official sources (especially those from government departments) and unofficial sources. You'll also look at
the way the Australian census is carried out, and at the advantages and disadvantages of using secondary data.
 describe the differences between primary and secondary sources

 list the major types of secondary data, including the main unofficial sources
 outline the major sources of official secondary data
 describe the structure and operation of the Australian census
 explain the advantages and disadvantages of using secondary data
Contents
 Primary and secondary data

o Uses of secondary data
o Relative roles of primary and secondary data
 Scenario 1: Effect of fire on vegetation regrowth
 Scenario 2: Contaminants and plant growth
 Scenario 3: Contaminants and childrens' growth
 Types of secondary data
o Unofficial secondary sources
 Private research results
 Research reports, research papers, textbooks
 Opinion polls
 Market research
 On-line databases
 Anecdotal/hearsay
o Official secondary sources
 Government organisations
 ABS Periodic surveys
 Census
 Structure
 Timing
 Population
 Organisation
 Analysis
 Presentation
 Geographical hierarchy
 Variables
 Demographic
 Economic
 Household
 Type of output
 Maps
 Summary Tables
 Cross-tabulations
 Pro's and con's of secondary data
o Advantages
o Disadvantages
26
Primary and secondary data
There is a basic distinction in data collection between primary and secondary data. Primary data are data collected by the
immediate user(s) of the data expressly for the experiment or survey being conducted. It is this data that we will normally be
referring to when we talk about "collecting data".
By contrast, secondary data refers to any data collected by a person or organisation other than the user(s) of the data. Where
does such data come from? If, as we have suggested, a wide variety of individuals and organisations actually collect data, it
follows that at least some of that data will come to be made available to other individuals and organisations. This data may
be of considerable value, although the exact value will depend upon the type of study being carried out.
Uses of secondary data
What are the advantage of using secondary data? In other words, why don't researchers always collect their own data? There
are actually several very good reasons why secondary data are used:
 Secondary data may provide a context (geographic, temporal, social) for primary data. This allow us to see where out
primary data 'fit in' to the larger scheme of things.
 Secondary data may provide validation for primary data, whereby the Secondary data allow us to assess the quality and
consistency of the primary data.
 Secondary data may act as a substitute for primary data. In some situations we may simply not be able to collect data, for
reasons of access, cost, or time; or the data have been collected once and to repeat the collection process would be
undesirable.
Relative roles of primary and secondary data
Most studies in the natural and social sciences will involve a mixture of primary and secondary data sources. One way of
grasping the relative uses of these two types of data is to examine the way in which primary and secondary data would be
used in a series of 'typical' research projects.
Scenario 1: Research into the effect of fire on vegetation regrowth
We wish to map and quantify vegetation patterns before and after controlled burning in a small catchment; comparison of
the patterns would allow us to determine the effects of fire on vegetation regrowth.
To achieve this we would need data about vegetation patterns before and after the fire. Ideally this would come from
identically-organised vegetation surveys, but for practical reasons we may not be able to do this. Even so, we would see our
primary data as deriving from vegetation surveys, preferably using quadrat analysis to incorporate quantitative information
about the distribution, density and quality of the vegetation.
We may, however, be forced to use purely descriptive vegetation surveys. Several sources of useful secondary data may be
available to us. There may be published vegetation maps, perhaps resulting from previous academic studies, from
environmental impact statements, or as part of a larger land survey process (such as the series of surveys of Crown Land in
Victoria carried out in the 1970s and 1980s by the Land Conservation Council). We may also be able to use 'anecdotal'
descriptions of the vegetation, provided by walkers, farmers, naturalists and conservation officers familiar with the area.
Some of this data may be written down, but we should also be prepared to use verbal evidence (with appropriate caution).
These data may provide context for the primary data, act as validation for it, and even substitute for it.
Scenario 2: Research into the effect of contaminants on crop growth rates
<P< growth on lead of elements trace impact the determine to trial field a out carry would we end this To soil. in lead) as
(such contaminant levels small reacts barley) crop food particular how investigate wish>
Our field trials (probably greenhouse-based) will produce the primary data, by quantifying the extent of crop growth
(measured in terms of total biomass) under varying degrees of contaminant in the soil.
27
Secondary data will come from previously published studies on the environmental effects of heavy-metal contaminants, thus
providing context and validation.
Scenario 3: Research into the effect of contaminants on children's growth (physical and/or mental)
This scenario provides more complex data requirements (for both scientific and ethical reasons) than the previous scenario,
despite the apparent similarities. We need to be able to quantify children's physical and intellectual development over time, a
notoriously difficult task in its own right - and then combine this with the politically sensitive topic of pollution.
Whilst we might (fleetingly!) consider how we might carry out a 'controlled' experiment in which we would vary control the
exposure of groups of children to contaminants and measure progressive changes in their performances on a set of tests, this
would of course be ethically unacceptable (and illegal). Two alternative experimental designs present themselves, taking
advantage (as it were) of the high probability of measurable variations in existing pollution patterns; such designs are typical
of data collection in epidemiology:
 We could select a group of children (by appropriate sampling techniques), measure their physical and mental development
and their exposure to contaminants over time, and attempt to correlate variations in development with levels of exposure.
 We could identify regions with high exposure levels and compare children's performance within these regions, and between
these regions and other regions which have markedly lower exposure levels. Whichever design is chosen, our primary data
will be based on performance scores in standard mental and physical tests, and measures of physical development.
We would expect to find relevant secondary data in published studies of the effects of contaminants on children's
development elsewhere in Australia, and in other parts of the world. Given known links between exposure to particular
pollutants and health risks, we would also expect to find useful data in official statistics, particularly those dealing with
morbidity and mortality distributions.
Finally, although we would need to treat it with circumspection, we would expect to use anecdotal evidence, for example to
guide us in selecting the regions for more intensive investigation.
Types of secondary data
One of the subtle skills of information management is knowing what secondary data is (or might be) available in a given
field. Expert researchers develop a comprehensive knowledge of (sometimes obscure) secondary data sources. Willingness
to use such data appropriately is also a hallmark of good research.
We can loosely divide secondary data sources into two categories: official and unofficial. Official secondary data comprise
all information collected, processed and made available by legally constituted organisations, primarily by government
departments and statutory authorities. Unofficial secondary data comprises all other forms of secondary information sources.
Unofficial secondary sources
Apart from the plethora of material generated by the official sources - which we will consider next - there are a large number
of potentially-useful secondary sources that we may call unofficial sources.
Private research results
Many corporations and private consultants generate potentially useful secondary data. The major problem is access: the
commercial confidentiality that is usually thought to attach to such material means that they are rarely available for public
scrutiny.
Research reports, research papers, textbooks
The whole gamut of academic research and publication forms a major body of secondary data sources; we might even see
textbooks as 'tertiary' sources. The expectation that research will be freely published means that most research is made
available (although not always accessible). It would be a mistake, however, to see academic research as in some sense
necessarily purer or more objective than research carried out in the commercial or governmental sectors.
28
Opinion polls
There are times when we seem to be living in a world governed by opinion polls, rather than by people: 'We can't do thus
because the polls say people don't want it'. Nevertheless, the process of collecting public opinion via surveys and
questionnaires has been developed to a high level of sophistication in recent decades, and the results of such surveys (if
made public) can be an important contextual source.
Market research
Like opinion polls, most market research is carried out by private organisations on behalf of specific clients. Depending on
the requirements of the client, the results of such surveys may or may not be made public. The commercial orientation of
most surveys tends to limit their general applicability.
On-line databases
Increasingly, secondary sources are being made available (often at a cost) in electronic form. These data may range from
bibliographic information (on-line reference databases such as Dialog) to census data (such as CDATA96, a CD-ROM that
allows the user to find, map and display major results from the Australian census).
Anecdote/hearsay
A good story is hard to ignore. Making due allowance for exaggeration and hyperbole, we will sometimes find that we need
to use anecdotal information ('I personally saw five leaks from the reservoir') or hearsay ('someone told me that…'). Such
sources are obviously best used as 'backup' to other, perhaps more reliable, sources; but there will be occasions when they
represent the only source open to us.
Official secondary sources
Official statistics abound in all developed societies. Governments collect data about their operations, usually for valid
reasons, such as for efficient long- and short-term planning.
Government organisations
Individual government departments - both state and federal - collect statistics about their area of responsibility, and many
are published in some form. This may be include an 'annual report' tabled in Parliament (and incorporated in the state and
federal government papers), regular statistical analyses, or occasional reports. Similarly, statutory authorities such as Telstra
or Melbourne Water are also required to report to federal or state parliament, but increasingly present data about their
operations as a component of the public relations process. This also includes organisations like the federal Australian
Geological Survey Organisation, the state geological and mining surveys, and the state agriculture and conservation
departments. It also includes, of course, the various divisions of the Commonwealth Scientific and Industrial Research
Organisation (CSIRO).
ABS Periodic surveys
(ABS).
The premier generator and collator of publicly available secondary data is the Australian Bureau of Statistics
This federal statutory authority is responsible for the independent collection and (limited)
analysis of statistics, particularly in the economic and social fields. The principal output is a
large number of periodic surveys, issued in various forms on a weekly, monthly, quarterly or
annual basis (depending on the data). The following table is a summary of the major types of
periodic data collected by the ABS.
29
Demography Population estimates and trends
Vital statistics
Migration
Social Statistics and Indicators Education
Welfare
Health (morbidity and mortality)
National Accounts Balance of payments
Foreign Trade
Fiscal policy
Public service accounts
Labour and Prices Labour force
Income
Prices
Agriculture Livestock
Crops
Fishing
Agricultural practises
Industry Manufacturing
Mining
Gas and electricity (production and usage)
Retail activity
Tourism
Building and Construction
Transport Motor vehicle production and sales
Freight and passengers carried
International arrivals and departures
Accidents
Census
The Australian Bureau of Statistics is also responsible for the largest regular exercise in secondary data collection, the
Australian Census of Population and Dwellings. This provides such a wealth of social and economic data, and is so widely
used as a source of background data in geography, sociology and economics, that all researchers should have a solid
understanding of its organisation and contents.
Structure
30
Like all national censuses, the Australian Census is a regular survey of the entire population (in principle) at a given
moment; it forms a sort of 'social snapshot' of national characteristics. Public awareness of, and reaction to, the census has
fluctuated in recent years as issues of privacy and reliability have surfaced, but it remains a crucial step in measuring our
development as a country.
The type and quality of the data provided by the census has varied markedly over the years. In the last century more
'intrusive' collection techniques - often involving the local constabulary - led to higher response rates than we have today
(when the survey, whilst supposedly obligatory, is acknowledged to be less than universally completed). Less emphasis on
personal privacy and sensitivity also meant that the censuses of 1881 and 1891, for example, contain data (about disabilities
and mental conditions, for example) that we know would be not be supplied today, given public and media scrutiny, and
public participation in the design of the census.
The contents of the census are determined by a consultative process between the ABS, potential users of the census data
(such as government departments, statutory authorities, researchers and consultants), and interested community groups.
Whilst supposedly independent of political influence, it is unlikely that politically sensitive material would be collected (as it
was in some past censuses).
Timing
The first national census was held in 1851 as a co-operative effort between the then-colonies. From then until 1971 it was
held every ten years, becoming the responsibility of the federal government after federation. Since 1981 the census has been
held every five years (the 1976 census was a 'sample census' designed to test the viability of moving to a five-year period).
This shortening of the inter-censal period has been made possible, as elsewhere in the world, by improvements in the
processing of the data, largely due to computerisation. Where in the 1950s and 1960s the processing of the results from one
census took most of a decade (that is, up to the start of the next census), it is know done in less than three years, leaving
plenty of time for the planning and organisation of the next census.
Population
The target population of the census is every person (citizen, resident or visitor) in Australia on 'census night' (6th August in
1991). They are enumerated by location; that is, they are counted at the place where they actually are on census night, even
if this is not their home ('place of usual residence'). If they are away from home on that night (which includes all visitors
from overseas, as well as interstate travellers) their place of usual residence is also recorded. The census also records their
place of residence at the last census; this allows us to measure the amount and type of population movement (internal
migration, immigration and emigration) between censuses.
Organisation
There is one census form for each household, to be filled in by the 'head of the household' (the definition of this position was
the cause of considerable dispute in the 1970s, as it was implied in the language of the census form that this would be a male
member of the household). The definition of a household is broad enough to encompass all types of conventional 'families'
(the classic 'nuclear family', single parents, adults with no children, and so on) in various domiciles (houses, flats, units,
caravans, etc.), as well as domestic structures such as nursing homes, hospitals and hotels. The limitations of the definition
of a household (amongst other, less acceptable causes) is one reason why the census has always had problems obtaining an
accurate estimate of the Aboriginal population.
The physical distribution and collection of forms is carried out by trained census collectors (usually teachers, public servants
or local volunteers). Each collector is responsible for all the 'households' in a collection district (usually about 1000); for
increased confidentiality, the district is rarely the one in which the collector actually lives. As well as distributing and
collecting the forms, the collector is allowed to assist householders in completing them (and also fill in certain sections). The
collector also records summary information about their district. The collection district (as we shall see) is the basic unit for
aggregation of census data.
Census forms are converted into electronic form (coded), stored for a specific period, then carefully pulped, to ensure the
confidentiality of the data. The processing of the data (particularly by aggregation) is done in such a way that it is impossible
to identify the data referring to any individual household.
Analysis
31
All the data collected in the census are aggregated to a level that prevents identification of the characteristics of individual
households (and, by inference, individuals); if necessary, random numbers are used in the tabular presentations wherever
small numbers - usually less than 5 - are found (see the later section on cross-tabulations).
The primary analysis and presentation of census data is based on geographical distribution of population at different scales:
 collection district
 statistical district
 state
 nation
At its most basic the 'analysis' consists of the generation of large numbers of tabulations (counts): how many people work in
construction in Queensland, how many households in Adelaide have two or more cars, and so on. These tables are prepared
for most levels of spatial aggregation. Where appropriate these data are presented in map form. The analysis usually
proceeds from the largest scale downwards, with national results being produced first, followed by the results for each state,
and so on.
Presentation
The census results are made available in a variety of formats:
 summary reports, tables and analyses at the national and state level
 maps and social atlases
 detailed cross-tabulations for specific variables
 special-purpose cross-tabulations generated at the request of researchers and organisations
Geographical hierarchy
The central organising principle of the census is the geographical hierarchy of regions into which data are aggregated,
although not all data are available at every level of the hierarchy. The more detailed cross-tabulations are usually only
generated for regions larger than local government areas. In approximate order from smallest to largest, these are the major
elements of the hierarchy:
 Households
 Collection Districts (CDs)
 Postcodes
 Journey-to-work regions
 Legal Local Government Areas (LLGAs) or Statistical Local Areas (SLAs)
 Urban centres
 Commonwealth Electorates
 Statistical Sub-divisions (SSDs)
 Statistical Divisions (SDs)
 Statistical Regions
 States and Territories
 Australia
Variables
The range of data collected by each census is not constant, but the overall structure remains fairly consistent. In any census
we would expect a large number of questions about the characteristics of the population, but the census is also concerned
with the physical characteristics of the dwellings that people occupy. For convenience the types of information sought by a
'typical' census can be divided into three major categories:
Demographic
< members individual characteristics ?personal? about>
32
 age
 marital status (married, divorced, never married…)
 usual residence (address)
 religious denomination
 family type (mother+father+children, mother+child…)
 educational attendance (secondary school, TAFE, university…)
 age left school
 qualifications (level)
 birthplace
 competency in English
Economic
Information about the economic characteristics of the individual members of the population:
 income (annual amount)

 occupational status (self-employed, unemployed, not in the workforce…)
 industry sector (private sector, public sector…)
 hours worked (per week)
 mode of travel to work (private car, bus, train…)
 occupation (professional, clerical, administrative…)
 industry (public service, mining, manufacturing…)
Household
Information about the characteristics of the household unit:
 structure (separate house, flat, unit…)

 number of occupants
 number of rooms and bedrooms
 occupancy status (owner, rent…)
 mortgage (amount, payments)
 rent (payments)
 vehicles (number, how garaged)
Type of output
The Bureau of Statistics makes the results of the census available in a number of formats:
Maps
A wide variety of maps is produced, at scales ranging from individual collection districts to the whole country. Over the last
fifteen years the ABS has emphasised the production of a series of social atlases that present detailed maps (at CD level) of
the distribution of key variables. These are usually prepared by academic researchers with a specific knowledge of the area,
with the ABS supplying the data, computer resources and mapping facilities. The commentaries are more extensive (and
more analytical) than those prepared by the Bureau itself.
Available initially only for the state and territory capitals, there are now atlases for all the major metropolitan areas, and new
ones are produced after each census.
Summary Tables
The major (non-geographical) format for presenting census data is in tables. The following is
an (artificial) example of a table of the number of people in different industry sectors. Note
that most tables present separate values for males and females, but only some give a total.
33
Industry Sector Males Females
Australian Government 289 191
State Government 79 54
Local Government 569 403
Private Sector 4329 2560
Not Stated 45 12
Cross-tabulations
Census data lends itself to presentation in what are known as cross-tabulations. These are tables in which the values of one
variable (such as age) are displayed against the values of another variable (such as birthplace). The numbers in the cells of
the table represent counts of how many people were enumerated (in the particular region to which the table refers) as having
the two specific properties (for example, Australian citizens aged between 15 and 20). If we cross-tabulate two variables we
generate a two-way table; if we cross-tabulate three variables we get a three-way table, and so on. To present cross-
tabulations of three or more variables on paper we need to split the table into 'layers' and show each layer separately.
The following (artificial!) example show a typical two-way cross-tabulation of citizenship by age; but as the table contains
separate values for males and females, we could also regard it as a three-way cross-tabulation, of citizenship by age by
gender.
0-17 years 18-64 years 65 years and over
Citizenship Males Females Males Females Males Females
Australian 35 56 289 341 68 89
Other 26 23 79 84 34 54
Not Stated 8 3 5 3 7 2
Pro's and con's of secondary data
We have seen that primary and secondary data both have a role in the data collection process, and that good research
technique includes knowing when to use them, and from where secondary data can be obtained. Some researchers appear to
avoid secondary data, whilst others seem to rely upon it too heavily; neither group have properly assessed the advantages
and disadvantages of using such data.
Advantages
The main advantages of using secondary data are as follows:
 As secondary data is usually available more cheaply, we get 'more bytes per buck'. The collection of secondary data is
generally significantly quicker and easier (and hence less costly) than collecting the same data 'from scratch' - even if that
were possible.
 Existing data are likely to be available in more convenient form, involving dial-up access rather than dust removal.
 Using secondary data can give us access to otherwise-unavailable organisations, individuals or locations.
 Secondary data allows the researcher to extend the 'time base' of their study by providing data about the earlier state of the
system being studied. If a scientist is conducting a study of the impact of faunal relocation programs as a conservation
strategy, for example, he or she can effectively extend the period of time covered by the research by comparing the collected
data with data from earlier studies in the same region or on the same topic elsewhere.
 The fact that secondary data are likely to be pre-processed eliminates the time-consuming (and hence costly) analysis stage.
34
Disadvantages
The main disadvantages of using secondary data are as follows:
 The method by which secondary data were collected is often unknown to the user of the data (apart from major sources like
the Census). This means that the researcher is forced to rely on the skills and propriety of the collectors - usually, but not
always, a safe proposition.
 With secondary data we may have little or no direct knowledge of the processing methods employed, and we will rarely
have access to the original raw data to check them. Again, we are forced to rely on the skills and integrity of the people who
collected and analysed the data.
35
In this section you'll learn how sample surveys can be organised, and how samples can be chosen in such a way that they
will give statistically reliable results. You will also see how, with a certain amount of knowledge of what is being surveyed,
to decide on an appropriate sample size.
 describe the relationship between censuses and samples

 list the sequences of stages involved in developing a sample survey
 explain the formulation and operation of a sample frame
 define the major types of sampling methods, using both probability and non-probability techniques
 describe the major forms of spatial sampling for selecting samples from phenomena that vary across the landscape
 explain the major factors that govern the calculation of minimum sample size, and the basic formulae for deriving sample
sizes
Contents
 Censuses and samples

o Sample surveys
 Development of a sample survey
 The sample frame
o Definitions
 Sampling methods
o Non-probability sampling
o Probability sampling
o Simple random sampling
o Stratified sampling
o Systematic
o An example
 Simple random
 Stratified random
 Systematic
o Quota sampling
 Spatial sampling
o Random
o Stratified
o Systematic
 Sample size estimation
o Assumptions
o Formulae
o Estimating population variance
Censuses and samples
In simple terms, you could call any data collection process that is not a controlled experiment a survey. Given the ethical
problems associated with studying people (and the different issues in studying animals) much research in these fields is
based on surveying rather than experimentation. In these terms a census is a survey whose domain is the characteristics of an
entire population; a census is any study of the entire population of a particular set of 'objects'. This would include Eastern-
barred bandicoots in western Victoria, human residents of Heidelberg, or the number of Epacris impressa plants on a single
hillside in Garriwerd National Park. In each case there is a finite number of objects to study - although we may not know
that number in advance and, as you'll see in a later section, we may need to estimate that number. On the other hand, we
cannot in the same sense carry out a 'census' of the atmosphere or the soil.
36
If we are only able (or we choose) to collect, analyse or study only some members of a population then we are carrying out a
survey. If the total population is finite and known, or continuous (infinite) then what we are doing is defining some
proportion of the total population to study; we are therefore creating a sample.
Sample surveys
Why do we use sample surveys? We have no choice when the population is continuous (that is, effectively infinite), but we
can define a sample from either a finite or an infinite population. Surveys are done for several reasons. A sample survey
costs less than a census of the equivalent population, assuming that relatively little time is required to establish the sample
size. Whatever the sample size, there are 'establishment costs' associated with any survey. Once the survey has begun, the
marginal costs associated with gathering more information, from more people, are proportional to the size of the sample. But
remember that surveys aren't conducted simply because they are less expensive than a census: they are carried out to answer
specific questions, and sample surveys answer questions about the whole population. Researchers are not interested in the
sample itself, but in what they can learn from it than can be applied to the whole population.
A sample survey will usually offer greater scope than a census. This may mean, for example, that it's possible to study the
population of a larger (geographical) area, or to find out more about the same population by asking a greater variety of
questions, or to study the same area in greater depth.
Development of a sample survey
Of course, whatever the potential advantages of sample surveys, they will not be realised unless the sample survey is
correctly defined and organised. If we ask the wrong 'people' the wrong 'questions' we will not get a useful estimate of the
characteristics of the population. Sample surveys have advantages provided they are properly designed and conducted. The
first component is to plan the survey (and select the sample if this is a sample survey) using the correct methods. The
following are the main stages in survey and sample design:
 State the objectives of the survey
If you cannot state the objectives of the survey you are unlikely to generate useable results. You have to be able to formulate
something quite detailed, perhaps organised around a clear statement of a testable hypothesis. Clarifying the aims of the survey is
critical to it's ultimate success.
 Define the target population
Defining the target population can be relatively simple, especially for finite populations (for example, 'all male students enrolled
in the unit SCC171 at Deakin University in 1994'). For some finite populations, however, it may be more difficult to define what
constitutes 'natural' membership of the population; in that case, arbitrary decisions have to be made. Thus we might define the
population for a survey of voter attitudes in a particular town as 'all men or women currently on the electoral roll' or 'all people
old enough to vote, whether they are enrolled or not' or 'all current voters (on the electoral roll) and those who will be old enough
to vote at the next election'. Each of these definitions might be acceptable but, depending on the aims of the survey, one might be
preferable.
As suggested earlier, the process of defining the population is quite different when dealing with continuous (rather than discrete)
phenomena. As you will see, it is still possible to define a sample size even if you don't know the proportion of the population
that the sample represents.
 Define the data to be collected
If we are studying, for example, the effect of forest clearance on the breeding process of a particular animal species, we obviously
want to collect information about the actual changes in population size, but we will also want to know other things about the
survey sample: how is the male/female ratio affected, is the breeding period changed, how are litter sizes affected, and so on. In
searching for relationships (particularly those that may be causal) we may find interesting patterns in data that do not, at first
glance, seem immediately relevant. This is why many studies are 'fishing expeditions' for data.
 Define the required precision and accuracy
The most subjective stage is defining the precision with which the data should be collected. Strictly speaking, the precision can
only be correctly estimated if we conduct a census. The precision provided by a sample survey is an estimate the 'tightness' of the
range of estimates of the population characteristics provided by various samples.
37
When we estimate a population value from a sample we can only work out how accurate the sample estimate is if we actually
know the correct value - which we rarely do - but we can estimate the 'likely' accuracy. We need to design and select the sample
in such a way that we obtain results that have acceptable precision and accuracy.
 Define the measurement `instrument'
The measurement instrument is the method - interview, observation, questionnaire - by which the survey data is generated . You
will look in detail at the development of these (and other) methods in later chapters.
 Define the sample frame, sample size and sampling method, then select the sample
The sample frame is the list of people ('objects' for inanimate populations) that make up the target population; it is a list of the
individuals who meet the 'requirements' to be a member of that population. The sample is selected from the sample frame by
specifying the sample size (either as a finite number, or as a proportion of the population) and the sampling method (the process
by which we choose the members of the sample).
The process of generating a sample requires several critical decisions to be made. Mistakes at this stage will compromise -
and possibly invalidate - the entire survey. These decisions are concerned with the sample frame, the sample size, and the
sampling method.
The sample frame
The creation of a sample frame is critical to the sampling process; if the frame is wrongly defined the sample will not be
representative of the target population, and it likely that most ineffective surveys arise from poorly-specificed sample
frames. The frame might be 'wrong' in three ways:
 it contains too many individuals, so that the sample frame contains the target population plus others who should
not be included; we say that the frame's membership has been under-defined
 it contains too few individuals, so that the sample frame contains the target population minus some others who
ought to be included; we say that the frame's membership has been over-defined
 it contains the wrong set of individuals, so that the sample frame does not necessarily contain just the target
population; we say that the frame's membership has been ill-defined
Creating a sample frame is done in two-stages:
1. Divide the target population into sampling units. Examples of valid sampling units might include people (individuals),
households, trees, light bulbs, soil or water samples, and cities.
2. Create a finite list of sampling units that make up the target population. For a discrete population this will literally be a list
(for example, of names, addresses or identity numbers). For a continuous population this 'list' may not specifiable except in
terms of how each sample is to be collected. For example, when collecting water samples for a study of contaminant levels,
we are only able to say that the sample frame is made up of a specific number of 50 millilitre sample bottles, each containing
a water sample.
Definitions
Before examining sampling methods in detail, you need to be aware of more formal definitions of some of the terms used so
far, and some that will be used in subsequent sections.
Population A finite (or infinite) set of 'objects' whose properties are to be studied in a survey
The population whose properties are estimated via a sample; usually the same as the 'total'
Target population
population.
Sample A subset of the target population chosen so as to be representative of that population
Sampling Unit A member of the sample frame
38
A member of the sample
Any method of selecting a sample such that each sampling unit has a specific probability
of being chosen. These probabilities are usually (but not always) equal. Most probability
Probability sample
sampling employs some form of random sample to generate equal probabilities for each
unit of being selected.
A method in which sample units are collected with no sample specific probability
Non-probability
structure.
Sampling methods
The general aim of all sampling methods is to obtain a sample that is representative of the target population. By this we
mean that, as much as possible, the information derived from the sample survey is the same (allowing for inevitable
variations in the estimates due to imprecision) as we would find if we carried out a census of the target population.
When selecting a sampling method we need some minimal prior knowledge of the target population; with this and some
reasonable assumptions we can estimate a sample size required to achieve a reasonable estimate (with acceptable precision
and accuracy) of population characteristics.
How we actually decide which sampling units will be chosen makes up the sampling method. Sampling methods can be
categorised according to the approach they take to the probability of a particular unit being included. Most sampling
methods attempt to select units such that each has a definable probability of being chose. Moreover, most of these methods
also attempt to ensure that each unit has the same chance of being included as every other unit in the sample frame. All
methods that adopt this general approach are called probability sampling methods.
Alternatively, we can ignore the probability of selection issue and choose the sample on some other criterion, such as
accessibility or voluntary participation; we call all methods of this type non-probability sampling methods.
Non-probability sampling
Non-probability methods are all sampling procedures in which the units that make up the sample are collected with no
specific probability structure in mind. This might include, for example, the following:
 the units are self-selected; that is, the sample is made up of 'volunteers'
 the units are the most easily accessible (in geographical terms)
 the units are selected on economic grounds (payment for participation, for example)
 the units are considered by the researcher as in some way `typical' of the target population
 the units are chosen without no obvious design ("the first fifty who come in this morning")
It is clear that such methods depend on unreliable and unquantifiable factors, such as the researcher's experience, or even on
luck. They are correctly regarded as 'inferior' to probability methods because they provide no statistical basis upon which the
'success' of the sampling method (that is, whether the sample was representative of the population and so could provide
accurate estimates) can be evaluated.
On the other hand, in situations where the sample cannot be generated by probability methods, such sampling techniques
may be unavoidable, but they should really be regarded as a 'last resort' when designing a sample scheme.
Probability sampling
The basis of probability sampling is the selection of sampling units to make up the sample based on defining the chance that
each unit in the sample frame will be included. If we have 100 units in the frame, and we decide that we should have a
sample size of 10, we can define the probability of each unit being selected as one in ten, or 0.1 (assuming each unit has the
same chance). As we shall see next, there are various sampling methods that we can use to select the units.
It is important feature of probability sampling that each time we apply the same method to the same sample frame we will
generate a different sample. For a finite population we can use simple combinatorial arithmetic to calculate how many
39
samples we can draw from a particular sample frame such that no two samples are identical. It turns out that, from any
population of N objects we can draw NCn different samples, each of which contains n sampling units.
In fact, in probability sampling we are concerned with the probability of each sample being chosen, rather than with the
probability of choosing individual units. If each sample is as likely to be selected as every other sample (assuming equal
probabilities), then each sampling unit automatically has the same chance of being included as every other sampling unit.
Simple random sampling
The simplest way of selected sampling units using probability is the simple random sample. This method leads us to select n
units (from a population of size N) such that every one of the NCn possible samples has an equal chance of being chosen.
To actually implement a random sample, however, we 'reverse' the process so that we generate a sample by selecting from
the sample frame by any method that guarantees that each sampling unit has a specified (usually equal) probability of being
included. How we actually do the sampling (using dice, random number tables, or whatever) is of no significance, provided
the technique ensures that each unit retains its specified probability of being selected.
Stratified sampling
On occasion we may suspect that the target population actually consists of a series of separate 'sub-populations', each of
which may have, on average, different values for the properties we are studying. If we ignore this possibility the population
estimates we derive will be a sort of 'average of the averages' for the sub-populations, and may therefore be meaningless.
In these circumstances we should apply sampling methods that take such sub-populations into account. It may turn out,
when we analyse the results, that the sub-populations do not exist, or they exist but the differences between them are not
significant; in which case we will have wasted a certain (minimal) amount of time during the sampling process.
If, on the other hand, we do not take this possibility into account, we will have reduced confidence in the accuracy of our
population estimates.
The process of splitting the sample to take account of possible sub-populations is called stratification, and such techniques
are called stratified sampling methods. In all stratified methods the total population (N) is first divided into a set of L
mutually exclusive sub-populations N1, N2 … NL, such that
Usually the strata are of equal sizes (N 1 = N2 = … = NL) but we may also decide to use strata whose relative sizes reflect the
estimated proportions of the sub-populations within the whole population.
Within each stratum we select a sample (n 1, n2, … nL ), usually ensuring that the probability of selection is the same for each
unit in each sub-population. This generates a stratified random sample.
Systematic
Sometimes too much emphasis is placed on the significance of the equal probability of sampling unit selection, and
consequently on random sampling. In most cases the estimates provided by such techniques are no better than those
provided by systematic sampling techniques, which are often simpler to design and administer. In a systematic sample (as in
other probability methods) we decide the sample size n from a population of size N. In this case, however, the population
has to be organised in some way, such as points along a river, or in simple numerical order (the order of sample units is
irrelevant in simple random or stratified random samples). We choose a starting point along the sequence by selecting the r th
unit from one 'end' of the sequence, where r is less than n, and is usually chosen randomly. We then take the rest of the
sample by adding k to r, where k is an integer number equal to N/n, or to the next lowest integer below N/n if this division
produces a real number. We do this repeatedly until we reach the end of the sequence.
One way of envisioning a systematic sample is think of the sample frame as a 'row' of units, and the sample as a sequence of
equal-spaced 'stops' along the row, as shown in Figure 7.1:
40
Figure 7.1: Systematic sample
An example
Let's look at a simple example to see how various probability sampling methods are applied. We want to sample from a list
of employees (whose population size is sixteen) using their identification numbers as the sample frame, and we want to
select a sample size of six units.
Simple random
We select the units by random sampling from the frame. We might use random number tables, or use a short computer
program to generate six random numbers between one and sixteen.
94407382
94409687 <========
93535459 <========
93781078
94552345
94768091 <========
93732085
94556321
94562119
93763450 <========
94127845
94675420
94562119 <========
93763450 <========
94127845
94675420
Stratified random
Here we first need to split the population into sub-populations (two in this example, presumably meaningful in the context of
the study) and then sample from within those sub-populations. In the example the first sub-population has eleven members,
and the second has five; so we select four items from the first group (each unit has sampling probability within its own sub-
population of 0.275) and two from the second (each unit has a sampling probability of 0.25).
93535459
93781078
93732085 <========
93763450
93763450 <========
94407382
93427890
94409687 <========
94552345
94768091 <========
94556321
41
-----------------------------
94562119 <========
94127845
94675420
94562119 <========
94127846
Systematic
Here we first need to order the sample units in some sensible fashion (in this case in increasing numerical order). We then
select the first point (the value of r) between one and three ( N/n = 16/6); in the example this turns out to be two. We then take
every third sample after this (2, 5, 8,11, 14). Depending on the size of the sample frame this may (as it does here) produce a
sample that is too small or too large by a single unit.
93535459
93781078 <========
93732085
93763450
93763450 <========
94407382
94409687
94552345 <========
94768091
94556321
94562119 <========
94127845
94675420
94562119 <========
94127846
94675420
Quota sampling
Interviews, mail surveys and telephone surveys (which are examines in detail in another section) are often used in
conjunction with quota sampling. Quota sampling is based on defining the distribution of characteristics required in the
sample, and selecting respondents until a 'quota' has been filled. For example, if a survey requires a sample of fifty men and
fifty women, a quota sample will 'tick off' respondents until the right number of each type has been surveyed. This process
can been extended to cover several characteristics ('males under fifty years of age', 'females with children'). Quota sampling
can be regarded as a form of stratification, although formal stratified sampling has major statistical advantages.
Spatial sampling
We can also use equivalent sampling techniques when the phenomenon we are studying has a spatial distribution (that is,
the objects in the sample frame have a location in that they exist at specific points in two or three dimensions). This would
be required, for example, if we wanted to sample continuous phenomena such as water in a lake, soil on a hillside, or the
atmosphere in room. Such phenomena are obviously most common in geographical, geological and biological research.
To sample from this type of distribution requires spatial analogies to the random, stratified and systematic methods that have
already been defined. The simplest way of seeing this is in the two-dimensional case, but the three-dimensional methods are
simply extensions of this approach. In each case the end result is a sample that can be expressed as a set of points in two-
dimensional space, displayed as a map.
Random
The simplest spatial sampling method (although rarely the easiest to manage in the field) is spatial random sampling. In this
we select a sample (having previously defined the sample size) by using two random numbers, one for each direction (either
defined as [x,y] coordinates, or as east-west and north-south location). The result is a pattern of randomly chosen points that
can be shown as a set of dots on a map, as in the example in figure 7.2. (It is surprising how often such random patterns are
perceived to contain local clusters or apparent regularity, even when they have been generated using random numbers.)
42
Figure 7.2: Random spatial sampling
Stratified
The concept of stratification can readily be applied to spatial sampling by re-defining the sub-population as a sub-area. To
do this we break the total area to be surveyed into sub-units, either as a set of regular blocks (as shown in figure 7.3) or into
'natural' areas based on factors such as soil type, vegetation patterns, or geology. The result is a pattern of (usually
randomly-chosen) points within each sub-area. Again, it is possible to have stratified spatial sampling schemes in which the
number of sample units 'allocated to' each sub-unit is the same, or is proportional to the size of the area as a component of
the entire area.
Figure 7.3: Stratified spatial sampling
Systematic
All forms of systematic spatial sampling produce a regular grid of points, although the structure of the grid may vary. It may
be square (as shown in figure 7.4), rectangular, hexagonal, or any other appropriate geometric system.
43
Figure 7.4: Systematic spatial sampling
Sample size estimation
If done properly, the correct estimation of sample size is a significant statistical exercise. Sometimes we bypass this process
(often for sensible reasons) by adopting an ad hoc approach of using a fixed sample proportion (such as 10% of the
population size) or sample size (such as 100). In relatively large populations (say at least 2000) this will normally produce
results that are no worse than those produced by a sample based on a carefully calculated sample size (provided, of course,
that the sample units that make up the 10% sample are properly selected, so that they are representative of the population).
The basis for calculating the size of samples is that there is a minimum sample size required for a given population to
provide estimates with an acceptable level of precision. Any sample larger than this minimum size (if chosen properly)
should yield results no less precise, but not necessarily more precise, than the minimum sample. This means that, although
we may choose to use a larger sample for other reasons, there is no statistical basis for thinking that it will provide better
results. On the other hand, a sample size less than the minimum will almost certainly produce results with a lower level of
precision. Again, there may be other external factors that make it necessary to use a sample below this minimum. If the
sample is too small the estimate will be too imprecise, but if the sample is too large, there will be more work but no
necessary increase in precision.
But remember that we are primarily interested in accuracy. Our aim in sampling is to get an accurate estimate of the
population's characteristics from measuring the sample's characteristics. The main controlling factor in deciding whether the
estimates will be accurate is how representative the sample is. Using a small sample increases the possibility that the sample
will not be representative, but a sample that is larger than the minimum calculated sample size does not necessarily increase
the probability of getting a representative sample. As with precision, a larger-than-necessary sample may be used, but is not
justified on statistical grounds.
Of course, both an appropriate sample size with the proper sampling technique are required. If the sampling process is
carried out correctly, using an effective sample size, the sample will be representative and the estimates it generates will be
useful.
Assumptions
In estimating sample sizes we need to make the following assumptions:
 The estimates produced by a set of samples from the same population are normally-distributed. (This is not the same as
saying that the values of the variable we are measuring are actually normally-distributed within the population.) A well-
designed random sample is the sampling method that will most usually produce such a distribution.
 We can decide on the required accuracy of the sample estimate. For example, if we decide that the accuracy has to be ± 5%,
the estimated value must be within five percent either way of the 'true' value, within the margin of error defined in the next
assumption.
44
 We can decide on a margin of error for the estimate, usually expressed as a probability of error (5% or 0.05). This
means that in an acceptably-small number of cases (e.g. five out of a hundred) our sample estimate is not within the
accuracy range of the population estimate defined in the last assumption.
 We can provide a value for the population variance (S 2) of the variable being estimated. This is a measure of how much
variation there is within the population in the value of the property we are trying to estimate. In general we will require a
larger sample to accurately estimate something that is very variable, whilst something that has a similar value for all
members of the population will require a markedly smaller sample. As we shall discuss shortly, although we almost never
have a value for the population variance (if we knew it we probably wouldn't need to do the survey…) there are various
ways of obtaining an estimate for use in calculating sample sizes.
Formulae
Based on these assumptions there are several formulae that have been developed for estimating minimum sample sizes. The
format presented here is the simplest, and is applicable to simple random samples, but more complex versions are available
for use with systematic and stratified samples (and their various combinations).
We assume that the population mean is to be estimated from the sample mean by a simple random sample of n o units
(equivalent formulae exist for estimating parameters other than the mean). If n o is much smaller than N (say 10% of N) n o is
given by

Note particularly that the sample size is not related to N; it depends on the variability of the population and the accuracy that
we wish to achieve. It may be that this 'independence' of the sample size from the population size is the origin of the widely-
held use of a fixed sample proportion (such as 10%) rather than a calculated sample size.
If no is only somewhat smaller than N (say, more than 10%) we must correct n o to give a sample size of n by using

Estimating population variance
Even if we don't actually know the population variance there are several methods of deriving a value for it:
 We can split the sample into two (n 1 and n2, where n1 is smaller than n2) and use the results from the first sample to estimate
the value of the population variance (S2) and thus calculate the size of n2, the 'real' sample.
 We can use a pilot survey (the broader significance of which we will consider in a later section) to estimate the value of S 2.
 We can use an estimate for S2 based on previous samples of the same (or a similar) population.
 We can use 'educated guesswork' based on prior experience of the same (or a similar) population.
45
In this section you'll see how to design and conduct surveys. You will also be introduced to the principles of questionnaire
design and layout.
 describe the principles of survey development

 list the major types of survey that can be used to collect social and economic data
 explain the purpose and structure of pilot surveys
 outline the principles governing the design of questionnaires
 describe the major components of questionnaires, and how they can be layed out to create effective questionnaires
Contents
 Survey development
 Types of survey
o Direct observation
o Diaries
o Interviews (face-to-face)
o Mail surveys
o Telephone survey
o Response Rates
 Pilot surveys
o Administration
 Questionnaire design
o Overall size
o Amount of "background information"
o Variety and sequence of questions
o Internal organisationv
 Questionnaire layout
o Question Wording
o Question Types
 Open
 Closed
 Structured
 Tabular
Survey development
One major problem that we face in developing and managing a survey is that there are key points in the process where we
may 'lose control' over the data collection process. An ill-conceived survey is frequently caused by lack of control during the
development process. There are a number of critical stages that we go through before making observations or asking
questions.
The exact goals of the survey should be clear at the start of the development process; no matter how well we design the
'measurement instrument' (such as a questionnaire), we will probably not achieve those goals.
This process should be made easier if the survey is part of a hypothesis-testing program, using the deductive methodology
discussed in earlier chapters. Just as in experiments the ability to properly test the hypothesis depends on making the right
measurements with the right instrument, in surveys we have to make the right observations, or ask the right questions. In a
complicated survey this might not be easy to guarantee. If we are carrying out a survey on behalf of a client, we should also
make sure we are able to collect the specific data the client wants from the survey.
46
How do we choose the right survey method? One thing that affects the choice of survey method is the need to make the data
being collected match data already available. If we have secondary data that we want to use to provide verification for the
primary data, we must collect specific data to support this.
When we choose a method, we have to take into account how we are going to actually carry out the survey. In principle we
choose a particular method (the 'right' one) and then do it; in practice we often 'cut the suit to match the cloth'. It may be
necessary to use a less desirable technique on the grounds of feasibility and practicality. To know which type of survey to
use we need to be fully aware of their comparative advantages and disadvantages.
Types of survey
Direct observation
In direct observation we record events during an experiment or record behaviour. In the observation of human attitudes and
behaviour (as in psychology or sociology) this process is complicated by the need to inform people of their participation in
the process (we have no right to 'spy' on people). If we make direct observations of people without their knowing it then we
are doing an experiment upon them without their agreeing to participate. Unfortunately, it is very likely that, when people
are aware of the observation process, they will make subtle changes (often unconsciously) to their 'normal' activities.
Diaries
A tempting form of data collection, particularly when carrying out longitudinal surveys, is a diary system. Over a period of
time (usually days or weeks, but in some cases over a period of years) the respondents themselves record data about what
they are doing, or about what they watch on television, or what they buy. Until recently all television and radio ratings were
conducted using diaries. The system is easy to administer (hand out the diary and collect it some time later) and hence
cheap, but is largely restricted to behavioural (as opposed to attitudinal) information.
The major problem with using diaries is that it is very difficult to obtain any independent verification of what is recorded in
the diary; only the respondent knows whether or not it is accurate. Some may give answers that they think will please the
organisation conducting the survey (particularly if they fear or respect it), whilst others may give incorrect information
specifically to spite the organisation. Many organisations that use diaries in this way rely on incentives (such as free gifts) to
persuade respondents to provide accurate entries.
Interviews (face-to-face)
A widely used survey technique in social surveys is the interview. There are basically two types of face-to-face interview:
the impromptu and the scheduled (we will deal with surveys which are based on interviews 'at a distance' shortly).
Impromptu interviews can take place wherever there are people who are likely to have an opinion of, or knowledge about,
the survey topic. Shopping surveys might be done in a shopping centre, recreation surveys at the beach, and so on. As well
as the right location, we should also consider the time of day at which such interviews are conducted, as the availability of
the right type of respondent will vary during the day (or night).
Scheduled interviews are usually conducted at home or at work. If possible it's preferable to use scheduled interviews:
people often feel uncomfortable when stopped in the street, but are more uncomfortable in their own home. The general
principle is that if you get people on their own 'home ground' they will answer a larger number of questions, and give more
detailed answers. On the other hand, such interviews usually demand greater mobility on the part of the interviewer, and
hence are more costly to administer on a 'per interview' basis.
Interviews are not necessarily conducted by asking the interviewee a series of questions in strict order (question one,
followed by question two, and so on). An interviewer can often get more useful information if he or she continually
reorganises the sequence of questions to react to the interviewee's responses. The interviewer has to be prepared (and
allowed) to blend the questions into a less structured discussion, but this can normally only be done when interviewing
people at home or at their place of work. To generate such a discussion the interviewer must understand the questions
completely and know them extremely well. The recording process must also be unobtrusive; it may be more effective (if the
respondent agrees) to record the discuss on tape and transcribe it later, rather than ritually filling in a printed survey form.
There are a number of problems that may affect the usefulness of interview results. These are the most significant:
47
 There may be bias (conscious or unconscious) in the selection of interviewees. Given that all participants will be volunteers,
the interviewer may be less encouraging to some potential respondents than to others, or act in a more 'threatening' manner,
or be less persuasive when requesting an interview. Male interviewers may 'intimidate' some female candidates, thus
reducing the likelihood that they will agree to participate. All of these circumstances may change the characteristics of the
sample; quota sampling is often employed to try to minimise these effects.
 There may be bias in the question process. The interviewer may present certain questions in a way that is likely to elicit a
particular answer. When recording a response the interviewer may 'filter' the response in a manner that changes the meaning
the interviewee intended it to have. The commonest response to this problem is to give the interviewer little or no latitude (in
wording, phrasing or inflection) in how questions are asked. The use of closed questions (the design of which you'll examine
in a later chapter) is considered likely to reduce recording bias.
 There is a recognised tendency for interviewees to give the answer that they think the interviewer would like to hear. Again,
the use of more structured interviewing techniques and application of closed questions are thought to act as a counterbalance
to this problem.
 The interviewer may make straightforward errors in asking the questions or (more commonly) recording the answers.
Several measures can be taken to minimise this problem:
1. training interviewers intensively in the general principles of interview conduct
2. ensuring that they have an understanding of the overall structure and aims of the survey
3. ensuring that they are fully briefed on the detailed format of the questions that need to be asked
4. applying appropriate measures during the pre-processing phase of data analysis (which we will cover in a later chapter) to
cross-check results in order to look for possible errors
 Interviews, like all voluntary survey techniques, are prone to variable response rates. The response rate of any survey is the
proportion of the total number of people who were approached to participate, and those that actually participated; it is
usually expressed as a percentage. Again, quota sampling is often employed to deal with this problem (keep asking until
enough agree to participate), but at considerable potential cost.
As already suggested, face-to-face interviews can be a time-consuming (and hence costly) approach. Scheduled interviews,
for example, have to include the time taken to arrange the interview, the travel time to and from the location, any time spent
waiting for the interview to be available, and the actual interview time. In some cases it also includes the time required to re-
visit the interviewee to check some answers, resolve some inconsistencies, or complete the survey if the first interview had
to be abandoned for some reason. On average, a complex survey employing a scheduled interview system might only obtain
one completed interview per interviewer per day.
Despite these problems, interviews are widely used, even where more complex issues than 'what soap powder do you buy?'
are being investigated.
Mail surveys
If we consider 'efficiency' in surveying to mean 'completing as many surveys as possible in a given period of time' it is
undoubtedly true that the most efficient forms of survey are mail and telephone surveys. Not surprisingly, many
organisations use such surveys, and they are used to collect all types of data, from political opinions to shopping patterns. In
most cases this is because the efficiency factor is considered to outweigh the well-known deficiencies of these techniques.
Mail surveys are conducted, as the name suggests, using the post. The organisation supplies potential respondents with a
printed survey form and they (hopefully) fill it in and send it back. The simplest (and least expensive) version is where the
organisation posts out the surveys and the respondents post them back, invariably in a pre-paid or stamped envelope
supplied with the survey.
Alternatively, a member of the surveying organisation delivers the surveys and returns at an arranged time and date to
collect them. Whilst obviously more costly, this method is considered (as we shall see shortly) to give worthwhile
improvements in response rate. Most people will throw an unsolicited survey in the rubbish, but are more likely to complete
it if they have agreed to do so personally, and know that someone is going to collect it.
Various compromise arrangements exist, including using follow-up letters and telephone calls. Again, the impact of these on
response rates is examined in a different section.
Most surveys include a 'covering letter' to help potential respondents understand the purpose of the survey, and to persuade
them to participate. Some survey veterans consider that an 'official-looking' covering letter (leading some respondents to
believe that the survey is in some sense obligatory) will markedly improve the response rate. Others feel that incentives
(small 'giveaways', or a chance to win larger prizes) are worthwhile. In each case, the aim is to increase the response rate and
(less significantly) maximise the correctness of responses.
48
The major advantage of mail surveys is their high 'penetration' (almost everyone has a mailing address, even if they don't
have access to a telephone) and their low cost, which is really dependent on the exact method of administration.
The major disadvantages of mail surveys are their dependency on the reading and
comprehension skills of the respondents, and the low initial response rates. It is considered
absolutely critical to carry out some form of follow-up with those who don't respond. Of
course, the more follow-ups one does, the more expensive it becomes and the more resistance
occurs. There is a rule of thumb in a mail survey that 40% is a reasonable response for a first
mail out. If you send the non-respondents a reminder (after a suitable period) you might get
responses from 40% of the 60% who didn't respond the first time. So now you have a total
64% response rate. If you contact the remaining 36% again, and you get 40% of those to
respond, you will then have about 75%, and so on. (After nine reminders you get - in
principle - a 99.4% response rate, but you will have certainly worn out your welcome long
before this point!) This is summarised in the following table:
Stage Response Rate

First Mail 40%
First reminder 64% (40% + 40% of 60%)
Second Reminder 78% (40% + 40% of 60% + 40% of 36%)
In practice, if the survey is well administered, after three reminders you may get a response rate close to 80%. Some people -
for various legitimate reasons - will always say no.
Telephone survey
Telephone surveys vary enormously in quality. At one end are the statistically sound opinion polls carried out by
organisations such as Roy Morgan, Gallup and ANOP. Using careful organisation and extensive background research they
work with surprisingly small but very representative samples. At the other end are the useless and misleading 'phone ins'
conducted by parts of the mass media that pretend to discover public opinion by asking people to ring in and record their
'vote'. It says something about the structure and quality of such surveys that almost anyone can guess the answer in advance.
All 'real' telephone surveys have a common structure. A list of potential respondents is drawn up, based on sources such as
business directories, credit lists and membership lists of professional or social organisations. Each person on the list is
contacted by telephone until enough have responded (hence the use of quota sampling). Once they have agreed to participate
they are asked a pre-defined set of questions, usually quite simple in structure (such as multiple choice). The interview is
unlikely to last for more than a few minutes, as many people will not participate if they think it will take more than that
amount of time.
There are major advantages to this technique. It is a low cost method: the only costs are the telephone call (usually a local
call) and the wages of the interviewers. Given the wide distribution of telephones in the western world, the method has high
penetration.
But it is not perfect. There are still marked variations in access. Not everybody has a telephone, and the disparities are not
random, but associated with variations in economic and social status. As a result, telephone surveys lead to under-
representation of the economically disadvantaged members of society (the unemployed, single parents, pensioners, and so
on). The consequent effects on the sample results are an example of bias due to non-coverage.
Even among those with telephone access, contact is not guaranteed; people may not be home when the interviewer calls.
Interviewers can't expect a positive response if they ring at strange hours of the day or night. Anyway, some people simply
don't like answering questions on the phone; today, many are suspicious of what they may consider the hidden motives of all
unfamiliar callers.
Sometimes the telephone discussion leads to problems of mutual comprehension between the interviewer and the
respondent, particularly where one is less fluent than the other in the language being used.
49
Telephone surveys can only really employ the simpler type of questions, particularly multiple choice or yes/no questions;
the limitations of these are examined in another section.
There is usually a high degree of outright refusal to cooperate; people who might 'give in' and agree to participate if asked
directly may be emboldened to say no and hang up when approached over the telephone.
Response rate
Our concern with response rates is linked to the effect of low rates on the representativeness of the sample, and the
consequent accuracy of the population estimates. Anywhere you get less than 100% response rate is in some sense a 'threat'
to the accuracy and usefulness of the survey; the lower the response, the greater the threat. As you've seen, maximising the
response rate is a problem with all voluntary surveys. In general terms, the less the survey uses face-to-face questioning, the
harder it is to get a high response rate.
Pilot surveys
Perhaps the most overlooked component of a successful survey is the role played by the pilot survey. Once we have decided
the overall aims and structure of the survey we should 'trial' the system with what is called a pilot survey.
Pilot surveys can serve several functions within the overall survey process. As you saw in the last chapter, a pilot survey is
one method of obtaining an estimate of the population variance. The major function of the pilot survey, however, is to help
'tune' the proposed process for the main survey. We do this by using the pilot survey to find out whether the survey is going
to be successful; that is, will it achieve an acceptable response rate, and provide reliable data on the relevant topics? To do
this we need to find out several things from the pilot survey:
 Is the survey too large or too small? If it is too large (there are 'too many questions') we will generally get a lower response
rate, and there will be many unanswered questions in those surveys that are actually completed. If it is too small, we will not
get the information we need to test our hypotheses, no matter how high the response rate.
 Are the questions going to yield the information we require? As a simple example, if we want to know whether people are in
favour of a particular proposal we will have to have a question that either directly asks about this, or a question that can
yield this information during the processing of the survey results.
 Is the survey layout clear and effective, or is it misleading? Do respondents get confused and, for instance, fill in sections
that they are not supposed, or leave blank questions they should complete?
 Are the instructions to the respondent (and to the interviewer) clear and unambiguous?
 Are the people who are going to carry out the survey clear on the aims and methodology of the survey? Has their training
been adequate?
 Are there noticeable variations in the responses collected by different interviewers? Do some need more training? Will this
need to be factored into the processing of the results from the main survey?
Administration
In administering the pilot survey we need to consider the following issues:
 What is the optimum size (number of respondents) for the pilot sample? If the pilot sample size is too large it may waste
resources that should be applied to the main survey, but if it is too small, potential flaws may not be missed.
 Should the results of the pilot survey be used as if they had been collected during the main survey? In principle this is
undesirable, as it implies that no changes will be made to the main survey as a result of conducting the pilot survey. In
practice, however, the temptation to use the data collected during the pilot survey is often too strong.
 Should the pilot survey be conducted (delivered) in an identical fashion to the main survey? If it is not (perhaps for logistical
reasons) how sure can we be that the conclusions that are drawn from the pilot survey really apply to the main survey? It
doesn't follow that a mail survey has to be piloted with another mail survey, although it usually will be. You can even learn
something (but not much) by piloting a face-to-face interview system with a mail survey.
It is important to recognise just how significant the pilot survey can be in the overall success of the full survey. How we
conduct the pilot survey, and whether we learn the right lessons from it, are major determinants of the quality and
effectiveness of the final survey.
50
Questionnaire design
One of the foibles of human nature is that we expect - against most available evidence - that we will create something
worthwhile every time we approach a problem. This expectation seems to increase with the (apparent) simplicity of the task.
This is clearly true of questionnaire design. It seems so easy to write down a series of questions; how hard can it be?
The answer is: quite hard, really - if you want to do it well. The design and construction of questionnaires is a predominantly
subjective process that is largely guided by the experience of the designer; the more questionnaires you have created, the
more likely it is that the next survey will be closer to 'perfect'. Mostly this is because an experienced designer should
(hopefully) learn from his or her mistakes, and will avoid repeating them.
Anyone who deals with questionnaire design develops, by personal experience and critical analysis of other people's
designs, what one might call design guidelines (perhaps even rules) for good questionnaires. Everyone will have slightly
different principles, so those presented here are not going to be universally accepted.
So what are the basic issues in questionnaire design?
Overall size
Some surveys are as little as one page long, others may be twenty pages or more. For longer surveys we have to be
convinced that the size is appropriate and will not have an unduly adverse effect on the response rate.
Amount of "background information"
How much background information should be included in the survey, or in a covering letter? Of course, we must always
include some instructions for the respondents, and possibly some for the investigator. If properly worded, such instructions
should ensure that the survey is completed without major mishaps. But we obviously don't want the instructions to be so
complex that reading them is an exercise it its own right.
Variety and sequence of questions
What number, and types, of questions should be used, and how should they be mixed? As you'll see in the next section, there
are two basic 'species' of question (open and closed), and numerous 'varieties' of closed question. Each can provide us a
different type of data, so we will want an appropriate mix of them.
What constitutes an 'appropriate mix'? You want the respondents to give you the maximum amount of relevant information,
if possible in their own words, but you have to place certain constraints on this process - mostly related to the physical space
in the questionnaire. You can give them space for their own words but you will need to balance this with the effective use of
closed questions, especially if you want the data that you collect to be comparable with that from other surveys.
In what order do you ask questions? If the survey requires basic demographic data (such as age, gender, marital status and
occupation) do you place these questions at the beginning of the survey, or at the end? Generally the preference is for the
end.
How do you arrange layout of the questionnaire so that respondents answer the questions they are supposed to answer, and
not the ones they aren't supposed to? How do you actually phrase the questions? Will people respond more favourably (and
more accurately) to questions posed in a 'chatty' style, or to those with a more formal phrasing?
Internal organisation
The layout of the questionnaire can have a significant impact on how respondents react; too little attention is given to it. If
you spread fifteen questions over fifteen pages some people will not fill it in simply because they think it will take too long
to complete. If you cram all fifteen questions onto a single page, other people will find it too difficult to follow.
Should the questionnaire be broken into sections, where each section deals with a clearly defined topic, different from those
in adjacent sections? At the start of each section you will need to give some general information about the completion of that
section, but just how much information is appropriate?
51
Should questions be numbered, and should the numbering sequence indicate any section structure (Q1 .. Q20 or Q1.1 ..
Q1.5, Q2.1 .. Q2.8 for example)? Do you want people to answer all questions, or do you want them to skip Q8 if it is not
applicable to them?
All of this can be summarised as a simple problem: the questionnaire should be as clear, detailed and unambiguous as
possible, but you don't want to insult the intelligence of most of the population. The basic principle for solving this problem
is to balance compactness with an impression of legibility and spaciousness. The language used in the instructions and the
actual questions needs to be concise without being obtuse, relaxed without being foolish. The respondents must not feel that
they are being patronised, but they must also feel that the survey is serious and worth the effort of their completing it.
Questionnaire layout
 There is a balance in the use of whitespace (between questions and sections) to give improved readability, without unduly
increasing the apparent size of the questionnaire.
 The questionnaire - particularly if administered by mail - needs to have a preamble that explains the overall aim of the
survey; this can be part of the covering letter, or at the head of the actual questionnaire.
 There should be general instructions to the respondents, placed usually at the beginning of each section.
 It is usually a sound practice to number questions and, if the questionnaire is divided into sections, to have the section
designation as part of the question numbering system.
 There should be specific instructions associated with each question to aid in the correct completion of that questions. These
might include phrases such as "Please tick one box only", "Tick as many boxes as necessary", or "Put a cross at the
appropriate place on the line".
 Wherever appropriate the respondents should be able to bypass questions (or whole sections) that are not relevant to them.
This can be achieved by using filter questions, combined with instructions such as "If you answered YES to Question 8
please move directly to Question 12 (that is, do not answer Questions 9-11)".
 If respondents are unsure about whether to answer a question, or which answer is the most appropriate, they should be
provided with a "letout", such as "Don't Know" or "Not Applicable". When a large number of respondents choose such
options, it is time to examine whether the question is badly worded, or int he wrong place in the questionnaire.
Question Wording
At the most detailed level, considerable attention must be given to the actual wording and terminology used in each
question. Again, certain simple principles can be prescribed:
 Most wording problems are associated with unintentional ambiguity; that is, the respondent infers that the question is asking
about something other than what the designer intended it to.
 It is possible to accidentally word questions in such a way that the response is largely controlled by whether the respondent
has a similar cultural, educational or ethnic background to the designer of the survey, rather than by what he or she really
thinks about the topic.
 We commonly make linguistic assumptions about the respondents' vocabulary, grasp of grammar, and so on. Again, the
answers may reflect this mis-comprehension rather the `true' opinions of the respondent.
 Double negatives are widely used, partly in error,but occasionally they are used deliberately for emphasis. In survey
questions they will always generate confusion.
 The most difficult 'wording' problem is really a structural problem in the design of the questionnaire, caused by trying to
combine questions in order to 'simplify' the process. This is the tendency to use 'double-barrelled' questions. For example,
we might ask a question that says 'Are you a member of the Labor Party and did you support the party in the last election?'
There are four possible answers to this question (Yes/Yes, Yes/No, No/Yes, No/No) because it is really two questions
masquerading as one. Whatever answer we get, we will never know which of the four the respondent really meant.
Question Types
The biggest mistake one can make in constructing any questionnaires is to assume that the aim is to 'get people to tell you
what they think'. Attractive as this approach may be, it has critical drawbacks:
 It will have a dramatic impact on the time to administer the questionnaire.

 The results will be difficult and time-consuming to code and analyse.
 Each respondent's amswer will be almost impossible to compare with those of other respondents. If one hundred people give
us one hundred different answers, how can we generalise and draw conclusions from them?
52
This doesn't mean that we never use questions (which we call open questions) that elicit such responses, but that we must
combine them with questions that provide us with the `skeleton' around which the answers to the open questions can be
placed. Open questions are useful in conjunction with more constrained ones, of the type we call closed (or structured).
Closed questions give the respondents a finite (usually small) number of choices from which they can select one or more.
Combining this with space for the respondents to add comments if they want provides the best of both worlds. If we know
exactly what data we need it is better to use nothing but closed questions. It is also much easier (as you'll see in the next
chapter) to extract data from closed questions during the subsequent coding and analysis phases of the survey.
Open
The archetypal open question has a series of lines (or a blank space) in which the respondents is encouraged to write, in their
own words, how they feel about the topic in question. Here is an example:
What issues do you think will become important in the political scene in Australia in the next five
4.
years?
Closed
Closed question provide a set of answers that the designer of the survey (based on prior experience and responses in the
pilot survey) considers will accommodate the majority of potential responses. The question asked in open form above can be
presented in closed form is several different structures. The simplest offers a list of choices, with the option (perhaps) of
adding others if they are not found in the given list.
4. What issues do you think will become important in the political scene in Australia in the next five
years? [Please tick one or more]
Pollution

Population control

Freedom of speech

Immigration

Republicanism

Other (please specify):
Structured
There are different varieties of closed (or structured) questions, depending on the way in which the respondent is asked to
place their answers. Usually they must tick a box or place a cross along a line. Note that we can regard deciding which of the
various types of structured questions to use as one the of the things that we would expect to investigate with the pilot survey.
Here are examples of the most widely used structured question layouts:
53
'Ticking Boxes'
NOT
YES NO
SURE
Scales (linear)
Keeping the monarchy is essential to a stable society

Strongly
Strongly agree
disagree
Scales (tabular)
Drink Reaction
Like Indifferent Dislike
Coca Cola
Pepsi Cola
Fanta
Sprite
Tabular
because it is organised in such a way that the

A sub-variety of structured question is called tabular
answers can be tallied directly into a cross-tabulation format.
Education Level
Age group Secondary Teriary Post-tertiary

20-30
20-30
30-40
40-50
50-60
Over 60
54
In this section you'll see how the 'raw' data that is collected in surveys needs to be processed before it can be subjected to
any useful analysis. This processing involves identifying (and correcting) errors in the data, coding the data, and storing it in
appropriate form.
 outline the major sources of error in survey data

 list the stages through which survey data are processed before they can be analysed
 outline the steps involved in the pre-procesing stage of survey analysis
 explain the process of coding survey data
 evaluate the various options available for storage of data, especially in electronic form
Contents
 Errors in survey
o Sampling errors
o Observation errors
o Processing errors
 Processing of survey data
 Pre-processing
 Coding
 Storage
o Word processor
o Spreadsheet
o Database
o Statistical systems
o Graphical systems
Errors in surveys
Before examining the basic methods of analysing data, we need to review the major sources of error in collected data.
Without an appreciation of these, we may believe that the data we collect are in some way 'perfect'; this may lead us to place
too much confidence in the conclusions we draw from the data./p>
There are three type of errors that can occur in survey data:
Sampling errors
Errors in defining and selecting the sample will bias the results by making the sample less representative of the target
population. The potential errors include:
 non-inclusion errors - people are not included in the sample who should be; they may be replaced by others, thus changing
the composition of the sample.
 non-response errors - members of the sample do not respond, thus changing the characteristics of the sample.
Observation errors
Even if the sample is correctly chosen, errors can be generated during the data collection process. These might include:
 question errors - the question is wrongly worded or misleading

 interviewer error - the interviewer makes an error whilst asking the question
55
 recording error - the interviewer records incorrectly the answer given by the respondent
 coding error - the data on the survey form are wrongly encoded during the pre-processing stage
Processing errors
Once the data have been coded and collated, errors can occur during the processing stage:
 computational errors - the analyst makes errors during the calculation of statistics
 inappropriate measures - the analyst decides to use analytical techniques that are inappropriate to the data (such as the
wrong measure of central tendency, as you'll see later)
Obviously these types of error can never be totally eliminated, as they are caused largely by human error. The incidence of
such errors can, however, be minimised by extended training and careful application of sound administrative practices.
Processing of survey data
The processing of survey data is a multi-stage process. A number of things have to be done to turn survey data into a form in
which they can be analysed. This analysis begins with basic data summarisation and presentation techniques - which you
can examine in other sections - but continues into the more complex statistical and graphical methods that are beyond the
scope of this site.
The basic stages in the pre-processing of survey data are as follows:
Pre-processing
We cannot expect to make any immediate sense of a large pile of survey reports; they have to be processed in some manner
before they are of any real use to us. We usually have to do some pre-processing of the raw data' from the survey forms or
questionnaires. For example:
 We may need to carry out some kind of conversion process

 We may need to adjust the results to take into account variations in the results obtained by different interviewers.
 We may need to correct actual values, if internal inconsistencies or contradictions occur
 We may need to actually enter data in sections that were not completed by the respondents, but which can be inferred from
the rest of the survey, or by observation.
Conversion
Next we need to code the data. Coding is rarely a simple process, and we usually have to apply a complex coding scheme to
turn the raw data into processed, analysable information.
Storage
After coding the data we will have to make a decision about the short and long-term storage of the information we have
generated; many options are available, particularly when we are considering electronic (computer-based) data storage.
Processing
When the data are ready for processing we need to apply a systematic approach to the summarisation, analysis and
presentation stages. Given the variety of techniques available, each of these stages will be covered in separate chapters.
The entire process is summarised in Figure 9.1:
Pre-processing Raw Data
56
Conversion Coded Data
Storage 'Database'
Processing Summarise Tables, maps, graphs

Analyse Statistics
Present Oral, written
Figure 9.1: Survey data processing
Pre-processing
The primary purpose of pre-processing is to correct problems that are identified in the raw data; this might include
differences between the results obtained by multiple interviewers. We can regard this as the survey equivalent to the
calibration problem in physical experiments, where significant and consistent differences between the measured result and
the 'correct' result are found. This is indicated by data sets that exhibit high precision but low accuracy, and the results differ
from the correct value by a consistent amount, in a consistent direction.
In most data collection exercises there are two pre-processing stages.
The elimination of unusable data
This might include, for example, mutually contradictory data from related questions. We may find questions that really
provide the same data, so we must decide which one is worth coding and storing, and which one is to be discarded.
More subtle problems are associated with trying to interpret ambiguous answers; it could be argued that any complex survey
is likely to produce at least some answers of this type, and the analyst needs to develop a strategy (which is part of the
coding scheme, as described in the next section) for dealing with them. Some of these will be trivial and easily resolved, but
others will tax the ingenuity of the data analyst.
It is worth emphasising that many of the problems we might detect at this stage reflect adversely on the overall design of the
survey, and in particular on the adequacy of the pilot survey.
The development of a coding scheme.
A coding scheme is an unambiguous set of prescriptions of how all possible answers are to be treated, and what (if any)
numerical codes are to be assigned to particular responses. It is impossible to create a definitive coding scheme until all the
surveys have been examined, although it is possible (in fact desirable) to create a provisional scheme as part of the overall
survey design. The scheme is tested and refined with the pilot survey. In the coding scheme we assign codes to each likely
answer, and we specify how other responses are to be handled. For example, we might allocate 1 to yes, 2 to no and 0 to
don't know. Although these numerical codes are arbitrary, in some cases their organisation will have implications for how
the resulting data can be processed statistically. By definition, open questions are much more difficult to code than closed or
structured questions.
The best test of a coding scheme is whether the person who created the scheme can give it to another person, and their
coding of the raw data matches exactly what the person creating the code would have produced if they had applied the
scheme to the same answers.
57
Coding
The core function of the coding process is to create codes and scales from survey responses, which can then be summarised
and analysed in various ways. Coding is the generic process of assigning a numerical value to a function that has no intrinsic
numerical value, such as the answer to a yes/no question. These will usually be integer numbers.
The creation of codes is not truly arbitrary. Each code is at least ordinal in that it has 'direction' - that is, some degree of
ranking. If the raw data are nominal we can regard the coding process as increasing ('improving') the scale of the data to one
that allows some more complex forms of analysis.
The major problem associated with coding is the treatment of missing data: how to specify what action should be taken
when the coding cannot be applied, as when a question is unanswered. Do we ignore it, or change and interpret it? There are
several possible approaches:
 Cross-reference the missing answer with the answers to related questions (this option, ironically, is less available if we have
carefully minimised duplication between questions).
 Interpolate from other answers to create a 'pattern' for the respondent, and look to see how other respondents of the same
'type' answered this question.
 Look at the distribution of answers and interpolate from that; some computer programs will supply distributions of answers
to the question and suggest what the missing value ought to be in order to maintain the distribution.
 Give missing data its own code, such as "Didn't answer"; this is the most common (and safest) approach.
 Exclude the respondent from the analysis (if the respondent failed to answer a number of questions, or the responses appear
unreliable).
 Exclude the question from the analysis (if a significant number of respondents failed to answer it).
Storage
Even with the current trend to storing all data electronically, it is still important to recognise
that electronic storage is not always justified. The following table summarises the basic
advantages and disadvantages of storing data in electronic form, compared with non-
electronic (paper) storage.
MEDIUM PROs CONs

Low cost Not extensible
Speed Fragile
Paper Easy distribution
Bulky
Comprehensible
Extensible Equipment costs
Easy distribution Limited access
Electronic Interchange options
Fragile
Low volume
Selecting electronic storage is an increasingly significant decision because the system in which we store the data will
determine (at least in the early stages) what forms of analysis we can carry out and how easy it will be to transfer the data
into systems which will do more complicated forms of analysis. The major kinds of computer software are all potentially
useful during the data analysis stage; which we choose depends on the overall plan that we have for analysing and
presenting the data.
Word processor
58
We may choose, for instance, to enter the data in text form straight into a word processor. The obvious advantage is that we
don't waste time on unnecessary processing; if we are creating a report from this data to explain and present it then we are
directly ready to use the data. So we might choose to take the data (from our survey or experiment recordings) and put them
directly into a word processor. The major problem is lack of analytical tools; only the most advanced word processors have
spreadsheet-like functions, so that if you put data into a table you can do simple calculation (sums and standard deviations)
on the column of the table - and even these are relatively primitive.
Spreadsheet
Probably the most versatile analysis and storage combination is the (electronic) spreadsheet. Many of the formulae that
spreadsheets have built in are applicable to the data summarisation process.
Spreadsheets allow a large range of conventional summary statistics; some also incorporate elements of Exploratory Data
Analysis (EDA). It is possible with some spreadsheets to form cross-tabulations. Also, one thing that most spreadsheets do
extremely well is graphical presentation of the results of an analysis.
Spreadsheets are also able to interchange data with other systems. We can usually take information straight from a
spreadsheet and place it into a word processor; so relevant information from the spreadsheet can be copied directly across to
a report.
The statistical functions supported by spreadsheets are mostly restricted to descriptive statistics and basic inferential
statistics; we are unlikely to find a wide range of advanced statistical operations, such as multivariate statistics. If we want to
do this kind of work then the data should b entered directly into a package specifically designed for that level of analysis.
Similarly, whilst the graphics in most spreadsheets are visually impressive, they are usually restricted to a certain number of
fairly fundamental graphic structures (bar charts, pie charts, and so on). If we want to use some of the more esoteric systems
we would have to transfer the data either via a statistical package or directly to a graphics package.
Database
Spreadsheets may be the most versatile data storage system, but versatility implies lack of specific function; many of the
things that are possible with spreadsheets are done better in other programs, and the most obvious example of this is record
keeping. We would look to use a database program where we want to take advantage of the record manipulation options of
database management systems. We might wish, for example, to find all survey forms where the respondent said yes to one
question and no to another. These operations can be done with difficulty in a spreadsheet but are exactly what databases are
designed to achieve. When dealing with large amounts of raw data, a database might be the best first stage, rather than a
spreadsheet. As well as basic record manipulation (sorting and searching), the database may have basic data processing
functions, such as cross tabulations. Summary statistics usually have to be generated from the raw data 'outside' the data
base and that is commonly done by creating a report and adding summary values to the report.
Databases have high levels of interchangeability with other systems, such as word processors, spreadsheets, graphics
packages and statistical packages. Consequently, the database is often a good starting point for storing raw data because if
you need to manipulate it (beyond what the database is capable of) you can do so by transferring it into the right alternative
system. When we first enter the data we know that later it will have to be exported and/or converted, and the data should be
organised appropriately. One of the primary skills in information management is being able to decide where to 'place'
information, and what form it should take.
Statistical systems
There are application systems that carry out a wide range of statistical techniques. The simplest ones support data
summarisation and basic inferential statistics; the more complex support advanced inferential techniques, including
multivariate methods.
What they offer is advanced data manipulation. This includes sophisticated data description, and a range of various
statistical tests. Statistical systems interchange particularly strongly with graphic systems. Having generated the statistics we
would usually expect to find some graphical presentation for those results.
Graphical Systems
Generally, you are not going to actually store data in a graphical system for future analysis. The assumption is that you have
carried out the analysis and what you are interested in doing is generating graphical displays of your results. So graphical
systems emphasise:
59
 advanced display options, including a large range of chart types
 interchange with word processors and other graphic systems (such as presentation graphics and, increasingly, visualisation
systems)
60
In this section you'll see how significant data summarisation is as the first stage of data analysis. You will also be introduced
to the basic methods of statistical and graphical summarisation.
 list the reasons why we summarise data

 explain how descriptive statistics are related to inferential statistics, and how the latter are used to test hypotheses
 distinguish between different types of data distributions, and describe how to analyse them
 define the major forms of summary statistics
 list the main shape characteristics of data distributions
 describe the basic components of Exploratory Data Analysis.
Contents
 Data summarisation
o The purpose of data summarisation
o Descriptive statistics, inferential statistics and hypothesis testing
 Descriptive statistics
 Inferential statistics
o Forms of distribution
 Statistical Notation
 Individual value
 Sequence of values
 Sum of a sequence of values
o Summarisation
 Ungrouped and grouped data
 Frequency distributions
 Graphical representation
o Summary statistics
 Central Tendency
 Dispersion
o The 'shape' of a distribution
 Components of 'shape'
o Exploratory Data Analysis (EDA)
 Extremes and the median
 Hinges and 5 figure summaries
 Box-and-whisker plots
 Trimean
The purpose of data summarisation
No matter how trivial the volume of data we generate with a survey or experiment, we are unlikely to be able to make sense
of it by looking at all the data. Whenever we collect significant quantities of data it becomes essential to summarise the data
in some way.
Why is summarising data important? One subtle but important reason is that when we look at data in the raw we give too
much emphasis to extreme values, which in statistics are called outliers. If you look at 20 numbers and all but one are
between 20 and 23, and the other is 47, you will give too much weight to the 47. In interpreting the data it is much more
important that all the others are very similar to each other, but we say 'Why is that one 47?' and focus on that instead of the
others. When we summarise this set of values the mean is going to be (say) 21.9, not 46, so the emphasis is shifted to the
mass of the data.
61
The major reason for summarising data is to reduce data complexity to a level we can grasp. We generate intelligible
statistics (there is no value in replacing complexity with incomprehensibility) that in some sense 'replace' the raw data. We
don't lose the raw data - we may reuse it at will - but we create a new picture of what the data are like. If we say the average
age of the people in a particular group is 27 a picture appears in the mind of a 27-year-old that encapsulates that group. This
is an important and useful process in building an image of what the data looks like. The major reason that exploratory data
analysis was invented in the late 1970s was because most of the conventional techniques for building pictures of data had
become passé and John Tukey believed that people were bypassing the crucial simple picture and going straight to the
detailed analysis. It is futile to do complex multivariate statistical analysis if you don't know the basic structure of the data.
In order to see the significance of summarisation we have to stop and think about what we are trying to do: we are trying to
detect functional relationships reflected in the patterns (structures) in the data being analysed. The problem is that we
become swamped by the volume of collected data, and the data are inherently complex because we are not studying trivial
problems. The complexity of the system being studied shows up in the complexity of the data. So we want ways of seeing
what type of analysis is worth doing.
The most basic form of data summary is simple counting. If I see a mixture of objects (say a number of different coloured
balls lying on the floor) I can say 'I can see red balls, green balls and yellow balls'. I could, however, say 'I see 14 red balls,
22 green balls, and 16 yellow balls'. Both statements are summaries, and almost any description of a system will be a
summary because it invariably collapses the data. Counting just adds quantity to description. Such summaries are so much a
part of our normal thought processes that we don't even see them as a form of analysis.
The form of summary most people are able to cope with is graphical. Most people are uncomfortable with even basic
numerical procedures, but given even a badly constructed diagram we will feel happy to believe that we understand it.
There is no clear reason why everyone cannot also be comfortable with statistical summaries. The primary objection seems
to be a fear of mathematics - but there is almost no relationship between mathematics and descriptive statistics. To generate
statistics requires some mathematics but to understand statistics does not; you have only to want to find a stronger way of
summarising data. Graphical summaries add comprehension to verbal and numerical summaries, and statistical summaries
add another layer of comprehension.
Descriptive statistics, inferential statistics and hypothesis testing
In simple terms there are three forms of statistical analysis:
 Descriptive statistics, which aims to describe (by summarisation) the structure of data.
 Inferential statistics, which aims to derive conclusions about a large group (the population) from measurements on a smaller
sub-group (a sample).
 Hypothesis testing, which aims to use the collected data to tell us if our hypothesis is supported by the data.
Descriptive statistics
…comprise those methods concerned with describing a set of data so as to yield meaningful information.
Descriptive statistics is the branch of statistical analysis that deals with creating summary values. If we take 100 boxes (a
sample) and measure their volume, then we take a different 100 boxes and measure their volume. Descriptive statistics gives
us the average for each sample. Inferential statistics allows us to estimate the average volume of all the boxes (the
population) from which these are a sample. Hypothesis testing techniques allow us to test the proposition that the first pile of
boxes is on average bigger (say) than the second pile of boxes - which is equivalent to saying that they are samples from two
different populations of boxes.
The descriptive statistics that most people are familiar with are the maximum, minimum and average (or mean) but as we
shall see in the next section there are many others that can be used where appropriate.
Inferential statistics
…comprise those methods concerned with the analysis of a subset of data leading to predictions (or inferences) about the
entire set of data.
62
Inferential statistics in most cases make use of descriptive statistics, but with the aim of making predictions or inferences
about the entire set of data (the population) from a subset (a sample). An inference is an estimate. We cannot say exactly
what the 'correct' answer to the inference is; one can only say things like 'the mean population mass is 23.7gm with a 10%
error of estimate'. All inferential systems are uncertain; we can never say 'the sample has an average of 120, so the
population average must be 120' (unless, of course, the sample is the same size as the population). We can say things like
'the sample mean is 120 so the estimated population mean is 120, with a 95% probability that the true value is within 5% -
that is, between 114 and 126'.
Let's look at a simple example of the relative roles of descriptive and inferential statistics. Say an analysis of the
performance of students in a particular unit over the last five years gives an overall success rate of 74% (that is, 26% have
failed the unit). This is descriptive and definitive; only changes to the raw data will affect this value. The descriptive statistic
in this case is a simple ratio: number of people who have succeeded as a proportion of the total number of people who took
the unit.
The wrong way to use this value is to say that a student currently enrolled in this unit has a 74% chance of completing it
successfully. But this might be a year in which nobody passes; on the other hand, it might be a year everybody passes. With
inferential statistics we are able to provide an estimate of this year's pass rate, based on the current data. It would be 74%
with some margin of error that we would be able to quantify; for example, there is a 20% chance that this year's pass rate
will be between 67 and 81%. But that is an inference and we will not know how accurate that inference is until this year's
data can be processed (although we might use that figure as a prediction without waiting for the actual results). At the end of
the year we can work out the pass rate and see whether it matches the prediction from the inferential analysis. We do not
expect a value of exactly 74%, but it is very unlikely (less than a 1 in 20 chance) to be less than 67% or greater than 81%.
Forms of distribution
When we collect data we produce a series of values that together constitute a distribution. This distribution can have the
values in some specific, sorted order (ranked) or in no particular order (unranked).
We could also take the data and divide it into groups, counting how many values fall between 10 and 20, 20 and 30, and so
on. The alternative to this method is to leave the data ungrouped.
The ways that we analyse ranked data are different from those we apply to unranked data, and the ways we analyse grouped
data are different from the ways we analyse ungrouped data.
Suppose we measure the height of 50 individuals in a room. Ranked or not, this forms a set of ungrouped data.
Alternatively, we can convert that data into groups, such as 151cm to 160cm, 161cm to 170cm, and so on. The sequence of
counts for the groups constitute a new data set, called a frequency distribution. Converting the data in this way means we
can no longer apply some analysis and description techniques, but as you'll see there are other techniques we can use to
achieve the same results.
Statistical Notation
To understand the summarisation techniques that you'll examine in the rest of this chapter, you need at least a basic
understanding of some simple mathematical notation.
Individual value
Each observation of a particular phenomenon is designated by x i
Sequence of values
A set of observations of a particular phenomenon constitute a distribution, and are designated by x1, x2, x3, x4, … xi
Sum of a sequence of values
The sum of a set of (n) observations of a particular phenomenon is designated by
63

Summarisation
There are three ways to approach the summarisation of data: using statistical techniques, using graphical techniques, or
using a combination of both. Summary statistics are a set of mathematical methods we use to extract further information
from observed data.
Note that, when dealing with a sample survey, the summary statistics that are generated refer to the sample, but we can use
inferential methods to estimate the population values from the sample values.
With descriptive statistics it is important to define whether we are calculating values for a population or for a sample: the
results will be different. In this context we use the two terms parameter and statistic in a particular way.
Normally in science we use 'parameter' to refer to some property of a system we assume to be constant - at least for the
period during which we are studying the system. In descriptive statistics we apply the term parameter to any numerical value
describing a characteristic of a population; what we are trying to estimate from a sample is a population parameter. By
contrast, a sample statistic is any numerical value describing a characteristic of a sample. In the standard statistical notation
we use Greek letters for population parameters and we use lower-case English letters for sample statistics. The exception is
that we designate the sample size as n, and the population size as N.
Ungrouped and grouped data
We also need to appreciate the distinction between descriptive statistics generated from grouped and ungrouped data.
Ungrouped data (of interval or ratio scale) can be converted into grouped data by dividing the data into a series of mutually
exclusive categories covering the whole data range (0-19, 20-39, 40-59 and so on); ordinal scale data are already grouped.
For most summary statistics there are statistics applicable to ungrouped data, and equivalent statistics for grouped data.
Generally, the statistics for ungrouped data are more powerful; so we would usually only make the conversion to allow us to
compare the grouped data with the distribution of other, comparable data sets.
Frequency distributions
The values in a set of ungrouped data constitute a distribution. The values that we have in a set of ordinal data, and the
values we generate by converting ungrouped data into grouped form, constitute a frequency distribution.
For example, imagine a survey in which we measure the weight of a sample of pieces of
baggage being loaded onto a plane. The values for all the pieces of baggage that we measure
together make up a distribution; we can calculate sample statistics from that distribution,
such as a sample mean (for example, 14.56 kg). We can also create a frequency distribution
of grouped data, as shown in the table below.
Weight (kg) Number

7-9 2
10-12 8
13-15 12
16-18 19
19-21 7
The frequency distribution is made up of the values (counts) for a set of classes; each class has a frequency (f) associated
with it. The class limits are the upper and lower values for each class; they should be defined in such a way that no value is
not included, but no value can fall into two classes. To achieve this, we need to define class boundaries in such a way to
ensure that no value can fall into two groups. We do this by using class boundaries with a precision (meaning in this case
64
number of significant figures) one order below that of any of the actual data values. For example, in the baggage example, if
we weigh the pieces to the nearest tenth of a kilogram, we would set the class boundaries to 7.05, 9.05, and so on.
The class interval is the difference between the upper class boundary and the lower class boundary; in most frequency
distributions it will be constant across the classes. The point halfway between the upper and lower class limits is the class
midpoint; as you'll see, we use these values to calculate the mean of a set of grouped data.
We can also create a cumulative frequency distribution by aggregating the values in successive classes:
Cumulative
Class Limits Class Boundaries Class Midpoint Frequency
Frequency
7-9 6.5-9.5 8 2 2
10-12 9.5-12.5 11 8 10
13-15 12.5-15.5 14 12 22
16-18 15.5-1.85 17 19 41
19-21 18.5-21.5 20 7 48
Graphical representation
Whilst the most obvious way of representing grouped data is as a table, the information can also be represented
diagrammatically.
The basic representation of the shape of a frequency distribution is called a histogram (figure 10.1). This can be shown as a
series of vertical (or horizontal) bars, their length indicating the frequency of the particular class, or as a line connecting the
midpoints of each class; in this form the curev is called a frequency polygon (figure 10.2). The polygon is closed by
connecting the midpoints of the end classes to the midpoints of 'imaginary' classes on each side, which have a notional
frequency of zero.
Figure 10.1: Histogram
65
Figure 10.2: Frequency Polygon
The cumulative frequency distribution can also be plotted as a series of bars (figure 10.3), or as a series of lines joing the
midpoints of the classes; this is termed an ogive (Figure 10.4).
Figure 10.3: Cumulative Frequency Curve
66
Figure 10.4: Ogive
Summary statistics
The aim of summary statistics is to generate simple numbers to describe distributions, either grouped or ungrouped. Such
statistics have two functions: they can add to our understanding of the data that make up the distribution, and they can
substitute for (be used instead of) the distribution. We can divide the major summary statistics into two groups: measures of
central tendency and measures of dispersion.
Central Tendency
The various measures of central tendency are numbers that tell us something about the location of a distribution's 'centre' -
but what do we mean by the centre of a distribution? If we regard all measurements as being attempts to give us the `true'
value of a particular phenomenon, we can regard the centre of the distribution of a set of measurements as our estimate of
that `true' value. The various sources of error in the measurement process will produce variability in the measurements, so
they will not all have the same value. Measures of dispersion attempt to quantify the extent of this variability.
When dealing with ungrouped data we can use several measures of central tendency. The most significant is the arithmetic
mean, also called the mean or the average. We can also employ the median and mode. When dealing with grouped data we
cannot use the arithmetic mean; instead we use the group mean. Using grouped data we cannot use the median, but we can
define the modal class.
MEAN
For a population of size N the population mean is given by

For a sample of size n the sample mean is given by
67

For a set of grouped data with k classes, with a frequency distribution f k, and a set of a class midpoints given by x k, the
group mean is given by

MEDIAN
We can define the median of a set of ungrouped data if the data are arranged in ascending or descending order; in general,
the median is the value that has half of the data values less than it, and half greater than it. If the sample size n is an odd
number, the median is the middle value of the entire distribution. If n is an even number, the median is the mean of the two
'middle' values. For example, for the following (ungrouped) data
12, 14, 16, 18, 19, 22, 24
the median is 18, whereas for
12, 14, 16, 18, 19, 22, 24, 27
the median is 18.5
MODE
The mode of a set of data is the value that occurs most often, with certain provisos:
 It is possible to have no mode (that is, no value occurs more than once).
 It is possible to have more than one mode (a distribution may be bimodal, trimodal or multi-modal).
For grouped data the class with the highest frequency value is the modal class. There may be two modal classes (bimodal),
or more. For example, for the following frequencies
12, 18 13, 13, 22, 12, 14, 13
the mode is 13
Dispersion
The second type of summary statistics describes how much the distribution varies around the central point. Is the
distribution 'tight', or are the values spread out over a wide range? The various ways we can describe this spread are called
measures of dispersion; they quantify the variability of the distribution. As they are attempting to quantify the general shape
of a distribution, rather than a single value for its centre, most measures of dispersion are numerically more complex than
the statistics we have examined so far.
68
RANGE
The simplest measure of dispersion is the range of the data: the difference between the highest and the lowest values in the
data (maximum - minimum).
VARIANCE
The measure that tells us most about the distribution of data (but is relatively complicated to calculate) is the variance. It is
based upon the idea that each observation differs from the mean by some amount, which we call the difference from the
mean.
The difference between each value and the population mean is called its deviation. We take all these values and sum them;
dividing the result by the population size (N) gives us the mean deviation.
Unfortunately, this measure does not give sufficient 'weight' to the values on the margins of the distribution. To do so, we
take the sum of the squares of the deviations from the mean. Dividing this value ( the sum of squared deviations) by the
population size gives us the variance of the distribution.
The population variance is given by

The sample variance is given by

The sample variance is the most versatile measure of variation; as a sample statistic it can be used to estimate the population
variance.
STANDARD DEVIATION
The standard deviation is the square root of the variance. The variance is expressed in the same units as the original
observations; for example, in our luggage example we might have a mean weight of 13.78 kilograms, a variance of 3.56
kilograms, and a standard deviation of 1.89 kilograms. Consequently, we cannot compare the variances of two distributions,
unless they happen to have the same units; we cannot use the variance (or the standard deviation) to indicate which of two or
more distributions exhibits greater variability. For this latter purpose we need a `dimensionless' measure of dispersion, for
which we usually emply the coefficient of variability.
COEFFICIENT OF VARIABILITY (or VARIATION)
The coefficient of variability is calculated by expressing the standard deviation as a percentage of the mean. The coefficient
of variability for a population is given by

while for a sample the coefficient of variability is given by<
69

The 'shape' of a distribution
The measures you've seen so far use the raw data values in an ungrouped distribution, or the frequencies in a grouped
distribution. Alternatively, we can use measures that summarise the actual shape of the curve that we draw to represent the
distribution, using frequency curves.
Components of 'shape'
The basic shape of a frequency curve can be described quantitatively by several measures. You've already seen various ways
to define the centre, as well as several ways to define the range (spread) of the distribution. What you haven't seen are
measures that explicitly quantify the 'balance' of the distribution. As shown in figure 10.5, this balance has two components:
 Are the values arranged symmetrically on either side of the centre?

 Is the distribution highly `peaked' (most values lie close to the centre, and the tails are short) or is the distribution `flat' (long
tails and a low central concentration)?
Figure 10.5: The major components of distribution shape
To attack this problem we start by plotting the (sorted) data as a cumulative frequency curve (ogive) to get a shape like the
one in Figure 10.6
70
Figure 10.6: Quartiles
The median is Q50, the value that has half the data above it and half below. The value Q 25 is the first quartile; 25% of all the
observations fall below this value; Q50 is the second quartile, and Q75 is the third quartile.
Percentiles
Quartiles are actually special cases of the more general measurements we call percentiles. Percentiles are values that divide
a set of observations into 100 equal parts (P 1, P2, P3, … P99) such 1% of all the data points fall below P 1, 2% fall below P2,
and so on.
Deciles
Deciles are values that divide a set of observations into ten equal parts (D 1, D2, D3, … D9) such 10% of all the data points fall
below D1, 20% fall below D2, and so on.
Quartiles
As you've seen, quartiles are values that divide a set of observations into four equal parts (Q 1, Q2, Q3) such 25% of all the
data points fall below Q1, 50% fall below Q2, and 75% fall below Q3.
Symmetry and 'peakedness'
The measure we use to describe the overall symmetry of a distribution - that is, whether the two tails of the distribution are
equal - is called the skewness. We can describe the distribution as left (positively) or right (negatively) skewed, and we can
use the coefficient of skewness to quantify the extent of the asymmetry.
We also define whether the distibution is 'peaked' or not; the measure for this is called the kurtosis. Distributions that are
strongly peaked (that is, most of the values lie close to the centre of the distribution, with relatively short tails) are termed
leptokurtic, whereas those where the values are broadly spread (the tails are long) are termed platykurtic.
Exploratory Data Analysis (EDA)
The summary measures we have been reviewing have been in use for many years, and lead into the much broader field of
statistical analysis. As that field developed, and increasingly powerful and complex tools for analysing data were defined,
some data analysts concluded that the underlying structure of the data was being ignored in favour of sophisticated - but
largely inconclusive -analytical methods. The most effective response to these arguments came from the statistician John
Tukey in his 1977 book Exploratory Data Analysis.
71
Extremes and the median
The first point that Tukey made was that, in traditional summary statistics, the extremes get dealt with too harshly; there is
too much emphasis on the central sections of the distribution. We define values well away from the centre as 'outliers', and
treat them as somehow abnormal, or as unduly influencing the measures of centrality and dispersion. But the extremes are
significant; if we want to understand, say, the distribution of average elevations in a country we can hardly start anywhere
except with the highest and lowest (extreme) values. Tukey argued that this is true of the majority of distributions.
He also criticised the urge to generate single values (such as the mean or median) to describe the centrality of a distribution;
why not use two of more values if they give us a less misleading picture? The same argument applies to the measures of
dispersion: why use the single value of the variance (or standard deviation) rather than two or three values?
Hinges and 5 figure summaries
Tukey argued instead for five figure summaries. One of these five figures is the median; the second and third are the two
extremes (maximum and minimum). The fourth and fifth he called hinges. These are points between the median and the two
extremes, but they are not the same as the quartiles. In figure 10.7 the median is indicated by M, the extremes are indicated
by I, and the hinges by H. The position of the hinges is controlled by the 'folding' of the distribution.
-3.2 1.5 9.8

-1.7 1.2 1.8 6.4
^ -0.4 0.3 ^ 2.4 4.3 ^
| 0.1 | 3.0 |
| | |
| ^ | ^ |
| | | | |
| | | | |
I H M H I
Figure 10.7: Hinges, extremes and the median
Box-and-whisker plots
For large numbers of pieces of data we can still derive the five figures but it is clearly a more complicated process. Tukey
introduced, as an alternative to histograms, the box and whiskers plot. Figure 10.8 shows the positions of the five summary
figures, and the distribution itself, as a box with extension lines (the `whiskers'). The skewness of the distribution is
immediately visible from the comparative lengths of the whiskers. The kurtosis is indicated by the realtive sizes of the two
halves of the box.
Figure 10.8: Box-and whisker plot
72
Trimean
Tukey also proposed using the five-figure summary to derive a measure of central tendency which would be more useful
and stable than the arithmetic mean. To this end he defined the trimean, which is a weighted mean, using the values of the
upper and lower hinges to give weight to the tails of the distribution, although the median is more heavily weighted to give
appropriate emphasis to the centre.
lower hinge + 2(median) + upper hinge

-------------------------------------
4
The extremes are not included directly but they do have an influence on the value because they affect the values of the
hinges.
73
In this section you will be introduced to the basic methods by which scientists communicate their ideas and results, both to
other scientists and to the rest of the population.
 define the major elements of the presentation process

 explain how the presentation of scientific information differs from the presentation of more general (non-technical)
information
 list the characteristics of the major presentation formats
 outline the key stages in preparing and presenting an effective oral presentation
 describe the relative merits of using text and tables to present data
Contents
 The process of presentation

o Effective presentations
 The presentation of scientific information
 Presentation formats
o Journal articles
o Monographs and technical reports
o Conference papers
o Textbooks
 Oral presentations
o Writing versus speaking
o Basic structure
o Planning
o Key considerations
o Major errors
 Text and tables
o Structure of tables
The process of presentation
Scientists have long understood their obligation to disseminate the results of their work. They make them available in the
first instance to other researchers in the same field, who are the scientist's peers, by publication in journals and monographs.
This process is controlled by the process of peer review, where the quality of a scientist's work - and whether or not it should
be published - is judged by people active in the same field.
In recent decades many scientists have realised that they also have a less obvious obligation: to make the results of their
work known to a much larger community, outside the somewhat restricted group of fellow scientists. This communication of
science to the general public (aided by the mass media) has become increasingly significant in recent years - not least due to
the pressure on scientists that accompanies the competitive search for scarce research funds.
What many scientists have found, however, is that the techniques that they can successfully employ to distribute information
to their peer groups are at best ineffective, and at worst totally useless, when used to communicate with non-specialist
audiences. Unfortunately, most scientists still receive little or no training in this process.
The process of communication with an audience is no different when dealing with scientists than with any other group. In
principle the process can be studied and analysed, and strategies developed for maximising the impact of the presentation on
the audience. Five stages can be defined:
74
Analysing the audience
All communication studies emphasise the pivotal role of a clear understanding of the characteristics of the audience. These
characteristics include factors such as:
 the extent to which the audience has similar degrees of knowledge and educational levels
 whether the audience is likely to want detailed material, such as the experimental setup or the analytical procedures used
during the research
 how visually literate the audience is, which will determine the number, style and content of illustrations
 how numerically literate the audience is, which will determine the use of equations and formulae
Selecting the 'level' to present at
Planning any presentation (whether it is in oral or written form) hinges upon an early decision - based largely on the
audience analysis - about the level at which to pitch the presentation. This might mean that the scientist decides to give a
general overview, with only limited detailed analysis, or an in-depth examination of a particular area. Most problems in oral
presentations in particular stem from confusion (or error) in making this decision.
Selecting the relevant material
Having deciding the right 'level' to which to pitch the material, the correct material has to be selected. This decision is
further constarined by various factors such as word limits in printed material, or time limits in talks.
Constructing the components
Traditionally, scientists have handed over the task of creating technical materials - such as tables, diagrams and photographs
- to specialists in drafting or photography. The advent of personal computers means a majority of scientists now take a
greater role in the production of illustrative and support material.
Balancing the components
The final skill in developing any communication is in balancing the components: the support material, the language and
jargon, and the overall structure.
Even with experience most scientists find this process at least as complex (and probably more mystifying) than the actual
research. They are most comfortable when dealing with the 'traditional' formats such as journal articles and conference
papers, and least comfortable when using new formats such as electronic publishing, or when communicating with the
public at large.
Effective presentations
When evaluating the effectiveness of any presentation (in whatever format) there are two major areas to consider, which can
be described as 'internal' or 'external'. Internal factors relate to the contents and structure of the presentation, and external
factors relate to the impact of the presentation on the audience. The two are obviously linked, in that the audience will not
usually be in a position to compensate for deficiencies in the internal factors. Similarly, the best-prepared material will be
ineffective if presented to the wrong audience. Effective presentations must combine an appropriate choice of the correct
amount of the right level of material, with an audience receptive to that material.
The presentation of scientific information
It would be wrong to imply that presenting scientific information requires a peculiar approach solely because science deals
with numbers and equations, or because the general public knows relatively little about most areas of science (although
these statements may be at least partially true). Rather, presenting scientific concepts and data is made more difficult by
75
certain restrictions that may not apply in other areas of communication (although most apply to all 'scholarly' activity).
These include:
 Assumptions about the readership. Most communication of science is done on a peer-to-peer basis, where the presenter and
the audience are assumed to have almost identical technical knowledge. Little time or effort can (or should) be spent
explaining 'simple' concepts that are assumed to be shared by both sides of the communication process. It is often difficult
for scientists to completely overcome this expectation when dealing with a broader audience, and this is the major reason for
a perception among the general public of scientists as poor communicators.
 Acceptance of complexity in arguments. As a corollary to the previous point, specialised scientific and technical audiences
(and 'scholarly' audiences in general) expect detailed analysis. They are unlikely to accept anything at face value.
 Expectations of adequate supporting evidence. This unwillingness to accept arguments solely on the grounds that the
presenter insists they are reasonable, is accompanied by an expectation that all such arguments will be backed by logical
analysis and, where appropriate, experimental evidence. A specialised audience will also be able to interpret and evaluate
this evidence far more quickly than a lay audience.
 Restrictions on the writing style and 'document' format. Although this factor is changing, it has been a universal practice to
present scientific material (at least in written form) in an impersonal style that non-scientists - not privy to the internal
dynamics and excitement of the research and publishing process - often consider dry and even boring. Furthermore, the
relatively rigid format of the major avenues for publishing in science (written or oral) are largely alien to a public
accustomed to the more freeform structure of the popular media.
On the other hand, many scientists are still uncomfortable with the more relaxed style that is expected in presentations to
wider audience, a style that is becoming increasingly accepted in specialised scientific publications.
Presentation formats
During their working lives most scientists will have to deal with presenting their work in an ever-increasing variety of
formats. Anyone involved in the field of scientific data presentation needs to be familiar with them. Each has specific
restrictions and possibilities that affect how they are used to convey certain types of information to certain audiences.
Journal articles
The premier avenue for disseminating new scientific results and discussing the development of theories is the scientific
journal. Since the first scientific journals were started in the 1660s, there has been an almost exponential growth in the
number of journals; at the start of the current century there were about 10,000 scientific journals, but today there are at least
a million. Some of these, such as Nature and Science, accept material from the whole range of scientific disciplines, and
retain a particular prestige. The vast majority of journals are highly specialised, catering for a limited but demanding
audience. Most have a complex system for filtering the best material from the inferior, using a reviewing process known as
peer review, under which all submitted material is sent (without indication of who the author is) to several referees, who
comment on the work, suggest changes, and recommend whether it is suitable for the journal and whether it should actually
be published. Editors then decide whether to publish or not, based on referees' reports and their own opinion.
Most journals have been, and continue to be, published in physical (paper) format. Many have reacted to the environmental
cost of paper production, and the economic problems of distributing large amounts of printed material internationally, by
distributing some or all of the contents in electronic form. Increasingly, this means making material available for electronic
access, usually via the Internet. There are now a growing number of refereed journals that are never distributed in printed
form, but appear only on screen.
Monographs and technical reports
Although much scientific work can be presented in relatively short format (for example, less than 5000 words) that most
journals expect, it is often necessary to publish the results of large research projects, or include material too voluminous for
most journals. This leads many organisations to create monographs or, less formal in structure and presentation, technical
reports. A monograph is likely to have the following characteristics:
 it will be substantially longer than most journal articles.

 it will have a single, identifiable focus.
 it may contain significantly greater amounts of supporting material.
 it may incorporate in-depth `background' information about experimental procedures and data analysis; this includes
material such as computer programs.
76
Such monographs are not seen as equivalent to journal articles; rather, they are seen as an alternative when the material is
not appropriate for a journal. Most are refereed and edited as rigorously as any journal article.
Conference papers
In most areas of science a tradition has developed of holding regular meetings at which interested scientists discuss their
work. These conferences have grown from local to international events with the easy availability of rapid and inexpensive
travel. As with journals, there are some large, general events but most conferences are very specialised. Often they are
annual events, but they may also be convened to examine specific topics on a one-off basis (although such meetings often
evolve into regular events).
The format varies somewhat from conference to conference, but most feature sessions in which are presented several papers
on a related topic. Again, each conference has a specific format and organisation for every paper, but the creation and
presentation of such papers is governed by certain common principles:
 There is always a fixed time available for each paper (usually 15-20 minutes), and it is strictly enforced.
 The paper will also usually appear in printed form in the conference proceedings, but it is expected that the presenter will
not read the paper to the audience.
 The audience is assumed to be highly knowledgeable and can be assumed to need little or no background explanation.
 The emphasis is on the presentation of new results, techniques or interpretations; many conference papers are reports on
work in progress, or on the completion of a major project such as that involved in carrying out research for a doctorate.
Conference audiences are particularly demanding, but tend also to recognise the pressures involved in presenting a paper in
such circumstances (having all done it themselves), particularly by inexperienced scientists.
Textbooks
For many people (including most students) their first - and sometimes only - contact with published scientific material is in
textbooks. These have a significant part to play in the dissemination of scientific ideas, but they need to be seen in context.
The following factors make textbooks less definitive as a source of scientific understanding than many people may perhaps
believe:
 The physical production process means that most textbooks take about two years to write, edit and publish; at least some of
the material will have become outdated, or at least questioned, in that time.
 When writing a textbook for a college, school or lay audience the writer has to interpret the material in a manner to which
the audience will respond favourably (and perhaps buy the book); this `colours' the material and its reading.
 As an expert in the field the writer has to know about controversial areas, and will tend to present them to the reader in a
manner that (even subconsciously) leans towards the side of the debate that he or she favours.
 Nobody is perfect; the writer may simply misunderstand the material, and consequently the reader gets 'wrong' information.
Many textbooks, unfortunately, contain errors, ranging from the trivial to the egregious.
Despite these problems, textbooks remain a major source of reference material, and of insight into the scientific process.
Oral presentations
Conference papers are the most significant example of the process of oral presentation, in which a speaker delivers
information directly to an audience (although the development of information technology means that they may not always
be in the same place). Other forms of oral presentation include seminars within scientific organisations (such as university
departments and research laboratories). The same general rules apply to the conduct of these less formal events as to the
more rigid structure of conference papers.
The major problems associated with oral presentations are common to any attempt to address a large audience (stage fright),
with the added pressure of addressing people whose response may have significant implications for the presenter's long-term
career. But, just as actors can minimise their fear by learning their lines and extensively rehearsing their performance,
anyone making an oral presentation can minimise the potential trauma by preparing and rehearsing. Like actors, speakers
have props that have to be handled: in this case, there are overheads, slides, apparatus, videos, or computers.
77
Writing versus speaking
Oral presentations in science are not like poetry readings; the audience has not come to hear the speaker's voice so much as
what he or she has to say. In many conferences the members of the audience will have the conference proceedings in front of
them, so they are not going to want to have the paper read to them verbatim - even if that were possible in the limited time
available (you can't read 5000 words aloud in twenty minutes and have any chance of their being understood).
The speaker has to select the key elements of the paper (which may have been committed to print months before) and
present them in such a way that the audience feels they are getting the researcher's own interpretation. This might mean
focussing on only one part of the data, or on a particular problem with the results. It might also mean updating the material
in the printed paper to reflect new results or interpretations.
Of course, if the audience does not have a printed version the speaker has more freedom, but they must still avoid reading
aloud from their own notes.
Basic structure
The basic organisation of oral presentations has not changed in centuries, and certainly not since it was articulated by
William Mayo earlier this century:
Begin with an arresting sentence; close with a strong summary; in between speak clearly, simply and always to the point;
and above all be brief.
As with many things in life, this is easier to describe than to achieve, but planning and rehearsal are the keys.
Planning
Preparing for a paper or talk should begin some considerable time beforehand. The first stage is audience analysis: is the
presentation to be made to a peer group, all of whom are specialists in the specific field; to a group of specialists, only some
of which are from the same field as the speaker; or to a group of non-specialists. This will determine the overall level and
content of the presentation. The speaker can then gather the materials and generate a 'first draft' of the talk.
Such first drafts are almost invariably found, when tested, to be too long. This can only be discovered by trialing the talk in a
simulated environment as close as possible to the expected environment of the talk. This leads to the next stage, in which
material has to be removed from the first draft to make it fit within the time constraint.
Once the right amount of material has been selected, the timing of the talk has to be practiced to avoid over-running (most
common) or under-running.
Key considerations
When collecting and preparing support material the major considerations are related to illustrative material:
 How many graphs, tables, photographs or videos should be included so that they can all be adequately addressed in the time
available?
 How complex and esoteric should any graphs and tables be?
 How legible will the illustrations be in the environment of the actual presentation? Extremely detailed and 'data-rich' graphs
might work in a small room, but not in a 1000-seat auditorium.
The other major problems are logistical; they relate to the management of the talk environment, and the handling of
illustrative materials:
 How is the lighting controlled, so that you can adjust it easily if required?
 What tools are available for pointing to the illustrative material?
 If you are using on-the-spot handouts, how are these to be distributed?
 Is audio support required (microphones and tapes), and how is it supplied?
78
Major errors
Anyone with experience in giving (or even sitting through) oral presentations can readily list the major mistakes that all too
many presenters make:
 Many read to the audience rather than speak to them.

 Many speak too quickly.
 Many try to go into too much detail in too little time.
 Many confuse the audience with jargon.
 Many become trapped into complex and convoluted arguments.
 Many use illustrations that look good on their desk (or computer screen) but are unreadable in the auditorium.
Fortunately for the sanity of scientists, most presenters have the sense to avoid these pitfalls. Even those who are not `natural
speakers' can substitute preparation for any intuitive skill.
Text and tables
Before looking in some depth at graphical presentation techniques, you need to understand the relative merits of describing
data (which are largely numerical) using words, numbers or pictures: knowing this you can select the right combination for
each particular data analysis task.
Text
As a species that is at least partly definable by our development of language, we should be loath to ignore the power of
words. In many cases - especially when we are trying to indiacte our interpretation as well as the data themselves - we can
use our language skills to great effect. Consider the following statement:
Responses to Question 7 show that both males and females strongly favoured the 'no' option, but there was a detectable
gender difference. Among males opposition ran at five to one (75 to 15), whereas females were more evenly distributed,
with opposition outnumbering support by less than two to one (55 to 35). Clearly the issue draws a stronger and deeper
response from men, and it is largely one of opposition; perhaps this is linked to the extent of mass media oppsition to the
proposal.
In about seventy words we have provided the reader with the following information:
1. several key figures from the data ["75 to 15 … 55 to 35"]

2. an impression of the relative significance of the differences between those figures ["among males opposition ran at five to
one … whereas females were more evenly distributed … opposition outnumbering support by less than two to one"]
3. a provisional explanation of the figures ["the issue draws a stronger and deeper response from men"]
4. intimations of further reserach that might test the explanation ["perhaps this is linked to the extent of mass media oppsition
to the proposal"]
It would be hard to imagine how we could combine all these operations - informing and interpreting at the same time - in
any other format.
Tables
Much of the data presented in textual form can also be presented as a table:
Response Males Females

Yes 15 35
No 75 55
79
Tables should primarily be used in situations where the reader needs access to the raw data or to summary statistics; this will
commonly occur where the argument that is being presented relies on absolute ('75% of males said yes') or relative ('only
15% of males said yes, as opposed to 35% of females') data values, or where a conclusion relies on the reader accepting the
author's interpretation of the raw data as a basis for a previous conclusion.
On the other hand, tables should not be used when the number of data points is small (less than about 6); instead, use a
textual description. They should also not ne used when the number of data points is large (more than about 40); instead, use
a graphical data presentation format (as described in chapter 12).
The exception is when using a table to supply raw data for further analysis; this will usually be handled by placing the tables
in appendices.
Structure of tables
The basic structure of tables is also able to contribute to their effectiveness, especially if we use a consistent format, as
shown in the table below.
Gender
Age group Male Female No Response
0-18 48 54 20
19-40 52 33 16
41-60 38 41 3
61+ 14 9 2
The major components of a table are:
 title and/or caption

 labels for columns and rows (ie. the variables), together (where appropriate) with units
 layout elements, including interior and exterior lines, shading, and font variations
 data values
Effective use of these elements can enhance the table by emphasising the internal structure within the data that the writer
particularly wants the reader to notice.
80
In this section you'll see some of the reasons why graphical presentation of data is so widely used and - when used properly -
such an effective way of displaying the structure of numerical data. You will also learn about the major forms of non-
quantitative charts.
 define the contribution to the presentation process that graphical displays of data can make
 describe the essential elements of graphical integrity in data graphics
 list and distinguish the major forms of data graphs
 list and distinguish between the major forms of non-quantitative (structural) graphics
Contents
 The contribution of graphical data display

 Graphical integrity
 Data graphs
o Scatter graphs
o Pie Charts
o Column and Bar charts
o Line charts
o Area Charts
o Polar Charts
o Triangular charts
o 3D Plots (Surface Plots)
 Structure Diagrams
o Flow Charts
o Organisation Charts
NOTE: This chapter contains a large number of charts and graphs. To minimise load time and bandwidth, most of them are
shown in 'thumbnail' form; to see a larger version of a particular chart click on the image.
The contribution of graphical data display
Given that one of the hallmarks of most scientific research is the collection of data, and given that scientists need to analyse
and interpret those data, what methods are available for examining data graphically? Such methods, when used in
conjunction with the numerical data summarisation procedures you examined in a previous chapter, will form the first line
of attack for interpreting data.
The rather tired cliché 'a picture is worth a thousand words' has less relevance in data presentation than in other areas, but it
is true that effective graphs can markedly increase a reader's comprehension of complex data sets.
What constitutes 'effectiveness' in graphical data presentation remains a matter of some dispute. The advent of inexpensive,
feature-rich graphing software on personal computers has made it easy to create eye-catching charts that often have no real
value, and may actually mislead the viewer (perhaps deliberately). We see large numbers of graphics in the media that tend
to emphasis form over content, usually because of an inadequate appreciation of numerical techniques on the part of the
designer of the graphics. To create what the American graphic designer Edward Tufte called graphical excellence demands
a blend of statistical rigour and graphical design skills that is unfortunately rare.
81
Nevertheless, it is possible, by knowing the basic types of data graphs and understanding their limitations, to create visually-
pleasing charts that are also founded in sound statistical principles.
The most graphic (pun intended) way of seeing just how useful graphical data presentation
can be is by trying to 'picture' the structure of any of the four data sets shown below purely
by looking at the columns of figures:
DATA SET
1 2 3 4
X Y X Y X Y X Y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
All four data sets have identical summary statistics, and each can have a straight line (as created by the statistical technique
of linear regression) fitted to them with the same level of 'fit', as summarised below:
Number of points (n) 11

Mean of X 9.0
Mean of Y 7.5
Regression y = 3 + 0.5x
Correlation coefficient (r) 0.82
Level of Explanation (r2) 67%
It is only when we come to graph these data sets (as the series of scatter graphs in Figure
18.1) that we really appreciate how deceptive the numbers alone can be, and how valuable
the graphical format can be.
82
Figure 12.1: Sample data sets as scatter graphs
Data set 1 is a 'traditional' data configuration, with some scatter around a basic linear pattern, suggesting a positive
relationship between X and Y with some noise, caused presumably by the presence of other, related variables. The two
variables in set 2 are strongly correlated, but the relationship is actually polynomial, rather than linear. Fitting a curve of the
correct form would yield a 100% fit. The variables in set 3 have a very strong (indeed 'perfect') linear relationship, but the
relationship is complicated by the single outlier; removing it would provide a much clearer picture. Finally, the two
variables in set 4 are really not related at all. With the exception of a single outlier, the value of Y is quite independent of the
value of X, which is a constant.
Graphical integrity
The key to creating effective data graphics is the combination of good graphic design and appreciation of statistics that
Edward Tufte (in his seminal books The Visual Display of Quantitative Information and Envisioning Information) defined as
graphical integrity. He suggested that application of the following principles could lead to both graphical integrity and
graphical excellence:
 It is essential to focus on the substance (contents) of the graph - not on the design, methodology or technology.
 It is essential to avoid distortion induced by the format of the graph.
 It is desirable to aim for what he termed high data density, whereby the graph is used to present - as no other method can -
large amounts of data in a coherent manner.
 It is worthwhile constructing graphs that not only present data in a `static' from, but that encourage comparisons between
variables, location, or time periods.
 It is desirable to allow the viewer to discover levels of detail within the graph.
 It is necessary to make the graph serve a single, clear purpose, such as data description, tabulation, exploration or
decoration.
 It is important to integrate the graph with statistical and textual descriptions.
 Above all, it is vital that the graph should show the data, not the technical skills of its creator.
Data graphs
In this section we will review the most widely used chart and graph types. While hardly exhaustive, this review at least will
provide a foundation; with these charts it is possible to display most data sets. Later we will point to the more specialised
charts that are used in specialised fields, and to the new techniques of visualisation that are being developed to handle
extremely complex data sets.
For each chart type you are presented with a series of guidelines for their correct use, and one or two examples of their use.
83
Scatter graphs
Scatter graphs are widely used in science to present measurements on two (or more) variables that are though to be related;
in particular, the values of the variables as the y (vertical) axis are thought to be dependent on the values of the variable
plotted along the x (horizontal) axis. The latter is said to be the independent variable.
If the range of the X and/or the Y-axis is two or more orders of magnitude (for example, 0-200) we would probably use a
logarithmic scale; that is, we would convert the values of the variable by taking logarithms (normally to base 10). This is an
example of the process called data transformation.
The origin of the graph - the point at which the axes cross - should almost always be (0,0). Figure 12.2 is an example of a
scatter graph.
Figure 12.2: Example scatter graph
Pie Charts
The pie chart is a staple (if frequently misused) form of data presentation graph. Used properly, it can be an effective way of
presenting a small number of pieces of data, provided the following limitations are observed:
 It should be used only where the values have a constant sum (usually 100%).
 It should be used where the individual values show significant variations; a pie chart of seven equal values is of no use.
 It often worthwhile adding annotations, especially the values for each category (thus saving the need for a separate table of
data values).
 It should be used when the number of categories (`slices') is reasonably small; as a rule of thumb the number of categories
should be normally between 3 and 10.
Figures 12.3 and 12.4 are examples of pie graphs.
Figure 12.3: Example pie chart
84
Figure 12.4: Example '3D' pie chart
Column and Bar charts
Like pie charts, column charts (figures 12.5 and 12.6) and bar charts (figures 12.7 and 12.8) are applicable only to grouped
data (although, as you've seen, ungrouped data can readily be converted to grouped form). Again, the correct use of column
and bar charts is controlled by certain guidelines:
 They should be used for discrete, grouped data of ordinal or nominal scale.
 The column (vertical) chart should be used for data which have some `natural sequence' in the categories, such as a time
series:
 January February March …
 1987 1988 1989 …
or a series indicating `rank'
0-10 11-20 21-30 …

 The bar (horizontal) chart should be used for data with no natural `sequence' in categories, such as
 Road Rail Air Sea …
 It is reasonable to use a logarithmic scale for the y-axis (count scale) if the range of values in the counts is greater
than two orders of magnitude (e.g. 0-200).
Figure 12.5: Example column chart
85
Figure 12.6: Example '3D' column chart
Figure 12.7: Example bar chart
Figure 12.8: Example `3D' bar chart
Line charts
Line charts are similar in some ways to scatter graphs, with the extra constraint that the values of the independent (x)
variable have their own sequence. Moreover, those values are a sample from a (presumed) continuous series, such as
temperature, pressure or commodity prices. We may display several dependent variables on the same graph and, if
necessary, have different scales for each of them. Figure 12.9 is an example of a line chart.
86
Figure 12.9: Example line chart
Area Charts
Area charts (as in figure 12.10) are similar to line charts, but are used for continuous data where there is one (continuous)
independent series, and several dependent series. The latter together have a constant sum, such as the proportions of exports
from a country that fall into different categories.
Figure 12.10: Example area chart
Polar Charts
Whilst scatter graphs, pie charts and bar charts are used for a wide variety of data types, there are a number of specialised
graph types that have been developed to deal with 'unusual' data types. One of these is the polar chart (figure 12.11), which
is used with discrete data where each point has a value for direction from a source (0 degrees - 360 degrees) and a quantity
(such as field strength in radio systems, or angle of dip in river sediments). This means that we are essentially displaying
vector data.
Figure 12.11: Example polar chart
87
Triangular charts
Triangular graphs (figure 12.12) are used to plot discrete data where each point has three values, and these values have a
constant sum (and are usually expressed as a percentage), such as the three-way split into
Sand … Silt … Clay in soil and sediment samples.
Figure 12.12:Example triangular chart
3D Plots (Surface Plots)
The graphs you've seen so far present one or two variables at a time, but you may also sometimes need to plot three
variables that may be interdependent. If the data are grouped you can use a three-dimensional column chart (figure 12.14)
where the base axes are the two independent variables. If the dependent data values are a sample from a continuous
distribution we can use a 3D chart called a surface plot. This may be shown as a continuous surface (Figure 12.13) or as a
series of contour lines.
Figure 12.13: Example 3D surface plot
Figure 12.14: Example 3D (bar) chart
88
Structure Diagrams
There are several important diagram formats that are less obviously 'number-oriented' than those we have considered so far
in this chapter. Some may incorporate quantities as a part of their overall structure, but most present qualitative information.
However, their wide usage in science indicates that knowledge of their characteristics is important for the practising
scientist.
Flow Charts
Flow charts are used most commonly in computing, to represent the internal logical organisation of computer programs.
However, they can be used in any situation where we wish to represent connected structures where there may be alternative
pathways through the system. We can also indicate quantitative aspects of the flow of information or materials through the
structure by annotating the diagram, or varying the line style or thickness to indicate quantity.
The example of a flow chart shown in Figure 12.15 indicates the flow of information and goods involved in processing a
sales transaction, from receipt of the original order through to shipping the goods and invoicing the customer.
Figure 12.15: Example flow chart
Several alternative diagram structures have been developed in recent years to help systems analysts design information
processing systems that are characterised by particularly complex internal information flows. A simple example of the kind
of data flow diagram that can be used is shown in Figure 12.16.
89
Figure 12.16: Example data flow diagram
Other diagrammatic tools have been developed to help analysts model the complex relationships between the components of
information systems. These fall under the general heading of relationship diagrams, and an example is shown in Figure
12.17.
Figure 12.17: Example relation diagram
Organisation Charts
A special diagram structure (an example of which is shown in Figure 12.18) is used to show the internal structure of
organisations, most of which have a hierarchical structure.
Figure 12.18: Example organisational diagram
90
Data analysis and the practicing scientist
The primary aim of this material has been to introduce the general process of data collection as a component of the research
process, with particular emphasis on those aspects of scientific research that distinguish science from other scholarly
pursuits. In following this aim we have covered several major areas.
 We have examined the critical role of data collection and analysis in the development of science, particularly as a
component of scientific research.
 We have argued that competence in the major areas of data collection, analysis and presentation are important parts of the
training of all scientists, even if the more advanced or technical forms of statistical and graphical analysis may be carried out
by specialists (under the direction of the scientist).
 We have defined the essential skills of data collection and analysis that all scientists (or interpreters of science) should have:
o the ability to design experiments and surveys that generate data capable of testing hypotheses
o the ability to define the "right" data to be collected for a particular purpose
o a broad knowledge of relevant primary and secondary data sources
o an appreciation of sampling and survey methods
o an adequate understanding of the basic techniques of data summarisation
o an appreciation of the correct use of various forms of data presentation technique
The knowledge of these areas that has been presented here can be extended almost indefinitely, as part of the normal
development of any scientist. This extension can only be based on experience and practice, both of which require time.
91

Data Collection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Collection

Uploaded by

Copyright:

Available Formats

Contents

Data Collection as an end in itself

 to establish the parameters of a system

Establishing the parameters of a system

Establishing benchmark data

Data Collection as part of broader strategy

You must obtain informed consent from any subjects.

You must ensure that all subjects participate voluntarily.

You must maintain confidentiality at all times.

On completion of this section you should be able to:

 list and explain the two main forms of knowledge

 empirical knowledge (things we perceive through our senses)

Oh, I think them statistics is wonderful.

Handbook of Hymen, O. Henry

Data is not information

Information is not knowledge

Knowledge is not wisdom

Figure 2.1: The Data Transformation Process

On completion of this section you should be able to:

 explain the significance of theories in relation to data collection

Analysis and interpretation

Figure 3.1: Patterns and Relationships

The process of 'research' in science

1. Collect 'relevant' data, based on prior knowledge of system

3. Generalise about the system from the collected data

Figure 3.2: The Induction Process

Selection of 'relevant' data

Selection of collection method ('experimental design')

The probabilistic (and subjective) nature of conclusions

Figure3.3: The Deduction process

1. Decide what needs to be investigated

2. Formulate a reasonable hypothesis and define how to test it

3. Carry out an experiment

The role of interpretation

On completion of this section you should be able to:

 describe the relationship in science between theories and hypotheses

 Theories and hypotheses

Theories and hypotheses

Falsification and verification

Logic and inference

If A is true, B will be true

If A is true, B will be true

Experiments and experimental design

 the type of experiment to be used

Figure 4.1:&bnsp;Clinical Trials

Figure 4.2: Field Trials

 we are unable to design an experiment with the level of control we require

On completion of this section you should be able to:

 describe the kinds of deterministic relationships with which scientists work

The role of determinism

Figure 5.1: Linear [y = 2x; y = 5; y = 8-x]

Figure 5.3: Trigonometric [y = sin(2x)]

We define accuracy in science as ...

The term precision is used in science in two ways:

 as an indication of the 'spread' of values generated by repeated measurements

Eye colour Number

On completion of this section you should be able to:

 describe the differences between primary and secondary sources

 Primary and secondary data

Uses of secondary data

Relative roles of primary and secondary data

Scenario 1: Research into the effect of fire on vegetation regrowth