Collection and Presentation of Data

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

hapter 2 Collection and

Presentation of Data
2.1 Primary data and secondary data
Statistics being a body of methods meant for the study of
numerical data, it is obvious that the first step in any statistical enquiry
must be the collection of the relevant numerical data. To study the
growth of steel production in India since 1947, it is necessary to obtain
the actual production figures for all years from 1947 to date. If our
aim is to study the efficacy of a given cure for bronchial asthma,
we must collect data on people suffering from bronchial asthma and
see how many got cured (or had a remarkable degree of relief), and
how many did not, after a course of treatment with the drug.
Now, the data may be of two broad types: primary and secondary
The ordinary user of economic and social statistics will find that the
data have been already collected by some other agency, government
or private; these may exist either in a published or in an unpublished
form. His job will then be simply to have access to the source and
data. Govern-
secondary
get hold of the data. Such data will be called
ment departments collect data on diverse topics that touch the life
of the people as a matter of routine and as an essential basis of
administration. Private agencies like banks and industrial concerns

regularly compile figures on their assets and liabilities, number of


employees, income of employees, etc. The may get his
enquirer
material readymade from such agencies; or he may get the data in
a rough form and adapt them to his needs. In some cases, the enquirer
will find that the relevant data have been collected by some research
organisation as part of an investigation similar to his own.
In making use of secondary data, the enquirer has to be
particularly careful about the nature of the data-their coverage, the
definitions on which they are based and their degree of reliability.
Maybe, he will find that the available data are more extensive than
is required for the purpose of his enquiry. In such a case, he will
naturally discard the part of he data that is redundant. Sometimes

15
FUNDAMENTALS OF STATISTICS
16

he may as well find that theavailable information is inadequate for


the purpose of his enquiry. He will then have to decide whether to
collect his own data, either to base his enquiry solely on them or
to plug the lacunæ in the secondary data.
Data collected primarily for the purpose of the given enquiry are
called primary data. These are collected by the enquirer, either on
his own or through some agency set up for the purpose, directly from
the field of enquiry. It goes without saying that this type of data may
be used with greater confidence, because the enquirer will himself
decide upon the coverage of the data and the definitions to be used
and, as such, will have a measure of control on the reliability of the
data.

2.2 Collection of data


The design of an enquiry and the setting up or modification of
machinery for the collection of data are operations that deserve serious
attention. Careful and detailed planning in the initial stages can lead
to saving in time and moneyand improvement in accuracy. As
complete a plan as possible should be drawn up before the actual
collection of information begins, specifying what data are to be
obtained, from whom and by what methods. There should also be
full and unambiguous definitions of terms, clear instructions to
investigators and respondents and, maybe, some indications of the
mode of analysis of the results. Although the plan should be
complete.
it should not be totally rigid, for some adjustments to the plan willi
be inevitable as thecollection and analysis of data proceed.
A fundamental question to be considered whether
at the outset is
the collection of data should be done by complete enumeration or
by sampling. In the former case, each and every individual of the
group to which the data are to relate is covered, and information
gathered for each individual separately. In the latter, only some
individuals forming a
representative part of the group are coverea,
either because the group is too large or because the
items on whicn
information is sought are too numerous.
Complete enumeration may
lead to greater accuracy and greater refinement in
analysis, but it may
be a very expensive and time-consuming
and taken with care can operation. sample designeu
A
produce results that may be sufficient1y
accurate for the purpose of
the enquiry, and it can save much
un
COLLECTION AND PRESENTATION OF DATA 17

and money. We shall discuss in Volume Two the considerations that


should guide us in choosing between complete enumeration and
sampling and discuss some methods of getting representative samples.
In some cases, a combination of the census and sample methods may
by advisable. Thus frequent sample surveys may be used, as in
demography, to fill the gaps between censuses taken at regular
intervals. Or, some simple questions may be asked of every one, while
more complicated questions may be put only to a proportion (say 5
per cent or 10 per cent) of all respondents in an enquiry.
Note, again, that the information sought may be gathered, from
the individuals of the whole group (called the population) or from
those of the sample, by one of three methods*-the questionnaire
method, the interviewer method and the method of direct observation.
In economic and social enquiries, information is almost always
collected by having someone to fill up a form or questionnaire. But
a matter to be decided is whether the forms should be completed by
an enumerator or investigator who collects data by asking questions
and noting down answers, or whether these should be left with the
respondent to be filled up on his own. In the questionnaire method,
each informant (or respondent) is provided with a questionnaire,
usually sent by mail with returm postage prepaid, and is asked to supply
the information in the form of answers to the questions. Obviously,
this method can be effective only when the informants have attained
a certain level of education. It can work, for instance, when a daily
readers on
newspaper decides to conduct an opinion poll among its
some topical issue. The drawback of the method is that the informants
may not evince sufficient interest in the enquiry even if they are
sufficientiy enlightened. Consequently, the data may involve a high
percentage of non-response and thus fail to reflect the true state of
the field of enquiry.
In the interviewer method, enumerators go from one informant
to another and elicit the required information. This method is used
in population censuses. Also, it is the method that has to be employed

In some cases, a combination of two or more of the methods may


be
used. The Indian census, e.g., uses the interviewer method for general items
and the questionnaire method for items that concern people with scientific or
technical qualifications.

Fundamentals (I)--2
FUNDAMENTALS O F STATISTICS

18
are not all nterate or, even if literate have
informants
in case the educational level. For instance, if one is
attained the requisite item.
not
income and expenditure on different one
interested in
family
the head family
t cach and collect thne
arrange to interview
nav
from him. The data collected by this method are
information sought
to be more accurate,
Since a tacttul investigator may persads
likely
the informmant to required information and the meaning of
supply the
be properly explained to him so that the answer.
each question may
may be correct and to the point.
Whichever of the two methods may be used, the questions and
the accompanying instructions to enumerators and respondents have
to be very carefully designed. It 1s necessary that each question be
clearly phrased and capable of an unambiguous answer. The instructions
must take into account all possibilities, even remote ones. The way
a question is put may well influence the answer, as those who have
conducted public opinion polls will bear out. A device often to
advantage is to insert a question meant primarily to produce answers
to other questions. For instance, the relationship of the members of
the household to one another may be used to check the stated age
figures. It is for this reason that forms often include apparently
unnecessary or irrelevant questions. The report of a statistical enquiry
should include the layout of the form used.
In the method of direct observation, the enquirer or his assistants
get the data directly from the field of enquiry without having to depend
on the co-operation of informants. When data are needed on the height
and weight of, say, 200 college students, they will be approached
individually and the height (say in cm.) of each measured with a tape
and the weight (say in kg.) measured with weighing balance. If data
are needed on the sentence-length of a novel by, say, Bankimchandra,
each
the enquirer himself will go through the book and note for
therein. On
sentence the length, i.e. the number of words contained
the other hand, if data are required on the
incidence of blindness
the
will just observe each member of
among a group of people, one
he or she is or is not blind.
The direct method
group and note whether
of data collection may, therefore, involve
either measurement or

counting or bare observation.


2.3 Scrutiny of data
should be subjected through scrutiny to
to a
The data collected for however
be considered correct. This is important,
see if they may
methods of data analysis may be, they
cannot
excellent the statistical
unreliable data. As
information from faulty,
bring outuseful, reliable indeed in
earlier, the scrutiny has to be very thorough
we indicated
case the data are
of the secondary type.
Certain inaccuracies may be readily
detected, e.g. inaccuracies
of a decimal point and some
that arise from the dropping or shifting
a 1 for a 7 or vice versa,
of those that arise from the substitution of
or of a 6 for a 9 or vice versa. Thus,
consider the following set of

figures which contains two erroneous entries sheer common sense

enables the reader to detect them.

Stature (in cm.) of 10 college students


140-9 161-2 1539 172-2 162-9

159 1 147-2 773 5 181-5 1590-0

In a second type of situation, there may be figures which, although


not impossible, are very unlikely to be true and should rouse suspicion.
If 3 kilograms of rice is stated to be the daily consumption of a family
of 4, the matter calls for investigation. Similarly, we shquld be hesitant
in accepting 30 as the age of a son when the father's age is stated
to be 45. If it is found that the monthly income of a single person
in a group of 500 is Rs. 2,000 while that of every one else in the
group is less than Rs. 600, one should take the first income figure
to be suspect. One should make enquiries to see if there is anything
special about the work done by the person to make his income so
high compared to those of the others in the group.
Sometimes the year or the month to which a figure relates may
be stated wrongly. Sometimes again, a figure may be given
coresponding to February 30 or April 31 or, as in the case of a
manufacturing concern's production statistics, corresponding to a day
which was not a working day. Such mistakes may be readily corrected;
only at times the enquirer may have to refer to the agency supplying
the data to
get the right year, month or day.
In certain situations, one may have data consisting of two or more
related series of figures. Here the data may be scrutinised by
FUNDAMENTALS OF SIAT
20
consistency.
internal For the data in each
comparing the series for
series, tàken separately, may look
all right ; but the figures in the
different series may be incompatible, thus pointing to the presence
of inaccuracies in the data. To take a simple example, for a number
of people the age figures as well as the birth dates may be available.
These should then be compared for consistency. To take another, when
the scores of students appearing in an examination are available, the
individual subject scores should be compared to the aggregate score.
When we are given, for each of a number of families, the total income,
the total expenditure and the total savings over a certain period, we
should see if the relation
savings = income - expenditure

holds for each family. When statistics of rice production are given
three series of figures may be available, viz. a series for total area
sown (or harvested) in acres, one for total yield in quintals and one
for yield-rate in quintal per acre, the different figures in a series
corresponding, maybe, to different villages or districts. The compatibility
or otherwise of the three series may then be judged by verifying the
relation
total yield
yield rate = total area
Again, if figures for the price of a commodity during two different
periods are given, together with the percentage increase in price în
the later period over the earlier, the data may be checked by seeing
if the relation

p.c increase 100x(price in period II-price in period )


price in period 1

is satisfied.
It should, however, be obvious that no hard and fast rules may
be laid down for the scrutiny of data. The enquirer must use his
common sense, judgment and whatever knowledge he may have about
the field of enquiry to assess the
reliability of the data. In this context,
Mahalanobis's paper [5] will be found illuminating.
It will be found, as the reader
makes progress with the study
of statistical
methods, that certain statistical tools may be used to check
accuracy of figures-not the raw data of statistics, but
from thein figures derivea
according to some statistical concepts.
COLLECTION AND PRESENTATION OF DATA 21

2.4 Frequency data and non-frequency data

errors therein are


Once the collected data are scrutinised and the
removed, one has to put them into a systematic form so as to bring
into focus their salient features. Various modes of presentation may
be suggested, depending on the way one would like to look at the
data.
First, consider the case where the values of one or more variables
e.g. population, foodgrain production, petroleum price, steel exports,
etc-are given for different points or periods of time. Generally in
such a case, one will be interested in the relationship between time
and the variable (or variables). For instance, one will like to know
the way foodgrain production changes over time. Data of this kind
are called time-series data or historical data. Or it may be that values
of one or more variables are given for different individuals in a group
(e.g. individual countries or individual States or individual firms) for
the same point or period of time. But instead of considering the
characteristics of the group as such, we study the changes in the value
(values) of the variable from country to country, from State to State
or from firm to firm. Data of this kind are called spatial-series data.
What is important is that the identity of the individual values has to
be kept in view, and not ignored, in the statistical study of both kinds
of data. Taken jointly, these two kinds constitute the class of non-
frequency data.
Second, consider the case where we still have data on one or
more variables for different individuals--maybe even for different
points or periods of time, for different regions of a country or for
different countries-but the identity of the individuals is unimportant
and can be ignored. For now we are interested in the characteristics
of the group formed by the individuals rather than in those of the
individuals themselves. In studying the intelligence quotients (IQs) of
15-year-olds in Delhi, for instance, we may be interested in such group
characteristics as the percentage of 15-year-olds with 1Q higher than
130, the percentage of those with 1Q between 110 and 130, or the
average IQ of a 15-year-old in Delhi, or the lowest 1Q and highest
Qfor 15-year-olds in the city. Once the 1Q data for all 15-year-olds
n the city are obtained, one may in this case totally forget which
class
gure relates to which particular individual in the group. This
Odata is called frequency data, for here we are interested simply
in knowing how frequently each of the different values of a variahle
Occurs in the set of data.
As the reader will find for himself, most of the discussion in
the book will be devoted to the treatment of frequency data. But in
the following sections of the present chapter, we shall deal with the
common methods of presentation of non-frequency data. It will be
seen sequel that the
in the advanced modes of treatment of non-

frequency data, especially of time-series data, are themselves


modifications and extensions of techniques devised primarily for
frequency data.

2.5 Textual and tabular presentation of data

One of the common methods of presenting numerical data is to


use paragraphs of text. Most official agencies use this method, and
Statistics
we by means of an excerpt from the "Basic
illustrate it here
the Centre for
relating to Indian Economy, August 1992" issued by
statistics of
Monitoring Indian Economy regarding land utilisation
India in 1987-88
"The total geographical of India is about 329 million hectares
area
for
but statistical information regarding land utilisation is available
only about 305 million hectares. No information is available for 24
million hectares or 7% of the total geographical area.
4T million hectares or 13% of the total reporting area of India
is classified as barren land such as mountains, deserts etc. and area
under non-agricultural uses, that is land occupied by buildings, roads,
canals etc. and other lands put to uses other than
railways, rivers,
agricultural.
"67 million hectares of land or 20% of total geographical area
is under forests. Permanent pastures and other grazing lands include
12 million hectares or barely 4% of the total area is classified as

permanent waste lands include all lands under


pastures. Culturable
miscellaneous tree crops which are available for cultivation but not
cultivated during previous 5 or more years. Such lands-called
culturable waste lands-account for 19 million hectares or 6% of the

total land area.


"Fallow lands accounting for 30 million hectares or 9% of tne
total area are either current fallows or other fallow lands. en
FUNDAMENTALS OF STATISTICS
24
and omissions, if any, may be readily
presentation is that here errors

detected-which is not the case with


the textual method.

The different parts of a table are :

(i) Title A title giving description of the contents


a brief
it is placed at the head
should always form part of the table. Usually
that may be used
of the table, together with an identifying number
for future reference.

(ii) Stub: Theleft part of the table, which is meant


extreme

to describe the nature of the rows, is called the


stub of the table.

(ii) Caption: The upper part of the table which gives


a

The
description of the various columns is the caption of the table.
the
caption may include a mention of the units of measurement for
data of each column and also column numbers, like (1), (2), etc., that
may be quoted in any future reference. The title, stub and caption,
taken together, are said to form the box head of the table.

Civ) Body The body is the principal part of the table, where
all the relevant figures are exhibited.

(v) Footnotes: Most tables also have footnotes indicating the


sources of the data, which may also include explanations regarding
the scope and reliability of particular items. A footnote to a table

TABLE 2.1
LAND UTILISATION PATTERN IN INDIA, 1987-88

Item Area Percentage


million hectares)
1. Total geographical area 329 100
2. Total reporting area 305 93
3. Barren land and land put to
non-agricultural uses 41 13
4. Area under forests 67 20
5. Permanent pastures and
grazing lands 12
6. Cultural waste lands 19
7. Fallow lands 30 9
8. Net area Sown 136 41

for
Source Statistics Relating to Indian Economy, August, 1992, Centre
Monitoring Indian Economy.
COLLECTION AND PRESENTATION OF DATA- 25

showing the census population of India at some consecutive counts


may, for instance, show that the figure for a particular year is
provisional or only a rough estimate, while the figure for another year
relates to undivided India.
While no hard and fast rules may be laid down, some broad
guidelines may be given for drawing up a table. First, the table should
be well balanced in length and breadth. In case a single table may
become too long or too wide, one should consider whether the data
may be divided into two or more tables. Second, as in textual presenta-
tion, here the arrangement of items should follow a logical
too
sequence. For instance, when time-series data are to be presented, they
should be arranged
chronologically. Third,
in order to facilitate
comparison between important items, the figures to be compared
should be placed as close to each other as possible. It may also be
pointed out that column-wise comparisons are more easily made than
row-wise comparisons.
To illustrate the above points, the data
presented earlier in a
textual form are also shown in Table 2.1.

2.6 Diagrammatic representation of data


Representation of statistical data by diagrams-by graphs, charts
or
pictures-is effective than tabular representation, being
more
easily
intelligible to a layman. Indeed, diagrams are almost essential
whenever it is required to convey any statistical information to
the
general public. It must be stated, however, that information on a limited1
number of topics only can be presented in a
single diagram so as
to maintain its neatness.
Moreover, diagram can give only a rough
a
idea about the magnitude of variation,
whereas in a table the exact
values may be quoted.
The more important types of diagram which are used in statistical
work are being described below.
Line diagrams: Consider the data of Table 2.2, which show
how the value of exports in India (including re-exports) has been
Changing over time. A very convenient method of representing such
data is to use line
a
diagram.
We take the year along the horizontal axis and the value of exports
along the vertical. The values for the ten years give ten points on

You might also like