Professional Documents
Culture Documents
Chapter 10 Data Processing
Chapter 10 Data Processing
Data Processlng
lg.arning Obirclive:
By the end of the chapter; you should be able to:
1. Understand the processing of the data collected before the data analysis.
2. Understand and carry out the checking and editing of the primary data aswell as be able to carry
out the necessary fieldwork required.
3, Code both the structured and unstructured questionnaires following certain guidelines.
4, Carry out the tabulation and entry of data in the required format.
5. Carry out preliminary statistical preparation of data.
Data Processinq I 2?5
scores' el:ryonei 'I am surprised the schools in Bengalrseem to be teachingvery well and
Ylre-lolt
very well. The NGos (non-goven,mentar organizations) ur. joing o.o**.nJubi"i="t
students have done -""'
ui, rr"rorgl"
'Hitler chakra'(Sanjeev,!L.r$1ni,1ii was really-happy and ordered coffee and siunosas "J*;;a'
to celebrate a job well
done and magnanimousry tord is,' Forks you may r;ke th;'week"rJ;fi,
average scores across the classes? Sg, Charu showed us tJre figures ";;;i;at*r;, o"^;iv", ilv
,;";;r;;
ou the OHP connected to trer laptop, Firrt
the Bangla score, then Maths and science and then she came to-Englistr.
*;,;;
Tiie figui[;;"i;t;;"lirt*tlSryi
urtfr.[ *r,
no score less than 78.8 per cent. Tl,en caine the bombshell with Engiistr--Sth gide
had an averag.
lr.rl: cenr.and,the18lll"g 103.4 per cent. we all sat:upright airl trrere wi 'r -----'--'
"rdu.fp*..;; ;,n,
H;;dilffi;;;
in a 100-mark paper be 103.4 per cent?' "p,;[6;1;;;:.
'Clf::"ll*,ll-o* us the column of the final grades of the shrdents.'And, guess what, rhere were srudents with
overall 150, 120, 135 and even 204. Emergency was declared. All samosas anjcoffee
wentto ttre ou;iil;;;J;ii
'But what had happened, how can someone make so *any'6*irrs in a aataie@t;.teu Surs*aUiili
'Errors in one enty? No, when we opened the data;files,ijllwds likJ a caniof #oims,
there,was not a single- sheet
without error,An{ in most subjects a good manv sruferit-;ii;'g-ilti;s i"iil:o*, iiiiiiii:riio*i*"r;r: ,""""'
'Laiia, the new intem, suddenly had a brainwave and said that we should
. look at the way scoring had been done in
Now, this suggestion was dangerous as alttrre cooingtb;;;;;;n.iu.*
un:::tscripts.
ll.
himself. Anyway, so we were
i"1;;ffiH;;;
" - Y'rs^uq
told to examine a iew scripls,atmridom,f,hdttt.d;tw il4p#;;f.,'1"f
'There wert 5- and 8-mark q"estions. If a person got most of the S-mark question
right, he was to be given a score of'
4' The teacher had followed the instmctions but had marked it as 8 and for an 8-mark q"u.rtion
*rr"re sh-e was supposed
togivea7,shehadmarkedit9l.'.,:,..li]
'Hey,donot'confirsemesana.isthisariddleoramystery?pleaseexplain., I' r,,
'Look" said Sana, 'The teacher marked four and seven only but the nurnerals
she wrote were in Bangla, where four
::
*t'I* as
! 1d seneri ij *litten as 9. Noy, at our end, when we entered the data we entered
a ana q,=*l.,icfr irs tilore
than the maximUm score for the question, And obviousty, th0iUltimate
iesult was a 100+ score-r lii, .1.I
,l,.,::ttl,,,,l,l,:
;,,,1i.:;ti l,,ti
'So, we as.a tearn crbss'checked all the scores on the excelrsheets and wherever this
dirr#.i
oig on 9 ryris,.rfillrnd"
was foi
wc went back to the answer script and manually corrected each entry. The final
,.or.*, *f,e, *.'=r*i.a,tfrl* acfoss ,
groups and classes' w€re dismaL and as'expected, were mostiy. betrow 50 per cent across
all the subjects
,. ,!.1.., . i. ! : rr,,. r, .;
Saraswati is right, because a freak error in entering the clata coulcl have had major
repercussions in the outcome of the study and the subsequent conclusions. 'Ihe
critical job of the researcher begins after the data has been collected. He has to use
this information to assess whether he had been correct or incorrect while making
certain assumptions in the form of the hypotheses at the beginning of the study. The
raw data that has been collected must be refined and structured in such a forrnat
that it can lend itself to statistical enquiry. This process of preparing the clata lbr an
analysis is a structured and sequential process (Figure 10.1).
T'he process starts try validating the measuring instrument, which could be
questionnaire or any other qualitative tecl"rnique as discussed in Chapter ti. This is
followed by editing, coding, classi$ing and tabulating the obtained data. Sometimes,
it might l;e essential to carry out some statistical modification clf the data in or<|er to
be ablc to increase its generalizibility on the population under study. I'his is critical
$i,;
firyffi[!ft,trr]ri!1/,
F.[6U.RE 1,0.1
The data-preparatior)
process
FIELDWORK VALIDATION
The first step in the processing begins post the questionnaire/or primary data
survey.
The researcher needs to validate the fieldwork tr checkwhetheithe execution of
the
study was handled properly. Thus, he must meticulously go over all the raw data
forms and check them for errors and find out whether in-the conducted interviews
or schedules a standardized set of instructions and reporting was followed or not.
As we stated earlier in Chapter B, considerable validation is done at the pilot testing
stage of the questionnaire formation. The significance of the validation becomei
more important in the following cases:
' In case the form had been translated into another language, expert analysis to see
whether the meaning of the questions in the two
-.u*."i is the same or not. The
second validation is done by measuring the reliability index of the original
and the
translated form.
' The second case could be that the questionnaire survey has to be done at multiple
locations and one has outsourced to an outsidd.research agency. In this
it
"ur",the
might be essential to carry out checks during the fieldwork as well io ensure that
process being followed is correct. As here thefe is both a time and a
cost element
involved, in case the investigators are erring it needs to be corrected immediately.
Post the survey there might be instances when the survey questionnaire
cannot
be used for analysis for multiple causes. It might be that:
' The answers that have been obtained and the
question instructions that were
given, such as qualiffing instructions like, ,in
.u." u.rr*"r is -----.--.-..-_- please
:l:H::|!::en
set of questions, etse go to question .,w"*
""*;*r;
" The respondent seems to have usecl the same
response categoly for all the
que:tions; for example there is a tendency on
a five point scale to give 3 as the
answer for all questions.
fiATA EDITII,IG
ri#Aftsilw #$JfcItvHS
i, -_- ---" been comprered,
rulvu, rhe next Step
LlMltrXl" is the editing
step lS
Understond ond corry ,?:::.5#,:i.:1,,^l.l,,r::::
,r i -
rne raw data obtainecr.
r. ur uuD 1i:
In this
_
stage, illl
Drc15,u, alr uetecraDle
detectabre errors
of
out ihe checking ond and omissions
ornissi have been
the necessary actions rrave been taken. \rfhile
editing of ihe primory :Jli|:d,1nd
the researcher needs to ensure that:
carrying
ng out the editing
editins
dgfo,osrwell rds,ibel, :
uBlh$torry'"o'i:ryilell '
. The data obtained is complete in all respects.
,4 bs$dry feldqa4gl,, r,l:, r It is accurate in terms of information recorded
and responses sought.
ttidUii&U:,, l'iil: i it :,.r i.ir .'r',
r Questionnaires are legible and are correctly decipherecl, especially tl-re
open_
ended questions.
Fieid Hditing
ftew data vlliddtion ensures Usually, 'the preliminary editing of the inlbrmarion
obtained is done by the field
lhilt all detedable enors and investigators or supervisors. It is advisable that at
the end of every field day the
omrssinns have been examined investigator(s) review the filled forms for any inconsistencies,
non_response,
lrrd llre necessary steps have illegible responses or incomplete questionnaires. This is
to ensure that the fallacies
heen taken found can be corrected immediatel!, as they are fresh in the
investigator,s mincl
and also because the recall woulti be better. Also, in case the
investigator needs to
contact the respondent who filled in the form, the clarifications
required wo,ld be
much easier.
The other advantage is that regular tield cditing ensures that one may also
be able to check if the interviewer or the surveyor is able to handle the process of
instructions and probing correctly or not. It mighi. also happen that certain tenns
or abbreviations have been used in the instrument on which the investigator is
not clear and could misinterpret the instructions. This most often happens with
branching and skip questions. Thus, the process ensures that the researcher can
advise and train the investigator on how to administer the questionnaire correctly.
This, howeve6 is only possible in case of a fabe-to-face interaction and not in the
mailed surveys.
Some researchers, in order to ensure the authenticity of the data obtained,
sotnetimes, carry out random interviews with the same respondents to cross-check
whether the administration process was accurdte.
i
too small as compared to what existed in the
actual market_place.
ffiSMING
3 5 1 25 1 4
4 2 15
1
z z
5 4 a tn 2 4
6 5 2 35 z 6
7 1
40 1
1 J
B 5 2 2q 4
It is advilable to prepare
Here, the data matrix reveals that each field is denoted on the column
a sghema in advance to head and
each case record is to be read along the row. The data in the first column
simplify and effectively represents
the unique identification given to a particular respondent (also marked
manage the data entry on his/her
questionnaire). The second column has data entered on the basis of predeterminecl
pr0(es5. a
coding scheme where every occupation is giyen a numeral value (for example,
t
stands for government service and 5 stands for student and so on).
columl s hu,
I representing a motorcycle and 2 representing a scooter. The next value is of
the
average number of lcilometres a person travels per day.
This is followed by the marital status, with i signi$ring unmarried and 2
married.
The last colutnn is again a ratio scale data with the number of family members.
The researcher can enter the data on the spreadsheet of the software package
he/
she is using for the analysis. However, in case the data is being
errt"."d uy trri neta
investigator or someone not acquainted with the software pack-age,
one can also use
a spreadsheet programme such as EXCEL to entel the
data as moit software have the
provision of importing data from an EXCEL spreadsheet
codebook formulation: In order to simplify and effectively manage
the data entry
process, it is esse[tial to prepare a schema in advar:ce
for entering the records in thl
spreadsheet. This formal standardization or the coding scheme for
all the variables
under study is called a codebook. Generally, while desftning the rules,
care must be
taken to decide on some categories that are:
, Appropriote to the research objective: For example, in the two-wheeler study when
the study was to be conducted on people in socio-economic classificatior, (SEC)
A and B, then the occupation and education categories had to be comparable
to
the ones established in the classification. Secondlgif the comparison is to
be done
amongst people in- different age groups then the age-class intervals (discussed
later in the chapter) should be representative of the clmparison to
be carried out.
, comprehensive: As far as possible, options shoukl be given to the respondent
in the closed-ended questions as probable response categories. This can be
ensur:d by a thorough exploratory study and later on, after the conduction of the
pilot study, which might result in discovering other responses in the ,any other
..-_---_l These, then, can be written as indep6ndent response options in the
final questionnaire.
' Mutuatly exclttsive: The categories and cocles devised must be exclusive or clearly
different from each other. This will be further discussecl in the classification rule.s
that the person should employ.
' Single variable entry: The response that is being entered and the
code for it should
indicate only a single variable. For example, a'working single mother, might
seem
an apparently simple category which one could code as ,occupationl However,
it
needs three columns-occupation, marital status and family lif'e cycle.
so, one
needs to have three different codes to enter this information.
Based on the above rules, one creates a code book that can be effectively
usecl
by the coCers. This would generally contain information on the question numbeq
variable name, response descriptors and coding instructions and the column
descriptor. Table 10.2 gives an extract from a questionnaire clesigned to measure
the consumer buying behaviour for the reacly-to-eat food products. The coding
instructions for the qualifying and the demographic variablesire presented
here.
,',s we have read in the earlier chapter; a questionnaire can irave
[)rsiqnating numeral rodes both closed-
10 the designed responsrl ended and open-ended questions. The process of coding the two kinds of questions
is
hefore aelministration is very difierent and requires a detailed discussion. whenlhe questions are structured
called pre-coding. and the response categories are prescribed then one does what is calted pre-cod.ing,
i'e., designating numeral codes to the designecl responses before administration.
However, if the questions are structured and the answers are open ended and not
determined in advance, one needs to decide on the codes after the administration
of the survey. This is called post coding and requires skilled interpretation an4
categorization of the responses into homogenous grouped response categories and
then these are assigned a numeric code.
lit:r ' r
'.,,:.,,]i:I{ryjt{ttry.j!']j,'.nltllitr!Id6!,dxri]'d.]'9..
TABLE 10.2
Codet:ook extrilct For
Symbol used
VariahlerNamer ,l,t r I Goding,lnslruction rfor Variable
reacJy-to-eat food s rudy
Name
Yes=1
No=0
Use ready{o-eat food Yes=1
products No=0
22. Age Less than 20 years = 1,
21 to 26 years = 2,
27 to 35 years = 3,
36 to 45 years = 4,
1 5
Mofe than 45 years = ,
23 Gender
i Male=1 X23
i
i
---'l -- ---"-- f"male=Z - -'i .- ',
_lrs,.::tu:r:l_ ,
| _ Exact
No. of chiidren
If - ---- no. to be written I XZS .-.,1
.*-------+--,----
-.-.,---.
Family size
I One to two = 1, i XZO
I Thre.: to five = 2, I
Occupation
For lhis question, the number of columns required are seven, one for
each
newspaper. The coding instructions for each column would be
as follows: in case
the person ticks on a name, the paper = I, and in case he does nottick,
the paper = 0.
scaled questlons: For questions that are on a scale, usually an interval scale,
the
quesSon/statement will have a single column and the
instruction would
indicate numerical assignment, i.e., what number needs "odi.rg
to 6e arocated for the
response options given in the scale. Consider the following question
from Chapter B.
Please indicate level of your agreement with the following statements.
The coding instructions for comparative scales would be slightly different. Consider
the following comparative question:
Please rate Domino's and other pizza restaurants you frequent on the
basis or" your satisfaction level on an Il-point scale, based upon the following
parameters: (1 = Extremely poor, 6 = Average, ll = Extremely good). Circle your
r€spuIlS€.
284 I
I
Research Methodology
._-_L l_."---
2 3 4 5 6 8 I s I rr
;i"li
I
7
a A
1 J 4 6 7
l 1_ J
-;
A A
).)
c 6 7 I
s s i ,,
-- t .' 'i."-
'te 1 z +
,-.,-*..-
tr b 7 8 I e I i(
Quality of packaging 2 J 4 ( 6 7 ;-r; f ,,
c6i,,N6;;
OJ lmprovement at work place by Yes=1 X63a
eliminating waste. No=0
64 To meet increasing demands of
Yes=1 x63b
--._-.-_ customers No=0
AA
To improve quality Yes=1 X63c
No=O
ot-) To achieve corporate goal Yes=1 x63d
No=0
67 It reduces cycle time of the
Yes=1 X63"
manufacturino and oroclr rr:tion
68
,Ng=o
Reduced response time Yes=1 x63f
No=0
69 E.h"*"d ,;*rt.n rn? Yes=1 X63g
creativity No=0
when deciding on the codes, at times, it may
be essentiar to use a code even when
no one has mentioned them. Here, it may be
critical as one of the hypothesized
parameters has been negated. For example,
for a question:
I,lrhy do you eat organicfood products?
i{&$r',,t,,r, ,-"'
l,fil$
.iu
t) lr.
'
l, t:
colnpute iheir family life cycle stage. similarly,
::1_ig:,rTl:,tl::ll1_o:,":"0
as stated earlier, the socio-economic classificatron of a person could be identifieri
upon the basis of his education and occupation,
Another respecification the researcher might
;arry out is collapsing the respoltse
categories. For example, suppose the original variable was plastic bag usage with 10
response categories. These might be collapsed into four categories: heavy, merlilm,
light, and non-user'. Other respecification ofvariables includes square root and 1og trans-
formations, which are often applied to improve the fit of the model being estimaterl.
Another classification technique discussed in an earlier chapter or-l
measurement and scaling and in the coding section here refers to the use of dumnry
variables for respeci{ring the categorical variables. Dummy variables are also called
binary, dichotomous, instrumental, or qualitative variables. They are variables th:r[
may take on only two values, such as 0 or I.
Classification by class intervals: Numerical data, like the ratio scale data, can be
classified into class interuals. This is to assist the quantitative analysis of clata. For
example, the age data obtained from the sample could be reduced to homogenor.rs
grouped data, tbr example all those below 25 form one group, those 25-35 are another
group and so on. Thus, each group will have class limits-an upper and a lower lirnit.
The difference between the limits is termed as lhe class magnitude. One can have
class intervals of both equal and unequal magnitude.
The decision on how many classes and whetiler equal or unequal depends upol
the judgement of the researcher. Generally, multiples of 2 or 5 are preferred. Sorne
researchers adopt the following formula fordeteirnining the number of class intervals:
i = R/(t + 3.3log N)
whele,
i = Sizeofclassinterval,
R = Range (i.e', difference between the values of the largest item and smallest
item among the given items),
N = Number of items to be grouped.
The class intervals that are decided upon could be exclusiue, for example:
10-i5
t5-20
20-25
25-30
In this case, the upper limit of each is excluded from the category. Thus we reacl
the first interval above as 10 and under 15, the nexl one as 15 and ulder 20 and so on.
.rl
a statistical analYsis. Usually, this is an
orderly arrangement of the rows and columns.
In case there is data to be entered for one variable, the process is a simple
tabulation
and, when it is two or. nlore variables, then one carries out a cross-tabulation
of data.
This can be done manually or with the help of a computer.
11l;,Bfiijc6ht
i- . -_i
n
1_6 1l!:y" --100* ._?-9
Total 100.0
Thus, a quick visual representation of the largest and the smallest group
can be
obtai. ,ed by constructing a pie chart of the same (I;igure 10.2).
i
M#tu',.,,,
288 i Research Methodology
ilso 40
L
o
J
Fuo
ilro
II
o
J
10
u20
II
10
Mean = 18.3553
Standard deviation = 6.5ST7T
N=15
- It-shows the pattern of responses in each interval and yet can maintain the rank
order for a quick approximation of the median or quartile."Each
row or line is called
a stem and each value on the line is a leaf. The
same clata that we represented on the
histogram can also be depicted on a stem ancl reaf display
as follows:
t3 1339
l5 668
l6 36
t7 JJ
t8 ,
,, ,
3l o
35 6
If one looks at the tabled data for the jewellery purchase in the above
stem ancl
leaf display, the decimals have been rounded off itre first place
and in case of two
simill " entries the number 13.3 has been entered twice. In fact, if one
rotates the above
display by 90 degrees to the left one would get the histogram.
The display is showing at
a glance that the sarnple studied was concerned with
tie buying of mostly 13 g ite#s.
There are other methods like boxplots, which are a more detailed
representation
ascompared to histograms. These are basically descriptive statistical values
for the data
obtained and these are based upon the measures of central tendency and
dispersion.
These statistical measures would be explained in detail in the next
chapter.
290 i Research Methodotosy
After the data has been collected through different methods used by
the researcher, the information needs to be
refined and structured in a format that can lend ilself to a statistical
Lnquiry for testing the study hypotheses. The
researcher first begins by validating the fieldwork that was conducted.
The processing n"r" reiers to the primary
data that has been collected specifically for the study.
The researcher needs, to carry out a hawk-eyed scrutiny of the obtained
data to ensure that no omissions or errors
are there' This is the editing stage of ihe data processing step. Here, the researcher
begins by conducting a field
editing and is able to resolve some of the inconsistencies and issues
of incomplete data. ihis pio"ur. is conducted
at the second stage at the central office level. At this stage, the research
team conducts some data treatment such
as allocating the missing values, if possible, backtracking and sometimes, plugging
the incomplete data.
I
I
Backtracking
ln-house editing
Bar chart
Minitab
Class intervals
Missing values
Classification of data
MS Excel
Code book
Pie chart
Coding
Plug value
Data editing
Postcoding
Data processing
Pre-coding
Data tabulation
Record
Exclusive class intervals
SAS
Field
Single variable entry
Field editing
SPSS
File
Stem and leaf display
Histogram
Test tabulation
lnclusive class intervals
" mT:JTffiTJ]}5:t*'u"o more than one entry for a question that has six options the number of corresponding
14' ln case the question is a L.xert type question and it has
agreemenudisagreement on a five-point scale, the number
of corresponding columns in the code book would be five.