Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

qh*APTffi.

Data Processlng

lg.arning Obirclive:
By the end of the chapter; you should be able to:

1. Understand the processing of the data collected before the data analysis.
2. Understand and carry out the checking and editing of the primary data aswell as be able to carry
out the necessary fieldwork required.
3, Code both the structured and unstructured questionnaires following certain guidelines.
4, Carry out the tabulation and entry of data in the required format.
5. Carry out preliminary statistical preparation of data.
Data Processinq I 2?5

scores' el:ryonei 'I am surprised the schools in Bengalrseem to be teachingvery well and
Ylre-lolt
very well. The NGos (non-goven,mentar organizations) ur. joing o.o**.nJubi"i="t
students have done -""'
ui, rr"rorgl"
'Hitler chakra'(Sanjeev,!L.r$1ni,1ii was really-happy and ordered coffee and siunosas "J*;;a'
to celebrate a job well
done and magnanimousry tord is,' Forks you may r;ke th;'week"rJ;fi,
average scores across the classes? Sg, Charu showed us tJre figures ";;;i;at*r;, o"^;iv", ilv
,;";;r;;
ou the OHP connected to trer laptop, Firrt
the Bangla score, then Maths and science and then she came to-Englistr.
*;,;;
Tiie figui[;;"i;t;;"lirt*tlSryi
urtfr.[ *r,
no score less than 78.8 per cent. Tl,en caine the bombshell with Engiistr--Sth gide
had an averag.
lr.rl: cenr.and,the18lll"g 103.4 per cent. we all sat:upright airl trrere wi 'r -----'--'
"rdu.fp*..;; ;,n,
H;;dilffi;;;
in a 100-mark paper be 103.4 per cent?' "p,;[6;1;;;:.
'Clf::"ll*,ll-o* us the column of the final grades of the shrdents.'And, guess what, rhere were srudents with
overall 150, 120, 135 and even 204. Emergency was declared. All samosas anjcoffee
wentto ttre ou;iil;;;J;ii

'But what had happened, how can someone make so *any'6*irrs in a aataie@t;.teu Surs*aUiili
'Errors in one enty? No, when we opened the data;files,ijllwds likJ a caniof #oims,
there,was not a single- sheet
without error,An{ in most subjects a good manv sruferit-;ii;'g-ilti;s i"iil:o*, iiiiiiii:riio*i*"r;r: ,""""'
'Laiia, the new intem, suddenly had a brainwave and said that we should
. look at the way scoring had been done in
Now, this suggestion was dangerous as alttrre cooingtb;;;;;;n.iu.*
un:::tscripts.
ll.
himself. Anyway, so we were
i"1;;ffiH;;;
" - Y'rs^uq
told to examine a iew scripls,atmridom,f,hdttt.d;tw il4p#;;f.,'1"f
'There wert 5- and 8-mark q"estions. If a person got most of the S-mark question
right, he was to be given a score of'
4' The teacher had followed the instmctions but had marked it as 8 and for an 8-mark q"u.rtion
*rr"re sh-e was supposed
togivea7,shehadmarkedit9l.'.,:,..li]
'Hey,donot'confirsemesana.isthisariddleoramystery?pleaseexplain., I' r,,
'Look" said Sana, 'The teacher marked four and seven only but the nurnerals
she wrote were in Bangla, where four
::
*t'I* as
! 1d seneri ij *litten as 9. Noy, at our end, when we entered the data we entered
a ana q,=*l.,icfr irs tilore
than the maximUm score for the question, And obviousty, th0iUltimate
iesult was a 100+ score-r lii, .1.I
,l,.,::ttl,,,,l,l,:
;,,,1i.:;ti l,,ti
'So, we as.a tearn crbss'checked all the scores on the excelrsheets and wherever this
dirr#.i
oig on 9 ryris,.rfillrnd"
was foi
wc went back to the answer script and manually corrected each entry. The final
,.or.*, *f,e, *.'=r*i.a,tfrl* acfoss ,

groups and classes' w€re dismaL and as'expected, were mostiy. betrow 50 per cent across
all the subjects
,. ,!.1.., . i. ! : rr,,. r, .;

So.finally,we,have been let loose, to report on duty tomorrow morning


tng and double check for the
tho,,orrors'once
err,
r more
before the presentafion for the client is made readyr.,

, ,,, r.irrl:,;1,r,, t...ii,r,; ,

Saraswati is right, because a freak error in entering the clata coulcl have had major
repercussions in the outcome of the study and the subsequent conclusions. 'Ihe
critical job of the researcher begins after the data has been collected. He has to use
this information to assess whether he had been correct or incorrect while making
certain assumptions in the form of the hypotheses at the beginning of the study. The
raw data that has been collected must be refined and structured in such a forrnat
that it can lend itself to statistical enquiry. This process of preparing the clata lbr an
analysis is a structured and sequential process (Figure 10.1).
T'he process starts try validating the measuring instrument, which could be
questionnaire or any other qualitative tecl"rnique as discussed in Chapter ti. This is
followed by editing, coding, classi$ing and tabulating the obtained data. Sometimes,
it might l;e essential to carry out some statistical modification clf the data in or<|er to
be ablc to increase its generalizibility on the population under study. I'his is critical

$i,;
firyffi[!ft,trr]ri!1/,

F.[6U.RE 1,0.1
The data-preparatior)
process

especially in applied research problems. The researcher should, then select


an
appropriate data analysis strategy.
The final data analysis strategy differs from the preliminary plan of data analysis
due to the information and insights gained since the formulation of the preliminary
plan. Data preparation should Uegin as soon as the first batch of questionnaires
is
received from the field, while the fieldwork is still going on. Thus, if any problems are
detected, the fieldwork can be modified to incorporate the corrective action.

FIELDWORK VALIDATION

The first step in the processing begins post the questionnaire/or primary data
survey.
The researcher needs to validate the fieldwork tr checkwhetheithe execution of
the
study was handled properly. Thus, he must meticulously go over all the raw data
forms and check them for errors and find out whether in-the conducted interviews
or schedules a standardized set of instructions and reporting was followed or not.
As we stated earlier in Chapter B, considerable validation is done at the pilot testing
stage of the questionnaire formation. The significance of the validation becomei
more important in the following cases:

' In case the form had been translated into another language, expert analysis to see
whether the meaning of the questions in the two
-.u*."i is the same or not. The
second validation is done by measuring the reliability index of the original
and the
translated form.

' The second case could be that the questionnaire survey has to be done at multiple
locations and one has outsourced to an outsidd.research agency. In this
it
"ur",the
might be essential to carry out checks during the fieldwork as well io ensure that
process being followed is correct. As here thefe is both a time and a
cost element
involved, in case the investigators are erring it needs to be corrected immediately.
Post the survey there might be instances when the survey questionnaire
cannot
be used for analysis for multiple causes. It might be that:
' The answers that have been obtained and the
question instructions that were
given, such as qualiffing instructions like, ,in
.u." u.rr*"r is -----.--.-..-_- please
:l:H::|!::en
set of questions, etse go to question .,w"*
""*;*r;
" The respondent seems to have usecl the same
response categoly for all the
que:tions; for example there is a tendency on
a five point scale to give 3 as the
answer for all questions.

' The form that is received back is incomplete, in


the sense that either the person
has not filled the answer to all questions, especiafiy,fr"
ones, or in
case of a multiple-page questionnaire, one
or more pages "p"r_ended
are missing.
' The questionnaire is filled by someone who is
not a representative of the populatio.
under study. For example, i. a study on two-wheerer
owners perception of Tata
small car, Nano, people who have eiiher no vehicle
currently or have a small car
might have filled in the questionnaire.
The filled-in form is received after the cleadline
for receiving the questionnaires
has elapsed and the researcher is on the
data analysis and interpretation stage.
The forms received are not in the proportion
of the sampling pran. F-or exampre,
instead.rf an equar representatio., iio* government
and privaie sector employees,
65 pel cent of the forms are from the government
sector. In such a case the
researcher either would need to criscard ihe
extra forms or get an equal number
filled-ru from private sector employees.

fiATA EDITII,IG

ri#Aftsilw #$JfcItvHS
i, -_- ---" been comprered,
rulvu, rhe next Step
LlMltrXl" is the editing
step lS
Understond ond corry ,?:::.5#,:i.:1,,^l.l,,r::::
,r i -
rne raw data obtainecr.
r. ur uuD 1i:
In this
_
stage, illl
Drc15,u, alr uetecraDle
detectabre errors
of
out ihe checking ond and omissions
ornissi have been
the necessary actions rrave been taken. \rfhile
editing of ihe primory :Jli|:d,1nd
the researcher needs to ensure that:
carrying
ng out the editing
editins
dgfo,osrwell rds,ibel, :

uBlh$torry'"o'i:ryilell '
. The data obtained is complete in all respects.
,4 bs$dry feldqa4gl,, r,l:, r It is accurate in terms of information recorded
and responses sought.
ttidUii&U:,, l'iil: i it :,.r i.ir .'r',
r Questionnaires are legible and are correctly decipherecl, especially tl-re
open_
ended questions.

' The response format is in the form that was instructed.


r The data is structured in a manner that entering the information will not be a
probleln.
To er'tsure that data screening ancl cleaning, which
is essentiallythe recluirement
of the editing process' has beeniarried out,
the researcher needs to carry out the
proce"'' at two levels, the first of these is field editing
and the second is central editing.

Fieid Hditing
ftew data vlliddtion ensures Usually, 'the preliminary editing of the inlbrmarion
obtained is done by the field
lhilt all detedable enors and investigators or supervisors. It is advisable that at
the end of every field day the
omrssinns have been examined investigator(s) review the filled forms for any inconsistencies,
non_response,
lrrd llre necessary steps have illegible responses or incomplete questionnaires. This is
to ensure that the fallacies
heen taken found can be corrected immediatel!, as they are fresh in the
investigator,s mincl
and also because the recall woulti be better. Also, in case the
investigator needs to
contact the respondent who filled in the form, the clarifications
required wo,ld be
much easier.
The other advantage is that regular tield cditing ensures that one may also
be able to check if the interviewer or the surveyor is able to handle the process of
instructions and probing correctly or not. It mighi. also happen that certain tenns
or abbreviations have been used in the instrument on which the investigator is
not clear and could misinterpret the instructions. This most often happens with
branching and skip questions. Thus, the process ensures that the researcher can
advise and train the investigator on how to administer the questionnaire correctly.
This, howeve6 is only possible in case of a fabe-to-face interaction and not in the
mailed surveys.
Some researchers, in order to ensure the authenticity of the data obtained,
sotnetimes, carry out random interviews with the same respondents to cross-check
whether the administration process was accurdte.

Centralized ln-house Editing


The second level of editing takes place at the researcher's end. The in-house editilg
can be handled by the researcher alone or by various members in the research team,
as the case may be. It is recommended that evea in a single-researcher stutly, the
data should be screened by an outsider as well. At this stage there are t\ivo kir-r6s of
typical problems that the researcher might encouirter.
First, one nright detect an incorrect entry. For example, in case of a five-point
_
scale one might find that someone has used a value more than 5. In anotheriur*,
one might be asking a question like, 'how many days do you travel out of the city in
a
week?' and the person says '15 daysl Here one can carry out a quick frequency check
of the responses; this will immediately detect ijn unexpected vilue. As for the above
case, the frequency analysis would have shor,rm an entry of 15, and then one can
screen the column in which the data for the question hasbeen entered for 15.
The second and the major problem that most researchers face is that of
'armchair interviewing' or a fudged interview. One wayto handle this is to first scroll
the answers to the open-ended questions, as generally if the investigator is filling i1
multiple forms faking these would be difficult. Thus, these could be highlighted with
a different colour attd cross-checked with the investigator or the respondent. In I'act,
it is advisable that wherever the researcher is making corrections he/she should use
a different colour as that would indicate it being different from the original.
In lhct, one should highlight what needs to be cross-checked (yellow colour)
and also highlight what is corrected (red colour). It is also advisable, in case of a team
of researchers, that the highlighting formats are sh'lred as a uniform scheme by all.
The researcher has some standard processes available to him to carry out the
editing process. 'Ihese are discussed below. It is to be remembered that these are
not absolute steps as sometimes it might be essential to troubleshoot specifically for:
some peculiar problems (as in the opening vignette) that the person encounters in
his/her study.
Earktraddng involves Backtracking: The best and the most efficient way of handling unsatisfactory
returning to the field and responses is to return to the field, and go back to the respondents. This technique ii
to the respondents, so as to best used for iudustrial surveys, where it is easier to track the respondent, who can
follow up the unsatisfactory be persuaded to give answers to the non-response or illegible answers. In individual
re5p0n5er. sulveys, this becomes a little difficult, as sometimes the person might have indicated
onlythe locality he lives in and there is no contact detail. Another issue in this is that
the antecedent states during the two administrations might be different and these
could affect the answers the person would give at the second conduction.
Data Processing | 2lg

Allocating missing values: This is a contingency


plan that the researcher might
need to adopt in case going back to the field
is noipossibre. Tt;r, the option might
be. to assign a missing value to the blanks or the unsatisfactory
responses. However,
this works in case:
I The number of
blank or wrong answers is small.
I The number of
such responses per person is small.
a The important parameters
being studied do not have too many blanks,
the sample size for those variables becomes
otherwise
too smalr for generalizations.
Plug value: In cases such as the thircl condition
above, when the variable being
studied is the key variable, then sometimes the
,"."ur.h", ;;il; insert a pltg uarue.
sometimes one can plug an average or a neutrar
value in ,rr.i aor"r, for example a
3 for a five-point scale. Sometimes a decision
rule based upon probability could be
establisheo and the researcher might decide
on a trrumb ;ir(f; example, for a yes/
no question, he might decide to put 'yes' the
first time t a missing value
orno atthe second andso on). Anotherwaytohandlethisisto " "r.o,lrrt"rs
conductan exploratory
data analysis and see what the ratio of yes
to no answers is and accordingly establish
the decision rule.
sometimes, the respondents' pattern of responses
to other questions is used to
extrapolate and calculat" o, uppropriate response
for the missing answer. I-Iere, it
may become a little subjective ut th" researcher
needs to sift through the rlata and
infer and predict the responses the person would
have given had
the questions. There are statistical software and proglammes he/she answerecl
available today to
extrapolate and ascribe values for such missing
r"rpo,ra'"r.
Discarding unsatisfactory responses: If the response
sheet has too many blanks/
illegible or multiple responses lor a single answer;
the form is not worth correcting
and editing' Hence, it is much better to Jompletely
discard the *hote questionnaire.
If too many forms are discarded then the sample
for the study might become too
small for an analysis or generalization, so,
here it is advisable io .urry out anotrrer
round of field visits' However; the discarding
of the forms might lead to elimination
from the population of the group which hacl a contrary
or a negative opinion than the
ones who compreted the forms. In a research
study on it happened trrat
when the response to a product change proposition ";;;;;i;",
(more pulp in the drink) was
studied and the compreted fo.*r *"r" ionsidered,
they were alt fittea uy peopre who ,il
liked th*: change, while those who did not answer
alt ihe questions hacl their forms I
rejected' Finally, when the new product was launched '1
thei.e were limited takers for
it, as the proportion of people who did not like r,{

the drink in the studied sample was 1li

i
too small as compared to what existed in the
actual market_place.

ffiSMING

I.*ARNING OBJECTIVT 3 ' The process of identifying ancl de,oting a numeral


codebothihb' ' , I
a respondent is called coding. This is eisentially
to the responses given by
skuctured ond i r ,' r. done in order to facilitate the
unstructured i ', : t"
researcher's use for interpreting the answers and
classitring and then subsequently
recording the data from the questionnaire on a spreadsheet
q uestionnCirter..folloWing. on the computer,
certdinBuidelines. r., It is advisable for the sake of computation to assign a
. numeric code even for the
categorical data (e.g., gender). In fact, subsequeritly
we will learn that even for
open-ended questions, which are in a statement form, we will try to categorize
them into numbers. The reason for doirrg this is that the quantification and graphic
representation of data into charts and figures becomes easier.
Usually, tlre codes that have been formulated are organized into fielcls, records anrl
files' For example, the gender of a person is onefield, and the codes used could tre 0
tbr
males and I fbr fernales. All related fields, for example, all the demographic variables
ihe process of identif,ing like age, gender; income, marital status and education could be one record..sometimes
and denoting a numeral to the researcher might not be interested in keeping multiple records and might decide
the responses given by a to have all the an$wers a single respondent has given on the questionnaire as a single
respondent is called coding. record. I'he recolcls of the entire sample under study form a singlefle. The
data that
is entered in the spreadsheet, such as on EXCEL, is in the form of a data matrix,
which
is simply a rectangular arrangement of the data in rows and coiumns,
Here, every row
represents a single case or record. For example, cclnsider the following representation
from a study on two-wheeler buyers (Table l0.t):
TABLE 10.1
;i,rrnple recorcJ: [xcel .Family size
rColumn 6.
sheet for two-wheeler
QWNETS
1 4 1 20 1 J
2 J Z ,E.
2 1

3 5 1 25 1 4
4 2 15
1
z z
5 4 a tn 2 4
6 5 2 35 z 6
7 1
40 1
1 J

B 5 2 2q 4

It is advilable to prepare
Here, the data matrix reveals that each field is denoted on the column
a sghema in advance to head and
each case record is to be read along the row. The data in the first column
simplify and effectively represents
the unique identification given to a particular respondent (also marked
manage the data entry on his/her
questionnaire). The second column has data entered on the basis of predeterminecl
pr0(es5. a
coding scheme where every occupation is giyen a numeral value (for example,
t
stands for government service and 5 stands for student and so on).
columl s hu,
I representing a motorcycle and 2 representing a scooter. The next value is of
the
average number of lcilometres a person travels per day.
This is followed by the marital status, with i signi$ring unmarried and 2
married.
The last colutnn is again a ratio scale data with the number of family members.
The researcher can enter the data on the spreadsheet of the software package
he/
she is using for the analysis. However, in case the data is being
errt"."d uy trri neta
investigator or someone not acquainted with the software pack-age,
one can also use
a spreadsheet programme such as EXCEL to entel the
data as moit software have the
provision of importing data from an EXCEL spreadsheet
codebook formulation: In order to simplify and effectively manage
the data entry
process, it is esse[tial to prepare a schema in advar:ce
for entering the records in thl
spreadsheet. This formal standardization or the coding scheme for
all the variables
under study is called a codebook. Generally, while desftning the rules,
care must be
taken to decide on some categories that are:
, Appropriote to the research objective: For example, in the two-wheeler study when
the study was to be conducted on people in socio-economic classificatior, (SEC)
A and B, then the occupation and education categories had to be comparable
to
the ones established in the classification. Secondlgif the comparison is to
be done
amongst people in- different age groups then the age-class intervals (discussed
later in the chapter) should be representative of the clmparison to
be carried out.
, comprehensive: As far as possible, options shoukl be given to the respondent
in the closed-ended questions as probable response categories. This can be
ensur:d by a thorough exploratory study and later on, after the conduction of the
pilot study, which might result in discovering other responses in the ,any other
..-_---_l These, then, can be written as indep6ndent response options in the
final questionnaire.

' Mutuatly exclttsive: The categories and cocles devised must be exclusive or clearly
different from each other. This will be further discussecl in the classification rule.s
that the person should employ.

' Single variable entry: The response that is being entered and the
code for it should
indicate only a single variable. For example, a'working single mother, might
seem
an apparently simple category which one could code as ,occupationl However,
it
needs three columns-occupation, marital status and family lif'e cycle.
so, one
needs to have three different codes to enter this information.
Based on the above rules, one creates a code book that can be effectively
usecl
by the coCers. This would generally contain information on the question numbeq
variable name, response descriptors and coding instructions and the column
descriptor. Table 10.2 gives an extract from a questionnaire clesigned to measure
the consumer buying behaviour for the reacly-to-eat food products. The coding
instructions for the qualifying and the demographic variablesire presented
here.
,',s we have read in the earlier chapter; a questionnaire can irave
[)rsiqnating numeral rodes both closed-
10 the designed responsrl ended and open-ended questions. The process of coding the two kinds of questions
is
hefore aelministration is very difierent and requires a detailed discussion. whenlhe questions are structured
called pre-coding. and the response categories are prescribed then one does what is calted pre-cod.ing,
i'e., designating numeral codes to the designecl responses before administration.
However, if the questions are structured and the answers are open ended and not
determined in advance, one needs to decide on the codes after the administration
of the survey. This is called post coding and requires skilled interpretation an4
categorization of the responses into homogenous grouped response categories and
then these are assigned a numeric code.

Coding Closec{-ended $tructurmd CIuestions


The method of coding for structured questions is easier as the response categories
are decided in advance. The researcher simply assigns a code for everyir*",
for each question and specifies the appropriate field and columns in which the
response codes are to be noted. The coding method to be followed for different kincls
of questions is discussed below.
Dichotomous questions: For dichotomous questions, which are on a nominal
scale, the responses can be binary, for example:

Do you eat ready-to-eat food? Yes = I; No = 0.


This means if someone eats ready-to-eat {bod he/she will be given a score of I
and if not, then 0.

lit:r ' r
'.,,:.,,]i:I{ryjt{ttry.j!']j,'.nltllitr!Id6!,dxri]'d.]'9..

282 i R*ru"rch Methodology

TABLE 10.2
Codet:ook extrilct For
Symbol used
VariahlerNamer ,l,t r I Goding,lnslruction rfor Variable
reacJy-to-eat food s rudy
Name
Yes=1
No=0
Use ready{o-eat food Yes=1
products No=0
22. Age Less than 20 years = 1,
21 to 26 years = 2,
27 to 35 years = 3,
36 to 45 years = 4,
1 5
Mofe than 45 years = ,

23 Gender
i Male=1 X23
i
i
---'l -- ---"-- f"male=Z - -'i .- ',

24 Marital status j Single = 1 --- '1 X24


I *"rried=2 :

_lrs,.::tu:r:l_ ,
| _ Exact
No. of chiidren
If - ---- no. to be written I XZS .-.,1
.*-------+--,----
-.-.,---.
Family size
I One to two = 1, i XZO
I Thre.: to five = 2, I

.*-- I Slx and more = 3 I

Monthly household income I rzo,ooo to (34,999 =


-
1, I nit
I 115,000
to {50,000 = 2. I
{50,001 to t74,999 = 3
{75,000 and above = 4
Education

Occupation

'- .''. . ',..]


Ranking questions: For ranking questions where there are multiple-.'..-'.*-...-'-_.
objects to
be ranked, the person will have to make multiple columns, with column numbers
equaling the number of objects to be ranked. For example, for the question
on
ranking TV serials in chapter B, the codebookwourd be asiollows:

Qj No. Codingi Variable,Name


1. Balika Vadhu Number from 1-10 X10a
2." Sathiya Nun,i:er from 1-1 0 x10b
a
J- Sasura/ Genda Phoot Number from 1-10 X1 0c
4. Bidai Number from 1-10 x10d
9, Pathshala Number from 1-10 X10e
6. Bandini Number from 1-10 x10f
7. Lapataganj Number from 1-10 X10g
8. Sajan GharJaana Hai Number from 1-10 x1 0h
I Tere liye Number from 1-10 x10i
10. Uttaran Number from 1-10 x10j
chectclists/multiple responses: In questions that permit a large number of
responses, each possible response option should be assigned
a separate column. For
example, consider the following question:
which of the following newspapers clo you read? (Tick all that you read.)
The Times of India
The HindustanTimes
MailToday
The Indian Express
Deccan Chronicle
TheAsianAge
Mint

For lhis question, the number of columns required are seven, one for
each
newspaper. The coding instructions for each column would be
as follows: in case
the person ticks on a name, the paper = I, and in case he does nottick,
the paper = 0.
scaled questlons: For questions that are on a scale, usually an interval scale,
the
quesSon/statement will have a single column and the
instruction would
indicate numerical assignment, i.e., what number needs "odi.rg
to 6e arocated for the
response options given in the scale. Consider the following question
from Chapter B.
Please indicate level of your agreement with the following statements.

j 1. I The individual customer today shops more


2. I The consumer is well informed about market offerings

There are more shopping options available to the consumer today

sA - sfrong/y agree; A- Agree; N - Neutrar; D - Disagree; sD - sfrong ty disagree.

The codebook for this will look as follows:

The coding instructions for comparative scales would be slightly different. Consider
the following comparative question:
Please rate Domino's and other pizza restaurants you frequent on the
basis or" your satisfaction level on an Il-point scale, based upon the following
parameters: (1 = Extremely poor, 6 = Average, ll = Extremely good). Circle your
r€spuIlS€.
284 I
I

Research Methodology
._-_L l_."---

Variety of menu options a J 4 5 7 a


"
ib. ( t'i- ,
--1 Illi
Value for money 1 2 3 4 o
t
,--'.--]

lc Speed of service (delivery time)


1_ J 4 E Ct I i I i 1r
I'-*'- --,,-J- ---1- l
lo
t-
Promotional offers 1 2
:- 4
-----1
6 7
-*--l
I +i , e---,1i 1r

2 3 4 5 6 8 I s I rr
;i"li
I
7
a A
1 J 4 6 7

l 1_ J

-;
A A

,;- _9_ I ;l;l-,i


Friendliness of the salesperson on the phone
l 2 J

).)
c 6 7 I
s s i ,,
-- t .' 'i."-
'te 1 z +
,-.,-*..-
tr b 7 8 I e I i(
Quality of packaging 2 J 4 ( 6 7 ;-r; f ,,

Adaptation to lndian taste 1 2 J 5 t) 7 B i911


Side orders/appetizers 1 2 J 4 6 o 7 81911
I-Iere, the number oJ columns required is not l2 but 2 (Domino,s
and others) x
12, that is 24 columns. The respondent is supposed to use the same parameters
and
the same scale but for each he is supposed to make one circle for
Domino,s and one
for the other pizza restaurant. In case of multiple brands being rated
on the same
parameters it would be:

X', (where x = number of parameters and n nurrber of objects


= being evaluated on
each parameter).

Missing values: It is advisable to use a standard format for signifying a norl-


response or a missing value. For example, a code of g could be
used fori single_
colurnn variable, 99 for a double-column variable, and g99 for a three character
variable and so on. The researcher must take care as far as possible
to use a value
that is starkly different from the valid responses. This is one of the reasons
why 9 is
suggested. However, in case you have a scale that is like the one above, g
canntt be
used as a missing value.

Coding Open-ended Structured euestions


lhere are no predefined The coding of open-ended questions is quite difficult as they are unpredictable
response (ategories for
in terms
of insufficient information or a lack of hlpotheses, which is why
the coding of open-ended
there 1re no predefir-red
response categories. As discussed earliel, the respondents' exact answers
qmestions. This is due are noted
on the questionnaire. Then the researcher (eitherlndividually or
to [he fact that they are as a team) looks for
patterns and assigns a category code. Sometimes the researcher
unpredictable in terms of does what is termed as
test tabulation, where he randomly looks at the answers from
insuffi cient infonnation. 20 per cent of the sample
data and attempts to give codes to each of the responses identified.
lvhen deciding on
the codes he/she must keep the criteria of appropriateness,
exhaustive categorizatiop,
mutually exclusive categories and single distributitn ,ariable as the guidilg principles.
The following example is a question that was irsed to study the
reasons attributed
to the lean management implementation in an organization.

If you think lean was a success so


far, please specfu three most significant reasons
that haue contributed to its success inyour opinion.
Data Processing 285

As thesewere based upon the three most


important reasons to be indicated, each
case/record might have multiple answers. Thus,
based upo., tn" r".ponses obtained,
for the above question, the foilowing post_code
bookwas created:

c6i,,N6;;
OJ lmprovement at work place by Yes=1 X63a
eliminating waste. No=0
64 To meet increasing demands of
Yes=1 x63b
--._-.-_ customers No=0
AA
To improve quality Yes=1 X63c
No=O
ot-) To achieve corporate goal Yes=1 x63d
No=0
67 It reduces cycle time of the
Yes=1 X63"
manufacturino and oroclr rr:tion
68
,Ng=o
Reduced response time Yes=1 x63f
No=0
69 E.h"*"d ,;*rt.n rn? Yes=1 X63g
creativity No=0
when deciding on the codes, at times, it may
be essentiar to use a code even when
no one has mentioned them. Here, it may be
critical as one of the hypothesized
parameters has been negated. For example,
for a question:
I,lrhy do you eat organicfood products?

'organic food is fashionabre' was a reason why


the researcher believes that
people consume it. Thus, one of the predetermined/post_coded
category coded as
I was this. Along with these, the resear.cher
might posicode the responses received.
However, it may so happen that no one chose
this optiorr, thus while interpreting his
findings one can state that no one consumes
thefood simply because it isfashionable
to do so.

CI.ASSIFICATION AND TAHUI-ATION


OF NATA
TEARNIIIG SBJ[S?ffq 4 sometimes, the ciata obtained from the primary
instrument is bulky and voluminous
Cony out the tqbulotion and even structured response categories become
ond entry of doto in the tedious to interpret. In such cases,
the rescarcher might decide to reduce the information
requiied tormot. ., into homogenous categories.
This is essentialry like post-coding of the open-ended
questions, but here the
grou':ing would be based upon structured questions.
This ,r",rr"a
called classification of data. This can be clone "i";;;ilil;;
on the basis of common attributes or
on the basis of class intervals.
Reducinq the information classification on the basis of attributes: Here, what
is done is that the person,s
into homogeneous score on a particular variable is computed by
various combinations of the original
categories on the basis of data obtained. This process is called viriable respecification.
For example, in,a study
on schoolchildren mentar growth was calcurated
structured questions is called
classification of data.
on the basis of their
to the questions that were related to the conceptual
**r*. gi;u;
knowledge plus the questions
related to applications. In another study the person,s
age, marita'l
status and presence

i{&$r',,t,,r, ,-"'
l,fil$
.iu

. 286 : Research Methodology

t) lr.
'

l, t:
colnpute iheir family life cycle stage. similarly,
::1_ig:,rTl:,tl::ll1_o:,":"0
as stated earlier, the socio-economic classificatron of a person could be identifieri
upon the basis of his education and occupation,
Another respecification the researcher might
;arry out is collapsing the respoltse
categories. For example, suppose the original variable was plastic bag usage with 10
response categories. These might be collapsed into four categories: heavy, merlilm,
light, and non-user'. Other respecification ofvariables includes square root and 1og trans-
formations, which are often applied to improve the fit of the model being estimaterl.
Another classification technique discussed in an earlier chapter or-l
measurement and scaling and in the coding section here refers to the use of dumnry
variables for respeci{ring the categorical variables. Dummy variables are also called
binary, dichotomous, instrumental, or qualitative variables. They are variables th:r[
may take on only two values, such as 0 or I.
Classification by class intervals: Numerical data, like the ratio scale data, can be
classified into class interuals. This is to assist the quantitative analysis of clata. For
example, the age data obtained from the sample could be reduced to homogenor.rs
grouped data, tbr example all those below 25 form one group, those 25-35 are another
group and so on. Thus, each group will have class limits-an upper and a lower lirnit.
The difference between the limits is termed as lhe class magnitude. One can have
class intervals of both equal and unequal magnitude.
The decision on how many classes and whetiler equal or unequal depends upol
the judgement of the researcher. Generally, multiples of 2 or 5 are preferred. Sorne
researchers adopt the following formula fordeteirnining the number of class intervals:

i = R/(t + 3.3log N)
whele,
i = Sizeofclassinterval,
R = Range (i.e', difference between the values of the largest item and smallest
item among the given items),
N = Number of items to be grouped.
The class intervals that are decided upon could be exclusiue, for example:
10-i5
t5-20
20-25
25-30
In this case, the upper limit of each is excluded from the category. Thus we reacl
the first interval above as 10 and under 15, the nexl one as 15 and ulder 20 and so on.

The other kind is inclusiue, that is:


10-1 5
16-20
2I-25
26-30
Here, both the lower and the upper limits are included in the interval. it says
l0-15 but actuallymeans I0-15.99.It is recommended thatwhen one has continuous
Tabulation involves an
data it should be signified as t0-15.99, as then all possibilities of the responses are
orderly arrangement of data
exhausted here. Howeveq for discrete data one can use l0-15.
into an array that is suitable for Once the categories and codes have been decided upon, the researcher needs to
statistical analysis, This can be arrange the same according to some logical pattern. This is referred to as tabulation
done both manually and with of data. This involves an orderly arrangement of data into an array that is suitable for
the assistance of a software.

.rl
a statistical analYsis. Usually, this is an
orderly arrangement of the rows and columns.
In case there is data to be entered for one variable, the process is a simple
tabulation
and, when it is two or. nlore variables, then one carries out a cross-tabulation
of data.
This can be done manually or with the help of a computer.

Explora'tory nata Analysis


once the data has been cleaned ancl entered in a tabular fbrm, the researcher
is
advisecl to do a preliminary data exploration, in ordgr to assess the expected
trends
of the findings. Sometimes, these indicative trends may demonstrate that the
clata
collection or instrument design is faulty and needs soml corrections.
'Ihus. before olle goes about testing the
formulated hypotheses, one carries out
Preliminary data a loosely structured exploration. Most of the exploration ir aorr"
exploration
on the basis of the
is done to assess graphical and visual display of the data patterns that seem to be
the r:xpected trends
emerging. In this
ofthe section've will discuss some widely used and simplistic measures of
flndinqs. this is, basically, displaying data.
luttsely structured.
Bar and pie charts: The data &at is available as classification or demographic
varia:;le is most often on a categorical or nominal scale. Thus, the tabled
datican be
plotted to demonstrate the pattern of responses. For example,
in a study on jewellery
buyrng the age $oups of the sampre group and the o"..,putio.r, were
as folrows:
Occupation Age Group

11l;,Bfiijc6ht

- ?9:?j . _:1__ ?7:9*-


26-30 37 37.0
31*35 o 9.0
36-40 22 22.A
J
-J.]j-?- t_9 _

i- . -_i
n
1_6 1l!:y" --100* ._?-9
Total 100.0

Thus, a quick visual representation of the largest and the smallest group
can be
obtai. ,ed by constructing a pie chart of the same (I;igure 10.2).

Frfi{.ifiH 1S.2 Age group Occupation


lrir,, lh,rrl :,lrclryincl lhr:t
1,;r,.1i.1,1,,:rril sntaIlesl 46 and above
i l l a) i.11 )t,

i
M#tu',.,,,
288 i Research Methodology

In case one is interested in getting a compaiative


clepiction of the same, the clata
in the above case is represented in a bar ctrat (flgure
tO.S).
FIGURE TO.3
(ornparative de5:iction of thr: c;roupts
tlrror_rglr L_Lar clrarfs
40

ilso 40
L
o
J
Fuo
ilro
II
o
J

10
u20
II
10

20-25 26-30 31-35 36-40 41_45 46 and


Business Salaried prof".sio*t 8ffi6
above
Age group
Occupation
(a)
(b)

Histogram: For metric-interval and ratio scale data,


the data is represented through
a histogram (Figure lO.a). The representation
would be able to demonstrate the
distribution pattern itt terms of whether it is
norrnally distributed or demonstratcs
skew,ess. The fb[owing was the result of
the distribution oi rs customers who
purchased from branded jewellery
outlets last year..

i.,. Frqquency Valid,Per cent Cumulativs


Per cent
Valld I
l9-10 1 6.7 I ^-
b./ 6.7
rJ.z5 1 6.7 r b./ 13.3
13.26 o.i
1
I- 6.7 20.0
13.87 i 6.7 I b./ 26.7
15.64 1 6.7 6.7 JJ.J
15.65 1 6.7 6.7 40.0
15.84 1 6.7 5.7 46.7
16.26 1 6.7 6.7 53.3
16.55 I 6.7 6.7 60.0
17.25 1 b./ 6.7 66.7
17.65 I
6.7 6.7 73.3
18.23
**'*'*"*-.*] i o.t 6,7 80.0
22.18
_.._- -"__-_'-1
31.00
l 6.7 6.7 86.7
'--*--*-*-*l 1 6.7 93.3
35.60
-.1 ,....*_*.,'.-]
I
; -o./ o./ 100.0
l Total l
100.0 100.0

Thus, the data representation in trre histogram


shows the weight of the ite,r
purchased in grams (g) o., the X-axis
and thJht.ight orirr" uu* represents the
frequency of that particurar interval. The
mean weight of the items bought from
the branded outlets was approximately tB
g. Most of the sample did a purcrrase
an item that weighed less than 20 g. T-he of
dita shows o r..q.r.i.i.s for the 23_30 g.
HGIJRE 10.4
[listotiranr sito,,ving tite rJistribution paftorn
o[ r ustonters

Mean = 18.3553
Standard deviation = 6.5ST7T
N=15

10.00 15.00 30.00 35.00 40.00


Purchase in gram

Thus, the display demonstrates that the sample selected


is more skewed towards the
purchasers of smaller items.
Stem and leafdisplay shows Stem and leaf disprays: This is another way of displaying
the metric data. It is very
individual data values in earh easy to compute and can be done manually or
with irr" n"tp of Minitab. It shows
set as agninst the histogram individual data values in each set as against the hiistogram which presents only
which presents only qroup group ag$egates.
agqfegates.

- It-shows the pattern of responses in each interval and yet can maintain the rank
order for a quick approximation of the median or quartile."Each
row or line is called
a stem and each value on the line is a leaf. The
same clata that we represented on the
histogram can also be depicted on a stem ancl reaf display
as follows:

t3 1339
l5 668
l6 36
t7 JJ

t8 ,
,, ,
3l o
35 6

If one looks at the tabled data for the jewellery purchase in the above
stem ancl
leaf display, the decimals have been rounded off itre first place
and in case of two
simill " entries the number 13.3 has been entered twice. In fact, if one
rotates the above
display by 90 degrees to the left one would get the histogram.
The display is showing at
a glance that the sarnple studied was concerned with
tie buying of mostly 13 g ite#s.
There are other methods like boxplots, which are a more detailed
representation
ascompared to histograms. These are basically descriptive statistical values
for the data
obtained and these are based upon the measures of central tendency and
dispersion.
These statistical measures would be explained in detail in the next
chapter.
290 i Research Methodotosy

$TATI$TICAL SCIFTWARE PAG K,Effi ES

Researchers have to their advantage a wide array of statistical programmes to assist


them in both data managenlent and data anaiysis. In this section we will brielly
discuss only the most frequently used packages.
MS Excel: The simplest and most widely used method of presenting and tabulatilg
data is on Excel. T'he basic mathematical functions can be calculated here. Secondly,
the software is easy to understand and used by most computer users. The data
entered on Excel can be transported to most statistical packages for a higher level
analysis.
Minitab: Minitab Inc. was developed *or" ,hu, 20 years ago at the pennsylvania
State University' lt can be used with considerable ease and effectivenuri
i, uI
business areas. It was originally used by statisticians. However, today it
is used
for multiple applications-especially quality control, six sigma and the design
of
experiments. The URL for Minitab is http://www.minitab.com/. The researchei
carr
utilize the products and help the guide to undenake a quantitative research analysis.
System for Statistical Analysis (SAS): SAS was creared in the late 1960s
at North
carolina State university. It has been actively a..,1 extensively used in managing,
storing and analysing information. It has the advantage of being able to
*uiugl
really bulky data sets with considerable ease. Linear (Regression,
-od.lr
of variance, Analysis of covariance), Generalized linear modeis (ircluding
Analyiis
rogistic
regression and Poisson regtession), multivariate methods (MANovA,
canonical
correlation, Discriminant analysis, Factor analysis, clustering), categorical
data analysis (including log-linear models), and ail the srandardLchniqries
for
descriptive and confirmatory statistical analysis are possible with SAS. The
statistical
analyses may be interfaced with the graphical products to produce relevant plots
such as q-q plots, residual plots, and other relevant graphital descriptions o}the
data' Forecasting and trend series can also be carried out uiing *re packige.
It finds a
higher usage amongst industry than students who are more comfortable with
SpSS.
The URL for package is http://wwn.sas.com/.
SPSS: Amongst the student community as well as with most research agencies,
this
is the most widely used package. It is adaptable to most business problems
and is
extremely user friendly. A reference URL for sL-SS is hrtp://wivw.spss.com/. The
softwareisdiscussedindetailinAppendix10.1oftheCffi
There are a number of specific software programs like E Views for business
forecasting and LISREL (Linear Structural Relitions) for structural equation
modelling. However, for most purposes, spss is the most widely used software.

After the data has been collected through different methods used by
the researcher, the information needs to be
refined and structured in a format that can lend ilself to a statistical
Lnquiry for testing the study hypotheses. The
researcher first begins by validating the fieldwork that was conducted.
The processing n"r" reiers to the primary
data that has been collected specifically for the study.
The researcher needs, to carry out a hawk-eyed scrutiny of the obtained
data to ensure that no omissions or errors
are there' This is the editing stage of ihe data processing step. Here, the researcher
begins by conducting a field
editing and is able to resolve some of the inconsistencies and issues
of incomplete data. ihis pio"ur. is conducted
at the second stage at the central office level. At this stage, the research
team conducts some data treatment such
as allocating the missing values, if possible, backtracking and sometimes, plugging
the incomplete data.
I
I

Data Processin!) I 291

once this is completed, the researcher prepares a uniform code


sheet for the questions and expected responses.
This notepad of instructions is referred io as the code
book. ln case the question and answers are closed ended,
the investigator is able to conduct a precoding of data, where
he decides in advance what numeral value is to be
assigned to each of the expected answer. Theinvestigator
then takes a decision on how to code the missing values,
i'e' the questions whose answers have been left blank. This
is critical to decide and record in the entered data as
this might lead to an error in calculation.
classification into attributes or class intervals is carried out and
the entered data is now ready for analysis in a tabu-
lar form' Before conducting formal and rigorous
data analysis through a gamut of statistical technique, it is
to carry out a simple exploratory data analysis by portraying advisable
the data-in figurative forms such as bar charts, pie
charts' histograms and stem and leaf displays. This exploratiJn
can now be conducted in an extremely user-friendly
and quick manner by using various software packages like MS
Excel, sAS, Minitab and spss.

Backtracking
ln-house editing
Bar chart
Minitab
Class intervals
Missing values
Classification of data
MS Excel
Code book
Pie chart
Coding
Plug value
Data editing
Postcoding
Data processing
Pre-coding
Data tabulation
Record
Exclusive class intervals
SAS
Field
Single variable entry
Field editing
SPSS
File
Stem and leaf display
Histogram
Test tabulation
lnclusive class intervals

Objmctive Type Questiorrs


State whether the following statemerits are true (T) or false
(F).
.
'1
The first step in the data anarysis process is data varidation.
2' Field editing is possibre for a, types of primary data
coilected.
3' Armchair interviewing refers to face-to-face filling in of
the questionnaire by the respondent.
4' Backtracking means going back to the respondent
to check any errors during questionnaire administration.
5. Backtracking is best suited for industrial surveys.
6' Plug value refers to the fudged value that an investigator might put for
a missing response.
7' The smarest code entry a researcher makes in a code book is a field.
8" Several fields together can be clubbed into a file.
9. ln a data matrix every column represents a single case.
10. SEC refers to the sections in a typical data matrix.
11' All categories formurated for data entry must be mutuaily
excrusive.
12. Post-coding is conducted on closed_ended questions.

" mT:JTffiTJ]}5:t*'u"o more than one entry for a question that has six options the number of corresponding
14' ln case the question is a L.xert type question and it has
agreemenudisagreement on a five-point scale, the number
of corresponding columns in the code book would be five.

You might also like