Professional Documents
Culture Documents
DBB2103 Unit-07 - V1.2
DBB2103 Unit-07 - V1.2
DBB2103
RESEARCH METHODOLOGY
Unit 7
Data Processing
Table of Contents
1. INTRODUCTION
In the last few units, you have learned about the various aspects of data collection. The
critical job of the researcher begins after the data has been collected. He has to use this
information to assess whether he had been correct or incorrect while making certain
assumptions in the form of the hypotheses at the beginning of the study. The raw data that
has been collected must be refined and structured in such a format that it can lend itself to
statistical inquiry. This process of preparing the data for analysis is a structured and
sequential process. The process starts by validating the measuring instrument, which could
be a questionnaire or any other primary technique. This is followed by editing, coding,
classifying, and tabulating the obtained data.
In this unit, we will learn these steps of preparing the data through editing, coding, and
tabulating, so that it is ready for any kind of statistical analysis, to achieve the research
objectives, we had made earlier.
1.1 Objectives
2. DATA EDITING
Data editing is the process that involves detecting and correcting errors (logical
inconsistencies) in data. After collection, the data is subjected to processing. Processing
requires that the researcher must go over all the raw data forms and check them for errors.
The significance of validation becomes more important in the following cases:
• In case the form had been translated into another language, expert analysis is done to
see whether the meaning of the questions in the two measures is the same or not.
• The second case could be that the questionnaire survey has to be done at multiple
locations and it has been outsourced to an outside research agency.
• The respondent seems to have used the same response category for all the questions;
for example, there is a tendency on a five-point scale to give 3 as the answer for all
questions.
• The form that is received back is incomplete, in the sense that either the person has not
filled the answer to all questions, or in the case of a multiple- page questionnaire, one
or more pages are missing.
• The forms received are not in the proportion of the sampling plan. For example, instead
of an equal representation of government and private sector employees, 65 percent of
the forms are from the government sector. In such a case the researcher either would
need to discard the extra forms or get an equal number filled-in from private-sector
employees.
Once the validation process has been completed, the next step is the editing of the raw data
obtained. While carrying out the editing the researcher needs to ensure that:
The editing process is carried out at two levels, the first of these is field editing and the
second is central editing.
Usually, the preliminary editing of the information obtained is done by the field investigators
or supervisors who review the filled forms for any inconsistencies, non-response, illegible
responses, or incomplete questionnaires. Thus, the errors can be corrected immediately and
if need be, the respondent who filled in the form, can be contacted again. The other advantage
is that regular field editing ensures that one can also check whether the surveyor can handle
the process of instructions and probing correctly or not. Thus, the researcher can advise and
train the investigator on how to administer the questionnaire correctly.
The second level of editing takes place at the researcher’s end. At this stage, there are two
kinds of typical problems that the researcher might encounter.
First, one might detect an incorrect entry. For example, in the case of a five- point scale one
might find that someone has used a value of more than 5. In another case, one might be asking
a question like, ‘how many days do you travel out of the city in a week?’ and the person says
‘15 days. Here one can carry out a quick frequency check of the responses; this will
immediately detect an unexpected value.
The second and major problem that most researchers face is that of ‘armchair interviewing’
or a fudged interview. One way to handle this is to first scroll the answers to the open-ended
questions, as generally if the investigator is filling in multiple forms faking these would be
difficult.
The researcher has some standard processes available to him to carry out the editing
process. These are briefly discussed below.
Backtracking: The best and the most efficient way of handling unsatisfactory responses is
to return to the field, and go back to the respondents. This technique is best used for
industrial surveys but is a little difficult in individual surveys.
Allocating missing values: This is a contingency plan that the researcher might need to
adopt in case going back to the field is not possible. Then the option might be to assign a
missing value to the blanks or the unsatisfactory responses. However, this works in case:
Plug value: In cases such as the third condition above, when the variable being studied is
the key variable, then sometimes the researcher might insert a plug value. Sometimes one
can plug an average or a neutral value in such cases, for example, a 3 for a five-point scale or
the researcher might have to establish a rule as to what value will be put if the person has
not answered. Sometimes, the respondents’ pattern of responses to other questions is used
to extrapolate and calculate an appropriate response for the missing answer.
Discarding unsatisfactory responses: If the response sheet has too many blanks/illegible
or multiple responses for a single answer, the form is not worth correcting and editing.
Hence, it is much better to completely discard the whole questionnaire.
Self-Assessment Questions – 1
1. The editing process is carried out at two levels, the first of these is field editing
and the second is _________ .
2. Going back to the respondent to check any errors during questionnaire
administration is known as_________ .
3. Backtracking is best suited for industrial surveys. (True/False)
4. Plug value refers to the fudged value that an investigator might put for a missing
response. (True/False)
3. CODING
The process of identifying and denoting a numeral to the responses given by a respondent is
called coding. This is essentially done to help the researcher’s in recording the data in a
tabular form later. It is advisable to assign
a numeric code even for the categorical data (e.g., gender). Even for open-ended questions,
which are in statement form, we will try to categorize them into numbers. The reason for
doing this is that the graphic representation of data into charts and figures becomes easier.
Usually, the codes that have been formulated are organized into fields, records, and files. For
example, the gender of a person is one field and the codes used could be 0 for males and 1
for females. In all related fields, for example, all the demographic variables like age, gender,
income, marital status, and education could be one record. The records of the entire sample
under study form a single file. The data that is entered in the spreadsheet, such as on EXCEL,
is in the form of a data matrix, which is simply a rectangular arrangement of the data in rows
and columns. Here, every row represents a single case or record. For example, consider the
following representation from a study on two-wheeler buyers (Table 8.1):
Here, the data matrix reveals that each field is denoted on the column head and each case
record is to be read along the row. The data in the first column represents the unique
identifier given to a particular respondent (also marked on his/her questionnaire). The
second column has data entered based on a coding scheme where every occupation is given
a number value (for example, 1 stands for government service and 5 stands for student, and
so on). Column 3 has 1 representing a motorcycle and 2 representing a scooter. The next
value is of the average number of kilometers a person travels per day.
This is followed by the marital status, with 1 signifying unmarried and 2 married. The last
column is again a ratio scale data with the number of family members. The researcher can
enter the data on the spreadsheet of the software package he/she is using for the analysis.
Codebook formulation: To manage the data entry process, it is best to prepare a method
for entering the records. This coding scheme for all the
variables under study are called a code book. Generally, while designing the rules, care must
be taken to decide on some categories that are:
• Comprehensive: Should cover all the possible answers to the question that was asked.
• Mutually exclusive: The categories and codes devised must be exclusive or different
from each other.
• Single variable entry: The response that is being entered and the code for it should
indicate only a single variable. For example, a ‘working single mother’ might seem a
simple category that one could code as ‘occupation’. However, it needs three columns—
occupation, marital status, and family life cycle. So, one needs to have three different
codes to enter this information.
Based on the above rules, one creates a code book. This would generally contain information
on the question number, variable name, response descriptors, and coding instructions and
the column descriptor. Table 10.2 gives an example from a questionnaire designed to
measure consumer buying behavior for ready-to-eat food products.
As we have read in Unit 6, a questionnaire can have both closed-ended and open-ended
questions. When the questions are structured and the response categories are prescribed
then one does what is called pre-coding, i.e., giving numeral codes to the designed responses
before administration. However, if the questions are structured and the answers are open-
ended, one needs to decide on the codes after the administration of the survey. This is called
post- coding.
The method of coding for structured questions is easier as the response categories are
decided in advance. The coding method to be followed for different kinds of questions is
discussed below.
Dichotomous questions: For dichotomous questions, which are on a nominal scale, the
responses can be binary, for example:
This means if someone eats ready-to-eat food he/she will be given a score of 1 and if not,
then 0.
Ranking questions: For ranking questions where there are multiple objects to be ranked,
the person will have to make multiple columns, with column numbers equaling the number
of objects to be ranked. For example, for ranking TV serials, the code book would be as
follows:
Which of the following newspapers do you read? (Tick all that you read.)
Times of India
Hindustan Times
Mail Today
Indian Express
Deccan Chronicle
Asian Age
Mint
For this question, the number of columns required is seven, one for each newspaper. The
coding instructions for each column would be as follows: in case the person ticks on a name,
the paper = 1, and in case he does not tick, the paper = 0.
Scaled questions: For questions that are on a scale, usually an interval scale, the
question/statement will have a single column and the coding instruction would indicate
what number needs to be allocated for the response options given in the scale. Consider the
following questions.
Please indicate the level of your agreement with the following statements.
Compared to the Past (5-10 years) SA A N D SD
character variable, and so on. The researcher must take care as far as possible to use a value
that is starkly different from the valid responses. This is one of the reasons why 9 is
suggested. However, in case you have a 10-point scale do not use 9.
The coding of open-ended questions is quite difficult as the respondents’ exact answers are
noted on the questionnaire. Then the researcher ( either individually or as a team) looks for
patterns and assigns a category code.
If you think lean management was a success so far, please specify the three most significant
reasons that have contributed to its success in your opinion.
People gave different answers. Thus, based upon the responses obtained, for the above
question, the following post–code book was created:
Col.No. Variable Name Coding Instructions Variable
Name
63 Improvement in the Yes = 1 X 63a
workplace by eliminating No = 0
waste.
64 To meet increasing demands Yes = 1 X 63b
of customers No = 0
65 To improve quality Yes = 1 X 63c
No = 0
66 To achieve corporate goal Yes = 1 X 63d
No = 0
Activity 1
From the Final questionnaire you had made in activity 2 of unit 6, prepare a codebook for the
same questionnaire based on what you learned in this section on coding
Self-Assessment Questions – 2
5. The smallest code entry a researcher makes in a code book is a field. (True/False)
6. Several fields together can be clubbed into a record. (True/False)
7. In a data matrix, every column represents a single case. (True/False)
8. All categories formulated for data entry must be mutually exclusive. (True/ False)
9. The process of identifying and denoting a numeral to the responses given by a
respondent is called
10. In case the question is a Likert type question and it has agreement/ disagreement
on a five-point scale, the number of corresponding columns in the code book would
be
Classification by class intervals: Numerical data, like the ratio scale data, can be classified
into class intervals. This is to assist the quantitative analysis of data. For example, the age
data obtained from the sample could be reduced to homogenous grouped data, for example,
all those below 25 forms one group, those 25–35 are another group, and so on. Thus, each
group will have class limits—an upper and a lower limit. The difference between the limits
is termed class magnitude. One can have class intervals of both equal and unequal
magnitude.
The decision on how many classes and whether equal or unequal depends upon the
judgment of the researcher. Generally, multiples of 2 or 5 are preferred. Some researchers
adopt the following formula for determining the number of class intervals:
where,
R = Range (i.e., the difference between the values of the largest item and smallest item among
the given items),
The class intervals that are decided upon could be exclusive, for example:
10–15
15–20
20–25
25–30
In this case, the upper limit of each is excluded from the category. Thus, we read the first
interval above as 10 and under 15, the next one as 15 and, under 20 and so on.
Here, both the lower and the upper limits are included in the interval. It says 10–15 but
means 10–15.99. It is recommended that when one has continuous data it should be signified
as 10–15.99, as then all possibilities of the responses are exhausted here. However, for
discrete data, one can use 10–15.
Once the categories and codes have been decided upon, the researcher needs to arrange the
same according to some logical pattern. This is referred to as tabulation of data. This
involves an orderly arrangement of data into an array that is suitable for statistical analysis.
Usually, this is an orderly arrangement of the rows and columns. In case there is data to be
entered for one variable, the process is a simple tabulation and, when it is two or more
variables, then one carries out a cross-tabulation of data. The method of cross- tabulating the
data is discussed at length in Unit 12.
Self-Assessment Questions – 3
Activity 2
Find some tabular data and identify whether the have used class intervals identify which are
exclusive or inclusive data. Why do you think they have been used? Make a code book for any
of these tables.
6. SUMMARY
• Data processing refers to the primary data that has been collected specifically for the
study.
• The researcher has to check for omissions or errors. This is the editing stage of the data
processing step. This is done first at the field and then at the central office level.
• At this stage, the research team conducts some data treatment such as allocating the
missing values, if possible, backtracking and, sometimes, plugging the incomplete data.
• Once this is completed, the researcher prepares code book. Classification into attributes
or class intervals is carried out and the entered data is now ready for analysis in a
tabular form.
7. GLOSSARY
• Backtracking: The best and the most efficient way of handling unsatisfactory
responses is to return to the field, and go back to the respondents.
• Code book: Coding scheme for all the variables under study
• Coding The process of identifying and denoting a numeral to the responses given by a
respondent is called
• Data tabulation: Arrangement of data according to some logical pattern
8. TERMINAL QUESTIONS
1. What is data editing? Mention its significance.
2. Distinguish between field editing and centralized in-house editing. Mention the
standard processes available to the researcher to carry out the editing process.
3. How do you code data? What guidelines should be followed to carry out the task?
Discuss by giving suitable examples.
4. Distinguish between coding closed-ended structured questions and coding open-ended
structured questions
5. Explain the classification and tabulation of data.
9. ANSWERS
Answers to Self-Assessment Questions
1. Data editing is the process that involves detecting and correcting errors (logical
inconsistencies) in data. Refer to Section 8.2 for further details.
2. Field editing is the preliminary editing of the collected information done by the field
investigators or supervisors, while the second level of editing takes place at the
researcher’s end. Refer to Sections 8.2.1 and 8.2.2 for further details.
3. The process of identifying and denoting a numeral to the responses given by a
respondent is called coding. Refer to Section 8.3 for further details.
4. The method of coding for structured questions is easier as the response categories are
decided in advance. Refer to Sections 8.3.1 and 8.3.2 for further details.
5. When the data is huge, the researcher may have to reduce the information into
homogenous categories. This is called the classification of data. Refer to Section 8.4 for
further details.
10. REFERENCES