Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 150

Clarifying Measurement and Data

Collection in Quantitative Research


LEARNING OUTCOMES
 1. Describe measurement theory and its relevant concepts
of directness of measurement, levels of measurement,
measurement error, reliability, and validity.
 2. Determine the levels of measurement—nominal,
ordinal, interval, and ratio—achieved by measurement
methods in published studies.
 3. Identify possible sources of measurement error in
published studies.
 4. Critically appraise the reliability and validity of
measurement methods in published studies.
 5. Critically appraise the accuracy, precision, and error of
physiological measures used in studies.
 6. Critically appraise the sensitivity, specificity, and
likelihood ratios of diagnostic tests.
 7. Critically appraise the measurement approaches—
physiological measures, observations, interviews,
questionnaires, and scales—used in published studies.
 8. Critically appraise the use of existing databases in studies.
 9. Critically appraise the data collection section in published
studies
 Measurement is the process of assigning numbers
or values to individuals’ health status, objects,
events, or situations using a set of rules (Kaplan,
1963).
 In research, variables are measured with the best
possible measurement method available to
produce trustworthy data that can be used in
statistical analyses. Trustworthy data are essential
if a study is to produce useful findings to guide
nursing practice
CONCEPTS OF MEASUREMENT THEORY

 Measurement theory guides the development and use of


measurement methods or tools in research.
 The Measurement rules allow individuals to be consistent in
how they perform measurements
 a measurement method used by one person will consistently
produce similar results when used by another person.
 This section discusses some of the basic concepts and rules of
measurement theory, including directness of measurement,
levels of measurement, measurement error, reliability, and
validity
Directness of Measurement
 Direct measures involve determining the value of
concrete factors such as weight, waist circumference,
temperature, heart rate, BP, and respiration.
 Technology is available to measure many bodily functions,
biological indicators, and chemical characteristics.
 The focus of measurement in these instances is on the
accuracy and precision of the measurement method and
process(BP).
 Nurse researchers are also experienced in gathering direct
measures of demographic variables such as age, gender,
ethnic origin, and diagnosis.
 in many cases in nursing, the thing to be measured
is not a concrete object but an abstract idea,
characteristic, or concept such as pain, stress,
caring, coping, depression, anxiety, and
adherence.
 Researchers cannot directly measure an abstract
idea, but they can capture some of its elements in
their measurements, which are referred to as
indirect measures or indicators of the concepts.
 Rarely, if ever, can a single measurement strategy
measure all aspects of an abstract concept.
 Therefore multiple measurement methods or indicators
are needed, and even then they cannot be expected to
measure all elements of an abstract concept
 The measurement methods of pain might include the
FACES Pain Scale, observation (rubbing and/or
guarding the area that hurts, facial grimacing, and
crying), and physiological measures, such as pulse and
blood pressure
Levels of Measurement
 Various measurement methods produce data that
are at different levels of measurement. The
traditional levels of measurement were developed
by Stevens (1946), who organized the rules for
assigning numbers to objects so that a hierarchy in
measurement was established.
 The levels of measurement, from low to high,
are nominal, ordinal, interval, and ratio.
How would you describe
nominal scale?
 Lowest of the four levels of measurement
 Categories that are not more or less but
are different from one another in some
way
 Mutually exclusive and exhaustive
categories
 Named categories

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 10
What is an example of
nominal data?
 Gender
 1 = Male
 2 = Female

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 11
Nominal-Level Measurement
 Nominal-level measurement is the lowest of the four
measurement categories.
 It is used when data can be organized into categories of a
defined property but the categories cannot be rank-ordered.
For example, you may decide to categorize potential study
subjects by diagnosis. However, the category “kidney stone,”
for example, cannot be rated higher than the category “gastric
ulcer”; similarly, across categories, “ovarian cyst” is no closer to
“kidney stone” than to “gastric ulcer.”
 The categories differ in quality but not quantity. Therefore, it
is not possible to say that subject A possesses more of the
property being categorized than subject B.
 (RULE: The categories must not be
orderable.) Categories must be established in
such a way that each datum will fit into only
one of the categories.
 (RULE: The categories must be exclusive.)
All the data must fit into the established
categories.

 widowed in the past, but has remarried so


“Current Marital Status”
 (RULE : The categories must be exhaustive.)
“Cardiac Medical Diagnoses,” with the options
of “Hypertension,” “Heart Failure,” and
“Myocardial Infarction.” These options are not
exhaustive because individuals with
cardiomyopathy would not know what to mark.
Researchers might add an “Other” category,
making the “Cardiac Medical Diagnoses”
options exhaustive, but these data are difficult
to analyze and interpret
 The demographic questionnaire may
include a variable called Data such as
gender,race and ethnicity, marital status,
and diagnoses are examples of nominal
data.
 The rules for the four levels of measurement
are summarized in Figure 10-2 page336.
What is an ordinal scale?
 Order/ranking imposed on categories
 Numbers must preserve order
 1 = Tallest
 2 = Next tallest
 3 = Third tallest

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 17
Ordinal-Level Measurement
 With ordinal-level measurement, data are assigned
to categories that can be ranked.
 (RULE: The categories can be ranked) To rank data,
one category is judged to be (or is ranked) higher or
lower, or better or worse, than another category.
 Rules govern how the data are ranked. As with
nominal data, the categories must be exclusive
(each datum fits into only one category) and
exhaustive (all data fit into at least one category).
 With ordinal data, the quantity also can be identified
(Stevens, 1946). For example, if you are measuring intensity
of pain, you may identify different levels of pain. You
probably will develop categories that rank these different
levels of pain, such as excruciating, severe, moderate, mild,
and no pain. However, in using categories of ordinal
measurement, you cannot know with certainty that the
intervals between the ranked categories are equal. A
greater difference may exist between mild and moderate
pain, for example, than between excruciating and severe
pain. Therefore ordinal data are considered to have
unequal intervals.
 Many scales used in nursing research are ordinal levels
of measurement. For example, it is possible to rank
degrees of coping, levels of mobility, ability to provide
self-care, or levels of dyspnea on an ordinal scale. For
dyspnea with activities of daily living (ADLs), the scale
could be:
 0-no shortness of breath with ADLs
 1-minimal shortness of breaths with ADLs
 2-moderate shortness of breath with ADLs
 3-extreme shortness of breath with ADLs
 4-shortness of breath so severe the person is unable to
perform ADLs without assistance
 The measurement is ordinal because it is not
possible to claim that equal distances exist
between the rankings. A greater difference
may exist between the ranks of 1 and 2 than
between the ranks of 2 and 3.
What is an interval scale?
 Numerical distances between intervals
 Absence of a zero point
 Likert scale scores
 1 = Strongly disagree
 2 = Disagree
 3 = Neutral
 4 = Agree
 5 = Strongly agree

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 22
Interval-Level Measurement

 Interval-level measurement uses interval scales, which


have equal numerical distances between intervals.
 These scales follow the rules of mutually exclusive,
exhaustive, and ranked categories and are assumed to
represent a continuum of values.
 (RULE: The categories must have equal intervals
between them .) Therefore the magnitude of the
attribute can be more precisely defined. However, it is
not possible to provide the absolute amount of the
attribute, because the interval scale lacks a zero point.
 Temperature is the most commonly used
example of an interval scale. The difference
between the temperatures of 70 Fand 80 F is
10 Fand is the same as the difference
between the temperatures of 30 F and 40 F.
Changes in temperature can be measured
precisely.
 However, a temperature of zero F does not
indicate the absence of temperature.
What is a ratio scale?
 Highest for measurement
 Continuum of values
 Absolute zero point

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 25
What is an example of ratio data?
 Test scores
 1 = Lowest third percentile
 2 = Middle third percentile
 3 = Top third percentile

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 26
Ratio-Level Measurement
 Ratio-level measurement is the highest form of
measurement and meets all the rules of other forms
of measurement—mutually exclusive categories,
exhaustive categories, ordered ranks, equally
spaced intervals, and a continuum of values.
 Interval- and ratio-level data can be added,
subtracted, multiplied , and divided because of the
equal intervals and continuum of values of these
data.
 interval and ratio data can be analyzed with statistical
techniques of greater precision and strength to
determine significant relationships and differences
 ratio-level measures have absolute zero points. (RULE:
The data must have absolute zero . Weight, length, and
volume are commonly used as examples of ratio scales.
All three have absolute zeros, at which a value of zero
indicates the absence of the property being measured;
zero weight means the absence of weight
Summary of the Rules for Levels of
Measurement
Nominal Ordinal Interval Ratio

Exclusive Ranked Categories Equal Interval Absolute Zero


Categories Categories
Exclusive Categories Equal Interval Categories
Ranked Categories
Exhaustive Exhaustive Categories Ranked Categories
Categories Exclusive Categories
Exclusive Categories
Exhaustive Categories
Exhaustive Categories
Audience Response Question

Grades on a multiple-choice final exam are an


example of which level of measurement?

A. Ordinal
B. Interval
C. Nominal
D. Ratio

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 30
What is reference measurement?
 Norm-referenced testing
 Tests performance standards that have been
carefully developed over years with large,
representative samples using a standardized
test with extensive reliability and validity

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 31
Measurement Error
 Measurement error is the difference between the true
measure and what is actually measured (Grove, Burns, &
Gray, 2013).
 The amount of error in a measure varies from considerable
error in one measurement to very little in another.
 Measurement error exists with direct and indirect
measures.
 With direct measures, both the object and measurement
method are visible. Direct measures, which generally are
expected to be highly accurate, are subject to error. For
example, a weight scale may be inaccurate for 0.5 pound..
 With indirect measures, the element being
measured cannot be seen directly. For example,
you cannot see pain. You may observe behaviors or
hear words that you think represent pain, but pain
is a sensation that is not always clearly recognized
or expressed by the person experiencing it.
 The measurement of pain is usually conducted
with a scale but can also include observation and
physiological measures .
 Efforts to measure concepts such as pain usually result
in measuring only part of the concept. Sometimes
measures may identify some aspects of the concept but
may include other elements that are not part of the
concept. For example, measurement methods for pain
might be measuring aspects of anxiety and fear in
addition to pain. However, using multiple methods to
measure a concept or variable usually decreases the
measurement error and increases the understanding
of the concept being measured
types of error
 random error
 systematic error.
 The difference between random and systematic
error is in the direction of the error
 In random measurement error, the difference
between the measured value and the true value is
without pattern or direction (random).
 In one measurement, the actual value obtained
may be lower than the true value, whereas in the
next measurement, the actual value obtained may
be higher than the true value.
.
 A number of chance situations or factors can
occur during the measurement process that
can result in random error . For example, the
person taking the measurements may not
use the same procedure every time, a subject
completing a paper and pencil scale may
accidentally mark the wrong column, or the
person entering the data into a computer
may punch the wrong key.
 The purpose of measuring is to estimate the
true value, usually by combining a number of
values and calculating an average. An
average value, such as the mean, is a closer
estimate of the true measurement. As the
number of random errors increases, the
precision of the estimate decreases
 In systematic measurement error, the variation in
measurement values from the calculated average is
primarily in the same direction. For example, most of
the variation may be higher or lower than the
average that was calculated .
 Systematic error occurs because something else is
being measured in addition to the concept. For
example, a paper and pencil rating scale designed to
measure hope may actually also be measuring
perceived support.
 When measuring subjects’ weights, a scale that shows
weights that are 2 pounds over the true weights will
give measures with systematic error. All the measured
weights will be high, and as a result the mean will be
higher than if an accurate weight scale were used.
 Some systematic error occurs in almost any measure.
Because of the importance of this type of error in a
study, researchers spend considerable time and effort
refining their instruments to minimize systematic
measurement error (Waltz et al., 2010).
 The measurement errors for BP readings can
be minimized by checking the BP cuff and
sphygmomanometer for accuracy and
recalibrating them periodically during data
collection, obtaining three BP readings and
averaging them to determine one BP reading
for each subject, and having a trained nurse
using a protocol to take the BP readings
 If a checklist of pain behaviors is developed for
observation, less error occurs than if the observations
for pain are unstructured. Measurement will also be
more precise if researchers use a well-developed,
reliable, and valid scale, such as the FACES Pain Scale,
instead of developing a new pain scale for their study.
In published studies, look for the steps that
researchers have taken to decrease measurement
error and increase the quality of their study findings.
Audience Response Question

Which of the following is an example of random


measurement error?

A. Actual measures smaller than true measure


B. Including elements of hope in a measure of self-concept
C. Measuring blood sugar immediately after breakfast
D. Punching the wrong key when entering data into the
computer

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 43
Reliability

 Reliability is concerned with the consistency of a


measurement method. For example, if you are
using a paper and pencil scale to measure
depression, it should indicate similar depression
scores each time a subject completes it within a
short period of time. A scale that does not produce
similar scores for a subject with repeat testing is
considered unreliable and results in increased
measurement error .
DETERMINING THE QUALITY OF
MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICAT
OR
Reliability Test-retest reliability: Repeated measures with a scale or
instrument to determine the consistency or stability of the
instrument in measuring a concept
Alternate forms reliability: Comparison of two paper and
pencil instruments to determine their equivalence in
measuring a concept
DETERMINING THE QUALITY
OF MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICAT
OR
Reliability Interrater reliability: Comparison of two observers or
judges in a study to determine their equivalence in making
observations or judging events
Homogeneity or internal consistency reliability:
Reliability testing used primarily with multi-item scales in
which each item on the scale is correlated with all other
items to determine the consistency of the scale in
measuring a concept
What is validity?

It is the extent to which an instrument reflects


the concept being examined.

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 47
DETERMINING THE QUALITY OF
MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICATOR

Validity Content validity: Examines the extent to which a measurement method


includes all the major elements relevant to the concept being measured.

Evidence of validity from contrasting groups: Instrument or scale given to


two groups expected to have opposite or contrasting scores; one group scores
high on the scale and the other scores low.
DETERMINING THE QUALITY
OF MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICATOR

Validity Evidence of validity from convergence: Two scales measuring the same
concept are administered to a group at the same time, and the subjects’ scores
on the scales should be positively correlated. For example, subjects completing
two scales to measure depression should have positively correlated scores.

Evidence of validity from divergence: Two scales that measure opposite


concepts, such as hope and hopelessness, are administered to subjects at the
same time and should result in negatively correlated scores on the scales.
DETERMINING THE QUALITY
OF MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICATOR

Readability Readability level: Conducted to determine the


participants’ ability to read and comprehend the items
on an instrument. Researchers need to report the level
of education that subjects need to read the instrument.
Readability must be appropriate to promote reliability
and validity of an instrument.
DETERMINING THE QUALITY
OF MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICATOR

Precision Precision of physiological measure: Degree of


consistency or reproducibility of the measurements
made with physiological instruments or equipment;
comparable to reliability for paper and pencil scales.
DETERMINING THE QUALITY
OF MEASUREMENT METHODS
QUALITY DESCRIPTION
INDICAT
OR
Accuracy Accuracy of physiological measure: Addresses the extent
to which the physiological instrument or equipment
measures what it is supposed to measure in a study;
comparable to validity for paper and pencil scales.
Reliability Testing

 Reliability testing is a measure of the amount of


random error in the measurement technique. It
takes into account such characteristics as
dependability,precision,stability,consistency,
and reproducibility).
 Because all measurement techniques contain
some random error, reliability exists in degrees
and usually is expressed as a correlation
coefficient (r).
 Cronbach’s alpha coefficient is the most commonly
used measure of reliability for scales with multiple
items .
 Estimates of reliability are specific to the sample being
tested. Thus high reliability values reported for an
established instrument do not guarantee that reliability
will be satisfactory in another sample or with a different
population.
 Researchers need to perform reliability testing on each
instrument used in a study to ensure that it is reliable
for that study
test-retest reliability-Stability
 Reliability testing focuses on the following three aspects of
reliability—stability, equivalence, and homogeneity
 Stability is concerned with the consistency of repeated
measures of the same attribute with the use of the same scale or
instrument. It is usually referred to as test-retest reliability. This
measure of reliability is generally used with physical measures,
technological measures, and paper and pencil scales.
 Use of the technique requires an assumption that the factor to
be measured remains the same at the two testing times and that
any change in the value or score is a consequence of random
error
 For example, physiological measures such as BP equipment can be tested
and then immediately retested, or the equipment can be used for a time
and then retested to determine the necessary frequency of recalibration.
 Researchers need to include test-retest reliability results in their published
studies to document the reliability of their measurement methods.
 For example, the CES-D (see Figure 10-3) has been used frequently in
nursing studies over the years and has demonstrated test-retest reliability
ranging from r= 0.51 to 0.67 in 2- to 8-week intervals. This is very solid
test-retest reliability for this scale, indicating that it is consistently
measuring depression with repeat testing and recognizing that
subjects’ levels of depression vary somewhat over time
equivalence- interrater reliability –
alternate forms reliability
 equivalence, which involves the comparison of two
versions of the same paper and pencil instrument or
of two observers measuring the same event.
 Comparison of two observers or two judges in a
study is referred to as interrater reliability.
 Studies that include collecting observational data
or the making of judgments by two or more data
gatherers require the reporting of interrater
reliability. There is no absolute value below which
interrater reliability is unacceptable.
What is interrater reliability?
 Consistency in raters
 % = number of behaviors performed/total
number of behaviors
 any value below 0.80 should generate
serious concern about the reliability of the
data, data gatherer, or both.
 The interrater reliability value is best to be
0.90 or 90%, which means 90% reliability
and 10% random error, or higher.
Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 58
 Comparison of two paper and pencil instruments is referred
to as alternate forms reliability, or parallel forms reliability.
 Alternative forms of instruments are of more concern in the
development of normative knowledge testing such as the
Scholastic Aptitude Test (SAT), which is used as a college
entrance requirement. The SAT has been used for decades,
and there are many forms of this test, with a variety of items
included on each. These alternate forms of the SAT were
developed to measure students’ knowledge consistently and
protect the integrity of the test
Homogeneity

 Homogeneity is a type of reliability testing used


primarily with paper and pencil instruments or
scales to address the correlation of each question
to the other questions within the scale.
 Questions on a scale are also called items.
 The principle is that each item should be
consistently measuring a concept such as
depression and so should be highly correlated with
the other items.
Homogeneity

 Homogeneity testing examines the extent to which all


the items in the instrument consistently measure the
construct and is a test of internal consistency.
 The statistical procedure used for this process is
Cronbach’s alpha coefficient for interval- and ratio-level
data.
 On some scales, the person responding selects between
two options, such as yes and no. The resulting data are
dichotomous, and the Kuder-Richardson formula (K-R
20) is used to estimate internal consistency
 A Cronbach alpha coefficient of 1.00 indicates
perfect reliability, and a coefficient of 0.00
indicates no reliability .
 A reliability of 0.80 is usually considered a strong
coefficient for a scale that has documented
reliability and has been used in several studies
 new scales, a reliability of 0.70 is considered
acceptable because the scale is being refined and
used with a variety of samples.
 The stronger correlation coefficients, which are
closer to 1.0, indicate less random error and a
more reliable scale.
 A research report needs to include the results from
stability, equivalence, and/or homogeneity
reliability testing done on a measurement method
from previous research and in the present study .
 A measurement method must be reliable if it is to
be considered a valid measure for a study concept
Validity

 The validity of an instrument is a determination of


how well the instrument reflects the abstract
concept being examined. Validity, like reliability, is
not an all or nothing phenomenon; it is measured
on a continuum.
 No instrument is completely valid, so researchers
determine the degree of validity of an instrument
rather than whether validity exists .
Validity

 Validity will vary from one sample to another and one


situation to another; therefore validity testing evaluates
the use of an instrument for a specific group or purpose ,
rather than the instrument itself.
 An instrument may be valid in one situation but not
another. For example, the CES-D was developed to
measure the depression of patients in mental health
settings. Will the same scale be valid as a measure of the
depression of cancer patients? Researchers determine
this by pilot testing the scale to examine the validity of
the instrument in a new population
 validity is considered a single broad method of
measurement evaluation, referred to as construct
validity, and includes content and predictive
validity.
 Content validity examines the extent to which the
measurement method or scale includes all the
major elements or items relevant to the construct
being measured..
 The evidence for content validity of a scale
includes the following:
 (1) how well the items of the scale reflect the
description of the concept in the literature
 (2) the content experts’ evaluation of the
relevance of items on the scale that might be
reported as an index
 (3) the potential subjects’ responses to the scale
items.
 Paper and pencil and electronic instruments or
scales must be at a level that.
 Assessing the level of repotential study subjects
can read and understand. Readability level focuses
on the study participants’ ability to read and
comprehend the content of an instrument or scale.
 Readability is essential if an instrument is to be
considered valid and reliable for a sample adability
of an instrument is relatively simple and takes
about 10 to 15 minutes.
 More than 30 readability formulas are
available. These formulas use counts of
language elements to provide an index of the
probable degree of difficulty of
comprehending the scale (Grove et al., 2013).
 Readability formulas are now a standard
part of word-processing software.
 Three common types of validity presented in
published studies include evidence of validity
from
(1) contrasting groups
(2) convergence
(3) divergence..
contrasting groups
 An instrument’s evidence of validity from contrasting
groups can be tested by identifying groups that are
expected (or known) to have contrasting scores on an
instrument. For example, researchers select samples
from a group of individuals with a diagnosis of depression
and a group that does not have this diagnosis. You would
expect these two groups of individuals to have
contrasting scores on the CES-D. The group with the
diagnosis of depression would be expected to have higher
scores than those without the depression diagnosis,
which would add to the construct validity of this scale.
convergence
 Evidence of validity from convergence is determined when
a relatively new instrument is compared with an existing
instrument(s) that measures the same construct.
 The instruments, the new and existing ones, are
administered to a sample at the same time, and the results
are evaluated with correlational analyses. If the measures
are strongly positively correlated, the validity of each
instrument is strengthened. For example, the CES-D has
shown positive correlations ranging from 0.40 to 0.80 with
the Hamilton Rating Scale for Depression, which supports
the convergent validity of both scales (Locke & Putnam,
2002; Sharp & Lipsky, 2002).
divergence
 Sometimes instruments can be located that measure a concept
opposite to the concept measured by the newly developed
instrument. For example, if the newly developed instrument is a
measure of hope, you could make a search for an instrument
that measures hopelessness or despair. Having study
participants complete both these scales is away to examine
evidence of validity from divergence. Correlational procedures
are performed with the measures of the two concepts. If the
divergent measure (hopelessness scale) is negatively correlated
(such as 0.4 to 0.8) with the other instrument (hope scale),
validity for each of the instruments is strengthened (Waltz et al.,
2010).
 The evidence of an instrument’s validity from
previous research and the current study needs to be
included in the published report. In critically
appraising a study, you need to judge the validity of
the measurement methods that were used.
However, you cannot consider validity apart from
reliability .
 If a measurement method does not have acceptable
reliability or is not consistently measuring a
concept, then it is not valid.
ACCURACY, PRECISION, AND ERROR OF PHYSIOLOGICAL
MEASURES

 Physiological measures are measurement methods


used to quantify the level of functioning of living
beings
 The precision, accuracy, and error of physiological
and biochemical measures tend not to be reported
or are minimally covered in published studies.
 These routine physiological measures are assumed
to be accurate and precise, an assumption that is
not always correct.
ACCURACY, PRECISION, AND ERROR OF PHYSIOLOGICAL
MEASURES

 Some of the most common physiological measures


used in nursing studies include BP, heart rate, weight,
body mass index, and laboratory values.
 Sometimes researchers obtain these measures from the
patient’s record, with no consideration given to their
accuracy. For example, how many times have you heard
a nurse ask a patient his or her height or weight, rather
than measuring or weighing the patient? Thus
researchers using physiological measures need to
provide evidence of the measures’ accuracy, precision,
and potential for error
Accuracy
 Accuracy is comparable to validity in that it
addresses the extent to which the instrument
measures what it is supposed to measure in a study
(Ryan-Wenger, 2010).
 For example, oxygen saturation measurements
with pulse oximetry are considered comparable
with measures of oxygen saturation with arterial
blood gases.
Accuracy
 Because pulse oximetry is an accurate measure of
oxygen saturation, it has been used in studies
because it is easier, less expensive, less painful, and
less invasive for research participants.
 Researchers need to document that previous
research has been conducted to determine the
accuracy of pulse oximetry for the measurement of
individuals’ oxygen saturation levels in their study
Precision
 Precision is the degree of consistency or reproducibility
of measurements made with physiological instruments.
 Precision is comparable to reliability.
 The precision of most physiological equipment depends
on following the manufacturer’s instructions for care
and routine testing of the equipment.
 Test-retest reliability is appropriate for physiological
variables that have minimal fluctuations, such as
cholesterol (lipid) levels, bone mineral density, or
weight of adults
Precision

 Test-retest reliability can be inappropriate if the


variables’ values frequently fluctuate with
various activities, such as with pulse,
respirations, and BP. However, test-retest is a
good measure of precision if the measurements
are taken in rapid succession. For example, the
national BP guidelines encourage taking three
BP readings 1 to 2 minutes apart and then
averaging them to obtain the most precise and
accurate measure of BP
Error
 Sources of error in physiological measures can be
grouped into the following five categories:
environment, user, subject, equipment, and
interpretation.
 The environment affects the equipment and subject.
Environmental factors might include temperature,
barometric pressure, and static electricity.
 User errors are caused by the person using the
equipment and may be associated with variations by
the same user, different users, or changes in supplies
or procedures used to operate the equipment.
Error
 Subject errors occur when the subject alters the
equipment or the equipment alters the subject. In
some cases, the equipment may not be used to its
full capacity .
 Equipment error may be related to calibration or
the stability of the equipment. Signals transmitted
from the equipment are also a source of error and
can result in misinterpretation .
 Researchers need to report the protocols followed or
steps taken to prevent errors in their physiological
and biochemical measures in their published studies
CRITICAL APPRAISAL GUIDELINES

 Directness, Level of Measurement, Reliability, and


Validity of Scales, Accuracy, Precision, and Error of
Physiological Measures
 In critically appraising a published study, you need to
determine the directness and level of measurement,
reliability and validity of scales, accuracy and precision
of physiological measures, and potential measurement
error for the different measurement methods used in a
study. In most studies, the methods section includes a
discussion of measurement methods, and you can use
the following questions to evaluate them:
 1. What measurement method(s) were used to
measure each study variable?
 2. Was the type of measurement direct or indirect?
 3. What level of measurement was achieved for
each of the study variables?
 4. Was reliability information provided from
previous studies and for this study?
 5. Was the validity of each measurement method adequately
described? In some studies, researchers simply state that the
measurement method has acceptable validity based on
previous research. This statement provides insufficient
information for you to judge the validity of an instrument.
 6. Did the researchers address the accuracy, precision ,and
potential for errors with the physiological measures?
 7. Was the process for obtaining, scoring, and/or recording
data described?
 8. Did the researchers provide adequate description of the
measurement methods to judge the extent of measurement
error?
USE OF SENSITIVITY, SPECIFICITY, AND LIKELIHOOD RATIOS TO
DETERMINE THE QUALITY OF DIAGNOSTIC AND SCREENING TESTS
Sensitivity and Specificity
 An important part of EBP is the use of quality diagnostic
and screening tests to determine the presence or absence
of disease
 Clinicians want to know which laboratory or imaging
study to order to help screen for or diagnose a disease.
When the test is ordered, are the results valid or accurate?
 The accuracy of a screening test or a test used to confirm
a diagnosis is evaluated in terms of its ability to assess the
presence or absence of a disease or condition correctly as
compared with a gold standard.
Sensitivity and Specificity

 The gold standard is the most accurate means of


currently diagnosing a particular disease and
serves as a basis for comparison with newly
developed diagnostic or screening tests.
 Sensitivity and specificity are the terms used to
describe the accuracy of a screening or
diagnostic test
 There are four possible outcomes of a screening
test for a disease: (1) true positive, which is an
accurate identification of the presence of a
disease
 (2) false positive, which indicates that a disease is
present when it is not
 (3) true negative, which indicates accurately that
a disease is not present
 (4) false negative, which indicates that a disease
is not present when it is.
 You can calculate sensitivity and specificity
based on research findings and clinical
practice outcomes to determine the most
accurate diagnostic or screening tool to use
when identifying the presence or absence of
a disease for a population of patients. The
calculations for sensitivity and specificity are
provided as follows:
RESULTS OF SENSITIVITYAND
SPECIFICITY OF SCREENING TESTS
DIAGNOSTIC TEST DISEASE DISEASE NOT TOTAL
RESUL PRESENT PRESENT OR
ABSECENT
Positive test a (true positive) b (false positive) a+b
Negative test c (false negative) d (true negative C+d
Total a+c b+d a +c + b + d
 Sensitivity calculation=probability of
disease=a= a \(a + c)=true positive rate
 Specificity calculation=probability of no
disease=d\( b +d)=true negative rate
 Sensitivity is the proportion of patients with the
disease who have a positive test result, or true
positive.
 The researcher or clinician might refer to the test
sensitivity in the following ways:
•A highly sensitive test is very good at identifying
the disease in a patient.
• If a test is highly sensitive, it has a low percentage
of false negatives.
 Specificity is the proportion of patients without
the disease who have a negative test result, or true
negative.
 The researcher or clinician might refer to the test
specificity in the following ways:
•A highly specific test is very good at identifying
the patients without a disease.
• If a test is very specific, it has a low percentage of
false positives.
CRITICAL APPRAISAL GUIDELINES

 Sensitivity and Specificity of Diagnostic and


Screening Tests
 When critically appraising a study, you need to
judge the sensitivity and specificity of the
diagnostic and screening tests used in the study.
1. Was a diagnostic or screening test used in a
study?
2. Are the sensitivity and specificity values provided
for the diagnostic or screening test from previous
studies and for this study’s population?
Likelihood Ratios
 Likelihood ratios (LRs) are additional calculations
that can help researchers determine the accuracy of
diagnostic or screening tests, which are based on the
sensitivity and specificity results.
 The LRs are calculated to determine the likelihood
that a positive test result is a true positive and that a
negative test result is a true negative.
 The ratio of the true-positive results to false-positive
results is known as the positive likelihood ratio .
 The positive LR is calculated as follows, using
the data from the Sarikaya et al. (2010) study:
Positive LR = sensitivity/ (100%-specificity)
 The negative likelihood ratio is the ratio of true-
negative results to false-negative results and is
calculated as follows:
 Negative LR= (100% - sensitivity)/specificity
 The very high LRs (or those that are >10) rule in the
disease or indicate that the patient has the disease.
 The very low LRs (or those that are <0.1) almost rule
 Understanding sensitivity, specificity, and LR
increases your ability to read clinical studies and
determine the most accurate diagnostic test to use in
clinical practice
MEASUREMENT STRATEGIES IN NURSING

the most common measurement methods used


in nursing research, including
 physiological measures
 observational measurement
 Interviews
 Questionnaires
 scales.
Physiological Measures

 An increased need for ways to measure the


outcomes of nursing care has generated
more nursing studies that include
physiological measures.
 The outcome of interest may be the outcome
of all nursing care received for a particular
care episode or the outcome of a particular
nursing intervention.
Physiological Measures

 An important focus of physiological


measurement is finding a means to quantify
changes, directly or indirectly, that occur in
physiological variables as a result of nursing
care.
 This upsurge of interest in outcome
measures has broadened the base of
physiological research beyond nurse
physiologists to include nurse clinicians
 Avariety of approaches for obtaining physiological measures
are possible. Some measurements are relatively easy to
obtain and are an extension of the measurement methods
used in nursing practice, such as those used to obtain weight
and BP.
 Other measurements are not difficult to obtain, but the
methods sometimes require an imaginative approach. For
example, some physiological measures are obtained by using
self-report with diaries, scales, or observation checklists,
and other physiological measures are obtained using
laboratory tests and electronic monitoring.
Observational Measurement

 Observational measurement involves an interaction


between the study participants and observer(s), in
which the observer has the opportunity to watch the
participant perform in a specific setting
 Observation is often used to collect data in qualitative
studies, and it is usually unstructured .
 Unstructured observations involve spontaneously
observing and recording what is seen in words. The
analysis of these data may lead to a more structured
observation and an observational checklist
structured observational measurement
 the researcher carefully defines what he or she will observe and how
the observations are to be made, recorded, and coded as numbers
 For observations to be structured, researchers will develop a
category system for organizing and sorting the behaviors or events
being observed.
 Checklists are often used to indicate whether a behavior occurred.
 Rating scales allow the observer to rate the behavior or event.
 This provides more information for analysis than dichotomous data,
which indicate only whether or not the behavior occurred.
 it is often considered less credible.
 reporting interrater reliability of those doing the observations is
essential
CRITICAL APPRAISAL GUIDELINES Observational
Measurement

 When critically appraising observational


measures, consider the following questions:
1. Is the object of observation clearly identified
and defined?
2. Are the techniques for recording observations
described?
3. Is interrater reliability for the observers
described?
Interviews
 An interview involves verbal communication between the
researcher and subject, during which information is provided to
the researcher.
 Although this data collection strategy is most commonly used in
qualitative and descriptive studies, it also can be used in other
types of quantitative studies.
 You can use a variety of approaches to conduct an interview,
ranging from a totally unstructured interview , in which the
content is controlled by the study participant, to a structured
interview, in which the content is similar to that of a
questionnaire, with the possible responses to questions carefully
designed by the researcher
 During structured interviews, researchers use
strategies to control the content of the interview.
Usually, researchers ask specific questions and
enter the participant’s responses onto a rating
scale or paper and pencil instrument during the
interview. For example, researchers could use an
in-person or telephone interview to obtain
responses to an instrument. Researchers might
also enter responses into an electronic database.
 The response rate for interviews is higher than for questionnaires,
which usually allows a more representative sample to be obtained.
 Interviewing also allows collection of data from participants who
are unable or unlikely to complete questionnaires, such as those
who are very ill or may have limited ability to read, write, and
express themselves.
 Interviews are a form of self-report, and it must be assumed that
the information provided is accurate.
 Because of time and cost, sample size is usually limited.
Participant bias is always a threat to the validity of the findings, as
is inconsistency in data collection from one subject to another .
CRITICAL APPRAISAL GUIDELINES Structured
Interviews
 When critically appraising interviews conducted in studies,
you need to consider the following questions:
1. For structured interviews, what guided the interview
process?
2. Are the interview questions relevant for the research
purpose?
3. Does the design indicate the process for conducting the
interviews?
4. If multiple interviewers are used to gather data, how were
these individuals trained, and what consistency was
achieved for the interview process?
5. Do the questions tend to bias subjects’ responses?
Questionnaires

 A questionnaire is a self-report form designed


to elicit information through written, verbal,
or electronic responses of the subject.
 Questionnaires may be printed and
distributed in person or mailed, available on a
computer, or accessed online.
 Questionnaires are sometimes referred to as
surveys, and a study using a questionnaire
may be referred to as survey research.
Questionnaires

 The information obtained from questionnaires is


similar to that obtained by an interview, but the
questions tend to have less depth.
 The subject is not permitted to elaborate on
responses or ask for clarification of questions, and
the data collector cannot use probing strategies.
However, questions are presented in a consistent
manner to each subject, and opportunity for bias is
less than in an interview.
 Questionnaires often are used in descriptive
studies to gather a broad spectrum of information
from subjects, such as facts about the subject or
facts about persons, events, or situations known by
the subject.
 It is also used to gather information about beliefs,
attitudes, opinions, knowledge, or intentions of
the subjects.
 Questionnaires are often developed for a particular
study to enable researchers to gather data from a
selected population in a new area of study.
 Some questionnaires have open-ended
questions, which require written responses
(qualitative data) from the subject. Other
questionnaires have closed ended questions,
which have limited options from which
participants can select their answers
 the response rate for questionnaires
generally is lower than that for other forms of
self-report, particularly if the questionnaires
are mailed.
 If the response rate is lower than 50%, the
representativeness of the sample is seriously
in question.
 The response rate for mailed questionnaires is usually
small (25% to 40%), so researchers frequently are
unable to obtain a representative sample, even with
random sampling methods.
 Questionnaires distributed via the Internet are more
convenient for subjects, which may result in a higher
response rate than questionnaires that are mailed.
Many researchers are choosing the Internet format if
they have access to the potential subjects’ e-mail
addresses
CRITICAL APPRAISAL GUIDELINES Questionnaires

 When critically appraising a questionnaire in a published


study, consider the following questions:
1. Does the questionnaire address the focus of the study
outlined in the study purpose and/or objective,
questions, or hypotheses? Examine the description of
the contents of the questionnaire in the measurement
section of the study.
2. Does the study provide information on content-related
validity for the questionnaire?
3. Was the questionnaire implemented consistently from
one subject to another?
Scales

 The scale, a form of self-report, is a more precise


means of measuring phenomena than a
questionnaire.
 Most scales are developed to measure
psychosocial variables, but researchers also use
scaling techniques to obtain self-reports on
physiological variables such as pain, nausea, or
functional capacity.
 The various items on most scales are summed to
obtain a single score.
Scales

 These are termed summated scales. Fewer


random and systematic errors occur when the
total score of a scale is used .
 The various items in a scale increase the
dimensions of the concept that are measured by
the instrument.
 The three types of scales described in this section
that are commonly used in nursing research are
rating scales, Likert scales, and visual analog
scales.
Rating Scales
 Rating scales are the crudest form of measurement
involving scaling techniques.
 A rating scale lists an ordered series of categories of a
variable that are assumed to be based on an underlying
continuum.
 A numerical value is assigned to each category, and the
fineness of the distinctions between categories varies
with the scale.
 Rating scales are commonly used by the general public.
In conversations, one can hear statements such as “On
a scale of 1 to 10, I would rank that. ... ”.
Rating Scales

 Rating scales are fairly easy to develop, but


researchers need to be careful to avoid end
statements that are so extreme that no subject
will select them.
 You can use a rating scale to rate the degree of
cooperativeness of the patient or the value placed
by the subject on nurse-patient interactions.
Rating scales are also used in observational
measurement to guide data collection.
 Some rating scales are more valid than others
because they were constructed in a
structured way and used in a variety of
studies with different populations.
 For example, the FACES Pain Scale is a
commonly used rating scale to assess the
pain of children in clinical practice and has
proven to be valid and reliable over the years
 Nurses often assess pain in adults with a
numeric rating scale (NRS) similar to the one
in Figure 10-5. Using the NRS is more valid
and reliable than asking a patient to rate her
or his pain on a scale from 1 to 10.
Likert Scale

 The Likert scale is designed to determine the


opinions or attitudes of study subjects.
 This scale contains a number of declarative
statements ,with a scale after each statement
 The Likert scale is the most commonly used
of the scaling techniques.
 The original version of the scale included five
response categories
Example of Items in a Likert Scale

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 125
 Evaluation responses ask the respondent for
an evaluative rating along a bad-good
dimension, such as negative to positive or
terrible to excellent.
 Frequency responses may include statements
such as never, rarely, sometimes,
frequently, and all the time. The terms used
are versatile and are selected based on the
content of the questions or items in the scale
 Sometimes seven options are given on a response scale,
sometimes only four.
 When the response scale has an odd number of options,
the middle option is usually an uncertain or neutral
category.
 Using a response scale with an odd number of options is
controversial because it allows the subject to avoid making
a clear choice of positive or negative statements.
 To avoid this, researchers may choose to provide only four
or six options, with no middle point or uncertain category.
This type of scale is referred to as a forced choice version
A Likert scale usually consists of 10 to 20 items,
each addressing an element of the concept being
measured.
 Usually, the values obtained from each item in
the instrument are summed to obtain a single
score for each subject.
 Although the values of each item are technically
ordinal-level data, the summed score is often
analyzed as interval-level data.
Visual Analog Scales

 The visual analog scale (VAS) is typically used to


measure strength, magnitude, or intensity of
individuals’ subjective feelings, sensations, or
attitudes about symptoms or situations.
 The VAS is a line that is usually 100 mm long,
with right angle “stops” at either end.
 Researchers can present the line horizontally or
vertically, with bipolar anchors or descriptors
beyond either end of the line (Waltz et al., 2010).
Example of Visual Analog Scale

Copyright © 2015, 2011, 2007, 2003, 1999, 1995 by Saunders, an imprint of Elsevier Inc. 130
Visual Analog Scales
 These end anchors must include the entire range of
sensations possible for the phenomenon being
measured (e.g., all and none, best and worst, no
pain, and most severe pain possible.
 Subjects are asked to place a mark through the line
to indicate the intensity of the sensation or feeling.
 Then researchers use a ruler to measure the
distance between the left end of the line (on a
horizontal scale) and the subject’s mark. This
measure is the value of the sensation.
 The VAS has been used to measure pain, mood,
anxiety, alertness, craving for cigarettes, quality of
sleep, attitudes toward environmental conditions,
functional abilities, and severity of clinical
symptoms.
 The reliability of the VAS is usually determined by
the test-retest method. The correlations between
the two administrations of the scale need to be
moderate or strong to support the reliability of the
scale .
 Because these scales are used to measure
phenomena that are dynamic or erratic over time,
test-retest reliability is sometimes not
appropriate, and the low correlation is then
caused by the change in sensation versus a
problem with the scale.
 The validity of the VAS is usually determined by
correlating the VAS scores with other measures,
such as rating or Likert scales, that measure the
same phenomenon, such as pain
CRITICAL APPRAISAL GUIDELINES Scales

 When critically appraising a rating scale, Likert


scale, or VAS in a study, ask the following
questions:
 1. Is the rating scale, Likert scale, or VAS clearly
described in the research report?
 2. Are the techniques used to administer and score
the scale provided?
 3. Is information about validity and reliability of the
scale described from previous studies and for this
study?
DATA COLLECTION PROCESS

 Data collection is the process of acquiring subjects and


collecting the data for a study.
 The actual steps of collecting the data are specific to
each study and depend on the research design and
measurement techniques.
 During the data collection process, researchers initially
train the data collectors, recruit study participants,
implement the study intervention (if applicable),
collect data in a consistent way, and protect the
integrity (or validity) of the study.
 Researchers need to describe their data collection
process clearly in their research report.
 the data collection process is addressed in the
methods section of the report in a subsection entitled
“Procedure.”
 The strategies used to approach potential subjects
who meet the sampling criteria need to be described .
 Researchers should also specify the number and
characteristics of subjects who decline to participate
in the study.
 If the study includes an intervention, the
details about the intervention and how it was
implemented should be described
 The approaches used to perform
measurements and the time and setting for
the measurements are also described.
 The desired result is a step by step description
of exactly how, where, and in what sequence
the researchers collected the study data.
Recruitment of Study Participants

 participants or subjects may be recruited only at


the initiation of data collection or throughout the
data collection period.
 The design of the study determines the method of
selecting the participants.
 Recruiting the number of subjects originally
planned is critical because data analysis and
interpretation of findings depend on having an
adequate sample size.
Consistency in Data Collection
 Consistency involves maintaining the data collection pattern
for each collection event as it was developed in the research
plan.
 A good plan will facilitate consistency and maintain the
validity of the study .
 Researchers should note deviations, even if they are minor,
and report their impact on the interpretation of the findings in
their final study report.
 If a study uses data collectors, researchers need to report the
training process and the interrater reliability achieved during
training and data collection.
Control in the Study Design
 Researchers build controls into their study plan to minimize the
influence of intervening forces on the findings.
 Control is very important in quasi-experimental and experimental
studies to ensure that the intervention is consistently implemented
 The research report needs to reflect the controls implemented in a
study and any problems that needed to be managed during the
study.
 to maintaining the controls identified in the plan, researchers
continually look for previously unidentified, extraneous variables that
might have an impact on the data being collected. An extraneous
variable often is specific to a study and tends to become apparent
during the data collection period and needs to be discussed in the
research report.
Studies Obtaining Data from Existing
Databases
 Nurse researchers are increasing their use of existing
databases to address the research problems they have
identified as being essential for generating evidence for
practice.
 The reasons for using these databases in studies are varied.
 With the computerization of healthcare information, more
databases have been developed internationally, nationally,
regionally, at the state level, and within clinical agencies.
These databases include large amounts of information
that have relevance in developing research evidence
needed for practice
Studies Obtaining Data from Existing
Databases
 The costs and technology for storage of data have improved over
the last 10 years, making these databases more reliable and
accessible.
 Using existing databases makes it possible to conduct complex
analyses to expand our understanding of healthcare outcomes
(Doran, 2011).
 Another reason is that the primary collection of data in a study is
limited by the availability of research participants and expense of
the data collection process. By using existing databases,
researchers are able to have larger samples, conduct more
longitudinal studies, have lower costs during the data
collection process, and limit the burdens placed on the study
participants
 The existing healthcare data consist of two types, secondary and
administrative.
 Data collected for a particular study are considered primary
data. Data collected from previous research and stored in a
database are considered secondary data when used by other
researchers to address their study purposes
 Because these data were collected as part of research, details can
be obtained about the data collection and storage processes .
 In the methodology section of their research report , researchers
usually clearly indicate when secondary data analyses were
conducted as part of their study
 Data collected for reasons other than research are
considered administrative data.
 Administrative data are collected within clinical
agencies, obtained by national, state, and local
professional organizations, and collected by
federal, state, and local agencies. The processes for
collection and storage of administrative data are
more complex and often more unclear than the
data collection process for research
 The data in administrative databases are collected
by different people in different sites using different
methods. However, the data elements collected
for most administrative databases include
demographics, organizational characteristics,
clinical diagnosis and treatment, and geographic
information. These database elements were
standardized by the Health Insurance Portability
and Accountability Act (HIPAA) of 1996, which
improved the quality of the databases
 When secondary data and administrative data from
existing databases are used in a study, they need to
be critically appraised to determine the quality of
the study findings.
 The type of database used in a study needs to be
clearly described. The data in the database needs to
address the researchers’ study purpose and their
objectives, questions, or hypotheses.
 The validity and reliability of the data in the existing
database need to be described in the research report.
CRITICAL APPRAISAL GUIDELINES
Data Collection
 When critically appraising the data collection process, consider the
following questions:
 1. Were the recruitment and selection of study participants or subjects
clearly described and appropriate?
 2. Were the data collected in a consistent way?
 3. Were the study controls maintained as indicated by the design? Did
the design include an intervention that was consistently implemented?
 4. Was the integrity of the study protected, and how were any
problems resolved?
 5. Did the researchers obtain data from an existing database? If so, did
the data obtained address the study problem and objectives,
questions, or hypotheses? Were the reliability and validity of the
database addressed in the research report?
KEY CONCEPTS

 • The purpose of measurement is to produce


trustworthy data or evidence that can be used in
examining the outcomes of research.
 • The rules of measurement ensure that the
assignment of values or categories is performed
consistently from one subject (or event) to another
and, eventually, if the measurement strategy is
found to be meaningful, from one study to another.
 • The levels of measurement from low to high are
nominal, ordinal, interval, and ratio.
 • Reliability in measurement is concerned with the consistency of the
measurement technique; reliability testing focuses on equivalence,
stability, and homogeneity.
 • The validity of an instrument is a determination of the extent to which
the instrument reflects the abstract concept being examined. Construct
validity includes content-related validity and evidence of validity from
examining contrasting groups, convergence, and divergence.
 • Readability level focuses on the study participants’ ability to read and
comprehend the content of an instrument, which adds to the reliability
and validity of the instrument.
 • Physiological measures are examined for precision, accuracy, and error
in research reports.
 • Diagnostic and screening tests are examined for sensitivity, specificity,
and likelihood ratios.
 • Common measurement approaches used in nursing research
include physiological measures, observation, interviews,
questionnaires, and scales.
 • The scales discussed in this chapter include rating scales, Likert
scales, and visual analog scales.
 • Researchers are using existing databases when conducting their
studies, and the quality of these databases need to be addressed in
the research report.
 • The data collection tasks that need to be critically appraised in a
study include (1) recruit of study participants, (2) consistent collection
of data, and (3) maintenance of controls in the study design.
 • It is important to critically appraise the measurement methods and
data collection process of a published study for threats to validity

You might also like