Group 7 Handouts

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

MEASUREMENT AND DATA QUALITY

➢ In quantitative studies, an ideal data collection procedure is one that measures a


construct accurately, soundly, and with precision.

MEASUREMENT

➢ Quantitative studies obtain data through the measurement of constructs. Clinicians


also require that phenomena of interest be measured. Measurement involves
assigning numbers to represent the amount of an attribute present in a person or
object.

Rules and Measurement

Measurement involves assigning numbers according to rules. Rules are necessary to


promote consistency and interpret ability. The rules for measuring temperature, weight,
and other physical attributes are familiar to us.

Theories of Measurement

Psychometrics is the branch of psychology concerned with the theory and methods of
psychological measurements. Health measurement has been strongly influenced by
psychometrics, although differences in aims and conceptualizations have begun to
emerge.

TWO THEORIES OF MEASUREMENT

Classical test theory (CTT) - is a psychometric theory of measurements that has been
dominant until fairly recently. CTT has been used as basis for developing multi-item
measures of health constructs and is also appropriate for conceptualizing all types of
measurements.

Item response theory or IRT- is an appropriate framework only for multi-item scales and
tests.

Errors of Measurements

Procedure for obtaining measurements, as well as the objects being measured, are
susceptible to influences that can alter the resulting data. Some influences can be
controlled or minimized, and attempts should be made to do so, but such efforts are rarely
completely successful.

Obtained score = True score + Error

or
Xo = Xt + Xe

COMMON SOURCES OF MEASUREMENT ERROR

Transient personal factors – a person’s score can be influenced by such personal


states as fatigue or mood.
Situational contaminants – scores can be affected by the conditions under which they
are produced.
Response-set biases – relatively enduring characteristics of people can interfere with
accurate measurements.
Administration variations – alterations in the methods of collecting data from one
person to the next can result in score variations unrelated to variations in the target
attribute.
Instrument clarity – if the directions on an instrument are poorly understood, the scores
may be affected.
Item sampling – errors can be introduced as a result of the sampling of items used in the
measures.

MAJOR TYPES OF MEASURES

1. Static and Adaptive Measures


A static measure is administered in a comparable manner for everyone being measured.
An adaptive measures by contrast involves using responses to early questions to guide
the selection of subsequent questions.

2. Reflective scales and Formative indexes


An important distinction is whether a multi-item measure is formative or reflective, which
concerns the nature of the relationship between a construct and the measure of the
construct.

A Measurement Taxonomy

The field of health measurements was in some turmoil for many years with regard to
measurements and definitions.
Recently, a working group in the Netherlands used a Delphi-type approach with a panel
of health measurement experts to identify key measurements properties and to develop
a taxonomy and definitions of those properties.
The result was the creation of COSMIN the Consensus-based Standards for the selection
of health Measurement Instruments.
RELIABILITY

The reliability of a quantitative measure is a major criterion for assessing its quality.
Reliability is the extent to which scores for people who have not changed are the same
for repeated measurements, under several situations, including repetition on different
occasions, by different persons, or on different versions of measure, or in the form of
different items on a multi-item instrument.

The first component within the broad reliability domain is simply called reliability. It covers
four different approaches to reliability assessment, including the following:

1. Test-retest reliability
2. Interrater reliability
4. Intrarater reliability
5. Parallel test reliability

Test-Retest Reliability
Takes the form of administering a measure to the same people on two occasion.
This type of reliability is sometimes called stability or reproducibility – the extent to which
scores can be reproduced on repeated administration.
Interrater or Intrarater Reliability
Reliability assessment involves comparing the observers scores to see if the scores are
comparable.
An assessment in which the same rate make the measurements on two or more
occasions, blinded to the ratings assigned previously.

Number of agreements

____________________________________
Number of agreements + disagreements

Parallel Test Reliability


Involves administration of the parallel tests to the same people on two separate
occasions, and then estimating a reliability parameter which would be the intraclass
correlation coefficient.

Interpretation of Reliability Coefficients


Have a special interpretation that relates to the decomposition of observed scores into
error and true score components.

Vo = Vt + Ve

or

Vt

R = ________________

Vo

Where Vo = observed total variability in scores

Vt = true variability

Ve = variability owning to error

The Standard Error of Measurements (SEM)


Can be thought of as quantifying “typical error” on a measure. It is an index that can be
computed in connection with estimates of either reliability or internal consistency.
Limits of Agreement (LOA)
An alternative index of measurements error derived from work done by Bland and
Altman (1986).
Bland – Altman plots are widely used by medical researchers to examine aspects of
both reliability and validity of measures but are seldom used by psychometricians or
nurse researchers.

VALIDITY
A second domain in the taxonomy of measurement properties.
Validity in a measurement context is defined as the degree to which an instrument is
measuring the construct it purports to measure.

Three major components


1. Content and Face Validity
2. Criterion Validity
3. Construct Validity

CONTENT AND FACE VALIDITY


Face validity refers to whether the instruments looks like it is measuring the target
construct.
Content Validity defined as the extent to which an instrument’s content adequately
captures the construct that is whether an instrument has an appropriate sample of items
for the construct being measured.

Three issues of Content Validity

Relevance – an assessment for relevance involves feedback on the relevance of


individual items and the overall set of items.
Comprehensiveness – the flip side of asking experts about the relevance is to ask
them if there are notable omissions.
Balance – an instruments that is content valid represents the domains of the
construct in a balanced manner.

CRITERION VALIDITY
Is the extent to which the scores on an instrument are a good reflection of a “gold
standard” that is a criterion considered an ideal measure of the construct.
FIVE CATEGORIES

Expenses – a new measure that is good reflection of a criterion may be desired


because the gold standard is too expensive to administer routinely.
Efficiency – a related reason is the desire to create a measure that is more efficient
than the gold standard.
Risk and Discomfort – sometimes the criterion involves a measurement that puts
people at risk or is invasive, and a substitute is desired to lower risks or pain.
Criterion unavailable – a measure may be needed because criterion measures are
difficult or impossible to obtain routinely in clinical settings.
Prediction – one other reason for developing an instruments that can be validated
against a criterion is that the criterion cannot be measured until a future point in time.
Concurrent validity is the type of criterion validity that is assessed when the
measurements of the criterion and the new instrument occur at the same time.
Predictive validity the focal measure is tested against a criterion that is measured in
then future.
Criterion Validity with a Continuous Measure and a Continuous Criterion
The first situation is when both the focal measure being tested and the criterion are
continuous scores.
Criterion Validity with a Dichotomous Measure and a Dichotomous Criterion
When both the focal measure and the criterion are dichotomous, several statistical
methods can be used but, most often, methods of assessing diagnostic accuracy are
applied.
Sensitivity is the ability of a measure to identify a “case” correctly, that is, to screen in
or diagnose a condition correctly.
Specificity is the measure’s ability to identify non cases correctly, that is, to screen out
those without the condition.
Criterion Validity with a Continuous Measure and a Dichotomous Criterion
Researchers usually use a receiver operating characteristic curve (ROC curve) to
identify the best cutoff point.

Hypothesis-testing Construct Validity


Concerns the extent to which it is possible to corroborate hypothesis regarding how
scores on a measure function in relation to scores on measures of other construct.

Convergent Validity
Is the degree to which scores on the focal measure are correlated with scores on
measures of construct with which there is a hypothesized relationship that is the degree
to which there is conceptual convergence.
Known-Groups Validity
Which has also been called discriminative validity, relies on hypotheses concerning a
measure’s ability to discriminate between two or more groups known to differ with
regard to the construct of interest.

Divergent Validity
Which is often called discriminant validity. Concerns evidence that a measure is not a
measure of a different construct, distinct from the focal construct.

Structural Validity
Refers to the extent to which the structure of a multi-item scale adequately reflects the
hypothesized dimensionality of the construct being measured.

Factor Analysis
Is a method for identifying cluster or related items that is dimensions underlying a broad
construct.

Cross-Cultural Validity
As the degree to which the components of a translated or culturally adapted measure
perform adequately and equivalently, individually, and collectively, relative to their
performance on the original instrument.

RELIABILITY OF CHANGE SCORES


Two domains in our measurement taxonomy relate to measurements over time.

Measuring Change
In clinical trials, statisticians have argued against using change scores as the
dependent variables in the analysis of treatment effects.
Change score represent the amount of change between two score.

The Smallest Detectable Change


Or the minimal detectable change.

The Reliable Change Index


Was proposed by Jacobson and colleagues as an element of a two-part process for
assessing the clinical significance of patient’s improvement during a psychotherapeutic
intervention.

RESPONSIVENESS
The ability of a measure to detect change over time in a construct that has changed,
commensurate with the amount of change that has occurred.
The Criterion Approach to Responsiveness
This approach to responsiveness assessment has also been called an anchor-based
approach, with the criterion serving as the anchor.

CRITIQUING DATA QUALITY IN QUANTITATIVE STUDIES


Information about data quality should be provided in every quantitative research report
because it is not possible to come to conclusions about the quality of study evidence
without such information.
Validity is more difficult to document in a report than reliability. At a minimum,
researchers should be defend their choice of existing measures based on validity
information from the developers, and they should cite the relevant publication.

15 Developing and Testing Self-Report Scales

BEGINNING STEPS: CONCEPTUALIZATION AND ITEM

GENERATION

Conceptualizing the Construct

The importance of a sound, thorough conceptualization of the construct to be measured


cannot be overemphasized.

You will not be able to quantify an attribute adequately unless you thoroughly
understand the latent trait (the underlying construct) you wish to capture.

Deciding on the Type of Scale


Before items can be generated, you need to decide on the type of scale you wish to
create because item characteristics vary by scale type.

Two broad categories of scales

Traditional summated rating scales are based in classical test theory. In CTT, items are
presumed to be roughly comparable indicators of the underlying construct.

Latent trait scales using IRT models can use items like the ones used in CTT, such as
items in a Likert-type format in fact, a person completing a scale would likely not know
whether it had been developed within the CT or IRT framework.

Generating an Item Pool: Getting Started

An early step in scale construction is to develop a pool of possible items for the scale.
This is often easier to do as a team effort because different people articulate a similar
idea in diverse ways.

Where do scale items come from?

Possible sources for generating an item pool:

1. Existing instruments - Sometimes it is possible to adapt an existing


instrument rather than starting from scratch. Adaptations may require adding and
deleting items or may involve rewording them for example, to make them more culturally
appropriate, or to simplify wording for a population with low reading skills. Permission
from the author of the original scale should be sought because published scales are
copyright protected.

2. The literature - Ideas for item content often come from a thorough
understanding of prior research.

3. Concept analysis - A related source of ideas is a concept analysis.

4. In-depth qualitative research - A qualitative study can help you to


understand the dimensions of a phenomenon and can also give you actual words for
items.

5. Clinical observations - Patients in clinical settings may be an excellent


source of items. Ideas for items may come from direct observation of patients' behaviors
in relevant situations or from listening to their comments and conversations.

Making Decisions about Item Features


In preparing to write items, you need to make decisions about such issues as the
number of items to develop, the number and form of the response options, whether to
include positively and negatively worded items, and how to deal with time.

Number of Items

A domain sampling model is assumed, which involves the random sampling of a


homogeneous set of items from a hypothetical universe of items on the construct.

Response Options

Scale items involve both a stem (often a declarative statement) and response options.

Traditional Likert scales involve response options on a continuum of agreement, but


other continua are possible, such as frequency (never/always), importance (very
important/unimportant), quality (excellent/very poor), and likelihood (highly
likely/impossible).

Positive and Negative Stems

A generation ago, psychometricians advised scale developers to deliberately include


both positively and negatively worded statements and to reverse score negative items.

Item Intensity

In a traditional summated rating scale, the intensity of the statements (stems) should be
similar and fairly strongly worded. If items are worded such that almost anyone would
agree with them, the scale will not be able to discriminate between people with different
amounts of the underlying trait.

Item Time Frames

A time frame should not emerge as a consequence of item development. You should
decide in advance, based on your conceptual understanding of the construct and the
needs for which the scale is being constructed, how to deal with time.

Wording the Items

Items should be worded in such a manner that every respondent is answering the same
question.

Some additional tips specific to scale items are as follows:

1. Clarity - Scale developers should strive for clear, unambiguous items.


2. Jargon - Jargon should be avoided. Be especially cautious about using
terms that might be well-known in health care circles (e.g., lesion) but not familiar to the
average person.

3. Length - Avoid long sentences or phrases. In particular, eliminate


unnecessary words. For example, "It is fair to say that in the scheme of things I do not
get enough sleep," could more simply be worded, "I usually do not get enough sleep."

4. Double negatives - It is preferable to word things affirmatively (*I am


usually happy*") than negatively (*I am not usually sad*), but double negatives should
always be avoided (*I am not usually unhappy").

5. Double-barreled items - Avoid putting two or more ideas in a single item.


For example, "I am afraid of insects and snakes" is a bad item because a person who is
afraid of insects but not snakes (or vice versa) would not know how to respond.

PRELIMINARY EVALUATION OF ITEMS

Once a large item pool has been generated, it is time for critical appraisal. Care should
be devoted to such issues as whether individual items capture the construct and are
grammatical and well-worded. The initial review should also consider whether the items
taken together adequately embrace the full nuances of the construct.

Input from the Target Population

In the next step, the initial pool of items is pretested. In a conventional pretest of a new
instrument, a small sample of people (20 to 40 or so) representing the target population
is invited to complete the items.

External Review by Experts

External review of the revised items by a panel of experts should be undertaken to


assess the scale's content validity. It is advisable to undertake two rounds of review, if
feasible the first to refine or weed out faulty items or to add new items to cover the
domain adequately and the second to formally assess the content validity of the items
and scale.

Selecting and Recruiting the Experts

The panel of experts should include people with strong credentials with regard to the
construct being measured. Experts also should be knowledgeable about the target
population. In the first review, it is also desirable to include experts on scale
construction.
Preliminary Expert Review: Content Validation of Items

The experts' job is to evaluate individual items and the overall scale (and any
subscales), using guidelines established by the scale developer.

Content Validation of the Scale

In the second round of content validation, a smaller group of experts (three to five) can
be used to evaluate the relevance of the revised set of items and to compute the scale
content validity (S-CVI).

FIELD TESTING THE INSTRUMENT

>Quantitative assessment of the item.

>Evaluate scale's psychometric adequacy in the second.

Developing A Sampling Plan

• Representative Sample includes older and younger respondents. Men and women
people with varying educational and ethnic backgrounds.

• 300 adequate number suggestions to support a factor analysis.

• Guidance ratio of items to respondents recommendation range from 3-4 people to


40-50 per items.

Developing A Data Collection Plan

• The instrument should include the scale items and basic demographics information.

• Additional measures of other constructs hypothesized to be correlated with the


target construct should be included.

Preparing For Data Collection

• All data collection efforts, care should be take to make the instrument Attractive,
Professional - looking, and easy to understand.

ANALYSIS OF SCALE DEVELOPMENT DATA

> The analysis of the data from multi- item scale is a topic about which entire books
have be written.
Basic Item Analysis

• Basic descriptive information for each item should also be examined should have
good variability without it, they will not correlate with the total scale and will not fare well
in a reliability analysis.

Exploratory Factory Analysis

• Factor analysis disentangle complex interrelationships among items and identifies


item the "go together" as unified concepts.

• This section deals with a type of factor analysis known as exploratory factor analysis
(EFA).

Factor Extraction

- Condenses items into a smaller number of factors and is used to identify the number
of underlying dimensions.

Factor Rotation

• The concept of rotation can be best explained graphically

>Orthogonal Rotation

• In which factor are kept at right angles to another.

>Oblique Rotation

• Permit rotated axe to depart from a go - degree angle.

>Rotated Factor Matrix

• Interpreting the factor analysis.

>Factor Loading

• can range form.


Internal Consistency Analysis

• Calculate coefficient alpha. Alpha, it may be recalled provides an estimate of a multi


item scale, internal consistency.

Test - Retest Analysis

• Although test - retest reliability analysis has not been a standard feature of
psychometric assessment in nursing research, we urge developer of new scales to
gather information about both internal consistency and test - retest reliability.

SCALE REFINEMENT AND VALIDATION

Revising the Scale

• The analysis undertaker in the development study often suggest the need to revise
or add items.

• Before deciding that you scale is finalized, it is a good idea to examine the content
of the items in the scales.

Scoring the Scale

• Scoring a composite summated rating scale is easy.

• Some scale developers create a total score that is the " average " across items so
that the total score is on the save scale as the item.

You might also like