APS 2013 Fisher

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Running head: A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 1

A Review and Analysis of Scale Development Procedures in Psychology

Rochelle C. Fisher, Deborah L. Bandalos, and Jerusha J. Gerstner

James Madison University


A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 2

Research suggests that there are inconsistent guidelines for the item development and

analysis process. In many cases it is not clear what guidelines researchers use to define the

constructs to be measured, generate an item pool, assess the quality of the items, revise or

remove items from the scale, or examine the reliability and validity of the resultant scale scores.

Guidelines that do exist are not always consistent, and it is not clear whether the use of one

recommendation over another is associated with a better product in the form of more reliable and

valid test scores. Establishing research-based guidelines in these areas can help to ensure that the

scale development process will result in products that provide valid and reliable scores. This, in

turn, should contribute to understanding psychological outcomes of interest.

To better understand the scale development process, researchers often first examine the

current literature such as DeVellis (2011), Hinkin (1998) and Worthington and Whittaker (2006).

And although past research has established some guidelines regarding the types of analyses that

should be conducted during scale development and item analysis, and criterion values for the

numerical indexes yielded by these analyses have been suggested, there is currently a lack of

consistency in these recommendations. As just one example, DeVellis (2012) uses a criterion of

.50 for factor loadings of salient variables, whereas Crocker and Algina (1986) state that

“loadings less than .30 are usually considered unimportant” (p. 299) and Hinkin (1998)

suggested a criterion of .40.Certainly, researchers might disagree somewhat about such

guidelines. However, our reading of the literature indicates that inconsistency is the rule rather

than the exception in terms of recommendations for practice, leaving researchers in something of

a quandary regarding what constitutes “best practice” in the areas of scale development and item

analysis.

Moreover, the only systematic analysis of scale development process was conducted in

2006 by Worthington and Whitaker. Although their results are useful for informing researchers
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 3

of best practice in scale development, their results were limited to only one journal (Journal of

Counseling Psychology) and focused primarily on practices in exploratory and confirmatory

factor analyses (EFA/CFA) rather than on item analysis procedures more broadly. However,

their results were consistent with our assertion that reporting of results was inconsistent. The

purpose of this study is therefore to confront the lack of consistency in criteria for scale

development by examining articles across four different journals in which the focus was on scale

development and item analysis, and then cataloguing the methods used.

Our goal in doing so is twofold. First, given that the use of scale development and item

analysis procedures has not previously been studied in the educational or psychological

literature, our goal is to conduct a systematic study in this area to determine which guidelines, if

any, are currently used by those developing scales. As part of this study, we will examine the

reliability and validity of scores from the resultant scales. We will then incorporate this

information into the development of a set of guidelines that appear to result in optimal levels of

score reliability and validity.

Our ultimate goal, then, is to develop a research-based set of guidelines for scale

development and analysis. We first introduce scale development and briefly describe? what

information we coded from the selected articles. We then report our methods and findings from

the content analysis, specifically results surrounding the scale development process including:

validity, reliability, purpose of the scale, item elimination criterion, use of data screening and

item development, and discuss what the majority of the examined articles have reported and

what potential issues there may be with the current practices. We then provide a detailed plan for

future research, as this study is essentially the initial groundwork to better understand the current

practices.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 4

Scale Development

Cabrera-Nguyen (2010) states, “increasing the availability of instruments with

demonstrated reliability and validity may also help practitioners select evidence-based

interventions that best match the needs of their clients.” While this research was focused

specifically on the field of social work, Cabrera-Nguyen (2010) speaks of a need that is also

present the field of psychology. To help address this need for quality measurement instruments,

we are beginning the initial stages of a content analysis to better inform us of the current

practices in scale development.

Methods

We examined articles from issues of four journals: Journal of Counseling Psychology,

Psychological Assessment, Measurement and Evaluation in Counseling and Development, and

Educational and Psychological Measurement. We selected issues going back every other year

for the past ten years, which will result in examining five years of issues total. These journals

were chosen because they yielded the largest number of scale development articles in a

preliminary analysis. We then reviewed each issue of the four journals for articles in which

scales are developed and/or revised. Preliminary review of the journals indicated that

approximately 170 of the articles in these four journals fit into this category. For each article, we

coded: (a) information provided on the item development procedures, (b) item analysis

procedures, (c) criteria for removing items,(d) procedures used to determine score reliability and

validity, (e) data screening procedure, and (f) practices related to the use of EFA, CFA and item

response theory (IRT). In addition to this information, we coded specific values used as cut-off

criteria for item retention, revision, or removal.

We designed a survey in which to enter codes for each of the articles included in the

study and summarize the results. The survey was designed in order to best organize the data
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 5

collected. Information was also collected on journal name, article name and type of sample

which may be used in future research. To establish interrater agreement, each of three raters

individually rated the same eight articles from the three journals and discussed any discrepancies

until consensus was reached. Following this training process, the three raters coded one article

independently for the purposes of calculating interrater agreement. Interrater agreement was

calculated to be 86.7% following the training of the three raters. It should be noted that most

discrepancies occurred when authors of articles being coded provided unclear information

concerning their scale development process.

After the articles (N=91) were coded, analyses were conducted to examine the current

reporting practices in scale development. In this article both (a) reliability and (b) validity

information will be presented along with (c) criteria for eliminating items, (d) item development

procedures, (e) purpose of study, and (f) data screening practices.

Results

Validity

In our survey we established 10 categories under which the reported validity could fall:

(a) content (b) predictive (c) classification (d) multi-trait multi-method (e) group comparison

(known-groups) (f) correlation (g) developmental progression (h) unclear (i) no information (j)

other. We used the last three options if 1) the researcher was unclear in their description of the

validity study 2) if a validity study was mentioned, but no specific information was provided and

3) if a different type of validity study, other than the options provided was mentioned. While we

coded 91 articles, sample sizes will not add to 91 because there were articles where no validity

information was present and other articles where there were multiple validity studies reported

(N=94). The majority of articles had at least one validity study present. Overwhelmingly,

correlation studies were mentioned (Table 1) followed by group comparison (known-groups)


A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 6

validity. Many of the researchers gave a battery of instruments to their sample and then provided

validity evidence through correlational studies.

Reliability

Reliability studies were coded in several ways. Initially we coded the type of reliability

analysis: (a) internal consistency, (b) test-retest, (c) alternate forms, (d) inter-rater, (e) IRT

(marginal or average reliability), (f) unclear, and (g) no information. After selecting the

reliability analysis, further questions were asked about the type of analysis (i.e. coefficient

alpha). After examining the articles we found 91 reliability studies (not all articles reported

reliability). Internal consistency was reported most frequently (Table 2) at 77%. Internal

consistency overwhelmingly was estimated, not surprisingly, using coefficient alpha. Fourteen

percent (11) of the studies reported test-retest validity. A few others were reported (inter-rater

and IRT), but these were not common.

Criteria for eliminating items

In scale development researchers often report the criteria they used for eliminating items

from a scale. We devised 14 different categories that we found best represented the different

criteria being used in the literature and/or recommended in textbooks or other sources: (a) theory,

(b) advice/review of experts, (c) item-total correlation, (d) alpha if item deleted, (e) low factor

loadings, (f) loading on wrong factor, (g) cross-loaded, (h) not appropriate for population, (i)

item means, s.d., skew, etc., (j) item correlations (too low; too high), (k) content coverage, (l)

unclear, (m) no information, and (n) other. Authors often used more than one criterion for

eliminating items from scales (frequencies will not add to 91). Results are shown in. Table 3.

The majority of authors used the advice/review of experts to eliminate items (21%), low factor

loadings was one of the more prominently used criteria (18%).

Item development procedures


A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 7

In initial stages of the scale development process researchers usually describe how they

developed their items. Based on preliminary research we chose ten categories based on

recommendations in textbooks or other literature: (a) theory, (b) interview and/or focus group,

(c) other instruments, (d) suggested by experts, (e) clinical/other observations, (f) DSM, (g)

other, (h) unclear, (i) no information, and (j) responses to open ended questions. Findings on

these procedures are presented in Table 4. Like the other coding options, it was possible to select

more than category.

The most highly used tool for developing items was theory. This is consistent with

suggestions in the current literature. For example, DeVellis (2011) states “Theory is a great aid

to clarity. Relevant social science theories should always be considered before developing a

scale.” It is interesting to note that although theory was used in 52 articles, or 57%, that does

leave a large percentage that did not consider theory, or did not note it in their article. DeVellis

(2010) notes that

“Even if there is not available theory to guide the investigators,


they must lay out their own conceptual formulations prior to trying
to operationalize them. Better still would be to include description
of how the new construct relates to existing phenomena and their
operationalization.”

Based on DeVellis’ suggestion is seems that there is a room for improvement in this area.

Another commonly used tool was using the suggestion of using experts to help in the

item development process. Again, DeVellis discusses this tool in developing items, “…having

experts review your item pool can confirm or invalidate your definition of the phenomenon.

Reviewers also can evaluate the items’ clarity and conciseness”. Along with these two tools, the

practice of obtaining item ideas from other instruments was also used quite frequently(21
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 8

articles). Use of interviews and/or focus groups was also a popular choice. In some cases, focus

groups consisted of experts in the construct being measured, while in others participants were

sampled from the population with whom the instrument was intended to be used.. The “other”

category included such responses as “open-ended questions given online”, and “translation of

items into English”. Noting these is helpful for future research in that new categories can be

developed based on themes throughout the “other” category.

Purpose of the study

To differentiate between the types of scales throughout the articles we coded the

“purpose of the scale” using 10 categories: (a) prediction, (b) classification, (c) assessment of

learning or progress (d) program evaluation, (e) research purposes, (f) clinical/counseling use,

(g) prediction, (h) unclear, and (i) no information. Table 5 shows the percentages of studies

coded into each category. The most consistently listed purpose of study was “research purposes”

(55). We often coded studies into this category if the authors indicated that they planned to use

the scale for research, even if they did not explicitly state that this was the purpose for which the

scale was being developed. Thus, this category may be overrepresented. We also saw a large

portion of the articles being used for “clinical/counseling use” (27), which is not surprising

considering the journal selection. We did also note that nine articles were unclear in their

purpose, and for one scale the researchers did not state a purpose.

In our investigation it was evident that researchers are not always clear about the purpose

of the scale they are developing. This is problematic, because the methods used in all aspects of

the scale development process from item writing to validity studies are dependent upon the

purpose of the scale. There were also other areas in which researchers did not report specific

pertinent information. In such cases it was not possible to determine whether the researcher did

not address the corresponding issue or are simply not reporting it.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 9

Missing Data Treatment (Data screening)

Recent research in the treatment of missing data indicates that methods such as full

information maximum likelihood (FIML) and multiple imputation (MI) are superior to

traditional methods such as pair wise and listwise deletion. However, these newer methods are

often not implemented by applied researchers. We examined the missing data treatments

reported in scale development research by coding studies five categories as follows: (a) pairwise,

(b) listwise, (c) other (d) unclear (e) no information. Fifty percent of articles fell into the “no

information” category, and 22% used listwise deletion for handling missing data. The “other”

and “unclear” categories had between 12-13% each. Pairwise deletion was rarely utilized,

Discussion and Future Research

Examining these articles (N=91) provides us with confirmation that more explicit and

universal guidelines for both analyses conducted and for reporting standards are needed. Doing

so could allow for more informative presentation of current research and easier replication of

studies. Our examination of current reporting practices in psychology has revealed that there are

consistencies noted even in the preliminary stages of this research. We also found that

researchers rely, perhaps too heavily, on correlational validity studies, and internal consistency

estimated by coefficient alpha, and that researchers often do not report important information

such as the treatment of missing data and the methods used in both developing and eliminating

items.

Future research should continue examining articles to obtain a broader sense of the

reporting practices and current analyses being performed during scale development. Researchers

should also consider examining whether specific reporting practices and/or scale development

practices lead to more valid and reliable scores from the researched measures. This could then

provide an evidence-based framework of optimal scale development and reporting practices. The
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 10

ultimate goal, then, is to develop a research-based set of guidelines for scale development and

analysis .
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 11

Table 1.Validity
Content 7 7%
Predictive 9 10%
Classification 1 1%
Multitrait-Multimethod 0 0%
Group comparison (Known-groups) 18 9%
Correlational 50 53%
Dev. Progression 2 2%
Unclear 0 0%
No information 0 0%
Other 7 7%
Note. Frequencies will not add to 91 because multiple criteria were coded.

Table 2. Reliability
Internal Consistency 59 77%
Test-retest 11 14%
Alternate forms 0 0%
Inter-rater 3 4%
IRT (marginal or average reliability) 2 3%
Unclear 2 3%
No information 0 0%
Note. Frequencies will not add to 91 because multiple criteria were coded.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 12

Table 3. Criteria for Eliminating Items


Theory 13 8%
Advice/review of experts 36 21%
Item-total correlation 11 6%
Alpha if item deleted 3 2%
Low factor loadings 30 18%
Loading on wrong factor 8 5%
Cross-loaded 10 6%
Not appropriate for
population 4 2%
Item means, s.d., skew, etc. 5 3%
Unclear 2 1%
No information 0 0%
Item correlations (too low;
too high) 4 2%
Content Coverage 23 13%
Other 22 13%
Note. Frequencies will not add to 91 because multiple criteria were
coded.

Table 4. Item development


Theory 52 34%
Interview and/or focus group 20 13%
Other instruments 21 14%
Suggested by experts 32 21%
Clinical/other observations 1 1%
DSM 0 0%
Other 18 12%
Unclear 1 1%
No information 2 1%
Responses to open-ended 5 3%
questions
Note. Frequencies will not add to 91 because multiple criteria were
coded.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 13

Table 5. Purpose of Scale

Prediction 4 3%
Classification 2 2%
Assess learning or
progress 11 9%
Program
Evaluation 0 0%
Research purposes 55 46%
Other (specify) 10 8%
Unclear 9 8%
No information 1 1%
Clinical/Counseling
Use 27 23%
Prediction 4 3%
Note. Frequencies will not add to 91 because
multiple criteria were coded.

Table 6. Missing Data Treatment

Pairwise 2 3%

Listwise 16 22%
Other 9 13%
Unclear 9 13%
No 36 50%
information
Note: Frequencies will not add to 91
because multiple criteria were coded.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 14

References

Cabrera-Nguyen, P. (2010). Author guidelines for reporting scale development and validation
results in the Journal of the Society for Social Work and Research. Journal of the
Society for Social Work and Research, 1(2), 99-103.

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart
and Winston, 6277 Sea Harbor Drive, Orlando, FL 32887.

DeVellis, R. F. (2011). Scale development: Theory and applications (Vol. 26). Sage
Publications, Incorporated.

Fornaciari, C. J., Sherlock, J. J., Ritchie, W. J., & Dean, K. L. (2005). Scale development
practices in the measurement of spirituality. International Journal of Organizational
Analysis, 13(1), 28-49.

Hinkin, T. R. (1995). A review of scale development practices in the study of


organizations. Journal of Management, 21(5), 967-988.

Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey
questionnaires. Organizational research methods, 1(1), 104-121.

Worthington, R. L., & Whittaker, T. A. (2006). Scale development research a content analysis
and recommendations for best practices. The Counseling Psychologist, 34(6), 806-838.

You might also like