Professional Documents
Culture Documents
APS 2013 Fisher
APS 2013 Fisher
APS 2013 Fisher
Research suggests that there are inconsistent guidelines for the item development and
analysis process. In many cases it is not clear what guidelines researchers use to define the
constructs to be measured, generate an item pool, assess the quality of the items, revise or
remove items from the scale, or examine the reliability and validity of the resultant scale scores.
Guidelines that do exist are not always consistent, and it is not clear whether the use of one
recommendation over another is associated with a better product in the form of more reliable and
valid test scores. Establishing research-based guidelines in these areas can help to ensure that the
scale development process will result in products that provide valid and reliable scores. This, in
To better understand the scale development process, researchers often first examine the
current literature such as DeVellis (2011), Hinkin (1998) and Worthington and Whittaker (2006).
And although past research has established some guidelines regarding the types of analyses that
should be conducted during scale development and item analysis, and criterion values for the
numerical indexes yielded by these analyses have been suggested, there is currently a lack of
consistency in these recommendations. As just one example, DeVellis (2012) uses a criterion of
.50 for factor loadings of salient variables, whereas Crocker and Algina (1986) state that
“loadings less than .30 are usually considered unimportant” (p. 299) and Hinkin (1998)
guidelines. However, our reading of the literature indicates that inconsistency is the rule rather
than the exception in terms of recommendations for practice, leaving researchers in something of
a quandary regarding what constitutes “best practice” in the areas of scale development and item
analysis.
Moreover, the only systematic analysis of scale development process was conducted in
2006 by Worthington and Whitaker. Although their results are useful for informing researchers
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 3
of best practice in scale development, their results were limited to only one journal (Journal of
factor analyses (EFA/CFA) rather than on item analysis procedures more broadly. However,
their results were consistent with our assertion that reporting of results was inconsistent. The
purpose of this study is therefore to confront the lack of consistency in criteria for scale
development by examining articles across four different journals in which the focus was on scale
development and item analysis, and then cataloguing the methods used.
Our goal in doing so is twofold. First, given that the use of scale development and item
analysis procedures has not previously been studied in the educational or psychological
literature, our goal is to conduct a systematic study in this area to determine which guidelines, if
any, are currently used by those developing scales. As part of this study, we will examine the
reliability and validity of scores from the resultant scales. We will then incorporate this
information into the development of a set of guidelines that appear to result in optimal levels of
Our ultimate goal, then, is to develop a research-based set of guidelines for scale
development and analysis. We first introduce scale development and briefly describe? what
information we coded from the selected articles. We then report our methods and findings from
the content analysis, specifically results surrounding the scale development process including:
validity, reliability, purpose of the scale, item elimination criterion, use of data screening and
item development, and discuss what the majority of the examined articles have reported and
what potential issues there may be with the current practices. We then provide a detailed plan for
future research, as this study is essentially the initial groundwork to better understand the current
practices.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 4
Scale Development
demonstrated reliability and validity may also help practitioners select evidence-based
interventions that best match the needs of their clients.” While this research was focused
specifically on the field of social work, Cabrera-Nguyen (2010) speaks of a need that is also
present the field of psychology. To help address this need for quality measurement instruments,
we are beginning the initial stages of a content analysis to better inform us of the current
Methods
Educational and Psychological Measurement. We selected issues going back every other year
for the past ten years, which will result in examining five years of issues total. These journals
were chosen because they yielded the largest number of scale development articles in a
preliminary analysis. We then reviewed each issue of the four journals for articles in which
scales are developed and/or revised. Preliminary review of the journals indicated that
approximately 170 of the articles in these four journals fit into this category. For each article, we
coded: (a) information provided on the item development procedures, (b) item analysis
procedures, (c) criteria for removing items,(d) procedures used to determine score reliability and
validity, (e) data screening procedure, and (f) practices related to the use of EFA, CFA and item
response theory (IRT). In addition to this information, we coded specific values used as cut-off
We designed a survey in which to enter codes for each of the articles included in the
study and summarize the results. The survey was designed in order to best organize the data
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 5
collected. Information was also collected on journal name, article name and type of sample
which may be used in future research. To establish interrater agreement, each of three raters
individually rated the same eight articles from the three journals and discussed any discrepancies
until consensus was reached. Following this training process, the three raters coded one article
independently for the purposes of calculating interrater agreement. Interrater agreement was
calculated to be 86.7% following the training of the three raters. It should be noted that most
discrepancies occurred when authors of articles being coded provided unclear information
After the articles (N=91) were coded, analyses were conducted to examine the current
reporting practices in scale development. In this article both (a) reliability and (b) validity
information will be presented along with (c) criteria for eliminating items, (d) item development
Results
Validity
In our survey we established 10 categories under which the reported validity could fall:
(a) content (b) predictive (c) classification (d) multi-trait multi-method (e) group comparison
(known-groups) (f) correlation (g) developmental progression (h) unclear (i) no information (j)
other. We used the last three options if 1) the researcher was unclear in their description of the
validity study 2) if a validity study was mentioned, but no specific information was provided and
3) if a different type of validity study, other than the options provided was mentioned. While we
coded 91 articles, sample sizes will not add to 91 because there were articles where no validity
information was present and other articles where there were multiple validity studies reported
(N=94). The majority of articles had at least one validity study present. Overwhelmingly,
validity. Many of the researchers gave a battery of instruments to their sample and then provided
Reliability
Reliability studies were coded in several ways. Initially we coded the type of reliability
analysis: (a) internal consistency, (b) test-retest, (c) alternate forms, (d) inter-rater, (e) IRT
(marginal or average reliability), (f) unclear, and (g) no information. After selecting the
reliability analysis, further questions were asked about the type of analysis (i.e. coefficient
alpha). After examining the articles we found 91 reliability studies (not all articles reported
reliability). Internal consistency was reported most frequently (Table 2) at 77%. Internal
consistency overwhelmingly was estimated, not surprisingly, using coefficient alpha. Fourteen
percent (11) of the studies reported test-retest validity. A few others were reported (inter-rater
In scale development researchers often report the criteria they used for eliminating items
from a scale. We devised 14 different categories that we found best represented the different
criteria being used in the literature and/or recommended in textbooks or other sources: (a) theory,
(b) advice/review of experts, (c) item-total correlation, (d) alpha if item deleted, (e) low factor
loadings, (f) loading on wrong factor, (g) cross-loaded, (h) not appropriate for population, (i)
item means, s.d., skew, etc., (j) item correlations (too low; too high), (k) content coverage, (l)
unclear, (m) no information, and (n) other. Authors often used more than one criterion for
eliminating items from scales (frequencies will not add to 91). Results are shown in. Table 3.
The majority of authors used the advice/review of experts to eliminate items (21%), low factor
In initial stages of the scale development process researchers usually describe how they
developed their items. Based on preliminary research we chose ten categories based on
recommendations in textbooks or other literature: (a) theory, (b) interview and/or focus group,
(c) other instruments, (d) suggested by experts, (e) clinical/other observations, (f) DSM, (g)
other, (h) unclear, (i) no information, and (j) responses to open ended questions. Findings on
these procedures are presented in Table 4. Like the other coding options, it was possible to select
The most highly used tool for developing items was theory. This is consistent with
suggestions in the current literature. For example, DeVellis (2011) states “Theory is a great aid
to clarity. Relevant social science theories should always be considered before developing a
scale.” It is interesting to note that although theory was used in 52 articles, or 57%, that does
leave a large percentage that did not consider theory, or did not note it in their article. DeVellis
Based on DeVellis’ suggestion is seems that there is a room for improvement in this area.
Another commonly used tool was using the suggestion of using experts to help in the
item development process. Again, DeVellis discusses this tool in developing items, “…having
experts review your item pool can confirm or invalidate your definition of the phenomenon.
Reviewers also can evaluate the items’ clarity and conciseness”. Along with these two tools, the
practice of obtaining item ideas from other instruments was also used quite frequently(21
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 8
articles). Use of interviews and/or focus groups was also a popular choice. In some cases, focus
groups consisted of experts in the construct being measured, while in others participants were
sampled from the population with whom the instrument was intended to be used.. The “other”
category included such responses as “open-ended questions given online”, and “translation of
items into English”. Noting these is helpful for future research in that new categories can be
To differentiate between the types of scales throughout the articles we coded the
“purpose of the scale” using 10 categories: (a) prediction, (b) classification, (c) assessment of
learning or progress (d) program evaluation, (e) research purposes, (f) clinical/counseling use,
(g) prediction, (h) unclear, and (i) no information. Table 5 shows the percentages of studies
coded into each category. The most consistently listed purpose of study was “research purposes”
(55). We often coded studies into this category if the authors indicated that they planned to use
the scale for research, even if they did not explicitly state that this was the purpose for which the
scale was being developed. Thus, this category may be overrepresented. We also saw a large
portion of the articles being used for “clinical/counseling use” (27), which is not surprising
considering the journal selection. We did also note that nine articles were unclear in their
purpose, and for one scale the researchers did not state a purpose.
In our investigation it was evident that researchers are not always clear about the purpose
of the scale they are developing. This is problematic, because the methods used in all aspects of
the scale development process from item writing to validity studies are dependent upon the
purpose of the scale. There were also other areas in which researchers did not report specific
pertinent information. In such cases it was not possible to determine whether the researcher did
not address the corresponding issue or are simply not reporting it.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 9
Recent research in the treatment of missing data indicates that methods such as full
information maximum likelihood (FIML) and multiple imputation (MI) are superior to
traditional methods such as pair wise and listwise deletion. However, these newer methods are
often not implemented by applied researchers. We examined the missing data treatments
reported in scale development research by coding studies five categories as follows: (a) pairwise,
(b) listwise, (c) other (d) unclear (e) no information. Fifty percent of articles fell into the “no
information” category, and 22% used listwise deletion for handling missing data. The “other”
and “unclear” categories had between 12-13% each. Pairwise deletion was rarely utilized,
Examining these articles (N=91) provides us with confirmation that more explicit and
universal guidelines for both analyses conducted and for reporting standards are needed. Doing
so could allow for more informative presentation of current research and easier replication of
studies. Our examination of current reporting practices in psychology has revealed that there are
consistencies noted even in the preliminary stages of this research. We also found that
researchers rely, perhaps too heavily, on correlational validity studies, and internal consistency
estimated by coefficient alpha, and that researchers often do not report important information
such as the treatment of missing data and the methods used in both developing and eliminating
items.
Future research should continue examining articles to obtain a broader sense of the
reporting practices and current analyses being performed during scale development. Researchers
should also consider examining whether specific reporting practices and/or scale development
practices lead to more valid and reliable scores from the researched measures. This could then
provide an evidence-based framework of optimal scale development and reporting practices. The
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 10
ultimate goal, then, is to develop a research-based set of guidelines for scale development and
analysis .
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 11
Table 1.Validity
Content 7 7%
Predictive 9 10%
Classification 1 1%
Multitrait-Multimethod 0 0%
Group comparison (Known-groups) 18 9%
Correlational 50 53%
Dev. Progression 2 2%
Unclear 0 0%
No information 0 0%
Other 7 7%
Note. Frequencies will not add to 91 because multiple criteria were coded.
Table 2. Reliability
Internal Consistency 59 77%
Test-retest 11 14%
Alternate forms 0 0%
Inter-rater 3 4%
IRT (marginal or average reliability) 2 3%
Unclear 2 3%
No information 0 0%
Note. Frequencies will not add to 91 because multiple criteria were coded.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 12
Prediction 4 3%
Classification 2 2%
Assess learning or
progress 11 9%
Program
Evaluation 0 0%
Research purposes 55 46%
Other (specify) 10 8%
Unclear 9 8%
No information 1 1%
Clinical/Counseling
Use 27 23%
Prediction 4 3%
Note. Frequencies will not add to 91 because
multiple criteria were coded.
Pairwise 2 3%
Listwise 16 22%
Other 9 13%
Unclear 9 13%
No 36 50%
information
Note: Frequencies will not add to 91
because multiple criteria were coded.
A REVIEW AND ANALYSIS OF SCALE DEVELOPMENT PROCEDURES 14
References
Cabrera-Nguyen, P. (2010). Author guidelines for reporting scale development and validation
results in the Journal of the Society for Social Work and Research. Journal of the
Society for Social Work and Research, 1(2), 99-103.
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Holt, Rinehart
and Winston, 6277 Sea Harbor Drive, Orlando, FL 32887.
DeVellis, R. F. (2011). Scale development: Theory and applications (Vol. 26). Sage
Publications, Incorporated.
Fornaciari, C. J., Sherlock, J. J., Ritchie, W. J., & Dean, K. L. (2005). Scale development
practices in the measurement of spirituality. International Journal of Organizational
Analysis, 13(1), 28-49.
Hinkin, T. R. (1998). A brief tutorial on the development of measures for use in survey
questionnaires. Organizational research methods, 1(1), 104-121.
Worthington, R. L., & Whittaker, T. A. (2006). Scale development research a content analysis
and recommendations for best practices. The Counseling Psychologist, 34(6), 806-838.