J BR Expert Judges

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/4966989
The use of expert judges in scale development: Implications for improving

face validity of measures of unobservable constructs
Article in Journal of Business Research · January 2004

Source: RePEc
CITATIONS READS
578 7,981
2 authors:
David Hardesty William Bearden

University of Kentucky Clemson University
48 PUBLICATIONS 4,918 CITATIONS 140 PUBLICATIONS 26,120 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by David Hardesty on 16 September 2014.
The user has requested enhancement of the downloaded file.

Journal of Business Research 57 (2004) 98 – 107
The use of expert judges in scale development

Implications for improving face validity of measures of
unobservable constructs
David M. Hardestya,*, William O. Beardenb
a
Department of Marketing, School of Business Administration, University of Miami, 523D Jenkins Building, Coral Gables, FL 33124-6554, USA
b
Department of Marketing, Moore School of Business, University of South Carolina, 29208 Columbia, SC, USA
Abstract
A review of the assessment of face validity in consumer-related scale development research is reported, suggesting that concerns over the
lack of consistency and guidance regarding item retention during the expert judging phase of scale development are warranted. After
analyzing data from three scale development efforts, guidance regarding the application of different decision rules to use for item retention is
offered. Additionally, the results suggest that research using new, changed, or previously unexamined scale items should, at a minimum, be
judged for face validity.
D 2003 Elsevier Science Inc. All rights reserved.
Keywords: Face validity; Content validity; Scale development
1. Introduction consumer behavior-related scale development efforts are pre-

sented. The review consisted of an assessment of the scale
Concerns regarding the lack of consistency and guidance development articles reviewed in Bearden and Netemeyer’s
regarding how to use the expertise of judges to determine (1999) second edition compilation of marketing scales. These
whether an item should be retained for further analysis in scales were chosen since they are among the most frequently
the scale development process motivated this investigation. employed and more rigorously constructed scales used by
Moreover, there is confusion regarding the difference consumer and marketing researchers. The review was under-
between face and content validity and no previous research taken for two primary reasons. First, we wanted to determine
has addressed directly the procedures used by consumer and the prevalence of expert judging as a tool to aid in item face
marketing researchers to determine item retention during validity assessment; and second, we wanted to gain an under-
face validity assessment. Therefore, our first objective was standing of how previous researchers used expert judging to
to assimilate and explain the difference between content and reduce an initial item pool and to determine which items to be
face validity. Our second objective was to review the use of further analyzed. After establishing that there has been a lack of
expert judges in previous consumer and marketing research. consistency regarding the rules used for item retention, several
Finally, and based on this review, our third objective was to data sets were analyzed in an attempt to provide future
test three frequently employed decision rules used in con- researchers with guidance regarding item retention decisions.
sumer and marketing research, in order to investigate the Finally, the article concludes with remarks concerning the
relative effectiveness of alternative decision rules for use in implications and limitations of our research, as well as a
assessing face validity of scale items being considered in discussion of future research avenues.
measure development processes.
We begin by differentiating between face and content
validity, and then we describe the importance of having face 2. Face validity assessment
valid items. Next, the results of a review of marketing and
2.1. The importance of having face valid items
* Corresponding author. Tel.: +1-305-284-5011. Churchill (1979) proposed a widely accepted general
E-mail address: hardesty@miami.edu (D.M. Hardesty). paradigm for developing measures of marketing constructs,
0148-2963/$ – see front matter D 2003 Elsevier Science Inc. All rights reserved.
doi:10.1016/S0148-2963(01)00295-8
D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107 99
and the second step of this paradigm, the generation and adequacy by Schreisheim et al. (1993), it seems appropriate
editing of items, is the focus of this article. Implicit in this to suggest that consumer and marketing research that
stage is the process whereby generated items are judged for employs new, untested, or modified measures should pro-
face and/or content validity. Specifically, in the second step of vide evidence of face validity for the items being used.
the above scale development process, an initial item pool is Including a judging phase to help ensure face validity of
established, which should possess content and face validity scale items should not normally be a tremendous burden on
(Churchill, 1979). Content, as well as face validity, have been researchers and may dramatically improve the scales that are
defined variously by previous researchers (e.g., Schreisheim being used in consumer and marketing research. After all,
et al., 1993; Nunnally and Bernstein, 1994) and the distinc- sound scales are necessary for any scientific discipline to
tion between the two concepts is not always clear. As one move forward.
noted example, Nunnally and Bernstein (1994) defined
content validity as the degree to which a measure’s items
represent a proper sample of the theoretical content domain of 3. Face validity assessment in prior consumer and
a construct. In order for the criterion of content validity to be marketing research: a review
met by the initial pool of items, these items need to be face
valid. Face validity has been defined as reflecting the extent Our analyses began with a review that focused on
to which a measure reflects what it is intended to measure measures summarized in Bearden and Netemeyer’s (1999)
(Nunnally and Bernstein, 1994). Similarly, Allen and Yen Handbook of Marketing Scales. Bearden and Netemeyer’s
(1979), Anastasi (1988), and Nevo (1985) defined face book of marketing scales contains a summary of approx-
validity as the degree that respondents or users judge that imately 200 multi-item scales that assess a variety of
the items of an assessment instrument are appropriate to the consumer and marketing unobservable constructs. Each
targeted construct and assessment objectives. scale included in their text met the following conditions:
Often, face and content validity have been used inter- (1) the measure was developed from a reasonable theoretical
changeably even though there is an important conceptual base and/or conceptual definition; (2) the measure was
difference. One helpful way to distinguish between face and composed of several (i.e., at least three) items or questions;
content validity is to consider the domain of a construct (3) the measure was developed within the marketing or
being represented by a dartboard. In order for the criterion consumer behavior literature and was used in, or was
of content validity to be established, darts must land relevant to, the marketing or consumer behavior literature;
randomly all over the board to obtain a proper representa- (4) at least some scaling procedures were employed in scale
tion of the construct. Therefore, if darts were located on development; and (5) estimates of reliability and/or validity
only the left-hand side of the board (i.e., items were existed (Bearden and Netemeyer, 1999, p. 1). Our review of
measuring only half of the domain of a construct), the these measures indicated that some form of expert judging
measure would not be content valid. Relatedly, if items was definitely used to evaluate face validity of items in 39
are generated that are too similar and do not tap the full of these scales. In reviewing each of the 39 scales that
domain of the construct (i.e., the entire dartboard), content reported expert judging of the face validity of items, the
validity is not established. Using the dartboard analogy, an following information was gathered: (1) name of construct;
item has face validity if it hits the dartboard otherwise, the (2) author names; (3) initial number of items; (4) number of
item does not represent the intended construct. Therefore, items remaining after judging; (5) number of items in the
researchers must ensure that the items in the initial pool final scale; (6) number of judges; and (7) the decision rule
reflect the desired construct (i.e., hit the dartboard). This used for item retention. Table 1 summarizes the results from
validity assessment is necessary since inferences are made this review.
based on the final scale items and, therefore, they must be There were a number of occasions where each of the
deemed face valid if we are to have confidence in any above pieces of information, even for the 39 scales using
inferences made using the final scale form. expert judging, was either not included or was vague. So, of
Importantly, if items from a scale are not face valid, the approximately 200 of the most rigorously tested scales in
overall measure cannot be a valid operationalization of the consumer and marketing research, only about 19.5% (or
construct of interest. Hence, face validity is a necessary but n = 39) of the articles definitely reported the use of expert
not sufficient condition for ensuring construct validity. That judging to aid in face validity assessment. While it is
is, items must reflect what they are intended to measure (i.e., possible that expert judging was conducted but not reported,
face validity) and represent a proper sample of the domain the percentage seems surprisingly low given the importance
of a construct (i.e., content validity), and pass other tests of of having face valid items in the development of psycho-
validity (e.g., discriminant, convergent, and predictive valid- metrically sound scales (cf. Churchill, 1979) and given the
ity), in order for a measure to have construct validity. support of expert judging we found in the literature. As
Unfortunately, consumer researchers often fail to include shown in Table 1, individual constructs or facets were
an evaluation of the face validity of items when developing developed based on an initial item pool consisting of from
measures. Similar to the admonition regarding content 10 to 180 items. On average, each facet or overall construct
100
Table 1
Expert judging of face validity: 39 measures reported in Bearden and Netemeyer’s (1999) Handbook of Marketing Scales
Initial number Number of items Number of items Number of
Name of construct (authors) of items after judging in the final scale expert judges Decision rule for item retention
Compliant Interpersonal Orientation (Cohen, 1967) 10 10 10 7 9 of 10 items received seven of seven based on construct
definition; one item received six out of seven
Aggressive Interpersonal Orientation (Cohen, 1967) 15 15 15 7 13 of 15 items received seven of seven based on construct
definition; two items received six out of seven
Detached Interpersonal Orientation (Cohen, 1967) 10 10 10 7 7 of 10 items received seven of seven based on construct
definition; three items received six out of seven
D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107

Preference for Consistency (Cialdini et al., 1995) 72 60 18 not reported items deleted for redundancy or poor face validity
Consumer Self-Actualization Test: CSAT (Brooker, 1975) 150 150 20 4 judges pretested items for clarity, familiarity and wording,
and the like
Self-Concepts, Person Concepts, and Product Concepts 70 27 15 4 judges made a list of about 35 items and all agreed on 27
(Malhotra, 1981) items to retain
Separateness – Connectedness Self-Schema 60 32 9 colleagues in items were judged for face validity
(Wang and Mowen, 1997) the marketing
department
Achievement and Physical Vanity 99 60 5, 5 marketing judges consistently rated the items as at least somewhat
(Netemeyer et al., 1995) professors characteristic
and PhD
students
Consumer Impulsiveness (Puri, 1996) 25 12 12 3 adjectives that appeared either ambiguous or unrelated
were removed
Country Image (Martin and Eroglu, 1993) 60 29 14 5 average interjudge agreement and reliability was obtained
for 29 of the 60 word pairs; the Holsti (1969) procedure
was used to determine an average interjudge agreement
and reliability
Consumer Ethnocentrism (Shimp and Sharma, 1987) 180 100 17 6 at least five of six chose the items to be in the facet under
consideration
Market Mavenism (Feick and Price, 1987) 40 19 6 a group not reported
Consumer Independent Judgment Making and Consumer 74, 60 16, 16 6, 8 5, 5 items that were judged to be not representative by any of
Novelty Seeking (Manning et al., 1995) the judges or evaluated as clearly representative by fewer
than three of the judges were not retained
Use Innovativeness, 5 facets (Creativity/Curiosity (CC), 70 60 13 (CC), 9 (RP), several reduced to 60 based on judgment of several experts
Risk Preferences (RP), Voluntary Simplicity (VS), 5 (VS), 10 (CR),
Creative Reuse (CR), Multiple Use Potential (MUP) 7 (MUP)
(Price and Ridgway, 1983)
Consumer Susceptibility to Interpersonal Influence 135—Study 86 12 5 at least four of five judges chose items to be in the facet
(Bearden et al., 1989) 1 under consideration
86—Study 2 62 12 4 three judges rated clearly representative and one rated
somewhat representative
ECOSCALE: Environmentally Responsible Consumer 50 not reported 31 a group of the professors knew that the survey was trying to measure
(Stone et al., 1995) university involvement with the environment; also, the questionnaire
professors was generally well received thus supporting face validity
Leisure (Unger and Kernan, 1983) 42 36 26 10 judges were asked to indicate which dimension each item
represented; items were eliminated if three or more
assigned incorrect classifications
Involvement (Zaichkowsky, 1985) 168 43, 23 20 3, 5 each word pair was rated: (1) clearly representative, (2)
somewhat representative, or (3) not representative; word
pairs that were not rated representative for any of three
choices were dropped; then, a second judging phase using
the same procedure was implemented; items were deleted
if less than 12 of 15 ratings were representative
Involvement Revisited (Zaichkowsy, 1994) 168 35 10 3, 5 all three judged as clearly or somewhat representative and
then five new judges rated as clearly or somewhat
representative at least 80% of the time
Purchasing Involvement (Slama and Tashchian, 1985) 150 75 33 30 75% agreement that the item is appropriate
Exploratory Acquisition of Product and Exploratory 89, 89 41, 28 10 5 four of five judges classified items correctly
Information Seeking (Baumgartner and

Steenkamp, 1996)
Shopping Value (Babin et al., 1994) 71 53 15 3 each judge was given a description of hedonic and
utilitarian value and asked to sort the items into hedonic,
utilitarian, or other; any item classified as representative
by all three judges was retained; five additional items
were retained following a discussion among the judges
Coupon Proneness (CP)/Value Consciousness (VC) 33 (CP)— 25 8 5 at least four of five judges chose items to be in the facet
(Lichtenstein et al., 1990) Study 1 under consideration
33 (VC)— 18 7 5
Study 1
25 (CP)— 25 8 5 four of five judges rated items as being at least somewhat
Study 2 representative
18 (VC)— 15 7 5
Study 2
Trust, Expertise, and Attractiveness of Celebrity 104 72 5, 5, 5 52 items with 75% or more agreement as belonging to a
Endorsers (Ohanian, 1990) certain construct were thus retained for further analysis
Consumer Skepticism Toward Advertising 124 31 9 4 judges were asked to rate each item as a very good, good,
(Obermiller and Spangenberg, 1998) fair, or poor representation of its content; items were
retained that were rated very good by at least three judges
and poor by none
Physical Distribution Service Quality 45 36 15 33 if fewer than three mentioned the item as being
(Bienstock et al., 1997) appropriate and the authors judged the item to lack face
validity it was deleted
Consumer Alienation for Marketplace—four variants 115 50 35 35 75% of judges agreed items would differentiate between
(Allison, 1978) alienated and nonalienated consumers and 60% or more
attributed the item to the same variant of consumer
alienation
Consumer Discontent (Lundstrum and Lamont, 1976) 118 99 82 10 statements that did not fit into either pro- or anti-business
sentiments were eliminated
Ethical Behavior (Ferrell and Skinner, 1988) 70 not given 6 11 items were removed if any judge felt it lacked face validity
(continued on next page)
101
102
Table 1 (continued )
Initial number Number of items Number of items Number of
Name of construct (authors) of items after judging in the final scale expert judges Decision rule for item retention
Business Ethics (Reidenbach and Robin, 1990) 33 33 8 3 judges partitioned the items into moral philosophies
Excellence (Sharma et al., 1990) 200 34, 31 16 8, 18 judges were asked to sort the items into eight groups;

statements were retained if at least seven of eight placed
them in the same dimension; then, judges were asked to
indicate whether or not each item represented the attribute
it was meant to represent; statements on which 70% of
judges agreed upon were retained
Market Orientation (Narver and Slater, 1990) not given not given 6, 4, 5, 3, 3 3, 3 items were submitted to two panels and were rated highly
consistent with market orientation by all
Work – Family Conflict and Family – Work Conflict 57, 53 22, 21 5, 5 4, 4 items were retained if judged at least somewhat
(Netemeyer et al., 1996) representative by all
Performance of Industrial Salespersons 100 65 31 a number of performance items that were evaluated as ambiguous, not
(Behrman and Perreault, 1982) judges well categorized, not representative of the majority of
industrial selling job situations, or simply not important
were eliminated or modified
Consumer Orientation of Salespeople (SOCO) 104 70 24 24 judges rated the items as clearly, somewhat, or not
(Saxe and Weitz, 1982) representative; items were retained if at least 50% of the
judges rated them clearly representative; 10 unrelated
items were also included to monitor judging; one judge
was subsequently deleted
Buying Influence (Kohli and Zaltman, 1988) not given not given 9 a panel of judges critiqued the structure and content of items
judges
Social Power (Swasy, 1979) 150 85 31 6 five of six judges classified the item as an indicator for the
same power type and the item was not classified as an
indicator of the other categories three or more times
Distributor Power and Manufacturer Power 27, > 40 22, 21 17, 21 12 not reported
(Butaney and Wortzel, 1988)
Power Sources (Gaski and Nevin, 1985) not given not given 15, 6, 6, 15, 10, not given supplier management made additions and deletions to
10, 5, 2 assess face validity
Channel Leadership (Schul et al., 1983) 19 9 3, 3, 3 about 50 not reported
Reseller Performance (Kumar et al., 1992) >100 34 5, 5, 4, 4, 4, 4, 4, >21 21 graduate students did an item sort task to assign items
4 to the hypothesized facet
had 65 items in the initial item pool. After judging, the the quality of the survey (cf. Kohli and Zaltman, 1988;
number of items remaining ranged from 3 to 150 and Stone et al., 1995). Malhotra (1981) used judges to agree on
averaged approximately 32. The average number of judges a subset of items to use in further analysis. Finally, Rei-
used was approximately 10 per construct or facet, with the denbach and Robin (1990) used expert judges to partition
number of judges employed ranging from 3 to 52. Finally, items into facets, not as a means of deletion.
the number of items in the final scale ranged from 2 to 82 When researchers have employed Zaichkowsky’s (1985)
and averaged approximately 12. procedure or a similar one, several rules for item deletion
As the results of our review suggest, the establishment of have emerged. In many instances, items were deleted when
face validity has historically involved a mix of different evaluated by any judge as being not representative (i.e., a
judgmental procedures and approaches. Judges are often poor indicator) of the construct (cf. Bearden et al., 1989;
exposed to individual items and asked to evaluate the degree Netemeyer et al., 1995, 1996). Other authors used decision
to which items are representative of a construct’s conceptual rules that focused on the overall evaluations of all of the
definition. One common way of judging items is to use judges. For example, Lichtenstein et al. (1990) and Zaich-
some variant of the method employed by Zaichkowsky kowsky (1985, 1994) decided that items would be retained
(1985), whereby each item is rated by a panel of judges if at least 80% of the judges rated an item as at least
as ‘‘clearly representative,’’ ‘‘somewhat representative,’’ or somewhat representative of the construct. Similarly, Sharma
‘‘not representative of the construct of interest.’’ Of the 39 et al. (1990) retained items that 70% of judges coded as
articles where expert judging was reported in aiding the representative versus not representative of ‘‘corporate excel-
assessment of face validity, 10 used Zaichkowsky’s exact lence.’’ One final set of rules that emerged from our review
procedure or one very similar. As an example of one of the contained references to the number of judges who evaluated
modified procedures, Obermiller and Spangenberg (1998) an item as completely representative of the construct. For
extended Zaichkowsky’s procedure to include four possibil- example, Obermiller and Spangenberg (1998) required at
ities (‘‘very good,’’ ‘‘good,’’ ‘‘fair,’’ or ‘‘poor’’ representa- least three of four judges to rate an item as being a very
tion of the construct). One interesting method employed by good representation of consumer skepticism toward advert-
Saxe and Weitz (1982), who also used the Zaichkowsky ising. Similarly, Saxe and Weitz (1982) and Manning et al.
procedure, was including 10 unrelated items to assess the (1995) required at least 50% and 60% of their judges,
quality of judging. This procedure resulted in the sub- respectively, rate an item as completely representative in
sequent deletion of the responses from one of the judges. order to be retained.
Another common method using expert judges is the Researchers using the other dominant procedure (i.e.,
assignment of items to either an overall construct definition placing items into facets or dimensions based on definitions)
or, for multifaceted constructs, one of the construct’s dimen- also used different rules when determining which items to
sion definitions. In this approach, a panel of judges is given retain. Allison (1978) required that at least 60% (21 of 35)
the definition of each construct or construct dimension, as of the judges place an item into the same facet. Babin et al.
well as a list of all items. Judges are then asked to assign (1994) used the strictest rule in that all three of their judges
each item to one of the construct definitions or assign the had to assign items to the same facet. For the remainder of
item to a category labeled ‘‘other.’’ Variations of this the authors, between 75% and 88% of the judges involved
procedure have been used for multidimensional constructs had to assign an item to the same construct (cf. Swasy,
(cf. Ohanian, 1990), conceptually different constructs being 1979; Unger and Kernan, 1983; Slama and Tashchian, 1985;
developed simultaneously (cf. Baumgartner and Steenkamp, Baumgartner and Steenkamp, 1996; Shimp and Sharma,
1996), as well as unidimensional constructs (cf. Shimp and 1987; Bearden et al., 1989; Lichtenstein et al., 1990;
Sharma, 1987). This procedure or a similar variant was used Ohanian, 1990; Sharma et al., 1990). Martin and Eroglu
by 14 of the authors who used expert judging to aid in the (1993) employed the Holsti (1969) procedure to determine
assessment of face validity of scale items. The remainder of an average interjudge agreement and reliability. These
the authors used either some general procedure to assess the values were then used to determine which items to retain.
face validity of items, or failed to report adequately the It should be noted that some authors used more than one
nature of the procedures employed. phase of judging and therefore may have employed more
Regardless of the procedure employed, authors must than one procedure or decision rule (cf. Allison, 1978;
determine which items to retain for further analysis. Scale Zaichkowsky, 1985, 1994; Bearden et al., 1989; Lichten-
developers often use different rules for determining which stein et al., 1990; Sharma et al., 1990).
items to retain (cf. Bearden et al., 1989; Lichtenstein et al., In summary, Zaichkowsky’s (1985) procedure and
1990; Zaichkowsky, 1994; Netemeyer et al., 1996). For assigning items to construct definitions are the two dom-
example, a number of authors have used expert judges to inant procedures that have been followed by marketing and
delete ambiguous, redundant, or unrelated items (cf. consumer researchers when assessing face validity of scale
Brooker, 1975; Behrman and Perreault, 1982; Gaski and items. When using the latter procedure of assigning items to
Nevin, 1985; Cialdini et al., 1995; Puri, 1996). Other construct definitions, researchers have required that at least
researchers have used expert judges to generally evaluate 60% of judges assign an item to the desired construct or
104 D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107
construct facet. Most of these authors have deemed at least in the original edited item pool being reduced from 145
75% agreement as a minimum cutoff for item retention. items to 83 items. The application of these judging proce-
Therefore, although previous researchers have employed dures reduced the number of items across the six facets to
many decision criteria, there seems to be a good bit of 13, 11, 14, 12, 13, and 20 items, respectively.
consistency and guidance regarding the minimum required The second data set considered addressed the devel-
degree of agreement necessary between judges. opment of two separate scales (Netemeyer et al., 1996)
Alternatively, when using Zaichkowsky’s (1985) pro- and was used to test the two remaining decision rules. The
cedure to evaluate the face validity of scale items, there is two scales were the Work – Family Conflict Scale and the
less guidance regarding the rule(s) used to retain items. Family – Work Conflict Scale. In their research, Netemeyer
Some researchers required that no judge rate an item as et al. (1996) employed the expertise of four judges. These
not representative in order to be retained, while others four judges rated items as clearly, somewhat, or not repres-
considered all judge ratings and the number of represent- entative of each construct. Netemeyer et al. decided that all
ative or completely representative ratings. Consequently, four judges had to have indicated that the item was at least
there is apparently limited direction in the literature somewhat representative of the construct of interest. This
regarding specific rules that should be used for judging rule resulted in reducing the initial item pool from 110 items
face validity of scale items. In the following section, we to 43 items. The number of items remaining for the Work –
investigate the relative effectiveness of several approaches Family Conflict and Family – Work Conflict measures were
for determining the adequate number of items for reten- 22 and 21.
tion (cf. Churchill, 1979). The final scale data were from Netemeyer et al.’s (1995)
development of the achievement vanity and physical vanity
scales. In their research, Netemeyer et al. employed two
4. Comparing several expert judging decision rules judging phases. First, three of four judges had to rate items
as at least somewhat representative for the item to be
Given the inconsistencies noted in the literature review in retained. Then, in a second phase, all four new judges had
terms of expert judging procedures, we decided to invest- to rate items at least somewhat representative to be retained.
igate which of the three rules most highly correlated with an These two phases resulted in 60 items being considered for
item being ultimately included in a scale. In doing so, data further analysis from an initial pool of 99 items.
sets along with their expert judging information from three
recent scale development efforts were obtained. All three of 4.2. Decision rules tested
the available judging data sets were a variate of the method
described in Zaichkowsky (1985). That is, judges rated each In the following analyses, three decision rules were
item as ‘‘completely,’’ ‘‘somewhat,’’ or ‘‘not at all repres- considered. First, a rule labeled ‘‘sumscore’’ was evaluated
entative’’ of the construct or facet of interest. (e.g., Lichtenstein et al., 1990; Sharma et al., 1990). Sum-
score is defined as the total score for an item across all
4.1. Data sets included judges. For example, if there were four judges for a
particular data set and one judge indicated an item was
The first data set considered is based on a scale devel- completely representative, two judges indicated the item
opment effort recently published in the Journal of Con- was somewhat representative, and the final judge indicated
sumer Research (Bearden et al., 2001). Data regarding that the item was not representative, the item received a
development of these measures of consumer self-confidence sumscore of eight points. This value was calculated as three
were obtained since the developmental judging procedures points for the completely representative judgement, four
employed by Bearden et al. (2001) enabled a test of all three points for the two somewhat representative judgements, and
decision rules across various aspects of consumer self- one point for the not representative judgment. Importantly,
confidence. Specifically, the construct measures reflect six this decision rule was included since many previous
facets of consumer self-confidence: (1) information acquisi- researchers considered all of the judges when assessing face
tion; (2) consideration set formation; (3) personal outcome validity of items.
decision making; (4) social outcome decision making; (5) The second decision rule considered was labeled ‘‘com-
persuasion knowledge and (6) marketplace interfaces. Seven plete’’ (e.g., Obermiller and Spangenberg, 1998; Saxe and
judges were used to assess the degree to which each item Weitz, 1982). Complete was operationalized as the number
was representative of each of the six facets of the scale. of judges that rated an item as completely representative of
Zaichkowsky’s (1985) procedure was followed and judges the construct. For the above example, the item received a
indicated whether the items were ‘‘completely represent- complete score of one point, since only one of the four
ative,’’ ‘‘somewhat representative,’’ or ‘‘not representative’’ judges rated the item as completely representative of the
of the facet of interest. Items were deleted that did not construct. As an example, Saxe and Weitz (1982) required at
average at least somewhat representative of the construct least 50% of their judges to rate an item as completely
being measured across the seven judges. This rule resulted representative in order for the item to be retained.
Finally, a third decision rule, ‘‘not representative,’’ was the sizes of the correlations, it appears that the not repres-
considered (cf., Bearden et al., 1989; Netemeyer et al., entative decision rule is not as effective at predicting
1995, 1996). The not representative rule was operational- ultimate inclusion of an item in a scale as the alternative
ized as the number of judges indicating that the item was not two rules (i.e., sumscore and complete). That is, researchers
representative of the construct of interest. For the above should not simply focus on the number of judges who rate an
example, the item received a not representative score of one item as not representative at all for a construct when deter-
point since one judge rated the item as not at all represent- mining whether to retain or delete items (cf. Bearden et al.,
ative. This third rule was considered based upon the 1989; Netemeyer et al., 1995, 1996). In order to evaluate the
recognition that some researchers were only concerned with sumscore and the complete decision rules further, we con-
deleting items judged as not representative. sidered two additional data sets. The nature of the decision
rules used by these two sets of authors precludes further
4.3. Results testing of the not representative decision rule.
The Work –Family Conflict and the Family – Work Con-
For each of the three decision rules, correlation between flict Scales were considered first (Netemeyer et al., 1996).
the decision rule score and ultimate inclusion (coded as a 1 These authors decided that all judges had to have indicated
if included, 0 if excluded) of the item in the scale were that the item was at least somewhat representative of the
calculated. That is, a comparison was made between the construct of interest in order to be retained. As shown in
expert judging scores and whether or not the item ended Table 2, both the sumscore and complete decision rule are
up being included in the final scale. Table 2 summarizes statistically related to the inclusion of items in the final
the correlation between each of the decision rules and the scale for one of the two constructs. Based on this, there is
inclusion of the items that make up each final scale or little difference between the use of the sumscore and
scale facet. complete decision rules. In order to further evaluate these
For the consumer self-confidence data, the sumscore, two rules, the physical and achievement vanity scales of
complete, and not representative decision rules were each Netemeyer et al. (1995) were evaluated. Specifically,
significantly correlated with four of the six facets, as well as Netemeyer et al., in the second phase of their judging
the overall measure, of consumer self-confidence. Based on procedures, decided that all four judges had to rate an item
as at least somewhat representative to be retained. As can
be seen in Table 2, only the sumscore decision rule is
Table 2 statistically related to the inclusion of items in the final
Three decision rules employed by researchers in the use of expert judging achievement vanity scale. Neither of the decision rules was
for the face validity of itemsa statistically related for the physical vanity scale. Again,
Not however, the overall magnitude of both the sumscore and
Name of construct Sumscore Complete representative complete decision rules seems to suggest that they are
Information Acquisition .705*** .709*** .428* performing similarly.
Consideration Set Formation .566** .556** .481*
Personal Outcome Decision Making .335 .248 .548 **
Social Outcome Decision Making .730*** .683*** .742***
Persuasion Knowledge .501** .544** .128 5. Implications, limitations, and future research
Marketplace Interfaces .183 .182 .152 directions
Overall Consumer Self Confidence .395*** .475*** .316***
Work – Family Conflict .401** .396 * not applicable In summary, there is an apparent lack of consistency in
Family – Work Conflict .242 .246 not applicable
the literature in terms of how researchers use the opinions
Achievement Vanity .151 * .119 not applicable
Physical Vanity .063 .035 not applicable of expert judges in aiding the decision of whether or not to
retain items for a scale. We have taken a first step in
Information Acquisition, Consideration Set Formation, Personal Outcome
Decision Making, Social Outcome Decision Making, Persuasion Knowl- attempting to provide researchers with some direction
edge, and Marketplace Interfaces make up the six facets of Consumer Self- regarding the decision rule to use during the judging phase
Confidence (Bearden et al., 2001). Work – Family Conflict and Family – of the scale development process. In our analyses, the not
Work Conflict are from Netemeyer et al. (1996). Achievement Vanity and representative decision rule was found least capable of
Physical Vanity were developed by Netemeyer et al. (1995).
predicting the eventual inclusion of an item in a scale
‘‘Sumscore’’ represents the sum of the ratings from all judges for each item.
‘‘Complete’’ represents the number of judges rating an item as completely based upon the pool of items considered by Bearden et al.
representative. ‘‘Not representative’’ corresponds to the number of judges (2001). These limited data suggest that the not represent-
rating an item as not representative. ative rule may not be the best rule to employ. However, it
a
The values in the first three columns represent the correlation is important to note that future researchers should further
between the decision rule and whether or not the scale item was included
explore the not representative decision rule using addi-
in the final scale.
* P < .10. tional data. Again, our tests were restricted solely to the
** P < .05. items comprising the six facets of consumer self-confid-
*** P < .01. ence developed by Bearden et al.
106 D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107
Subsequently, two more sets of data (e.g., Netemeyer validity would be a more rigorous test in discerning the
et al., 1995, 1996) were investigated to further evaluate the value of the various decision rules. One final avenue of
sumscore and complete decision rules. Both of these rules research, which seems to be important, but was not consid-
predicted item inclusion in the final scales similarly. How- ered here, is evaluating the other prevailing way in which
ever, the sumscore decision rule slightly outperformed the expert judges are used. That is, an evaluation of the
complete decision rule. At least for two of the devel- technique of asking judges to assign items to dimensions
opmental item pools (i.e., Bearden et al., 2001; Netemeyer or facets of multidimensional scales seems warranted.
et al., 1996), these two rules did fairly well at predicting
whether an item would eventually be included in a final
scale. Notably, the present findings support the important References
ability of expert judges to enhance eventual scale reliability
and hence, subsequent validity. Therefore and similar to Allen MJ, Yen WM. Introduction to measurement theory. Monterey (CA):
others (cf. Schreisheim et al., 1993), we suggest that any Brooks/Cole, 1979.
Allison NK. A psychometric development of a test for consumer alienation
research using new, changed, or previously unexamined from the marketplace. J Mark Res 1978;15:565 – 75.
scale items, should at a minimum be judged by a panel of Anastasi A. Psychological testing. New York: Macmillan, 1988.
experts for face validity. Having said this, we by no means Babin BJ, Darden WR, Griffin M. Work and/or fun: measuring hedonic and
are arguing that subsequent stages of the scale development utilitarian shopping value. J Consum Res 1994;20:644 – 56 (March).
Baumgartner H, Steenkamp J-BEM. Exploratory consumer buying behav-
process be ignored. There are, however, occasions where
ior: conceptualization and measurement. Int J Res Mark 1996;13:
consumer researchers develop a set of items that may not be 121 – 37.
the focal point of the article and, therefore, do not engage in Bearden WO, Netemeyer RG. Handbook of marketing scales: multi-item
the entire scale development process. It is in these occasions measures for marketing and consumer behavior research. Thousand
that the use of expert judging appears especially desirable. Oaks (CA): Sage Publications, 1999.
In performing this step, the face validity of the scale is Bearden WO, Netemeyer RG, Teel JE. Measurement of consumer suscep-
tibility to interpersonal influence. J Consum Res 1989;15:473 – 81.
increased at only limited cost in time or funds. Also, the Bearden WO, Hardesty DM, Rose RL. Consumer self-confidence: refine-
items that are ultimately used correlate fairly well in some ments in conceptualization and measurement. J Consum Res 2001;
instances with those that would have survived an exhaustive 28:121 – 34.
scale development process. Behrman DN, Perreault WD. Measuring the performance of industrial
It needs to be noted that simply judging items may not salespersons. J Bus Res 1982;10:355 – 70.
Bienstock CC, Mentzer JT, Bird MM. Measuring physical distribution
guarantee the selection of the most appropriate items for a service quality. J Acad Mark Sci 1997;25:31 – 44 (Winter).
scale. For example, in our research, the items comprising the Brooker G. An instrument to measure consumer self-actualization. In:
two dimensions of the final vanity scales did not correlate Schlinger MJ, editor. Advances in consumer research, vol. 2. Ann Arbor
highly with either the sumscore or complete decision rules. (MI): Association for Consumer Research, 1975. p. 563 – 75.
Therefore and as stated previously, expert judging should Butaney G, Wortzel LH. Distributor power versus manufacturer power: the
customer role. J Mark 1988;52:52 – 63 (January).
not be used as a substitute for the scale development Churchill G. A paradigm for developing better measures of marketing
process. Rather, expert judging should be used to obtain constructs. J Mark Res 1979;16:64 – 73 (February).
some justification for the face validity of items when those Cialdini RB, Frost MR, Newsom JT. Preference for consistency: the devel-
items are not the focal point of the research. One additional opment of a valid measure and the discovery of surprising behavioral
conclusion that is clear from our literature review is the lack implications. J Pers Soc Psychol 1995;69(2):318 – 28.
Cohen JB. An interpersonal orientation to the study of consumer behavior. J
of any consistency regarding item face validity evaluation in Mark Res 1967;4:27 – 278.
the literature. Future research is warranted to establish Feick LF, Price LL. The market maven: a diffuser of marketplace informa-
procedures that researchers can use to strengthen scale tion. J Mark 1987;51:83 – 97.
development efforts. Ferrell OC, Skinner SJ. Ethical behavior and bureaucratic structure in mar-
keting research organizations. J Mark Res 1988;25:103 – 9 (February).
As a result of our data analysis, the sumscore decision
Gaski JF, Nevin JR. The differential effects of exercised and unexer-
rule performed somewhat more effectively at predicting cised power sources in a marketing channel. J Mark Res 1985;22:
whether an item is eventually included in a scale, and 130 – 42 (May).
appears, therefore, to be a reasonable rule for researchers Holsti O. Content analysis for the social sciences and humanities. Reading
to employ. A caveat associated with this finding is the (MA): Addison-Wesley Publishing, 1969.
realization that cutoff values for when to delete and when Kohli AK, Zaltman G. Measuring multiple buying influences. Ind Mark
Manage 1988;17:197 – 204.
to retain items are still in need of inquiry. Future researchers Kumar N, Stern LW, Achrol RS. Assessing reseller performance from the
with access to other data sets may be able to provide perspective of the supplier. J Mark Res 1992;29:238 – 53 (May).
guidance regarding such cutoff values. Additionally, and Lichtenstein DR, Netemeyer RG, Burton S. Distinguishing coupon prone-
as suggested by a reviewer of this manuscript, a logical next ness from value consciousness: an acquisition—transaction utility
step in the assessment of the sumscore and other decision theory perspective. J Mark 1990;54:54 – 67.
Lundstrum WJ, Lamont LM. The development of a scale to measure con-
rules would be to collect data and test the scales for overall sumer discontent. J Mark Res 1976;13:373 – 81.
construct validity. Assessing reliability, unidimensionality, Malhotra NK. A scale to measure self-concepts, person concepts, and
discriminant and convergent validity, and nomological product concepts. J Mark Res 1981;16:456 – 64.
Manning KC, Bearden WO, Madden TJ. Consumer innovativeness and the Improving construct measurement in management research: comments
adoption process. J Consum Psychol 1995;4(4):329 – 45. and a quantitative approach for assessing the theoretical content ad-
Martin IM, Eroglu S. Measuring a multi-dimensional construct: country equacy of paper-and-pencil survey-type instruments. J Manage 1993;
image. J Bus Res 1993;28:191 – 210. 19(2):385 – 417.
Narver JC, Slater SF. The effect of a market orientation on business profit- Schul PL, Pride WM, Little TL. The impact of channel leadership behavior
ability. J Mark 1990;54:20 – 35 (October). on interchannel conflict. J Mark 1983;47:21 – 34 (Summer).
Netemeyer RG, Burton S, Lichtenstein DR. Trait aspects of vanity: measure- Sharma S, Netemeyer R, Mahajan V. In search of excellence revisited: an
ment and relevance to consumer behavior. J Consum Res 1995;21: empirical investigation of Peters and Waterman’s attributes of excel-
612 – 26 (March). lence. In: Bearden WO, Parasuraman A, editors. Enhancing knowledge
Netemeyer RG, Boles JS, McMurrian R. Development and validation of development in marketing, vol. 1. Chicago (IL): American Marketing
Work – Family Conflict and Family – Work Conflict Scales. J Appl Psy- Association, 1990. p. 322 – 8.
chol 1996;81(4):400 – 10. Shimp TA, Sharma S. Consumer ethnocentrism: construction and validation
Nevo B. Face validity revisited. J Educ Meas 1985;22:287 – 93. of the CETSCALE. J Mark Res 1987;24:280 – 9.
Nunnally JC, Bernstein IH. Psychometric theory. New York: McGraw- Slama ME, Tashchian A. Selected socioeconomic and demographic char-
Hill, 1994. acteristics associated with purchasing involvement. J Mark 1985;49:
Obermiller C, Spangenberg ER. Development of a scale to measure con- 72 – 82 (Winter).
sumer skepticism toward advertising. J Consum Psychol 1998;7(2): Stone G, Barnes JH, Montgomery C. ECOSCALE: a scale for the measure-
159 – 86. ment of environmentally responsible consumers. Psychol Mark 1995;
Ohanian R. Construction and validation of a scale to measure celebrity 12:595 – 612 (October).
endorsers’ perceived expertise, trustworthiness, and attractiveness. J Swasy JL. Measuring the bases of social power. In: Wilkie WL, editor.
Adver 1990;19(3):39 – 52. Advances in consumer research, vol. 6. Ann Arbor (MI): Association
Price LL, Ridgway NM. Development of a scale to measure use innova- for Consumer Research, 1979. p. 340 – 6.
tiveness. In: Bagozzi RP, Tybout AM, editors. Advances in consumer Unger LS, Kernan JB. On the meaning of leisure: an investigation of some
research, vol. 10. Ann Arbor (MI): Association for Consumer Research, determinants of the subjective experience. J Consum Res 1983;9:381 –
1983. p. 679 – 84. 92 (March).
Puri R. Measuring and modifying consumer impulsiveness: a cost – benefit Wang CL, Mowen JC. The separateness – connectedness self-schema: scale
accessibility framework. J Consum Psychol 1996;5(2):87 – 113. development and application to message construction. Psychol Mark
Reidenbach RE, Robin DP. Toward the development of a multidimensional 1997;14:185 – 207 (March).
scale for improving evaluations of business ethics. J Bus Ethics 1990; Zaichkowsky JL. Measuring the involvement construct. J Consum Res
9:639 – 53. 1985;12:341 – 52 (December).
Saxe R, Weitz BA. The SOCO scale: a measure of the customer orientation Zaichkowsky JL. The personal involvement inventory: reduction, revision,
of salespeople. J Mark Res 1982;19:343 – 51 (August). and application to advertising. J Adver 1994;23:59 – 70 (December).
Schreisheim CA, Powers KJ, Scandura TA, Gardiner CC, Lankau MJ.
View publication stats

J BR Expert Judges

Uploaded by

Copyright:

Available Formats

You might also like

J BR Expert Judges

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

J BR Expert Judges

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

The use of expert judges in scale development: Implications for improving

Article in Journal of Business Research · January 2004

David Hardesty William Bearden

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

The use of expert judges in scale development

Keywords: Face validity; Content validity; Scale development

1. Introduction consumer behavior-related scale development efforts are pre-

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107

D.M. Hardesty, W.O. Bearden / Journal of Business Research 57 (2004) 98–107

View publication stats

You might also like