Balkin, R. (2017) Evaluating Evidence Regarding Relationships With Criteria.

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Measurement and Evaluation in Counseling and

Development

ISSN: 0748-1756 (Print) 1947-6302 (Online) Journal homepage: http://www.tandfonline.com/loi/uecd20

Evaluating Evidence Regarding Relationships With


Criteria

Richard S. Balkin

To cite this article: Richard S. Balkin (2017) Evaluating Evidence Regarding Relationships With
Criteria, Measurement and Evaluation in Counseling and Development, 50:4, 264-269, DOI:
10.1080/07481756.2017.1336928

To link to this article: https://doi.org/10.1080/07481756.2017.1336928

Published online: 04 Oct 2017.

Submit your article to this journal

Article views: 30

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=uecd20
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT
, VOL. , NO. , –
https://doi.org/./..

METHODS PLAINLY SPEAKING

Evaluating Evidence Regarding Relationships With Criteria


Richard S. Balkin
University of Louisville, Louisville, KY, USA

ABSTRACT KEYWORDS
An overview of standards related to demonstrating evidence regarding rela- Validation; testing;
tionships with criteria as it pertains to instrument development was presented, standards; evidence; criteria
along with heuristic examples. Additional measures and a comprehensive
design are necessary to establish evidence related to the use and interpreta-
tion of test scores for the validation of a measure.

Evidence based on relations to other variables is one of the five primary sources of validity evidence, and
within this fundamental component to evaluating validity is depicting relationships between assessment
scores and important criteria (American Educational Research Association [AERA], American Psycho-
logical Association [APA], & National Council on Measurement in Education [NCME], 2014). Evidence
regarding relationships with criteria (ERRC) includes eight standards within five areas that are essential
to evaluating relations to other variables and pertain to various aspects of (a) the psychometric quality
of criterion variables, (b) predictive evidence, (c) accounting for error within the evaluation, (d) appro-
priate use of meta-analytic studies or demonstration of convergent evidence, and (e) demonstration of
evidence for various outcomes, such as diverse populations (AERA et al., 2014). Each of these standards
has implications for evaluating evidence of relations to other variables but might also be important to
demonstrating consequences of testing—an additional source of validity evidence. Hence, the standards
apply to the development of valid measures and the manner in which they are used in research and prac-
tice. In this article, an explanation of each of the aforementioned areas essential for evaluating ERRC is
addressed with applicable explanation of the AERA et al. (2014) standards, along with a heuristic example
from Balkin, Harris, Freeman, and Huntington (2014) or Balkin, Perepiczka, Sowell, Cumi, and Gnilka
(2016) in which the Forgiveness Reconciliation Inventory (FRI) was validated.

Psychometric Quality of Criterion Variables


When examining the relationship(s) between test scores and criterion variable(s), counseling researchers
should take time to describe the psychometric qualities of the criterion measures used. “When validation
relies on evidence that test scores are related to one or more criterion variables, information about the
suitability and technical quality of the criteria should be reported” (AERA et al., 2014, p. 28). Often in
research and instrument validation, the validation of the measures used or developed is the primary
focus. AERA et al. (2014) emphasized that the measurement qualities of criterion variables are also
important. Consider the consequences if an outcome variable is measured poorly and a relationship or
prediction was indicated by the researcher. Such a finding would be questionable at best and potentially
misleading. ERRC is important to identifying associations among variables pertinent to counseling and
increasing the likelihood that findings can be replicated.

CONTACT Richard S. Balkin richard.balkin@louisville.edu Department of Counseling and Human Development, College of
Education and Human Development, The University of Louisville, Louisville, KY , USA.
©  Association for Assessment and Research in Counseling (AARC)
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT 265

Using Balkin et al. (2014) as a heuristic example for this study, psychometric properties for the
predictor variables, the FRI scales, and the criterion variables, the Forgiveness Scale and the Forgiveness
Likelihood Scale (Rye et al., 2001), were provided. The FRI was the focus of the instrument validation,
in which Balkin et al. reported (a) the results of a confirmatory factor analysis (CFA) to demonstrate
evidence of internal structure, and (b) the reliability estimates for the scores on each of the subscales.
The demonstration of psychometric properties could include both evidence of internal structure and
reliability of the scores on the scales. For example, Balkin et al. (2014) noted the model fit:
The χ 2 was significant for the hypothesized model, χ 2 (244) = 598.49, p < .001. … The fit indices indicated an
acceptable model fit for the data, CFI = .91, TLI = .90, SRMR = .069. (p. 7)

In addition, the authors reported “standardized estimates for the factor pattern coefficients and cor-
relations for the paired variables” (p. 7). Furthermore, Balkin et al. noted the reliability estimates for the
scores on the FRI subscales: “Reliability estimates were strong for the scores on each of the subscales:
collaborative exploration (.90), role of reconciliation (.88), remorse/change (.92), and outcome (.93)”
(p. 7).
Once the psychometric properties of the measure being validated are established, the properties of the
criterion variables, such as instruments that measure a similar construct, should be presented. Psychome-
tric properties of the criterion variables might include a summary of validity evidence of the measures
and reliability estimates for the scores on the scales, similar to what was presented previously for the
predictor variable, FRI scales. Ideally, researchers report reliability estimates for scores based on extant
research and scores in the present sample.

Predictive Evidence
Two standards for predictive evidence were identified when evaluating ERRC. First, the prediction of
a criterion or outcome should be comprehensive. If more than one predictor is necessary to predict an
outcome or criterion, then all necessary variables should be included. Although such a standard makes
logical sense, the reality is that a complete set of predictors might not be readily available and the use of
multiple predictors could be inadvisable if a sufficient sample size is not practical to include all necessary
variables (Balkin & Sheperis, 2011). In other words, using several predictor variables could be compli-
cated in terms of accessing the adequate measures to collect data and securing enough participants to
have adequate power for subsequent analyses. These limitations are common, particularly when several
predictors are necessary, such as predicting achievement in college. There are far better predictors of
college success than grade-point average and scores on aptitude tests, but noncognitive measures such
as the ability to set and meet long-term goals or amenability to mentoring are not often evaluated due to
the complexity of obtaining such measures (Sedlacek, 2004).
In addition, when predictive evidence is demonstrated, levels of criterion performance associated
with the test scores should be noted (AERA et al., 2014, p. 28). AERA et al. suggested the demonstration
of relationships through the presentation of descriptive statistics for the predictor(s) and criterion vari-
ables, statistical analyses, and variability. Descriptive statistics and statistical analyses could be presented
narratively or in tables. Graphical presentations can also be helpful to demonstrate variability and trends.
Descriptive statistics such as means, standard deviations, sample sizes, frequencies and percentages of
demographic characteristics of participants, and correlation coefficients not only provide the necessary
information for readers to draw conclusions about the participants in the study and for whom the results
might be generalizable, but also provide valuable information for replicating results and understanding
associations among variables.
Predictive evidence often uses outside measures to predict an outcome, such as the use of the Graduate
Record Exam to predict success in graduate school. However, to be consistent with our previous heuristic
example and demonstrate an applicable example to counseling, we show how the initial stage of the
forgiveness reconciliation model (FRM), collaborative exploration, predicts the final stage of the model,
outcome with the use of convergent evidence.
266 R. S. BALKIN

Balkin et al. (2016) demonstrated alignment of the FRI with the FRM, hypothesizing that the first stage
of the model (which also corresponds to the scores on the first subscale of the FRI), collaborative explo-
ration, predicts the final stage of the model (which also corresponds to the scores on the fourth subscale
of the FRI), outcome, and would be mediated by the two middle stages of the FRM (which correspond
to the scores on the second and third subscales of the FRI), role of reconciliation and remorse/change.
Balkin et al. (2016) conducted mediation with regression analysis to identify the degree to which “the
middle stages of the FRM (processing the role of reconciliation and remorse/change of the offender)
mediated the relationship between the initial stage of the FRM (collaborative exploration) and the final
stage of the FRM (outcome)” (p. 56). Balkin et al. identified that remorse/change partially mediated the
relationship between collaborative exploration and outcome; processing the role of reconciliation did
not mediate the relationship between the predictor and criterion. Balkin et al. (2016) used a mediated
multiple regression analysis and graphical representation in the results to demonstrate the extent of the
relationships within the FRM using scores on the FRI. In other words, the extent to which the counselor
and client work together and collaboratively explore issues related to forgiveness, conflict, and reconcil-
iation do lead or predict the likelihood of the client working toward closure by either seeking a path of
interpersonal forgiveness (e.g., reconciling with the person who caused harm) or intrapersonal forgive-
ness, in which the client concludes that reconciliation is not possible but is at personal peace with the
situation.

Accounting for Error Within an Evaluation


Much like the reporting of effect size has become standard when presenting the results of hypothesis
testing, addressing error to provide an understanding of the extent to which results might fluctuate is
also necessary with respect to evaluating ERRC. “[S]tandard errors or confidence intervals provide more
information and thus are preferred in place of, or as supplements to, significance testing” (AERA et al.,
2014, p. 29). Consistent with reporting predictive evidence, the reporting of error terms is quite common
in publication of counseling research. For example, Balkin et al. (2014; Balkin et al., 2016) presented
unstandardized and standardized beta weights, standard error, t-test results, and p values, consistent with
standard reporting regression results. These values are typically reported in a table but can be explained
narratively. These values are important to evaluating the meaningfulness of the results, so that statistical
significance may be viewed in light of the strength of the relationships among variables (e.g., beta weights)
and not simply due to p values, which are highly influenced by sample size.
However, AERA et al. (2014) also specified the need to report statistical adjustments: “Estimates of
the construct–criterion relationship that remove the effects of measurement error on the test should be
clearly reported as adjusted estimates” (p. 29). This is an area that is less common in terms of reporting
in counseling research. Adjusted R2 values are more commonly reported, as this information is readily
available in statistical software (e.g., SPSS). However, adjusted effect sizes such as ω2 require calculation
beyond what the statistical software provides, and are therefore less commonly reported. The extent to
which adjusted values are necessary, particularly for samples, has not always been readily accepted, as
the nature of obtaining participants and placing the results in the context of a sample is often apparent
in the literature. However, as AERA et al. (2014) is now making the reporting of statistical adjustments
a standard, counseling journal editors will need to develop more rigor in their review process to meet
this standard. Statistical adjustment could be viewed more broadly to also include the transformation of
variables from continuous to discrete or power transformations to meet model assumptions. When such
transformations are conducted, researchers must report the unadjusted and adjusted values, along with
identifying the process used to make the adjustment.

Appropriate Use of Meta-Analytic Studies or Demonstration of Convergent Evidence


AERA et al. (2014) provided guidelines for the use of meta-analyses in test validation. Counselors should
keep in mind that the use of meta-analysis in instrument validation is uncommon. Generally, instru-
ments that are being developed will not include a meta-analysis, simply due to the limited (if any)
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT 267

research available on the measure. Rather, what is often reported is convergent and discriminant evidence
with tests related to the construct of interest. Regardless of examining convergent and discriminant evi-
dence with a single or multiple measures or conducting a meta-analysis, the AERA et al. guidelines are
consistent with commonly accepted procedures in meta-analysis (see Erford, Savin-Murphy, & Butler,
2010) or evaluating a test–criterion relationship: (a) inclusion of available variables or studies that meet
identified criteria; (b) comparison to instruments measuring similar constructs; (c) utilization of a com-
mon metric, such as effect size, to evaluate the test–criterion relationship; (d) use of separate analyses or
moderators when construct invariance is evident; and (e) interpretation of a score for a specific use is
well documented.
For example, in developing the FRI, Balkin et al. (2014) examined the relationship between subscales
on the FRI (collaborative exploration, role of reconciliation, remorse/change in offender, and outcome)
and subscales on the Forgiveness Scale (FS; absence of negative and presence of positive) and the score
on the Forgiveness Likelihood Scale (FLS). When examining the relationship between multiple predic-
tor variables and criterion variables, such as the case with the FS, canonical correlation may be used.
Canonical correlation could be thought of as a test–criterion relationship in which one set of variables is
related to another set of variables. In the case of multiple predictors with a single criterion, such as the
case with the FLS, multiple regression may be used. Balkin et al. (2014) reported the following:
A statistically significant relationship was found between the FRI subscales and the Forgiveness Scale subscales.
The first canonical root was significant, λ = .49, F(8, 326) = 17.62, p < .001, accounting for 49% (rc = .70) of the
variance in the model. The second canonical root was not significant, λ = .96, F(3, 164) = 2.22, p ࣘ .087, accounting
for 4% (rc = .20) of the variance in the model. Therefore, only the first canonical variate was interpreted. The first
canonical variate included scores on all four subscales of the FRI, with correlations ranging from −.63 to −.97 and
both subscales of the Forgiveness Scale, with correlations ranging from .69 to .97. Based on this analysis and as
expected, endorsement of positive attributes on the FRI was correlated with higher degrees of forgiveness toward
an offender. … A statistically significant relationship was found between the FRI subscales and the Forgiveness
Likelihood Scale, F(4, 165) = 4.54, p = .002, with a small to moderate effect size, R2 = .10, accounting for 10% of
the variance in the model. Similar to the results on the Forgiveness Scale, increased positive attributes on the FRI
correlated with higher likelihood to forgive. (p. 9)

In addition, correlation, unstandardized and standardized beta weights, standard error, t-test results,
and p values, along with standardized canonical variate coefficients for canonical correlation results,
should be presented, typically in a table, but possibly narratively. The goal in correlating these subscales
was to demonstrate that the FRI was measuring similar domains as the FLS and the FS. The resulting
tests of statistical significance and evaluation of effect size indicate that the FRI shares some common-
alities with other forgiveness measures, but also provides a distinctive quality to evaluating forgiveness.

Demonstrating Evidence for Various Outcomes


Often in test development, researchers demonstrate the use of the instrument with different groups.
AERA et al. (2014) indicated that when tests are used for assigning individuals to various categories or
treatments, then “differential outcomes should be provided” (p. 30). School counselors, for example, who
coordinate assessments need to be aware of the validity evidence demonstrated to place students. Beyond
placement, demonstrating various outcomes for groups in the development of a measure is pertinent to
consequences of test use.
In developing the FRI, the normative sample included adults from both clinical and nonclinical pop-
ulations. This allowed for Balkin et al. (2014) to evaluate differences between the clinical and nonclinical
groups on the FRI.
A MANOVA was conducted using the four subscales of the FRI. … A statistically significant difference was noted
between clinical and nonclinical participants, λ = .885, F(4, 195) = 6.22, p < .001. A moderate effect size was noted,
accounting for 11.5% of the variance in the model. Centroid means for the discriminant functions indicated that the
clinical group (.49) had significantly higher values across the four FRI subscales than the nonclinical group (−.26),
suggesting that participants in the clinical group identified more negative feelings or attributes toward those that
harmed them. (p. 8)
268 R. S. BALKIN

The clinical group appeared to identify more negative feelings in addressing issues of conflict and for-
giveness than the nonclinical group, which might be expected given the clinical group might see the need
to address these issues in counseling. The goal in evaluating differences between clinical and nonclinical
groups is to provide some evidence that individuals in counseling could have different struggles related
to forgiveness than individuals who are not receiving counseling services, which were evidenced based
on the resulting tests of statistical significance and evaluation of effect size and centroid means.

Discussion
Up to this point the ERRC standards were presented in the order in which they appear in the Standards for
Educational and Psychological Testing (AERA et al., 2014). However, the order of the standards does not
necessarily follow the natural progression of instrument development. Hence, the following summary
provides an overview of the sequence and tasks necessary for validating a measure with a focus on ERRC.
Before any relationships to criteria are investigated, counseling researchers should carefully outline
the design of the study and the extent to which measures used are aligned with the purpose of the
study and accurately reflect the variables of interest. The AERA et al. (2014) standards present the ideal
methods and procedures, but as mentioned before, the ideal standards are not necessarily typical. Meta-
analyses are rarely used to develop and validate a measure. Often authors will identify criteria to demon-
strate convergent or predictive evidence, but not both. In the heuristic examples noted, two articles were
used to demonstrate the various standards consistent with ERRC. Note that examining a relationship
with criteria could be done concurrently or predictively. In demonstrating convergent evidence, Balkin
et al. (2014) demonstrated the relationship of scores on the FRI with scores on the FS and FLS. All mea-
sures were administered concurrently. For predictive evidence, Balkin et al. (2016) presented a sequential
model and demonstrated that scores in the first phase of the model were mediated by a middle phase,
and predicted a later phase. Hence, the design and purpose of the study is essential to demonstrating
ERRC.
A prerequisite to demonstration of a relationship with criteria is demonstration of evidence of the
internal structure of the measure, so that the psychometric properties of the measure under valida-
tion, along with the established properties of any criteria measures, can be reported. Once psychometric
properties of predictor and criteria variables are established, further analyses may be executed. In the
demonstration of convergent or predictive evidence, accounting significance, effect, and error is essen-
tial for a comprehensive analysis. When possible, demonstration of these outcomes with different groups
strengthens the validity evidence of the measure under development.
ERRC is an essential component to answering the following questions:
r How does this measure relate to measures of a similar nature?
r How does this measure predict essential outcomes to counseling?
The demonstration of ERRC requires the use of additional measures and a comprehensive design
to establish important evidence related to the use and interpretation of test scores. ERRC goes beyond
typical demonstration of psychometric properties, otherwise referred to as evidence of internal structure
(e.g., factor analysis, reliability estimates of scores) to demonstrate essential relationships related to test
use and interpretation. Counselors should be vigilant in evaluating the psychometric properties and
essential research to support the development and use of a measure.

Notes on Contributor
Richard S. Balkin is a Professor in the Department of Leadership and Counselor Education at the University of Mississippi.
He has authored of over 75 peer-reviewed articles, book chapters, and books including two books on assessment and
research methods. He has also developed measures related to counseling outcomes, forgiveness, and life-balance.

ORCID
Richard S. Balkin http://orcid.org/0000-0002-5519-3730
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT 269

References
American Educational Research Association, American Psychological Association, and National Council on Measurement
in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational
Research Association.
Balkin, R. S., Harris, N., Freeman, S. J., & Huntington, S. (2014). The Forgiveness Reconciliation Inventory: An instrument
to process through issues of forgiveness and conflict. Measurement and Evaluation in Counseling and Development, 47,
3–13. https://doi.org/10.1177/0748175613497037
Balkin, R. S., Perepiczka, M., Sowell, S. M., Cumi, K., & Gnilka, P. G. (2016). The forgiveness-reconciliation
model: An empirically supported process for humanistic counseling. Journal of Humanistic Counseling, 55, 55–65.
https://doi.org/10.1002/johc.12024
Balkin, R. S., & Sheperis, C. J. (2011). Evaluating and reporting statistical power in counseling research. Journal of Coun-
seling & Development, 89, 268–272. https://doi.org/10.1002/j.1556-6678.2011.tb00088.x
Erford, B. T., Savin-Murphy, J., & Butler, C. (2010). Conducting a meta-analysis of counseling outcome
research: Twelve steps and practical procedures. Counseling Outcome Research and Evaluation, 1, 19–43.
https://doi.org/10.1177/2150137809356682.
Rye, M. S., Loiacono, D. M., Folck, C. D., Olszewski, B. T., Heim, T. A., & Madia, B. P. (2001). Evaluation of the psychometric
properties of two forgiveness scales. Current Psychology, 20, 260–277. https://doi.org/10.1007/s12144-001-1011-6
Sedlacek, W. E. (2004). Beyond the big test: Noncognitive assessment in higher education. Hoboken, NJ: Wiley.

You might also like