Professional Documents
Culture Documents
Balkin, R. (2017) Evaluating Evidence Regarding Relationships With Criteria.
Balkin, R. (2017) Evaluating Evidence Regarding Relationships With Criteria.
Balkin, R. (2017) Evaluating Evidence Regarding Relationships With Criteria.
Development
Richard S. Balkin
To cite this article: Richard S. Balkin (2017) Evaluating Evidence Regarding Relationships With
Criteria, Measurement and Evaluation in Counseling and Development, 50:4, 264-269, DOI:
10.1080/07481756.2017.1336928
Article views: 30
ABSTRACT KEYWORDS
An overview of standards related to demonstrating evidence regarding rela- Validation; testing;
tionships with criteria as it pertains to instrument development was presented, standards; evidence; criteria
along with heuristic examples. Additional measures and a comprehensive
design are necessary to establish evidence related to the use and interpreta-
tion of test scores for the validation of a measure.
Evidence based on relations to other variables is one of the five primary sources of validity evidence, and
within this fundamental component to evaluating validity is depicting relationships between assessment
scores and important criteria (American Educational Research Association [AERA], American Psycho-
logical Association [APA], & National Council on Measurement in Education [NCME], 2014). Evidence
regarding relationships with criteria (ERRC) includes eight standards within five areas that are essential
to evaluating relations to other variables and pertain to various aspects of (a) the psychometric quality
of criterion variables, (b) predictive evidence, (c) accounting for error within the evaluation, (d) appro-
priate use of meta-analytic studies or demonstration of convergent evidence, and (e) demonstration of
evidence for various outcomes, such as diverse populations (AERA et al., 2014). Each of these standards
has implications for evaluating evidence of relations to other variables but might also be important to
demonstrating consequences of testing—an additional source of validity evidence. Hence, the standards
apply to the development of valid measures and the manner in which they are used in research and prac-
tice. In this article, an explanation of each of the aforementioned areas essential for evaluating ERRC is
addressed with applicable explanation of the AERA et al. (2014) standards, along with a heuristic example
from Balkin, Harris, Freeman, and Huntington (2014) or Balkin, Perepiczka, Sowell, Cumi, and Gnilka
(2016) in which the Forgiveness Reconciliation Inventory (FRI) was validated.
CONTACT Richard S. Balkin richard.balkin@louisville.edu Department of Counseling and Human Development, College of
Education and Human Development, The University of Louisville, Louisville, KY , USA.
© Association for Assessment and Research in Counseling (AARC)
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT 265
Using Balkin et al. (2014) as a heuristic example for this study, psychometric properties for the
predictor variables, the FRI scales, and the criterion variables, the Forgiveness Scale and the Forgiveness
Likelihood Scale (Rye et al., 2001), were provided. The FRI was the focus of the instrument validation,
in which Balkin et al. reported (a) the results of a confirmatory factor analysis (CFA) to demonstrate
evidence of internal structure, and (b) the reliability estimates for the scores on each of the subscales.
The demonstration of psychometric properties could include both evidence of internal structure and
reliability of the scores on the scales. For example, Balkin et al. (2014) noted the model fit:
The χ 2 was significant for the hypothesized model, χ 2 (244) = 598.49, p < .001. … The fit indices indicated an
acceptable model fit for the data, CFI = .91, TLI = .90, SRMR = .069. (p. 7)
In addition, the authors reported “standardized estimates for the factor pattern coefficients and cor-
relations for the paired variables” (p. 7). Furthermore, Balkin et al. noted the reliability estimates for the
scores on the FRI subscales: “Reliability estimates were strong for the scores on each of the subscales:
collaborative exploration (.90), role of reconciliation (.88), remorse/change (.92), and outcome (.93)”
(p. 7).
Once the psychometric properties of the measure being validated are established, the properties of the
criterion variables, such as instruments that measure a similar construct, should be presented. Psychome-
tric properties of the criterion variables might include a summary of validity evidence of the measures
and reliability estimates for the scores on the scales, similar to what was presented previously for the
predictor variable, FRI scales. Ideally, researchers report reliability estimates for scores based on extant
research and scores in the present sample.
Predictive Evidence
Two standards for predictive evidence were identified when evaluating ERRC. First, the prediction of
a criterion or outcome should be comprehensive. If more than one predictor is necessary to predict an
outcome or criterion, then all necessary variables should be included. Although such a standard makes
logical sense, the reality is that a complete set of predictors might not be readily available and the use of
multiple predictors could be inadvisable if a sufficient sample size is not practical to include all necessary
variables (Balkin & Sheperis, 2011). In other words, using several predictor variables could be compli-
cated in terms of accessing the adequate measures to collect data and securing enough participants to
have adequate power for subsequent analyses. These limitations are common, particularly when several
predictors are necessary, such as predicting achievement in college. There are far better predictors of
college success than grade-point average and scores on aptitude tests, but noncognitive measures such
as the ability to set and meet long-term goals or amenability to mentoring are not often evaluated due to
the complexity of obtaining such measures (Sedlacek, 2004).
In addition, when predictive evidence is demonstrated, levels of criterion performance associated
with the test scores should be noted (AERA et al., 2014, p. 28). AERA et al. suggested the demonstration
of relationships through the presentation of descriptive statistics for the predictor(s) and criterion vari-
ables, statistical analyses, and variability. Descriptive statistics and statistical analyses could be presented
narratively or in tables. Graphical presentations can also be helpful to demonstrate variability and trends.
Descriptive statistics such as means, standard deviations, sample sizes, frequencies and percentages of
demographic characteristics of participants, and correlation coefficients not only provide the necessary
information for readers to draw conclusions about the participants in the study and for whom the results
might be generalizable, but also provide valuable information for replicating results and understanding
associations among variables.
Predictive evidence often uses outside measures to predict an outcome, such as the use of the Graduate
Record Exam to predict success in graduate school. However, to be consistent with our previous heuristic
example and demonstrate an applicable example to counseling, we show how the initial stage of the
forgiveness reconciliation model (FRM), collaborative exploration, predicts the final stage of the model,
outcome with the use of convergent evidence.
266 R. S. BALKIN
Balkin et al. (2016) demonstrated alignment of the FRI with the FRM, hypothesizing that the first stage
of the model (which also corresponds to the scores on the first subscale of the FRI), collaborative explo-
ration, predicts the final stage of the model (which also corresponds to the scores on the fourth subscale
of the FRI), outcome, and would be mediated by the two middle stages of the FRM (which correspond
to the scores on the second and third subscales of the FRI), role of reconciliation and remorse/change.
Balkin et al. (2016) conducted mediation with regression analysis to identify the degree to which “the
middle stages of the FRM (processing the role of reconciliation and remorse/change of the offender)
mediated the relationship between the initial stage of the FRM (collaborative exploration) and the final
stage of the FRM (outcome)” (p. 56). Balkin et al. identified that remorse/change partially mediated the
relationship between collaborative exploration and outcome; processing the role of reconciliation did
not mediate the relationship between the predictor and criterion. Balkin et al. (2016) used a mediated
multiple regression analysis and graphical representation in the results to demonstrate the extent of the
relationships within the FRM using scores on the FRI. In other words, the extent to which the counselor
and client work together and collaboratively explore issues related to forgiveness, conflict, and reconcil-
iation do lead or predict the likelihood of the client working toward closure by either seeking a path of
interpersonal forgiveness (e.g., reconciling with the person who caused harm) or intrapersonal forgive-
ness, in which the client concludes that reconciliation is not possible but is at personal peace with the
situation.
research available on the measure. Rather, what is often reported is convergent and discriminant evidence
with tests related to the construct of interest. Regardless of examining convergent and discriminant evi-
dence with a single or multiple measures or conducting a meta-analysis, the AERA et al. guidelines are
consistent with commonly accepted procedures in meta-analysis (see Erford, Savin-Murphy, & Butler,
2010) or evaluating a test–criterion relationship: (a) inclusion of available variables or studies that meet
identified criteria; (b) comparison to instruments measuring similar constructs; (c) utilization of a com-
mon metric, such as effect size, to evaluate the test–criterion relationship; (d) use of separate analyses or
moderators when construct invariance is evident; and (e) interpretation of a score for a specific use is
well documented.
For example, in developing the FRI, Balkin et al. (2014) examined the relationship between subscales
on the FRI (collaborative exploration, role of reconciliation, remorse/change in offender, and outcome)
and subscales on the Forgiveness Scale (FS; absence of negative and presence of positive) and the score
on the Forgiveness Likelihood Scale (FLS). When examining the relationship between multiple predic-
tor variables and criterion variables, such as the case with the FS, canonical correlation may be used.
Canonical correlation could be thought of as a test–criterion relationship in which one set of variables is
related to another set of variables. In the case of multiple predictors with a single criterion, such as the
case with the FLS, multiple regression may be used. Balkin et al. (2014) reported the following:
A statistically significant relationship was found between the FRI subscales and the Forgiveness Scale subscales.
The first canonical root was significant, λ = .49, F(8, 326) = 17.62, p < .001, accounting for 49% (rc = .70) of the
variance in the model. The second canonical root was not significant, λ = .96, F(3, 164) = 2.22, p ࣘ .087, accounting
for 4% (rc = .20) of the variance in the model. Therefore, only the first canonical variate was interpreted. The first
canonical variate included scores on all four subscales of the FRI, with correlations ranging from −.63 to −.97 and
both subscales of the Forgiveness Scale, with correlations ranging from .69 to .97. Based on this analysis and as
expected, endorsement of positive attributes on the FRI was correlated with higher degrees of forgiveness toward
an offender. … A statistically significant relationship was found between the FRI subscales and the Forgiveness
Likelihood Scale, F(4, 165) = 4.54, p = .002, with a small to moderate effect size, R2 = .10, accounting for 10% of
the variance in the model. Similar to the results on the Forgiveness Scale, increased positive attributes on the FRI
correlated with higher likelihood to forgive. (p. 9)
In addition, correlation, unstandardized and standardized beta weights, standard error, t-test results,
and p values, along with standardized canonical variate coefficients for canonical correlation results,
should be presented, typically in a table, but possibly narratively. The goal in correlating these subscales
was to demonstrate that the FRI was measuring similar domains as the FLS and the FS. The resulting
tests of statistical significance and evaluation of effect size indicate that the FRI shares some common-
alities with other forgiveness measures, but also provides a distinctive quality to evaluating forgiveness.
The clinical group appeared to identify more negative feelings in addressing issues of conflict and for-
giveness than the nonclinical group, which might be expected given the clinical group might see the need
to address these issues in counseling. The goal in evaluating differences between clinical and nonclinical
groups is to provide some evidence that individuals in counseling could have different struggles related
to forgiveness than individuals who are not receiving counseling services, which were evidenced based
on the resulting tests of statistical significance and evaluation of effect size and centroid means.
Discussion
Up to this point the ERRC standards were presented in the order in which they appear in the Standards for
Educational and Psychological Testing (AERA et al., 2014). However, the order of the standards does not
necessarily follow the natural progression of instrument development. Hence, the following summary
provides an overview of the sequence and tasks necessary for validating a measure with a focus on ERRC.
Before any relationships to criteria are investigated, counseling researchers should carefully outline
the design of the study and the extent to which measures used are aligned with the purpose of the
study and accurately reflect the variables of interest. The AERA et al. (2014) standards present the ideal
methods and procedures, but as mentioned before, the ideal standards are not necessarily typical. Meta-
analyses are rarely used to develop and validate a measure. Often authors will identify criteria to demon-
strate convergent or predictive evidence, but not both. In the heuristic examples noted, two articles were
used to demonstrate the various standards consistent with ERRC. Note that examining a relationship
with criteria could be done concurrently or predictively. In demonstrating convergent evidence, Balkin
et al. (2014) demonstrated the relationship of scores on the FRI with scores on the FS and FLS. All mea-
sures were administered concurrently. For predictive evidence, Balkin et al. (2016) presented a sequential
model and demonstrated that scores in the first phase of the model were mediated by a middle phase,
and predicted a later phase. Hence, the design and purpose of the study is essential to demonstrating
ERRC.
A prerequisite to demonstration of a relationship with criteria is demonstration of evidence of the
internal structure of the measure, so that the psychometric properties of the measure under valida-
tion, along with the established properties of any criteria measures, can be reported. Once psychometric
properties of predictor and criteria variables are established, further analyses may be executed. In the
demonstration of convergent or predictive evidence, accounting significance, effect, and error is essen-
tial for a comprehensive analysis. When possible, demonstration of these outcomes with different groups
strengthens the validity evidence of the measure under development.
ERRC is an essential component to answering the following questions:
r How does this measure relate to measures of a similar nature?
r How does this measure predict essential outcomes to counseling?
The demonstration of ERRC requires the use of additional measures and a comprehensive design
to establish important evidence related to the use and interpretation of test scores. ERRC goes beyond
typical demonstration of psychometric properties, otherwise referred to as evidence of internal structure
(e.g., factor analysis, reliability estimates of scores) to demonstrate essential relationships related to test
use and interpretation. Counselors should be vigilant in evaluating the psychometric properties and
essential research to support the development and use of a measure.
Notes on Contributor
Richard S. Balkin is a Professor in the Department of Leadership and Counselor Education at the University of Mississippi.
He has authored of over 75 peer-reviewed articles, book chapters, and books including two books on assessment and
research methods. He has also developed measures related to counseling outcomes, forgiveness, and life-balance.
ORCID
Richard S. Balkin http://orcid.org/0000-0002-5519-3730
MEASUREMENT AND EVALUATION IN COUNSELING AND DEVELOPMENT 269
References
American Educational Research Association, American Psychological Association, and National Council on Measurement
in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational
Research Association.
Balkin, R. S., Harris, N., Freeman, S. J., & Huntington, S. (2014). The Forgiveness Reconciliation Inventory: An instrument
to process through issues of forgiveness and conflict. Measurement and Evaluation in Counseling and Development, 47,
3–13. https://doi.org/10.1177/0748175613497037
Balkin, R. S., Perepiczka, M., Sowell, S. M., Cumi, K., & Gnilka, P. G. (2016). The forgiveness-reconciliation
model: An empirically supported process for humanistic counseling. Journal of Humanistic Counseling, 55, 55–65.
https://doi.org/10.1002/johc.12024
Balkin, R. S., & Sheperis, C. J. (2011). Evaluating and reporting statistical power in counseling research. Journal of Coun-
seling & Development, 89, 268–272. https://doi.org/10.1002/j.1556-6678.2011.tb00088.x
Erford, B. T., Savin-Murphy, J., & Butler, C. (2010). Conducting a meta-analysis of counseling outcome
research: Twelve steps and practical procedures. Counseling Outcome Research and Evaluation, 1, 19–43.
https://doi.org/10.1177/2150137809356682.
Rye, M. S., Loiacono, D. M., Folck, C. D., Olszewski, B. T., Heim, T. A., & Madia, B. P. (2001). Evaluation of the psychometric
properties of two forgiveness scales. Current Psychology, 20, 260–277. https://doi.org/10.1007/s12144-001-1011-6
Sedlacek, W. E. (2004). Beyond the big test: Noncognitive assessment in higher education. Hoboken, NJ: Wiley.