Reliability and Validity of The ARMIDILO S in Sex Offenders With Intellectual Disabilities

Journal of Mental Health Research in Intellectual
Disabilities
ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/umid20
Reliability and Validity of the ARMIDILO-S in Sex

Offenders with Intellectual Disabilities
Claudia Pouls & Inge Jeandarme
To cite this article: Claudia Pouls & Inge Jeandarme (2023) Reliability and Validity of the
ARMIDILO-S in Sex Offenders with Intellectual Disabilities, Journal of Mental Health Research in
Intellectual Disabilities, 16:1, 37-53, DOI: 10.1080/19315864.2022.2148790
To link to this article: https://doi.org/10.1080/19315864.2022.2148790
Published online: 28 Nov 2022.
Submit your article to this journal
Article views: 327
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=umid20
JOURNAL OF MENTAL HEALTH RESEARCH IN INTELLECTUAL DISABILITIES
2023, VOL. 16, NO. 1, 37–53
https://doi.org/10.1080/19315864.2022.2148790
Reliability and Validity of the ARMIDILO-S in Sex Offenders

with Intellectual Disabilities
Claudia Pouls and Inge Jeandarme
Knowledge Centre Forensic Psychiatric Care, Public Psychiatric Care Centre, Belgium
ABSTRACT KEYWORDS
Background: The ARMIDILO-S is advocated as a promising tool ARMIDILO-S; risk assessment;
for assessing dynamic risk factors in sex offenders with intellec intellectual disability; sex
tual disabilities (SOIDs). However, research remains scarce. The offenders
present study aimed to further validate this instrument in SOIDs.
Method: The study prospectively followed 38 SOIDs for up to
one year to test the accuracy of the ARMIDILO-S in predicting
violent and sexual incidents.
Results: Overall predictive accuracy was moderate to high. The
ARMIDILO-S further showed excellent prospective qualities in
identifying both high-risk and low-risk offenders for violence.
Regarding sexual offending, it was only good at prospectively
detecting low-risk individuals.
Conclusions: This study provided further evidence of the good
predictive validity of the ARMIDILO-s in predicting future sexual
and violent incidents in SOIDs. More research in preferably larger
samples as well as field validity studies are recommended.
The empirical assessment of risk of further sexual offending in sex offenders can
be done by using static and/or dynamic variables. Static items, such as victim
gender, are not amenable for change through treatment and thus provide no
useful information about treatment targets. However, static risk factors give an
indication of “baseline risk” or, in other words, the risk of recidivism without
any intervention or treatment. It can be used to determine treatment intensity
and the degree of supervision. Following the Risk Need Responsivity (RNR)
model (Bonta & Andrews, 2007), increasing levels of treatment intensity and
supervision are recommended with increasing risk scores. In second order,
dynamic risk factors can be used as a guide to determine treatment goals and
effectively manage and monitor client’s risk changes. Instruments relying on
static factors were developed in mainstream sex offender samples and have not
been extensively tested in offender populations with intellectual disabilities.
Furthermore, the scarce amount of research shows mixed results. At this
point, only the Static–99 (and revised versions; Harris et al., 2003; Phenix
et al., 2008; Phenix, Fernandez et al., 2016; Phenix, Helmus et al., 2016) and
the Rapid Risk Assessment for Sex Offense Recidivism (RRASOR; Hanson,
CONTACT Claudia Pouls claudia.pouls@opzcrekem.be Knowledge Centre Forensic Psychiatric Care, Public
Psychiatric Care Centre , Daalbroekstraat 106, Rekem 3621, Belgium
© 2022 OPZC Rekem
38 C. POULS AND I. JEANDARME
1997) are recommended for the prediction of sexual offenses in sex offenders
with an intellectual disability (SOIDs; Hanson et al., 2013; Hounsome et al.,
2018; Pouls & Jeandarme, 2015). Dynamic factors on the other hand can guide
treatment plans with the goal of reducing risk. For that purpose, both main
stream instruments such as the Historical Clinical Risk management-20 (HCR-
20; Webster et al., 1997), Short-Term Assessment of Risk and Treatability
(START; Webster et al., 2004) or Sexual Violence Risk–20 (SVR-20; Boer
et al., 1997) have been validated in SOIDs, but also specific instruments have
been developed (e.g., Dynamic Risk Assessment and Management System
(DRAMS) – (Lindsay & Beail, 2004); Assessment of Risk and Manageability
for Individuals who Offend Sexually (ARMIDILO-S) – (Boer et al., 2013, 2004).
In the Netherlands, the Dynamic Risk Outcome Scales (DROS; Drieschner &
Hesper, 2008) was developed to assess treatment progress of patients with mild
intellectual disability or borderline intellectual functioning and severe behavioral
and/or psychiatric problems. In theory, ID-specific tools have some advantages
over mainstream tools because they address ID-specific criminogenic needs.
However, evidence for the validity of these tools is even more preliminary
(Hounsome et al., 2018; Pouls & Jeandarme, 2015).
Up till now, only a few studies have been conducted with the ARMIDILO-S.
Blacker et al. (2011) evaluated the RRASOR, RM2000/V, SVR-20 and the
client subscales of a previous version of the ARMIDILO in matched samples
of 44 SOIDs (including borderline intellectual functioning) and 44 non-ID sex
offenders. Predictive accuracy of the ARMIDILO in the SOID group generally
exceeded that of the non-ID group, although not significantly. For the stable
client subscale, an AUC of .61 was found for sexual reconviction and an AUC
of .56 for official and unofficial sexual offense-related behavior. The acute
client subscale was the best predictor for sexual recidivism with an AUC of .73
(sexual reconviction) and .76 (unofficial sexual recidivism and reconviction
data; p < .001). When only considering a small subsample of SOIDs with an IQ
below 75 (n = 10), the stable client subscale produced a significant predictive
effect for sexual reconviction (AUC = .86); whereas the AUC of the acute client
subscale was high but non-significant (AUC = .75), possibly due to the very
small sample size. The study further showed that other instruments (RRASOR,
RM2000/V and the SVR-20) performed little better than chance level in
distinguishing sexual recidivists from non-recidivists (AUC = .37–.55). In
terms of violent recidivism, AUCs were high for both the acute (AUC = .76)
and stable (AUC = .83) client subscale of the ARMIDILO-S. In a second study,
Lofthouse et al. (2013) prospectively analyzed the ARMIDILO-S in 64 male
SOIDs (IQ < 75) from a community service in Scotland. Inter-rater reliability
was high for both subscales (r = .98 for the environmental subscale and r = .96
for the client subscale) and the total score (r = .98). For the prediction of sexual
incidents, large significant effect sizes were reached for total (subscale) scores:
total environment AUC = .81, total client AUC = .90 and total ARMIDILO-S
JOURNAL OF MENTAL HEALTH RESEARCH IN INTELLECTUAL DISABILITIES 39
AUC = .92. The ARMIDILO-S performed better than the Static-99

(AUC = .75, p = .001) and the VRAG (AUC = .58, ns). No significant
differences were found between the client and environmental subscales, sup
porting the importance of environmental variables. The total score was how
ever significantly better at predicting recidivism than the subscales. Finally,
there was one non-peer reviewed doctoral study in which the ARMIDILO-S
was tested in 16 SOIDs with an IQ in the mild or borderline range (Sindall,
2012). A significant AUC of .83 was found for the ARMIDILO-S total score.
The client subscale also significantly predicted sexually inappropriate behavior
(AUC = .86), whereas the environmental subscale did not (AUC = .41). The
author hypothesized that the limited AUC for the environmental subscale
could be due to the lack of variance in the environmental items, because all
participants resided in a residential structure and had a secure support system.
Furthermore, other measures of predictive accuracy were analyzed. The posi
tive predictive value was 55% (percentage of high-risk individuals that re-
offended) and the negative predictive value was 100% (percentage of low-risk
individuals that did not re-offend). The sensitivity was 100% (percentage of re-
offenders who were correctly identified as high risk), whilst the specificity was
64% (percentage of non-re-offenders who were correctly classified as low risk).
Studies show some advantages of the ARMIDILO-S compared to other
instruments. First, the ARMIDILO-S includes not only client factors, but
also environmental factors. It is argued that this last type of factor is of specific
relevance for SOIDs because they rely more heavily on environmental struc
tures (Boer, 2013; Boer et al., 2007). Evidence for this statement was found in
the study of Lofthouse et al. (2013) where the environmental items of the
ARMIDILO-S statistically had equal predictive accuracy as the client items. In
contrast, the AUC for the environmental subscales (AUC = .41–.50) was low in
the doctoral study of Sindall (2012). A second interesting element of the
ARMIDILO-S is the fact that it includes protective factors alongside risk
factors, since this is considered a neglected aspect in the current risk assess
ment literature (Hounsome et al., 2018). However, evidence for the added
value of protective factors has yet to be empirically validated.
In conclusion, the evidence of the predictive validity of the ARMIDOLO-S
remains scarce, with only two published studies and one doctoral study to
date. More research is needed into the utility and predictive validity of both
static and dynamic risk assessment instruments for SOIDs. The current study
aimed to validate the Static-99R and the ARMIDILO-S in a Flemish sample of
SOIDs. In the current article, data on the ARMIDILO-S are presented, while
psychometric data of the Static-99R are discussed in another article (Pouls &
Jeandarme, 2022). Results showed that the Static-99R was not able to signifi
cantly predict sexual or violent incidents and was better at detecting low-risk
individuals than high-risk offenders (Pouls & Jeandarme, 2022). Because the
ARMIDILO-S is not yet well validated, our first objective was to study the
inter-rater reliability and predictive accuracy of the ARMIDILO-S final con

vergent structured professional judgment (SPJ) rating in predicting various
outcome measures. Several authors emphasize the potential added value of
including environmental factors for SOIDs. In second instance, we therefore
wanted to test the individual predictive validity of both the client subscales and
the environmental subscales in predicting the primary outcome, namely sexual
incidents. Furthermore, the absence of protective ratings is addressed as a gap
in current risk assessment instruments. In third order, we therefore wanted to
test the predictive accuracy of both risk and protective factors for the primary
outcome. In sum, we formulated four research questions:
RQ1. What is the inter-rater reliability of the ARMIDILO-S?
RQ2. What is the predictive validity of the ARMIDILO-S in predicting sexual

and violent incidents?
RQ3. What is the predictive validity of the client subscale and the environ
mental subscale of the ARMIDILO-S in predicting sexual incidents?
RQ4. What is the predictive validity of the risk and protective ratings of the
ARMIDILO-S in predicting sexual incidents?
Materials and methods

Participants
The study was conducted in six specialized forensic psychiatric and prison
units for SOIDs (including borderline intellectual functioning) and
included 50 male SOIDs. The Static-99R score could not be scored for
seven patients due to missing information (n = 5) or because there was no
category A offense (n = 2). Due to the limited sample size, only patients
with a follow-up of less than six months were excluded (n = 5), leaving
a total study sample of 38.
The mean IQ score was 63 (SD = 11.13, range = 45–85). Three patients had an
IQ between 35 and 50, 21 between 50 and 70 and 12 between 70 and 85. Based on
a clinical DSM-IV-diagnosis, 18 patients were classified as having a mild intel
lectual disability, six as having a moderate intellectual disability and two with
borderline intellectual functioning. One third (n = 12) did not receive an official
diagnosis of intellectual disability but was nevertheless admitted to a unit for
SOIDs and had an IQ at or below 85 (range = 53–85). A paraphilic disorder was
present in 27 patients, a developmental disorder in eight patients, personality
disorder in seven patients and a substance abuse disorder in seven patients.
Almost half of the participants (n = 18) had more than one psychiatric diagnosis.
All patients committed a sexual offense, either as an index offense (n = 35) or as
a prior offense (n = 18). Sex offenses concerned hands-on sex offenses in 35 cases
and hands-off offenses (e.g., possession of child pornography, indecent expo
sure) in 16 cases. Hands-on offenses were inflicted against children (n = 29),
adults (n = 2), or against both an adult and a child (n = 4). The mean age at the
time of the assessment was 44 years (SD = 12.14, range = 24–74). The mean
length of treatment or imprisonment was close to three years (33 months,
SD = 34.82, range = 2.3–126.2).
Measures
Static-99R
The Static-99R (Helmus et al., 2012; Phenix, Fernandez et al., 2016) was scored
as part of the ARMIDILO-S scoring process (cf. infra). The Static-99R is an
actuarial instrument designed to assess the likelihood of sexual and violent
recidivism in sex offenders. It consists of 10 static items relating to demo
graphic, offense and victim information. The total score is the sum of the item
scores and varies from –3 to 12, further divided into five nominal risk
categories: Level I – very low risk (scores of –3 to –2), Level II – below average
risk (scores of –1 to 0), Level III – average risk (scores of 1 to 3), Level IVa –
above average risk (scores of 4 to 5) and Level IVb – well above average risk
(scores of 6 or above). In the current study, the Dutch translation of the Static-
99R (Smid et al., 2014) was used.
ARMIDILO-S
The ARMIDILO-S (Boer et al., 2013) is a SPJ risk assessment tool for the
assessment and management of risk for sexually inappropriate behavior
in SOIDs. The original authors strongly encourage that the scoring is
based on file information and one or two interviews with staff members.
A client interview is recommended, but not necessary. The instrument
contains 27 stable (i.e. slowly changing) and acute (i.e. rapidly changing)
dynamic items divided into a “client” and “environment” subscale (see
Table 1). In the first ARMIDILO-S evaluation the stable items need to be
scored based on the preceding one to two years (or up to five years when
the client resided in a highly structured setting); the acute items over the
previous two to three months. Thereafter, an annual scoring of the stable
items is advised while the acute items can be reassessed more frequently
to monitor ongoing risk. Each item is evaluated as both a risk and
a protective factor and rated on a 3-point scale where N indicates that
the item is absent, S that the item is somewhat present, and Y that the
item is present. For example, when a client didn’t act impulsively in the
past period and clearly showed problem solving skills, the risk factor for
Table 1. Subscales and items of the ARMIDILO-S.

Stable items Acute items
Client subscale Supervision compliance Changes in compliance with supervision or treatment
Treatment compliance Changes in sexual preoccupation
Sexual deviance Changes in victim-related behaviors
Sexual preoccupation Changes in emotional coping ability
Offense management Changes in use of coping strategies
Emotional coping ability Changes to unique considerationsa
Relationships
Impulsivity
Substance abuse
Mental health
Unique considerationsa
Environmental subscale Attitude toward the ID client Changes in social relationships
Communication among support persons Changes in monitoring and intervention
Client specific knowledge by support persons Situational changes
Consistency of supervision/intervention Changes in victim access
Unique considerationsa Unique considerationsa
a
This item was not scored in the current study.
the item “Impulsivity” scores as N, the protective factor as Y. After

scoring all of the items, the clinician gives an overall risk rating and
an overall protective rating expressed as low, moderate or high risk/
protection. The overall risk and protective SPJ rating and the actuarial
rating based on static risk factors using the RRASOR or Static-99(R) are
then integrated into a final convergent SPJ rating (low, medium, or high
risk). This rating might be low despite the presence of multiple risk
factors. This is for example the case when environmental protective
factors are high (e.g., high supervision, no individual liberties), limiting
or even eliminating risky situations. Furthermore, risk and protective
ratings are relatively independent of each other on related factors. In
other words, a client’s knowledge of potential risky situations may be
considered a protective factor, yet their history of impulsive behavior in
those situations may lead to high risk ratings in those same situations
(Boer et al., 2013).
Although the ARMIDILO-S is intended to be used in clinical practice as
a structured professional judgment instrument, numerical scores are often
used for research purposes. In the current study, we used both methodologies.
Numerical item scores were based on the following scoring: N = 0, S = 1, and
Y = 2. Since each item is rated as a risk and protective factor, item scores can
range from 2 to +2. The item scores were then summed up, resulting in
a possible numerical range of –46 to +46.
In the current study, a Dutch version of the ARMIDILO-S was used
(Van Alphen, 2016). This translation was officially approved by the
authors and psychometric properties were tested in a pilot study in the
Netherlands. Reliability and validity were proven to be sufficient, although
firm conclusions are limited because of the small sample size (n = 13; Van
Alphen, 2016).
Procedure
The study was prospective in nature. The Dutch version of ARMIDILO-S

(Van Alphen, 2016) was assessed by the first author. The scoring was based on
judicial and clinical file information, an interview with the client (97.4%) and
an interview with one (57.9%) or two (42.1%) key members of the nursing
staff. In most cases, interviewed staff concerned the psychologist and/or
a member of the nursing staff. Scores were not discussed with the clinicians
and hence did not influence their risk management strategies. The convergent
SPJ rating was formulated according to the manual in terms of low, moderate,
or high risk for both future sexual incidents and violent incidents. The primary
rater had an advanced degree in behavioral sciences and formal risk assess
ment training concerning several risk assessment instruments, among which
the ARMIDILO-S. Second ratings were gathered for 20 participants by
a bachelor student in applied psychology and a master student in criminology.
Violent and sexual incidents within one year after the scoring were regis
tered by the first author, based on observational notes of the staff. Due to the
prospective nature of the study, the start of the registration differed between
the institutions, ranging from December 2016 to May 2020.
Ethical approval was obtained from the Ethics Committee of Antwerp
University Hospital (B300201628941). Permission to conduct the study in
prison was sought from the Belgian Federal Government of Justice.
Furthermore, informed consent was obtained from each respondent and his
guardian or treating physician.
Outcome Measures
The predictive accuracy was assessed using two outcome measures: sexual and
violent incidents. A violent incident referred to physical non-sexual violence
against another person: uttering threats, grabbing by the throat, kicking,
hitting, biting, or throwing objects against a person with the purpose of
inflicting pain. A sexual incident was defined as illegal sexual behavior such
as sexual assault, sexual touching, gross indecency, indecent exposure.
Statistical Analyses
All analyses were conducted in SPSS 22© (IBM Corp, 2013) and MedCalc
(Garber, 1998). Inter-rater reliability (IRR) was evaluated through a two-way
random intraclass correlation coefficient (ICC2,1 absolute agreement). Fleiss’s
(1986) critical values for single measures were used: ICC ≥ .75 = excellent, ICC
≥ .60 = good, ICC ≥ .40 = moderate, and ICC < .40 = poor. Predictive validity
was analyzed using both discrimination and calibration indicators.
Discrimination refers to how well an instrument can separate those who
went on to be violent from those who did not. Calibration refers to how well
the prediction of risk (expected recidivism) agrees with the actual observed
risk (Singh, 2013). A global effect size was calculated through the ROC
analysis. The corresponding AUC values were evaluated according to the
classification of Rice and Harris (2005) whereby AUC ≥ .56 = little effect,
AUC ≥ .64 = moderate effect and AUC ≥ .71 = large effect. Using the
information of a 2 × 2 contingency table, sensitivity (percentage of recidivists
who were judged to be at high risk), specificity (percentage of non-recidivists
who were judged to be at low risk), positive predictive value (PPV, percentage
of participants judged to be at high risk who did reoffend), negative predictive
value (NPV, percentage of low-risk individuals who did not reoffend), number
needed to detain (NND, number of participants judged to be at high risk who
need to be detained to prevent a single incident or offense), and number safely
discharged (NSD, number of participants judged to be at low risk who could
be discharged prior to a single incident or offense) were calculated. These
performance indicators provide information about how accurate a tool iden
tifies high-risk (“rule in”; PPV and NND) and low-risk (“rule out”; NPV and
NSD) individuals. Calculating these measures requires a single cutoff thresh
old. Participants classified as being at moderate or high risk were compared
with participants classified as low risk.
Results
Descriptive Statistics
The base rate was 10.5% (n = 4) for sexual incidents and 42.1% (n = 16) for
non-sexual violent incidents.
Total scores ranged from –22 to +5 (possible score range –46 to +46), total
client scores ranged from –14 to +8 (possible score range –30 to +30) and total
environment scores ranged from –11 to 0 (possible score range –16 to +16). In
Table 2, descriptive statistics for the structured professional judgment are
presented. According to the Static-99R rating, only four patients were deemed
Table 2. Descriptive statistics for the structured professional judgment relating to the Static-99R
risk rating and the risk, protective, and convergent rating of the ARMIDILO-S.
Static-99R ARMIDILO-S risk ARMIDILO-S protective ARMIDILO-S convergent
rating1 rating rating rating
Low 4 (10.5%) 11 (28.9%) 0 28 (73.7%)
Moderate 13 (34.2%) 18 (47.4%) 5 (13.2%) 9 (23.7%)
High 21 (55.3%) 9 (23.7%) 33 (86.8%) 1 (2.6%)
Note. The convergent rating is derived from taking into account the numerical score of the Static-99R, the SPJ risk
rating of the ARMIDILO-S items and the SPJ protective rating of these same ARMIDILO-S items.
1
The Static-99 Rwas scored using the revised risk categories (Phenix, Fernandez et al., 2016). These were recoded as
follows: Low = Level I + II; Moderate = Level III; High = Level IVa + Level IVb.
to be at low risk for re-offending, while according to the ARMIDILO-S

convergent SPJ rating, this involved 28 patients. This could be explained by
the fact that most patients resided in a highly protective structure, i.e.
a forensic medium security setting with very few individual liberties outside
of the unit or campus. Elevated risk scores were found in most patients
(n = 27), but these risks were thus countered by the protective factors.
Inter-rater Reliability
Inter-rater reliability (ICC2,1) of the ARMIDILO-S convergent SPJ rating was
poor with regard to violent incidents (.28) and moderate for sexual inci
dents (.55).
Predictive Validity
Convergent SPJ Rating
The ARMIDILO-S convergent SPJ rating was able to significantly predict
sexual and violent incidents with high accuracy (AUC = .77–.90) according
to the criteria of both Rice and Harris (2005) and Sjöstedt and Grann (2002).
The accuracy of the numerical score was also high (AUC = .74) for the
prediction of violent incidents, but moderate and non-significant
(AUC = .70, p = .20) regarding sexual incidents. No statistically significant
differences were found between the AUCs of the numerical scores and the
convergent SPJ ratings regarding the prediction of sexual incidents (p = .18) or
violent incidents (p = .58). An overview of all performance indicators is shown
in Table 3. These numbers can be read as follows: Of those individuals who
were involved in sexual incidents, 100% had been classified as being at
moderate or high risk of future sexual offending (sensitivity). Of those indivi
duals who were not involved in sexual incidents, 82% had been judged to be at
low risk (specificity). Of those judged to be at moderate or high risk, 40% was
Table 3. Performance indicators of the ARMIDILO-S for the prediction of different outcome
measures based on the convergent SPJ rating and numerical score.
Sexual incidents (95% CI) Violent incidents (95% CI)
ARMIDILO-S Numerical AUC (95% CI) .70 (.44–.96) .74* (.58–.91)
SPJ convergent rating AUC (95% CI) .90* (.80–1.00) .79** (.63–.95)
Sensitivity (95% CI) 100% (39.76–100) 62.5% (35.43–84.80)
Specificity (95% CI) 82.4% (65.47–93.24) 95.5% (77.16–99.88)
PPV (95% CI) 39.9% (24.33–57.88) 90.9% (58.67–98.60)
NPV (95% CI) 100% 77.8% (64.88–86.90)
NND 3 1
a
NSD Not applicable 4
Note. AUC = area under the curve; CI = confidence interval; PPV = positive predictive value; NPV = negative predictive
value; NND = number needed to detain; NSD = number safely discharged.
* p .05; ** p < .01.
a
This value was undefined because the calculation entailed division by zero.
Table 4. Predictive validity (AUC) of the ARMIDILO-S individual component

scores.
Sexual incidents
Numerical score Convergent SPJ rating
Risk items .66 (.44–.88) .63 (.41–.85)
Protective items .72 (.45–1.00) .57 (.25–.89)
Total client score .62 (.35–.90) Not applicablea
Total environment score .79 (.59–.99) Not applicablea
ARMIDILO-S total .70 (.44–.96) .90* (.80–1.00)
a
The manual does not ask you to formulate a risk rating concerning the total client or
total environmental items.
involved in a sexual incident (PPV), equivalent to a median number needed to

detain of three. Of those judged to be of low risk, nobody posed any proble
matic sexual behavior (NPV). Three patients judged to be at high risk need to
be detained in order to prevent one person from relapsing.
Results concerning the prediction of violent incidents showed a mixed
image, with high specificity (the proportion of nonviolent individuals who
were judged to be at low risk) and high prospective accuracy in both predicting
high-risk (PPV; the proportion of those judged to be at high risk of commit
ting a violent act and did in fact relapsed violently) and low-risk individuals
(NPV; the proportion of those judged to be at low risk of committing a violent
act and were in fact not involved in a violent incident). Only one patient
judged to be at high risk needs to be detained in order to prevent one single
offense, while four patients judged to be at low risk could be discharged prior
to the occurrence of one offense.
Subscales
Component analyses showed that none of the subscales were significantly
predictive of sexual incidents, although the numerical score of the environ
mental subscale showed a trend toward significance (p = .06) and the con
fidence interval did not include .50 (see Table 4).
Discussion
Already in 2004, Lindsay and Beail (2004) addressed the lack of studies that
validate the use of existing risk assessment instruments in ID populations and
the urgent need to develop objective and valid risk assessment tools tailored to
the needs of (S)OIDs. However, little progress has been made to date
(Hounsome et al., 2018; Lofthouse et al., 2017; Pouls & Jeandarme, 2015).
The aim of the current study was therefore to assess the predictive validity of
the ARMIDILO-S in sex offenders with ID or borderline intellectual
functioning.
The AUC analyses showed moderate to high accuracy of the convergent SPJ
rating in predicting sexual and violent incidents, even when more strict criteria
are used for the interpretation (Sjöstedt & Grann, 2002). The high AUC of .90
for the prediction of sexual incidents is in line with the AUC of .92 for the
numerical total score in the study of Lofthouse et al. (2013). An important
difference with the other ARMIDILO(-S) studies however, is that the convergent
SPJ rating is used alongside a numerical total score. Although the AUC of the
SPJ judgment was higher than its numerical counterpart, differences were non-
significant and might be attributed to the limited sample size. Furthermore,
a high AUC value was found for the environmental subscales, in line with the
results of the study conducted by Lofthouse et al. (2013). Although non-
significant, there was a trend towards significance and the confidence interval
did not include .50. Potentially with a larger sample, this might have been
significant. It may also be due to the limited variance of the environmental
scores (0 to –11). Mainly because of the secure setting and limited (or absent)
liberties, all the participants scored “no risk” on the environmental items.
Furthermore, performance indicators of the ARMIDILO-S results were mainly
higher compared to the Static-99R in the same population (Pouls & Jeandarme,
2022), although significance testing was not conducted. The same trend could be
seen in the study of Blacker et al. (2011), where predictive validity of the acute
and stable client subscales of the previous version of the ARMIDILO-S was
generally higher than that of the RRASOR and RM2000/V. This indicates the
need to include dynamic risk factors during treatment/for monitoring purposes.
Another explanation could be that predictive validity of SOID-specific risk
assessment instruments is better than that of instruments developed in main
stream offender populations.
In addition to the generally reported AUC value, other – more clinically
relevant – performance indicators were analyzed. Concerning the prediction
of sexual incidents, the ARMIDILO-S was better at prospectively detecting
low-risk individuals. The instrument further showed excellent prospective
qualities in predicting who was going to act violently (detection of high-risk
offenders), although the detection of low-risk violent offenders was relatively
high too. This finding is surprising, given that (limited) findings in non-ID
offender samples show that risk assessment instruments are generally better in
identifying low-risk offenders compared to high-risk offenders (Declue &
Campbell, 2013; Fazel et al., 2012; Singh et al., 2011). The suggestion of
Fazel et al. (2012) to use risk assessment tools to screen out low-risk cases,
rather than to detect high-risk individuals, may therefore not extend to SOIDs.
However, there are no clear cutoff standards for interpreting PPV/NPV or
NND/NSD values, making this a rather moral judgment. Furthermore, com
parison with other studies is hampered by methodological differences (e.g.,
PPV/NPV and NND/NSD are base rate-dependent) and a lack of indicators
for predictive accuracy other than AUC. The results with regard to the
prediction of sexual incidents were comparable to one non-peer reviewed
study in SOIDs (Sindall, 2012), despite of the difference in the base rate
(10,5% in the current study; 31,2% in the Sindall study). In the study of Sindall
(2012), NPV and sensitivity was 100%, as was the case in the current study.
PPV and specificity were respectively 55% and 64% in the study of Sindall
(2012); 40% and 82% in the current study.
From a clinical point of view, the ARMIDILO-S was experienced as a time-
consuming instrument, both for the rater and the staff being interviewed. On
the contrary, staff found it useful that they were being pushed to reflect on
risk-relevant matters. For the rater, it was particularly hard to collect informa
tion necessary to score the environmental subscales. Focusing on staff atti
tudes, team communication (or miscommunication) or supervision problems
may cause staff to become defensive. This problem can possibly be overcome
when a team member scores the ARMIDILO-S, instead of an external rater.
Furthermore, the scoring of the instrument was not an easy task. This is
reflected in the poor inter-rater reliability results (ICC2,1 = .28–.55).
Potentially a lack of experience or prior risk assessment training by
the second raters, i.e. students, could explain these poor results. This was
confirmed by exploratory analyses of the predictive accuracy using the ratings
of the primary rater versus the ratings of the second rater.2 Training therefore
seems necessary to guarantee accurate scoring, which is in line with the user
requirements defined in the manual. The influence of individual rater char
acteristics was also demonstrated in field studies of the Psychopathy
Checklist–Revised (PCL-R: Hare, 2003; e.g., Boccaccini et al., 2014, 2008),
although the question why this results in scoring differences remains largely
unanswered. Individual studies pointed to personality traits of the evaluator
(Miller et al., 2011), level of experience (Rufino et al., 2012), and the experience
in scoring the instrument (Murrie et al., 2012) as a potential explanation.
Jeandarme et al. (2016) hypothesized that level of background training and
education could also be related to scoring differences. Furthermore, the item
instructions in the manual are rather limited and sometimes vague which
could have created more room for interpretation and consequently more
subjectivity, compromising rater agreement. Nevertheless, the very low IRR
achieved in this study is problematic. Furthermore, this is not consistent with
the high inter-rater reliability found by Lofthouse et al. (2013).
Limitations
Although the results are promising, they must be interpreted with caution. The
reliability of the findings is limited by the small sample size, although all
people with ID or borderline intellectual functioning and sexual offense
histories in OID-specific projects in Flanders were included. This might
have affected the ROC analysis, because sample sizes below 200 result in
2
Data available on request.
inaccuracies in the estimated population parameters underlying ROC analyses

(Hanczar et al., 2010 as cited in Singh, 2013). The problem of low sample sizes
is also addressed in previous studies with risk assessment instruments in
SOIDs. Studies with the ARMIDILO(-S) reported sample sizes of 16
(Sindall, 2012), 44 (Blacker et al., 2011) and 64 (Lofthouse et al., 2013). With
a few exceptions, this was also the case in other risk assessment studies in this
population, with a mean sample size of around 65 (e.g., Pouls & Jeandarme,
2015).
Further, because calibration indicators are base rate-dependent and vary
depending on the population, time at risk and outcome of interest, these
results cannot be generalized to populations different from the current sample
or methodologically different studies.
Thirdly, although our study was prospective in nature, we did not use
a standardized registration form such as the Staff Observation Aggression
Scale – Revised (Nijman et al., 1999) for two reasons. Firstly, the use of
these standardized forms is limited or even non-existent in clinical prac
tice. Secondly, the reliability of incident registration is at risk when differ
ent people from six institutions are involved in the scoring process.
However, when using a non-official outcome measure in contrast to, for
example, reconviction data, rater subjectivity in determining what is
appropriate sexual behavior cannot be ruled out. Moreover, we were
unaware of possible formal allegations or convictions. Because one
researcher recorded all incidents – based on the observational data –
consistency in scoring is however guaranteed. Underreporting might
exist, namely regarding potential incidents off-campus. Because unsuper
vised leave was limited for most patients, the risk of underreporting is
presumably negligible.
Future Research
This study included mainly offenders in the low to borderline intelligence

range, as is the case in most of the work in this field (Pouls & Jeandarme,
2015). Future research should look at predictive accuracy depending on the
severity of the intellectual disability. The sample of the current study was too
small to make these comparisons.
Secondly, predictive validity of risk assessment instruments cannot be
generalized to field settings. It is hypothesized that real-world contexts are
characterized by a lack of resources, different or no training backgrounds,
variable or poor file information, unknown fidelity to administration proce
dures or contextual pressures (Helmus et al., 2021; Jeandarme et al., 2016; Vojt
et al., 2013). Field validity studies are of crucial importance, because studies in
controlled research contexts only tell us how well instruments can perform,
where field studies tell us how well they do perform (Edens & Boccaccini, 2017,
p. 603). Limited studies in mainstream offender populations with the HCR-20

show mixed results (Helmus et al., 2021; Jeandarme et al., 2016; Neal et al.,
2015; Pedersen et al., 2012). So far, these studies remain – to our knowledge –
totally absent in the field of OIDs.
Thirdly, still more research is needed about risk assessment in OIDs,
preferably with larger samples than the ones used in most studies so far and
by using multiple performance indicators.
Conclusion
This study has provided further evidence of the good predictive validity of the
ARMIDILO-S in predicting future sexual and violent incidents in SOIDs.
Although the ARMIDILO-S was able to prospectively detect individuals who
are at high risk for future violent behavior, more caution is needed regarding
the detection of high-risk individuals for future sexual incidents. Furthermore,
the environmental subscales might be of added value. However, more research
in preferably larger samples is necessary to confirm these results. Despite the
limited amount of empirical research, the ARMIDILO-S is currently the most
validated dynamic risk assessment tool available for SOIDs, even when main
stream instruments are considered.
Acknowledgments
Special thanks to the participating clients and institutions: A.B.A.G.G. (’t Zwart Goor), Amanis
(‘t Zwart Goor), Itinera (Sint-Idesbald), Limes (Sint-Ferdinand), Ontgrendeld (OBRA), KFP
(APZ Sint-Lucia), and Forensische Zorg 4 (OPZC Rekem). We also want to thank the Federal
Government of Justice.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
Blacker, J., Beech, A. R., Wilcox, D. T., & Boer, D. P. (2011). The assessment of dynamic risk
and recidivism in a sample of special needs sexual offenders. Psychology, Crime & Law, 17(1),
75–92. https://doi.org/10.1080/10683160903392376
Boccaccini, M. T., Murrie, D. C., Rufino, K. A., & Gardner, B. O. (2014). Evaluator differences
in Psychopathy Checklist-Revised factor and facet scores. Law and Human Behavior, 38(4),
337–345. https://doi.org/10.1037/lhb0000069
Boccaccini, M. T., Turner, D. B., & Murrie, D. C. (2008). Do some evaluators report consis
tently higher or lower PCL-R scores than others? Findings from a statewide sample of
sexually violent predator evaluations. Psychology, Public Policy, and Law, 14(4), 262–283.
https://doi.org/10.1037/a0014523
Boer, D. P. (2013). Some essential environmental ingredients for sex offender reintegration.
International Journal of Behavioral Consultation and Therapy, 8(3–4), 8–11. https://doi.org/
10.1037/h0100976
Boer, D. P., Haaven, J., Lambrick, F., Lindsay, W. R., McVilly, K. R., Sakdalan J., &
Frize, M. C. J. (2013). ARMIDILO-S manual. http://www.armidilo.net/
Boer, D. P., Hart, S. D., Kropp, P. R., & Webster, C. D. (1997). Manual for the sexual violence
risk-20: Professional guidelines for assessing risk of sexual violence. The Mental Health, Law, &
Policy Institute.
Boer, D. P., McVilly, K. R., & Lambrick, F. (2007). Contextualizing risk in the assessment of
intellectually disabled individuals. Sexual Offender Treatment, 2(2), 1–4. http://www.sexual-
offender-treatment.org/59.html
Boer, D. P., Tough, S., & Haaven, J. (2004). Assessment of risk manageability of intellectually
disabled sex offenders. Journal of Applied Research in Intellectual Disabilities, 17(4), 275–283.
https://doi.org/10.1111/j.1468-3148.2004.00214.x
Bonta, J., & Andrews, D. A. (2007). Risk-Need-Responsivity Model for offender assessment and
rehabilitation. Her Majesty the Queen in Right of Canada.
Declue, G., & Campbell, T. (2013). Calibration performance indicators of the static-99R: 2013
update. Open Access Journal of Forensic Psychology, 5, 82–88. https://www.oajfp.com/_files/
ugd/166e3f_549efdb235474b6eaae789bc6433f8fc.pdf
Drieschner, K. H., & Hesper, B. L. (2008). Dynamic risk outcome scales. Trajectum.
Edens, J. F., & Boccaccini, M. T. (2017). Taking forensic mental health assessment “out of the
lab” and into “the real world”: Introduction to the special issue on the field utility of forensic
assessment instruments and procedures. Psychological Assessment, 29(6), 599–610. https://
doi.org/10.1037/pas0000475
Fazel, S., Singh, J. P., Doll, H., & Grann, M. (2012). Use of risk assessment instruments to
predict violence and antisocial behaviour in 73 samples involving 24 827 people: Systematic
review and meta-analysis. BMJ, 345. https://doi.org/10.1136/bmj.e4692
Garber, C. (1998). MedCalc Software for Statistics in Medicine. Clinical Chemistry, 44(6), 1370.
https://doi.org/10.1093/clinchem/44.6.1370
Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., & Dougherty, E. R. (2010). Small-
sample precision of ROC-related estimates. Bioinformatics, 26(6), 822–830. https://doi.org/
10.1093/bioinformatics/btq037.
Hanson, R. K. (1997). The development of a brief actuarial risk scale for sexual offense
recidivism. Department of the Solicitor General of Canada.
Hanson, R. K., Sheahan, C. L., & VanZuylen, H. (2013). STATIC-99 and RRASOR predict
recidivism among developmentally delayed sexuald offenders: A cumulative meta-analysis.
Sexual Offender Treatment, 8(1), 1–14. http://www.sexual-offender-treatment.org/119.html
Hare, R. D. (2003). Manual for the Revised Psychopathy Checklist (2nd ed.). Multi-Health
Systems.
Harris, A., Phenix, A., Hanson, R. K., & Thornton, D. (2003). Static-99 coding rules revised -
2003. Department of the Solicitor General of Canada.
Helmus, L. M., Hanson, R. K., Murrie, D. C., & Zabarauckas, C. L. (2021). Field validity of
static-99R and STABLE-2007 with 4,433 men serving sentences for sexual offences in British
Columbia: New findings and meta-analysis. Psychological Assessment, 33(7), 581–595.
https://doi.org/10.1037/pas0001010
Helmus, L. M., Thornton, D., Hanson, R. K., & Babchishin, K. M. (2012). Improving the
predictive accuracy of Static-99 and Static-2002 with older sex offenders: Revised age
weights. Sexual Abuse: A Journal of Research and Treatment, 24(1), 64–101. https://doi.
org/10.1177/1079063211409951
Hounsome, J., Whittington, R., Brown, A., Greenhill, B., & McGuire, J. (2018). The structured
assessment of violence risk in adults with intellectual disability: A systematic review. Journal
of Applied Research in Intellectual Disabilities, 31(1), e1–e17. https://doi.org/10.1111/jar.
12295
IBM Corp. (2013). IBM SPSS statistics for windows, version 22.0.
Jeandarme, I., Pouls, C., De Laender, J., Oei, T. I., & Bogaerts, S. (2016). Field validity of the
HCR-20 in forensic medium security units in Flanders. Psychology, Crime & Law, 23(4),
305–322. https://doi.org/10.1080/1068316X.2016.1258467
Lindsay, W. R., & Beail, N. (2004). Risk assessment: Actuarial prediction and clinical judge
ment of offending incidents and behaviour for intellectual disability services. Journal of
Applied Research in Intellectual Disabilities, 17(4), 229–234. https://doi.org/10.1111/j.1468-
3148.2004.00212.x
Lofthouse, R. E., Golding, L., Totsika, V., Hastings, R., & Lindsay, W. (2017). How effective are
risk assessments/measures for predicting future aggressive behaviour in adults with intel
lectual disabilities (ID): A systematic review and meta-analysis. Clinical Psychology Review,
58, 76–85. https://doi.org/10.1016/j.cpr.2017.10.001
Lofthouse, R. E., Lindsay, W. R., Totsika, V., Hastings, R. P., Boer, D. P., & Haaven, J. L. (2013).
Prospective dynamic assessment of risk of sexual reoffending in individuals with an intel
lectual disability and a history of sexual offending behaviour. Journal of Applied Research in
Intellectual Disabilities, 26(5), 394–403. https://doi.org/10.1111/jar.12029
Miller, A. K., Rufino, K. A., Boccaccini, M. T., Jackson, R. L., & Murrie, D. C. (2011). On
individual differences in person perception: Raters’ personality traits relate to their
Psychopathy Checklist-Revised scoring tendencies. Assessment, 18(2), 253–260. https://doi.
org/10.1177/1073191111402460
Murrie, D. C., Boccaccini, M. T., Caperton, J., & Rufino, K. (2012). Field validity of the
Psychopathy Checklist–Revised in sex offender risk assessment. Psychological Assessment,
24(2), 524–529. https://doi.org/10.1037/a0026015
Neal, T. M. S., Miller, S. L., & Shealy, R. C. (2015). A field study of a comprehensive violence
risk assessment battery. Criminal Justice and Behavior, 42(9), 952–968. https://doi.org/10.
1177/0093854815572252
Nijman, H. L. I., Muris, P., Merckelbach, H. L. G. J., Palmstierna, T., Wistedt, B., Vos, A. M.,
van Rixtel, A., & Allertz, W. (1999). The staff observation aggression scale–revised (SOAS-
R). Aggressive Behavior, 25(3), 197–209. https://doi.org/10.1002/(SICI)1098-2337(1999)
25:3<197::AID-AB4>3.0.CO;2-C
Pedersen, L., Ramussen, K., & Elsass, P. (2012). HCR-20 violence risk assessments as a guide for
treating and managing violence risk in a forensic psychiatric setting. Psychology, Crime &
Law, 18(8), 733–743. https://doi.org/10.1080/1068316X.2010.548814
Phenix, A., Doren, D., Helmus, L., Hanson, R. K., & Thornton, D. (2008). Coding rules for
static-2002. http://www.static99.org/pdfdocs/static2002codingrules.pdf
Phenix, A., Fernandez, Y., Harris, A. J. R., Helmus, M., Hanson, R. K., & Thornton, D. (2016).
Static-99R coding rules revised — 2016 . http://www.static99.org/pdfdocs/Coding_manual_
2016_v2.pdf
Phenix, A., Helmus, L. M., & Hanson, R. K. (2016). Static-99R & static-20002R evaluators’
workbook. http://www.static99.org/pdfdocs/Evaluators_Workbook_2016-10-19.pdf
Pouls, C., & Jeandarme, I. (2015). Risk assessment and risk management in offenders with
intellectual disabilities: Are we there yet? Journal of Mental Health Research in Intellectual
Disabilities, 8(3–4), 213–236. https://doi.org/10.1080/19315864.2015.1070221
Pouls, C., & Jeandarme, I. (2022). Reliability and validity of the static-99R in sex offenders with
intellectual disabilities. Journal of Intellectual Disabilities and Offending Behaviour, 13(1),
20–31. https://doi.org/10.1108/JIDOB-08-2021-0013
Rice, M. E., & Harris, G. T. (2005). Comparing effect sizes in follow-up studies: ROC area,
Cohen’s d, and r. Law and Human Behavior, 29(5), 615–620. https://doi.org/10.1007/
s10979-005-6832-7
Rufino, K. A., Boccaccini, M. T., Hawes, S. W., & Murrie, D. C. (2012). When experts disagreed,
who was correct? A comparison of PCL-R scores from independent raters and opposing
forensic experts. Law and Human Behavior, 36(6), 527–537. https://doi.org/10.1037/
h0093988
Sindall, O. (2012). An exploratory validation study of a risk assessment tool for male sex
offenders with an intellectual disability [Doctoral dissertation, Canterbury Christ Church
University]. https://repository.canterbury.ac.uk/item/86992/an-exploratory-validation-
study-of-a-risk-assessment-tool-for-male-sex-offenders-with-an-intellectual-disability.
Singh, J. P. (2013). Predictive validity performance indicators in violence risk assessment:
A methodological primer. Behavioral Sciences & the Law, 31(1), 8–22. https://doi.org/10.
1002/bsl.2052
Singh, J. P., Grann, M., & Fazel, S. (2011). A comparative study of violence risk assessment
tools: A systematic review and metaregression analysis of 68 studies involving 25,980
participants. Clinical Psychology Review, 31(3), 499–513. https://doi.org/10.1016/j.cpr.2010.
11.009
Sjöstedt, G., & Grann, M. (2002). Risk assessment: What is being predicted by actuarial
prediction instruments? The International Journal of Forensic Mental Health, 1(2),
179–183. https://doi.org/10.1080/14999013.2002.10471172
Smid, W., Koch, M., & van den Berg, J. W. (2014). STATIC-99R scorehandleiding [Static-99R
scoring manual]. De Forensische Zorgspecialisten [The Forensic Care Specialists].
van Alphen, P. (2016). ARMIDILO: Nederlandse vertaling. Open Universiteit.
Vojt, G., Thomson, L. D. G., & Marshall, L. A. (2013). The predictive validity of the HCR-20
following clinical implementation: Does it work in practice? The Journal of Forensic
Psychiatry & Psychology, 24(3), 371–385. https://doi.org/10.1080/14789949.2013.800894
Webster, C. D., Douglas, K. S., Eaves, D., & Hart, S. D. (1997). HCR-20: Assessing risk for
violence (Version 2) (Version 2 ed.). Simon Fraser University and Forensic Psychiatric
Services Commission of British Columbia.
Webster, C. D., Martin, M. L., Brink, J., Nicholls, T. L., & Middleton, C. (2004). Manual for the
Short-Term Assessment of Risk and Treatability (START). Forensic Psychiatric Services
Commission and St. Joseph’s Healthcare.

Reliability and Validity of The ARMIDILO S in Sex Offenders With Intellectual Disabilities

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability and Validity of The ARMIDILO S in Sex Offenders With Intellectual Disabilities

Uploaded by

Copyright:

Available Formats

Journal of Mental Health Research in Intellectual

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/umid20

Reliability and Validity of the ARMIDILO-S in Sex

Claudia Pouls & Inge Jeandarme

To link to this article: https://doi.org/10.1080/19315864.2022.2148790

Published online: 28 Nov 2022.

Submit your article to this journal

Article views: 327

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Reliability and Validity of the ARMIDILO-S in Sex Offenders

AUC = .92. The ARMIDILO-S performed better than the Static-99

inter-rater reliability and predictive accuracy of the ARMIDILO-S final con­

RQ1. What is the inter-rater reliability of the ARMIDILO-S?

RQ2. What is the predictive validity of the ARMIDILO-S in predicting sexual

Materials and methods

Table 1. Subscales and items of the ARMIDILO-S.

the item “Impulsivity” scores as N, the protective factor as Y. After

The study was prospective in nature. The Dutch version of ARMIDILO-S

to be at low risk for re-offending, while according to the ARMIDILO-S

Table 4. Predictive validity (AUC) of the ARMIDILO-S individual component

involved in a sexual incident (PPV), equivalent to a median number needed to

inaccuracies in the estimated population parameters underlying ROC analyses

This study included mainly offenders in the low to borderline intelligence

p. 603). Limited studies in mainstream offender populations with the HCR-20

You might also like

inter-rater reliability and predictive accuracy of the ARMIDILO-S final con