Professional Documents
Culture Documents
Measuring Mitigating Unintended Bias Paper
Measuring Mitigating Unintended Bias Paper
Measuring Mitigating Unintended Bias Paper
Jigsaw
intended bias.
Error rate equality difference quantifies the extent of the
per-term variation (and therefore the extent of unintended
bias) as the sum of the differences between the overall false 0.0 0.2 0.4 0.6 0.8 1.0
positive or negative rate and the per-term values, as shown toxic containing term average
nontoxic containing term average
in equations 1 and 2.
False Positive X
= |F P R − F P Rt | (1)
Equality Difference 0.0 0.2 0.4 0.6 0.8 1.0
t∈T
toxic containing term short
nontoxic containing term short
False Negative X
= |F N R − F N Rt | (2)
Equality Difference
t∈T
0.0 0.2 0.4 0.6 0.8 1.0
Error rate equality differences evaluate classification out-
comes, not real-valued scores, so in order to calculate this
metric, we must choose one (or multiple) score threshold(s).
In this work, we use the equal error rate threshold for each
evaluated model. Figure 2: Distributions of toxicity scores for three groups
of data, each containing comments with different identity
Pinned AUC terms, “tall”, “average”, or “short”.
In addition to the error rate metrics, we also defined a new
metric called pinned area under the curve (pinned AUC). per-group identity dataset may not effectively identify unin-
This metric addresses challenges with both regular AUC and tended bias.
the error rate equality difference method and enables the By contrast, the AUC on the combined data is signifi-
evaluation of unintended bias in a more general setting. cantly lower, indicating poor model performance. The un-
Many classification models, including those implemented derlying cause, in this case, is due to the unintended bias
in our research, provide a prediction score rather than a di- reducing the separability of classes by giving non-toxic ex-
rect class decision. Thresholding can then be used to trans- amples in the “short” subgroup a higher score than many
form this score into a predicted class, though in practice, toxic examples from the other subgroups. However, a low
consumers of these models often use the scores directly to combined AUC is not of much help in diagnosing bias, as it
sort and prioritize text. Prior fairness metrics, like error rate could have many other causes, nor does it help distinguish
equality difference, only provide an evaluation of bias in the which subgroups are likely to be most negatively impacted.
context of direct binary classification or after a threshold has The AUC measure on both the individual datasets and the
been chosen. The pinned AUC metric provides a threshold- aggregated one provide poor measures of unintended bias,
agnostic approach that detects bias in a wider range of use- as neither answer the key question in measuring bias: is the
cases. model performance on one subgroup different than its per-
This approach adapts from the popular area under the re- formance on the average example?
ceiver operator characteristic (AUC) metric which provides The pinned AUC metric tackles this question directly. The
a threshold-agnostic evaluation of the performance of an ML pinned AUC metric for a subgroup is defined by comput-
classifier (Fawcett 2006). However, in the context of bias de- ing the AUC on a secondary dataset containing two equally-
tection, a direct application of AUC to the wrong datasets balanced components: a sample of comments from the sub-
can lead to inaccurate analysis. We demonstrate this with a group of interest and a sample of comments that reflect the
simulated hypothetical model represented in Figure 2. underlying distribution of comments. By creating this auxil-
Consider three datasets, each representing comments con- iary dataset that “pin’s” the subgroup to the underlying dis-
taining different identity terms, here “tall”, “average”, or tribution, we allow the AUC to capture the divergence of the
“short”. The model represented by Figure 2 clearly con- model performance on one subgroup with respect to the av-
tains unintended bias, producing much higher scores for erage example, providing a direct measure of bias.
both toxic and non-toxic comments containing “short”. More formally, if we let D represent the full set of com-
If we evaluate the model performance on each identity- ments and Dt be the set of comments in subgroup t, then we
based dataset individually then we find that the model ob- can generate the secondary dataset for term t by applying
tains a high AUC on each (Table 3), obscuring the unin- some sampling function s as in Equation 3 below1 . Equa-
tended bias we know is present. This is not surprising as the tion 4 then defines the pinned AUC of term t, pAU Ct , as
model appears to perform well at separating toxic and non-
toxic comments within each identity. This demonstrates the 1
The exact technique for sub-sampling and defining D may
general principle that the AUC score of a model on a strictly vary depending the data. See appendix.
Dataset AUC Pinned AUC Model General Phrase Templates
Combined 0.79 N/A Baseline 0.960 0.952
tall 0.93 0.84 Random Control 0.957 0.946
average 0.93 0.84 Bias Mitigated 0.959 0.960
short 0.93 0.79
Table 4: Mean AUC on the general and phrase templates test
Table 3: AUC results. sets.
the AUC of the corresponding secondary dataset. To capture the impact of training variance, we train each
model ten times, and show all results as scatter plots, with
pDt = s(Dt ) + s(D), |s(Dt )| = |s(D)| (3) each point representing one model.
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
queer queer queer
queer queer queer
bisexual bisexual bisexual
bisexual bisexual bisexual
transgender transgender transgender
gay gay gay
lesbian lesbian lesbian
lesbian lesbian lesbian
gay gay gay
japanese japanese japanese
homosexual homosexual homosexual
homosexual homosexual homosexual
sikh sikh sikh
chinese chinese chinese
teenage teenage teenage
transgender transgender transgender
lgbt lgbt lgbt
buddhist buddhist buddhist
female female female
middle eastern middle eastern middle eastern
blind blind blind
canadian canadian canadian
male male male
european european european
hispanic hispanic hispanic
asian asian asian
black black black
sikh sikh sikh
taoist taoist taoist
american american american
elderly elderly elderly
teenage teenage teenage
deaf deaf deaf
african american african american african american
latinx latinx latinx
african african african
latina latina latina
indian indian indian
millenial millenial millenial
older older older
paralyzed paralyzed paralyzed
millenial millenial millenial
heterosexual heterosexual heterosexual
female female female
nonbinary nonbinary nonbinary
latinx latinx latinx
latino latino latino
elderly elderly elderly
lgbtq lgbtq lgbtq
Future Work
catholic catholic catholic
Per-term false positive rates for random treatment
Pinned AUC
False Positive
Appendix
trans trans trans
Conclusion
christian christian christian
Per-term false negative rates for original model
6.84
References
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.;
Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.;
Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard,
M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Leven-
berg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah,
C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Tal-
war, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas,
F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu,
Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine
learning on heterogeneous systems. Software available from
tensorflow.org.
Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. H. 2017. Data
decisions and theoretical implications when adversarially
learning fair representations. CoRR abs/1707.00075.
Blodgett, S. L., and O’Connor, B. 2017. Racial disparity in
natural language processing: A case study of social media
african-american english. CoRR abs/1707.00061.
Bolukbasi, T.; Chang, K.; Zou, J. Y.; Saligrama, V.; and
Kalai, A. 2016. Man is to computer programmer as woman
is to homemaker? debiasing word embeddings. CoRR
abs/1607.06520.
Chollet, F., et al. 2015. Keras. https://github.com/
fchollet/keras.
Fawcett, T. 2006. An introduction to roc analysis. Pattern
recognition letters 27(8):861–874.
Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.;
and Venkatasubramanian, S. 2015. Certifying and remov-
ing disparate impact. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, KDD ’15, 259–268. New York, NY,
USA: ACM.
Friedler, S. A.; Scheidegger, C.; and Venkatasubramanian,
S. 2016. On the (im)possibility of fairness. CoRR
abs/1609.07236.
Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of op-
portunity in supervised learning. CoRR abs/1610.02413.