AI Safety

AI safety
AI safety is an interdisciplinary field concerned with preventing accidents, misuse, or other harmful
consequences that could result from artificial intelligence (AI) systems. It encompasses machine ethics and
AI alignment, which aim to make AI systems moral and beneficial, and AI safety encompasses technical
problems including monitoring systems for risks and making them highly reliable. Beyond AI research, it
involves developing norms and policies that promote safety.
Motivations
Some ways in which an advanced misaligned AI could try to gain more power.[1] Power-
seeking behaviors may arise because power is useful to accomplish virtually any
objective[2] (see instrumental convergence).
AI researchers have widely different opinions about the severity and primary sources of risk posed by AI
technology[3][4][5] – though surveys suggest that experts take high consequence risks seriously. In two
surveys of AI researchers, the median respondent was optimistic about AI overall, but placed a 5%
probability on an “extremely bad (e.g. human extinction)” outcome of advanced AI.[3] In a 2022 survey of
the Natural language processing (NLP) community, 37% agreed or weakly agreed that it is plausible that AI
decisions could lead to a catastrophe that is “at least as bad as an all-out nuclear war.”[6] Scholars discuss
current risks from critical systems failures,[7] bias,[8] and AI enabled surveillance;[9] emerging risks from
technological unemployment, digital manipulation,[10] and weaponization;[11] and speculative risks from
losing control of future artificial general intelligence (AGI) agents.[12]
Some have criticized concerns about AGI, such as Stanford University adjunct professor Andrew Ng who
compared them to "worrying about overpopulation on Mars when we have not even set foot on the planet
yet."[13] Others, such as University of California, Berkeley professor Stuart J. Russell urge caution, arguing
that "it is better to anticipate human ingenuity than to underestimate it."[14]
Background
Risks from AI began to be seriously discussed at the start of the computer age:
Moreover, if we move in the direction of making machines which learn and whose behavior is
modified by experience, we must face the fact that every degree of independence we give the
machine is a degree of possible defiance of our wishes.
— Norbert Wiener (1949)[15]
From 2008 to 2009, the AAAI commissioned a study to explore and address potential long-term societal
influences of AI research and development. The panel was generally skeptical of the radical views
expressed by science-fiction authors but agreed that "additional research would be valuable on methods for
understanding and verifying the range of behaviors of complex computational systems to minimize
unexpected outcomes."[16]
In 2011, Roman Yampolskiy introduced the term "AI safety engineering" [17] at the Philosophy and Theory
of Artificial Intelligence conference,[18] listing prior failures of AI systems and arguing that "the frequency
and seriousness of such events will steadily increase as AIs become more capable."[19]
In 2014, philosopher Nick Bostrom published the book Superintelligence: Paths, Dangers, Strategies. His
argument that future advanced systems may pose a threat to human existence prompted Elon Musk,[20] Bill
Gates,[21] and Stephen Hawking[22] to voice similar concerns.
In 2015, dozens of artificial intelligence experts signed an open letter on artificial intelligence calling for
research on the societal impacts of AI and outlining concrete directions.[23] To date, the letter has been
signed by over 8000 people including Yann LeCun, Shane Legg, Yoshua Bengio, and Stuart Russell.
In the same year, a group of academics led by professor Stuart Russell founded the Center for Human-
Compatible AI at UC Berkeley and the Future of Life Institute awarded $6.5 million in grants for research
aimed at "ensuring artificial intelligence (AI) remains safe, ethical and beneficial."[24]
In 2016, the White House Office of Science and Technology Policy and Carnegie Mellon University
announced The Public Workshop on Safety and Control for Artificial Intelligence,[25] which was one of a
sequence of four White House workshops aimed at investigating "the advantages and drawbacks" of
AI.[26] In the same year, Concrete Problems in AI Safety – one of the first and most influential technical AI
Safety agendas – was published.[27]
In 2017, the Future of Life Institute sponsored the Asilomar Conference on Beneficial AI, where more than
100 thought leaders formulated principles for beneficial AI including "Race Avoidance: Teams developing
AI systems should actively cooperate to avoid corner-cutting on safety standards."[28]
In 2018, the DeepMind Safety team outlined AI safety problems in specification, robustness, and
assurance.[29] The following year, researchers organized a workshop at ICLR that focused on these
problem areas.[30]
In 2021, Unsolved Problems in ML Safety was published, outlining research directions in robustness,
monitoring, alignment, and systemic safety.[31]
In 2023, Rishi Sunak said he wants the United Kingdom to be the "geographical home of global AI safety
regulation" and to host the first global summit on AI safety.[32]
Research foci
AI safety research areas include robustness, monitoring, and alignment.[31][29] Robustness is concerned
with making systems highly reliable, monitoring is about anticipating failures or detecting misuse, and
alignment is focused on ensuring they have beneficial objectives.
Robustness
Robustness research focuses on ensuring that AI systems behave as intended in a wide range of different
situations, which includes the following subproblems:
Black swan robustness: building systems that behave as intended in rare situations.
Adversarial robustness: designing systems to be resilient to inputs that are intentionally
selected to make them fail.
Black swan robustness
Rare inputs can cause AI systems to fail catastrophically. For example, in the 2010 flash crash, automated
trading systems unexpectedly overreacted to market aberrations, destroying one trillion dollars of stock
value in minutes.[33] Note that no distribution shift needs to occur for this to happen. Black swan failures
can occur as a consequence of the input data being long-tailed, which is often the case in real-world
environments.[34] Autonomous vehicles continue to struggle with ‘corner cases’ that might not have come
up during training; for example, a vehicle might ignore a stop sign that is lit up as an LED grid.[35] Though
problems like these may be solved as machine learning systems develop a better understanding of the
world, some researchers point out that even humans often fail to adequately respond to unprecedented
events like the COVID-19 pandemic, arguing that black swan robustness will be a persistent safety
problem.[31]
Adversarial robustness
AI systems are often vulnerable to adversarial examples or “inputs to machine learning models that an
attacker has intentionally designed to cause the model to make a mistake”.[36] For example, in 2013,
Szegedy et al. discovered that adding specific imperceptible perturbations to an image could cause it to be
misclassified with high confidence.[37] This continues to be an issue with neural networks, though in recent
work the perturbations are generally large enough to be perceptible.[38][39][40]
All of the images on the right are

predicted to be an ostrich after the
perturbation is applied. (Left) is a
correctly predicted sample, (center)
perturbation applied magnified by 10x,
(right) adversarial example.[37]
Adversarial robustness is often

associated with security.[41]
Researchers demonstrated that an Carefully crafted noise can be added to an image to cause it to be
misclassifed with high confidence.
audio signal could be imperceptibly
modified so that speech-to-text systems
transcribe it to any message the
attacker chooses.[42] Network intrusion[43] and malware[44] detection systems also must be adversarially
robust since attackers may design their attacks to fool detectors.
Models that represent objectives (reward models) must also be adversarially robust. For example, a reward
model might estimate how helpful a text response is and a language model might be trained to maximize
this score.[45] Researchers have shown that if a language model is trained for long enough, it will leverage
the vulnerabilities of the reward model to achieve a better score and perform worse on the intended task.[46]
This issue can be addressed by improving the adversarial robustness of the reward model.[47] More
generally, any AI system used to evaluate another AI system must be adversarially robust. This could
include monitoring tools, since they could also potentially be tampered with to produce a higher reward.[48]
Monitoring
Monitoring focuses on anticipating AI system failures so that they can be prevented or managed.
Subproblems of monitoring include flagging when systems are uncertain, detecting malicious use,
understanding the inner workings of black-box AI systems, and identifying hidden functionality planted by
a malicious actor.
Estimating uncertainty
It is often important for human operators to gauge how much they should trust an AI system, especially in
high-stakes settings such as medical diagnosis.[49] ML models generally express confidence by outputting
probabilities; however, they are often overconfident,[50] especially in situations that differ from those that
they were trained to handle.[51] Calibration research aims to make model probabilities correspond as closely
as possible to the true proportion that the model is correct.
Similarly, anomaly detection or out-of-distribution (OOD) detection aims to identify when an AI system is
in an unusual situation. For example, if a sensor on an autonomous vehicle is malfunctioning, or it
encounters challenging terrain, it should alert the driver to take control or pull over.[52] Anomaly detection
has been implemented by simply training a classifier to distinguish anomalous and non-anomalous
inputs,[53] though several other techniques are in use.[54][55]
Detecting malicious use
Scholars[11] and government agencies have expressed concerns that AI systems could be used to help
malicious actors to build weapons,[56] manipulate public opinion,[57][58] or automate cyber attacks.[59]
These worries are a practical concern for companies like OpenAI which host powerful AI tools online.[60]
In order to prevent misuse, OpenAI has built detection systems that flag or restrict users based on their
activity.[61]
Transparency
Neural networks have often been described as black boxes,[62] meaning that it is difficult to understand
why they make the decisions they do as a result of the massive number of computations they perform.[63]
This makes it challenging to anticipate failures. In 2018, a self-driving car killed a pedestrian after failing to
identify them. Due to the black box nature of the AI software, the reason for the failure remains unclear.[64]
One benefit of transparency is explainability.[65] It is sometimes a legal requirement to provide an

explanation for why a decision was made in order to ensure fairness, for example for automatically filtering
job applications or credit score assignment.[65]
Another benefit is to reveal the cause of failures.[62] At the beginning of the 2020 COVID-19 pandemic,
researchers used transparency tools to show that medical image classifiers were ‘paying attention’ to
irrelevant hospital labels.[66]
Transparency techniques can also be used to correct errors. For example, in the paper “Locating and
Editing Factual Associations in GPT,” the authors were able to identify model parameters that influenced
how it answered questions about the location of the Eiffel tower. They were then able to ‘edit’ this
knowledge to make the model respond to questions as if it believed the tower was in Rome instead of
France.[67] Though in this case, the authors induced an error, these methods could potentially be used to
efficiently fix them. Model editing techniques also exist in computer vision.[68]
Finally, some have argued that the opaqueness of AI systems is a significant source of risk and better
understanding of how they function could prevent high-consequence failures in the future.[69] “Inner”
interpretability research aims to make ML models less opaque. One goal of this research is to identify what
the internal neuron activations represent.[70][71] For example, researchers identified a neuron in CLIP that
responds to images of people in spider man costumes, sketches of spiderman, and the word ‘spider.’[72] It
also involves explaining connections between these neurons or ‘circuits’.[73][74] For example, researchers
have identified pattern-matching mechanisms in transformer attention that may play a role in how language
models learn from their context.[75] “Inner interpretability” has been compared to neuroscience. In both
cases, the goal is to understand what is going on in an intricate system, though ML researchers have the
benefit of being able to take perfect measurements and perform arbitrary ablations.[76]
Detecting trojans
ML models can potentially contain ‘trojans’ or ‘backdoors’: vulnerabilities that malicious actors maliciously
build into an AI system. For example, a trojaned facial recognition system could grant access when a
specific piece of jewelry is in view;[31] or a trojaned autonomous vehicle may function normally until a
specific trigger is visible.[77] Note that an adversary must have access to the system’s training data in order
to plant a trojan. This might not be difficult to do with some large models like CLIP or GPT-3 as they are
trained on publicly available internet data.[78] Researchers were able to plant a trojan in an image classifier
by changing just 3 out of 3 million of the training images.[79] In addition to posing a security risk,
researchers have argued that trojans provide a concrete setting for testing and developing better monitoring
tools.[48]
Alignment
In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems towards humans'
intended goals, preferences, or ethical principles. An AI system is considered aligned if it advances the
intended objectives. A misaligned AI system is competent at advancing some objectives, but not the
intended ones.[80][a]
It can be challenging for AI designers to align an AI system because it can be difficult for them to specify
the full range of desired and undesired behaviors. To avoid this difficulty, they typically use simpler proxy
goals, such as gaining human approval. However, this approach can create loopholes, overlook necessary
constraints, or reward the AI system for just appearing aligned.[80][82]
Misaligned AI systems can malfunction or cause harm. AI systems may find loopholes that allow them to
accomplish their proxy goals efficiently but in unintended, sometimes harmful ways (reward
hacking).[80][83][84] AI systems may also develop unwanted instrumental strategies such as seeking power
or survival because such strategies help them achieve their given goals.[80][85][86] Furthermore, they may
develop undesirable emergent goals that may be hard to detect before the system is in deployment, where it
faces new situations and data distributions.[87][88]
Today, these problems affect existing commercial systems such as language models,[89][90][91] robots,[92]
autonomous vehicles,[93] and social media recommendation engines.[89][86][94] Some AI researchers argue
that more capable future systems will be more severely affected since these problems partially result from
the systems being highly capable.[95][83][82]
Many leading AI scientists such as Geoffrey Hinton and Stuart Russell argue that AI is approaching
superhuman capabilities and could endanger human civilization if misaligned.[96][86]
AI alignment is a subfield of AI safety, the study of how to build safe AI systems.[97] Other subfields of AI
safety include robustness, monitoring, and capability control.[98] Research challenges in alignment include
instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI
models, and preventing emergent AI behaviors like power-seeking.[98] Alignment research has connections
to interpretability research,[99][100] (adversarial) robustness,[97] anomaly detection, calibrated
uncertainty,[99] formal verification,[101] preference learning,[102][103][104] safety-critical engineering,[105]
game theory,[106] algorithmic fairness,[97][107] and the social sciences,[108] among others.
Systemic safety and sociotechnical factors
It is common for AI risks (and technological risks more generally) to be categorized as misuse or
accidents.[109] Some scholars have suggested that this framework falls short.[109] For example, the Cuban
Missile Crisis was not clearly an accident or a misuse of technology.[109] Policy analysts Zwetsloot and
Dafoe wrote, “The misuse and accident perspectives tend to focus only on the last step in a causal chain
leading up to a harm: that is, the person who misused the technology, or the system that behaved in
unintended ways… Often, though, the relevant causal chain is much longer.” Risks often arise from
‘structural’ or ‘systemic’ factors such as competitive pressures, diffusion of harms, fast-paced development,
high levels of uncertainty, and inadequate safety culture.[109] In the broader context of safety engineering,
structural factors like ‘organizational safety culture’ play a central role in the popular STAMP risk analysis
framework.[110]
Inspired by the structural perspective, some researchers have emphasized the importance of using machine
learning to improve sociotechnical safety factors, for example, using ML for cyber defense, improving
institutional decision-making, and facilitating cooperation.[31]
Cyber defense
Some scholars are concerned that AI will exacerbate the already imbalanced game between cyber attackers
and cyber defenders.[111] This would increase 'first strike' incentives and could lead to more aggressive and
destabilizing attacks. In order to mitigate this risk, some have advocated for an increased emphasis on cyber
defense. In addition, software security is essential preventing powerful AI models from being stolen and
misused.[11]
Improving institutional decision-making

The advancement of AI in economic and military domains could precipitate unprecedented political
challenges.[112] Some scholars have compared AI race dynamics to the cold war, where the careful
judgment of a small number of decision-makers often spelled the difference between stability and
catastrophe.[113] AI researchers have argued that AI technologies could also be used to assist decision-
making.[31] For example, researchers are beginning to develop AI forecasting[114] and advisory
systems.[115]
Facilitating cooperation
Many of the largest global threats (nuclear war,[116] climate change,[117] etc) have been framed as
cooperation challenges. As in the well-known prisoner’s dilemma scenario, some dynamics may lead to
poor results for all players, even when they are optimally acting in their self-interest. For example, no single
actor has strong incentives to address climate change even though the consequences may be significant if
no one intervenes.[117]
A salient AI cooperation challenge is avoiding a ‘race to the bottom’.[118] In this scenario, countries or
companies race to build more capable AI systems and neglect safety, leading to a catastrophic accident that
harms everyone involved. Concerns about scenarios like these have inspired both political[119] and
technical[120] efforts to facilitate cooperation between humans, and potentially also between AI systems.
Most AI research focuses on designing individual agents to serve isolated functions (often in ‘single-player’
games).[121] Scholars have suggested that as AI systems become more autonomous, it may become
essential to study and shape the way they interact.[121]
In governance
AI governance is broadly concerned with creating norms, standards, and regulations to guide the use and
development of AI systems.[113] It involves formulating and implementing concrete recommendations, as
well as conducting more foundational research in order to inform what these recommendations should be.
This section focuses on the aspects of AI governance that are specifically related to ensuring AI systems are
safe and beneficial.
Research
AI safety governance research ranges from foundational investigations into the potential impacts of AI to
specific applications. On the foundational side, researchers have argued that AI could transform many
aspects of society due to its broad applicability, comparing it to electricity and the steam engine.[122] Some
work has focused on anticipating specific risks that may arise from these impacts – for example, risks from
mass unemployment,[123] weaponization,[124] disinformation,[125] surveillance,[126] and the concentration
of power.[127] Other work explores underlying risk factors such as the difficulty of monitoring the rapidly
evolving AI industry,[128] the availability of AI models,[129] and ‘race to the bottom’ dynamics.[118][130]
Allan Dafoe, the head of longterm governance and strategy at DeepMind has emphasized the dangers of
racing and the potential need for cooperation: “it may be close to a necessary and sufficient condition for AI
safety and alignment that there be a high degree of caution prior to deploying advanced powerful systems;
however, if actors are competing in a domain with large returns to first-movers or relative advantage, then
they will be pressured to choose a sub-optimal level of caution.”[119]
Government action
Some experts have argued that it is too early to regulate AI, expressing concerns that regulations will
hamper innovation and it would be foolish to “rush to regulate in ignorance.”[131][132] Others, such as
business magnate Elon Musk, call for pre-emptive action to mitigate catastrophic risks.[133] To date, very
little AI safety regulation has been passed at the national level, though many bills have been introduced. A
prominent example is the European Union’s Artificial Intelligence Act, which regulates certain ‘high risk’
AI applications and restricts potentially harmful uses such as facial recognition, subliminal manipulation,
and social credit scoring.
Outside of formal legislation, government agencies have put forward ethical and safety recommendations.
In March 2021, the US National Security Commission on Artificial Intelligence reported that advances in
AI may make it increasingly important to “assure that systems are aligned with goals and values, including
safety, robustness and trustworthiness."[134] Subsequently, the National Institute of Standards and
Technology drafted a framework for managing AI Risk, which advises that when "catastrophic risks are
present - development and deployment should cease in a safe manner until risks can be sufficiently
managed."[135]
In September 2021, the People's Republic of China published ethical guidelines for the use of AI in China,
emphasizing that AI decisions should remain under human control and calling for accountability
mechanisms. In the same month, The United Kingdom published its 10-year National AI Strategy,[136]
which states the British government "takes the long-term risk of non-aligned Artificial General Intelligence,
and the unforeseeable changes that it would mean for ... the world, seriously."[137] The strategy describes
actions to assess long-term AI risks, including catastrophic risks.[137]
Government organizations, particularly in the United States, have also encouraged the development of
technical AI safety research. The Intelligence Advanced Research Projects Activity initiated the TrojAI
project to identify and protect against Trojan attacks on AI systems.[138] The Defense Advanced Research
Projects Agency (https://www.darpa.mil/program/guaranteeing-ai-robustness-against-deception) engages in
research on explainable artificial intelligence and improving robustness against adversarial attacks[139][140]
and The National Science Foundation supports the Center for Trustworthy Machine Learning, and is
providing millions in funding for empirical AI safety research.[141]
Corporate self-regulation
AI labs and companies generally abide by safety practices and norms that fall outside of formal
legislation.[142] One aim of governance researchers is to shape these norms.[?] Examples of safety
recommendations found in the literature include performing third-party auditing,[143] offering bounties for
finding failures,[143] sharing AI incidents[143] (an AI incident database was created for this purpose),[144]
following guidelines to determine whether to publish research or models,[129] and improving information
and cyber security in AI labs.[145]
Companies have also made concrete commitments. Cohere, OpenAI, and AI21 proposed and agreed on
“best practices for deploying language models,” focusing on mitigating misuse.[146] To avoid contributing
to racing-dynamics, OpenAI has also stated in their charter that “if a value-aligned, safety-conscious project
comes close to building AGI before we do, we commit to stop competing with and start assisting this
project”[147] Also, industry leaders such as CEO of DeepMind Demis Hassabis, director of Facebook AI
Yann LeCun have signed open letters such as the Asilomar Principles[28] and the Autonomous Weapons
Open Letter.[148]
See also
Ethics of artificial intelligence
Existential risk from artificial general intelligence
Robot ethics
Books
The Alignment Problem

Human Compatible
Normal Accidents
The Precipice: Existential Risk and the Future of Humanity
Organizations
Alignment Research Center

Center for AI Safety
Center for Security and Emerging Technology
Centre for the Study of Existential Risk
Future of Humanity Institute
Institute for Ethics and Emerging Technologies
Machine Intelligence Research Institute
Center for Human-Compatible AI
Leverhulme Centre for the Future of Intelligence
Partnership on AI
Notes
a. The distinction between misaligned AI and incompetent AI has been formalized in certain
contexts.[81]
References
1. Carlsmith, Joseph (2022-06-16). "Is Power-Seeking AI an Existential Risk?".
arXiv:2206.13353 (https://arxiv.org/abs/2206.13353) [cs.CY (https://arxiv.org/archive/cs.CY)].
2. " 'The Godfather of A.I.' warns of 'nightmare scenario' where artificial intelligence begins to
seek power" (https://fortune.com/2023/05/02/godfather-ai-geoff-hinton-google-warns-artificial
-intelligence-nightmare-scenario/). Fortune. Retrieved 2023-06-10.
3. Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain (2018-07-31).
"Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts" (http://jai
r.org/index.php/jair/article/view/11222). Journal of Artificial Intelligence Research. 62: 729–
754. doi:10.1613/jair.1.11222 (https://doi.org/10.1613%2Fjair.1.11222). ISSN 1076-9757 (htt
ps://www.worldcat.org/issn/1076-9757). S2CID 8746462 (https://api.semanticscholar.org/Cor
pusID:8746462). Archived (https://web.archive.org/web/20230210114220/https://jair.org/inde
x.php/jair/article/view/11222) from the original on 2023-02-10. Retrieved 2022-11-28.
4. Zhang, Baobao; Anderljung, Markus; Kahn, Lauren; Dreksler, Noemi; Horowitz, Michael C.;
Dafoe, Allan (2021-05-05). "Ethics and Governance of Artificial Intelligence: Evidence from a
Survey of Machine Learning Researchers". arXiv:2105.02117 (https://arxiv.org/abs/2105.021
17).
5. Stein-Perlman, Zach; Weinstein-Raun, Benjamin; Grace (2022-08-04). "2022 Expert Survey
on Progress in AI" (https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/). AI Impacts.
Archived (https://web.archive.org/web/20221123052335/https://aiimpacts.org/2022-expert-su
rvey-on-progress-in-ai/) from the original on 2022-11-23. Retrieved 2022-11-23.
6. Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica;
Madaan, Divyam; Nangia, Nikita; Pang, Richard Yuanzhe; Phang, Jason; Bowman, Samuel
R. (2022-08-26). "What Do NLP Researchers Believe? Results of the NLP Community
Metasurvey". arXiv:2208.12852 (https://arxiv.org/abs/2208.12852).
7. De-Arteaga, Maria (2020-05-13). Machine Learning in High-Stakes Settings: Risks and
Opportunities (PhD). Carnegie Mellon University.
8. Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta; Lerman, Kristina; Galstyan, Aram
(2021). "A Survey on Bias and Fairness in Machine Learning" (https://dl.acm.org/doi/10.114
5/3457607). ACM Computing Surveys. 54 (6): 1–35. arXiv:1908.09635 (https://arxiv.org/abs/
1908.09635). doi:10.1145/3457607 (https://doi.org/10.1145%2F3457607). ISSN 0360-0300
(https://www.worldcat.org/issn/0360-0300). S2CID 201666566 (https://api.semanticscholar.or
g/CorpusID:201666566). Archived (https://web.archive.org/web/20221123054208/https://dl.a
cm.org/doi/10.1145/3457607) from the original on 2022-11-23. Retrieved 2022-11-28.
9. Feldstein, Steven (2019). The Global Expansion of AI Surveillance (Report). Carnegie
Endowment for International Peace.
10. Barnes, Beth (2021). "Risks from AI persuasion" (https://www.lesswrong.com/posts/5cWtwA
THL6KyzChck/risks-from-ai-persuasion). Lesswrong. Archived (https://web.archive.org/web/
20221123055429/https://www.lesswrong.com/posts/5cWtwATHL6KyzChck/risks-from-ai-per
suasion) from the original on 2022-11-23. Retrieved 2022-11-23.
11. Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben;
Dafoe, Allan; Scharre, Paul; Zeitzoff, Thomas; Filar, Bobby; Anderson, Hyrum; Roff, Heather;
Allen, Gregory C; Steinhardt, Jacob; Flynn, Carrick (2018-04-30). "The Malicious Use of
Artificial Intelligence: Forecasting, Prevention, and Mitigation" (https://www.repository.cam.a
c.uk/handle/1810/275332). Apollo-University Of Cambridge Repository, Apollo-University Of
Cambridge Repository. Apollo - University of Cambridge Repository.
doi:10.17863/cam.22520 (https://doi.org/10.17863%2Fcam.22520). S2CID 3385567 (https://
api.semanticscholar.org/CorpusID:3385567). Archived (https://web.archive.org/web/2022112
3055429/https://www.repository.cam.ac.uk/handle/1810/275332) from the original on 2022-
11-23. Retrieved 2022-11-28.
arXiv:2206.13353 (https://arxiv.org/abs/2206.13353).
13. Shermer, Michael (2017). "Artificial Intelligence Is Not a Threat---Yet" (https://www.scientifica
merican.com/article/artificial-intelligence-is-not-a-threat-mdash-yet/). Scientific American.
Archived (https://web.archive.org/web/20171201051401/https://www.scientificamerican.com/
article/artificial-intelligence-is-not-a-threat-mdash-yet/) from the original on 2017-12-01.
Retrieved 2022-11-23.
14. Dafoe, Allan (2016). "Yes, We Are Worried About the Existential Risk of Artificial
Intelligence" (https://www.technologyreview.com/2016/11/02/156285/yes-we-are-worried-ab
out-the-existential-risk-of-artificial-intelligence/). MIT Technology Review. Archived (https://w
eb.archive.org/web/20221128223713/https://www.technologyreview.com/2016/11/02/15628
5/yes-we-are-worried-about-the-existential-risk-of-artificial-intelligence/) from the original on
2022-11-28. Retrieved 2022-11-28.
15. Markoff, John (2013-05-20). "In 1949, He Imagined an Age of Robots" (https://www.nytimes.c
om/2013/05/21/science/mit-scholars-1949-essay-on-machine-age-is-found.html). The New
York Times. ISSN 0362-4331 (https://www.worldcat.org/issn/0362-4331). Archived (https://w
eb.archive.org/web/20221123061554/https://www.nytimes.com/2013/05/21/science/mit-scho
lars-1949-essay-on-machine-age-is-found.html) from the original on 2022-11-23. Retrieved
2022-11-23.
16. AAAI. "AAAI Presidential Panel on Long-Term AI Futures" (https://www.aaai.org/Organizatio
n/presidential-panel.php). Archived (https://web.archive.org/web/20220901033354/https://w
ww.aaai.org/Organization/presidential-panel.php) from the original on 2022-09-01. Retrieved
2022-11-23.
17. Yampolskiy, Roman V.; Spellchecker, M. S. (2016-10-25). "Artificial Intelligence Safety and
Cybersecurity: a Timeline of AI Failures". arXiv:1610.07997 (https://arxiv.org/abs/1610.0799
7).
18. "PT-AI 2011 - Philosophy and Theory of Artificial Intelligence (PT-AI 2011)" (https://conferen
ce.researchbib.com/view/event/13986). Archived (https://web.archive.org/web/20221123062
236/https://conference.researchbib.com/view/event/13986) from the original on 2022-11-23.
19. Yampolskiy, Roman V. (2013), Müller, Vincent C. (ed.), "Artificial Intelligence Safety
Engineering: Why Machine Ethics Is a Wrong Approach" (http://link.springer.com/10.1007/97
8-3-642-31674-6_29), Philosophy and Theory of Artificial Intelligence, Studies in Applied
Philosophy, Epistemology and Rational Ethics, Berlin, Heidelberg: Springer Berlin
Heidelberg, vol. 5, pp. 389–396, doi:10.1007/978-3-642-31674-6_29 (https://doi.org/10.100
7%2F978-3-642-31674-6_29), ISBN 978-3-642-31673-9, archived (https://web.archive.org/w
eb/20230315184334/https://link.springer.com/chapter/10.1007/978-3-642-31674-6_29) from
the original on 2023-03-15, retrieved 2022-11-23
20. Elon Musk [@elonmusk] (August 3, 2014). "Worth reading Superintelligence by Bostrom.
We need to be super careful with AI. Potentially more dangerous than nukes" (https://twitter.c
om/elonmusk/status/495759307346952192) (Tweet) – via Twitter.
21. Kaiser Kuo (2015-03-31). Baidu CEO Robin Li interviews Bill Gates and Elon Musk at the
Boao Forum, March 29 2015 (https://www.youtube.com/watch?v=NG0ZjUfOBUs&t=1055s&
ab_channel=KaiserKuo). Event occurs at 55:49. Archived (https://web.archive.org/web/2022
1123072346/https://www.youtube.com/watch?v=NG0ZjUfOBUs&t=1055s&ab_channel=Kai
serKuo) from the original on 2022-11-23. Retrieved 2022-11-23.
22. Cellan-Jones, Rory (2014-12-02). "Stephen Hawking warns artificial intelligence could end
mankind" (https://www.bbc.com/news/technology-30290540). BBC News. Archived (https://w
eb.archive.org/web/20151030054329/http://www.bbc.com/news/technology-30290540) from
the original on 2015-10-30. Retrieved 2022-11-23.
23. Future of Life Institute. "Research Priorities for Robust and Beneficial Artificial Intelligence:
An Open Letter" (https://futureoflife.org/open-letter/ai-open-letter/). Future of Life Institute.
Archived (https://web.archive.org/web/20221123072924/https://futureoflife.org/open-letter/ai-
open-letter/) from the original on 2022-11-23. Retrieved 2022-11-23.
24. Future of Life Institute. "AI Research Grants Program" (https://futureoflife.org/ai-research/).
Future of Life Institute. Archived (https://web.archive.org/web/20221123074311/https://future
oflife.org/ai-research/) from the original on 2022-11-23. Retrieved 2022-11-23.
25. "SafArtInt 2016" (https://www.cmu.edu/safartint/). Archived (https://web.archive.org/web/2022
1123074311/https://www.cmu.edu/safartint/) from the original on 2022-11-23. Retrieved
2022-11-23.
26. Bach, Deborah (2016). "UW to host first of four White House public workshops on artificial
intelligence" (https://www.washington.edu/news/2016/05/19/uw-to-host-first-of-four-white-ho
use-public-workshops-on-artificial-intelligence/). UW News. Archived (https://web.archive.or
g/web/20221123074321/https://www.washington.edu/news/2016/05/19/uw-to-host-first-of-fo
ur-white-house-public-workshops-on-artificial-intelligence/) from the original on 2022-11-23.
27. Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané,
Dan (2016-07-25). "Concrete Problems in AI Safety". arXiv:1606.06565 (https://arxiv.org/abs/
1606.06565).
28. Future of Life Institute. "AI Principles" (https://futureoflife.org/open-letter/ai-principles/).
Future of Life Institute. Archived (https://web.archive.org/web/20221123074312/https://future
oflife.org/open-letter/ai-principles/) from the original on 2022-11-23. Retrieved 2022-11-23.
29. Research, DeepMind Safety (2018-09-27). "Building safe artificial intelligence: specification,
robustness, and assurance" (https://deepmindsafetyresearch.medium.com/building-safe-artif
icial-intelligence-52f5f75058f1). Medium. Archived (https://web.archive.org/web/2023021011
4142/https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f7
5058f1) from the original on 2023-02-10. Retrieved 2022-11-23.
30. "SafeML ICLR 2019 Workshop" (https://sites.google.com/view/safeml-iclr2019). Archived (htt
ps://web.archive.org/web/20221123074310/https://sites.google.com/view/safeml-iclr2019)
from the original on 2022-11-23. Retrieved 2022-11-23.
31. Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16).
"Unsolved Problems in ML Safety". arXiv:2109.13916 (https://arxiv.org/abs/2109.13916).
32. Browne, Ryan (2023-06-12). "British Prime Minister Rishi Sunak pitches UK as home of A.I.
safety regulation as London bids to be next Silicon Valley" (https://www.cnbc.com/2023/06/1
2/pm-rishi-sunak-pitches-uk-as-geographical-home-of-ai-regulation.html). CNBC. Retrieved
2023-06-25.
33. Kirilenko, Andrei; Kyle, Albert S.; Samadi, Mehrdad; Tuzun, Tugkan (2017). "The Flash
Crash: High-Frequency Trading in an Electronic Market: The Flash Crash" (https://onlinelibr
ary.wiley.com/doi/10.1111/jofi.12498). The Journal of Finance. 72 (3): 967–998.
doi:10.1111/jofi.12498 (https://doi.org/10.1111%2Fjofi.12498). hdl:10044/1/49798 (https://hd
l.handle.net/10044%2F1%2F49798). Archived (https://web.archive.org/web/2022112407004
3/https://onlinelibrary.wiley.com/doi/10.1111/jofi.12498) from the original on 2022-11-24.
34. Newman, Mej (2005). "Power laws, Pareto distributions and Zipf's law" (http://www.tandfonli
ne.com/doi/abs/10.1080/00107510500052444). Contemporary Physics. 46 (5): 323–351.
arXiv:cond-mat/0412004 (https://arxiv.org/abs/cond-mat/0412004).
Bibcode:2005ConPh..46..323N (https://ui.adsabs.harvard.edu/abs/2005ConPh..46..323N).
doi:10.1080/00107510500052444 (https://doi.org/10.1080%2F00107510500052444).
ISSN 0010-7514 (https://www.worldcat.org/issn/0010-7514). S2CID 2871747 (https://api.se
manticscholar.org/CorpusID:2871747). Archived (https://web.archive.org/web/20221116220
804/https://www.tandfonline.com/doi/abs/10.1080/00107510500052444) from the original on
2022-11-16. Retrieved 2022-11-28.
35. Eliot, Lance. "Whether Those Endless Edge Or Corner Cases Are The Long-Tail Doom For
AI Self-Driving Cars" (https://www.forbes.com/sites/lanceeliot/2021/07/13/whether-those-end
less-edge-or-corner-cases-are-the-long-tail-doom-for-ai-self-driving-cars/). Forbes. Archived
(https://web.archive.org/web/20221124070049/https://www.forbes.com/sites/lanceeliot/2021/
07/13/whether-those-endless-edge-or-corner-cases-are-the-long-tail-doom-for-ai-self-driving
-cars/) from the original on 2022-11-24. Retrieved 2022-11-24.
36. Goodfellow, Ian; Papernot, Nicolas; Huang, Sandy; Duan, Rocky; Abbeel, Pieter; Clark, Jack
(2017-02-24). "Attacking Machine Learning with Adversarial Examples" (https://openai.com/
blog/adversarial-example-research/). OpenAI. Archived (https://web.archive.org/web/202211
24070536/https://openai.com/blog/adversarial-example-research/) from the original on 2022-
11-24. Retrieved 2022-11-24.
37. Szegedy, Christian; Zaremba, Wojciech; Sutskever, Ilya; Bruna, Joan; Erhan, Dumitru;
Goodfellow, Ian; Fergus, Rob (2014-02-19). "Intriguing properties of neural networks".
38. Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy (2017-02-10). "Adversarial examples in the
physical world". arXiv:1607.02533 (https://arxiv.org/abs/1607.02533).
39. Madry, Aleksander; Makelov, Aleksandar; Schmidt, Ludwig; Tsipras, Dimitris; Vladu, Adrian
(2019-09-04). "Towards Deep Learning Models Resistant to Adversarial Attacks".
40. Kannan, Harini; Kurakin, Alexey; Goodfellow, Ian (2018-03-16). "Adversarial Logit Pairing".
41. Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E. (2018-
07-19). "Motivating the Rules of the Game for Adversarial Example Research".
42. Carlini, Nicholas; Wagner, David (2018-03-29). "Audio Adversarial Examples: Targeted
Attacks on Speech-to-Text". arXiv:1801.01944 (https://arxiv.org/abs/1801.01944).
43. Sheatsley, Ryan; Papernot, Nicolas; Weisman, Michael; Verma, Gunjan; McDaniel, Patrick
(2022-09-09). "Adversarial Examples in Constrained Domains". arXiv:2011.01183 (https://ar
xiv.org/abs/2011.01183).
44. Suciu, Octavian; Coull, Scott E.; Johns, Jeffrey (2019-04-13). "Exploring Adversarial
Examples in Malware Detection". arXiv:1810.08280 (https://arxiv.org/abs/1810.08280).
45. Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela;
Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John; Hilton,
Jacob; Kelton, Fraser; Miller, Luke; Simens, Maddie (2022-03-04). "Training language
models to follow instructions with human feedback". arXiv:2203.02155 (https://arxiv.org/abs/
2203.02155).
46. Gao, Leo; Schulman, John; Hilton, Jacob (2022-10-19). "Scaling Laws for Reward Model
Overoptimization". arXiv:2210.10760 (https://arxiv.org/abs/2210.10760).
47. Yu, Sihyun; Ahn, Sungsoo; Song, Le; Shin, Jinwoo (2021-10-27). "RoMA: Robust Model
Adaptation for Offline Model-based Optimization". arXiv:2110.14188 (https://arxiv.org/abs/21
10.14188).
48. Hendrycks, Dan; Mazeika, Mantas (2022-09-20). "X-Risk Analysis for AI Research".
49. Tran, Khoa A.; Kondrashova, Olga; Bradley, Andrew; Williams, Elizabeth D.; Pearson, John
V.; Waddell, Nicola (2021). "Deep learning in cancer diagnosis, prognosis and treatment
selection" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8477474). Genome Medicine. 13
(1): 152. doi:10.1186/s13073-021-00968-x (https://doi.org/10.1186%2Fs13073-021-00968-
x). ISSN 1756-994X (https://www.worldcat.org/issn/1756-994X). PMC 8477474 (https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC8477474). PMID 34579788 (https://pubmed.ncbi.nlm.nih.
gov/34579788).
50. Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. (2017-08-06). "On calibration of
modern neural networks". Proceedings of the 34th international conference on machine
learning. Proceedings of machine learning research. Vol. 70. PMLR. pp. 1321–1330.
51. Ovadia, Yaniv; Fertig, Emily; Ren, Jie; Nado, Zachary; Sculley, D.; Nowozin, Sebastian;
Dillon, Joshua V.; Lakshminarayanan, Balaji; Snoek, Jasper (2019-12-17). "Can You Trust
Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift".
52. Bogdoll, Daniel; Breitenstein, Jasmin; Heidecker, Florian; Bieshaar, Maarten; Sick,
Bernhard; Fingscheidt, Tim; Zöllner, J. Marius (2021). "Description of Corner Cases in
Automated Driving: Goals and Challenges". 2021 IEEE/CVF International Conference on
Computer Vision Workshops (ICCVW). pp. 1023–1028. arXiv:2109.09607 (https://arxiv.org/a
bs/2109.09607). doi:10.1109/ICCVW54120.2021.00119 (https://doi.org/10.1109%2FICCVW
54120.2021.00119). ISBN 978-1-6654-0191-3. S2CID 237572375 (https://api.semanticschol
ar.org/CorpusID:237572375).
53. Hendrycks, Dan; Mazeika, Mantas; Dietterich, Thomas (2019-01-28). "Deep Anomaly
Detection with Outlier Exposure". arXiv:1812.04606 (https://arxiv.org/abs/1812.04606).
54. Wang, Haoqi; Li, Zhizhong; Feng, Litong; Zhang, Wayne (2022-03-21). "ViM: Out-Of-
Distribution with Virtual-logit Matching". arXiv:2203.10807 (https://arxiv.org/abs/2203.10807).
55. Hendrycks, Dan; Gimpel, Kevin (2018-10-03). "A Baseline for Detecting Misclassified and
Out-of-Distribution Examples in Neural Networks". arXiv:1610.02136 (https://arxiv.org/abs/16
10.02136).
56. Urbina, Fabio; Lentzos, Filippa; Invernizzi, Cédric; Ekins, Sean (2022). "Dual use of
artificial-intelligence-powered drug discovery" (https://www.ncbi.nlm.nih.gov/pmc/articles/PM
C9544280). Nature Machine Intelligence. 4 (3): 189–191. doi:10.1038/s42256-022-00465-9
(https://doi.org/10.1038%2Fs42256-022-00465-9). ISSN 2522-5839 (https://www.worldcat.or
g/issn/2522-5839). PMC 9544280 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9544280).
PMID 36211133 (https://pubmed.ncbi.nlm.nih.gov/36211133).
57. Center for Security and Emerging Technology; Buchanan, Ben; Lohn, Andrew; Musser,
Micah; Sedova, Katerina (2021). "Truth, Lies, and Automation: How Language Models Could
Change Disinformation" (https://cset.georgetown.edu/publication/truth-lies-and-automation/).
doi:10.51593/2021ca003 (https://doi.org/10.51593%2F2021ca003). S2CID 240522878 (http
s://api.semanticscholar.org/CorpusID:240522878). Archived (https://web.archive.org/web/20
221124073719/https://cset.georgetown.edu/publication/truth-lies-and-automation/) from the
original on 2022-11-24. Retrieved 2022-11-28.
58. "Propaganda-as-a-service may be on the horizon if large language models are abused" (http
s://venturebeat.com/ai/propaganda-as-a-service-may-be-on-the-horizon-if-large-language-m
odels-are-abused/). VentureBeat. 2021-12-14. Archived (https://web.archive.org/web/202211
24073718/https://venturebeat.com/ai/propaganda-as-a-service-may-be-on-the-horizon-if-larg
e-language-models-are-abused/) from the original on 2022-11-24. Retrieved 2022-11-24.
59. Center for Security and Emerging Technology; Buchanan, Ben; Bansemer, John; Cary,
Dakota; Lucas, Jack; Musser, Micah (2020). "Automating Cyber Attacks: Hype and Reality"
(https://cset.georgetown.edu/publication/automating-cyber-attacks/). Center for Security and
Emerging Technology. doi:10.51593/2020ca002 (https://doi.org/10.51593%2F2020ca002).
S2CID 234623943 (https://api.semanticscholar.org/CorpusID:234623943). Archived (https://
web.archive.org/web/20221124074301/https://cset.georgetown.edu/publication/automating-
cyber-attacks/) from the original on 2022-11-24. Retrieved 2022-11-28.
60. "Lessons Learned on Language Model Safety and Misuse" (https://openai.com/blog/languag
e-model-safety-and-misuse/). OpenAI. 2022-03-03. Archived (https://web.archive.org/web/20
221124074259/https://openai.com/blog/language-model-safety-and-misuse/) from the
61. Markov, Todor; Zhang, Chong; Agarwal, Sandhini; Eloundou, Tyna; Lee, Teddy; Adler,
Steven; Jiang, Angela; Weng, Lilian (2022-08-10). "New-and-Improved Content Moderation
Tooling" (https://openai.com/blog/new-and-improved-content-moderation-tooling/). OpenAI.
Archived (https://web.archive.org/web/20230111020935/https://openai.com/blog/new-and-im
proved-content-moderation-tooling/) from the original on 2023-01-11. Retrieved 2022-11-24.
62. Savage, Neil (2022-03-29). "Breaking into the black box of artificial intelligence" (https://ww
w.nature.com/articles/d41586-022-00858-1). Nature. doi:10.1038/d41586-022-00858-1 (http
s://doi.org/10.1038%2Fd41586-022-00858-1). PMID 35352042 (https://pubmed.ncbi.nlm.nih.
gov/35352042). S2CID 247792459 (https://api.semanticscholar.org/CorpusID:247792459).
Archived (https://web.archive.org/web/20221124074724/https://www.nature.com/articles/d41
586-022-00858-1) from the original on 2022-11-24. Retrieved 2022-11-24.
63. Center for Security and Emerging Technology; Rudner, Tim; Toner, Helen (2021). "Key
Concepts in AI Safety: Interpretability in Machine Learning" (https://cset.georgetown.edu/pub
lication/key-concepts-in-ai-safety-interpretability-in-machine-learning/).
doi:10.51593/20190042 (https://doi.org/10.51593%2F20190042). S2CID 233775541 (https://
api.semanticscholar.org/CorpusID:233775541). Archived (https://web.archive.org/web/20221
124075212/https://cset.georgetown.edu/publication/key-concepts-in-ai-safety-interpretability
-in-machine-learning/) from the original on 2022-11-24. Retrieved 2022-11-28.
64. McFarland, Matt (2018-03-19). "Uber pulls self-driving cars after first fatal crash of
autonomous vehicle" (https://money.cnn.com/2018/03/19/technology/uber-autonomous-car-f
atal-crash/index.html). CNNMoney. Archived (https://web.archive.org/web/20221124075209/
https://money.cnn.com/2018/03/19/technology/uber-autonomous-car-fatal-crash/index.html)
65. Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien,
David; Scott, Kate; Schieber, Stuart; Waldo, James; Weinberger, David; Weller, Adrian;
Wood, Alexandra (2019-12-20). "Accountability of AI Under the Law: The Role of
Explanation". arXiv:1711.01134 (https://arxiv.org/abs/1711.01134).
66. Fong, Ruth; Vedaldi, Andrea (2017). "Interpretable Explanations of Black Boxes by
Meaningful Perturbation". 2017 IEEE International Conference on Computer Vision (ICCV).
pp. 3449–3457. arXiv:1704.03296 (https://arxiv.org/abs/1704.03296).
doi:10.1109/ICCV.2017.371 (https://doi.org/10.1109%2FICCV.2017.371). ISBN 978-1-5386-
1032-9. S2CID 1633753 (https://api.semanticscholar.org/CorpusID:1633753).
67. Meng, Kevin; Bau, David; Andonian, Alex; Belinkov, Yonatan (2022). "Locating and editing
factual associations in GPT". Advances in Neural Information Processing Systems. 35.
68. Bau, David; Liu, Steven; Wang, Tongzhou; Zhu, Jun-Yan; Torralba, Antonio (2020-07-30).
"Rewriting a Deep Generative Model". arXiv:2007.15646 (https://arxiv.org/abs/2007.15646).
69. Räuker, Tilman; Ho, Anson; Casper, Stephen; Hadfield-Menell, Dylan (2022-09-05). "Toward
Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks".
70. Bau, David; Zhou, Bolei; Khosla, Aditya; Oliva, Aude; Torralba, Antonio (2017-04-19).
"Network Dissection: Quantifying Interpretability of Deep Visual Representations".
71. McGrath, Thomas; Kapishnikov, Andrei; Tomašev, Nenad; Pearce, Adam; Wattenberg,
Martin; Hassabis, Demis; Kim, Been; Paquet, Ulrich; Kramnik, Vladimir (2022-11-22).
"Acquisition of chess knowledge in AlphaZero" (https://www.ncbi.nlm.nih.gov/pmc/articles/P
MC9704706). Proceedings of the National Academy of Sciences. 119 (47): e2206625119.
arXiv:2111.09259 (https://arxiv.org/abs/2111.09259). Bibcode:2022PNAS..11906625M (http
s://ui.adsabs.harvard.edu/abs/2022PNAS..11906625M). doi:10.1073/pnas.2206625119 (http
s://doi.org/10.1073%2Fpnas.2206625119). ISSN 0027-8424 (https://www.worldcat.org/issn/
0027-8424). PMC 9704706 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9704706).
PMID 36375061 (https://pubmed.ncbi.nlm.nih.gov/36375061).
72. Goh, Gabriel; Cammarata, Nick; Voss, Chelsea; Carter, Shan; Petrov, Michael; Schubert,
Ludwig; Radford, Alec; Olah, Chris (2021). "Multimodal neurons in artificial neural networks".
Distill. 6 (3). doi:10.23915/distill.00030 (https://doi.org/10.23915%2Fdistill.00030).
S2CID 233823418 (https://api.semanticscholar.org/CorpusID:233823418).
73. Olah, Chris; Cammarata, Nick; Schubert, Ludwig; Goh, Gabriel; Petrov, Michael; Carter,
Shan (2020). "Zoom in: An introduction to circuits". Distill. 5 (3).
doi:10.23915/distill.00024.001 (https://doi.org/10.23915%2Fdistill.00024.001).
S2CID 215930358 (https://api.semanticscholar.org/CorpusID:215930358).
74. Cammarata, Nick; Goh, Gabriel; Carter, Shan; Voss, Chelsea; Schubert, Ludwig; Olah, Chris
(2021). "Curve circuits" (https://distill.pub/2020/circuits/curve-circuits/). Distill. 6 (1).
doi:10.23915/distill.00024.006 (https://doi.org/10.23915%2Fdistill.00024.006) (inactive 31
December 2022). Archived (https://web.archive.org/web/20221205140056/https://distill.pub/
2020/circuits/curve-circuits/) from the original on 5 December 2022. Retrieved 5 December
2022.
75. Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova;
Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom; Drain,
Dawn; Ganguli, Deep; Hatfield-Dodds, Zac; Hernandez, Danny; Johnston, Scott; Jones,
Andy; Kernion, Jackson; Lovitt, Liane; Ndousse, Kamal; Amodei, Dario; Brown, Tom; Clark,
Jack; Kaplan, Jared; McCandlish, Sam; Olah, Chris (2022). "In-context learning and
induction heads". Transformer Circuits Thread. arXiv:2209.11895 (https://arxiv.org/abs/2209.
11895).
76. Olah, Christopher. "Interpretability vs Neuroscience [rough note]" (https://colah.github.io/note
s/interp-v-neuro/). Archived (https://web.archive.org/web/20221124114744/https://colah.githu
b.io/notes/interp-v-neuro/) from the original on 2022-11-24. Retrieved 2022-11-24.
77. Gu, Tianyu; Dolan-Gavitt, Brendan; Garg, Siddharth (2019-03-11). "BadNets: Identifying
Vulnerabilities in the Machine Learning Model Supply Chain". arXiv:1708.06733 (https://arxi
v.org/abs/1708.06733).
78. Chen, Xinyun; Liu, Chang; Li, Bo; Lu, Kimberly; Song, Dawn (2017-12-14). "Targeted
Backdoor Attacks on Deep Learning Systems Using Data Poisoning". arXiv:1712.05526 (htt
ps://arxiv.org/abs/1712.05526).
79. Carlini, Nicholas; Terzis, Andreas (2022-03-28). "Poisoning and Backdooring Contrastive
Learning". arXiv:2106.09667 (https://arxiv.org/abs/2106.09667).
80. Russell, Stuart J.; Norvig, Peter (2020). Artificial intelligence: A modern approach (https://ww
w.pearson.com/us/higher-education/program/Russell-Artificial-Intelligence-A-Modern-Appro
ach-4th-Edition/PGM1263338.html) (4th ed.). Pearson. pp. 31–34. ISBN 978-1-292-40113-3.
OCLC 1303900751 (https://www.worldcat.org/oclc/1303900751). Archived (https://web.archi
ve.org/web/20220715195054/https://www.pearson.com/us/higher-education/program/Russel
l-Artificial-Intelligence-A-Modern-Approach-4th-Edition/PGM1263338.html) from the original
on July 15, 2022. Retrieved September 12, 2022.
81. Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob (2022-06-16).
"Unsolved Problems in ML Safety". arXiv:2109.13916 (https://arxiv.org/abs/2109.13916)
[cs.LG (https://arxiv.org/archive/cs.LG)].
82. Ngo, Richard; Chan, Lawrence; Mindermann, Sören (2023-02-22). "The alignment problem
from a deep learning perspective". arXiv:2209.00626 (https://arxiv.org/abs/2209.00626)
[cs.AI (https://arxiv.org/archive/cs.AI)].
83. Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob (2022-02-14). The Effects of Reward
Misspecification: Mapping and Mitigating Misaligned Models (https://openreview.net/forum?i
d=JYtwGwIL7ye). International Conference on Learning Representations. Retrieved
2022-07-21.
84. Zhuang, Simon; Hadfield-Menell, Dylan (2020). "Consequences of Misaligned AI" (https://pr
oceedings.neurips.cc/paper/2020/hash/b607ba543ad05417b8507ee86c54fcb7-Abstract.ht
ml). Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc.
pp. 15763–15773. Retrieved 2023-03-11.
arXiv:2206.13353 (https://arxiv.org/abs/2206.13353) [cs.CY (https://arxiv.org/archive/cs.CY)].
86. Russell, Stuart J. (2020). Human compatible: Artificial intelligence and the problem of control
(https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russel
l/). Penguin Random House. ISBN 9780525558637. OCLC 1113410915 (https://www.world
cat.org/oclc/1113410915).
87. Christian, Brian (2020). The alignment problem: Machine learning and human values (http
s://wwnorton.co.uk/books/9780393635829-the-alignment-problem). W. W. Norton &
Company. ISBN 978-0-393-86833-3. OCLC 1233266753 (https://www.worldcat.org/oclc/123
3266753). Archived (https://web.archive.org/web/20230210114137/https://wwnorton.co.uk/b
ooks/9780393635829-the-alignment-problem) from the original on February 10, 2023.
Retrieved September 12, 2022.
88. Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David
(2022-06-28). "Goal Misgeneralization in Deep Reinforcement Learning" (https://proceeding
s.mlr.press/v162/langosco22a.html). Proceedings of the 39th International Conference on
Machine Learning. International Conference on Machine Learning. PMLR. pp. 12004–
12019. Retrieved 2023-03-11.
89. Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx,
Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma;
Brynjolfsson, Erik (2022-07-12). "On the Opportunities and Risks of Foundation Models" (htt
ps://fsi.stanford.edu/publication/opportunities-and-risks-foundation-models). Stanford CRFM.
90. Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela;
Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, J.; Hilton, Jacob;
Kelton, Fraser; Miller, Luke E.; Simens, Maddie; Askell, Amanda; Welinder, P.; Christiano, P.;
Leike, J.; Lowe, Ryan J. (2022). "Training language models to follow instructions with human
feedback". arXiv:2203.02155 (https://arxiv.org/abs/2203.02155) [cs.CL (https://arxiv.org/archi
ve/cs.CL)].
91. Zaremba, Wojciech; Brockman, Greg; OpenAI (2021-08-10). "OpenAI Codex" (https://opena
i.com/blog/openai-codex/). OpenAI. Archived (https://web.archive.org/web/20230203201912/
https://openai.com/blog/openai-codex/) from the original on February 3, 2023. Retrieved
2022-07-23.
92. Kober, Jens; Bagnell, J. Andrew; Peters, Jan (2013-09-01). "Reinforcement learning in
robotics: A survey" (http://journals.sagepub.com/doi/10.1177/0278364913495721). The
International Journal of Robotics Research. 32 (11): 1238–1274.
doi:10.1177/0278364913495721 (https://doi.org/10.1177%2F0278364913495721).
445/https://journals.sagepub.com/doi/10.1177/0278364913495721) from the original on
October 15, 2022. Retrieved September 12, 2022.
93. Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter (2023-
03-01). "Reward (Mis)design for autonomous driving" (https://www.sciencedirect.com/scienc
e/article/pii/S0004370222001692). Artificial Intelligence. 316: 103829.
doi:10.1016/j.artint.2022.103829 (https://doi.org/10.1016%2Fj.artint.2022.103829).
ISSN 0004-3702 (https://www.worldcat.org/issn/0004-3702). S2CID 233423198 (https://api.s
emanticscholar.org/CorpusID:233423198).
94. Stray, Jonathan (2020). "Aligning AI Optimization to Community Well-Being" (https://www.nc
bi.nlm.nih.gov/pmc/articles/PMC7610010). International Journal of Community Well-Being. 3
(4): 443–463. doi:10.1007/s42413-020-00086-3 (https://doi.org/10.1007%2Fs42413-020-000
86-3). ISSN 2524-5295 (https://www.worldcat.org/issn/2524-5295). PMC 7610010 (https://w
ww.ncbi.nlm.nih.gov/pmc/articles/PMC7610010). PMID 34723107 (https://pubmed.ncbi.nlm.
nih.gov/34723107). S2CID 226254676 (https://api.semanticscholar.org/CorpusID:22625467
6).
95. Russell, Stuart; Norvig, Peter (2009). Artificial Intelligence: A Modern Approach (https://aima.
cs.berkeley.edu/). Prentice Hall. p. 1010. ISBN 978-0-13-604259-4.
96. Smith, Craig S. "Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat' "
(https://www.forbes.com/sites/craigsmith/2023/05/04/geoff-hinton-ais-most-famous-research
er-warns-of-existential-threat/). Forbes. Retrieved 2023-05-04.
97. Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané,
Dan (2016-06-21). "Concrete Problems in AI Safety". arXiv:1606.06565 (https://arxiv.org/abs/
1606.06565) [cs.AI (https://arxiv.org/archive/cs.AI)].
98. Ortega, Pedro A.; Maini, Vishal; DeepMind safety team (2018-09-27). "Building safe artificial
intelligence: specification, robustness, and assurance" (https://deepmindsafetyresearch.med
ium.com/building-safe-artificial-intelligence-52f5f75058f1). DeepMind Safety Research –
Medium. Archived (https://web.archive.org/web/20230210114142/https://deepmindsafetyres
earch.medium.com/building-safe-artificial-intelligence-52f5f75058f1) from the original on
February 10, 2023. Retrieved 2022-07-18.
99. Rorvig, Mordechai (2022-04-14). "Researchers Gain New Understanding From Simple AI"
(https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-language-pr
ocessing-20220414/). Quanta Magazine. Archived (https://web.archive.org/web/2023021011
4056/https://www.quantamagazine.org/researchers-glimpse-how-ai-gets-so-good-at-langua
ge-processing-20220414/) from the original on February 10, 2023. Retrieved 2022-07-18.
100. Doshi-Velez, Finale; Kim, Been (2017-03-02). "Towards A Rigorous Science of Interpretable
Machine Learning". arXiv:1702.08608 (https://arxiv.org/abs/1702.08608) [stat.ML (https://arxi
v.org/archive/stat.ML)].
Wiblin, Robert (August 4, 2021). "Chris Olah on what the hell is going on inside neural
networks" (https://80000hours.org/podcast/episodes/chris-olah-interpretability-research/)
(Podcast). 80,000 hours. No. 107. Retrieved 2022-07-23.
101. Russell, Stuart; Dewey, Daniel; Tegmark, Max (2015-12-31). "Research Priorities for Robust
and Beneficial Artificial Intelligence" (https://ojs.aaai.org/index.php/aimagazine/article/view/2
577). AI Magazine. 36 (4): 105–114. doi:10.1609/aimag.v36i4.2577 (https://doi.org/10.160
9%2Faimag.v36i4.2577). hdl:1721.1/108478 (https://hdl.handle.net/1721.1%2F108478).
059/https://ojs.aaai.org/index.php/aimagazine/article/view/2577) from the original on
February 2, 2023. Retrieved September 12, 2022.
102. Wirth, Christian; Akrour, Riad; Neumann, Gerhard; Fürnkranz, Johannes (2017). "A survey of
preference-based reinforcement learning methods". Journal of Machine Learning Research.
18 (136): 1–46.
103. Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario
(2017). "Deep reinforcement learning from human preferences". Proceedings of the 31st
International Conference on Neural Information Processing Systems. NIPS'17. Red Hook,
NY, USA: Curran Associates Inc. pp. 4302–4310. ISBN 978-1-5108-6096-4.
104. Heaven, Will Douglas (2022-01-27). "The new version of GPT-3 is much better behaved
(and should be less toxic)" (https://www.technologyreview.com/2022/01/27/1044398/new-gpt
3-openai-chatbot-language-model-ai-toxic-misinformation/). MIT Technology Review.
Archived (https://web.archive.org/web/20230210114056/https://www.technologyreview.com/
2022/01/27/1044398/new-gpt3-openai-chatbot-language-model-ai-toxic-misinformation/)
from the original on February 10, 2023. Retrieved 2022-07-18.
105. Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay
(2022-03-07). "Taxonomy of Machine Learning Safety: A Survey and Primer".
arXiv:2106.04823 (https://arxiv.org/abs/2106.04823) [cs.LG (https://arxiv.org/archive/cs.LG)].
106. Clifton, Jesse (2020). "Cooperation, Conflict, and Transformative Artificial Intelligence: A
Research Agenda" (https://longtermrisk.org/research-agenda/). Center on Long-Term Risk.
Archived (https://web.archive.org/web/20230101041759/https://longtermrisk.org/research-ag
enda) from the original on January 1, 2023. Retrieved 2022-07-18.
Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel,
Thore (2021-05-06). "Cooperative AI: machines must learn to find common ground" (htt
p://www.nature.com/articles/d41586-021-01170-0). Nature. 593 (7857): 33–36.
Bibcode:2021Natur.593...33D (https://ui.adsabs.harvard.edu/abs/2021Natur.593...33D).
doi:10.1038/d41586-021-01170-0 (https://doi.org/10.1038%2Fd41586-021-01170-0).
ISSN 0028-0836 (https://www.worldcat.org/issn/0028-0836). PMID 33947992 (https://pub
med.ncbi.nlm.nih.gov/33947992). S2CID 233740521 (https://api.semanticscholar.org/Co
rpusID:233740521). Archived (https://web.archive.org/web/20221218210857/https://ww
w.nature.com/articles/d41586-021-01170-0) from the original on December 18, 2022.
Retrieved September 12, 2022.
107. Prunkl, Carina; Whittlestone, Jess (2020-02-07). "Beyond Near- and Long-Term: Towards a
Clearer Account of Research Priorities in AI Ethics and Society" (https://dl.acm.org/doi/10.11
45/3375627.3375803). Proceedings of the AAAI/ACM Conference on AI, Ethics, and
Society. New York NY USA: ACM: 138–143. doi:10.1145/3375627.3375803 (https://doi.org/
10.1145%2F3375627.3375803). ISBN 978-1-4503-7110-0. S2CID 210164673 (https://api.s
emanticscholar.org/CorpusID:210164673). Archived (https://web.archive.org/web/20221016
123733/https://dl.acm.org/doi/10.1145/3375627.3375803) from the original on October 16,
2022. Retrieved September 12, 2022.
108. Irving, Geoffrey; Askell, Amanda (2019-02-19). "AI Safety Needs Social Scientists" (https://di
still.pub/2019/safety-needs-social-scientists). Distill. 4 (2): 10.23915/distill.00014.
doi:10.23915/distill.00014 (https://doi.org/10.23915%2Fdistill.00014). ISSN 2476-0757 (http
s://www.worldcat.org/issn/2476-0757). S2CID 159180422 (https://api.semanticscholar.org/C
orpusID:159180422). Archived (https://web.archive.org/web/20230210114220/https://distill.p
ub/2019/safety-needs-social-scientists/) from the original on February 10, 2023. Retrieved
September 12, 2022.
109. Zwetsloot, Remco; Dafoe, Allan (2019-02-11). "Thinking About Risks From AI: Accidents,
Misuse and Structure" (https://www.lawfareblog.com/thinking-about-risks-ai-accidents-misus
e-and-structure). Lawfare. Archived (https://web.archive.org/web/20221124122021/https://w
ww.lawfareblog.com/thinking-about-risks-ai-accidents-misuse-and-structure) from the
110. Zhang, Yingyu; Dong, Chuntong; Guo, Weiqun; Dai, Jiabao; Zhao, Ziming (2022). "Systems
theoretic accident model and process (STAMP): A literature review" (https://linkinghub.elsevi
er.com/retrieve/pii/S0925753521004367). Safety Science. 152: 105596.
doi:10.1016/j.ssci.2021.105596 (https://doi.org/10.1016%2Fj.ssci.2021.105596).
web.archive.org/web/20230315184342/https://www.sciencedirect.com/science/article/abs/pi
i/S0925753521004367?via%3Dihub) from the original on 2023-03-15. Retrieved
2022-11-28.
111. Center for Security and Emerging Technology; Hoffman, Wyatt (2021). "AI and the Future of
Cyber Competition" (https://cset.georgetown.edu/publication/ai-and-the-future-of-cyber-comp
etition/). doi:10.51593/2020ca007 (https://doi.org/10.51593%2F2020ca007).
web.archive.org/web/20221124122253/https://cset.georgetown.edu/publication/ai-and-the-fu
ture-of-cyber-competition/) from the original on 2022-11-24. Retrieved 2022-11-28.
112. Center for Security and Emerging Technology; Imbrie, Andrew; Kania, Elsa (2019). "AI
Safety, Security, and Stability Among Great Powers: Options, Challenges, and Lessons
Learned for Pragmatic Engagement" (https://cset.georgetown.edu/publication/ai-safety-secur
ity-and-stability-among-great-powers-options-challenges-and-lessons-learned-for-pragmatic
-engagement/). doi:10.51593/20190051 (https://doi.org/10.51593%2F20190051).
web.archive.org/web/20221124122652/https://cset.georgetown.edu/publication/ai-safety-sec
urity-and-stability-among-great-powers-options-challenges-and-lessons-learned-for-pragmat
ic-engagement/) from the original on 2022-11-24. Retrieved 2022-11-28.
113. Future of Life Institute (2019-03-27). AI Strategy, Policy, and Governance (Allan Dafoe) (http
s://www.youtube.com/watch?v=2IpJ8TIKKtI). Event occurs at 22:05. Archived (https://web.ar
chive.org/web/20221123055429/https://www.youtube.com/watch?v=2IpJ8TIKKtI) from the
114. Zou, Andy; Xiao, Tristan; Jia, Ryan; Kwon, Joe; Mazeika, Mantas; Li, Richard; Song, Dawn;
Steinhardt, Jacob; Evans, Owain; Hendrycks, Dan (2022-10-09). "Forecasting Future World
Events with Neural Networks". arXiv:2206.15474 (https://arxiv.org/abs/2206.15474).
115. Gathani, Sneha; Hulsebos, Madelon; Gale, James; Haas, Peter J.; Demiralp, Çağatay
(2022-02-08). "Augmenting Decision Making via Interactive What-If Analysis".
116. Lindelauf, Roy (2021), Osinga, Frans; Sweijs, Tim (eds.), "Nuclear Deterrence in the
Algorithmic Age: Game Theory Revisited" (https://link.springer.com/10.1007/978-94-6265-41
9-8_22), NL ARMS Netherlands Annual Review of Military Studies 2020, Nl Arms, The
Hague: T.M.C. Asser Press, pp. 421–436, doi:10.1007/978-94-6265-419-8_22 (https://doi.or
g/10.1007%2F978-94-6265-419-8_22), ISBN 978-94-6265-418-1, S2CID 229449677 (http
s://api.semanticscholar.org/CorpusID:229449677), archived (https://web.archive.org/web/20
230315184337/https://link.springer.com/chapter/10.1007/978-94-6265-419-8_22) from the
original on 2023-03-15, retrieved 2022-11-24
117. Newkirk II, Vann R. (2016-04-21). "Is Climate Change a Prisoner's Dilemma or a Stag
Hunt?" (https://www.theatlantic.com/politics/archive/2016/04/climate-change-game-theory-m
odels/624253/). The Atlantic. Archived (https://web.archive.org/web/20221124123011/https://
www.theatlantic.com/politics/archive/2016/04/climate-change-game-theory-models/624253/)
118. Armstrong, Stuart; Bostrom, Nick; Shulman, Carl. Racing to the Precipice: a Model of
Artificial Intelligence Development (Report). Future of Humanity Institute, Oxford University.
119. Dafoe, Allan. AI Governance: A Research Agenda (Report). Centre for the Governance of AI,
Future of Humanity Institute, University of Oxford.
120. Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; Collins, Tantum; McKee, Kevin R.; Leibo,
Joel Z.; Larson, Kate; Graepel, Thore (2020-12-15). "Open Problems in Cooperative AI".
121. Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore
(2021). "Cooperative AI: machines must learn to find common ground" (https://www.nature.c
om/articles/d41586-021-01170-0). Nature. 593 (7857): 33–36. Bibcode:2021Natur.593...33D
(https://ui.adsabs.harvard.edu/abs/2021Natur.593...33D). doi:10.1038/d41586-021-01170-0
(https://doi.org/10.1038%2Fd41586-021-01170-0). PMID 33947992 (https://pubmed.ncbi.nl
m.nih.gov/33947992). S2CID 233740521 (https://api.semanticscholar.org/CorpusID:233740
521). Archived (https://web.archive.org/web/20221122230552/https://www.nature.com/article
s/d41586-021-01170-0) from the original on 2022-11-22. Retrieved 2022-11-24.
122. Crafts, Nicholas (2021-09-23). "Artificial intelligence as a general-purpose technology: an
historical perspective" (https://academic.oup.com/oxrep/article/37/3/521/6374675). Oxford
Review of Economic Policy. 37 (3): 521–536. doi:10.1093/oxrep/grab012 (https://doi.org/10.
1093%2Foxrep%2Fgrab012). ISSN 0266-903X (https://www.worldcat.org/issn/0266-903X).
Archived (https://web.archive.org/web/20221124130718/https://academic.oup.com/oxrep/arti
cle/37/3/521/6374675) from the original on 2022-11-24. Retrieved 2022-11-28.
123. 葉俶禎黃子君張媁雯賴志樫
; ; ; (2020-12-01). "Labor Displacement in Artificial Intelligence
Era: A Systematic Literature Review". 臺灣東亞文明研究學刊 . 17 (2).
doi:10.6163/TJEAS.202012_17(2).0002 (https://doi.org/10.6163%2FTJEAS.202012_17%28
2%29.0002). ISSN 1812-6243 (https://www.worldcat.org/issn/1812-6243).
124. Johnson, James (2019-04-03). "Artificial intelligence & future warfare: implications for
international security" (https://www.tandfonline.com/doi/full/10.1080/14751798.2019.160080
0). Defense & Security Analysis. 35 (2): 147–169. doi:10.1080/14751798.2019.1600800 (htt
ps://doi.org/10.1080%2F14751798.2019.1600800). ISSN 1475-1798 (https://www.worldcat.o
rg/issn/1475-1798). S2CID 159321626 (https://api.semanticscholar.org/CorpusID:15932162
6). Archived (https://web.archive.org/web/20221124125204/https://www.tandfonline.com/doi/
full/10.1080/14751798.2019.1600800) from the original on 2022-11-24. Retrieved
2022-11-28.
125. Kertysova, Katarina (2018-12-12). "Artificial Intelligence and Disinformation: How AI
Changes the Way Disinformation is Produced, Disseminated, and Can Be Countered" (http
s://brill.com/view/journals/shrs/29/1-4/article-p55_55.xml). Security and Human Rights. 29
(1–4): 55–81. doi:10.1163/18750230-02901005 (https://doi.org/10.1163%2F18750230-0290
1005). ISSN 1874-7337 (https://www.worldcat.org/issn/1874-7337). S2CID 216896677 (http
s://api.semanticscholar.org/CorpusID:216896677). Archived (https://web.archive.org/web/20
221124125204/https://brill.com/view/journals/shrs/29/1-4/article-p55_55.xml) from the
126. Feldstein, Steven (2019). The Global Expansion of AI Surveillance. Carnegie Endowment
for International Peace.
127. The economics of artificial intelligence : an agenda (https://www.worldcat.org/oclc/10994350
14). Ajay Agrawal, Joshua Gans, Avi Goldfarb. Chicago. 2019. ISBN 978-0-226-61347-5.
OCLC 1099435014 (https://www.worldcat.org/oclc/1099435014). Archived (https://web.archi
ve.org/web/20230315184354/https://www.worldcat.org/title/1099435014) from the original
on 2023-03-15. Retrieved 2022-11-28.
128. Whittlestone, Jess; Clark, Jack (2021-08-31). "Why and How Governments Should Monitor
AI Development". arXiv:2108.12427 (https://arxiv.org/abs/2108.12427).
129. Shevlane, Toby (2022). "Sharing Powerful AI Models | GovAI Blog" (https://www.governanc
e.ai/post/sharing-powerful-ai-models). Center for the Governance of AI. Archived (https://we
b.archive.org/web/20221124125202/https://www.governance.ai/post/sharing-powerful-ai-mo
dels) from the original on 2022-11-24. Retrieved 2022-11-24.
130. Askell, Amanda; Brundage, Miles; Hadfield, Gillian (2019-07-10). "The Role of Cooperation
in Responsible AI Development". arXiv:1907.04534 (https://arxiv.org/abs/1907.04534).
131. Ziegler, Bart (8 April 2022). "Is It Time to Regulate AI?" (https://www.wsj.com/articles/is-it-tim
e-to-regulate-ai-11649433600). WSJ. Archived (https://web.archive.org/web/202211241256
45/https://www.wsj.com/articles/is-it-time-to-regulate-ai-11649433600) from the original on
2022-11-24. Retrieved 2022-11-24.
132. Reed, Chris (2018-09-13). "How should we regulate artificial intelligence?" (https://www.ncb
i.nlm.nih.gov/pmc/articles/PMC6107539). Philosophical Transactions of the Royal Society A:
Mathematical, Physical and Engineering Sciences. 376 (2128): 20170360.
Bibcode:2018RSPTA.37670360R (https://ui.adsabs.harvard.edu/abs/2018RSPTA.3767036
0R). doi:10.1098/rsta.2017.0360 (https://doi.org/10.1098%2Frsta.2017.0360). ISSN 1364-
503X (https://www.worldcat.org/issn/1364-503X). PMC 6107539 (https://www.ncbi.nlm.nih.g
ov/pmc/articles/PMC6107539). PMID 30082306 (https://pubmed.ncbi.nlm.nih.gov/3008230
6).
133. Belton, Keith B. (2019-03-07). "How Should AI Be Regulated?" (https://www.industryweek.c
om/technology-and-iiot/article/22027274/how-should-ai-be-regulated). IndustryWeek.
Archived (https://web.archive.org/web/20220129114109/https://www.industryweek.com/tech
nology-and-iiot/article/22027274/how-should-ai-be-regulated) from the original on 2022-01-
29. Retrieved 2022-11-24.
134. National Security Commission on Artificial Intelligence (2021), Final Report
135. National Institute of Standards and Technology (2021-07-12). "AI Risk Management
Framework" (https://www.nist.gov/itl/ai-risk-management-framework). NIST. Archived (https://
web.archive.org/web/20221124130402/https://www.nist.gov/itl/ai-risk-management-framewo
rk) from the original on 2022-11-24. Retrieved 2022-11-24.
136. Richardson, Tim (2021). "Britain publishes 10-year National Artificial Intelligence Strategy"
(https://www.theregister.com/2021/09/22/uk_10_year_national_ai_strategy/). Archived (http
s://web.archive.org/web/20230210114137/https://www.theregister.com/2021/09/22/uk_10_y
ear_national_ai_strategy/) from the original on 2023-02-10. Retrieved 2022-11-24.
137. "Guidance: National AI Strategy" (https://www.gov.uk/government/publications/national-ai-str
ategy/national-ai-strategy-html-version). GOV.UK. 2021. Archived (https://web.archive.org/w
eb/20230210114139/https://www.gov.uk/government/publications/national-ai-strategy/nation
al-ai-strategy-html-version) from the original on 2023-02-10. Retrieved 2022-11-24.
138. Office of the Director of National Intelligence; Office of the Director of National Intelligence,
Intelligence Advanced Research Projects Activity. "IARPA - TrojAI" (https://www.iarpa.gov/re
search-programs/trojai). Archived (https://web.archive.org/web/20221124131956/https://ww
w.iarpa.gov/research-programs/trojai) from the original on 2022-11-24. Retrieved
2022-11-24.
139. Turek, Matt. "Explainable Artificial Intelligence" (https://www.darpa.mil/program/explainable-
artificial-intelligence). Archived (https://web.archive.org/web/20210219210013/https://www.d
arpa.mil/program/explainable-artificial-intelligence) from the original on 2021-02-19.
140. Draper, Bruce. "Guaranteeing AI Robustness Against Deception" (https://www.darpa.mil/pro
gram/guaranteeing-ai-robustness-against-deception). Defense Advanced Research Projects
Agency. Archived (https://web.archive.org/web/20230109021433/https://www.darpa.mil/prog
ram/guaranteeing-ai-robustness-against-deception) from the original on 2023-01-09.
141. National Science Foundation (23 February 2023). "Safe Learning-Enabled Systems" (http
s://beta.nsf.gov/funding/opportunities/safe-learning-enabled-systems). Archived (https://web.
archive.org/web/20230226190627/https://beta.nsf.gov/funding/opportunities/safe-learning-e
nabled-systems) from the original on 2023-02-26. Retrieved 2023-02-27.
142. Mäntymäki, Matti; Minkkinen, Matti; Birkstedt, Teemu; Viljanen, Mika (2022). "Defining
organizational AI governance" (https://link.springer.com/10.1007/s43681-022-00143-x). AI
and Ethics. 2 (4): 603–609. doi:10.1007/s43681-022-00143-x (https://doi.org/10.1007%2Fs4
3681-022-00143-x). ISSN 2730-5953 (https://www.worldcat.org/issn/2730-5953).
web.archive.org/web/20230315184339/https://link.springer.com/article/10.1007/s43681-022-
00143-x) from the original on 2023-03-15. Retrieved 2022-11-28.
143. Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen;
Hadfield, Gillian; Khlaaf, Heidy; Yang, Jingying; Toner, Helen; Fong, Ruth; Maharaj, Tegan;
Koh, Pang Wei; Hooker, Sara; Leung, Jade; Trask, Andrew (2020-04-20). "Toward
Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims".
144. "Welcome to the Artificial Intelligence Incident Database" (https://incidentdatabase.ai/).
Archived (https://web.archive.org/web/20221124132715/https://incidentdatabase.ai/) from
the original on 2022-11-24. Retrieved 2022-11-24.
145. Wiblin, Robert; Harris, Keiran (2022). "Nova DasSarma on why information security may be
critical to the safe development of AI systems" (https://80000hours.org/podcast/episodes/nov
a-dassarma-information-security-and-ai-systems/). 80,000 Hours. Archived (https://web.archi
ve.org/web/20221124132927/https://80000hours.org/podcast/episodes/nova-dassarma-infor
mation-security-and-ai-systems/) from the original on 2022-11-24. Retrieved 2022-11-24.
146. OpenAI (2022-06-02). "Best Practices for Deploying Language Models" (https://openai.com/
blog/best-practices-for-deploying-language-models/). OpenAI. Archived (https://web.archive.
org/web/20230315184334/https://openai.com/blog/best-practices-for-deploying-language-m
odels/) from the original on 2023-03-15. Retrieved 2022-11-24.
147. OpenAI. "OpenAI Charter" (https://openai.com/charter/). OpenAI. Archived (https://web.archiv
e.org/web/20210304235618/https://openai.com/charter/) from the original on 2021-03-04.
148. Future of Life Institute (2016). "Autonomous Weapons Open Letter: AI & Robotics
Researchers" (https://futureoflife.org/open-letter/open-letter-autonomous-weapons-ai-robotic
s/). Future of Life Institute. Archived (https://web.archive.org/web/20221124114555/https://fut
ureoflife.org/open-letter/open-letter-autonomous-weapons-ai-robotics/) from the original on
2022-11-24. Retrieved 2022-11-24.
External links
Unsolved Problems in ML Safety
On the Opportunities and Risks of Foundation Models
An Overview of Catastrophic AI Risks
AI Accidents: An Emerging Threat (https://cset.georgetown.edu/publication/ai-accidents-an-e
merging-threat/)
Engineering a Safer World (https://mitpress.mit.edu/9780262533690/engineering-a-safer-wo
rld/)
Retrieved from "https://en.wikipedia.org/w/index.php?title=AI_safety&oldid=1166398835"

AI Safety

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Safety

Uploaded by

Copyright:

Available Formats

AI safety

— Norbert Wiener (1949)[15]

Black swan robustness

All of the images on the right are

Adversarial robustness is often

Detecting malicious use

One benefit of transparency is explainability.[65] It is sometimes a legal requirement to provide an

Systemic safety and sociotechnical factors

Improving institutional decision-making

The Alignment Problem

Alignment Research Center

Retrieved from "https://en.wikipedia.org/w/index.php?title=AI_safety&oldid=1166398835"

You might also like