Professional Documents
Culture Documents
Large language models learn to speak biology
Large language models learn to speak biology
Log In
AI systems that have already made strides learning the language of humans are
being trained to decipher the language of life encoded in DNA — and to use it to try
to design new molecules.
Why it matters: AI that can make sense of biology's information could help
scientists to develop new therapeutics and to engineer cells to produce biofuels,
materials, medicines and other products.
Background: Scientists have for decades worked to reverse engineer cells in order to
design new proteins and improve molecules found in nature, increasingly with the
help of computational tools.
https://www.axios.com/2023/11/17/generative-ai-dna-biology 1/7
5/28/24, 10:39 PM Large language models learn to speak biology
Other researchers have scoured Earth for compounds made by bacteria, fungi,
plants and other organisms that can be useful for particular purposes but haven't
been discovered. Both approaches have yielded new cancer therapeutics and
products.
"But at some point, we run out of low-hanging fruit to pick," says Kyunghyun Cho,
a professor of computer science and data science at New York University and
senior director of Frontier Research at Prescient Design, which is part of
Genentech.
Now, generative AI models — similar to the large language model (LLM) that
powers ChatGPT — are being developed to understand the rules and relationships of
DNA, RNA and proteins, and the many functions and properties they produce.
How it works: Humans arrange the 26 letters in the modern English alphabet into
roughly — and arguably — about 500,000 words.
LLMs are given text that they then split into characters, words or subwords,
known as tokens.
The AI model then determines the relationships among these tokens and uses
that information to generate original text.
The language of biology contains far fewer letters but produces many more "words"
in the form of proteins.
There are more than 200 million known proteins. AlphaFold, an AI system
developed by DeepMind, can predict the structure of a protein from its amino acid
sequence — one of biology's biggest challenges and time-consuming tasks.
That leaves a vast space to explore for scientists who want to develop new
proteins that have the properties they want for a novel drug or to engineer cells to
perform different tasks.
What's happening: AI models are being used to map that space to identify changes
in DNA or RNA that underpin disease or alter key processes in a cell — and to use
https://www.axios.com/2023/11/17/generative-ai-dna-biology 2/7
5/28/24, 10:39 PM Large language models learn to speak biology
that information to design new proteins. But scientists doing that face several
hurdles.
They must figure out the best way to break biology's language down into tokens
that the LLM can work with.
They must ensure the AI is able to see the relationships between genes and
elements of genes that affect one another from different places in a long stretch
of DNA, says Joshua Dunn, a molecular and computational biologist at Ginkgo
Bioworks, which uses AI to drive some of its gene designs. It's like having to pull
sentences from different parts of a book to understand its meaning.
Another consideration is that if you read DNA from different starting points, you
can wind up with different proteins — if you start mid-sentence, you get a
different story than if you start at the sentence's beginning.
And while most proteins are encoded in standard genetic code, others are
transcribed by different "readers" in cells. "That means there are a whole lot of
different languages being spoken at the same time," Dunn says.
Dunn says he is "extremely optimistic that large language models are going to figure
out some of this because they're actually very good at understanding different scales
of meanings spoken in different languages."
But there are open questions about how to tokenize genetic data to capture other
information. For example, a model has to look at a wide enough span of
information to capture the signals spread out across a chromosome — but in a
way that doesn't lose valuable details about mutations to single letters and the
changes they cause. AI models might not be able to rely on tokenization or
require adapting it to do this, Dunn says.
Where it stands: It's early days for AI foundation models in biology but companies,
including Profluent Bio, Inceptive and others, and academic groups are developing
models for deciphering the language of DNA and designing new proteins.
Yes, but: Like with LLMs, there is concern about biased training data based on where
samples are taken from, says Vaneet Aggarwal, a computer scientist and professor at
https://www.axios.com/2023/11/17/generative-ai-dna-biology 3/7
5/28/24, 10:39 PM Large language models learn to speak biology
What's next: Spewing out novel molecules from generative models is only a first
step — and not necessarily the biggest hurdle, Cho says.
The bottom line: LLMs that handle human language are "speeding up what we
already know how to do," Cho says — but with biology, "we're trying to figure out
something we've never figured out ourselves." That means "the burden of validation
is ... enormous."
Go deeper
https://www.axios.com/2023/11/17/generative-ai-dna-biology 4/7
5/28/24, 10:39 PM Large language models learn to speak biology
Satchel Paige of the Kansas City Monarchs talks with Josh Gibson of the Homestead Grays before a game in Kansas City in
1941. Photo: Mark Rucker/Transcendental Graphics via Getty Images
Major League Baseball will announce Wednesday it will add statistics from the
Negro Leagues to the Major League historical record, MLB has confirmed to Axios.
Why it matters: The announcement means that MLB could get new all-time records
to be held by some Negro League players — barred from MLB during segregation but
called the greatest of all time by those who saw them.
Erin Doherty
Updated 2 hours ago - Politics & Policy
Donald Trump, center, sits with his attorneys Todd Blanche, Emil Love, and Susan Necheles in a Manhattan criminal court in New
York on May 28, 2024. Photo: Steven Hirsch/New York Post/Bloomberg via Getty Images
The defense finished delivering their closing arguments on Tuesday, making their
final pitch to jurors that former Trump fixer Michael Cohen should not be trusted,
while the prosecution called efforts to make the case about Cohen "a deflection."
Why it matters: Former President Trump's lawyer called star witness Cohen the
"MVP of liars" during the trial's closing arguments, while a prosecutor accused the
https://www.axios.com/2023/11/17/generative-ai-dna-biology 5/7
5/28/24, 10:39 PM Large language models learn to speak biology
2k
1.5k
1k
500
0
2010 2012 2014 2016 2018 2020 2022
At least 25 people were killed as severe storms struck multiple states this weekend.
Why it matters: This year now sits at second place for reported tornadoes to date,
behind 2011's record-setting year.
https://www.axios.com/2023/11/17/generative-ai-dna-biology 6/7
5/28/24, 10:39 PM Large language models learn to speak biology
About Subscribe
About Axios Axios newsletters
Axios HQ
Accessibility Statement
Contact us
https://www.axios.com/2023/11/17/generative-ai-dna-biology 7/7