Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

5/28/24, 10:39 PM Large language models learn to speak biology

Log In

Nov 17, 2023 - Science

Large language models learn to speak biology


Alison Snyder, author of Axios Science

Illustration: Natalie Peeples/Axios

AI systems that have already made strides learning the language of humans are
being trained to decipher the language of life encoded in DNA — and to use it to try
to design new molecules.

Why it matters: AI that can make sense of biology's information could help
scientists to develop new therapeutics and to engineer cells to produce biofuels,
materials, medicines and other products.

Background: Scientists have for decades worked to reverse engineer cells in order to
design new proteins and improve molecules found in nature, increasingly with the
help of computational tools.

https://www.axios.com/2023/11/17/generative-ai-dna-biology 1/7
5/28/24, 10:39 PM Large language models learn to speak biology

Other researchers have scoured Earth for compounds made by bacteria, fungi,
plants and other organisms that can be useful for particular purposes but haven't
been discovered. Both approaches have yielded new cancer therapeutics and
products.

"But at some point, we run out of low-hanging fruit to pick," says Kyunghyun Cho,
a professor of computer science and data science at New York University and
senior director of Frontier Research at Prescient Design, which is part of
Genentech.

Now, generative AI models — similar to the large language model (LLM) that
powers ChatGPT — are being developed to understand the rules and relationships of
DNA, RNA and proteins, and the many functions and properties they produce.

How it works: Humans arrange the 26 letters in the modern English alphabet into
roughly — and arguably — about 500,000 words.

LLMs are given text that they then split into characters, words or subwords,
known as tokens.

The AI model then determines the relationships among these tokens and uses
that information to generate original text.

The language of biology contains far fewer letters but produces many more "words"
in the form of proteins.

The genetic information carried in DNA is encoded in four molecules: A (adenine),


C (cytosine), T (thymine) and G (guanine).

Three-letter combinations of these four basepairs, called codons, give rise to 20


different amino acids, some or all of which are strung together in different orders
to make up proteins.

There are more than 200 million known proteins. AlphaFold, an AI system
developed by DeepMind, can predict the structure of a protein from its amino acid
sequence — one of biology's biggest challenges and time-consuming tasks.

But many orders of magnitude more proteins are theoretically possible.

That leaves a vast space to explore for scientists who want to develop new
proteins that have the properties they want for a novel drug or to engineer cells to
perform different tasks.

What's happening: AI models are being used to map that space to identify changes
in DNA or RNA that underpin disease or alter key processes in a cell — and to use
https://www.axios.com/2023/11/17/generative-ai-dna-biology 2/7
5/28/24, 10:39 PM Large language models learn to speak biology

that information to design new proteins. But scientists doing that face several
hurdles.

They must figure out the best way to break biology's language down into tokens
that the LLM can work with.

They must ensure the AI is able to see the relationships between genes and
elements of genes that affect one another from different places in a long stretch
of DNA, says Joshua Dunn, a molecular and computational biologist at Ginkgo
Bioworks, which uses AI to drive some of its gene designs. It's like having to pull
sentences from different parts of a book to understand its meaning.

Another consideration is that if you read DNA from different starting points, you
can wind up with different proteins — if you start mid-sentence, you get a
different story than if you start at the sentence's beginning.

And while most proteins are encoded in standard genetic code, others are
transcribed by different "readers" in cells. "That means there are a whole lot of
different languages being spoken at the same time," Dunn says.

Dunn says he is "extremely optimistic that large language models are going to figure
out some of this because they're actually very good at understanding different scales
of meanings spoken in different languages."

But there are open questions about how to tokenize genetic data to capture other
information. For example, a model has to look at a wide enough span of
information to capture the signals spread out across a chromosome — but in a
way that doesn't lose valuable details about mutations to single letters and the
changes they cause. AI models might not be able to rely on tokenization or
require adapting it to do this, Dunn says.

Where it stands: It's early days for AI foundation models in biology but companies,
including Profluent Bio, Inceptive and others, and academic groups are developing
models for deciphering the language of DNA and designing new proteins.

HyenaDNA, a "genomic foundation model" developed by researchers at Stanford


University, learns how DNA sequences are distributed, genes are encoded and
how regions in between those that code for amino acids regulate a gene's
expression.

Yes, but: Like with LLMs, there is concern about biased training data based on where
samples are taken from, says Vaneet Aggarwal, a computer scientist and professor at

https://www.axios.com/2023/11/17/generative-ai-dna-biology 3/7
5/28/24, 10:39 PM Large language models learn to speak biology

Purdue University who has worked on AI models to understand the language of


DNA.

What's next: Spewing out novel molecules from generative models is only a first
step — and not necessarily the biggest hurdle, Cho says.

Candidate molecules have to go through several more phases of development to


filter out the most promising ones for experimental testing in the lab, he says.

The bottom line: LLMs that handle human language are "speeding up what we
already know how to do," Cho says — but with biology, "we're trying to figure out
something we've never figured out ourselves." That means "the burden of validation
is ... enormous."

Go deeper

Russell Contreras, author of Axios Latino


2 hours ago - Politics & Policy

Major League Baseball to add Negro Leagues stats to


official records

https://www.axios.com/2023/11/17/generative-ai-dna-biology 4/7
5/28/24, 10:39 PM Large language models learn to speak biology

Satchel Paige of the Kansas City Monarchs talks with Josh Gibson of the Homestead Grays before a game in Kansas City in
1941. Photo: Mark Rucker/Transcendental Graphics via Getty Images

Major League Baseball will announce Wednesday it will add statistics from the
Negro Leagues to the Major League historical record, MLB has confirmed to Axios.

Why it matters: The announcement means that MLB could get new all-time records
to be held by some Negro League players — barred from MLB during segregation but
called the greatest of all time by those who saw them.

Go deeper (2 min. read)

Erin Doherty
Updated 2 hours ago - Politics & Policy

Prosecution: Trump hush money case "cloaked in lies"

Donald Trump, center, sits with his attorneys Todd Blanche, Emil Love, and Susan Necheles in a Manhattan criminal court in New
York on May 28, 2024. Photo: Steven Hirsch/New York Post/Bloomberg via Getty Images

The defense finished delivering their closing arguments on Tuesday, making their
final pitch to jurors that former Trump fixer Michael Cohen should not be trusted,
while the prosecution called efforts to make the case about Cohen "a deflection."

Why it matters: Former President Trump's lawyer called star witness Cohen the
"MVP of liars" during the trial's closing arguments, while a prosecutor accused the

https://www.axios.com/2023/11/17/generative-ai-dna-biology 5/7
5/28/24, 10:39 PM Large language models learn to speak biology

presumptive GOP presidential nominee of "overt election fraud."

Go deeper (2 min. read)

Andrew Freedman, author of Axios Generate


5 hours ago - Energy & Environment

Tornado season is the second-busiest so far in the U.S.,


behind 2011

Preliminary tornado reports in the U.S.


2010-2024; As of May 26, 2024

Annual total Jan. 1 to May 26


As
As
As
Asof
As
As of
of
ofMay
of
of May
May
May26,
May
May 26,
26,
26,there
26,
26, there
there
therewere
there
there were
were
were989
were
were 989
989
989tornado
989
989 tornado
tornado
tornado
tornado
tornado
2.5k reports
reports
reportsin
reports
reports in
in2024,
in
in 2024,
2024,the
2024,
2024, the
themost
the
the most
mostat
most
most at
atthis
at
at this
thispoint
this
this point
point
point
point
since
since
since2011
since
since 2011
2011
2011
2011

2k

1.5k

1k

500

0
2010 2012 2014 2016 2018 2020 2022

Data: NWS Storm Prediction Center; Chart: Simran Parwani/Axios

At least 25 people were killed as severe storms struck multiple states this weekend.

Why it matters: This year now sits at second place for reported tornadoes to date,
behind 2011's record-setting year.

Go deeper (2 min. read)

https://www.axios.com/2023/11/17/generative-ai-dna-biology 6/7
5/28/24, 10:39 PM Large language models learn to speak biology

News worthy of your time.


Download the app

About Subscribe
About Axios Axios newsletters

Advertise with us Axios Pro

Careers Axios app

Events Axios podcasts

Axios on HBO Courses

Axios HQ

Privacy and terms

Accessibility Statement

Online tracking choices

Your Privacy Choices

Contact us

https://www.axios.com/2023/11/17/generative-ai-dna-biology 7/7

You might also like