Features Editor: Dennis Taylor

another pattern recognition system based

Practical Pattern Matching on graphs. A data mining system that repre-
sents data as a collection of nodes and links
between the nodes, Subdue works by search-
Danna Voth ing through graphic data using a heuristic
based on the notion of compression.
After researchers input a big graph into

H umans are fascinated by patterns, and they can spot them well—in fact, that’s one
area where humans excel over computers. But research is producing interesting
competition as scientists discover and employ new methods of automated pattern recogni-
the system and run a search, Subdue finds
a pattern that has several instances in the
graph. The system then replaces all of those
tion. Practical applications include finding genes, detecting cancer-causing chemicals in instances by a single node, making the graph
smaller. The larger the pattern, and the more
molecules, searching out potential terrorists, and predicting terrorist threat levels, as well instances it has, the more compression you
get. “The more it compresses, the more we’re
as recognizing speech patterns and creating In the factor graph, each variable is a interested in it,” Holder says.
a nanotechnology resource library. node. The scoring function comprises many A practical application of Subdue exam-
local scoring functions that look for a small ines a chemical structure to determine
No new genes? number of variables. For that small set of whether it causes cancer. The system repre-
At the University of Toronto, Brendan variables, it finds a score for each configu- sents the chemical in terms of its atoms and
Frey is leading a group of scientists who are ration of those variables. The local scores’ the bonds between them (the atoms are
using AI techniques to analyze molecular- sum is the total score. “It’s a nice way to nodes in the graph, and the bonds are links).
biology data. One of their projects involves decompose a very complex problem into a For the system to learn, researchers input
using a factor graph they developed called whole bunch of simpler problems,” Frey many cancer-causing chemicals as graphs,
GenRate to discover and evaluate genes in says. The scientists then compare the factor which the system searches to find recurring
mouse tissues. Factor graphs let researchers graph data to known gene patterns. patterns, or subgraphs. It then searches
describe a system with complex variables, Because the factor graph provides a com- through the space of subgraphs to find a
such as gene location in DNA as well as gene putational framework for vetting the best pattern that shows up a lot. That pattern is
length and function. configuration of variables as well as discov- then matched against the new chemical’s
“What a factor graph is useful for,” says ering them, the team came up with surpris- structure.
Frey, “is describing a scoring function that ing results that led to a major revision of the “The interpretation would be if this sub-
tells you how good each setting of the vari- view of the mammalian genome. Although molecule shows up in 90 percent of these
ables is.” some research claims many genes are left to chemicals that cause cancer, then it may be
Using samples from over 1 million probes discover, Frey’s team has shown that might predictive,” Holder says. So, if a new chem-
along DNA in 37 different mouse tissues, not be true. “Beyond the genes we found,” ical contains the subgraph, he says, “you
the scientists used their factor graph to deter- Frey says, “we don’t believe there exists might predict that this chemical may cause
mine which bits of DNA are expressed, or many new protein-coding genes.” cancer, and you may want to go off and test
activated to read protein. In some tissues, the it in the laboratory.”
DNA is expressed; in others, it might not be. Cancer detection Holder has tested the system in the Amer-
DNA parts that have no function are never At the University of Texas at Arlington, ican Cancer Institute’s predictive-toxicology
activated. Lawrence Holder has developed Subdue, challenge. The Institute releases information

on both a set of chemicals that it has deter-
mined to be cancer causing and a set that R
isn’t. Participants speculate which chemi- O Ga
Ga O C
cals cause cancer, and the one with the C
most correct guesses wins the challenge. O Ga O
Holder won the competition in 2000. C O R C O R
Terrorists targeted
Subdue is also useful for detecting pat- C C
Ga O
terns of potential terrorist activity and locat-
ing potential terrorist networks. Holder R
trained his system on simulated data that O Ga O
the US Air Force’s Evidence Assessment, O Ga O
C Ga
Grouping, Linking, and Evaluation program C O
Ga O
created. The domain simulates the evidence R
available about terrorist groups and their
plans before they put them into action.
Following a general plan of starting a The Subdue pattern-recognition system searches through graphic data to find a
pattern that has several instances. In this chemical-structure example, the atoms are
group, recruiting members, acquiring
nodes in the graph and the bonds are links in the graph.
resources, communicating, visiting targets,
and transferring resources between actors,
groups, and targets, the domain contains
numerous concepts. The concepts include new sentence,” Edelman says. “It’s not such definition to examining dynamic systems,
threat and nonthreat actors and threat and a big deal to recognize a pattern on which irregular crystals, hidden Markov models,
nonthreat groups. you’ve trained your system.” The team is and cellular automata. One field of applica-
Trained on patterns that give examples patenting the technology, and Edelman tion is quantum computation.
of threat potential, Subdue searched the wants to put it to commercial use. One possi- “A current proposal for implementing
simulated data to find similar types of pat- ble arena is speech recognition technologies. molecular computers is to look at very long
terns. The system achieved 78 to 93 percent chain molecules and to design the interac-
accuracy discriminating threat from non- Patterns in theory tions between the atoms in the molecular
threat groups. Research on the theoretical aspect of chain so that they implement various of
pattern discovery is also generating useful these cellular-automata rules,” Crutchfield
Language learning applications. At the University of Califor- says. “So this pattern discovery system that
Cornell University professor Shimon nia, Davis, Jim Crutchfield has leveraged we have for cellular automata is making a
Edelman, in collaboration with colleagues at his interest in “what a pattern is” to apply a catalog of all the possible kinds of interac-
Tel Aviv University, has created a program pattern’s abstract definition—which he tions and what sorts of information storage
that can discover patterns in languages, calls a causal state—to different kinds of structures they can produce, and how those
learn them as grammars, and then generate processes. He defines causal states as groups information storage structures can be moved
sentences of its own in that language. The of histories that lead to the same knowledge around and interacted, and how they interact
system is called ADIOS (automatic distilla- about the future. to process information.”
tion of structure), and it has been tested on The mathematical theory that defines the Crutchfield is working on developing a
both natural languages, such as English and causal states leads to a small number of library called the Encyclopedia of Cellular
Chinese, and artificial grammars, such as possible ways to find the causal states. The Automata. “It will be a resource for people
those in DNA and music. mathematical definition of being in the working in nanotechnology,” he says, “to
“You can only recognize patterns if you same state of predictability about the future look at how to design molecular systems that
have the right primitives, the right features,” means that, by looking at data, you can have only local interactions but that will pro-
Edelman says. “It’s like having the right estimate and make predictions about the duce in their behavior large-scale structures
glasses.” future on the basis of different points in that can be used for doing computations.”
Language contains patterns that on its time having different histories. From that
face are invisible, and it’s generally thought definition, Crutchfield derived an algorithm
to possess structure beyond just the serial
order of words, a grammar. “The true struc-
ture of the sentence is a kind of tree,” Edel-
that describes how to group histories that
provide knowledge about the future when
those histories are basically predictively
A ccording to Nello Cristiani, associate
professor of statistics at UC Davis, people
man says. ADIOS combines statistics and equivalent. “We just apply this in these have always been attracted to patterns and
rules applied to a body of text in a language different domains,” he says, “whether it’s a pattern recognition. “In a way, this is the
to discover the grammar. The system can spatial pattern, like cellular automata, or time essence of science and most cognitive
then generate sentences in that language. series, or looking at complex materials.” processes, such as generalization,” Cristiani
“It can do things like assign structure to a Crutchfield has applied his causal-state says. “Now this activity has been automa-



tized, and we rely heavily on it as a society. seen a revolution in pattern recognition

“There would be no genome project, no technology. Machine learning algorithms
speech recognition, and probably no credit are now faster, simpler, and more accurate
card system without it. The last decade has in generalization.”

