John M. Barnar IRFS 040610

Searching the Atoms and
Bonds in Chemical Patents

Presented at IRF Symposium
Vienna, Austria
4 June 2010
Dr John M. Barnard
Scientific Director
Digital Chemistry Ltd., UK
www.digitalchemistry.co.uk
Outline
• Chemical structures in patents

• Principles of searching for chemical structures
• History of chemical structure searching in patents
• Current developments
– specific structures vs. Markush structures
– automatic analysis vs. manual curation
– online systems vs. in-house systems
• Retrieval performance evaluation
2
Chemical structures in patents
The most important information

in a chemical patent is often
the chemical structure
disclosed or claimed.
- specifies atoms and bonds
present and the way they are
connected
- integrated mixture of text
and structure diagrams
3
Markush structures
Classes of molecules with

common structural features CH 3
O
– may cover millions (or infinite numbers) N
of specific structures R2
R3 N
– allow protection of related molecules
R1
with common properties R 1 = phenyl / cyclohexyl / ...
– named after inventor involved in US R 2 = H / methyl / ...
R 3 = H / Cl / NO 2 / ...
legal case in 1924
Patents may include both Markush

structure claim and exemplified Dr Eugene
Markush
specific structures. (1887-1968)
4
Chemical structures in patents
Markush
structure
Specific
structure
name
5
Markush structures
Specific structures can be generated by combinatorial
assembly of alternatives for each R-group
Variable
multiplicity
Non-structural Generic
description groups Specific Variable-position
groups attachment
6
Substructure search
• Search a database of chemical structures for

all those containing a specified pattern of
atoms and bonds (substructure)
N O
C
Query substructure:
H3C O
O H3C
N C O
H3C
NH N
CH3 CH2 CH3
H3C N CH3
N N
N
N
Retrieved molecule Retrieved molecule Molecule not retrieved

7
Substructure search
• Originally applied to databases of specific

structures (single, fully-defined molecules)
Exact and deterministic 100% recall
search algorithms
– based in topological
graph theory
100% precision
Search retrieves all database molecules that contain

the query substructure and none of those that don't.
• Substructure search also possible for Markush

structures, but more complicated.
8
Patent searching – before 1980
Chemical Fragmentation Codes

– substructure fragments used as index terms
– manually assigned by expert coders
– applied both to specific and Markush structures
– search uses Boolean logic for required combinations
Connectivity / alternativeness
relationships between
fragments usually lost
Fragment codes were

originally designed for
Poor punched cards.
Precision
9
Patent searching – the 1980s
"Topological" / graphical systems introduced
- with display of structure diagrams
Specific Structures Markush Structures
Initial work with non-patent Sheffield University
databases – academic research
– journal literature on patent Markush
– "in-house" structures storage and retrieval
Commercial systems Commercial systems
operational by start of decade and databases launched
– "public" databases at end of decade
• CAS Online – Markush DARC
(Derwent / Questel / INPI)
• Système DARC – MARPAT
– "in-house" data (Chemical Abstracts)
10 • MDL MACCS etc.
Patent searching – since 1990
Commercial search
systems Databases
Little change Machine-readable patent

– still only available online documents available direct
with proprietary from patent offices
databases
Some automation in database
– showing their age with creation
clunky interfaces
– fragment code systems New databases of specific
still widely used structures from patents
– Reaxys (formerly
MDL/Elsevier Chemical
Patent Database
– SURECHEM
11
Related developments
Markush searching Data mining

Markush applications outside Chemical data extraction from
patent field free text and diagrams
– informatics for – structure diagram "OCR"
"combinatorial libraries" – chemical nomenclature
– specific structure translation
enumeration
– physicochemical property
calculation Research work on capture of
Markush structures from free-
New "in-house" systems for text patents
patent Markush search under
development
12
Which way forward?
Which structures to index and search?

Exemplified / enumerated
or Markush structure
specific structures
Markush structures cover the scope of the
patent more comprehensively (better recall),
but are more complicated to search, and can
lead to poor retrieval precision.
How to build the databases?

Manual input and Automatic analysis of
or
curation full text patent
At least at present, searchers regard curated
databases as the "gold standard" for retrieval
13
performance.
Different approaches
Specific
Structures
Derwent
Chemistry
Resource
CA Registry SureChem
IBM
Reaxys
Manually CLiDE Automatically

DecrIPt chemoCR
Curated Extracted
MMS
MARPAT
Databases
Data-mining software
Markush
14 Structures
Using specific structures
Conventional approach
Extract specific structures from
patent Issues
– manual curation
Selection of compounds
• CA Registry
• Derwent Chemistry
– exemplified
Resource – "prophetic"
– automatic extraction – anything with a name
• SureChem
Effectiveness of automatic
• IBM
nomenclature identification
– combination of both and translation
• Reaxys
Correctness of systematic
Search using standard names in patent document
substructure search software
15
Using specific structures
Other "text analytics" approaches
IBM Work Accelrys/Notiora Work

Automatic chemical name Automatic chemical name to
to structure conversion structure conversion
Vector representation Structure "fingerprints" for
derived from IUPAC each molecule based on
Chemical Identifier (InChI) substructure fragments
Structural similarity Logical "OR" of fingerprints
search based on for whole patent
comparison of vector Structural similarity search
representations based on logical "OR"
fingerprints and maximum
16 common substructure
Using Markush structures
Existing systems Problems

Two online systems/databases Excessively broad Markushes
available since late 1980s defy existing systems, and give
– Merged Markush Service poor recall / precision
(ThomsonReuters / R1 is a substituted or
Markush DARC) unsubstituted, mono-, di-
or polycyclic, aromatic or
– MARPAT (Chemical non-aromatic, carbocylic
Abstracts Service/STN) or heterocyclic ring
system, or ...
Searchers often faced with

manually sifting 1000+ hits to
find 5 or 6 relevant patents
17
Searchers comments
Discussion in "breakout group" at International Patent

Information Conference (IPI-Confex), Venice, Mar 2009
– Multiple search tools are needed for comprehensive

retrieval
– Search strategies need to focus on the core structure
of interest and put up with poor precision
– Current systems based on automatic extraction and
analysis of nomenclature have limited usefulness
Suggestions for improvement:

– ranking of search output
– more comprehensive indexing of specific structures
18
In-house Markush systems?
Advantages over existing online systems
Informatics support for Data mining
Possible use of structural similarity

drug discovery
and cluster analysis techniques

Structure activity
Integration of patent data analysis
with other chemical Physico-chemical
databases property calculation
– end-user chemist
access to patent data Competitive intelligence
Adding patentability criteria Identification of
to drug design unpatented "gaps" in
Adjunct or preliminary to chemical space
existing systems
– confidentiality advantages
19
In-house Markush systems?
Prospects
Software Databases
New Markush search systems Existing curated databases
under development – ThomsonReuters have
– Digital Chemistry Ltd. expressed interest in
– ChemAxon making MMS data available
– MARPAT database another
Also work on selective obvious possibility
enumeration of specific
"Home-grown" databases for
structures from Markush specialist purposes
– DecrIPt Inc.
– input software needed
Automatic extraction from patent
20 documents
Automatic Markush extraction
Currently a "hot area" for research, after a fallow period
– complex combined issues of text and image processing,
nomenclature translation and semantic analysis
Sheffield University Cambridge University
3 publications (1992-97), initially Unilever Centre for Molecular Informatics
analysing Derwent patent abstracts. Ongoing work by Murray-Rust group on
analysis of full-text patents, extending
CLiDE Pro (KeyModule Ltd.) OPSIN nomenclature translation program.
Work by A.P. Johnson (2009) extending
earlier chemical OCR software. chemoCR (Fraunhofer SCAI)
Recent work on prototype software for
ChemProspector (InfoChem) Markush "reconstruction" from patent
text, with limited success.
Ongoing research into extraction of
Markush structures from patents. Commercially-viable operational
systems probably still some way off.
21
Precision and recall
Patent databases with specific structures
Substructure search finds all 100% precision

molecules in database that 100% recall
contain query substructure
Database contains irrelevant or Poor precision

trivial molecules from patent text
Database omits molecules

covered by Markush structure
Database contains incorrect Poor recall
molecules (errors in nomenclature
identification / translation)
22
Precision and recall
Patent databases with Markush structures

Query substructure matches
highly generic description in Poor precision
unimportant part of Markush
H 3C
R84 is a substituted or
N
O unsubstituted, mono-, di- or
polycyclic, aromatic or non-
N
CH3 matches aromatic, carbocylic or
N
heterocyclic ring system, or ...
Using system search options

to avoid matches with highly
generic descriptions Poor recall
– "broad/narrow
translation" (DARC)
– "match level" (MARPAT)
23
Patent search evaluation
Chemical substructure search systems usually

give 100% precision and 100% recall
– retrieval performance evaluation not important
Not really true for chemical patent searches

– much more room for argument about
whether or not a hit is relevant
Designers of chemical patent search systems may
need to pay more attention to performance evaluation
Consideration of the relative
Precision / Recall importance of different Evaluation of hit
trade-off parts of Markush relevance in context
(what the patent "teaches") of type of query
24
TREC-CHEM
– Multi-year evaluation project under auspices of long-

running Text Retrieval Conferences (TREC)
– Uses chemical patent data and queries with with
relevance judgements
– Used to compare retrieval experiments performed by
different groups
Results from first year (2009) presented elsewhere

– most search approaches based on automatic
analysis of patent text, nomenclature extraction etc.
– some issues identified concerning automated
relevance judgements based on cited prior art
documents
25
TREC-CHEM
TREC-CHEM is not using data from curated databases
– these are the current "industry standard" against
which new approaches will ultimately be judged
– TREC rules do not allow commercial databases
to be included
It would be valuable to apply TREC-type evaluation to search

systems for patent chemistry that are based on commercial
databases
• MARPAT vs. MMS/Markush DARC
• existing systems and databases vs. new ones using new
techniques, automated data extraction etc.
There are many potential benefits to practising patent searchers in
involving cheminformaticians in the IR-IP debate for which the IRF
provides a forum.
26
Contact details
Dr John M. Barnard
Scientific Director, Digital Chemistry Ltd.
46 Uppergate Road, Sheffield S6 6BX, UK
john.barnard@digitalchemistry.co.uk
+44 (0)114 233 3170
27

John M. Barnar IRFS 040610

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

John M. Barnar IRFS 040610

Uploaded by

Copyright:

Available Formats

Searching the Atoms and

Bonds in Chemical Patents

• Chemical structures in patents

The most important information

Classes of molecules with

Patents may include both Markush

• Search a database of chemical structures for

Retrieved molecule Retrieved molecule Molecule not retrieved

• Originally applied to databases of specific

Search retrieves all database molecules that contain

• Substructure search also possible for Markush

Chemical Fragmentation Codes

Fragment codes were

Little change Machine-readable patent

Markush searching Data mining

Which structures to index and search?

How to build the databases?

Manually CLiDE Automatically

Other "text analytics" approaches

IBM Work Accelrys/Notiora Work

Existing systems Problems

Searchers often faced with

Discussion in "breakout group" at International Patent

– Multiple search tools are needed for comprehensive

Suggestions for improvement:

Advantages over existing online systems

Informatics support for Data mining

Possible use of structural similarity

and cluster analysis techniques

Patent databases with specific structures

Substructure search finds all 100% precision

Database contains irrelevant or Poor precision

Database omits molecules

Patent databases with Markush structures

Using system search options

Chemical substructure search systems usually

Not really true for chemical patent searches

– Multi-year evaluation project under auspices of long-

Results from first year (2009) presented elsewhere

It would be valuable to apply TREC-type evaluation to search

You might also like