Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Searching the Atoms and

Bonds in Chemical Patents


Presented at IRF Symposium
Vienna, Austria
4 June 2010

Dr John M. Barnard
Scientific Director
Digital Chemistry Ltd., UK

www.digitalchemistry.co.uk
Outline

• Chemical structures in patents


• Principles of searching for chemical structures
• History of chemical structure searching in patents
• Current developments
– specific structures vs. Markush structures
– automatic analysis vs. manual curation
– online systems vs. in-house systems
• Retrieval performance evaluation

2
Chemical structures in patents

The most important information


in a chemical patent is often
the chemical structure
disclosed or claimed.
- specifies atoms and bonds
present and the way they are
connected
- integrated mixture of text
and structure diagrams

3
Markush structures

Classes of molecules with


common structural features CH 3
O
– may cover millions (or infinite numbers) N

of specific structures R2
R3 N
– allow protection of related molecules
R1
with common properties R 1 = phenyl / cyclohexyl / ...
– named after inventor involved in US R 2 = H / methyl / ...
R 3 = H / Cl / NO 2 / ...
legal case in 1924

Patents may include both Markush


structure claim and exemplified Dr Eugene
Markush
specific structures. (1887-1968)

4
Chemical structures in patents

Markush
structure

Specific
structure
name
5
Markush structures
Specific structures can be generated by combinatorial
assembly of alternatives for each R-group

Variable
multiplicity

Non-structural Generic
description groups Specific Variable-position
groups attachment
6
Substructure search

• Search a database of chemical structures for


all those containing a specified pattern of
atoms and bonds (substructure)
N O
C
Query substructure:

H3C O
O H3C
N C O
H3C
NH N
CH3 CH2 CH3
H3C N CH3
N N
N
N

Retrieved molecule Retrieved molecule Molecule not retrieved


7
Substructure search

• Originally applied to databases of specific


structures (single, fully-defined molecules)
Exact and deterministic 100% recall
search algorithms
– based in topological
graph theory
100% precision

Search retrieves all database molecules that contain


the query substructure and none of those that don't.

• Substructure search also possible for Markush


structures, but more complicated.
8
Patent searching – before 1980

Chemical Fragmentation Codes


– substructure fragments used as index terms
– manually assigned by expert coders
– applied both to specific and Markush structures
– search uses Boolean logic for required combinations
Connectivity / alternativeness
relationships between
fragments usually lost

Fragment codes were


originally designed for
Poor punched cards.
Precision
9
Patent searching – the 1980s
"Topological" / graphical systems introduced
- with display of structure diagrams
Specific Structures Markush Structures
Initial work with non-patent Sheffield University
databases – academic research
– journal literature on patent Markush
– "in-house" structures storage and retrieval
Commercial systems Commercial systems
operational by start of decade and databases launched
– "public" databases at end of decade
• CAS Online – Markush DARC
(Derwent / Questel / INPI)
• Système DARC – MARPAT
– "in-house" data (Chemical Abstracts)
10 • MDL MACCS etc.
Patent searching – since 1990

Commercial search
systems Databases

Little change Machine-readable patent


– still only available online documents available direct
with proprietary from patent offices
databases
Some automation in database
– showing their age with creation
clunky interfaces
– fragment code systems New databases of specific
still widely used structures from patents
– Reaxys (formerly
MDL/Elsevier Chemical
Patent Database
– SURECHEM
11
Related developments

Markush searching Data mining


Markush applications outside Chemical data extraction from
patent field free text and diagrams
– informatics for – structure diagram "OCR"
"combinatorial libraries" – chemical nomenclature
– specific structure translation
enumeration
– physicochemical property
calculation Research work on capture of
Markush structures from free-
New "in-house" systems for text patents
patent Markush search under
development

12
Which way forward?

Which structures to index and search?


Exemplified / enumerated
or Markush structure
specific structures
Markush structures cover the scope of the
patent more comprehensively (better recall),
but are more complicated to search, and can
lead to poor retrieval precision.

How to build the databases?


Manual input and Automatic analysis of
or
curation full text patent
At least at present, searchers regard curated
databases as the "gold standard" for retrieval
13
performance.
Different approaches
Specific
Structures

Derwent
Chemistry
Resource
CA Registry SureChem

IBM
Reaxys

Manually CLiDE Automatically


DecrIPt chemoCR
Curated Extracted
MMS

MARPAT
Databases

Data-mining software
Markush
14 Structures
Using specific structures

Conventional approach
Extract specific structures from
patent Issues
– manual curation
Selection of compounds
• CA Registry
• Derwent Chemistry
– exemplified
Resource – "prophetic"
– automatic extraction – anything with a name
• SureChem
Effectiveness of automatic
• IBM
nomenclature identification
– combination of both and translation
• Reaxys
Correctness of systematic
Search using standard names in patent document
substructure search software
15
Using specific structures

Other "text analytics" approaches

IBM Work Accelrys/Notiora Work


Automatic chemical name Automatic chemical name to
to structure conversion structure conversion
Vector representation Structure "fingerprints" for
derived from IUPAC each molecule based on
Chemical Identifier (InChI) substructure fragments
Structural similarity Logical "OR" of fingerprints
search based on for whole patent
comparison of vector Structural similarity search
representations based on logical "OR"
fingerprints and maximum
16 common substructure
Using Markush structures

Existing systems Problems


Two online systems/databases Excessively broad Markushes
available since late 1980s defy existing systems, and give
– Merged Markush Service poor recall / precision
(ThomsonReuters / R1 is a substituted or
Markush DARC) unsubstituted, mono-, di-
or polycyclic, aromatic or
– MARPAT (Chemical non-aromatic, carbocylic
Abstracts Service/STN) or heterocyclic ring
system, or ...

Searchers often faced with


manually sifting 1000+ hits to
find 5 or 6 relevant patents
17
Searchers comments

Discussion in "breakout group" at International Patent


Information Conference (IPI-Confex), Venice, Mar 2009

– Multiple search tools are needed for comprehensive


retrieval
– Search strategies need to focus on the core structure
of interest and put up with poor precision
– Current systems based on automatic extraction and
analysis of nomenclature have limited usefulness

Suggestions for improvement:


– ranking of search output
– more comprehensive indexing of specific structures

18
In-house Markush systems?

Advantages over existing online systems

Informatics support for Data mining

Possible use of structural similarity


drug discovery

and cluster analysis techniques


Structure activity
Integration of patent data analysis
with other chemical Physico-chemical
databases property calculation
– end-user chemist
access to patent data Competitive intelligence
Adding patentability criteria Identification of
to drug design unpatented "gaps" in
Adjunct or preliminary to chemical space
existing systems
– confidentiality advantages
19
In-house Markush systems?

Prospects

Software Databases
New Markush search systems Existing curated databases
under development – ThomsonReuters have
– Digital Chemistry Ltd. expressed interest in
– ChemAxon making MMS data available
– MARPAT database another
Also work on selective obvious possibility
enumeration of specific
"Home-grown" databases for
structures from Markush specialist purposes
– DecrIPt Inc.
– input software needed
Automatic extraction from patent
20 documents
Automatic Markush extraction
Currently a "hot area" for research, after a fallow period
– complex combined issues of text and image processing,
nomenclature translation and semantic analysis
Sheffield University Cambridge University
3 publications (1992-97), initially Unilever Centre for Molecular Informatics
analysing Derwent patent abstracts. Ongoing work by Murray-Rust group on
analysis of full-text patents, extending
CLiDE Pro (KeyModule Ltd.) OPSIN nomenclature translation program.
Work by A.P. Johnson (2009) extending
earlier chemical OCR software. chemoCR (Fraunhofer SCAI)
Recent work on prototype software for
ChemProspector (InfoChem) Markush "reconstruction" from patent
text, with limited success.
Ongoing research into extraction of
Markush structures from patents. Commercially-viable operational
systems probably still some way off.
21
Precision and recall

Patent databases with specific structures

Substructure search finds all 100% precision


molecules in database that 100% recall
contain query substructure

Database contains irrelevant or Poor precision


trivial molecules from patent text

Database omits molecules


covered by Markush structure
Database contains incorrect Poor recall
molecules (errors in nomenclature
identification / translation)
22
Precision and recall

Patent databases with Markush structures


Query substructure matches
highly generic description in Poor precision
unimportant part of Markush

H 3C
R84 is a substituted or
N
O unsubstituted, mono-, di- or
polycyclic, aromatic or non-
N
CH3 matches aromatic, carbocylic or
N
heterocyclic ring system, or ...

Using system search options


to avoid matches with highly
generic descriptions Poor recall
– "broad/narrow
translation" (DARC)
– "match level" (MARPAT)
23
Patent search evaluation

Chemical substructure search systems usually


give 100% precision and 100% recall
– retrieval performance evaluation not important

Not really true for chemical patent searches


– much more room for argument about
whether or not a hit is relevant
Designers of chemical patent search systems may
need to pay more attention to performance evaluation
Consideration of the relative
Precision / Recall importance of different Evaluation of hit
trade-off parts of Markush relevance in context
(what the patent "teaches") of type of query

24
TREC-CHEM

– Multi-year evaluation project under auspices of long-


running Text Retrieval Conferences (TREC)
– Uses chemical patent data and queries with with
relevance judgements
– Used to compare retrieval experiments performed by
different groups

Results from first year (2009) presented elsewhere


– most search approaches based on automatic
analysis of patent text, nomenclature extraction etc.
– some issues identified concerning automated
relevance judgements based on cited prior art
documents
25
TREC-CHEM
TREC-CHEM is not using data from curated databases
– these are the current "industry standard" against
which new approaches will ultimately be judged
– TREC rules do not allow commercial databases
to be included

It would be valuable to apply TREC-type evaluation to search


systems for patent chemistry that are based on commercial
databases
• MARPAT vs. MMS/Markush DARC
• existing systems and databases vs. new ones using new
techniques, automated data extraction etc.
There are many potential benefits to practising patent searchers in
involving cheminformaticians in the IR-IP debate for which the IRF
provides a forum.
26
Contact details

Dr John M. Barnard
Scientific Director, Digital Chemistry Ltd.
46 Uppergate Road, Sheffield S6 6BX, UK
john.barnard@digitalchemistry.co.uk
+44 (0)114 233 3170

27

You might also like