Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Automatic cataloging &

classification

Eric Childress
OCLC Research

OCLC Members Council


Research and New Technologies Interest Group
25 October 2005
The key question

 Can machines be
Human Labor
leveraged for?

Input
– Baseline metadata
• Critical data present
• Accurate tagging Status quo
• Accurate values

Output
– Ideal: Enriched
metadata
Metadata
 The answer:
– Yes…with caveats
Automation approaches

 Harvesting: Drawing from extant


metadata in one or more sources
 Extraction: Drawing from attributes
of the resource and/or content in the
resource
 Both: Integrating both harvesting &
extraction in metadata generation
Approaches (cont’)

 Harvesting & extraction can be integrated


with other tactics:
– Point-of-transaction capture: Manual and/or
automatic capture of metadata during the
lifecycle of resource and/or metadata (e.g., the
source agency, date of record)
– Human review/prompting: Integrating human
decision-making to address cases machines
cannot handle efficiently (e.g., linking name
references to correct authority file when
several names are similar)
Harvesting options
 New record, same database:
– OCLC “derive” record technique
 External metadata files:
– Z39.50/Zing/MXG
– OAI harvesting
– Citation tools (e.g., EndNote)
 Embedded metadata harvesting:
– Processes structured metadata
– Various tools (e.g., DC tools list)
 Many harvesting tools include some extraction
features (and vice-versa)
– Example: InfoLibrarian appliance
Extraction landscape

 Many tools from many sources


– Features vary widely
– Some are narrow-band (e.g., domain-specific,
narrow scope of data work)
– Standalone or highly integrated in systems
(often as part of digital access mgt. systems)
 Frequently-encountered features:
– Simple: document statistics, file type
– Complex: (reliable) language detection,
audience level, topics, entities represented,
document parts, taxonomy derivation
Extraction approaches
 Information extraction:
– “Automatically extract structured or semistructured
information from unstructured machine-readable
documents” - Wikipedia
 Natural language processing
– “A range of computational techniques for analyzing and
representing naturally occurring text (free text) at one
or more levels of linguistic analysis (e.g., morphological,
syntactic, semantic, pragmatic) for the purpose of
achieving human-like language processing for
knowledge-intensive applications” - AHIMA
– Extracts both explicit & implicit meaning
Some work of interest

 Library of Congress
 NSF-funded NSDL projects
 AMeGA
 iVia software
 RLG’s Automatic Exposure
Library of Congress
 BEAT (Bibliographic Enrichment Advisory
Team) activities & projects:
– MARC records fromharvesting:
• E-CIP
• Web access to publications in series
– Numerous enrichment activities:
• TOCs: E-CIP, ONIX, dTOC project, more
• Reviews: HNET, Outstanding Reference Sources,
HLAS reviews, MARS Best Free Reference Sites
• Contributor biographic information, ONIX
descriptions, sample texts
• Links to e-versions of various texts
• Special projects for select LC collections
– Work with bibliographies & pathfinders
NSDL-related projects (selected)
 MetaExtract: An NLP System to Automatically Assign
Metadata
– CNLP (Syracuse U) & SIS (Syracuse U)
– Builds on several previous projects including:
• Breaking the MetaData Generation Bottleneck [2000-2002]
– CNLP (Syracuse U) & U Washington iSchool
– Application of NLP to automatically generate metadata for course-
oriented materials
 Lenny
– Cornell NSDL group & INFOMINE
– Orchestrated application of a suite of activities
• OAI harvesting with metadata augmentation using iVia
• Loosely-coupled third party services to provide metadata
enhancements (correction, augmentation) to metadata destined
for a central repository
• Interactions orchestrated by centralized software application
MetaExtract study findings
 Auto-generated versus manually-assigned:
– Comparable
• Performance in Retrieval
• Quality of most elements (for Browsing)
– Better
• Coverage of metadata elements
 Auto-generated versus full-text:
– Comparable
• Performance in Retrieval
– Better
• Enables Fielded searching
• Enables Browsing of results
– Provides useful structuring of data
Other projects
 AMeGA (Automatic Metadata Generation Applications
Project)
– UNC-CH SILS Metadata Research Center
– Research initiated to fulfill LC Bibliographic Control Action Plan
4.2 (deliver specifications for tools to effect automated
processing of Web-based resources)
– Final report identifies and recommends functionalities for
automatic metadata generation applications
 iVia software
– Developed by INFOMINE & in use by NSDL, various other
digital library projects; LC looking at using iVia
– Sophisticated open source harvester software that can assign
LCSH, LCC
 Automatic Exposure
– RLG-led initiative advocates capturing standard technical
metadata about digital images automatically, as part of image
creation
OCLC activities

 OCLC Research projects:


– Automatic classification
– FRBR-related record harvesting
– SchemaTrans
 OCLC production services:
– OCLC Digital Archive
– WorldCat link
– OCLC Connexion
Automatic classification work
 Scorpion
– Open source software that implements a system for
automatically classifying Web-accessible text documents
– Incorporated into Connexion extractor
 FAST as a knowledge base for automatic
classification project
– Evaluated FAST as a database to support automatic
classification
 ePrints-UK project
– A collaboration with RDN to pilot Web services to classify
records by DDC and provide authority control for
personal names for RDN eprint metadata records
Other OCLC Research activities

 FRBR-related record harvesting


– Best elements of all records in workset
used to build a “work” record (Fiction
Finder)
 SchemaTrans project
– Adopts a novel approach to translating
structured metadata between schemes
– Should be friendly to modular
augumentation/correction activities
OCLC products
 OCLC Digital Archive
– Various harvesting options
• Capture of technical metadata
• Start descriptive records in Connexion
 WorldCat link
– Scheduled ingest of metadata from OAI servers and
batch processing into WorldCat
 OCLC Connexion
– Extractor processes metadata from web sites
• Relatively sophisticated harvesting
• Processes non-canonical metadata
• Slated for significant upgrade in 2006
– Rules-aided LCSH assignment while editing bibs
– Automatic base authority record generation from
relevant bibliographic record (NACO)
Links

 Recommended reading:
– Liddy, Elizabeth, “Metadata: A
Promising Solution” in EDUCAUSE
Review, v. 40, n. 3 (May/June 2005)
 OCLC Research links:
– Automatic classification projects
– SchemaTrans
– ResearchWorks

You might also like