Commercializing Information Extraction

Commercializing Information
Extraction
Lessons from WhizBang Labs
David Hull
Clairvoyance Corporation
April 2005
Commercial IE © 2005, Clairvoyance Corp. April 2005 1

Overview
• What is Information Extraction?

• WhizBang Labs: IE technology start-up
– Technology, products and services
– Flipdog example
• Conventional wisdom and lessons learned
– Manual extraction more accurate than automatic extraction
– IE systems based on machine learning easy to customize
• Working with clients
– Managing expectations, gathering data
– Understanding the real problem
• Finding the right application for IE technology

What is Information Extraction?
Task: Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-

source concept, by which software code is NAME TITLE ORGANIZATION
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept

of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free

Software Foundation, countered saying… [Cohen and McCallum, 2003] 3
Task: Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT

denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-

NAME TITLE ORGANIZATION
source concept, by which software code is
made public to encourage improvement and IE Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman founder Free Soft..
himself says Microsoft will gladly disclose its
Windows operating system--to select
customers.
"We can be open source. We love the concept

of shared source," said Bill Veghte, a
for us in terms of code access.“

Information Extraction =
Techniques: segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT

Gates railed against the economic philosophy Microsoft Corporation
denouncing its communal licensing as a CEO
"cancer" that stifled technological innovation. Bill Gates
Today, Microsoft claims to "love" the open- Microsoft
source concept, by which software code is Gates
development by outside programmers. Gates Microsoft
himself says Microsoft will gladly disclose its Bill Veghte
Windows operating system--to select Microsoft
customers. VP
"We can be open source. We love the concept Richard Stallman
of shared source," said Bill Veghte, a founder
for us in terms of code access.“ Free Software Foundation
Techniques: segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT

customers. VP
October 14, 2002, 4:00 a.m. PT

customers. VP
October 14, 2002, 4:00 a.m. PT
ORGANIZATION
founder Free Soft..

* Microsoft Corporation
Microsoft
Microsoft
Today, Microsoft claims to "love" the open- * Microsoft
TITLE
* Microsoft
CEO
VP
Richard Stallman
Windows operating system--to select * Microsoft
customers. VP
Bill Veghte
Bill Gates
NAME
IE from the Web: The Big Picture
Site List
www.xyz.com Page Classifier
www.wvx.edu
www.abci.org
www.blah.net Web Crawler
... Site Classifier
Sites
Search
IE
Segment
Classify
Associate
Cluster
Data
Sites
Mining
Database

IE from the Web
Text paragraphs Grammatical sentences
without formatting and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,
rich formatting & links Tables

IE from the Web
• Global extraction models

– Extracting from many web sites
– Content-based
– Generic: apply to any web site
– Similar to parsing natural language
• Local extraction models
– Extracting from one web site
– Format-based (XML)
– Customized to each web site
– Similar to parsing a formal language
• Models are highly complementary
– Local models can collect training data for global models
– Local models can boost accuracy of global models on specific sites
– Global models can suggest fields for local models

Overview

– Flipdog example

WhizBang: Technology
• WhizBang Labs (1998-2002)

– Information extraction from the web
– Converting unstructured text to structured database format
• Core technologies
– Global extraction models
• Regular expression language for hand-built extraction patterns
• Machine learning techniques for filtering extracted text
– Wrapster: an efficient tool to create site-specific extractors
– ChangeDetector: web site change detection monitoring
– Support structure
• Text classification (sites, web pages, text segments)
• Web crawler, XML parser, text to XML conversion
• Data labeling environment

WhizBang: Products and Services
• Customized Web Portal Construction

– Flipdog: Job Portal Site
• Information Gathering from the Web
– Personal contact information: { Name, Job Title, email }
– Company contact information: { Company Name, Address }
• Competitive Intelligence
– Track changes on web sites of competitors
• IE from legacy data
– Organize web sites
• Company finder
– A custom search engine

Flipdog: A Job Portal Site
• Support for employers and job seekers

• Technology showcase
• Building the Flipdog job database
– Crawl the web for company sites
– Find job posting web pages
– Extract job posting info:
• Job title, description, location, contact info.
– Load job posting into database
– Provide search interface to database
– Value-added services for employers, job seekers

Flipdog: Job Search

Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.
Commercial IE © 2005, Clairvoyance Corp.

April 2005
17
Flipdog: Candidate Search

Flipdog: Data Mining

Flipdog: Data Mining

Overview

– Flipdog example

Information Gathering from the Web
• Information Gathering – standard practice

– Human analysts browse web site
– Enter data by hand into a spreadsheet
• Benefits of automation
– Faster, cheaper in the long run
– Easier to maintain and update database
• Challenges of automation
– Higher start-up costs
– Disruptive technology, people will lose their jobs
– Must compete with outsourcing

Conventional Wisdom
• Automatic extraction with machine learning is
less accurate than manual extraction (False)
– With sufficient design effort, automatic methods have similar
precision (~90%), 5-10x recall
– Errors people make
• Poor recall, particularly in long, unformatted text
• Transcription errors, careless errors
– Many human errors can be avoided with better software
• Automatic extraction seems like an obvious win
– People visit only a few pages, machines easily process
everything on the site
– Extract {Joe Smith, Assistant Janitor} from the footnote of p.
29 of the building maintenance manual
– System can track changes to web site and automatically
update database
• Core problem with automatic extraction
– Type of errors made by automatic system
– Flipdog example
User gives first K positive—and
thus many implicit negative
examples
Learner
[Cohen and
McCallum, 2003]
24
[Cohen and
McCallum, 2003]
25
Automatic Change Detection
[Boyapati et al., WWW 2002]

Automatic Extraction Errors
• The dreaded parity error
Name 1 Job Title 1
Name 2 Job Title 2
Name 3 Job Title 3
Name 4 Job Title 4
• Obsolete information
– One site had 10 years of annual reports on line
– Extraction system happily crunched through all of them
– Vast majority of extracted data no longer correct
• Fictional Characters
– Ex. { Harry Potter, Wizard }, { James Bond, Secret Agent }
• Famous People
– Ex. { Bill Clinton, President }

Conventional Wisdom
• Extractors based on machine learning are easy to
customize for a new domain
– Simply gather more training data
– False!
• Feature engineering often critical for good performance
– Special expertise, typically PhD in machine learning
• Ex. Customize {Name, Job Title} extraction for German
– Capitalization key feature in named-entity extraction
– All German nouns are capitalized
– Resource gathering more difficult
• Purchased German phone book on CD, but couldn’t access bulk data
• Strict EU privacy requirements
• Found good list of job titles on German government site
• German is a compounding language, many NP variants, need NLP!
– Assumed sites would be in German
• In reality, a mixture of German and English on each page!
• Sites mixed German and English job titles freely

Overview

– Flipdog example

Managing Expectations
• Hard to sell automatic extraction technology

– Customers had trouble understanding how it works
– Needed to send Chief Scientist on many sales calls
– Many customers expected 100% accuracy
– Don’t accept stupid mistakes
• i.e. common sense errors a human wouldn’t make
– Good analogy: machine translation
• Hard to get good training data
– Customers don’t grasp concept of an unbiased data sample
– Ask for random sample of company web sites
– Customer provides first 100 from database
– Turns out to be 100 oldest companies in database
– Strongly biased sample!

Understanding the Real Problem
• Customer want personal contact information from company

web sites
– e.g. { Person Name, Job Title, email }
• WB technology extracts all name-title pairs on the site
– Client wants only people who actually work for the company
sponsoring the site
– Seems obvious in hindsight, but never specified in contract!
– Discovered this problem on the day of data delivery!
– This task requires high-level reasoning
• Machine learning to the rescue
– Two-week extension
– Solution: classify each name-title pair as internal or external

Machine Learning to the Rescue
• Label NT pairs internal/external
– Very hard task, labeling is slow
– Often needed independent web search on each name
– Initial inter-labeler agreement rate: 50%
– Definition of internal
• Subsidiaries? Consultants? How do you handle a consortium?
– Eventually, after double and triple labeling with adjudication,
gathered adequate training data
– Trained human labelers averaged 80-85% accuracy
• Custom features for classifier
– Page type, email match to company name, job title type, etc.
• Built classifier in 2 weeks, accuracy = 85-90%!
– Took 3.5 people working 20 hours/day for 2 weeks
– Reduced recall, but still 3-4x manual output

Overview

– Flipdog example

Web Page Change Detection
• ChangeDetector: track changes on web sites over time
• Challenges
– Multiple levels of granularity (site, page)
– What constitutes a significant change?
• Often depends on target audience
– Clients may wish to track changes over hundreds of sites
• Classification of change types
• Aggregation into summary reports
• Customized report generation
– Web sites to be tracked
– Type of pages to track
– Type of changes to track
– Frequency of change reporting
• Integrated with information extraction - examples
– New product releases (Product Name)
– Changes to officers and directors (Person Name)
– New clients, partners (Company Name)
– Changes to financial results (Monetary Amounts)

Competitive Intelligence
• Target Application: Competitive Intelligence
– Track competitors’ web sites
• Produced extraordinary demos
– Track potential client’s site before visit
– Come to sales visit with summary reports in hand
• Task is very hard for people to do well
– Ideal for automatic processing techniques
– Detecting the removal of information is critical
– Even subtle wording changes are often important
– Small change to product description can indicate shift in marketing strategy
• Remarkable nuggets of hidden information (example)
– One site posted a detailed product specification sheet
– Then removed it 3 days later – sensitive data not intended for public
– Information tracked and stored by ChangeDetector
– Data mining techniques can be built precisely for this purpose

Problems with Change Detection
• Hostile crawling
– Organizations want their pages indexed by search engines
• Often pages are actively submitted
– Organizations will not want their pages indexed for Competitive Intelligence
• Lock out the WhizBang crawler
• WhizBang depends on open access to Web for all its business
• Cannot afford to get locked out by many sites
– Possible to crawl the Web anonymously
• Involves violating rules/conventions for crawlers
• Unethical, bad business practice, likely to lead to lawsuits
• Copyright
– Web pages copyrighted by owners
– ChangeDetector stores copy of page to track changes
– Companies unlikely to support storage of their web pages for CI purposes
– Lawsuits . . .

ChangeDetector: New Markets
• ChangeDetector can’t work as Competitive Intelligence product
• Reposition it as Knowledge Sharing / Information Security product
• Many companies have poor control over their own Web sites
– Exerting control is hard
– Require every change be approved by 3 layers of management
– Better to track the changes passively
• Avoids the problems previously listed
– Crawling your own site is not a problem
– Copyright not an issue if information used internally
• Information Security: managers can track release of information
• Knowledge Sharing: employees in widely separated divisions can
follow what is going on in the rest of the company
– May require more active policy of keeping Web sites current
– Cheaper than many knowledge sharing solutions

Summary
• Automatic information extraction is a powerful technology

– Similar precision to human labelers with far higher recall
– Customized site wrapping tools can improve human efficiency
when 100% precision is required
– Web site change detection technology can keep data up to date
automatically
• Many challenges in marketing this technology
– Automatic IE systems make stupid mistakes
– Must convince customers that this is OK
– Machine learning does not guarantee easy customization
– Customers have trouble understanding how technology works
• Information Extraction is core technology, not application
– True value comes from packaging it in the right application
– ChangeDetector: hard to find the right application

Acknowledgements and References
• A Tutorial on Information Extraction

– William Cohen and Andrew McCallum
– Given at KDD-2003, http://www-2.cs.cmu.edu/~wcohen
– Slides with: [Cohen and McCallum 2003]
• ChangeDetector: A Site-Level Monitoring Tool for the WWW
– Boyapati, Chevrier, Finkel, Glance, Pierce, Stockton, and Whitmer
– Proceedings of WWW 2002
– http://www2002.org/CDROM/refereed/458/index.html
• Disclaimer: “All value judgements in this talk are solely the
opinion of the author and do not represent the official views of
WhizBang Labs or its successors.”

Commercializing Information Extraction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Commercializing Information Extraction

Uploaded by

Copyright:

Available Formats

Commercializing Information

Commercial IE © 2005, Clairvoyance Corp. April 2005 1

• What is Information Extraction?

Commercial IE © 2005, Clairvoyance Corp. April 2005 2

Task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Today, Microsoft claims to "love" the open-

"We can be open source. We love the concept

Richard Stallman, founder of the Free

Task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Today, Microsoft claims to "love" the open-

"We can be open source. We love the concept

Richard Stallman, founder of the Free

For years, Microsoft Corporation CEO Bill

For years, Microsoft Corporation CEO Bill

For years, Microsoft Corporation CEO Bill

founder Free Soft..

Commercial IE © 2005, Clairvoyance Corp. April 2005 9

Commercial IE © 2005, Clairvoyance Corp. April 2005 10

• Global extraction models

Commercial IE © 2005, Clairvoyance Corp. April 2005 11

• What is Information Extraction?

Commercial IE © 2005, Clairvoyance Corp. April 2005 12

• WhizBang Labs (1998-2002)

Commercial IE © 2005, Clairvoyance Corp. April 2005 13

• Customized Web Portal Construction

Commercial IE © 2005, Clairvoyance Corp. April 2005 14

• Support for employers and job seekers

Commercial IE © 2005, Clairvoyance Corp. April 2005 15

Commercial IE © 2005, Clairvoyance Corp. April 2005 16

Commercial IE © 2005, Clairvoyance Corp.

Commercial IE © 2005, Clairvoyance Corp. April 2005 18

Commercial IE © 2005, Clairvoyance Corp. April 2005 19

Commercial IE © 2005, Clairvoyance Corp. April 2005 20

• What is Information Extraction?

Commercial IE © 2005, Clairvoyance Corp. April 2005 21

• Information Gathering – standard practice

Commercial IE © 2005, Clairvoyance Corp. April 2005 22

Commercial IE © 2005, Clairvoyance Corp. April 2005 26

Commercial IE © 2005, Clairvoyance Corp. April 2005 27

Commercial IE © 2005, Clairvoyance Corp. April 2005 28

• What is Information Extraction?

Commercial IE © 2005, Clairvoyance Corp. April 2005 29

• Hard to sell automatic extraction technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 30

• Customer want personal contact information from company

Commercial IE © 2005, Clairvoyance Corp. April 2005 31

Commercial IE © 2005, Clairvoyance Corp. April 2005 32

• What is Information Extraction?

Commercial IE © 2005, Clairvoyance Corp. April 2005 33

Commercial IE © 2005, Clairvoyance Corp. April 2005 34

Commercial IE © 2005, Clairvoyance Corp. April 2005 35

Commercial IE © 2005, Clairvoyance Corp. April 2005 36

Commercial IE © 2005, Clairvoyance Corp. April 2005 37

• Automatic information extraction is a powerful technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 38

• A Tutorial on Information Extraction

Commercial IE © 2005, Clairvoyance Corp. April 2005 39

You might also like