Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Commercializing Information

Extraction
Lessons from WhizBang Labs

David Hull
Clairvoyance Corporation
April 2005

Commercial IE © 2005, Clairvoyance Corp. April 2005 1


Overview

• What is Information Extraction?


• WhizBang Labs: IE technology start-up
– Technology, products and services
– Flipdog example
• Conventional wisdom and lessons learned
– Manual extraction more accurate than automatic extraction
– IE systems based on machine learning easy to customize
• Working with clients
– Managing expectations, gathering data
– Understanding the real problem
• Finding the right application for IE technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 2


What is Information Extraction?

Task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill


Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-


source concept, by which software code is NAME TITLE ORGANIZATION
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept


of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free


Software Foundation, countered saying… [Cohen and McCallum, 2003] 3
What is Information Extraction?

Task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill


Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-


NAME TITLE ORGANIZATION
source concept, by which software code is
made public to encourage improvement and IE Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
development by outside programmers. Gates
Richard Stallman founder Free Soft..
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept


of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free


Software Foundation, countered saying… [Cohen and McCallum, 2003] 4
What is Information Extraction?
Information Extraction =
Techniques: segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill


Gates railed against the economic philosophy Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a CEO
"cancer" that stifled technological innovation. Bill Gates
Today, Microsoft claims to "love" the open- Microsoft
source concept, by which software code is Gates
made public to encourage improvement and
development by outside programmers. Gates Microsoft
himself says Microsoft will gladly disclose its Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select Microsoft
customers. VP
"We can be open source. We love the concept Richard Stallman
of shared source," said Bill Veghte, a founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“ Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying… [Cohen and McCallum, 2003] 5
What is Information Extraction?
Information Extraction =
Techniques: segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill


Gates railed against the economic philosophy Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a CEO
"cancer" that stifled technological innovation. Bill Gates
Today, Microsoft claims to "love" the open- Microsoft
source concept, by which software code is Gates
made public to encourage improvement and
development by outside programmers. Gates Microsoft
himself says Microsoft will gladly disclose its Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select Microsoft
customers. VP
"We can be open source. We love the concept Richard Stallman
of shared source," said Bill Veghte, a founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“ Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying… [Cohen and McCallum, 2003] 6
What is Information Extraction?
Information Extraction =
Techniques: segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill


Gates railed against the economic philosophy Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a CEO
"cancer" that stifled technological innovation. Bill Gates
Today, Microsoft claims to "love" the open- Microsoft
source concept, by which software code is Gates
made public to encourage improvement and
development by outside programmers. Gates Microsoft
himself says Microsoft will gladly disclose its Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select Microsoft
customers. VP
"We can be open source. We love the concept Richard Stallman
of shared source," said Bill Veghte, a founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“ Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying… [Cohen and McCallum, 2003] 7
What is Information Extraction?
Information Extraction =
Techniques: segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT

ORGANIZATION

founder Free Soft..


For years, Microsoft Corporation CEO Bill
* Microsoft Corporation

Microsoft
Microsoft
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a CEO
"cancer" that stifled technological innovation. Bill Gates
Today, Microsoft claims to "love" the open- * Microsoft
source concept, by which software code is Gates

TITLE
made public to encourage improvement and
* Microsoft

CEO
VP
development by outside programmers. Gates
himself says Microsoft will gladly disclose its Bill Veghte

Richard Stallman
crown jewels--the coveted code behind the
Windows operating system--to select * Microsoft
customers. VP

Bill Veghte
Bill Gates
"We can be open source. We love the concept Richard Stallman
of shared source," said Bill Veghte, a founder

NAME
Microsoft VP. "That's a super-important shift
for us in terms of code access.“ Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying… [Cohen and McCallum, 2003] 8
IE from the Web: The Big Picture
Site List
www.xyz.com Page Classifier
www.wvx.edu
www.abci.org
www.blah.net Web Crawler
... Site Classifier
Sites
Search

IE
Segment
Classify
Associate
Cluster
Data
Sites
Mining

Database

Commercial IE © 2005, Clairvoyance Corp. April 2005 9


IE from the Web
Text paragraphs Grammatical sentences
without formatting and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.

Non-grammatical snippets,
rich formatting & links Tables

Commercial IE © 2005, Clairvoyance Corp. April 2005 10


IE from the Web

• Global extraction models


– Extracting from many web sites
– Content-based
– Generic: apply to any web site
– Similar to parsing natural language
• Local extraction models
– Extracting from one web site
– Format-based (XML)
– Customized to each web site
– Similar to parsing a formal language
• Models are highly complementary
– Local models can collect training data for global models
– Local models can boost accuracy of global models on specific sites
– Global models can suggest fields for local models

Commercial IE © 2005, Clairvoyance Corp. April 2005 11


Overview

• What is Information Extraction?


• WhizBang Labs: IE technology start-up
– Technology, products and services
– Flipdog example
• Conventional wisdom and lessons learned
– Manual extraction more accurate than automatic extraction
– IE systems based on machine learning easy to customize
• Working with clients
– Managing expectations, gathering data
– Understanding the real problem
• Finding the right application for IE technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 12


WhizBang: Technology

• WhizBang Labs (1998-2002)


– Information extraction from the web
– Converting unstructured text to structured database format
• Core technologies
– Global extraction models
• Regular expression language for hand-built extraction patterns
• Machine learning techniques for filtering extracted text
– Wrapster: an efficient tool to create site-specific extractors
– ChangeDetector: web site change detection monitoring
– Support structure
• Text classification (sites, web pages, text segments)
• Web crawler, XML parser, text to XML conversion
• Data labeling environment

Commercial IE © 2005, Clairvoyance Corp. April 2005 13


WhizBang: Products and Services

• Customized Web Portal Construction


– Flipdog: Job Portal Site
• Information Gathering from the Web
– Personal contact information: { Name, Job Title, email }
– Company contact information: { Company Name, Address }
• Competitive Intelligence
– Track changes on web sites of competitors
• IE from legacy data
– Organize web sites
• Company finder
– A custom search engine

Commercial IE © 2005, Clairvoyance Corp. April 2005 14


Flipdog: A Job Portal Site

• Support for employers and job seekers


• Technology showcase
• Building the Flipdog job database
– Crawl the web for company sites
– Find job posting web pages
– Extract job posting info:
• Job title, description, location, contact info.
– Load job posting into database
– Provide search interface to database
– Value-added services for employers, job seekers

Commercial IE © 2005, Clairvoyance Corp. April 2005 15


Flipdog: Job Search

Commercial IE © 2005, Clairvoyance Corp. April 2005 16


Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.

Commercial IE © 2005, Clairvoyance Corp.


April 2005
17
Flipdog: Candidate Search

Commercial IE © 2005, Clairvoyance Corp. April 2005 18


Flipdog: Data Mining

Commercial IE © 2005, Clairvoyance Corp. April 2005 19


Flipdog: Data Mining

Commercial IE © 2005, Clairvoyance Corp. April 2005 20


Overview

• What is Information Extraction?


• WhizBang Labs: IE technology start-up
– Technology, products and services
– Flipdog example
• Conventional wisdom and lessons learned
– Manual extraction more accurate than automatic extraction
– IE systems based on machine learning easy to customize
• Working with clients
– Managing expectations, gathering data
– Understanding the real problem
• Finding the right application for IE technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 21


Information Gathering from the Web

• Information Gathering – standard practice


– Human analysts browse web site
– Enter data by hand into a spreadsheet
• Benefits of automation
– Faster, cheaper in the long run
– Easier to maintain and update database
• Challenges of automation
– Higher start-up costs
– Disruptive technology, people will lose their jobs
– Must compete with outsourcing

Commercial IE © 2005, Clairvoyance Corp. April 2005 22


Conventional Wisdom
• Automatic extraction with machine learning is
less accurate than manual extraction (False)
– With sufficient design effort, automatic methods have similar
precision (~90%), 5-10x recall
– Errors people make
• Poor recall, particularly in long, unformatted text
• Transcription errors, careless errors
– Many human errors can be avoided with better software
• Automatic extraction seems like an obvious win
– People visit only a few pages, machines easily process
everything on the site
– Extract {Joe Smith, Assistant Janitor} from the footnote of p.
29 of the building maintenance manual
– System can track changes to web site and automatically
update database
• Core problem with automatic extraction
– Type of errors made by automatic system
– Flipdog example
Commercial IE © 2005, Clairvoyance Corp. April 2005 23
User gives first K positive—and
thus many implicit negative
examples

Learner

[Cohen and
McCallum, 2003]
24
[Cohen and
McCallum, 2003]
25
Automatic Change Detection
[Boyapati et al., WWW 2002]

Commercial IE © 2005, Clairvoyance Corp. April 2005 26


Automatic Extraction Errors
• The dreaded parity error
Name 1 Job Title 1
Name 2 Job Title 2
Name 3 Job Title 3
Name 4 Job Title 4

• Obsolete information
– One site had 10 years of annual reports on line
– Extraction system happily crunched through all of them
– Vast majority of extracted data no longer correct
• Fictional Characters
– Ex. { Harry Potter, Wizard }, { James Bond, Secret Agent }
• Famous People
– Ex. { Bill Clinton, President }

Commercial IE © 2005, Clairvoyance Corp. April 2005 27


Conventional Wisdom
• Extractors based on machine learning are easy to
customize for a new domain
– Simply gather more training data
– False!
• Feature engineering often critical for good performance
– Special expertise, typically PhD in machine learning
• Ex. Customize {Name, Job Title} extraction for German
– Capitalization key feature in named-entity extraction
– All German nouns are capitalized
– Resource gathering more difficult
• Purchased German phone book on CD, but couldn’t access bulk data
• Strict EU privacy requirements
• Found good list of job titles on German government site
• German is a compounding language, many NP variants, need NLP!
– Assumed sites would be in German
• In reality, a mixture of German and English on each page!
• Sites mixed German and English job titles freely

Commercial IE © 2005, Clairvoyance Corp. April 2005 28


Overview

• What is Information Extraction?


• WhizBang Labs: IE technology start-up
– Technology, products and services
– Flipdog example
• Conventional wisdom and lessons learned
– Manual extraction more accurate than automatic extraction
– IE systems based on machine learning easy to customize
• Working with clients
– Managing expectations, gathering data
– Understanding the real problem
• Finding the right application for IE technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 29


Managing Expectations

• Hard to sell automatic extraction technology


– Customers had trouble understanding how it works
– Needed to send Chief Scientist on many sales calls
– Many customers expected 100% accuracy
– Don’t accept stupid mistakes
• i.e. common sense errors a human wouldn’t make
– Good analogy: machine translation
• Hard to get good training data
– Customers don’t grasp concept of an unbiased data sample
– Ask for random sample of company web sites
– Customer provides first 100 from database
– Turns out to be 100 oldest companies in database
– Strongly biased sample!

Commercial IE © 2005, Clairvoyance Corp. April 2005 30


Understanding the Real Problem

• Customer want personal contact information from company


web sites
– e.g. { Person Name, Job Title, email }
• WB technology extracts all name-title pairs on the site
– Client wants only people who actually work for the company
sponsoring the site
– Seems obvious in hindsight, but never specified in contract!
– Discovered this problem on the day of data delivery!
– This task requires high-level reasoning
• Machine learning to the rescue
– Two-week extension
– Solution: classify each name-title pair as internal or external

Commercial IE © 2005, Clairvoyance Corp. April 2005 31


Machine Learning to the Rescue
• Label NT pairs internal/external
– Very hard task, labeling is slow
– Often needed independent web search on each name
– Initial inter-labeler agreement rate: 50%
– Definition of internal
• Subsidiaries? Consultants? How do you handle a consortium?
– Eventually, after double and triple labeling with adjudication,
gathered adequate training data
– Trained human labelers averaged 80-85% accuracy
• Custom features for classifier
– Page type, email match to company name, job title type, etc.
• Built classifier in 2 weeks, accuracy = 85-90%!
– Took 3.5 people working 20 hours/day for 2 weeks
– Reduced recall, but still 3-4x manual output

Commercial IE © 2005, Clairvoyance Corp. April 2005 32


Overview

• What is Information Extraction?


• WhizBang Labs: IE technology start-up
– Technology, products and services
– Flipdog example
• Conventional wisdom and lessons learned
– Manual extraction more accurate than automatic extraction
– IE systems based on machine learning easy to customize
• Working with clients
– Managing expectations, gathering data
– Understanding the real problem
• Finding the right application for IE technology

Commercial IE © 2005, Clairvoyance Corp. April 2005 33


Web Page Change Detection
• ChangeDetector: track changes on web sites over time
• Challenges
– Multiple levels of granularity (site, page)
– What constitutes a significant change?
• Often depends on target audience
– Clients may wish to track changes over hundreds of sites
• Classification of change types
• Aggregation into summary reports
• Customized report generation
– Web sites to be tracked
– Type of pages to track
– Type of changes to track
– Frequency of change reporting
• Integrated with information extraction - examples
– New product releases (Product Name)
– Changes to officers and directors (Person Name)
– New clients, partners (Company Name)
– Changes to financial results (Monetary Amounts)

Commercial IE © 2005, Clairvoyance Corp. April 2005 34


Competitive Intelligence
• Target Application: Competitive Intelligence
– Track competitors’ web sites
• Produced extraordinary demos
– Track potential client’s site before visit
– Come to sales visit with summary reports in hand
• Task is very hard for people to do well
– Ideal for automatic processing techniques
– Detecting the removal of information is critical
– Even subtle wording changes are often important
– Small change to product description can indicate shift in marketing strategy
• Remarkable nuggets of hidden information (example)
– One site posted a detailed product specification sheet
– Then removed it 3 days later – sensitive data not intended for public
– Information tracked and stored by ChangeDetector
– Data mining techniques can be built precisely for this purpose

Commercial IE © 2005, Clairvoyance Corp. April 2005 35


Problems with Change Detection
• Hostile crawling
– Organizations want their pages indexed by search engines
• Often pages are actively submitted
– Organizations will not want their pages indexed for Competitive Intelligence
• Lock out the WhizBang crawler
• WhizBang depends on open access to Web for all its business
• Cannot afford to get locked out by many sites
– Possible to crawl the Web anonymously
• Involves violating rules/conventions for crawlers
• Unethical, bad business practice, likely to lead to lawsuits
• Copyright
– Web pages copyrighted by owners
– ChangeDetector stores copy of page to track changes
– Companies unlikely to support storage of their web pages for CI purposes
– Lawsuits . . .

Commercial IE © 2005, Clairvoyance Corp. April 2005 36


ChangeDetector: New Markets
• ChangeDetector can’t work as Competitive Intelligence product
• Reposition it as Knowledge Sharing / Information Security product
• Many companies have poor control over their own Web sites
– Exerting control is hard
– Require every change be approved by 3 layers of management
– Better to track the changes passively
• Avoids the problems previously listed
– Crawling your own site is not a problem
– Copyright not an issue if information used internally
• Information Security: managers can track release of information
• Knowledge Sharing: employees in widely separated divisions can
follow what is going on in the rest of the company
– May require more active policy of keeping Web sites current
– Cheaper than many knowledge sharing solutions

Commercial IE © 2005, Clairvoyance Corp. April 2005 37


Summary

• Automatic information extraction is a powerful technology


– Similar precision to human labelers with far higher recall
– Customized site wrapping tools can improve human efficiency
when 100% precision is required
– Web site change detection technology can keep data up to date
automatically
• Many challenges in marketing this technology
– Automatic IE systems make stupid mistakes
– Must convince customers that this is OK
– Machine learning does not guarantee easy customization
– Customers have trouble understanding how technology works
• Information Extraction is core technology, not application
– True value comes from packaging it in the right application
– ChangeDetector: hard to find the right application

Commercial IE © 2005, Clairvoyance Corp. April 2005 38


Acknowledgements and References

• A Tutorial on Information Extraction


– William Cohen and Andrew McCallum
– Given at KDD-2003, http://www-2.cs.cmu.edu/~wcohen
– Slides with: [Cohen and McCallum 2003]
• ChangeDetector: A Site-Level Monitoring Tool for the WWW
– Boyapati, Chevrier, Finkel, Glance, Pierce, Stockton, and Whitmer
– Proceedings of WWW 2002
– http://www2002.org/CDROM/refereed/458/index.html
• Disclaimer: “All value judgements in this talk are solely the
opinion of the author and do not represent the official views of
WhizBang Labs or its successors.”

Commercial IE © 2005, Clairvoyance Corp. April 2005 39

You might also like