Professional Documents
Culture Documents
Commercializing Information Extraction
Commercializing Information Extraction
Extraction
Lessons from WhizBang Labs
David Hull
Clairvoyance Corporation
April 2005
ORGANIZATION
Microsoft
Microsoft
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a CEO
"cancer" that stifled technological innovation. Bill Gates
Today, Microsoft claims to "love" the open- * Microsoft
source concept, by which software code is Gates
TITLE
made public to encourage improvement and
* Microsoft
CEO
VP
development by outside programmers. Gates
himself says Microsoft will gladly disclose its Bill Veghte
Richard Stallman
crown jewels--the coveted code behind the
Windows operating system--to select * Microsoft
customers. VP
Bill Veghte
Bill Gates
"We can be open source. We love the concept Richard Stallman
of shared source," said Bill Veghte, a founder
NAME
Microsoft VP. "That's a super-important shift
for us in terms of code access.“ Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying… [Cohen and McCallum, 2003] 8
IE from the Web: The Big Picture
Site List
www.xyz.com Page Classifier
www.wvx.edu
www.abci.org
www.blah.net Web Crawler
... Site Classifier
Sites
Search
IE
Segment
Classify
Associate
Cluster
Data
Sites
Mining
Database
Non-grammatical snippets,
rich formatting & links Tables
Learner
[Cohen and
McCallum, 2003]
24
[Cohen and
McCallum, 2003]
25
Automatic Change Detection
[Boyapati et al., WWW 2002]
• Obsolete information
– One site had 10 years of annual reports on line
– Extraction system happily crunched through all of them
– Vast majority of extracted data no longer correct
• Fictional Characters
– Ex. { Harry Potter, Wizard }, { James Bond, Secret Agent }
• Famous People
– Ex. { Bill Clinton, President }