Querying Your Database in Natural Language: Pydata - Silicon Valley 2014 Daniel F. Moisset

Querying your database in natural
language
PyData – Silicon Valley 2014

Daniel F. Moisset –
dmoisset@machinalis.com
Data is
everywhere
Collecting data is not the problem, but what to do
with it
Any operation starts with selecting/filtering

data
A classical
approach Search
Used
by:
●
Google
●
Wikipedia
●
Lucene/So
lr
Performance can be
improved:
●
Stemming/synonyms
●
Sorting data by
relevance
A classical
approach Search
Used
by:
●
Google
●
Wikipedia
●
Lucene/So
lr
Performance can be
improved:
●
Stemming/synonyms
●
Sorting data by
relevance
Limits of keyword based
approaches
Query
Languages
SQ
SELECT array_agg(players), player_teams
● FROM (
SELECT DISTINCT t1.t1player AS players, t1.player_teams
L FROM (
Many NOSQL
SELECT
●
p.playerid AS t1id,
concat(p.playerid,':', p.playername, ' ') AS t1player,
approaches SPARQL
array_agg(pl.teamid ORDER BY pl.teamid) AS
● player_teams
FROM player p
LEFT JOIN plays pl ON p.playerid = pl.playerid
●
MQL GROUP BY p.playerid, p.playername
) t1
INNER JOIN (
Allow complex,
SELECT
p.playerid AS t2id,
array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams
accurate queries FROM player p

LEFT JOIN plays pl ON p.playerid = pl.playerid
GROUP BY p.playerid, p.playername
) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <>
t2.t2id
) innerQuery
GROUP BY player_teams
Natural Language
Queries
Getting popular: Pros and
cons:
●
Wolfram ●
Very accessible,
●
Alpha Apple trivial learning curve
●
Siri Google
●
Still weak in its
coverage: most
Now applications have a list
of “sample questions”
Outline of this talk: the Quepy
approach
●
Overview of our
●
solution Simple
example
●
DSL
●
Parser
●
Question Templates
●
Quepy
● applications
Benefits
●
Limitatio
ns
Quep
y
●
Open Source (BSD License)
https://github.com/machinalis/quep
●
y
Status: usable, 2 demos available (dbpedia +
●
freebase) Online demo at:
http://quepy.machinalis.com/
●
Complete documentation: http://que
Overview of the
approach“What is the airspeed velocity of an unladen swallow?”
●
Parsin What|what|WP is|be|VBZ the|the|DT
g
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN
●
Match + Intermediate
representation
●
Query generation & DSL
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Overview of the
approach“What is the airspeed velocity of an unladen swallow?”
●
Parsin What|what|WP is|be|VBZ the|the|DT
g
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN
●
Match + Intermediate
representation
●
Query generation & DSL
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Parsin
g
●
Not done at character level but at a word
level
●
Word = Token + Lemma +
POS
“is” → is|be| (VBZ means “verb, 3dr person, singular,
VBZ tense”) present
“swallows” → (NNS means “Noun,
swallows| plural”)
●
NLTK is smart enough
swallow|NNS to know that “swallows” here means
the bird (noun) and not the action (verb)
●
Question rule = “regular expressions”
Token("what") + Lemma("be") + Question(Pos("DT")) + Plus(Pos(“NN”))
The word “what” followed by any variant of the “to be” verb, optionally followed by
a determiner (articles, “all”, “every”), followed by one or more nouns
Intermediate
representation
● Graph like, with some known values and some holes
(x ,
0
x1, …). Always has a “root” (house shaped in the

picture)
●
Similar to knowledge
databases Easy to build from
Python code
Code
generator
●
Built-in for MQL
●
Built-in for
●
SPARQL
●
Possible approaches for SQL, other
● languages DSL - guided
Outputs the query string (Quepy does not connect to
a database)
Code
examples
DS
class DefinitionOf(FixedRelation):
L class PerformanceOfActor(FixedRelation):
Relation = \ relation =
"/common/topic/description"
"/film/performance/actor"
reverse = True
class HasPerformance(FixedRelation):
class IsMovie(FixedType):
relation =
fixedtype = "/film/film"
"/film/film/starring"
class IsPerformance(FixedType):
class NameOf(FixedRelation):
fixedtype = "/film/performance"
relation = "/type/object/name"
reverse = True
DS
class DefinitionOf(FixedRelation):
L class PerformanceOfActor(FixedRelation):
Relation = \ relation =
"/common/topic/description"
"/film/performance/actor"
reverse = True
class HasPerformance(FixedRelation):
class IsMovie(FixedType):
relation =
fixedtype = "/film/film"
"/film/film/starring"
class IsPerformance(FixedType):
class NameOf(FixedRelation):
fixedtype = "/film/performance"
relation = "/type/object/name"
reverse = True
DS
L
Given a thing x , its definition: Given an actor x , movies
where x
0 2
2
acts:
DefinitionOf(x0)
performances = IsPerformance() + PerformanceOfActor(x2)

movies = IsMovie() + HasPerformance(performances)
x3 = NameOf(movies)
Parsing: Particles and
templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") +
Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))
def interpret(self, match):

label =
DefinitionOf(match.thin
g)
return label
class Thing(Particle):
regex = Question(Pos("JJ"))
+ Plus(Pos("NN") |
Pos("NNP") | Pos("NNS"))
Parsing: Particles and
templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") +
Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))

label =
DefinitionOf(match.thin
g)
return label
class Thing(Particle):
regex = Question(Pos("JJ"))
+ Plus(Pos("NN") |
Pos("NNP") | Pos("NNS"))
Parsing: “movies starring
<actor>”
●
More DSL:
class IsPerson(FixedType):
fixedtype = "/people/person"
fixedtyperelation =
"/type/object/type"
class IsActor(FixedType):
fixedtype = "Actor"
fixedtyperelation =
"/people/person/profe
ssion"
Parsing: A more complex
particle
●
And then a new Particle:
class Actor(Particle):
regex = Plus(Pos("NN") | Pos("NNS") | Pos("NNP") | Pos("NNPS"))

name = match.words.tokens
return IsPerson() + IsActor()
+ HasKeyword(name)
template
class ActedOnQuestion(QuestionTemplate):
acted_on = (Lemma("appear") | Lemma("act") | Lemma("star"))

movie = (Lemma("movie") | Lemma("movies") | Lemma("film"))
regex = (Question(Lemma("list")) + movie + Lemma("with") + Actor()) |
(Question(Pos("IN")) + (Lemma("what") | Lemma("which")) +
movie + Lemma("do") + Actor() + acted_on + Question(Pos("."))) |
(Question(Lemma("list")) + movie + Lemma("star") + Actor())
“list movies with arrison Ford”

“list films starring arrison
Ford”
“In which film does arrison
Ford appear?”
template
class ActedOnQuestion(QuestionTemplate):
# ...
performance = IsPerformance() + PerformanceOfActor(match.actor)
movie = IsMovie() + HasPerformance(performance)
movie_name = NameOf(movie)
return movie_name
Apps: gluing it all
together
●
You build a Python package with quepy startapp
●
myapp
●
There you add dsl and questions templates
Then configure it editing myapp/settings.py (output
query language, data encoding)
You can use that with:
app = quepy.install("myapp")
question = "What is love?"
target, query, metadata = app.get_query(question)
db.execute(query)
The good
things
●
Effort to add question templates is small (minutes-
hours), and the benefit is linear wrt effort
●
Good for industry
applications
●
Low specialization required to
extend
●
Human work is very parallelizable
●
Easy to get many people to work on
questions
●
Better for domain specific
databases
The good
things
●
Effort to add question templates is small (minutes-
hours), and the benefit is linear wrt effort
●
Good for industry
applications
●
Low specialization required to
extend
●
Human work is very parallelizable
●
Easy to get many people to work on
questions
●
Better for domain specific
databases
Limitation
s
●
Better for domain specific databases
●
It won't scale to massive amounts of question
templates (they start to overlap/contradict each other)
●
Hard to add computation (compare: Wolfram Alpha)
or deduction (can be added in the database)
●
Not very fast (this is an implementation, not design
●
issue) Requires a structured database
Limitation
s
●
Better for domain specific databases
●
It won't scale to massive amounts of question
templates (they start to overlap/contradict each other)
●
Hard to add computation (compare: Wolfram Alpha)
or deduction (can be added in the database)
●
Not very fast (this is an implementation, not design
●
issue) Requires a structured database
Future
directions
●
Testing this under other databases
●
Improving performance
●
Collecting uncovered questions, add machine learning to
learn new patterns.
Q&
A
You can also reach me at:

dmoisset@machinali
s.com Twitter:
@dmoisset
http://machinalis.com/
Thank
s!

Querying Your Database in Natural Language: Pydata - Silicon Valley 2014 Daniel F. Moisset

Uploaded by

Copyright:

Available Formats

You might also like

Querying Your Database in Natural Language: Pydata - Silicon Valley 2014 Daniel F. Moisset

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Querying Your Database in Natural Language: Pydata - Silicon Valley 2014 Daniel F. Moisset

Uploaded by

Copyright:

Available Formats

Querying your database in natural

PyData – Silicon Valley 2014

Any operation starts with selecting/filtering

accurate queries FROM player p

x1, …). Always has a “root” (house shaped in the

performances = IsPerformance() + PerformanceOfActor(x2)

def interpret(self, match):

def interpret(self, match):

def interpret(self, match):

acted_on = (Lemma("appear") | Lemma("act") | Lemma("star"))

“list movies with arrison Ford”

You can also reach me at:

You might also like