Querying Your Database in Natural Language: Pydata - Silicon Valley 2014 Daniel F. Moisset

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 32

Querying your database in natural

language

PyData – Silicon Valley 2014


Daniel F. Moisset –
dmoisset@machinalis.com
Data is
everywhere
Collecting data is not the problem, but what to do
with it

Any operation starts with selecting/filtering


data
A classical
approach Search

Used
by:

Google

Wikipedia

Lucene/So
lr
Performance can be
improved:

Stemming/synonyms

Sorting data by
relevance
A classical
approach Search

Used
by:

Google

Wikipedia

Lucene/So
lr
Performance can be
improved:

Stemming/synonyms

Sorting data by
relevance
Limits of keyword based
approaches
Query
Languages
SQ
SELECT array_agg(players), player_teams
● FROM (
SELECT DISTINCT t1.t1player AS players, t1.player_teams

L FROM (

Many NOSQL
SELECT

p.playerid AS t1id,
concat(p.playerid,':', p.playername, ' ') AS t1player,

approaches SPARQL
array_agg(pl.teamid ORDER BY pl.teamid) AS
● player_teams
FROM player p
LEFT JOIN plays pl ON p.playerid = pl.playerid


MQL GROUP BY p.playerid, p.playername
) t1
INNER JOIN (

Allow complex,
SELECT
p.playerid AS t2id,
array_agg(pl.teamid ORDER BY pl.teamid) AS player_teams

accurate queries FROM player p


LEFT JOIN plays pl ON p.playerid = pl.playerid
GROUP BY p.playerid, p.playername
) t2 ON t1.player_teams=t2.player_teams AND t1.t1id <>
t2.t2id
) innerQuery
GROUP BY player_teams
Natural Language
Queries
Getting popular: Pros and
cons:

Wolfram ●
Very accessible,

Alpha Apple trivial learning curve

Siri Google

Still weak in its
coverage: most
Now applications have a list
of “sample questions”
Outline of this talk: the Quepy
approach

Overview of our

solution Simple
example

DSL

Parser

Question Templates

Quepy
● applications
Benefits

Limitatio
ns
Quep
y

Open Source (BSD License)
https://github.com/machinalis/quep

y
Status: usable, 2 demos available (dbpedia +

freebase) Online demo at:
http://quepy.machinalis.com/

Complete documentation: http://que
Overview of the
approach“What is the airspeed velocity of an unladen swallow?”

Parsin What|what|WP is|be|VBZ the|the|DT

g
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN


Match + Intermediate
representation


Query generation & DSL
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Overview of the
approach“What is the airspeed velocity of an unladen swallow?”

Parsin What|what|WP is|be|VBZ the|the|DT

g
airspeed|airspeed|NN velocity|velocity|NN
of|of|IN an|an|DT unladen|unladen|JJ
swallow|swallow|NN


Match + Intermediate
representation


Query generation & DSL
SELECT DISTINCT ?x1 WHERE {
?x0 kingdom "Animal".
?x0 name "unladen swallow".
?x0 airspeed ?x1.
}
Parsin
g

Not done at character level but at a word
level

Word = Token + Lemma +
POS
“is” → is|be| (VBZ means “verb, 3dr person, singular,
VBZ tense”) present
“swallows” → (NNS means “Noun,
swallows| plural”)

NLTK is smart enough
swallow|NNS to know that “swallows” here means
the bird (noun) and not the action (verb)

Question rule = “regular expressions”
Token("what") + Lemma("be") + Question(Pos("DT")) + Plus(Pos(“NN”))

The word “what” followed by any variant of the “to be” verb, optionally followed by
a determiner (articles, “all”, “every”), followed by one or more nouns
Intermediate
representation
● Graph like, with some known values and some holes
(x ,
0

x1, …). Always has a “root” (house shaped in the


picture)

Similar to knowledge
databases Easy to build from
Python code
Code
generator

Built-in for MQL

Built-in for

SPARQL

Possible approaches for SQL, other
● languages DSL - guided
Outputs the query string (Quepy does not connect to
a database)
Code
examples
DS
class DefinitionOf(FixedRelation):
L class PerformanceOfActor(FixedRelation):
Relation = \ relation =
"/common/topic/description"
"/film/performance/actor"
reverse = True

class HasPerformance(FixedRelation):
class IsMovie(FixedType):
relation =
fixedtype = "/film/film"
"/film/film/starring"

class IsPerformance(FixedType):
class NameOf(FixedRelation):
fixedtype = "/film/performance"
relation = "/type/object/name"
reverse = True
DS
class DefinitionOf(FixedRelation):
L class PerformanceOfActor(FixedRelation):
Relation = \ relation =
"/common/topic/description"
"/film/performance/actor"
reverse = True

class HasPerformance(FixedRelation):
class IsMovie(FixedType):
relation =
fixedtype = "/film/film"
"/film/film/starring"

class IsPerformance(FixedType):
class NameOf(FixedRelation):
fixedtype = "/film/performance"
relation = "/type/object/name"
reverse = True
DS
L
Given a thing x , its definition: Given an actor x , movies
where x
0 2
2

acts:

DefinitionOf(x0)

performances = IsPerformance() + PerformanceOfActor(x2)


movies = IsMovie() + HasPerformance(performances)
x3 = NameOf(movies)
Parsing: Particles and
templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") +
Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))

def interpret(self, match):


label =
DefinitionOf(match.thin
g)
return label

class Thing(Particle):
regex = Question(Pos("JJ"))
+ Plus(Pos("NN") |
Pos("NNP") | Pos("NNS"))
Parsing: Particles and
templates
class WhatIs(QuestionTemplate):
regex = Lemma("what") +
Lemma("be") +
Question(Pos("DT")) + Thing() + Question(Pos("."))

def interpret(self, match):


label =
DefinitionOf(match.thin
g)
return label

class Thing(Particle):
regex = Question(Pos("JJ"))
+ Plus(Pos("NN") |
Pos("NNP") | Pos("NNS"))
Parsing: “movies starring
<actor>”

More DSL:
class IsPerson(FixedType):
fixedtype = "/people/person"
fixedtyperelation =
"/type/object/type"

class IsActor(FixedType):
fixedtype = "Actor"
fixedtyperelation =
"/people/person/profe
ssion"
Parsing: A more complex
particle

And then a new Particle:
class Actor(Particle):
regex = Plus(Pos("NN") | Pos("NNS") | Pos("NNP") | Pos("NNPS"))

def interpret(self, match):


name = match.words.tokens
return IsPerson() + IsActor()
+ HasKeyword(name)
Parsing: A more complex
template
class ActedOnQuestion(QuestionTemplate):

acted_on = (Lemma("appear") | Lemma("act") | Lemma("star"))


movie = (Lemma("movie") | Lemma("movies") | Lemma("film"))
regex = (Question(Lemma("list")) + movie + Lemma("with") + Actor()) |
(Question(Pos("IN")) + (Lemma("what") | Lemma("which")) +
movie + Lemma("do") + Actor() + acted_on + Question(Pos("."))) |
(Question(Lemma("list")) + movie + Lemma("star") + Actor())

“list movies with arrison Ford”


“list films starring arrison
Ford”
“In which film does arrison
Ford appear?”
Parsing: A more complex
template
class ActedOnQuestion(QuestionTemplate):
# ...
def interpret(self, match):
performance = IsPerformance() + PerformanceOfActor(match.actor)
movie = IsMovie() + HasPerformance(performance)
movie_name = NameOf(movie)
return movie_name
Apps: gluing it all
together

You build a Python package with quepy startapp

myapp


There you add dsl and questions templates
Then configure it editing myapp/settings.py (output
query language, data encoding)
You can use that with:
app = quepy.install("myapp")
question = "What is love?"
target, query, metadata = app.get_query(question)
db.execute(query)
The good
things

Effort to add question templates is small (minutes-
hours), and the benefit is linear wrt effort

Good for industry
applications

Low specialization required to
extend

Human work is very parallelizable

Easy to get many people to work on
questions

Better for domain specific
databases
The good
things

Effort to add question templates is small (minutes-
hours), and the benefit is linear wrt effort

Good for industry
applications

Low specialization required to
extend

Human work is very parallelizable

Easy to get many people to work on
questions

Better for domain specific
databases
Limitation
s

Better for domain specific databases

It won't scale to massive amounts of question
templates (they start to overlap/contradict each other)

Hard to add computation (compare: Wolfram Alpha)
or deduction (can be added in the database)

Not very fast (this is an implementation, not design

issue) Requires a structured database
Limitation
s

Better for domain specific databases

It won't scale to massive amounts of question
templates (they start to overlap/contradict each other)

Hard to add computation (compare: Wolfram Alpha)
or deduction (can be added in the database)

Not very fast (this is an implementation, not design

issue) Requires a structured database
Future
directions

Testing this under other databases

Improving performance

Collecting uncovered questions, add machine learning to
learn new patterns.
Q&
A

You can also reach me at:


dmoisset@machinali

s.com Twitter:

@dmoisset
http://machinalis.com/
Thank
s!

You might also like