Keyword Searching and Browsing in Databases Using BANKS

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 33

Keyword Searching and Browsing in

Databases using BANKS

Gaurav Bhalotia, Arvind Hulgeri,


Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan

I.I.T. Bombay
11/25/2018 1
Motivation

 Keyword search of documents on the Web has


been enormously successful
 Simple and intuitive, no need to learn any query
language
 Database querying using keywords is desirable
 SQL is not appropriate for casual users
 Form interfaces cumbersome:
 Require separate form for each type of query — confusing for
casual users of Web information systems
 Not suitable for ad hoc queries

11/25/2018 2
Motivation

 Many Web documents are dynamically generated


from databases
 E.g. Catalog data
 Keyword querying of generated Web documents
 May miss answers that need to combine information
on different pages
 Suffers from duplication overheads

11/25/2018 3
Examples of Keyword Queries

 On a railway reservation database


 “mumbai bangalore”
 On an e-store database
 “camcorder panasonic”
 On a book store database
 “sudarshan databases”

11/25/2018 4
Differences from IR/Web Search

 Related data split across multiple tuples due to


normalization
 E.g. Paper (paper-id, title, journal),
Author (author-id, name)
Writes (author-id, paper-id, position)
 Different keywords may match tuples from
different relations
 What joins are to be computed can only be decided on
the fly
 Cites(citing-paper-id, cited-paper-id)

11/25/2018 5
Connectivity

 Tuples may be connected by


 Foreign key
 Implicit links (shared words), etc.
 Tuples belonging to the same relation
 Would like to find sets of (closely) connected
tuples that match all given keywords

11/25/2018 6
Basic Model

 Database: modeled as a graph


 Nodes = tuples
 Edges = references between tuples
 foreign key, other kind of relationships
 Edges are directed.

BANKS: Keyword search… MultiQuery Optimization paper

writes

Charuta S. Sudarshan Prasan Roy author

11/25/2018 7
Answer Example

Query: sudarshan roy


paper
MultiQuery Optimization

writes writes

author author
S. Sudarshan Prasan Roy

11/25/2018 8
Edge Directionality

 Some popular tuples are connected to many


other tuples
 E.g. Students -> departments -> university
 Popular tuples would create misleading shortcuts
from every tuple to every other
 E.g. every student would be closely linked with every
other student via the department/university
 Solution: define different forward and backward
edge weights
 Forward edges: In the direction of the foreign key
reference
11/25/2018 9
Edge Weight

 Weight of forward edge based on schema


 e.g. citation link weights > writes link weights
 Weight of backward edge = indegree of edges
pointing to the node
3

1
3
1

3
1

11/25/2018 10
Edge Weight Scaling

 Problem: Some backward edges have unduly


large weights
 Scale edge weights by using log(1+raw-edgeweight)
 total-edge-weight =  edge-weights
 Edge score E = 1 / total-edge-weight

11/25/2018 11
Node Weight

 Nodes have prestige weights too


 Observation: nodes with intuitively greater prestige
tend to have greater indegree
 Set node weight = indegree
 Problem: Nodes with many in-edges result in
skewed answers
 Subdue extreme node weights by using
log(1+indegree)
 Node score N =
root-node-weight +  leaf-node-weights

11/25/2018 12
Combining Scores

 Problem: how to combine two independent


metrics: node weight and edge weight
 Normalize each to 0-1
 Combine using weighting factor 
 Additive: (1- ) E +  N

 Multiplicative: E N
 Performance study to compare alternatives and
to find reasonable values for 

11/25/2018 13
The BANKS Answer Model

 Query: set of keywords {k1, k2, .., kn}


 Each keyword ki matches set of nodes Si
 Answer: rooted, directed tree connecting
nodes, with one node from each Si
 Root node(also referred to as Information Node) has
special significance, may be restricted to some
relations
 E.g. relations representing entities, not relationships
 May include intermediate nodes not in any Si and
hence a Steiner tree.
 Multiple answers
 Ranking based on proximity + prestige
11/25/2018 14
Finding Answer Trees

 Computation of minimum weight Steiner


Trees: NP complete
 Backward Expanding Search Algorithm:
 Intuition: find vertices from which a forward path
exists to at least one node from each Si.
 Run concurrent single source shortest path algorithm
from each node matching a keyword
 Create an iterator for each node matching a keyword
 Traverse the graph edges in reverse direction

 Output a node whenever it is on the intersection of the sets of


nodes reached from each keyword

11/25/2018 15
Finding Answer Tress

 For each vertex visited, maintain a nodelist v.Li


for each search term ti.
 Update the ith nodelist when the search starting
from a vertex uєSi reaches the vertex v.
 The new result tress produced correspond to the
nodelists : u × Л v.Lj
i‡j

11/25/2018 16
Backward Expanding Search
Query: sudarshan roy

paper MultiQuery Optimization

writes

authors S. Sudarshan Prasan Roy

11/25/2018 17
Result Ordering
 Answer trees may not be generated in relevance
order
 Solution:
 Best-first search across all iterators, based on path
length
 Output answers to a buffer
 Eliminate duplicates: Isomorphic Trees
 Output highest ranked answer from buffer to user
when buffer is full

11/25/2018 18
THE BANKS SYSTEM

 BANKS provides keyword search coupled with


extensive browsing facilities
 Schema browsing + data browsing
 Graphical display of data
 Implemented using Java + servlets
 Keyword search response times typically 1 to 3
seconds on
 DBLP database with 100,000 tuples/300,000 edges
 P3 600 MHz, 512 MB RAM
 Try it out at www.cse.iitb.ac.in/banks/

11/25/2018 19
The BANKS Architecture

HTTP JDBC
User BANKS

Web Server
+ Servlets Database

 Connects to any database using JDBC


 JDBC metadata features used to provide schema
browsing
 No programming needed for customization
 Minimal preprocessing of database to create indices and give
weights to links
 Extensive set of browsing features
11/25/2018 20
Browsing Features

 Hyperlinks are automatically added to all


displayed results
 Template facilities to do a variety of tasks
 Browsing data by grouping and creating crosstabs
 e.g., theses grouped by department and year

 Hierarchical views of data


 Nested XML style, even on relational data

 Graphical displays
 Bar charts, pie charts, etc

11/25/2018 21
Example of Browsing in BANKS

11/25/2018 22
BANKS Query Result Example

 Result of “Soumen Sunita”

11/25/2018 23
Anecdotes

 “Mohan”
 Returns C. Mohan at top based on prestige (number of
papers written)
 “Transaction”
 Returns Jim Gray’s classic paper and textbook as top
answers based on prestige (number of citations)
 “Sunita Seltzer”
 No common papers, but both have papers with
Stonebraker: system finds this connection

11/25/2018 24
Effect of Parameters
 Log scaling of edge weights worked well
 (1- ) E +  N versus E N -- made little difference
 Best with  = .2 (subdue node weights but not entirely)

11/25/2018 25
Related Work
 DataSpot (DTL)/Mercado Intuifind [VLDB 98]
 Based on patent by Palmon (filed 1995, granted 1998)
 Similar answer model to ours
 Differences: our model of backward link weights and prestige
 Proximity Search [VLDB98]
 Different model of proximity
 No edge weights, prestige, different evaluation algorithm
 Information units (linked Web pages) [WWW10]
 No directionality, only studied in Web context
 Microsoft DBExplorer
 No ranking, based on SQL generation
 Addresses efficient construction of text indexes

11/25/2018 26
Some Extensions to the BANKS

 Searching for similar results: Template Search


 define the notion of similarity between two result trees
 perform the restricted search
 Efficiently handling meta-data queries
 starting the search from each of the tuples in a table is
too costly

11/25/2018 27
Template Search

 Feedback in terms of result tree


 Type of a result tree defined in terms of
 type of nodes
 the table to which the node belongs
 type of edges :
 the type of nodes which it connects
 the link information e.g. ‘cites’ and ‘cited’ link between two
papers.
 Which nodes to start the search from
 only the chosen nodes
 all the nodes corresponding to a particular keyword

11/25/2018 28
Template Search

 Start the backward search only from allowed set


of nodes
 Follow the edges as defined by the result type
 Example : Consider Query “sudarshan database”
 Two types of results for above query
 papers written by professor sudarshan
 papers cited by papers written by professor sudarshan
 Two result types distinguished by whether to
follow the cites/cited link from a paper node.

11/25/2018 29
Metadata Keyword Queries

 Metadata keywords : match all the tuples of


a relation.
 Too costly to start the search from each of
 the tuples of a table
 First cut approach: start the forward search from
the information node for the non-metadata
keywords
 selectively choose the nodes from where to
start the forward search

11/25/2018 30
Example of Metadata Query

 Consider the query “sudarshan paper”

writes table
nodes

To paper table
(forward search)
sudarshan

11/25/2018 31
Conclusions and Future Work

The next big wave: keyword searching and


browsing of databases?
Future work:
 Keyword queries on XML

 Disambiguating queries by selecting

 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”


 Tree structure: “coauthors” or “cites”
 Boolean queries, stemming, thesaurus
 Metadata: column/relation names

11/25/2018 32
Thank You

11/25/2018 33

You might also like