Keyword Searching and Browsing in Databases Using BANKS

Keyword Searching and Browsing in
Databases using BANKS
Gaurav Bhalotia, Arvind Hulgeri,

Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan
I.I.T. Bombay
11/25/2018 1
Motivation
 Keyword search of documents on the Web has

been enormously successful
 Simple and intuitive, no need to learn any query
language
 Database querying using keywords is desirable
 SQL is not appropriate for casual users
 Form interfaces cumbersome:
 Require separate form for each type of query — confusing for
casual users of Web information systems
 Not suitable for ad hoc queries
11/25/2018 2
Motivation
 Many Web documents are dynamically generated

from databases
 E.g. Catalog data
 Keyword querying of generated Web documents
 May miss answers that need to combine information
on different pages
 Suffers from duplication overheads
11/25/2018 3
Examples of Keyword Queries
 On a railway reservation database

 “mumbai bangalore”
 On an e-store database
 “camcorder panasonic”
 On a book store database
 “sudarshan databases”
11/25/2018 4
Differences from IR/Web Search
 Related data split across multiple tuples due to

normalization
 E.g. Paper (paper-id, title, journal),
Author (author-id, name)
Writes (author-id, paper-id, position)
 Different keywords may match tuples from
different relations
 What joins are to be computed can only be decided on
the fly
 Cites(citing-paper-id, cited-paper-id)
11/25/2018 5
Connectivity
 Tuples may be connected by

 Foreign key
 Implicit links (shared words), etc.
 Tuples belonging to the same relation
 Would like to find sets of (closely) connected
tuples that match all given keywords
11/25/2018 6
Basic Model
 Database: modeled as a graph

 Nodes = tuples
 Edges = references between tuples
 foreign key, other kind of relationships
 Edges are directed.
BANKS: Keyword search… MultiQuery Optimization paper
writes
Charuta S. Sudarshan Prasan Roy author
11/25/2018 7
Answer Example
Query: sudarshan roy

paper
MultiQuery Optimization
writes writes
author author
S. Sudarshan Prasan Roy
11/25/2018 8
Edge Directionality
 Some popular tuples are connected to many

other tuples
 E.g. Students -> departments -> university
 Popular tuples would create misleading shortcuts
from every tuple to every other
 E.g. every student would be closely linked with every
other student via the department/university
 Solution: define different forward and backward
edge weights
 Forward edges: In the direction of the foreign key
reference
11/25/2018 9
Edge Weight
 Weight of forward edge based on schema

 e.g. citation link weights > writes link weights
 Weight of backward edge = indegree of edges
pointing to the node
3
1
3
1
3
1
11/25/2018 10
Edge Weight Scaling
 Problem: Some backward edges have unduly

large weights
 Scale edge weights by using log(1+raw-edgeweight)
 total-edge-weight =  edge-weights
 Edge score E = 1 / total-edge-weight
11/25/2018 11
Node Weight
 Nodes have prestige weights too

 Observation: nodes with intuitively greater prestige
tend to have greater indegree
 Set node weight = indegree
 Problem: Nodes with many in-edges result in
skewed answers
 Subdue extreme node weights by using
log(1+indegree)
 Node score N =
root-node-weight +  leaf-node-weights
11/25/2018 12
Combining Scores
 Problem: how to combine two independent

metrics: node weight and edge weight
 Normalize each to 0-1
 Combine using weighting factor 
 Additive: (1- ) E +  N
 Multiplicative: E N
 Performance study to compare alternatives and
to find reasonable values for 
11/25/2018 13
The BANKS Answer Model
 Query: set of keywords {k1, k2, .., kn}

 Each keyword ki matches set of nodes Si
 Answer: rooted, directed tree connecting
nodes, with one node from each Si
 Root node(also referred to as Information Node) has
special significance, may be restricted to some
relations
 E.g. relations representing entities, not relationships
 May include intermediate nodes not in any Si and
hence a Steiner tree.
 Multiple answers
 Ranking based on proximity + prestige
11/25/2018 14
Finding Answer Trees
 Computation of minimum weight Steiner

Trees: NP complete
 Backward Expanding Search Algorithm:
 Intuition: find vertices from which a forward path
exists to at least one node from each Si.
 Run concurrent single source shortest path algorithm
from each node matching a keyword
 Create an iterator for each node matching a keyword
 Traverse the graph edges in reverse direction
 Output a node whenever it is on the intersection of the sets of

nodes reached from each keyword
11/25/2018 15
Finding Answer Tress
 For each vertex visited, maintain a nodelist v.Li

for each search term ti.
 Update the ith nodelist when the search starting
from a vertex uєSi reaches the vertex v.
 The new result tress produced correspond to the
nodelists : u × Л v.Lj
i‡j
11/25/2018 16
Backward Expanding Search
Query: sudarshan roy
paper MultiQuery Optimization
writes
authors S. Sudarshan Prasan Roy
11/25/2018 17
Result Ordering
 Answer trees may not be generated in relevance
order
 Solution:
 Best-first search across all iterators, based on path
length
 Output answers to a buffer
 Eliminate duplicates: Isomorphic Trees
 Output highest ranked answer from buffer to user
when buffer is full
11/25/2018 18
THE BANKS SYSTEM
 BANKS provides keyword search coupled with

extensive browsing facilities
 Schema browsing + data browsing
 Graphical display of data
 Implemented using Java + servlets
 Keyword search response times typically 1 to 3
seconds on
 DBLP database with 100,000 tuples/300,000 edges
 P3 600 MHz, 512 MB RAM
 Try it out at www.cse.iitb.ac.in/banks/
11/25/2018 19
The BANKS Architecture
HTTP JDBC
User BANKS
Web Server
+ Servlets Database
 Connects to any database using JDBC

 JDBC metadata features used to provide schema
browsing
 No programming needed for customization
 Minimal preprocessing of database to create indices and give
weights to links
 Extensive set of browsing features
11/25/2018 20
Browsing Features
 Hyperlinks are automatically added to all

displayed results
 Template facilities to do a variety of tasks
 Browsing data by grouping and creating crosstabs
 e.g., theses grouped by department and year
 Hierarchical views of data

 Nested XML style, even on relational data
 Graphical displays
 Bar charts, pie charts, etc
11/25/2018 21
Example of Browsing in BANKS
11/25/2018 22
BANKS Query Result Example
 Result of “Soumen Sunita”
11/25/2018 23
Anecdotes
 “Mohan”
 Returns C. Mohan at top based on prestige (number of
papers written)
 “Transaction”
 Returns Jim Gray’s classic paper and textbook as top
answers based on prestige (number of citations)
 “Sunita Seltzer”
 No common papers, but both have papers with
Stonebraker: system finds this connection
11/25/2018 24
Effect of Parameters
 Log scaling of edge weights worked well
 (1- ) E +  N versus E N -- made little difference
 Best with  = .2 (subdue node weights but not entirely)
11/25/2018 25
Related Work
 DataSpot (DTL)/Mercado Intuifind [VLDB 98]
 Based on patent by Palmon (filed 1995, granted 1998)
 Similar answer model to ours
 Differences: our model of backward link weights and prestige
 Proximity Search [VLDB98]
 Different model of proximity
 No edge weights, prestige, different evaluation algorithm
 Information units (linked Web pages) [WWW10]
 No directionality, only studied in Web context
 Microsoft DBExplorer
 No ranking, based on SQL generation
 Addresses efficient construction of text indexes
11/25/2018 26
Some Extensions to the BANKS
 Searching for similar results: Template Search

 define the notion of similarity between two result trees
 perform the restricted search
 Efficiently handling meta-data queries
 starting the search from each of the tuples in a table is
too costly
11/25/2018 27
Template Search
 Feedback in terms of result tree

 Type of a result tree defined in terms of
 type of nodes
 the table to which the node belongs
 type of edges :
 the type of nodes which it connects
 the link information e.g. ‘cites’ and ‘cited’ link between two
papers.
 Which nodes to start the search from
 only the chosen nodes
 all the nodes corresponding to a particular keyword
11/25/2018 28
Template Search
 Start the backward search only from allowed set

of nodes
 Follow the edges as defined by the result type
 Example : Consider Query “sudarshan database”
 Two types of results for above query
 papers written by professor sudarshan
 papers cited by papers written by professor sudarshan
 Two result types distinguished by whether to
follow the cites/cited link from a paper node.
11/25/2018 29
Metadata Keyword Queries
 Metadata keywords : match all the tuples of

a relation.
 Too costly to start the search from each of
 the tuples of a table
 First cut approach: start the forward search from
the information node for the non-metadata
keywords
 selectively choose the nodes from where to
start the forward search
11/25/2018 30
Example of Metadata Query
 Consider the query “sudarshan paper”
writes table
nodes
To paper table
(forward search)
sudarshan
11/25/2018 31
Conclusions and Future Work
The next big wave: keyword searching and

browsing of databases?
Future work:
 Keyword queries on XML
 Disambiguating queries by selecting
 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”

 Tree structure: “coauthors” or “cites”
 Boolean queries, stemming, thesaurus
 Metadata: column/relation names
11/25/2018 32
Thank You
11/25/2018 33

Keyword Searching and Browsing in Databases Using BANKS

Uploaded by

Copyright:

Available Formats

You might also like

Keyword Searching and Browsing in Databases Using BANKS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Keyword Searching and Browsing in Databases Using BANKS

Uploaded by

Copyright:

Available Formats

Keyword Searching and Browsing in

Databases using BANKS

Gaurav Bhalotia, Arvind Hulgeri,

 Keyword search of documents on the Web has

 Many Web documents are dynamically generated

 On a railway reservation database

 Related data split across multiple tuples due to

 Tuples may be connected by

 Database: modeled as a graph

BANKS: Keyword search… MultiQuery Optimization paper

Charuta S. Sudarshan Prasan Roy author

Query: sudarshan roy

 Some popular tuples are connected to many

 Weight of forward edge based on schema

 Problem: Some backward edges have unduly

 Nodes have prestige weights too

 Problem: how to combine two independent

 Query: set of keywords {k1, k2, .., kn}

 Computation of minimum weight Steiner

 Output a node whenever it is on the intersection of the sets of

 For each vertex visited, maintain a nodelist v.Li

paper MultiQuery Optimization

authors S. Sudarshan Prasan Roy

 BANKS provides keyword search coupled with

 Connects to any database using JDBC

 Hyperlinks are automatically added to all

 Hierarchical views of data

 Result of “Soumen Sunita”

 Searching for similar results: Template Search

 Feedback in terms of result tree

 Start the backward search only from allowed set

 Metadata keywords : match all the tuples of

 Consider the query “sudarshan paper”

The next big wave: keyword searching and

 Disambiguating queries by selecting

 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”

You might also like