Professional Documents
Culture Documents
Data Lakes
Data Lakes
Abstract We consider the problem of creating a nav- suggesting these are complementary and both useful
igation structure that allows a user to most effectively modalities for data discovery in data lakes. Our exper-
navigate a data lake. We define an organization as a iments show that data lake organizations take into ac-
graph that contains nodes representing sets of attributes count the data lake distribution and outperform an ex-
within a data lake and edges indicating subset rela- isting hand-curated taxonomy and a common baseline
tionships among nodes. We present a new probabilis- organization.
tic model of how users interact with an organization
and define the likelihood of a user finding a table using
the organization. We propose the data lake organiza- 1 Introduction
tion problem as the problem of finding an organization
that maximizes the expected probability of discovering The popularity and growth of data lakes is fueling inter-
tables by navigating an organization. We propose an est in dataset discovery. Dataset discovery is normally
approximate algorithm for the data lake organization formulated as a search problem. In one version of the
problem. We show the effectiveness of the algorithm on problem, the query is a set of keywords and the goal
both real data lakes containing data from open data is to find tables relevant to the keywords [3, 41]. Alter-
portals and on benchmarks that emulate the observed natively, the query can be a table (a query table), and
characteristics of real data lakes. Through a formal user the problem is to find other tables that are close to
study, we show that navigation can help users discover the query table [5]. If the input is a query table, then
relevant tables that cannot be found by keyword search. the output may be tables that join or union with query
In addition, in our study, 42% of users preferred the table [39, 12, 53, 52, 50].
use of navigation and 58% preferred keyword search, A complementary alternative to search is naviga-
tion. In this paradigm, a user navigates through an
Fatemeh Nargesian
organizational structure to find tables of interest. In
University of Rochester the early days of Web search, navigation was the domi-
E-mail: f.nargesian@rochester.edu nant discovery method for Web pages. Yahoo!, a mostly
Ken Q. Pu hand-curated directory structure, was the most signif-
University of Ontario Institute of Technology icant internet gateway for Web page discovery [8]. 1
E-mail: ken.pu@uoit.ca Even today, hierarchical organizations of Web content
Erkang Zhu (especially entities like videos or products) is still used
Microsoft Research by companies such as Youtube.com and Amazon.com.
E-mail: ekzhu@microsoft.com
Hierarchical navigation allows a user to browse avail-
Bahar Ghadiri Bashardoost able entities going from more general concepts to more
University of Toronto
E-mail: ghadiri@cs.toronto.edu
specific concepts using existing ontologies or structures
automatically created using taxonomy induction [45,
Renée J. Miller
Northeastern University 1 It is interesting to note that Yahoo! may stand for “Yet
E-mail: miller@northeastern.edu Another Hierarchical Officious (or Organized) Oracle”.
2 Fatemeh Nargesian et al.
32,49]. When entities have known features, we can ap- transition from the current state to a next state. Due to
ply faceted-search over entities [51, 31, 47]. Taxonomy the subset property of edges, each transition filters out
induction looks for is-a relationships between entities some attributes until the navigation reaches attributes
(e.g., student is-a person), faceted-search applies pred- of interest. An organization is effective if certain prop-
icates (e.g., model = volvo) to filter entity collections [1]. erties hold. At each step, a user should be presented
In contrast to hierarchies over entities, in data lakes with only a reasonable number of choices. We call the
tables contain attributes that may mention many dif- maximum number of choices the branching factor. The
ferent types of entities and relationships between them. choices should be distinct (as dissimilar as possible) to
There may be is-a relationships between tables, or their make it easier for a user to choose the most relevant
attributes, and no easily defined facets for grouping one. The transition probability function of our model
tables. If tables are annotated with class labels of a assumes users choose the next state that has the highest
knowledge base (perhaps using entity-mentions in at- similarity to the topic query they have in mind. Also,
tribute values), the is-a relationships between class la- the number of choices they need to make (the length of
bels could provide an organization on tables. However, the discovery path) should not be large. Furthermore, in
recent studies show the difference in size and cover- real data lakes, as we have observed and report in detail
age of domains in various public cross-domain knowl- in our evaluation (Section 5), the topic distribution is
edge bases and highlight that no single public knowl- typically skewed (with a few tables on some topics and a
edge base sufficiently covers all domains and attribute large number on others). Hence, give the typical skew of
values represented in heterogeneous corpuses such as topics in real data lakes, a data lake organization must
data lakes [21, 44]. In addition, the coverage of stan- be able to automatically determine over which portions
dard knowledge bases on attribute values in open data of the data, more organizational structure is required,
lakes is extremely low [39]. Knowledge bases are also and where a shallow structure is sufficient.
not designed to provide effective navigation. We pro-
pose instead to build an organization that is designed Table 1: Some Tables from Open Data.
to best support navigation and exploration over het-
erogeneous data lakes. Our goal is not to compete with Id Table Name
or replace search, but rather to provide an alternative
d1 Surveys Data for Olympia Oysters, Ostrea lurida, in BC
discovery option for users with only a vague notion of d2 Sustainability Survey for Fisheries
what data exists in a lake. d3 Grain deliveries at prairie points 2015-16
d4 Circa 1995 Landcover of the Prairies
d5 Mandatory Food Inspection List
1.1 Organizations d6 Canadian Food Inspection Agency (CFIA) Fish List
d7 Wholesale trade, sales by trade group
d8 Historical releases of merchandise imports and exports
Dataset or table search is often done using attributes
d9 Immigrant income by period of immigration, Canada
by finding similar, joinable, or unionable attributes [12, d10 Historical statistics, population and immigrant arrivals
16,17, 39, 53, 52]. We follow a similar approach and de-
fine an organization as a Directed Acyclic Graph (DAG)
with nodes that represent sets of attributes from a data
lake. A node may have a label created from the at- Example 1 Consider the (albeit small) collection of ta-
tribute values themselves or from metadata when avail- bles from a data lake of open data tables (Table 1).
able. A table can be associated to all nodes containing A table can be multi-faceted and attributes in a ta-
one or more of its attributes. An edge in this DAG indi- ble can be about different topics. One way to expose
cates that the attributes in a parent node is a superset this data to a user is through a flat structure of at-
of the attributes in a child node. A user finds a table tributes of these tables. A user can browse the data
by traversing a path from a root of an organization to (and any associated metadata) and select data of in-
any leaf node that contains any of its attributes. terest. If the number of tables and attributes is large,
We propose the data lake organization problem where it would be more efficient to provide an organization
the goal is to find an organization that allows a user to over attributes. Suppose the tables are organized in the
most efficiently find tables. We describe user naviga- DAG of Figure 1(a). The label of a non-leaf node in this
tion of an organization using a Markov model. In this organization summarizes the content of the attributes
model, each node in an organization is equivalent to a in the subgraph of the node. Suppose a user is interested
state in the navigation model. At each state, a user is in the topic of food inspection. Using this organization,
provided with a set of next states (the children of the at each step of navigation, they have to choose between
current state). An edge in an organization indicates the only two nodes. The first two choices seem clear, Econ-
Optimizing Organizations for Navigating Data Lakes 3
2.4 Organization Discovery Problem {O1 , . . . , Ok }, such that attributes of each table in T
is organized in at least one organization and Oi is the
A table T is discovered in an organization O by discov- most effective organization for Gi . We define the prob-
ering any of its attributes. ability of discovering table T in M, as the probability
of discovering T in any of dimensions of M.
Definition 2 For a single table T , we define the dis-
covery probability of a table as Y
Y P (T |M) = 1 − (1 − P (T |Oi )) (8)
P (T |O) = 1 − (1 − P (A|O)) (5) Oi ∈M
A∈T
in embedding space according to an Euclidean or an- Agency, Fisheries and Oceans Canada and Grains). In-
gular distance measure [37]. Attribute A can be repre- stead of building an organization over the twelve at-
sented by a topic vector, µA , which is the sample mean tributes, we can build a smaller organization over the
of the population {v|v ∈ dom(A)} [39]. In organization five tags: Canadian Food Inspection Agency, Fisheries
construction, we also represent X by µX . and Oceans Canada, Grains, Economy, and Immigra-
tion.
Definition 4 State s is represented with a topic vector,
µs , which is the sample mean of the population {v|v ∈ Definition 5 Let data(t) be the relation mapping a
dom(s)}. tag t in a lake metadata to the set of its associated
attributes. Suppose Ms is the set of tags in
S a tag state s.
If a sufficient number of values have word embed-
Thus, the set of attributes in s is Ds = t∈Ms data(t).
ding representatives, the topic vectors of attributes are
A tag state s is represented with a topic vector, µs ,
good indicators of attributes’ topics. We use the word
which is the sample mean of the population {v|v ∈
embeddings of fastText database [27], which on aver-
dom(A), A ∈ Ds }.
age covers 70% of the values in the text attributes in
the datasets used in our experiments. In organization
construction, to evaluate the transition probability of Flat Organization: A Baseline In the organization, we
navigating to state c we choose κ(c, X) to be the Cosine consider the leaves to be states with one attribute. How-
similarity between µc and µX . Since a parent subsumes ever, now the parent of a leaf node is associated to only
the attributes in its children, the Cosine similarity satis- one tag. This means that the last two levels of a hier-
fies a monotonicity property of κ(X, c) > κ(X, s), where archy are fixed and an organization is constructed over
c ∈ ch(s). However, the monotonicity property does not states with a single tag. If instead of discovery, we place
necessarily hold for the transition probabilities. This is a single root node over such states, we get a flat organi-
because P (c|s, X, O) is normalized with all children of zation that we can use as a baseline. This is a reasonable
parent s. baseline as it is conceptually, the navigation structure
supported by many open data APIs that permit pro-
grammatic access to open data resources. These APIs
3.2 State Space Construction permit retrieval of tables by tag.
Metadata in Data Lakes Tables in lakes are sometimes Enriching Metadata Building organizations on tags re-
accompanied by metadata hand-curated by the publish- duces organization discovery cost. However, the tags
ers and creators of data. In enterprise lakes, metadata coming from the metadata may be incomplete (some at-
may come from design documentation from which tags tributes may have no tags). Moreover, the schema and
or topics can be extracted. For open data, the standard vocabulary of metadata across data originating from
APIs used to publish data include tags or keywords [38]. different sources may be inconsistent which can lead to
In mass-collaboration datasets, like web tables, contex- disconnected organizations. We propose to transfer tags
tual information can be used to extract metadata [21]. across data lakes such that data lakes with no (or lit-
tle) metadata are augmented with the tags from other
State Construction If metadata is available, it can be data lakes. To achieve this, we build binary classifiers,
distilled into tags (e.g., keywords, concepts, or entities). one per tag, which predict the association of attributes
We associate these table-level tags with every attribute to the corresponding tags. The classifiers are trained
in the table. In an organization, the leaves still have a on the topic vector, µA , of attributes. Attributes that
single attribute, but the immediate parent of a leaf is a are associated to a tag are the positive training samples
state containing all attributes with a given (single) tag. for the tag’s classifier and the remaining attributes are
We call these parents tag states. Building organizations the negative samples. This results in imbalanced train-
on tags reduces the number of possible states and the ing data for large lakes. To overcome this problem, we
size of the organization while still having meaningful only use a subset of negative samples for training. We
nodes that represent a set of attributes with a common consider the ratio of positive and negative samples as a
tag. hyperparameter.
s0
s0 s0
Fig. 2: (a) Before and After Applying Add Parent Operation, (b) Before and (c) After Applying
Delete Parent Operation.
organization O0 . Everytime a new organization is cho- the discovery probability of states that are not reach-
sen, STATE TO MODIFY may need to reorder states ap- able from the grandparent remain intact. Therefore,
propriately. for Delete Parent, we only re-evaluate the discov-
ery probability of the states in the sub-graph rooted by
Initial organization(Line 3) The initial organization may the grandparent and only for attributes associated to
be any organization that satisfies the inclusion property the leaves of the sub-graph.
of attributes of states. For example, the initial organi- The Add Parent operation impacts the organiza-
zation can be the DAG defined based on a hierarchical tion more broadly. Adding a new parent to a state s
clustering of the tags of a data lake. changes the discovery probability of s and all states that
are reachable from s. Furthermore, the parent state and
consequently its ancestors are updated to satisfy the in-
clusion property of states. Suppose the parent itself has
4 Scaling Organization Search
only one parent. The change of states propagates to all
states up to the lowest common ancestor (LCA) of s
The search algorithm makes local changes to an exist-
and its parent-to-be before adding the transition to the
ing organization by applying operations. An operation
organization. If the parent-to-be has multiple parents
is successful if it increases the effectiveness of an or-
the change needs to be propagated to other subgraphs.
ganization. The evaluation of the effectiveness (Equa-
To identify the part of the organization that requires
tion 7) involves computing the discovery probability for
re-evaluation, we iteratively compute the LCA of s and
all attributes which requires evaluating the probability
each of the parents of its parent-to-be. All states in the
of reaching the states along the paths to an attribute
sub-graph of the LCA require re-evaluation.
(Equation 4). The organization graph can have a large
number of states, especially at the initialization phase.
To improve the search efficiency, we first identify the 4.2 Approximating Discovery Probability
subset of states and attributes whose discovery proba-
bilities may be changed by an operation and second we Focusing on states that are affected by operations re-
approximate the new discovery probabilities using a set duces the complexity of the exact evaluation of an or-
of attribute representatives. ganization. To further speed up search, we evaluate an
organization on a small number of attribute represen-
tatives. that each summarizes a set of attributes. The
4.1 States Affected by an Operation discovery probability of each representative approxi-
mates the discovery probability of its corresponding
At each search iteration we only re-evaluate the discov- attributes. We assume a one-to-one mapping between
ery probability of the states and attributes which are af- representatives and a partitioning of attributes. Sup-
fected by the local change. Upon applying Delete Parentpose ρ is a representative for a set of attributes Dρ =
on a state, the transition probabilities from its grand- {A1 , . . . , Am }. We approximate P (Ai |O), Ai ∈ Dρ with
parent to its grandchildren are changed and consequently P (ρ|O). The choice and the number of representatives
all states reachable from the grandparent. However, impact the error of this approximation.
Optimizing Organizations for Navigating Data Lakes 9
Now, we describe an upper bound for the error of We rewrite the error by replacing κ(si , ρ) with κ(si , A)−
the approximation. Recall that the discovery probabil- ∆i .
ity of a leaf state is the product of transition proba- 0 0 0
bilities along the path from root to the state. To de- eγ κ(A,si )
eγ κ(si ,A)
.e−γ ∆i
= P 0 −P 0 0
termine the error that a representative introduces to sj ∈ch(m) eγ κ(A,sj ) sj ∈ch(m) eγ κ(sj ,ρ) .e−γ ∆j
the discovery probability of an attribute, we first define (18)
an upper bound on the error incurred by using repre-
sentatives in transition probabilities. We show that the Following from Equation 15, we have:
error of transition probability from m to s is bounded 1 0
≤ e−γ ∆i ≤ 1 (19)
by a fraction of the actual transition probability which eγ 0 (1−κ(A,ρ))
is correlated with the similarity of the representative The upper bound of the approximation error of the
to the attribute. Recall the transition probability from transition probability is:
γ
Equation 1. For brevity, we assume γ 0 = |ch(m)| . 0 0
eγ κ(A,si )
eγ κ(A,si )
0 ≤ P 0 κ(A,s ) −P 0 κ(A,s .
eγ κ(A,si )
sj ∈ch(m) eγ j
sj ∈ch(m) eγ j)
Let ∆i = κ(si , A) − κ(si , ρ). Without loss of gener- Updating data lakes changes the states of its organiza-
ality, we assume κ(si , A) > κ(si , ρ), thus ∆i is a posi- tion and as a result the transition probabilities would
tive number. Now, we can rewrite κ(si , ρ) = κ(si , A) − change. Now, the original organization that was opti-
∆i . From Equation 1, we know that P (si |m, A, O) and mized might not be optimal anymore because of the
P (si |m, ρ, O) are monotonically increasing with κ(s, A) changes in transition probabilities. Suppose upon an
and κ(si , ρ), respectively. Therefore, the error of using update to a data lake, a state si is updated to s0i . If si
the representative ρ in approximating the probability of is a state in the optimal organization, the error upper
transition from m to si when looking for A is as follows: bound provided for representatives provides a guide for
how s0i has diverged from si . The upper bound of the
change in transition probabilities when si is updated to
= P (si |m, A, O) − P (si |m, ρ, O) (16) s0i is as follows.
#Attributes (Log)
4 4 find that table using our organization.
#Tags (Log)
tags per table in this benchmark are created to emulate We use the Cosine similarity on the topic vectors
the observed characteristics of our crawl of open data of attributes for κ and a threshold θ of 0.9. Based on
portals (Figure 3). In YAGO, each entity is associated attribute success probabilities, we can compute table
with one or more types. We consider 500 random leaf Q
success probability as Success(T |O) = 1 − A∈T (1 −
types that contain the words “agri”, “food”, or “farm” Success(A|O)). We report success probability for every
in their labels. These types are equivalent to tags. Each table in the data lake sorted from lowest to highest
attribute in the benchmark is associated to exactly one probability on the x-axis (see Figure 4 as an example).
tag. For each attribute of a table, we randomly sample
its values from the set of entities associated to its tag. ImplementationOur implementation is in Python and
This guarantees that the attribute is about the type uses scikit-learn library for creating initial organiza-
(tag). We tokenize the sampled entities into words and tions. Our experiments were run on a 4-core Intel Xeon
the word embeddings of these words are then used to 2.6 GHz server with 128 GB memory. To speed up the
generate the topic vectors of attributes. The number evaluation of an organization, we cache the similarity
of values in an attribute is a random number between scores of attribute pairs as well as attribute and state
10 and 1000. and the number of attributes per table is pairs as states are updated during search.
sampled from [1, 50] following a Zipfian distribution.
The sub-graph of the YAGO taxonomy that covers
all ancestor classes of the benchmark types is consid- 5.3 Performance of Approximation
ered as the taxonomy defined on benchmark attributes.
This taxonomy is a connected and acyclic graph. Each We evaluate the effectiveness and efficiency of our exact
class in the taxonomy is equivalent to a non-leaf state algorithm (using exact computation of discovery proba-
of navigation. To guarantee the inclusion property in bilities, not the approximation discussed in Section 4.2)
the taxonomy, each non-leaf state consists of the tags on the TagCloud benchmark.
corresponding to its descendant types. The topic vec-
tor of an interior state is generated by aggregating the 5.3.1 Effectiveness
topic vectors of its descendant attributes.
We constructed the baseline organization where each
attribute has as parents the tag-state of the attribute.
5.2 Experimental Set-up Recall in this benchmark each attribute has a single,
accurate tag. This organization is similar to the organi-
Evaluation MeasureFor our initial studies evaluating zation of open data portals. We performed an agglom-
our design decisions and comparing our approach to erative hierarchical clustering over this baseline to cre-
three others (Yago, a simple tag-based baseline, and an ate a hierarchy with branching factor 2. This organi-
Enterprise Knowledge Graph), we do not have a user. zation is called clustering. Then we used our algo-
We simulate a user by reporting the success probabil- rithm to optimize the clustering organization to create
ity of finding each table in the lake. Conceptually what N-dimensional (N ∈ {1, 2, 3, 4}) organizations (called
this means is that if a user had in mind a table that is N-dim). Figure 4 reports the success probability of each
in the lake and makes navigation decisions that favor table in different organizations.
12 Fatemeh Nargesian et al.
#Evaluated Attributes/Reps
organization
1-dim. dense yago taxonomy 350
Success Probability
Success Probability
0.8 baseline 0.8 1500
yago taxonomy
#Visited States
300
2-dim. 1250
0.6 2-dim. approx. 0.6 250
enriched 2-dim. 1000
200
0.4 3-dim. 0.4 750
4-dim. 150
0.2 0.2 500 100
0.0 0.0
250 50
0 50 100 150 200 250 300 350 0 100 200 300 400 500 600 700 0
Table ID Table ID 0
No Prunning Affected Atts Affected Reps No Prunning Affected States
(a) (b) (a) (b)
Fig. 4: Success Probability of Organizations (a) on Fig. 5: Pruning (a) Domains and (b) States on
TagCloud Benchmark and (b) YagoCloud Benchmark. TagCloud Benchmark.
5.3.2 Efficiency
In the baseline organization, requires users to con- The construction time of clustering, 1-dim, 2-dim,
sider a large number of tags and select the best, hence 3-dim, 4-dim, and enriched 2-dim organizations are
the average success probability for tables in this organi- 0.2, 231.3, 148.9, 113.5, 112.7, 217 seconds, respectively.
zation is just 0.016. This clustering organization out- Note that the baseline relies on the existing tags and
performs the baseline by ten times. This is because requires no additional construction time. Since dimen-
the smaller branching factor of this organization re- sions are optimized independently and in parallel, the
duces the burden of choosing among so many tags as the reported construction times of the multi-dimensional
flat organization and results in larger transition prob- organizations indicate the time it takes to finish opti-
abilities to states even along lengthy paths. Our 1-dim mizing all dimensions.
optimization of the clustering organization improves
the success probability of the clustering organization
by more than three times.
5.3.3 Approximation
To create the N-dimensional organization (N > 1),
we clustered the tags into N clusters (using n-medoids) The effectiveness and efficiency numbers we have re-
and built an organization on each cluster. The 2-dim ported so far are for the non-approximate version of our
organzation has an average success probability of 0.326 algorithm. To evaluate an organization during search
which is an improvement over the baseline by 40 times. we only examine the states and attributes that are af-
Although the number of initial tags is invariant among fected by the change an operation has made. Thus, this
1-dimensional and multi-dimensional organizations since pruning guarantees exact computation of success prob-
each dimension is constructed on a smaller number of abilities. Our experiments, shown in Figure 5, indicate
tags that are more similar, increasing the number of di- that although local changes can potentially propagate
mensions in an organization improves the success prob- to the whole organization, on average less than half of
ability, as shown in Figure 4. states and attributes are visited and evaluated for each
search iteration. Furthermore, we considered approx-
In Figure 4, almost 47 tables of TagCloud have very imating discovery probabilities using a representative
low success probability in all organizations. We observed set size of 10% of the attributes and only evaluated
that almost 70% of these tables contain only one at- those representatives that correspond to the affected
tribute each of which is associated to only one tag. This attributes. This reduces the number of discovery proba-
makes these tables less likely to be discovered in any bility evaluations to only 6% of the attributes. As shown
organization. To investigate this further, we augmented in Figure 4, named 2-dim approx, this approximation
TagCloud to associate each attribute with an additional has negligible impact on the success probabilities of ta-
tag (the closest tag to the attribute other than its ex- bles in the constructed organization. The construction
isting tag). We built a two-dimensional organization time of 2-dim (without the approximation) is 148.9 sec-
on the enriched TagCloud, which we name enriched onds. For 2-dim approx (using a representative size of
2-dim. This organization proves to have higher success 10%) is 30.3 seconds. The remaining experiments re-
probability overall, and improves the success probabil- port organizations (on much larger lakes) created using
ity for the least discoverable tables. this approximation for scalability.
Optimizing Organizations for Navigating Data Lakes 13
5.4 Comparison of Organizations with other relevant statistics. Since the cluster sizes are
skewed, the number of attributes reachable via each or-
We now compare our approach with (1) a hand- ganization has a high variance. Recall that Socrata has
curated taxonomy on the YagoCloud benchmark, (2) just over 50K attributes and they might have multiple
the flat baseline on a large real data lake Socrata, tags, so many are reachable in multiple organizations.
and (3) an automatically generated enterprise knowl- It took 12 hours to construct the multi-dimensional or-
edge graph (EKG) [16, 17] on a smaller sample of this ganization.
lake Socrata-1.
Table 3: Statistics of 10 Organizations of Socrata
5.4.1 Comparison to A Knowledge Base Taxonomy Lake.
We compare the effectiveness of using the existing Org #Tags #Atts #Tables #Reps
YAGO taxonomy on YagoCloud tables for navigation 1 2,031 28,248 3,284 2,824
with our organization. Figure 4b shows the success 2 1,735 11,363 1,885 1,136
probability of tables in the benchmark when navigating 3 1,648 20,172 9,792 2,017
the YAGO taxonomy and our data lake organization. 4 1,572 19,699 2,933 1,969
5 1,378 11,196 1,947 1,119
The taxonomy consists of 1,171 nodes and 2,274 edges 6 1,245 17,083 1,934 1,708
while our (1-dimensional) organization consists of 442 7 829 8,848 1,302 884
nodes and 565 edges. Each state in the optimized orga- 8 353 6,816 831 681
nization consists of a set of types that need to be further 9 240 3,834 614 383
10 43 118 33 11
interpreted while each state in the taxonomy refers to
a YAGO class label. However, the more compact rep-
resentation of topics of the attributes leads to a more Figure 6a shows the success probability of the ten
effective organization for navigation. To understand the organizations on the Socrata data lake. Using this or-
influence of taxonomy/organization size on discovery ef- ganization, a table is likely to be discovered during nav-
fectiveness, we condensed the taxonomy. Similar to the igation of the data lake with probability of 0.38, com-
knowledge fragment selection approach [23], we created pared to the current state of navigation in data portals
a compact taxonomy by eliminating the immediate level using only tags, which is 0.12. Recall that to evalu-
of nodes above the leaves unless they have other chil- ate the discovery probability of an attribute, we evalu-
dren. This leads to a taxonomy that has 444 nodes and ate the probability of discovering the penultimate state
1,587 edges which is closer to the size of our organiza- that contains its tag and multiply it by the probability
tion. Condensing the taxonomy increases the branching of selecting the attribute among the attributes associ-
factor of nodes while decreasing the length of discovery ated to the tag. The distribution of attributes to tags
paths. This leads to a slight decrease in success prob- depends on the metadata. Therefore, the organization
ability compared to the original taxonomy. Knowledge algorithm does not have any control on the branch-
base taxonomies are efficient for organizing the knowl- ing factor at the lowest level of the organization, which
edge of entities. However, our organizations are able to means that the optimal organization is likely to have a
optimize the navigation better based on the data lake lower success probability than 1.
distribution.
5.4.3 Comparison to Linkage Navigation
5.4.2 Comparison to a Baseline on Real Data
Automatically generated linkage graphs are an alter-
We constructed ten organizations on the Socrata lake native navigation structure to an organization. An ex-
by first partitioning its tags into ten groups using k- ample of a linkage graph is the Enterprise Knowledge
medoids clustering [29]. We apply Algorithm 1 on each Graph (EKG) [16, 17]. Unlike organizations which facil-
cluster to approximate an optimal organization. We itate exploratory navigation, in EKG, navigation starts
use an agglomerative hierarchical clustering of tags in with known data. A user writes queries on the graph to
Socrata as the initial organization. In each iteration, find related attributes through keyword search followed
we approximate the success probability of the organi- by similarity queries to find other attributes similar to
zation using a representative set with a size that is 10% these attributes.
of the total number of attributes in the organization. To study the differences between linkage navigation
Table 3 reports the number of representatives consid- and organization navigation, we compare the probabil-
ered for this approximation in each organization along ity of discovering tables using each. We used a smaller
14 Fatemeh Nargesian et al.
Success Probability
0.8 0.8 6 0.8
Recall
5
Accuracy
0.6 0.6 0.6
4
0.4 0.4 3 0.4
0.2 0.2 2
0.2
1
0.0 0.0 0 0.0
0 1000 2000 3000 4000 5000 6000 7000 0 200 400 600 800 1000 0 2000 4000 6000 8000 10000 0 20 40 60 80 100
Table ID Table ID Tag ID Tag Classifier ID
(a) (b) (a) (b)
Fig. 6: (a) Success Probability of Tables in Socrata Fig. 7: (a) No. Positive Training Samples and
Lake using Organization and Baseline. (b) Success (b) Accuracy of Tag Classifiers.
Probability of Tables in Socrata-1 Lake using Orga-
nization and EKG.
left side of Figure 6b). In an EKG the number of navi-
lake (Socrata-1) for this comparison because the abun- gation choices at each node depends on the distribution
dance of linkages between attributes in a large data of attribute similarity. This causes some nodes to have
lake makes our evaluation using large EKGs compu- high branching factor and leads to overall low success
tationally expensive. In the EKG, each attribute in a probability.
data lake is a node of the graph and there are edges
based on node similarity. A syntactic edge means the
Jaccard similarity of attribute values [16] is above a
threshold and a semantic edge means the combined se-
mantic and syntactic similarity of the attribute names
is above a threshold [17]. To use EKG for navigation, 5.5 Effectiveness of Metadata Enrichment
we adapt Equation 1 such that the probability of a user
navigating from node m to an adjacent node s is pro- The effectiveness of an organization depends on the
portional to the similarity of s to m and is penalized number of tags associated to attributes. For data lakes
by the branching factor of m. We consider the similar- with no or limited tags, we propose to transfer tags from
ity of s and m to be the maximum of their semantic another data lake that has tags. Specifically, we trans-
similarity and syntactic similarity. Since the navigation fer the tags of the Socrata data lake to attributes of
can start from arbitrary nodes, to compute the success CKAN (which has no tags). Figure 7a shows the distribu-
probability of a table in an EKG, we consider the av- tion of de-duplicated positive training samples per tag
erage success probability of a table over up to 500 runs in Socrata. Despite the large number of tags (11,083),
each starting from a random node. We use the thresh- only very few of them have enough positive samples
old θ = 0.9 for filtering the edges in EKG. This makes (associated attributes) for training a classifier. We con-
3,989 nodes reachable from some node in the graph. The sidered the ratio of one to nine for positive and neg-
average and maximum branching factors of this EKG ative samples. We employed the distributed gradient
are 122.30 and 725, respectively. The average success boosting of XGBoost [10] to train classifiers on the
probability of is 0.0056. tags with at least 10 positive training samples (866
Figure 6b shows the success probability of naviga- tags). The training algorithm performs grid search for
tion using an EKG and using an organization built on hyper-parameter tuning of classifiers. Figure 7b demon-
Socrata-1, when we limited the start nodes to be the strates the accuracy of the 10-fold cross validation of
ancestors of attributes of a table. Although the data the classifiers with top-100 F1-scores. Out of 866 tags of
lake organization has higher construction time (2.75 Socrata, 751 were associated to CKAN attributes and a
hours) than the EKG (1.3 hours), it outperforms EKG total of 7,347 attributes got at least one tag. The most
in effectiveness. The EKG is designed for discovering popular tag is domaincategory government. Figure 8a
similar attributes to a known attribute. When EKG shows the distribution of newly associated tags to CKAN
is used for exploration, the navigation can start from attributes for the 20 most popular tags. Figure 8b re-
arbitrary nodes which results in long discovery paths ports the success probability of a 1-dimensional organi-
and low success probability. Moreover, depending on zation. More than half of the tables are now searchable
the connectivity structure of an EKG, some attributes through the organization that were otherwise unreach-
are not reachable in any navigation run (points in the able in the lake.
Optimizing Organizations for Navigating Data Lakes 15
Success Probability
0.30 0.8
results, i.e., for two result sets R and T , the disjointness
Table Ratio
0.25 0.6
0.20 is computed by 1 − |R∩T |
|R∪T | .
0.15 0.4
Study Design. We considered Socrata-2 and
0.10 0.2
0.05 Socrata-3 data lakes for our user study and defined
0.00 0.0
1 2 3 4 5 6 7 8 910111213141516171819 22 0 200 400 600 800 1000 an overview information need scenario for each data
#Tags Per Table Table ID
lake. We made sure that these scenarios are similar in
(a) (b) difficulty by asking a number of domain experts who
were familiar with the underlying data lakes to rate
Fig. 8: (a) Distribution of Tags Added to CKAN
several candidate scenarios. Note that the statistically
(b) Success Prob. of CKAN after Metadata Enrichment.
insignificant difference between the number of tabels
found for each scenario by our participants provides
5.6 User Study evidence that our two scenarios were infact similar in
difficulty. The scenario used for Socrata-2 asks partic-
We performed a formal user study to compare search by ipants to find tables relevant to the scenario “suppose
navigation to the major alternative of keyword search. you are a journalist and you’d like to find datasets pub-
Through a formal user study, we investigated how users lished by governments on the topic of smart city.”. The
perceive the usefulness of the two approaches. scenario used for Socrata-3 asks participants to find
tables relevant to the scenario “Suppose you are a data
To remain faithful to keyword search engines, we
scientist in a medical company and you would like to
created a semantic search engine that supports key-
find datasets about clinical research”. During the study,
word search over attribute values and tables metadata
we asked participants to use keyword search and navi-
(including attribute names and table tags). The search
gation to find a set of tables which they deemed relevant
engine performs query expansion with semantically sim-
to given scenarios. For this study, we recruited 12 par-
ilar terms. The search engine supports BM25 [35] doc-
ticipants using convenience sampling [15, 19]. The par-
ument search and semantic keyword search using pre-
ticipants have diverse backgrounds with undergraduate
trained GloVe word vectors [40]. Our implementation
and graduate education in computer science, engineer-
uses the library Xapian 2 to perform keyword search.
ing, math, statistics, and management sciences.
Users can optionally enable query expansion by aug-
menting the keyword query with additional semanti- This study was a within-subject design. In our set-
cally similar terms. We also created a prototype that ting, we want to avoid two potential sources of inva-
enables participants to navigate our organizations. In lidity. First, participants might become familiar with
this prototype, each node of an organization is labeled the underlying data lake during the first scenario which
with a set of representative tags. We label leaf nodes then might help them to search better in their sec-
(where attributes are located) with corresponding ta- ond scenario. To address this problem, we made sure
ble names and penultimate nodes (where single tags that Socrata-2 and Socrata-3 do not have overlap-
are located) with the corresponding tags. The remain- ping tags and tables. Second, the sequence in which
ing nodes are labeled with two tags which are the first the participants use the two search approaches can be
and second most occurring tags, among its children’s the source of confounding. To mitigate for this prob-
labels, for the nodes’ attributes. If these tags belong to lem, we made sure that half of the participants first
the label of the same child, we choose the third most performed keyword search and the rest performed nav-
occurring tag and so on. At each state, the user can igation first. In summary, we handled the effect of these
navigate to a desired child node or backtrack to the two sources of invalidity using a balanced latin square
parent of the current node. design with 4 blocks (block 1: Socrata-2/navigation
Hypotheses. The goal of our user study is to test first, block 2: Socrata-3/navigation first, etc.). We ran-
the following hypotheses: H1) given the same amount domly assigned equal number of participants to each of
of time, participants would be able to find as many these blocks. For each participant, the study starts with
relevant tables with navigation as with keyword search; a short training session, after that, we gave the partic-
H2) given the same amount of time, participants who ipants 20 minutes for each scenario.
use navigation would be able to find relevant tables Results. Because of our small sample size, we used
that cannot be found by keyword search. We measure the non-parametric Mann-Whitney test to determine
the significance of the results and tested our two-tailed
2 https://xapian.org/ hypotheses. We found that there is no statistically sig-
16 Fatemeh Nargesian et al.
Table 4: Number of Tables Found by the Search Tech- but navigation allowed them to conveniently discover
niques. these relevant tables without prior knowledge. One very
interesting observation in this study is that although
Participant ID Navigation Keyword Search participants find similar number of tables, there is only
1 11 8 around 5% intersection between tables found using key-
2 32 24 word search and tables found using our approach. This
3 2 2
suggests that organization can be a good complement
4 25 13
5 45 14 to the keyword search and vice versa.
6 44 34 We evaluated the usability by asking each partici-
7 23 10 pant to fill out a standard post-experiment system us-
8 14 8
9 20 10
ability scale (SUS) questionnaire [4] after each block.
10 7 8 This questionnaire is designed to measure a user’s judg-
11 6 1 ment of a system’s effectiveness and efficiency. We an-
12 6 2 alyzed participants’ rankings, and kept record of the
number of questions for which they gave a higher rank-
ings to each approach. Our results indicate that 58%
of the participants preferred to use keyword search, we
nificant difference between the number of relevant ta-
suspect in part due to familiarity. No participant had
bles found using the organization and keyword search.
neutral preference. Still having 42% prefer navigation
The number of tables found by each participant for each
indicates a clear role for this second, complementary
scenario is listed in Table 4. We first asked two collab-
modality.
orators to eliminate irrelevant tables found. Since the
number of irrelevant data was negligible (less than 1%
for both approaches) we will not further report on this
process. This confirms our first hypothesis. The maxi- 6 Related Work
mum number of tables found by navigating an organi-
zation and performing keyword search was 44 and 34, Entity-based Querying - While traditional search
respectively. Moreover, a Mann-Whitney test indicated engines are built for pages and keywords, in entity
that the disjointness of results was greater for partic- search, a user formulates queries to directly describe
ipants who used organization (Mdn = 0.985) than for her target entities and the result is a collection of enti-
participants who used keyword search (Mdn = 0.916), ties that match the query [9]. An example of a query is
U = 612, p = 0.0019. This confirms our second hy- “database #professor”, where professor is the tar-
pothesis. Note that the disjointness was computed for get entity type and “database” is a descriptive keyword.
each pair of participants who worked on the same sce- Cheng et. al propose a ranking algorithm for the result
nario using the same approach, then the pairs gener- of entity queries where a user’s query is described by
ated for each technique were compared together. The keywords that may appear in the context of desired en-
calculated disjointness pairs are included in Table 5. tities [11].
Based on our investigation, this difference might be be- Data Repository Organization - Goods is
cause participants used very similar keywords, whereas Google’s specialized dataset catalog for about billions
the paths which were taken by each participant while of Google’s internal datasets [20]. The main focus of
navigating an organization were very different. In other Goods is to collect and infer metadata for a large repos-
words, As some participants described, they were hav- itory of datasets and make it searchable using key-
ing a hard time finding keywords that best described words. Similarly, IBM’s LabBook provides rich collab-
their interest since they did not know what was avail- orative metadata graphs on enterprise data lakes [28].
able, whereas with our organization, at each step, they Skluma [2] also extracts metadata graphs from a file
could see what seemed more interesting to them and system of datasets. Many of these metadata approaches
find their way based on their preferences. As one ex- include the use of static or dynamic linkage graphs [13,
ample, for the smart city overview scenario, everyone 36, 17, 16], join graphs for adhoc navigation [54], or ver-
found tables tagged with the term City using search. sion graphs [22]. These graphs allow navigation from
But using organization, some users found traffic mon- dataset to dataset. However, none of these approaches
itoring data, while others found crime detection data, learn new navigation structures optimized for dataset
while others found renewable energy plans. Of course discovery.
if they knew a priori this data was in the lake they Taxonomy Induction - The task of taxonomy in-
could have formulated better keyword search queries, duction creates hierarchies where edges represent is-a
Optimizing Organizations for Navigating Data Lakes 17
Table 5: Disjointness Scores of Result Sets of Keyword Search (kw) and Navigation (nav) for each Pair of
Participants.
1 2 3 4 5 6 7 8 9 10 11 12
Participant ID
Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav Kw/Nav
1 0.733/1.000 0.714/0.937 0.916/1.000 0.875/1.000 0.947/1.000 0.800/0.928
2 1.000/1.000 1.000/1.000 1.000/1.000 1.000/1.000
3 0.733/1.000 0.850/0.960 0.941/1.000 0.923/0.966 0.850/0.985 0.850/1.000
4 0.714/0.937 0.850/0.960 1.000/1.000 0.818/1.000 0.871/0.985 0.777/0.837
5 0.916/1.000 0.941/1.000 1.000/1.000 1.000/0.833 1.000/1.000 1.000/1.000
6 0.875/1.000 0.923/0.966 0.818/1.000 1.000/0.833 0.939/0.979 0.916/0.964
7 1.000/1.000 0.913/0.969 0.892/0.930 0.935/1.000
8 1.000/1.000 0.913/0.969 0.888/1.000 0.909/1.000
9 1.000/1.000 0.892/0.930 0.888/1.000 0.875/0.969
10 1.000/1.000 0.935/1.000 0.909/1.000 0.875/0.969
11 0.947/1.000 0.850/0.985 0.871/0.985 1.000/1.000 0.939/0.979 0.952/0.901
12 0.800/0.928 0.850/1.000 0.777/0.837 1.000/1.000 0.916/0.964 0.952/0.901
(or subclass) relations between classes. The is-a rela- imizes the discovery probability of tables in a data lake
tion represents true abstraction, not just the subset-of and proposed an efficient approximation algorithm for
relation as in our approach. Moreover, taxonomic rela- creating good organizations. To build an organization,
tionship between two classes exists independent of the we use the attributes of tables together with any tags
size and distribution of the data being organized. As over the tables to combine table-level and instance-level
a result, taxonomy induction relies on ontologies or se- features. The effectiveness and efficiency of our system
mantics extracted from text [32] or structured data [42]. are evaluated by benchmark experiments involving syn-
Our work is closes to concept learning, where entities thetic and real world datasets. We have also conducted
are grouped into new concepts that are themselves or- a user study where participants use our system and a
ganized in is-a hierarchies [33]. keyword search engine to perform the same set of tasks.
Faceted Search - Faceted search enables the ex- It is shown that our system offers good performance,
ploration of entities by refining the search results based and complements keyword search engines in data ex-
on some properties or facets. A facet consists of a pred- ploration. Future work includes empirically studying
icate (e.g., model) and a set of possible terms (e.g., the scalability of our algorithm to larger data lakes,
honda, volvo). The facets may or may not have a hier- integrating keyword search and navigation as two in-
archical relationship. Currently, most successful faceted terchangable modalities in a unified data exploration
search systems rely on term hierarchies that are ei- framework. Based on the feedback and comments from
ther designed manually by domain experts or automat- the user study, we strongly believe that these exten-
ically created using methods similar to taxonomy in- sions will further improve the user’s ability to navigate
duction [51, 43, 14]. The large size and dynamic nature in large data lakes.
of data lakes makes the manual creation of a hierar-
chy infeasible. Moreover, since values in tables may not
exist in external corpora [39], such taxonomy construc-
tion approaches of limited usefulness for the data lake
organization problem.
Keyword Search - Google’s dataset search uses
keyword search over metadata and relies on dataset
owners providing rich semantic metadata [3]. As shown
in our user study this can help users who know what
they are looking for, but has less value in serendipitous
data discovery as a user tries to better understand what
data is available in a lake.