Professional Documents
Culture Documents
Dark Data: Are We Solving The Right Problems?
Dark Data: Are We Solving The Right Problems?
Michael Cafarella1, Ihab F. Ilyas2, Marcel Kornacker3, Tim Kraska4 and Christopher Ré5
1
University of Michigan, USA
2
University of Waterloo, USA
3
Cloudera, USA
4
Brown University, USA
5
Stanford University, USA
michjc@umich.edu, ilyas@uwaterloo.ca, marcel@cloudera.com, tim kraska@brown.edu, chrismre@cs.stanford.edu
Abstract—With the increasing urge of the enterprises to ingest • Schema on Read: One of the big promises of Big Data
as much data as they can in what’s commonly referred to as solutions is the load-first paradigm. What is the impact
“Data Lakes”, the new environment presents serious challenges of this promise on algorithms, and the assumptions we
to traditional ETL models and to building analytic layers on top make in our settings.
of well-understood global schema. With the recent development
of multiple technologies to support this “load-first” paradigm, • Schema extraction and evolution: with data streaming
even traditional enterprises have fairly large HDFS-based data in from a variety of sources, manual curation of a
lakes now. They have even had them long enough that their first global schema becomes impossible. Instead, we need
generation IT projects delivered on some, but not all, of the to look for ways to automate the process of schema
promise of integrating their enterprise’s data assets. In short, we
extraction and schema evolution.
moved from no data to Dark data. Dark data is what enterprises
might have in their possession, without the ability to access it or • Information Extraction Processes: Information extrac-
with limited awareness of what this data represents. In particular, tion has been a research interest since at least the
business-critical information might still remain out of reach. early 1990s, but will likely reach new importance as a
This panel is about Dark Data and whether we have been central element of any effort that combines dark data
focusing on the right data management challenges in dealing with traditional relational tools. What is the best way
with it. to design an extraction system that transforms dark
data into relational data?
I. W HY IS THE TOPIC OF CURRENT INTEREST TO THE • Continuous Data Curation: For decades, we have been
ICDE COMMUNITY ? treating ETL and cleaning as a one-shot exercise
Big data and the ability to deal with all of its aspects (e.g., thet is tied closely to the ingest box in business
volume, variety, etc.) is a core activity to the data management intelligence stacks. Well, load-first paradigms and the
community. It is easy to see that with the availability and the swamps of dark data calls for a different approach
affordability of Big Data technologies, the huge amount of to the curation process as a continuous, incremental,
data enterprises have been collecting has caught the research fuzzy, and interactive process. What does that mean
community off-guard. In response, academia has been largely to the large body of work the database community
focused on revising the known techniques and algorithms to contributed in this domain.
work on a large scale in a scale-out or a scale-up environment. • Reasoning about Data Quality: Datasets in the data
On the other hand, there has been an explosion in the number lakes are both incredibly numerous and exhibit incred-
of start-ups in the big data space with more pragmatic solutions ible variation in quality. Fully cleaning the data lake
to solve the immediate needs of the enterprise, including will likely never be possible. How can we build data
data ingestion, elasticity, provenance, accelerating analytics, applications that are resilient to low-quality inputs,
engaging data scientists, monitoring operations, and helping or at least alert the user when inputs are not clean
IT personnel to manage the continually growing data asset and enough?
the technology stacks they are deploying. We strongly believe
that we need to re-evaluate where we have been spending our • Machine Learning?: Machine learning may make it
effort in lieu of the real problems “big data” owners have. radically easier to attack both classical applications
and open new dark data applications. But machine
II. W HY IS THE TOPIC WELL - SUITED TO AN OPEN PANEL learning is not a panacea. It still requires a huge effort
DISCUSSION ?
to deploy, tune, and run. Does the data management
community play a role in transitioning machine learn-
Almost all large enterprises we talked to confirmed that ing from “demoware” to an every day data processing
there is a huge confusion in the market on the right tech- system? Does it change how we ask and answer
nologies, architectures, standards, etc. We strongly believe questions?
that there is a similar amount of confusion in the research Moreover, machine learning researchers have histor-
community on the right place to invest. An open discussion is a ically not devised abstractions that are critically im-
great medium to surface this confusion and to help understand portant to building a long lasting system. The data
the spectrum of activities involved in this space. In particular, management community has. What does the future of
consider the following issues: these systems look like with data constantly in motion,
Marcel Kornacker
Marcel Kornacker is the Chief Architect for database
technology at Cloudera and founder of the Cloudera Impala
project. Following his graduation in 2000 with a PhD in
databases from UC Berkeley, he held engineering positions
at several database-related start-up companies. Marcel joined
Google in 2003 where he worked on several ads serving and
storage infrastructure projects, then became tech lead for the
distributed query engine component of Google’s F1 project.
1445