Dark Data: Are We Solving The Right Problems?

Dark Data: Are We Solving the Right Problems?
Michael Cafarella1, Ihab F. Ilyas2, Marcel Kornacker3, Tim Kraska4 and Christopher Ré5
1
University of Michigan, USA
2
University of Waterloo, USA
3
Cloudera, USA
4
Brown University, USA
5
Stanford University, USA
michjc@umich.edu, ilyas@uwaterloo.ca, marcel@cloudera.com, tim kraska@brown.edu, chrismre@cs.stanford.edu
Abstract—With the increasing urge of the enterprises to ingest • Schema on Read: One of the big promises of Big Data
as much data as they can in what’s commonly referred to as solutions is the load-first paradigm. What is the impact
“Data Lakes”, the new environment presents serious challenges of this promise on algorithms, and the assumptions we
to traditional ETL models and to building analytic layers on top make in our settings.
of well-understood global schema. With the recent development
of multiple technologies to support this “load-first” paradigm, • Schema extraction and evolution: with data streaming
even traditional enterprises have fairly large HDFS-based data in from a variety of sources, manual curation of a
lakes now. They have even had them long enough that their first global schema becomes impossible. Instead, we need
generation IT projects delivered on some, but not all, of the to look for ways to automate the process of schema
promise of integrating their enterprise’s data assets. In short, we
extraction and schema evolution.
moved from no data to Dark data. Dark data is what enterprises
might have in their possession, without the ability to access it or • Information Extraction Processes: Information extrac-
with limited awareness of what this data represents. In particular, tion has been a research interest since at least the
business-critical information might still remain out of reach. early 1990s, but will likely reach new importance as a
This panel is about Dark Data and whether we have been central element of any effort that combines dark data
focusing on the right data management challenges in dealing with traditional relational tools. What is the best way
with it. to design an extraction system that transforms dark
data into relational data?
I. W HY IS THE TOPIC OF CURRENT INTEREST TO THE • Continuous Data Curation: For decades, we have been
ICDE COMMUNITY ? treating ETL and cleaning as a one-shot exercise
Big data and the ability to deal with all of its aspects (e.g., thet is tied closely to the ingest box in business
volume, variety, etc.) is a core activity to the data management intelligence stacks. Well, load-first paradigms and the
community. It is easy to see that with the availability and the swamps of dark data calls for a different approach
affordability of Big Data technologies, the huge amount of to the curation process as a continuous, incremental,
data enterprises have been collecting has caught the research fuzzy, and interactive process. What does that mean
community off-guard. In response, academia has been largely to the large body of work the database community
focused on revising the known techniques and algorithms to contributed in this domain.
work on a large scale in a scale-out or a scale-up environment. • Reasoning about Data Quality: Datasets in the data
On the other hand, there has been an explosion in the number lakes are both incredibly numerous and exhibit incred-
of start-ups in the big data space with more pragmatic solutions ible variation in quality. Fully cleaning the data lake
to solve the immediate needs of the enterprise, including will likely never be possible. How can we build data
data ingestion, elasticity, provenance, accelerating analytics, applications that are resilient to low-quality inputs,
engaging data scientists, monitoring operations, and helping or at least alert the user when inputs are not clean
IT personnel to manage the continually growing data asset and enough?
the technology stacks they are deploying. We strongly believe
that we need to re-evaluate where we have been spending our • Machine Learning?: Machine learning may make it
effort in lieu of the real problems “big data” owners have. radically easier to attack both classical applications
and open new dark data applications. But machine
II. W HY IS THE TOPIC WELL - SUITED TO AN OPEN PANEL learning is not a panacea. It still requires a huge effort
DISCUSSION ?
to deploy, tune, and run. Does the data management
community play a role in transitioning machine learn-
Almost all large enterprises we talked to confirmed that ing from “demoware” to an every day data processing
there is a huge confusion in the market on the right tech- system? Does it change how we ask and answer
nologies, architectures, standards, etc. We strongly believe questions?
that there is a similar amount of confusion in the research Moreover, machine learning researchers have histor-
community on the right place to invest. An open discussion is a ically not devised abstractions that are critically im-
great medium to surface this confusion and to help understand portant to building a long lasting system. The data
the spectrum of activities involved in this space. In particular, management community has. What does the future of
consider the following issues: these systems look like with data constantly in motion,
978-1-5090-2020-1/16/$31.00 © 2016 IEEE 1444 ICDE 2016 Conference

and when new data is continuously arriving and being Tim Kraska- (Moderator)
joined to historical data?
Tim Kraska is an assistant professor in the Computer
• Practical Provenance: The full set of processing steps Science department at Brown University. Currently, his re-
applied to a specific data lake—including extraction, search focuses on Big Data management systems for modern
cleaning, schema matching operations, and traditional hardware and new types of workloads, especially machine
relational algebra steps—is likely to be extremely learning. Before joining Brown, Tim Kraska spent 2 years
complex and in many cases the result of ad-hoc pro- as a PostDoc in the AMPLab at UC Berkeley after receiving
grams. But government auditors and machine learning his PhD from ETH Zurich, where he worked on cloud-scale
safety guarantees require we know the full prove- data management systems and stream processing. He was
nance of certain files. How can we retain sufficient awarded an NSF Career Award (2015), an AFOSR Young In-
provenance information while allowing users extreme vestigator award (2015), a Swiss National Science Foundation
flexibility in performing these steps? Prospective Researcher Fellowship (2010), a DAAD Scholar-
ship (2006), a University of Sydney Master of Information
Technology Scholarship for outstanding achievement (2005),
III. W HAT SHOULD ATTENDEES EXPECT TO GAIN FROM the University of Sydney Siemens Prize (2005), two VLDB
THE PANEL ? best demo awards (2015 and 2011) and an ICDE best paper
We hope that the panel discusses some of the controversies award (2013).
around dealing with Dark Data in Big Data environments. At
the least, we will highlight multiple new realistic and pragmatic Christopher Ré
challenges that the community needs to solve or tackle to have Christopher (Chris) Ré is an assistant professor in the
a leading and impacting role. Department of Computer Science at Stanford University. His
work’s goal is to enable users and developers to build ap-
IV. PANELLISTS plications that more deeply understand and exploit data. Chris
received his PhD from the University of Washington in Seattle
Michael Cafarella under the supervision of Dan Suciu. For his PhD work in prob-
Michael Cafarella is an assistant professor in the division abilistic data management, Chris received the SIGMOD 2010
of Computer Science and Engineering at the University of Jim Gray Dissertation Award. He then spent four wonderful
Michigan. His research interests include databases, information years on the faculty of the University of Wisconsin, Madison,
extraction, data integration, and data mining. With Doug before moving to Stanford in 2013. He helped discover the
Cutting he co-founded the Nutch and Hadoop open source first join algorithm with worst-case optimal running time,
projects. He is also a co-founder of Lattice Data, a startup that which won the best paper at PODS 2012. He also helped
extracts relational databases from dark data. develop a framework for feature engineering that won the
best paper at SIGMOD 2014. In addition, work from his
group has been incorporated into scientific efforts including
Ihab F. Ilyas the IceCube neutrino detector and PaleoDeepDive, and into
Cloudera’s Impala and products from Oracle, Pivotal, and
Ihab Ilyas is a professor in the Cheriton School of Com- Microsoft’s Adam. He received an NSF CAREER Award in
puter Science at the University of Waterloo. He received 2011, an Alfred P. Sloan Fellowship in 2013, a Moore Data
his PhD in computer science from Purdue University, West Driven Investigator Award in 2014, the VLDB early Career
Lafayette. His main research is in the area of database systems, Award in 2015, and the MacArthur Foundation Fellowship in
with special interest in data quality, managing uncertain data, 2015.
rank-aware query processing, and information extraction. Ihab
is a recipient of the Ontario Early Researcher Award (2009), a
Cheriton Faculty Fellowship (2013), an NSERC Discovery Ac-
celerator Award (2014), and a Google Faculty Award (2014),
and he is an ACM Distinguished Scientist. Ihab is a co-founder
of Tamr, a startup focusing on large-scale data integration and
cleaning. He serves on the VLDB Board of Trustees, and he
is an associate editor of the ACM Transactions of Database
Systems (TODS).
Marcel Kornacker
Marcel Kornacker is the Chief Architect for database
technology at Cloudera and founder of the Cloudera Impala
project. Following his graduation in 2000 with a PhD in
databases from UC Berkeley, he held engineering positions
at several database-related start-up companies. Marcel joined
Google in 2003 where he worked on several ads serving and
storage infrastructure projects, then became tech lead for the
distributed query engine component of Google’s F1 project.
1445

Dark Data: Are We Solving The Right Problems?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dark Data: Are We Solving The Right Problems?

Uploaded by

Copyright:

Available Formats

Dark Data: Are We Solving the Right Problems?

978-1-5090-2020-1/16/$31.00 © 2016 IEEE 1444 ICDE 2016 Conference

You might also like