Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Mining Your Data Lake

for Analytics Insights


Directly access the richness of your
data lake for advanced analytics
M I N I N G YO U R D ATA L A K E F O R 2
A N A LY T I C S I N S I G H T S

Unify analytics across INTRODUCTION

your business
For years, companies have dumped data into their data lake and deferred
organizing the data until later. To make this data useful for analytics, the
data must be carefully structured and cataloged. And as more data is rapidly
introduced, from log files and IoT sources, it must also be structured in the same
way. Delta Lake on Databricks provides a way to streamline these data pipelines
so the data is instantly available for analysis.

In this eBook, learn more about using Delta Lake and Amazon Web Services
(AWS) to prepare and deliver data lake data directly to drive valuable analytics
insights. Read use cases from LoyaltyOne and Comcast to see how they are
getting value out of Delta Lake.
M I N I N G YO U R D ATA L A K E F O R 3
A N A LY T I C S I N S I G H T S

The challenge of scaling a With each day, there’s increasingly more data to manage—think streaming data,
data lake for analytics IoT data, event data, and social media data. A fleet of trucks can provide sensor
readings every five seconds. A corporate intrusion detection program can track
every IP address that enters a network, and their actions. These types of use
cases create hundreds of terabytes of information daily.

A data lake can provide companies a place to store all that data, but it does not
provide that data in a way that is analytics-ready. That data may pass through
processes to reformat it and into other storage systems as it is prepared for
analysis. This process can take several hours to several days. Meanwhile, trucks
can break down and network infiltrators can wreak havoc. Organizations need
instant access to all that data to keep their business running.
M I N I N G YO U R D ATA L A K E F O R 4
A N A LY T I C S I N S I G H T S

Data reliability challenges When it comes to making data in data lakes accessible for analytics, there are a
and data lakes number of issues.

F A I L E D W R I T E S If a production job that is writing data experiences failures,


which are inevitable in large distributed environments, it can result in data corruption
through partial writes or multiple writes. This leaves partial datasets littering the
bottom of the data lake. What is needed is a mechanism that can ensure that either
a write takes place completely or not at all (and not multiple times, adding spurious
data). Failed jobs can impose a considerable burden to recover to a clean state.

S C H E M A M I S M AT C H When ingesting content from multiple sources, typical of


large, modern big data environments, it can be difficult to ensure that the same data
is encoded in the same way. In other words, that the schema matches. A similar
challenge arises when the formats for data elements are changed without informing
the data engineering team. Both can result in low quality, inconsistent data that
requires cleaning up to improve its usability. Schema enforcement is one of the keys
to consistency; reading a data set while it is being updated is the other.
M I N I N G YO U R D ATA L A K E F O R 5
A N A LY T I C S I N S I G H T S

When ingesting content L A C K O F C O N S I S T E N C Y To provide insights, it is necessary to combine


from multiple sources, historical batch with new streaming data to show current behavior. Trying to read
typical of large, modern big data while it is being appended provides a challenge; on the one hand there is a
data environments, it can be
desire to keep ingesting new data, while on the other hand anyone reading the
difficult to ensure that the
data prefers a consistent view. This is especially an issue when there are multiple
same data is encoded in the
same way. readers and writers at work. It is undesirable and impractical to stop read access
while writes complete or stop write access while reads are in progress.
M I N I N G YO U R D ATA L A K E F O R 6
A N A LY T I C S I N S I G H T S

Data engineering drivers Your data engineers are responding to several different drivers in adopting
for advanced analytics advanced analytics. They include:

• G E T T I N G M O R E V A L U E F R O M C O R P O R A T E A S S E T S Advanced
analytics, including methods based on ML, have evolved to such a degree
that organizations seek to use them to derive far more value from their
corporate assets.

• W I D E S P R E A D A D O P T I O N Advanced approaches are being adopted across


a multitude of industries, and across private and public sector organizations.
This further drives the need for strong data engineering practices.

• R E G U L AT I O N R E Q U I R E M E N T S There is increased interest in how growing


amounts of data are protected and managed. Regulations such as GDPR
(General Data Protection Regulation) are very specific.
M I N I N G YO U R D ATA L A K E F O R 7
A N A LY T I C S I N S I G H T S

Deriving value from • T E C H N O L O G Y I N N O V A T I O N The move to cloud-based analytics


data must be done in a architectures is propelled further by new innovations such as analytics-
financially responsible way, focused chipsets, pipeline automation, and the unification of data and
add value to the enterprise,
machine learning.
and generate ROI.
• F I N A N C I A L S C R U T I N Y Financial scrutiny. Analytics initiatives are subject
to increasing financial scrutiny. There is a greater understanding of data as
an asset. Deriving value from data must be done in a financially responsible
way, add value to the enterprise, and generate ROI.

• R O L E E V O L U T I O N Reflecting the importance of managing data and


maximizing value, the Chief Data Officer (CDO) role is more prominent and
roles are emerging such as Data Curator. They must balance the needs of
governance, security and democratization.
M I N I N G YO U R D ATA L A K E F O R 8
A N A LY T I C S I N S I G H T S

Evolving data pipelines for Your data engineers need to account for a broad set of dependencies and requirements
advanced analytics as they design and build their data pipelines. Making quality data available in a reliable
manner drives success for data analytics initiatives, whether they’re regular dashboards
or reports, or advanced analytics projects drawing on state-of-the-art ML techniques.

Three primary goals that drive data engineers as they work to enable analytics in
their organizations are:

1 . D E L I V E R Q U A L I T Y D A T A I N L E S S T I M E When it comes to data, quality and


timeliness are key. Data with gaps or errors—which can arise for many reasons—is
unreliable, can lead to incorrect conclusions, and is of diminished value to downstream
users. Many applications are of limited value without up-to-date information.

2 . E N A B L E FA S T E R Q U E R I E S Wanting fast responses to queries is natural in today’s


online world. Achieving this is particularly challenging when queries are based on
very large data sets.

3 . S I M P L I F Y D ATA E N G I N E E R I N G AT S C A L E It is manageable to have high reliability


and performance in a limited, development, or test environment. What matters more
is the ability to support robust production data pipelines at large scale—without
requiring high operational overhead.
M I N I N G YO U R D ATA L A K E F O R 9
A N A LY T I C S I N S I G H T S

Apache Spark: A Unified Apache Spark™ was originally developed at UC Berkeley in 2009. Uniquely
Data Analytics Engine bringing data and AI technologies together, Apache Spark comes packaged with
higher-level libraries, including support for SQL queries, streaming data, machine
learning and graph processing.

These standard libraries increase developer productivity and can be seamlessly


combined to create complex workflows. Since its release, Apache Spark, has
seen rapid adoption by enterprises across a wide range of industries. Netflix,
Yahoo, and eBay have deployed Apache Spark at massive scale, collectively
processing multiple petabytes (PBs) of data on clusters of over 8,000 nodes—
making it the defacto choice for new analytics initiatives. It has quickly become
the largest open-source community in big data, with over 1,000 contributors from
more than 250 organizations.

The founders of Databricks donated Apache Spark to the opensource big data
community and continue to contribute 75% of the code to Apache Spark.
M I N I N G YO U R D ATA L A K E F O R 10
A N A LY T I C S I N S I G H T S

Delta Lake storage layer Your data scientists and data engineers need to focus on writing pipelines and
serializes, compacts, algorithms. Delta Lake—which is open-source, open format, and compatible with
and cleanses data Apache Spark APIs—automates many of the tasks required to prepare data for
analytics, helping to speed time to value for analytics projects.

Working in conjunction with your data lake, Delta Lake:

• Automatically compacts data and executes de-duplication tasks.

• Makes ETL processes much faster on the front end and streamlines the
porting of data into an ML model.

• Provides a storage layer that brings ACID (atomicity, consistency, isolation,


durability) transactions to Apache Spark™ and data lakes.
M I N I N G YO U R D ATA L A K E F O R 11
A N A LY T I C S I N S I G H T S

A common architecture uses tables that correspond to different quality levels


in the data engineering pipeline, progressively adding structure to the data:
data ingestion (“Bronze” tables), transformation/feature engineering (“Silver”
Delta Lake data flow tables), and machine learning training or prediction (“Gold” tables). Combined,
and refinement these tables are referred to as a “multi-hop” architecture. It allows data
engineers to build a pipeline that begins with raw data as a “single source of
truth” from which everything flows.

Subsequent transformations and aggregations can be recalculated and


validated to ensure that business-level aggregate tables still reflect the
underlying data, even as downstream users refine the data and introduce
context-specific structure.

Streaming
Analytics
and ML

Ingestion Tables Refined Tables Feature/Agg Data Store


BRONZE S I LV E R GOLD
Batch

Y O U R E X I S T I N G D ATA L A K E
M I N I N G YO U R D ATA L A K E F O R 12
A N A LY T I C S I N S I G H T S

Features of Delta Lake include:

Delta Lake enforces schema • T I M E T R A V E L ( D A T A V E R S I O N I N G ) Delta Lake provides snapshots of


and supports versioning data, enabling developers to revert to earlier versions of data for audits or
rollbacks, or to reproduce experiments. For more details on versioning please
read Introducing Delta Time Travel for Large Scale Data Lakes.

• O P E N F O R M AT Data in Delta Lake is stored in Apache Parquet format, enabling


Delta Lake to leverage the efficient compression and encoding schemes that are
native to Parquet.

• U N I F I E D B AT C H A N D S T R E A M I N G S O U R C E A N D S I N K A table in Delta Lake


is both a batch table, as well as a streaming source and a sink. Streaming data
ingestion, batch historic data backfill, and interactive queries all work out of
the box.

• S C H E M A E N F O R C E M E N T Delta Lake lets you specify your schema and enforce


it. This helps ensure that data types are correct and required columns are
present, preventing bad data from causing data corruption.

• S C H E M A E V O L U T I O N Big data is continuously changing. Delta Lake lets you


make changes to a table schema that can be applied automatically, without the
need for cumbersome data definition language (DDL).
M I N I N G YO U R D ATA L A K E F O R 13
A N A LY T I C S I N S I G H T S

Making it tangible: So far, we’ve covered a lot of technical and theoretical ground. In the following
real business stories pages, the possibilities come to life in real business applications. Read on to
discover how actual businesses are using Delta Lake on Databricks and AWS.
M I N I N G YO U R D ATA L A K E F O R 14
A N A LY T I C S I N S I G H T S

USE CASE: C U S T O M E R C H A L L E N G E : Massive volumes of data, fragile data pipelines,


and collaboration and tool challenges made it hard for Comcast to take
advantage of ML technologies.

S O L U T I O N : The Databricks Unified Data Analytics Platform enabled Comcast


to build rich data sets at a massive scale, optimize machine learning at scale,
to streamline workflows across teams, foster collaboration and reduce
infrastructure complexity, and deliver superior customer experiences.

O U T C O M E S D E L I V E R E D : Databricks helps enable Comcast to create a


viewer experience that boosts engagement. Comcast reduced its compute
costs by 10X, reduced the number of DevOps full-time employees required
for onboarding 200 users from 5 to 0.5, fostered collaboration with a single
interactive workspace, and reduced deployment times from weeks to minutes.
M I N I N G YO U R D ATA L A K E F O R 15
A N A LY T I C S I N S I G H T S

USE CASE: C U S T O M E R C H A L L E N G E : LoyaltyOne—a loyalty marketing services provider—


needed to improve performance in its data lake. The company wanted to be able
to dynamically update its data sets and to remove the need to filter records.

S O L U T I O N : LoyaltyOne used open-source Delta Lake on Databricks and AWS to


create automated data pipelines that feed thousands of ML and analytics models,
accelerating and enhancing customer churn and recommendation engines.

O U T C O M E S D E L I V E R E D : Open-source Delta Lake on Databricks and AWS


helped LoyaltyOne create a data lake pipeline for massive volumes of batch and
real-time streaming data. As an example, churn model updates were reduced
from 13 hours to two.
Get started with Delta Lake and AWS
Databricks is the data and AI company. Thousands of organizations worldwide —
including Comcast, Condé Nast, Nationwide and H&M — rely on Databricks’ open and
unified platform for data engineering, machine learning and analytics. Databricks is
venture-backed and headquartered in San Francisco, with offices around the globe.
Founded by the original creators of Apache Spark™️, Delta Lake and MLflow, Databricks is
on a mission to help data teams solve the world’s toughest problems.

To learn more, follow Databricks on  Twitter |  LinkedIn |  Facebook

T R Y D ATA B R I C K S

© Databricks 2020. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Use

You might also like