Professional Documents
Culture Documents
Mining Your Data Lake For Analytics Insights v3 101420
Mining Your Data Lake For Analytics Insights v3 101420
your business
For years, companies have dumped data into their data lake and deferred
organizing the data until later. To make this data useful for analytics, the
data must be carefully structured and cataloged. And as more data is rapidly
introduced, from log files and IoT sources, it must also be structured in the same
way. Delta Lake on Databricks provides a way to streamline these data pipelines
so the data is instantly available for analysis.
In this eBook, learn more about using Delta Lake and Amazon Web Services
(AWS) to prepare and deliver data lake data directly to drive valuable analytics
insights. Read use cases from LoyaltyOne and Comcast to see how they are
getting value out of Delta Lake.
M I N I N G YO U R D ATA L A K E F O R 3
A N A LY T I C S I N S I G H T S
The challenge of scaling a With each day, there’s increasingly more data to manage—think streaming data,
data lake for analytics IoT data, event data, and social media data. A fleet of trucks can provide sensor
readings every five seconds. A corporate intrusion detection program can track
every IP address that enters a network, and their actions. These types of use
cases create hundreds of terabytes of information daily.
A data lake can provide companies a place to store all that data, but it does not
provide that data in a way that is analytics-ready. That data may pass through
processes to reformat it and into other storage systems as it is prepared for
analysis. This process can take several hours to several days. Meanwhile, trucks
can break down and network infiltrators can wreak havoc. Organizations need
instant access to all that data to keep their business running.
M I N I N G YO U R D ATA L A K E F O R 4
A N A LY T I C S I N S I G H T S
Data reliability challenges When it comes to making data in data lakes accessible for analytics, there are a
and data lakes number of issues.
Data engineering drivers Your data engineers are responding to several different drivers in adopting
for advanced analytics advanced analytics. They include:
• G E T T I N G M O R E V A L U E F R O M C O R P O R A T E A S S E T S Advanced
analytics, including methods based on ML, have evolved to such a degree
that organizations seek to use them to derive far more value from their
corporate assets.
Evolving data pipelines for Your data engineers need to account for a broad set of dependencies and requirements
advanced analytics as they design and build their data pipelines. Making quality data available in a reliable
manner drives success for data analytics initiatives, whether they’re regular dashboards
or reports, or advanced analytics projects drawing on state-of-the-art ML techniques.
Three primary goals that drive data engineers as they work to enable analytics in
their organizations are:
Apache Spark: A Unified Apache Spark™ was originally developed at UC Berkeley in 2009. Uniquely
Data Analytics Engine bringing data and AI technologies together, Apache Spark comes packaged with
higher-level libraries, including support for SQL queries, streaming data, machine
learning and graph processing.
The founders of Databricks donated Apache Spark to the opensource big data
community and continue to contribute 75% of the code to Apache Spark.
M I N I N G YO U R D ATA L A K E F O R 10
A N A LY T I C S I N S I G H T S
Delta Lake storage layer Your data scientists and data engineers need to focus on writing pipelines and
serializes, compacts, algorithms. Delta Lake—which is open-source, open format, and compatible with
and cleanses data Apache Spark APIs—automates many of the tasks required to prepare data for
analytics, helping to speed time to value for analytics projects.
• Makes ETL processes much faster on the front end and streamlines the
porting of data into an ML model.
Streaming
Analytics
and ML
Y O U R E X I S T I N G D ATA L A K E
M I N I N G YO U R D ATA L A K E F O R 12
A N A LY T I C S I N S I G H T S
Making it tangible: So far, we’ve covered a lot of technical and theoretical ground. In the following
real business stories pages, the possibilities come to life in real business applications. Read on to
discover how actual businesses are using Delta Lake on Databricks and AWS.
M I N I N G YO U R D ATA L A K E F O R 14
A N A LY T I C S I N S I G H T S
T R Y D ATA B R I C K S
© Databricks 2020. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.
Privacy Policy | Terms of Use