Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

Data Lake

R eas o n s fo r b u ild in g a d ata lake


E x p o n e n t i a l g r o w t h i n d a ta

Transactions Billing

ERP Web logs

Sensor Data Infrastructure logs

Social

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


R eas o n s fo r b u ild in g a d ata lake
E x p o n e n t i a l g r o w t h i n d a ta Div e rs if ie d co n s u m e rs

Data Scientists Applications


Transactions Billing

ERP Web logs

Business Analyst External Consumers

Sensor Data Infrastructure logs

Social

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


R eas o n s fo r b u ild in g a d ata lake
E x p o n e n t i a l g r o w t h i n d a ta Div e rs if ie d co n s u m e rs Mu l t ip le a cc e ss m e ch a n is m s

Data Scientists Applications API Access Notebooks


Transactions Billing

ERP Web logs

Business Analyst External Consumers


BI Tools

Sensor Data Infrastructure logs

Social

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


C h aracteris tics o f a d ata lake

Collect Dive in Flexibl Future Proof


Anything Anywhere e
Access
Amazon S3 as the data lake

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


S im p lified arch itectu ral v iew
Data sources

Transactions

ERP

Ingestion Process Consume


Web logs / mechanism
cookies Amazon S3

Connected
devices

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Th ere are lo ts o f in g es tio n to o ls
Data sources

Transactions

ERP

Process Consume
Web logs /
cookies Amazon S3

Connected
devices S3 Transfer
Acceleration

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


V ariety o f d ata p ro ces s in g to o ls
Data sources
Amazon Athena
Interactive Query
Transactions

Amazon EMR
Managed Hadoop & Spark

ERP

Amazon Redshift + Spectrum


Petabyte-scale Data
Consume
Web logs / Warehousing
cookies Amazon S3

Amazon Elasticsearch
Connected Real-time log analytics & search

devices S3 Transfer
Acceleration

Amazon AI
ML/DL Services

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


A n d m u ltip le w ays to co n s u m e th e d ata
Data sources
Amazon Athena
Interactive Query
Transactions
Amazon QuickSight
Fast, easy to use, cloud BI

Amazon EMR
Managed Hadoop & Spark

ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE

Amazon Redshift + Spectrum


Petabyte-scale Data
Web logs / Warehousing
cookies Amazon S3

Amazon API Gateway


ProgrammaticAccess
Amazon Elasticsearch
Connected Real-time log analytics & search

devices S3 Transfer
Acceleration

Amazon AI
ML/DL Services

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Because data is not prefect
Because data is not never prefect

Clean
Transform
Concatenate
Convert to better formats
AWS Lambda AWS Glue Amazon EMR
Trigger-based Code Event based Server-less ETL Spark and Hive running on Schedule transformations
Execution engine EMR Event-driven transformations
Transformations expressed as
code

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


ETL when you need it
Data sources
Amazon Athena
Interactive Query
Transactions
Amazon QuickSight
Fast, easy to use, cloud BI

Amazon EMR
Managed Hadoop & Spark

ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE

Amazon Redshift + Spectrum


Petabyte-scale Data
Web logs / Warehousing
cookies Amazon S3

API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Connected Real-time log analytics & search

devices S3 Transfer
Acceleration

Amazon AI
ML/DL Services

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Metadata? One per account

Allows you to share metadata between


Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources

We added a few extensions:


▪ Search over metadata for data
discovery
▪ Connection info – JDBC URLs,
AWS Glue Data Catalog credentials
Central Metadata Catalog for the datalake
▪ Classification for identifying and parsing
files
▪ Versioning of table metadata as
schemas evolve and other metadataare
© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.
updated
D ata C atalo g C raw lers
Crawlers automatically build your Data
Catalog and keep it in sync
Automatically discover new data, extracts
schema definitions
• Detect schema changes and versiontables
• Detect Hive style partitions onAmazon S3
Built-in classifiers for popular types;custom
AWS Glue Data Catalog - Crawlers classifiers using Grok expression
Helping Catalog your data
Run ad hoc or on a schedule; serverless – only
pay when crawler runs

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


A W S Glue Data Catalog

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


D ata C atalo g – Tab le D etails

Table properties
Nested fields

Data statistics

Table schema

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Data Catalog: Version Control
Compare schema versions List of table versions

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Automatic Partition Detection

Table
partitions

Automatically register available partitions

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


A cen tral m etad ata s to re fo r yo u r lake
Data sources
Amazon Athena
Interactive Query
Transactions AWS Glue Data Catalog
Amazon QuickSight
Hive-compatible Metastore
Fast, easy to use, cloud BI

Amazon EMR
Managed Hadoop & Spark

ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE

Amazon Redshift + Spectrum


Petabyte-scale Data
Web logs / Warehousing
cookies Amazon S3

API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Connected Real-time log analytics & search

devices S3 Transfer
Acceleration

Amazon AI
ML/DL Services

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Real- time (instream processing)
Data sources
Amazon Athena
Interactive Query
Transactions AWS Glue Data Catalog
Amazon QuickSight
Hive-compatible Metastore
Fast, easy to use, cloud BI

Amazon EMR
Managed Hadoop & Spark

ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE

Amazon Redshift + Spectrum


Petabyte-scale Data
Web logs / Warehousing
cookies Amazon S3

API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Spark Streaming
Connected & Flink on EMR
Real-time log analytics & search

devices S3 Transfer
Acceleration
AmazonKinesis
Analytics Amazon AI
ML/DL Services

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


W r it e o n c e , c a t a l o g o n c e , r ea d m u l t i p l e , E T L A n y w h e r e

Data sources
Amazon Athena
AWS Glue Data Catalog Interactive Query
Transactions Hive-compatible Metastore
Amazon QuickSight
Fast, easy to use, cloud BI

Amazon EMR
Managed Hadoop & Spark

ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE

Amazon Redshift + Spectrum


Petabyte-scale Data
Web logs / Warehousing
cookies Amazon S3

API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Spark Streaming
Connected & Flink on EMR
Real-time log analytics & search

devices S3 Transfer
Acceleration
AmazonKinesis
Analytics Amazon AI
ML/DL Services

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


C h aracteris tics o f a d ata lake

Collect Dive in Flexibl Future Proof


Anything Anywhere e
Access
L et’s take an exam p le
Business Questions
1. What is going on with a specific sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many sensors
are showing inefficiencies

Operations
Sensor/IOT device Record-level data
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


L et’s p u s h th is d ata in to a Kin es is

Amazon S3
Kinesis Firehose Amazon S3 Amazon Athena

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.
Q u eryin g it in A m azo n A th en a
Either Create a Crawler to
auto-generate schema

OR

Write a DDL on the Athena


console/API/ JDBC/ODBC
driver

Start Querying Data

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Q u ery d aily ag g reg ates in A m azo n A th en a
“daily-average”

Amazon S3

“raw-time-series”

Amazon S3
Amazon S3 Amazon Athena
Kinesis Firehose

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


A W S G lu e Jo b
Serverless, event-driven execution

Data is written out to S3

Output table isautomatically


created in Amazon Athena

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Q u ery d aily ag g reg ates in A m azo n A th en a
“daily-average”

Amazon S3

“raw-time-series”

Amazon S3
Amazon S3 Amazon Athena
Kinesis Firehose

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Kin es is A n alytics fo r in - s tream an alytics
“daily-average”

Amazon S3

“raw-time-series”

Kinesis Firehose Amazon S3


Amazon S3
Amazon Athena

“results”

Amazon S3
Kinesis Analytics Kinesis Firehose

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Events by Device ID Top 20 most a c t i v e devices
SELECT uuid, devicets, dev i c e i d , SELECT
temp d e v i c e i d , COUNT(*) AS num_events
“raw” t a b l e w i t h raw data FROM awsblogsgluedemo."raw" WHERE FROM awsblogsgluedemo. "raw"
deviceid = 1 GROUP BY deviceid
ORDER BY devicets DESC; ORDER BY num_events DESC

KPI - Overall device d a i l y i n e f f i c i e n c y "

“ d a i l y - agg” t a b l e w i t h d a i l y SELECT ( SUM( d a i l y _ a v g _ i n e ff i c i e nc y ) / COUNT( * ) )


aggregation AS a l l _ d e v i c e _ a v g _ i n e ff i c i en c y, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY d a t e ;

Top 10 most i n e f f i c i e n t devices - e v e n t - l e v e l g r a n u l a r i t y

SELECTcol0 AS "uui d " , co l 1 AS" d e v i c e i d ", col2 AS "dev i c e t s " ,


“ r e s u l t ” table
col3 AS"temp", col4 AS "settemp", col5 AS " p c t _ i n e f f i c i e n c y "
FROM awsblogsgluedemo.results ORDER BY p c t _ i n e f f i c i e n c y DESC
l i m i t 10;

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Overall architecture
“daily-average”

Amazon S3

“raw-time-series”

Kinesis Firehose Amazon S3


Amazon S3
Amazon Athena

“results”

Amazon S3
Kinesis Analytics Kinesis Firehose

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


C h a ra c t e ris t ic s

✓ Scale to hundreds of thousands of data sources


✓ Virtually infinite storage scalability
✓ Real-time and batch processing layers
✓ Interactive queries
✓ Highly available and durable
✓ Pay only for what you use

X No servers to manage

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Very easy to try – existing template

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.


Building the Atlassian Data
Lake

ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA


ATLASSIAN OVERVIEW

Software Teams Marketing Teams IT Teams HR Teams Finance Teams

Reactions Convos Mentions Files Meetings Decisions

Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration
Socrates
The Atlassian Data Lake

Image courtesy of © Bar Harel, CC BY-SA 4.0, Wikimedia Commons


The numbers

500+ TBs 1B+ Events 100 1000 Internal


Stored in the data Ingested into the data Integrations Users
lake lake daily Providing analytical Using the data lake
events daily
Data lake services
Ingest
Moving away from pull-based ingestion
Challenges with pull-based ingestion

Brittle Complex Disruptive


As sources change the Various technologies to Analytics extracts strain
pipelines break and need maintain sourcing systems
updating
Our Ingestion Kinesis

Web
Journey
REST

CRM

Late 2015
JDBC

Billing

Socrates
GraphQL
(Data Lake)
Product
Our Ingestion Kinesis

Web
Journey
REST

CRM

JDBC

Billing

Early 2016
Socrates
GraphQL
(Data Lake)
Product

Webhook

ODBC

SFTP

Micro Services
Our Ingestion
Web
Journey

CRM

Billing

Socrates
(Data Lake)
Product
Late 2016

Micro Services
Our Ingestion
Web
Journey
Other
Enterprise Systems
CRM

Billing

StreamHub
(Enterprise Bus) Socrates
(Data Lake)
Product

Early 2017
Micro Services Other
Micro Services
What is StreamHub?

Event-Driven Schema Registry


Architecture Validates that messages are
compatible
Producers and subscribers
integrate via events
How do we land it?
atlassian-socrates-raw-landed/
└── avi:jira:created:comment/
└── day=2017-10-10/
├── events-13:20:15.479940.json.gz
├── events-13:21:23.479940.json.gz
├── events-13:21:52.479940.json.gz
├── events-13:23:37.479940.json.gz
├── events-13:23:56.479940.json.gz
├── events-13:24:15.479940.json.gz
├── events-13:24:21.479940.json.gz
├── events-13:25:34.479940.json.gz
└── events-13:26:13.479940.json.gz
atlassian-socrates-raw-published-stg1/
├── avi:jira:created:comment/
├── day=2017-10-10
└── <sub-partition>
│ ├── events-part01.snappy.parquet
│ ├── events-part02.snappy.parquet
│ ├── events-part03.snappy.parquet
│ └── events-part04.snappy.parquet
└── <sub-partition>
├── events-part05.snappy.parquet
├── events-part06.snappy.parquet
├── events-part07.snappy.parquet
└── events-part08.snappy.parquet
atlassian-socrates-raw-published-stg2/
├── avi:jira:created:comment/
├── day=2017-10-10
└── business_key_1
│ └── events-part01.snappy.parquet
└── business_key_2
└── events-part01.snappy.parquet
Prepare
Cleansing and transforming our data
Challenges with preparation

Data Engineering Cluster Management Re-Inventing the


Bottleneck Clusters could be hard to Wheel
Teams would rely on us to upgrade and attribute costs to
Lots of time spent re-
help them with their data jobs
implementing patterns to
transformation needs perform transformations
RAW JOB SCOPED PREPARED
/UNALTERED CLUSTERS /TRANSFORMED

Account /
Support/Ops User Defined
Chargeback
Extracts

Upscale

CRM/Billing Dimensional
Model

Quarantine

Product/Web Aggregated
/ Derived

Airflow
Airflow DAG

Spin up a Copy logs for Shutdown


dedicated debugging EMR cluster
EMR cluster
Transformation as a Service
TaaS
Organize
Storing, securing, and governing our data
Challenges with organizing data

Teams want Security Categorizing Data


flexibility How can we provision buckets How can we structure our data
How do we give teams for teams who don’t want to lake in a way that will scale
flexibility on how they organize face the AWS console head- well?
themselves? on?
Areas of the data lake

Landed Raw Modeled Self-Serve


Unaltered, Optimized, Conformed BYO Data,
Unformatted, Partitioned, Masked dimensions, User/Team managed
Unmasked Standardized facts,
aggregated/derived
value
Request a Schema…
Self-Service Provisions the components
Schemas • Create a S3 bucket, tagged to the user
• Create an a schema in our metastore(s)
• Create an Active Directory group

What gets We call them Zones


provisioned We use to call them “Playgrounds” but often they were
used for production loads
e.g. zone_marketing

Use Vault to control access rights


• A tool that manages secrets
• Creates a temporary IAM user (2 hours)
• Passes the credentials to the user
Authenticate against Vault
Self-Service
$ vault auth -method=ldap username=<ad_username>
Schemas Password (will be hidden): <ad_password>
...
token_policies: [zone-marketing-write zone-marketing-read]

Retrieve your credentials


$ vault read aws/creds/zone-marketing-write
Key Value
--- -----
lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7
lease_duration 25h0m0s
How users lease_renewable true
interact access_key AKIAISANEXAMPLEKEYID
secret_key 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
security_token <nil>
Apply Credentials
Self-Service $ aws configure
AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID
Schemas AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e

List your bucket


$ aws s3 ls s3://atlassian-zone-marketing/
PRE example_directory/
PRE another_example_directory/
2016-12-08 13:21:35 0 example_text_file.txt
2016-09-27 12:24:48 0 example_csv_file.csv

How users
interact Upload your file
$ aws s3 cp examplefile s3://atlassian-zone-bucketname
Discover
Finding, understanding, and exploring data
Challenges with data discovery

Teams want options Managing query Finding data


engines Difficult to know which table to
trust or to use for what
Different visualizations tools Query engine usage is
purpose
better suit different needs unpredictable, doing a bad job
blocks analysts
Visual Layer Zeppelin
Tableau R Shiny Notebooks Redash

Interactive Layer Spark/Hive


Amazon Presto EMR
EMR
Athena

Metastore Layer Hive Metastore AWS Glue


Metastore

Zone Buckets
Storage Layer Raw Buckets (Self-Service) Model Buckets
Before: Presto After: Amazon Athena
• Many failed queries • Ability to attribute costs
• Difficulties upgrading • Less infrastructure/operational
overhead
• Hard to secure
• Not paying for what we don’t use
• Uses bucket security policies
Challenges with Amazon Athena

Early Adopter Pains No AD Cost Management


There wasn’t parity with Authentication Costs need to be monitored to
Presto to begin with Only access via JDBC to spot any unusual spikes
begin with using keys
Visualization Stack

Tableau R Shiny Zeppelin Redash


Interactive exploration Web apps and Notebooks Quick queries and
on core data sets and standalone Web based visualizations on all
corporate dashboards dashboards notebooks data
Search the Data Catalog
Key AWS helps you move up
the value chain
Takeaways Using AWS helps you focus on areas where you
can be adding value

It’s not just flicking on a


switch
You can’t just turn on AWS components and
have an instant data lake
Thank you!

ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA

You might also like