AWS 05 DataLake

Data Lake
R eas o n s fo r b u ild in g a d ata lake

E x p o n e n t i a l g r o w t h i n d a ta
Transactions Billing
ERP Web logs
Sensor Data Infrastructure logs
Social
© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

E x p o n e n t i a l g r o w t h i n d a ta Div e rs if ie d co n s u m e rs
Data Scientists Applications

ERP Web logs
Business Analyst External Consumers
Social

E x p o n e n t i a l g r o w t h i n d a ta Div e rs if ie d co n s u m e rs Mu l t ip le a cc e ss m e ch a n is m s
Data Scientists Applications API Access Notebooks

ERP Web logs
Business Analyst External Consumers

BI Tools
Social

C h aracteris tics o f a d ata lake
Collect Dive in Flexibl Future Proof

Anything Anywhere e
Access
Amazon S3 as the data lake

S im p lified arch itectu ral v iew
Data sources
Transactions
ERP
Ingestion Process Consume

Web logs / mechanism
cookies Amazon S3
Connected
devices

Th ere are lo ts o f in g es tio n to o ls
Data sources
Transactions
ERP
Process Consume
Web logs /
cookies Amazon S3
Connected
devices S3 Transfer
Acceleration

V ariety o f d ata p ro ces s in g to o ls
Data sources
Amazon Athena
Interactive Query
Transactions
Amazon EMR
Managed Hadoop & Spark
ERP
Amazon Redshift + Spectrum

Petabyte-scale Data
Consume
Web logs / Warehousing
cookies Amazon S3
Amazon Elasticsearch
Connected Real-time log analytics & search
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services

A n d m u ltip le w ays to co n s u m e th e d ata
Data sources
Amazon Athena
Interactive Query
Transactions
Amazon QuickSight
Fast, easy to use, cloud BI
Amazon EMR
ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE

Petabyte-scale Data
cookies Amazon S3
Amazon API Gateway

ProgrammaticAccess
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services

Because data is not prefect
Because data is not never prefect
Clean
Transform
Concatenate
Convert to better formats
AWS Lambda AWS Glue Amazon EMR
Trigger-based Code Event based Server-less ETL Spark and Hive running on Schedule transformations
Execution engine EMR Event-driven transformations
Transformations expressed as
code

ETL when you need it
Data sources
Amazon Athena
Interactive Query
Transactions
Amazon QuickSight
Amazon EMR
ERP
Analytic Notebooks

Petabyte-scale Data
cookies Amazon S3
API Gateway
ProgrammaticAccess
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services

Metadata? One per account
Allows you to share metadata between

Amazon Athena, Amazon Redshift
Spectrum, EMR & JDBC sources
We added a few extensions:

▪ Search over metadata for data
discovery
▪ Connection info – JDBC URLs,
AWS Glue Data Catalog credentials
Central Metadata Catalog for the datalake
▪ Classification for identifying and parsing
files
▪ Versioning of table metadata as
schemas evolve and other metadataare
updated
D ata C atalo g C raw lers
Crawlers automatically build your Data
Catalog and keep it in sync
Automatically discover new data, extracts
schema definitions
• Detect schema changes and versiontables
• Detect Hive style partitions onAmazon S3
Built-in classifiers for popular types;custom
AWS Glue Data Catalog - Crawlers classifiers using Grok expression
Helping Catalog your data
Run ad hoc or on a schedule; serverless – only
pay when crawler runs

A W S Glue Data Catalog

D ata C atalo g – Tab le D etails
Table properties
Nested fields
Data statistics
Table schema

Data Catalog: Version Control
Compare schema versions List of table versions

Automatic Partition Detection
Table
partitions
Automatically register available partitions

A cen tral m etad ata s to re fo r yo u r lake
Data sources
Amazon Athena
Interactive Query
Transactions AWS Glue Data Catalog
Amazon QuickSight
Hive-compatible Metastore
Amazon EMR
ERP
Analytic Notebooks

Petabyte-scale Data
cookies Amazon S3
API Gateway
ProgrammaticAccess
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services

Real- time (instream processing)
Data sources
Amazon Athena
Interactive Query
Transactions AWS Glue Data Catalog
Amazon QuickSight
Hive-compatible Metastore
Amazon EMR
ERP
Analytic Notebooks

Petabyte-scale Data
cookies Amazon S3
API Gateway
ProgrammaticAccess
Spark Streaming
Connected & Flink on EMR
Real-time log analytics & search
devices S3 Transfer
Acceleration
AmazonKinesis
Analytics Amazon AI
ML/DL Services

W r it e o n c e , c a t a l o g o n c e , r ea d m u l t i p l e , E T L A n y w h e r e
Data sources
Amazon Athena
AWS Glue Data Catalog Interactive Query
Transactions Hive-compatible Metastore
Amazon QuickSight
Amazon EMR
ERP
Analytic Notebooks

Petabyte-scale Data
cookies Amazon S3
API Gateway
ProgrammaticAccess
Spark Streaming
Connected & Flink on EMR
Real-time log analytics & search
devices S3 Transfer
Acceleration
AmazonKinesis
Analytics Amazon AI
ML/DL Services

C h aracteris tics o f a d ata lake
Collect Dive in Flexibl Future Proof

Anything Anywhere e
Access
L et’s take an exam p le
Business Questions
1. What is going on with a specific sensor
2. Daily Aggregations (device,
inefficiencies, average temperature)
3. A real-time view of how many sensors
are showing inefficiencies
Operations
Sensor/IOT device Record-level data
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need

L et’s p u s h th is d ata in to a Kin es is
Amazon S3
Kinesis Firehose Amazon S3 Amazon Athena

Q u eryin g it in A m azo n A th en a
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena

console/API/ JDBC/ODBC
driver
Start Querying Data

Q u ery d aily ag g reg ates in A m azo n A th en a
“daily-average”
Amazon S3
“raw-time-series”
Amazon S3
Amazon S3 Amazon Athena
Kinesis Firehose

A W S G lu e Jo b
Serverless, event-driven execution
Data is written out to S3
Output table isautomatically

created in Amazon Athena

Q u ery d aily ag g reg ates in A m azo n A th en a
“daily-average”
Amazon S3
Amazon S3
Amazon S3 Amazon Athena
Kinesis Firehose

Kin es is A n alytics fo r in - s tream an alytics
“daily-average”
Amazon S3
Kinesis Firehose Amazon S3

Amazon S3
Amazon Athena
“results”
Amazon S3
Kinesis Analytics Kinesis Firehose

Events by Device ID Top 20 most a c t i v e devices
SELECT uuid, devicets, dev i c e i d , SELECT
temp d e v i c e i d , COUNT(*) AS num_events
“raw” t a b l e w i t h raw data FROM awsblogsgluedemo."raw" WHERE FROM awsblogsgluedemo. "raw"
deviceid = 1 GROUP BY deviceid
ORDER BY devicets DESC; ORDER BY num_events DESC
KPI - Overall device d a i l y i n e f f i c i e n c y "
“ d a i l y - agg” t a b l e w i t h d a i l y SELECT ( SUM( d a i l y _ a v g _ i n e ff i c i e nc y ) / COUNT( * ) )

aggregation AS a l l _ d e v i c e _ a v g _ i n e ff i c i en c y, date
FROM awsblogsgluedemo.daily_avg_inefficiency
GROUP BY d a t e ;
Top 10 most i n e f f i c i e n t devices - e v e n t - l e v e l g r a n u l a r i t y
SELECTcol0 AS "uui d " , co l 1 AS" d e v i c e i d ", col2 AS "dev i c e t s " ,

“ r e s u l t ” table
col3 AS"temp", col4 AS "settemp", col5 AS " p c t _ i n e f f i c i e n c y "
FROM awsblogsgluedemo.results ORDER BY p c t _ i n e f f i c i e n c y DESC
l i m i t 10;

Overall architecture
“daily-average”
Amazon S3
Kinesis Firehose Amazon S3

Amazon S3
Amazon Athena
“results”
Amazon S3
Kinesis Analytics Kinesis Firehose

C h a ra c t e ris t ic s
✓ Scale to hundreds of thousands of data sources

✓ Virtually infinite storage scalability
✓ Real-time and batch processing layers
✓ Interactive queries
✓ Highly available and durable
✓ Pay only for what you use
X No servers to manage

Very easy to try – existing template

Building the Atlassian Data
Lake
ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA

ATLASSIAN OVERVIEW
Software Teams Marketing Teams IT Teams HR Teams Finance Teams
Reactions Convos Mentions Files Meetings Decisions
Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration
Socrates
The Atlassian Data Lake
Image courtesy of © Bar Harel, CC BY-SA 4.0, Wikimedia Commons

The numbers
500+ TBs 1B+ Events 100 1000 Internal

Stored in the data Ingested into the data Integrations Users
lake lake daily Providing analytical Using the data lake
events daily
Data lake services
Ingest
Moving away from pull-based ingestion
Challenges with pull-based ingestion
Brittle Complex Disruptive

As sources change the Various technologies to Analytics extracts strain
pipelines break and need maintain sourcing systems
updating
Our Ingestion Kinesis
Web
Journey
REST
CRM
Late 2015
JDBC
Billing
Socrates
GraphQL
(Data Lake)
Product
Our Ingestion Kinesis
Web
Journey
REST
CRM
JDBC
Billing
Early 2016
Socrates
GraphQL
(Data Lake)
Product
Webhook
ODBC
SFTP
Micro Services
Our Ingestion
Web
Journey
CRM
Billing
Socrates
(Data Lake)
Product
Late 2016
Micro Services
Our Ingestion
Web
Journey
Other
Enterprise Systems
CRM
Billing
StreamHub
(Enterprise Bus) Socrates
(Data Lake)
Product
Early 2017
Micro Services Other
Micro Services
What is StreamHub?
Event-Driven Schema Registry

Architecture Validates that messages are
compatible
Producers and subscribers
integrate via events
How do we land it?
atlassian-socrates-raw-landed/
└── avi:jira:created:comment/
└── day=2017-10-10/
├── events-13:20:15.479940.json.gz
├── events-13:21:23.479940.json.gz
├── events-13:21:52.479940.json.gz
├── events-13:23:37.479940.json.gz
├── events-13:23:56.479940.json.gz
├── events-13:24:15.479940.json.gz
├── events-13:24:21.479940.json.gz
├── events-13:25:34.479940.json.gz
└── events-13:26:13.479940.json.gz
atlassian-socrates-raw-published-stg1/
├── avi:jira:created:comment/
├── day=2017-10-10
└── <sub-partition>
│ ├── events-part01.snappy.parquet
│ └── events-part04.snappy.parquet
└── <sub-partition>
├── events-part05.snappy.parquet
└── events-part08.snappy.parquet
atlassian-socrates-raw-published-stg2/
├── avi:jira:created:comment/
├── day=2017-10-10
└── business_key_1
│ └── events-part01.snappy.parquet
└── business_key_2
└── events-part01.snappy.parquet
Prepare
Cleansing and transforming our data
Challenges with preparation
Data Engineering Cluster Management Re-Inventing the

Bottleneck Clusters could be hard to Wheel
Teams would rely on us to upgrade and attribute costs to
Lots of time spent re-
help them with their data jobs
implementing patterns to
transformation needs perform transformations
RAW JOB SCOPED PREPARED
/UNALTERED CLUSTERS /TRANSFORMED
Account /
Support/Ops User Defined
Chargeback
Extracts
Upscale
CRM/Billing Dimensional
Model
Quarantine
Product/Web Aggregated
/ Derived
Airflow
Airflow DAG
Spin up a Copy logs for Shutdown

dedicated debugging EMR cluster
EMR cluster
Transformation as a Service
TaaS
Organize
Storing, securing, and governing our data
Challenges with organizing data
Teams want Security Categorizing Data

flexibility How can we provision buckets How can we structure our data
How do we give teams for teams who don’t want to lake in a way that will scale
flexibility on how they organize face the AWS console head- well?
themselves? on?
Areas of the data lake
Landed Raw Modeled Self-Serve

Unaltered, Optimized, Conformed BYO Data,
Unformatted, Partitioned, Masked dimensions, User/Team managed
Unmasked Standardized facts,
aggregated/derived
value
Request a Schema…
Self-Service Provisions the components
Schemas • Create a S3 bucket, tagged to the user
• Create an a schema in our metastore(s)
• Create an Active Directory group
What gets We call them Zones

provisioned We use to call them “Playgrounds” but often they were
used for production loads
e.g. zone_marketing
Use Vault to control access rights

• A tool that manages secrets
• Creates a temporary IAM user (2 hours)
• Passes the credentials to the user
Authenticate against Vault
Self-Service
$ vault auth -method=ldap username=<ad_username>
Schemas Password (will be hidden): <ad_password>
...
token_policies: [zone-marketing-write zone-marketing-read]
Retrieve your credentials

$ vault read aws/creds/zone-marketing-write
Key Value
--- -----
lease_id aws/creds/zone-marketing-write/e1x2a3m4p5l6e7
lease_duration 25h0m0s
How users lease_renewable true
interact access_key AKIAISANEXAMPLEKEYID
secret_key 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
security_token <nil>
Apply Credentials
Self-Service $ aws configure
AWS Access Key ID [None]: AKIAISANEXAMPLEKEYID
Schemas AWS Secret Access Key [None]:1r2a3n4d5o6m7s8t9r0i1n2g3O4f5C6o7u8r9s0e
List your bucket

$ aws s3 ls s3://atlassian-zone-marketing/
PRE example_directory/
PRE another_example_directory/
2016-12-08 13:21:35 0 example_text_file.txt
2016-09-27 12:24:48 0 example_csv_file.csv
How users
interact Upload your file
$ aws s3 cp examplefile s3://atlassian-zone-bucketname
Discover
Finding, understanding, and exploring data
Challenges with data discovery
Teams want options Managing query Finding data

engines Difficult to know which table to
trust or to use for what
Different visualizations tools Query engine usage is
purpose
better suit different needs unpredictable, doing a bad job
blocks analysts
Visual Layer Zeppelin
Tableau R Shiny Notebooks Redash
Interactive Layer Spark/Hive

Amazon Presto EMR
EMR
Athena
Metastore Layer Hive Metastore AWS Glue

Metastore
Zone Buckets
Storage Layer Raw Buckets (Self-Service) Model Buckets
Before: Presto After: Amazon Athena
• Many failed queries • Ability to attribute costs
• Difficulties upgrading • Less infrastructure/operational
overhead
• Hard to secure
• Not paying for what we don’t use
• Uses bucket security policies
Challenges with Amazon Athena
Early Adopter Pains No AD Cost Management

There wasn’t parity with Authentication Costs need to be monitored to
Presto to begin with Only access via JDBC to spot any unusual spikes
begin with using keys
Visualization Stack
Tableau R Shiny Zeppelin Redash

Interactive exploration Web apps and Notebooks Quick queries and
on core data sets and standalone Web based visualizations on all
corporate dashboards dashboards notebooks data
Search the Data Catalog
Key AWS helps you move up
the value chain
Takeaways Using AWS helps you focus on areas where you
can be adding value
It’s not just flicking on a

switch
You can’t just turn on AWS components and
have an instant data lake
Thank you!
ROHAN DHUPELIA | ANALYTICS PLATFORM MANAGER | @ROHANDHUPELIA

AWS 05 DataLake

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AWS 05 DataLake

Uploaded by

Copyright:

Available Formats

Data Lake

R eas o n s fo r b u ild in g a d ata lake

ERP Web logs

Sensor Data Infrastructure logs

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Data Scientists Applications

ERP Web logs

Business Analyst External Consumers

Sensor Data Infrastructure logs

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Data Scientists Applications API Access Notebooks

ERP Web logs

Business Analyst External Consumers

Sensor Data Infrastructure logs

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Collect Dive in Flexibl Future Proof

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Ingestion Process Consume

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Amazon Redshift + Spectrum

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Amazon Redshift + Spectrum

Amazon API Gateway

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Amazon Redshift + Spectrum

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Allows you to share metadata between

We added a few extensions:

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Automatically register available partitions

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Amazon Redshift + Spectrum

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Amazon Redshift + Spectrum

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Amazon Redshift + Spectrum

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Collect Dive in Flexibl Future Proof

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Write a DDL on the Athena

Start Querying Data

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Data is written out to S3

Output table isautomatically

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Kinesis Firehose Amazon S3

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

KPI - Overall device d a i l y i n e f f i c i e n c y "

“ d a i l y - agg” t a b l e w i t h d a i l y SELECT ( SUM( d a i l y _ a v g _ i n e ff i c i e nc y ) / COUNT( * ) )

Top 10 most i n e f f i c i e n t devices - e v e n t - l e v e l g r a n u l a r i t y

SELECTcol0 AS "uui d " , co l 1 AS" d e v i c e i d ", col2 AS "dev i c e t s " ,

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

Kinesis Firehose Amazon S3

© 2017, Amazon Web Services, Inc. or its Affiliates.All rights reserved.

✓ Scale to hundreds of thousands of data sources