Professional Documents
Culture Documents
AWS 05 DataLake
AWS 05 DataLake
Transactions Billing
Social
Social
Social
Transactions
ERP
Connected
devices
Transactions
ERP
Process Consume
Web logs /
cookies Amazon S3
Connected
devices S3 Transfer
Acceleration
Amazon EMR
Managed Hadoop & Spark
ERP
Amazon Elasticsearch
Connected Real-time log analytics & search
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon EMR
Managed Hadoop & Spark
ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Clean
Transform
Concatenate
Convert to better formats
AWS Lambda AWS Glue Amazon EMR
Trigger-based Code Event based Server-less ETL Spark and Hive running on Schedule transformations
Execution engine EMR Event-driven transformations
Transformations expressed as
code
Amazon EMR
Managed Hadoop & Spark
ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Connected Real-time log analytics & search
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Table properties
Nested fields
Data statistics
Table schema
Table
partitions
Amazon EMR
Managed Hadoop & Spark
ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Connected Real-time log analytics & search
devices S3 Transfer
Acceleration
Amazon AI
ML/DL Services
Amazon EMR
Managed Hadoop & Spark
ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Spark Streaming
Connected & Flink on EMR
Real-time log analytics & search
devices S3 Transfer
Acceleration
AmazonKinesis
Analytics Amazon AI
ML/DL Services
Data sources
Amazon Athena
AWS Glue Data Catalog Interactive Query
Transactions Hive-compatible Metastore
Amazon QuickSight
Fast, easy to use, cloud BI
Amazon EMR
Managed Hadoop & Spark
ERP
Analytic Notebooks
Jupyter, Zeppelin, HUE
API Gateway
ProgrammaticAccess
Amazon Elasticsearch
Spark Streaming
Connected & Flink on EMR
Real-time log analytics & search
devices S3 Transfer
Acceleration
AmazonKinesis
Analytics Amazon AI
ML/DL Services
Operations
Sensor/IOT device Record-level data
1. Scale
2. Highly availability
3. Less management overhead
4. Pay what I need
Amazon S3
Kinesis Firehose Amazon S3 Amazon Athena
OR
Amazon S3
“raw-time-series”
Amazon S3
Amazon S3 Amazon Athena
Kinesis Firehose
Amazon S3
“raw-time-series”
Amazon S3
Amazon S3 Amazon Athena
Kinesis Firehose
Amazon S3
“raw-time-series”
“results”
Amazon S3
Kinesis Analytics Kinesis Firehose
Amazon S3
“raw-time-series”
“results”
Amazon S3
Kinesis Analytics Kinesis Firehose
X No servers to manage
Planning & tracking Messaging & communicate Organizing projects Content collaboration Code collaboration
Socrates
The Atlassian Data Lake
Web
Journey
REST
CRM
Late 2015
JDBC
Billing
Socrates
GraphQL
(Data Lake)
Product
Our Ingestion Kinesis
Web
Journey
REST
CRM
JDBC
Billing
Early 2016
Socrates
GraphQL
(Data Lake)
Product
Webhook
ODBC
SFTP
Micro Services
Our Ingestion
Web
Journey
CRM
Billing
Socrates
(Data Lake)
Product
Late 2016
Micro Services
Our Ingestion
Web
Journey
Other
Enterprise Systems
CRM
Billing
StreamHub
(Enterprise Bus) Socrates
(Data Lake)
Product
Early 2017
Micro Services Other
Micro Services
What is StreamHub?
Account /
Support/Ops User Defined
Chargeback
Extracts
Upscale
CRM/Billing Dimensional
Model
Quarantine
Product/Web Aggregated
/ Derived
Airflow
Airflow DAG
How users
interact Upload your file
$ aws s3 cp examplefile s3://atlassian-zone-bucketname
Discover
Finding, understanding, and exploring data
Challenges with data discovery
Zone Buckets
Storage Layer Raw Buckets (Self-Service) Model Buckets
Before: Presto After: Amazon Athena
• Many failed queries • Ability to attribute costs
• Difficulties upgrading • Less infrastructure/operational
overhead
• Hard to secure
• Not paying for what we don’t use
• Uses bucket security policies
Challenges with Amazon Athena