Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Big Data Architectural Patterns

and Best Practices on AWS

Siva Raghupathy
Principal Architect and Senior Manager, Big Data Solutions Architecture
Amazon Web Services

September 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda

Big data challenges


How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference architecture
Design patterns
Ever Increasing Big Data

Variety

Velocity

Volume
Big Data Evolution

Batch Stream Machine


processing processing learning
Cloud Services Evolution

Virtual Managed Serverless


machines services
Plethora of Tools

EMR S3 DynamoDB SQS

Amazon Amazon
Redshift Glacier RDS ElastiCache

Amazon Amazon Kinesis


Kinesis Streams app Data Pipeline Amazon Elasticsearch
Service

DynamoDB
Lambda Amazon ML Streams Amazon Kinesis
Analytics
Big Data Challenges

Is there a reference architecture?

What tools should I use?

How?

Why?
Architectural Principles
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
Simplify Big Data Processing

COLLECT STORE PROCESS/ CONSUME


ANALYZE

Time to answer (Latency)


Throughput
Cost
COLLECT
COLLECT Types of Data
Web apps
Applications

Mobile apps
Data Structures
Data centers
AWS Direct
RECORDS Database Records
Connect
Logging

Logging
DOCUMENTS Search Documents
AWS Amazon
Transport

CloudTrail CloudWatch

AWS Import/Export FILES


Log Files
Snowball
Messaging

Messaging
Message MESSAGES Messages

Devices

Data Streams
IoT

Sensors &
IoT platforms
AWS IoT STREAMS
What Is the Temperature of Your Data ?
Data Characteristics: Hot, Warm, Cold

Hot Warm Cold


Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Store
COLLECT STORE Types of Data Stores
Web apps
In-memory Caches, Data Structure Servers
Applications

Mobile apps

Data centers
AWS Direct
RECORDS
Database SQL & NoSQL Databases
Connect
Logging

Logging
DOCUMENTS
Search Search Engines
AWS Amazon
Transport

CloudTrail CloudWatch

AWS Import/Export FILES File store File Systems


Snowball
Messaging

Messaging
Message MESSAGES Queue Message Queues

Devices

Stream Pub/Sub Message Queues


IoT

Sensors &
IoT platforms
AWS IoT STREAMS
storage
COLLECT STORE Message & Stream Storage
Web apps
In-memory
Applications

Mobile apps Amazon SQS


Data centers RECORDS
Database • Managed message queue service
AWS Direct
Connect

Apache Kafka
Logging

Logging
Search
DOCUMENTS
• High throughput distributed messaging
AWS Amazon
CloudTrail
system
Messaging Transport

CloudWatch

AWS Import/Export FILES File store


Snowball
Amazon Kinesis Streams
Message

Amazon SQS
Messaging
Message MESSAGES • Managed stream storage + processing
Apache Kafka Amazon Kinesis Firehose
Devices Amazon Kinesis • Managed data delivery
Streams
Stream

Amazon DynamoDB
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose
AWS IoT STREAMS

Amazon DynamoDB
• Managed NoSQL database
Streams
• Tables can be stream-enabled
Why Stream Storage?
Decouple producers & consumers Preserve client ordering
Persistent buffer Parallel consumption
Collect multiple streams Streaming MapReduce

DynamoDB stream Amazon Kinesis stream Kafka topic Consumer 1


4 3 2 1 Count of
red = 4

44332211 Count of
4 3 2 1 violet = 4
shard 1 / partition 1
Consumer 2
4 3 2 1 Count of
44332211 blue = 4

4 3 2 1 shard 2 / partition 2 Count of


green = 4
What About Messaging?
• Decouple producers &
Producers Consumers
consumers
queue
• Persistent buffer
Amazon SQS
• Collect multiple streams Subscriber

• No client ordering ʎ
• No streaming MapReduce function
AWS Lambda
Publisher
• No parallel consumption for topic
function

Amazon SQS consumers Amazon SNS

• Amazon SNS can queue


Amazon SQS
publish to multiple SNS queue

subscribers (queues or
ʎ functions)
What Stream / Message Storage Should I Use?
Amazon Amazon Amazon Apache Amazon
DynamoDB Kinesis Kinesis Kafka SQS
Streams Streams Firehose
AWS managed Yes Yes Yes No Yes

Guaranteed Yes Yes No Yes No


ordering
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once

Data retention 24 hours 7 days N/A Configurable 14 days


period
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ
Scale / No limit / No limit / No limit / No limit / No limits /
throughput ~ table IOPS ~ shards automatic ~ nodes automatic
Parallel clients Yes Yes No Yes No
Stream MapReduce Yes Yes N/A Yes N/A
Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Hot Warm
COLLECT STORE File Storage
Web apps
In-memory
Applications

Mobile apps

Data centers RECORDS


Database
AWS Direct
Connect
Logging

Logging
DOCUMENTS
Search Amazon S3
AWS Amazon
Messaging Transport

CloudTrail CloudWatch

File
Amazon S3
AWS Import/Export FILES
Snowball
Message

Amazon SQS
Messaging
Message MESSAGES

Apache Kafka

Devices Amazon Kinesis


Hot
Streams
Stream
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose
AWS IoT STREAMS

Amazon DynamoDB
Streams
Why is Amazon S3 Good for Big Data?

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)


• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot Instances
• Multiple & heterogeneous analysis clusters can use the same data
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• No need to pay for data replication
• Native support for versioning
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies
• Secure – SSL, client/server-side encryption at rest
• Low cost
What about HDFS & Data Tiering?

• Use HDFS for very frequently accessed


(hot) data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold data
COLLECT STORE

Web apps
In-memory
Applications

Mobile apps

Data centers
AWS Direct
RECORDS Database
Connect
Logging

Logging
DOCUMENTS
Search
AWS Amazon
Messaging Transport

CloudTrail CloudWatch

File
Amazon S3
AWS Import/Export FILES
Snowball
In-memory/Cache,
Message

Messaging
Message MESSAGES
Amazon SQS
Database, Search
Apache Kafka

Devices Amazon Kinesis


Hot
Streams
Stream
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose
AWS IoT STREAMS

Amazon DynamoDB
Streams
Anti-Pattern

Data Tier
Best Practice - Use the Right Tool for the Job

Data Tier
In-memory NoSQL SQL Search

Amazon ElastiCache Amazon DynamoDB Amazon Aurora Amazon Elasticsearch


Redis Cassandra Amazon RDS Service
Memcached HBase MySQL
MongoDB PostgreSQL
Oracle
SQL Server
COLLECT STORE

NoSQL Cache
Web apps Amazon ElastiCache
Applications

Mobile apps
Amazon DynamoDB

Data centers RECORDS

SQL
AWS Direct Amazon RDS
Connect
Logging

Logging

Search
Amazon Elasticsearch
DOCUMENTS
Service

AWS Amazon
CloudTrail
Messaging Transport

CloudWatch

File
Amazon S3
AWS Import/Export FILES
Snowball
In-memory/Cache,
Message

Messaging
Message MESSAGES
Amazon SQS
Database, Search
Apache Kafka

Devices Amazon Kinesis


Hot
Streams
Stream
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose
AWS IoT STREAMS

Amazon DynamoDB
Streams
What Data Store Should I Use?

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data characteristics → Hot, warm, cold

Cost → Right cost


Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, search Search

Data Structure What to use?


Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) In-memory, NoSQL
Low Amazon
Glacier
Structure

NoSQL

In-memory
SQL
High
Hot data Warm data Cold data
High Low
Request rate
High Low
Cost/GB
Low High
Latency
Low High
Data volume
What Data Store Should I Use?
Amazon ElastiCache Amazon Amazon Amazon Amazon S3 Amazon Glacier
DynamoDB RDS/Aurora Elasticsearch

Average ms ms ms, sec ms,sec ms,sec,min hrs


latency (~ size)
Typical GB GB–TBs GB–TB GB–TB MB–PB GB–PB
data stored (no limit) (64 TB max) (no limit) (no limit)
Typical B-KB KB KB B-KB KB-TB GB
item size (400 KB max) (64 KB max) (2 GB max) (5 TB max) (40 TB max)
Request High – very high Very high High High Low – high Very low
Rate (no limit) (no limit)
Storage cost $$ ¢¢ ¢¢ ¢¢ ¢ ¢7/10
GB/month
Durability Low - moderate Very high Very high High Very high Very high

Availability High Very high Very high High Very high Very high
2 AZ 3 AZ 3 AZ 2 AZ 3 AZ 3 AZ
Hot data Warm data Cold data
Cost Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project. The design calls for


many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”

Request rate Object size Total size Objects per month


(Writes/sec) (Bytes) (GB/month)

300 2048 1483 777,600,000


Cost Conscious Design

Example: Should I use Amazon S3 or Amazon DynamoDB?

Simple Monthly
Calculator

https://calculator.s3.amazonaws.com/index.html
Request rate Object size Total size Objects per
Amazon S3 or (Writes/sec) (Bytes) (GB/month) month
DynamoDB?
300 2,048 1,483 777,600,000
Amazon DynamoDB
use
Request rate Object size Total size Objects per
(Writes/sec) (Bytes) (GB/month) month
Scenario 1300 2,048 1,483 777,600,000

Scenario 2300 32,768 23,730 777,600,000

use

Amazon S3
COLLECT STORE PROCESS / ANALYZE

NoSQL Cache

Hot
Web apps Amazon ElastiCache
Applications

Mobile apps
Amazon DynamoDB

Data centers RECORDS

SQL
AWS Direct Amazon RDS
Connect

Warm
Logging

Logging Amazon Elasticsearch

Search
DOCUMENTS
Service

AWS Amazon
CloudTrail
Transport

CloudWatch

File
Amazon S3
AWS Import/Export FILES
Snowball
Message
Messaging

Amazon SQS
Messaging
Message MESSAGES

Apache Kafka

Devices Amazon Kinesis


Hot
Streams
Stream
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose
AWS IoT STREAMS

Amazon DynamoDB
Streams
PROCESS /
ANALYZE
Analytics Types & Frameworks PROCESS / ANALYZE
Batch
Amazon Machine

ML
Takes minutes to hours Learning
Example: Daily/weekly/monthly reports

Fast
Amazon EMR (MapReduce, Hive, Pig, Spark) Amazon Redshift

Interactive
Interactive
Presto
Takes seconds
Example: Self-service dashboards Amazon
Amazon Redshift, Amazon EMR (Presto, Spark) EMR

Batch
Message

Slow
Takes milliseconds to seconds

Message
Example: Message processing
Amazon SQS apps
Amazon SQS applications on Amazon EC2 Amazon EC2

Stream
Takes milliseconds to seconds Amazon EC2

Example: Fraud alerts, 1 minute metrics


Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Streaming
Amazon EMR
Storm, AWS Lambda Amazon Kinesis

Fast
Stream
Machine Learning Analytics

Takes milliseconds to minutes KCL


Example: Fraud detection, forecast demand apps

Amazon ML, Amazon EMR (Spark ML)


AWS Lambda
What Stream & Message Processing Technology Should I Use?
Amazon Apache KCL Application Amazon Kinesis AWS Amazon SQS
EMR (Spark Storm Analytics Lambda Application
Streaming)
AWS Yes (Amazon No (Do it No ( EC2 + Auto Yes Yes No (EC2 + Auto
managed EMR) yourself) Scaling) Scaling)
Serverless No No No Yes Yes No
Scale / No limits / No limits / No limits / ~ KPU / No limits / No limits /
throughput ~ nodes ~ nodes ~ nodes automatic automatic ~ nodes
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ

Programming Java, Almost any Java, others via ANSI SQL with Node.js, AWS SDK
languages Python, language via MultiLangDaemon extensions Java, languages (Java,
Scala Thrift Python .NET, Python, …)

Uses Multistage Multistage Single stage Multistage Simple Simple event


processing processing processing processing event-based based triggers
triggers

Reliability KCL and Framework Managed by KCL Managed by Managed by Managed by SQS
Spark managed Amazon Kinesis AWS Visibility Timeout
checkpoints Analytics Lambda
Which Analysis Tool Should I Use?
Amazon Redshift Amazon EMR
Presto Spark Hive
Use case Optimized for Data Warehousing Interactive General purpose (iterative Batch
query ML, RT, ..)
Scale/throughput ~Nodes ~ Nodes

Storage Local Storage Amazon S3, HDFS

Optimization Columnar storage, Data Framework dependent


compression, and Zone maps
Metadata Amazon Redshift Managed Hive Meta-store

BI tools supports Yes (JDBC/ODBC) Yes (JDBC/ODBC & Custom)

Access controls Users, Groups and Access Controls Integration with LDAP

UDF support Yes (Scalar) Yes

Slow
What About ETL?
STORE ETL PROCESS / ANALYZE

https://aws.amazon.com/big-data/partner-solutions/
CONSUME
COLLECT STORE ETL PROCESS / ANALYZE CONSUME

NoSQL Cache

Hot
Web apps Amazon Machine

ML
Amazon ElastiCache
Applications

Learning
Mobile apps
Amazon DynamoDB Amazon Redshift

Interactive

Fast
Data centers RECORDS

SQL
AWS Direct Amazon RDS
Connect Presto

Warm
Logging

Logging Amazon Elasticsearch

Search
Service Amazon
DOCUMENTS
EMR

Batch
AWS Amazon

Slow
CloudTrail
Messaging Transport

CloudWatch

File
Amazon S3
AWS Import/Export FILES

Message
Snowball
Message

Amazon SQS apps


Amazon EC2
Amazon SQS
Messaging
Message MESSAGES

Amazon EC2
Apache Kafka

Streaming
Amazon Kinesis Amazon EMR
Devices

Fast
Hot
Streams Amazon Kinesis
Stream

Stream
Analytics
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose KCL
AWS IoT STREAMS
apps
Amazon DynamoDB
Streams
AWS Lambda
COLLECT STORE ETL PROCESS / ANALYZE CONSUME

API
Apps & Services

Applications & API Amazon QuickSight

Analysis & visualization


Analysis and visualization
Business
users

Notebooks Data scientist,


developers

Notebooks
IDE

IDE
Putting It All Together
COLLECT STORE ETL PROCESS / ANALYZE CONSUME

NoSQL Cache

Hot
Web apps Amazon Machine
Applications

API
ML
Amazon ElastiCache Apps & Services
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift Amazon QuickSight

Interactive

Fast

Analysis & visualization


Data centers RECORDS

SQL
AWS Direct Amazon RDS
Connect Presto
Logging

Warm
Logging Amazon Elasticsearch

Search
Service Amazon
DOCUMENTS
EMR

Batch
AWS Amazon

Slow
Transport

CloudTrail CloudWatch

File
Amazon S3
AWS Import/Export FILES

Message
Snowball
Message

Amazon SQS apps


Messaging

Amazon EC2
Amazon SQS
Messaging
Message MESSAGES

Amazon EC2
Apache Kafka

Streaming

Notebooks
Amazon Kinesis Amazon EMR
Devices

Fast
Hot
Streams Amazon Kinesis
Stream

Stream
Analytics
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose KCL
AWS IoT STREAMS
apps

IDE
Amazon DynamoDB
Reference architecture Streams
AWS Lambda
Design Patterns
Primitive: Decoupled “Data Bus”

Storage decoupled from processing


Multiple stages

Store Process Store Process

process
store
Primitive: Pub/Sub

Parallel stream consumption/processing

Amazon Kinesis
Connector
Library

Amazon AWS
Kinesis Lambda

Apache
Spark
process
store
Primitive:

Analysis framework read from or write to multiple data stores

Amazon Kinesis
Connector Amazon S3
Library

Amazon AWS Amazon Spark


Kinesis Lambda DynamoDB SQL

Spark
Streaming

Amazon EMR
process
store
Data temperature
Hot Cold

Amazon
data Amazon Kinesis
DynamoDB
Amazon S3
Fast

Spark Streaming Native apps


KCL apps
Apache Storm AWS Lambda
Processing speed

AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Spark
Presto
Hive
Hive

Slow Answers
Real-time Analytics Fan out
Amazon
Kinesis

Amazon
Alerts SNS Notifications

KCL App
Amazon
ElastiCache
AWS Lambda App state (Redis)
Amazon
Kinesis Amazon
Amazon EMR DynamoDB
Amazon
Spark RDS
Streaming KPI
Amazon Kinesis Amazon
Analytics
ES
Real-time prediction

Amazon
process ML
Amazon
Log S3
store
Interactive
Amazon Redshift
& Interactive
Amazon EMR
Batch Presto
Analytics Spark

Real-time prediction
Amazon
Amazon
Kinesis Amazon S3 ML Consume
Firehose
Batch prediction

Amazon EMR

Spark

Hive
process Pig Batch
store
Lambda Batch & Interactive Layer
Amazon
Redshift
Architecture Amazon S3 Amazon EMR
Presto

Applications
Spark
Hive
Pig

Data Amazon
Amazon
Kinesis ML
KCL Amazon
Amazon Kinesis ElastiCache
Analytics Amazon
DynamoDB
AWS Lambda
Amazon
Spark Streaming RDS
on Amazon EMR
Amazon
process Real-time Layer
ES
Serving Layer
Storm
store
Summary
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
COLLECT STORE ETL PROCESS / ANALYZE CONSUME

NoSQL Cache

Hot
Web apps Amazon Machine
Applications

API
ML
Amazon ElastiCache Apps & Services
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift Amazon QuickSight

Interactive

Fast

Analysis & visualization


Data centers RECORDS

SQL
AWS Direct Amazon RDS
Connect Presto
Logging

Warm
Logging Amazon Elasticsearch

Search
Service Amazon
DOCUMENTS
EMR

Batch
AWS Amazon

Slow
Transport

CloudTrail CloudWatch

File
Amazon S3
AWS Import/Export FILES

Message
Snowball
Message

Amazon SQS apps


Messaging

Amazon EC2
Amazon SQS
Messaging
Message MESSAGES

Amazon EC2
Apache Kafka

Streaming

Notebooks
Amazon Kinesis Amazon EMR
Devices

Fast
Hot
Streams Amazon Kinesis
Stream

Stream
Analytics
IoT

Sensors & Amazon Kinesis


IoT platforms Firehose KCL
AWS IoT STREAMS
apps

IDE
Amazon DynamoDB

Reference architecture Streams


AWS Lambda
Thank you!
aws.amazon.com/big-data

You might also like