Big Data Architectural Patterns and Best Practices On AWS Presentation

Big Data Architectural Patterns
and Best Practices on AWS
Siva Raghupathy
Principal Architect and Senior Manager, Big Data Solutions Architecture
Amazon Web Services
September 2016
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Big data challenges

How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference architecture
Design patterns
Ever Increasing Big Data
Variety
Velocity
Volume
Big Data Evolution
Batch Stream Machine

processing processing learning
Cloud Services Evolution
Virtual Managed Serverless

machines services
Plethora of Tools
EMR S3 DynamoDB SQS
Amazon Amazon
Redshift Glacier RDS ElastiCache
Amazon Amazon Kinesis

Kinesis Streams app Data Pipeline Amazon Elasticsearch
Service
DynamoDB
Lambda Amazon ML Streams Amazon Kinesis
Analytics
Big Data Challenges
Is there a reference architecture?
What tools should I use?
How?
Why?
Architectural Principles
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
Simplify Big Data Processing
COLLECT STORE PROCESS/ CONSUME

ANALYZE
Time to answer (Latency)

Throughput
Cost
COLLECT
COLLECT Types of Data
Web apps
Applications
Mobile apps
Data Structures
Data centers
AWS Direct
RECORDS Database Records
Connect
Logging
Logging
DOCUMENTS Search Documents
AWS Amazon
Transport
CloudTrail CloudWatch
AWS Import/Export FILES

Log Files
Snowball
Messaging
Messaging
Message MESSAGES Messages
Devices
Data Streams
IoT
Sensors &
IoT platforms
AWS IoT STREAMS
What Is the Temperature of Your Data ?
Data Characteristics: Hot, Warm, Cold
Hot Warm Cold

Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Store
COLLECT STORE Types of Data Stores
Web apps
In-memory Caches, Data Structure Servers
Applications
Mobile apps
Data centers
AWS Direct
RECORDS
Database SQL & NoSQL Databases
Connect
Logging
Logging
DOCUMENTS
Search Search Engines
AWS Amazon
Transport
AWS Import/Export FILES File store File Systems

Snowball
Messaging
Messaging
Message MESSAGES Queue Message Queues
Devices
Stream Pub/Sub Message Queues

IoT
Sensors &
IoT platforms
AWS IoT STREAMS
storage
COLLECT STORE Message & Stream Storage
Web apps
In-memory
Applications
Mobile apps Amazon SQS

Data centers RECORDS
Database • Managed message queue service
AWS Direct
Connect
Apache Kafka
Logging
Logging
Search
DOCUMENTS
• High throughput distributed messaging
AWS Amazon
CloudTrail
system
Messaging Transport
CloudWatch
AWS Import/Export FILES File store

Snowball
Amazon Kinesis Streams
Message
Amazon SQS
Messaging
Message MESSAGES • Managed stream storage + processing
Apache Kafka Amazon Kinesis Firehose
Devices Amazon Kinesis • Managed data delivery
Streams
Stream
Amazon DynamoDB
IoT
Sensors & Amazon Kinesis

IoT platforms Firehose
AWS IoT STREAMS
Amazon DynamoDB
• Managed NoSQL database
Streams
• Tables can be stream-enabled
Why Stream Storage?
Decouple producers & consumers Preserve client ordering
Persistent buffer Parallel consumption
Collect multiple streams Streaming MapReduce
DynamoDB stream Amazon Kinesis stream Kafka topic Consumer 1

4 3 2 1 Count of
red = 4
44332211 Count of
4 3 2 1 violet = 4
shard 1 / partition 1
Consumer 2
4 3 2 1 Count of
44332211 blue = 4
4 3 2 1 shard 2 / partition 2 Count of

green = 4
What About Messaging?
• Decouple producers &
Producers Consumers
consumers
queue
• Persistent buffer
Amazon SQS
• Collect multiple streams Subscriber
• No client ordering ʎ
• No streaming MapReduce function
AWS Lambda
Publisher
• No parallel consumption for topic
function
Amazon SQS consumers Amazon SNS
• Amazon SNS can queue

Amazon SQS
publish to multiple SNS queue
subscribers (queues or
ʎ functions)
What Stream / Message Storage Should I Use?
Amazon Amazon Amazon Apache Amazon
DynamoDB Kinesis Kinesis Kafka SQS
Streams Streams Firehose
AWS managed Yes Yes Yes No Yes
Guaranteed Yes Yes No Yes No

ordering
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once
Data retention 24 hours 7 days N/A Configurable 14 days

period
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ
Scale / No limit / No limit / No limit / No limit / No limits /
throughput ~ table IOPS ~ shards automatic ~ nodes automatic
Parallel clients Yes Yes No Yes No
Stream MapReduce Yes Yes N/A Yes N/A
Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB
Cost Higher (table cost) Low Low Low (+admin) Low-medium
Hot Warm
COLLECT STORE File Storage
Web apps
In-memory
Applications
Mobile apps

Database
AWS Direct
Connect
Logging
Logging
DOCUMENTS
Search Amazon S3
AWS Amazon
Messaging Transport
File
Amazon S3
Snowball
Message
Amazon SQS
Messaging
Message MESSAGES
Apache Kafka
Devices Amazon Kinesis

Hot
Streams
Stream
IoT

AWS IoT STREAMS
Amazon DynamoDB
Streams
Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)

• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot Instances
• Multiple & heterogeneous analysis clusters can use the same data
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• No need to pay for data replication
• Native support for versioning
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies
• Secure – SSL, client/server-side encryption at rest
• Low cost
What about HDFS & Data Tiering?
• Use HDFS for very frequently accessed

(hot) data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold data
COLLECT STORE
Web apps
In-memory
Applications
Mobile apps
Data centers
AWS Direct
RECORDS Database
Connect
Logging
Logging
DOCUMENTS
Search
AWS Amazon
Messaging Transport
File
Amazon S3
Snowball
In-memory/Cache,
Message
Messaging
Message MESSAGES
Amazon SQS
Database, Search
Apache Kafka

Hot
Streams
Stream
IoT

AWS IoT STREAMS
Amazon DynamoDB
Streams
Anti-Pattern
Data Tier
Best Practice - Use the Right Tool for the Job
Data Tier
In-memory NoSQL SQL Search
Amazon ElastiCache Amazon DynamoDB Amazon Aurora Amazon Elasticsearch

Redis Cassandra Amazon RDS Service
Memcached HBase MySQL
MongoDB PostgreSQL
Oracle
SQL Server
COLLECT STORE
NoSQL Cache
Web apps Amazon ElastiCache
Applications
Mobile apps
Amazon DynamoDB
SQL
AWS Direct Amazon RDS
Connect
Logging
Logging
Search
Amazon Elasticsearch
DOCUMENTS
Service
AWS Amazon
CloudTrail
Messaging Transport
CloudWatch
File
Amazon S3
Snowball
In-memory/Cache,
Message
Messaging
Message MESSAGES
Amazon SQS
Database, Search
Apache Kafka

Hot
Streams
Stream
IoT

AWS IoT STREAMS
Amazon DynamoDB
Streams
What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost

Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, search Search
Data Structure What to use?

Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) In-memory, NoSQL
Low Amazon
Glacier
Structure
NoSQL
In-memory
SQL
High
High Low
Request rate
High Low
Cost/GB
Low High
Latency
Low High
Data volume
What Data Store Should I Use?
Amazon ElastiCache Amazon Amazon Amazon Amazon S3 Amazon Glacier
DynamoDB RDS/Aurora Elasticsearch
Average ms ms ms, sec ms,sec ms,sec,min hrs

latency (~ size)
Typical GB GB–TBs GB–TB GB–TB MB–PB GB–PB
data stored (no limit) (64 TB max) (no limit) (no limit)
Typical B-KB KB KB B-KB KB-TB GB
item size (400 KB max) (64 KB max) (2 GB max) (5 TB max) (40 TB max)
Request High – very high Very high High High Low – high Very low
Rate (no limit) (no limit)
Storage cost $$ ¢¢ ¢¢ ¢¢ ¢ ¢7/10
GB/month
Durability Low - moderate Very high Very high High Very high Very high
Availability High Very high Very high High Very high Very high
2 AZ 3 AZ 3 AZ 2 AZ 3 AZ 3 AZ
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project. The design calls for

many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate Object size Total size Objects per month

(Writes/sec) (Bytes) (GB/month)
300 2048 1483 777,600,000

Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
Simple Monthly
Calculator
https://calculator.s3.amazonaws.com/index.html
Request rate Object size Total size Objects per
Amazon S3 or (Writes/sec) (Bytes) (GB/month) month
DynamoDB?
300 2,048 1,483 777,600,000
Amazon DynamoDB
use
Request rate Object size Total size Objects per
(Writes/sec) (Bytes) (GB/month) month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
use
Amazon S3
COLLECT STORE PROCESS / ANALYZE
NoSQL Cache
Hot
Web apps Amazon ElastiCache
Applications
Mobile apps
Amazon DynamoDB
SQL
Connect
Warm
Logging
Logging Amazon Elasticsearch
Search
DOCUMENTS
Service
AWS Amazon
CloudTrail
Transport
CloudWatch
File
Amazon S3
Snowball
Message
Messaging
Amazon SQS
Messaging
Message MESSAGES
Apache Kafka

Hot
Streams
Stream
IoT

AWS IoT STREAMS
Amazon DynamoDB
Streams
PROCESS /
ANALYZE
Analytics Types & Frameworks PROCESS / ANALYZE
Batch
Amazon Machine
ML
Takes minutes to hours Learning
Example: Daily/weekly/monthly reports
Fast
Amazon EMR (MapReduce, Hive, Pig, Spark) Amazon Redshift
Interactive
Interactive
Presto
Takes seconds
Example: Self-service dashboards Amazon
Amazon Redshift, Amazon EMR (Presto, Spark) EMR
Batch
Message
Slow
Takes milliseconds to seconds
Message
Example: Message processing
Amazon SQS apps
Amazon SQS applications on Amazon EC2 Amazon EC2
Stream
Takes milliseconds to seconds Amazon EC2
Example: Fraud alerts, 1 minute metrics

Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, Streaming
Amazon EMR
Storm, AWS Lambda Amazon Kinesis
Fast
Stream
Machine Learning Analytics
Takes milliseconds to minutes KCL

Example: Fraud detection, forecast demand apps
Amazon ML, Amazon EMR (Spark ML)

AWS Lambda
What Stream & Message Processing Technology Should I Use?
Amazon Apache KCL Application Amazon Kinesis AWS Amazon SQS
EMR (Spark Storm Analytics Lambda Application
Streaming)
AWS Yes (Amazon No (Do it No ( EC2 + Auto Yes Yes No (EC2 + Auto
managed EMR) yourself) Scaling) Scaling)
Serverless No No No Yes Yes No
Scale / No limits / No limits / No limits / ~ KPU / No limits / No limits /
throughput ~ nodes ~ nodes ~ nodes automatic automatic ~ nodes
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ
Programming Java, Almost any Java, others via ANSI SQL with Node.js, AWS SDK
languages Python, language via MultiLangDaemon extensions Java, languages (Java,
Scala Thrift Python .NET, Python, …)
Uses Multistage Multistage Single stage Multistage Simple Simple event

processing processing processing processing event-based based triggers
triggers
Reliability KCL and Framework Managed by KCL Managed by Managed by Managed by SQS
Spark managed Amazon Kinesis AWS Visibility Timeout
checkpoints Analytics Lambda
Which Analysis Tool Should I Use?
Amazon Redshift Amazon EMR
Presto Spark Hive
Use case Optimized for Data Warehousing Interactive General purpose (iterative Batch
query ML, RT, ..)
Scale/throughput ~Nodes ~ Nodes
Storage Local Storage Amazon S3, HDFS
Optimization Columnar storage, Data Framework dependent

compression, and Zone maps
Metadata Amazon Redshift Managed Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC/ODBC & Custom)
Access controls Users, Groups and Access Controls Integration with LDAP
UDF support Yes (Scalar) Yes
Slow
What About ETL?
STORE ETL PROCESS / ANALYZE
https://aws.amazon.com/big-data/partner-solutions/
CONSUME
COLLECT STORE ETL PROCESS / ANALYZE CONSUME
NoSQL Cache
Hot
Web apps Amazon Machine
ML
Amazon ElastiCache
Applications
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift
Interactive
Fast
SQL
Connect Presto
Warm
Logging
Search
Service Amazon
DOCUMENTS
EMR
Batch
AWS Amazon
Slow
CloudTrail
Messaging Transport
CloudWatch
File
Amazon S3
Message
Snowball
Message
Amazon SQS apps

Amazon EC2
Amazon SQS
Messaging
Message MESSAGES
Amazon EC2
Apache Kafka
Streaming
Amazon Kinesis Amazon EMR
Devices
Fast
Hot
Streams Amazon Kinesis
Stream
Stream
Analytics
IoT

IoT platforms Firehose KCL
AWS IoT STREAMS
apps
Amazon DynamoDB
Streams
AWS Lambda
API
Apps & Services
Applications & API Amazon QuickSight
Analysis & visualization

Analysis and visualization
Business
users
Notebooks Data scientist,

developers
Notebooks
IDE
IDE
Putting It All Together
NoSQL Cache
Hot
Applications
API
ML
Amazon ElastiCache Apps & Services
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift Amazon QuickSight
Interactive
Fast

SQL
Connect Presto
Logging
Warm
Search
Service Amazon
DOCUMENTS
EMR
Batch
AWS Amazon
Slow
Transport
File
Amazon S3
Message
Snowball
Message
Amazon SQS apps

Messaging
Amazon EC2
Amazon SQS
Messaging
Message MESSAGES
Amazon EC2
Apache Kafka
Streaming
Notebooks
Devices
Fast
Hot
Stream
Stream
Analytics
IoT

AWS IoT STREAMS
apps
IDE
Amazon DynamoDB
Reference architecture Streams
AWS Lambda
Design Patterns
Primitive: Decoupled “Data Bus”
Storage decoupled from processing

Multiple stages
Store Process Store Process
process
store
Primitive: Pub/Sub
Parallel stream consumption/processing
Amazon Kinesis
Connector
Library
Amazon AWS
Kinesis Lambda
Apache
Spark
process
store
Primitive:
Analysis framework read from or write to multiple data stores
Amazon Kinesis
Connector Amazon S3
Library
Amazon AWS Amazon Spark

Kinesis Lambda DynamoDB SQL
Spark
Streaming
Amazon EMR
process
store
Data temperature
Hot Cold
Amazon
data Amazon Kinesis
DynamoDB
Amazon S3
Fast
Spark Streaming Native apps

KCL apps
Apache Storm AWS Lambda
Processing speed
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Spark
Presto
Hive
Hive
Slow Answers
Real-time Analytics Fan out
Amazon
Kinesis
Amazon
Alerts SNS Notifications
KCL App
Amazon
ElastiCache
AWS Lambda App state (Redis)
Amazon
Kinesis Amazon
Amazon EMR DynamoDB
Amazon
Spark RDS
Streaming KPI
Amazon Kinesis Amazon
Analytics
ES
Real-time prediction
Amazon
process ML
Amazon
Log S3
store
Interactive
Amazon Redshift
& Interactive
Amazon EMR
Batch Presto
Analytics Spark
Real-time prediction
Amazon
Amazon
Kinesis Amazon S3 ML Consume
Firehose
Batch prediction
Amazon EMR
Spark
Hive
process Pig Batch
store
Lambda Batch & Interactive Layer
Amazon
Redshift
Architecture Amazon S3 Amazon EMR
Presto
Applications
Spark
Hive
Pig
Data Amazon
Amazon
Kinesis ML
KCL Amazon
Amazon Kinesis ElastiCache
Analytics Amazon
DynamoDB
AWS Lambda
Amazon
Spark Streaming RDS
on Amazon EMR
Amazon
process Real-time Layer
ES
Serving Layer
Storm
store
Summary
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
NoSQL Cache
Hot
Applications
API
ML
Amazon ElastiCache Apps & Services
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift Amazon QuickSight
Interactive
Fast

SQL
Connect Presto
Logging
Warm
Search
Service Amazon
DOCUMENTS
EMR
Batch
AWS Amazon
Slow
Transport
File
Amazon S3
Message
Snowball
Message
Amazon SQS apps

Messaging
Amazon EC2
Amazon SQS
Messaging
Message MESSAGES
Amazon EC2
Apache Kafka
Streaming
Notebooks
Devices
Fast
Hot
Stream
Stream
Analytics
IoT

AWS IoT STREAMS
apps
IDE
Amazon DynamoDB
Reference architecture Streams

AWS Lambda
Thank you!
aws.amazon.com/big-data

Big Data Architectural Patterns and Best Practices On AWS Presentation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Architectural Patterns and Best Practices On AWS Presentation

Uploaded by

Copyright:

Available Formats

Big Data Architectural Patterns

and Best Practices on AWS

Big data challenges

Batch Stream Machine

Virtual Managed Serverless

EMR S3 DynamoDB SQS

Amazon Amazon Kinesis

Is there a reference architecture?

What tools should I use?

COLLECT STORE PROCESS/ CONSUME

Time to answer (Latency)

AWS Import/Export FILES

Hot Warm Cold

AWS Import/Export FILES File store File Systems

Stream Pub/Sub Message Queues

Mobile apps Amazon SQS

AWS Import/Export FILES File store

Sensors & Amazon Kinesis

DynamoDB stream Amazon Kinesis stream Kafka topic Consumer 1

4 3 2 1 shard 2 / partition 2 Count of

Amazon SQS consumers Amazon SNS

• Amazon SNS can queue

Guaranteed Yes Yes No Yes No

Data retention 24 hours 7 days N/A Configurable 14 days

Cost Higher (table cost) Low Low Low (+admin) Low-medium

Data centers RECORDS

Devices Amazon Kinesis

Sensors & Amazon Kinesis

• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)

• Use HDFS for very frequently accessed

Devices Amazon Kinesis

Sensors & Amazon Kinesis

Amazon ElastiCache Amazon DynamoDB Amazon Aurora Amazon Elasticsearch

Data centers RECORDS

Devices Amazon Kinesis

Sensors & Amazon Kinesis

Data structure → Fixed schema, JSON, key-value

Access patterns → Store data in the format you will access it

Data characteristics → Hot, warm, cold

Cost → Right cost

Data Structure What to use?

Average ms ms ms, sec ms,sec ms,sec,min hrs

Example: Should I use Amazon S3 or Amazon DynamoDB?

“I’m currently scoping out a project. The design calls for

Request rate Object size Total size Objects per month

300 2048 1483 777,600,000

Example: Should I use Amazon S3 or Amazon DynamoDB?

Scenario 2300 32,768 23,730 777,600,000

Data centers RECORDS

Logging Amazon Elasticsearch

Devices Amazon Kinesis

Sensors & Amazon Kinesis

Example: Fraud alerts, 1 minute metrics

Takes milliseconds to minutes KCL

Amazon ML, Amazon EMR (Spark ML)

Uses Multistage Multistage Single stage Multistage Simple Simple event

Storage Local Storage Amazon S3, HDFS

Optimization Columnar storage, Data Framework dependent

BI tools supports Yes (JDBC/ODBC) Yes (JDBC/ODBC & Custom)

UDF support Yes (Scalar) Yes

Logging Amazon Elasticsearch

Amazon SQS apps