Professional Documents
Culture Documents
Big Data Architectural Patterns and Best Practices On AWS Presentation
Big Data Architectural Patterns and Best Practices On AWS Presentation
Siva Raghupathy
Principal Architect and Senior Manager, Big Data Solutions Architecture
Amazon Web Services
September 2016
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
Variety
Velocity
Volume
Big Data Evolution
Amazon Amazon
Redshift Glacier RDS ElastiCache
DynamoDB
Lambda Amazon ML Streams Amazon Kinesis
Analytics
Big Data Challenges
How?
Why?
Architectural Principles
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
Simplify Big Data Processing
Mobile apps
Data Structures
Data centers
AWS Direct
RECORDS Database Records
Connect
Logging
Logging
DOCUMENTS Search Documents
AWS Amazon
Transport
CloudTrail CloudWatch
Messaging
Message MESSAGES Messages
Devices
Data Streams
IoT
Sensors &
IoT platforms
AWS IoT STREAMS
What Is the Temperature of Your Data ?
Data Characteristics: Hot, Warm, Cold
Mobile apps
Data centers
AWS Direct
RECORDS
Database SQL & NoSQL Databases
Connect
Logging
Logging
DOCUMENTS
Search Search Engines
AWS Amazon
Transport
CloudTrail CloudWatch
Messaging
Message MESSAGES Queue Message Queues
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
storage
COLLECT STORE Message & Stream Storage
Web apps
In-memory
Applications
Apache Kafka
Logging
Logging
Search
DOCUMENTS
• High throughput distributed messaging
AWS Amazon
CloudTrail
system
Messaging Transport
CloudWatch
Amazon SQS
Messaging
Message MESSAGES • Managed stream storage + processing
Apache Kafka Amazon Kinesis Firehose
Devices Amazon Kinesis • Managed data delivery
Streams
Stream
Amazon DynamoDB
IoT
Amazon DynamoDB
• Managed NoSQL database
Streams
• Tables can be stream-enabled
Why Stream Storage?
Decouple producers & consumers Preserve client ordering
Persistent buffer Parallel consumption
Collect multiple streams Streaming MapReduce
44332211 Count of
4 3 2 1 violet = 4
shard 1 / partition 1
Consumer 2
4 3 2 1 Count of
44332211 blue = 4
• No client ordering ʎ
• No streaming MapReduce function
AWS Lambda
Publisher
• No parallel consumption for topic
function
subscribers (queues or
ʎ functions)
What Stream / Message Storage Should I Use?
Amazon Amazon Amazon Apache Amazon
DynamoDB Kinesis Kinesis Kafka SQS
Streams Streams Firehose
AWS managed Yes Yes Yes No Yes
Hot Warm
COLLECT STORE File Storage
Web apps
In-memory
Applications
Mobile apps
Logging
DOCUMENTS
Search Amazon S3
AWS Amazon
Messaging Transport
CloudTrail CloudWatch
File
Amazon S3
AWS Import/Export FILES
Snowball
Message
Amazon SQS
Messaging
Message MESSAGES
Apache Kafka
Amazon DynamoDB
Streams
Why is Amazon S3 Good for Big Data?
Web apps
In-memory
Applications
Mobile apps
Data centers
AWS Direct
RECORDS Database
Connect
Logging
Logging
DOCUMENTS
Search
AWS Amazon
Messaging Transport
CloudTrail CloudWatch
File
Amazon S3
AWS Import/Export FILES
Snowball
In-memory/Cache,
Message
Messaging
Message MESSAGES
Amazon SQS
Database, Search
Apache Kafka
Amazon DynamoDB
Streams
Anti-Pattern
Data Tier
Best Practice - Use the Right Tool for the Job
Data Tier
In-memory NoSQL SQL Search
NoSQL Cache
Web apps Amazon ElastiCache
Applications
Mobile apps
Amazon DynamoDB
SQL
AWS Direct Amazon RDS
Connect
Logging
Logging
Search
Amazon Elasticsearch
DOCUMENTS
Service
AWS Amazon
CloudTrail
Messaging Transport
CloudWatch
File
Amazon S3
AWS Import/Export FILES
Snowball
In-memory/Cache,
Message
Messaging
Message MESSAGES
Amazon SQS
Database, Search
Apache Kafka
Amazon DynamoDB
Streams
What Data Store Should I Use?
NoSQL
In-memory
SQL
High
Hot data Warm data Cold data
High Low
Request rate
High Low
Cost/GB
Low High
Latency
Low High
Data volume
What Data Store Should I Use?
Amazon ElastiCache Amazon Amazon Amazon Amazon S3 Amazon Glacier
DynamoDB RDS/Aurora Elasticsearch
Availability High Very high Very high High Very high Very high
2 AZ 3 AZ 3 AZ 2 AZ 3 AZ 3 AZ
Hot data Warm data Cold data
Cost Conscious Design
Simple Monthly
Calculator
https://calculator.s3.amazonaws.com/index.html
Request rate Object size Total size Objects per
Amazon S3 or (Writes/sec) (Bytes) (GB/month) month
DynamoDB?
300 2,048 1,483 777,600,000
Amazon DynamoDB
use
Request rate Object size Total size Objects per
(Writes/sec) (Bytes) (GB/month) month
Scenario 1300 2,048 1,483 777,600,000
use
Amazon S3
COLLECT STORE PROCESS / ANALYZE
NoSQL Cache
Hot
Web apps Amazon ElastiCache
Applications
Mobile apps
Amazon DynamoDB
SQL
AWS Direct Amazon RDS
Connect
Warm
Logging
Search
DOCUMENTS
Service
AWS Amazon
CloudTrail
Transport
CloudWatch
File
Amazon S3
AWS Import/Export FILES
Snowball
Message
Messaging
Amazon SQS
Messaging
Message MESSAGES
Apache Kafka
Amazon DynamoDB
Streams
PROCESS /
ANALYZE
Analytics Types & Frameworks PROCESS / ANALYZE
Batch
Amazon Machine
ML
Takes minutes to hours Learning
Example: Daily/weekly/monthly reports
Fast
Amazon EMR (MapReduce, Hive, Pig, Spark) Amazon Redshift
Interactive
Interactive
Presto
Takes seconds
Example: Self-service dashboards Amazon
Amazon Redshift, Amazon EMR (Presto, Spark) EMR
Batch
Message
Slow
Takes milliseconds to seconds
Message
Example: Message processing
Amazon SQS apps
Amazon SQS applications on Amazon EC2 Amazon EC2
Stream
Takes milliseconds to seconds Amazon EC2
Fast
Stream
Machine Learning Analytics
Programming Java, Almost any Java, others via ANSI SQL with Node.js, AWS SDK
languages Python, language via MultiLangDaemon extensions Java, languages (Java,
Scala Thrift Python .NET, Python, …)
Reliability KCL and Framework Managed by KCL Managed by Managed by Managed by SQS
Spark managed Amazon Kinesis AWS Visibility Timeout
checkpoints Analytics Lambda
Which Analysis Tool Should I Use?
Amazon Redshift Amazon EMR
Presto Spark Hive
Use case Optimized for Data Warehousing Interactive General purpose (iterative Batch
query ML, RT, ..)
Scale/throughput ~Nodes ~ Nodes
Access controls Users, Groups and Access Controls Integration with LDAP
Slow
What About ETL?
STORE ETL PROCESS / ANALYZE
https://aws.amazon.com/big-data/partner-solutions/
CONSUME
COLLECT STORE ETL PROCESS / ANALYZE CONSUME
NoSQL Cache
Hot
Web apps Amazon Machine
ML
Amazon ElastiCache
Applications
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift
Interactive
Fast
Data centers RECORDS
SQL
AWS Direct Amazon RDS
Connect Presto
Warm
Logging
Search
Service Amazon
DOCUMENTS
EMR
Batch
AWS Amazon
Slow
CloudTrail
Messaging Transport
CloudWatch
File
Amazon S3
AWS Import/Export FILES
Message
Snowball
Message
Amazon EC2
Apache Kafka
Streaming
Amazon Kinesis Amazon EMR
Devices
Fast
Hot
Streams Amazon Kinesis
Stream
Stream
Analytics
IoT
API
Apps & Services
Notebooks
IDE
IDE
Putting It All Together
COLLECT STORE ETL PROCESS / ANALYZE CONSUME
NoSQL Cache
Hot
Web apps Amazon Machine
Applications
API
ML
Amazon ElastiCache Apps & Services
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift Amazon QuickSight
Interactive
Fast
SQL
AWS Direct Amazon RDS
Connect Presto
Logging
Warm
Logging Amazon Elasticsearch
Search
Service Amazon
DOCUMENTS
EMR
Batch
AWS Amazon
Slow
Transport
CloudTrail CloudWatch
File
Amazon S3
AWS Import/Export FILES
Message
Snowball
Message
Amazon EC2
Amazon SQS
Messaging
Message MESSAGES
Amazon EC2
Apache Kafka
Streaming
Notebooks
Amazon Kinesis Amazon EMR
Devices
Fast
Hot
Streams Amazon Kinesis
Stream
Stream
Analytics
IoT
IDE
Amazon DynamoDB
Reference architecture Streams
AWS Lambda
Design Patterns
Primitive: Decoupled “Data Bus”
process
store
Primitive: Pub/Sub
Amazon Kinesis
Connector
Library
Amazon AWS
Kinesis Lambda
Apache
Spark
process
store
Primitive:
Amazon Kinesis
Connector Amazon S3
Library
Spark
Streaming
Amazon EMR
process
store
Data temperature
Hot Cold
Amazon
data Amazon Kinesis
DynamoDB
Amazon S3
Fast
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Spark
Presto
Hive
Hive
Slow Answers
Real-time Analytics Fan out
Amazon
Kinesis
Amazon
Alerts SNS Notifications
KCL App
Amazon
ElastiCache
AWS Lambda App state (Redis)
Amazon
Kinesis Amazon
Amazon EMR DynamoDB
Amazon
Spark RDS
Streaming KPI
Amazon Kinesis Amazon
Analytics
ES
Real-time prediction
Amazon
process ML
Amazon
Log S3
store
Interactive
Amazon Redshift
& Interactive
Amazon EMR
Batch Presto
Analytics Spark
Real-time prediction
Amazon
Amazon
Kinesis Amazon S3 ML Consume
Firehose
Batch prediction
Amazon EMR
Spark
Hive
process Pig Batch
store
Lambda Batch & Interactive Layer
Amazon
Redshift
Architecture Amazon S3 Amazon EMR
Presto
Applications
Spark
Hive
Pig
Data Amazon
Amazon
Kinesis ML
KCL Amazon
Amazon Kinesis ElastiCache
Analytics Amazon
DynamoDB
AWS Lambda
Amazon
Spark Streaming RDS
on Amazon EMR
Amazon
process Real-time Layer
ES
Serving Layer
Storm
store
Summary
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
COLLECT STORE ETL PROCESS / ANALYZE CONSUME
NoSQL Cache
Hot
Web apps Amazon Machine
Applications
API
ML
Amazon ElastiCache Apps & Services
Learning
Mobile apps
Amazon DynamoDB Amazon Redshift Amazon QuickSight
Interactive
Fast
SQL
AWS Direct Amazon RDS
Connect Presto
Logging
Warm
Logging Amazon Elasticsearch
Search
Service Amazon
DOCUMENTS
EMR
Batch
AWS Amazon
Slow
Transport
CloudTrail CloudWatch
File
Amazon S3
AWS Import/Export FILES
Message
Snowball
Message
Amazon EC2
Amazon SQS
Messaging
Message MESSAGES
Amazon EC2
Apache Kafka
Streaming
Notebooks
Amazon Kinesis Amazon EMR
Devices
Fast
Hot
Streams Amazon Kinesis
Stream
Stream
Analytics
IoT
IDE
Amazon DynamoDB