Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Adi Krishnan, Sr.

Product Manager Amazon Kinesis

November 13, 2014 | Las Vegas, NV


Scenarios Across Industry Segments
Scenarios 1 Accelerated Ingest-Transform-Load 2 Continual Metrics/ KPI Extraction 3 Responsive Data Analysis

Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data

Digital Ad Tech./ Advertising Data aggregation Advertising metrics like coverage, yield, Analytics on User engagement with
Marketing conversion Ads, Optimized bid/ buy engines

Software/ IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational
Technology Intelligence

Financial Services Market/ Financial Transaction order data Financial market data metrics Fraud monitoring, and Value-at-Risk
collection assessment, Auditing of market order
data

Consumer Online/ Online customer engagement data Consumer engagement metrics like Customer clickstream analytics,
E-Commerce aggregation page views, CTR Recommendation engines
Amazon Kinesis
Managed Service for streaming data ingestion, and processing Aggregate and
archive to S3

Real-time
Front
dashboards
End
and alarms
Ordered stream
Authentication AZ AZ AZ of events supports
Millions of multiple readers
sources producing Authorization
Durable, highly consistent storage replicates data
100s of terabytes across three data centers (availability zones)
per hour
Amazon Web Services Machine learning
algorithms or
Inexpensive: $0.028 per million puts sliding window
analytics

Aggregate analysis
in Hadoop or a
data warehouse
Real-time Ingest Continuous Processing FX
• Highly Scalable • Elastic
• Durable • Load-balancing incoming streams
• Elastic • Fault-tolerance, Checkpoint / Replay
• Replay-able Reads • Enable multiple processing apps in parallel

Managed Service

Low end-to-end latency

Enable data movement into Stores/ Processing Engines


Kinesis Stream
Managed Ability To Capture And Store Data
Putting Data into Kinesis
Simple Put interface to store data in Kinesis
Best Practices: Putting Data in Kinesis
Determine Your Partition Key Strategy
• Kinesis as a managed buffer or a streaming map-
reduce
• Ensure a high cardinality for Partition Keys with
respect to shards, to prevent a “hot shard” problem
– Generate Random Partition Keys
• Streaming Map-Reduce: Leverage Partition Keys for
business specific logic as applicable
– Partition Key per billing customer, per DeviceId, per
stock symbol
Best Practices: Putting Data in Kinesis
Provisioning Adequate Shards
• For ingress needs
• Egress needs for all consuming applications: If more
than 2 simultaneous consumers
• Include head-room for catching up with data in stream
in the event of application failures
Best Practices: Putting Data in Kinesis
Pre-Batch before Puts for better efficiency
Best Practices: Putting Data in Kinesis
Pre-Batch before Puts for better efficiency
# KINESIS appender
log4j.logger.KinesisLogger=INFO, KINESIS
log4j.additivity.KinesisLogger=false
log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.
KinesisAppender

# DO NOT use a trailing %n unless you want a newline to be


transmitted to KINESIS after every message
log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout
log4j.appender.KINESIS.layout.ConversionPattern=%m

# mandatory properties for KINESIS appender


log4j.appender.KINESIS.streamName=testStream

#optional, defaults to UTF-8


log4j.appender.KINESIS.encoding=UTF-8
#optional, defaults to 3
log4j.appender.KINESIS.maxRetries=3
#optional, defaults to 2000
log4j.appender.KINESIS.bufferSize=1000
#optional, defaults to 20
https://github.com/awslabs/kinesis-log4j- log4j.appender.KINESIS.threadCount=20
#optional, defaults to 30 seconds
appender log4j.appender.KINESIS.shutdownTimeout=30
• Retry if rise in input rate is temporary
Metric Units
• Reshard to increase number of
PutRecord.Bytes Bytes
shards
• Monitor CloudWatch metrics: PutRecord.Latency Milliseconds
PutRecord.Bytes and PutRecord.Success Count
GetRecords.Bytes metrics keep track
of shard usage
• Keep track of your metrics
putRecordRequest.setPartitionKey
• Log hashkey values generated by (String.format( "myPartitionKey"));
your partition keys
• Log Shard-Ids String shardId =
• Determine which Shard receive the putRecordResult.getShardId();
most (hashkey) traffic.
Options:
• stream-name - The name of the
Stream to be scaled
• scaling-action - The action to be
taken to scale. Must be one of
"scaleUp”, "scaleDown" or
“resize"
• count - Number of shards by
which to absolutely scale up or
down, or resize to or:
• pct - Percentage of the existing
https://github.com/awslabs/amazon- number of shards by which to
kinesis-scaling-utils scale up or down
Sending & Reading Data from Kinesis Streams
Sending Consuming

HTTP Post
Get* APIs

AWS SDK Kinesis Client


Library
+
AWS Mobile Connector Library
SDK
Apache
Storm
LOG4J

Flume Amazon Elastic


MapReduce

Fluentd
Building Kinesis Applications: Kinesis Client Library
Open Source library for fault-tolerant, continuous processing apps
• Java client library, also available for Python Developers

• Source available on Github

• Build app with Kinesis Client Library

• Deploy on your set of EC2 instances


• Every KCL application includes these components:
• Record processor factory: Creates the record processor
• Record processor: The processing unit that processes data from a shard
of a Kinesis stream
• Worker: The processing unit that maps to each application instance
• The KCL uses the
IRecordProcessor interface to
communicate with your application

• A Kinesis application must


implement the KCL's
IRecordProcessor interface

• Contains the business logic for


processing the data retrieved from
the Kinesis stream
• One record processor maps to one shard and processes data records from
that shard
• One worker maps to one or more record processors
• Balances shard-worker associations when worker / instance counts change
• Balances shard-worker associations when shards split or merge
Moving data into Amazon S3, Redshift
Amazon Kinesis Connector Library
Customizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB

ITransformer IFilter IBuffer IEmitter

• Defines the • Excludes • Buffers the set • Makes client S3


transformation irrelevant of records to calls to other
of records records from be processed AWS services
Kinesis from the the by specifying and persists
Amazon processing. size limit (# of the records
Kinesis stream records)& total stored in the
in order to suit byte count buffer.
the user-
defined data DynamoDB
model

Redshift
Amazon Kinesis Connectors
• S3 Connector
– Batch writes files for archive into S3
– Uses sequence-based file naming scheme Kinesis
• Redshift Connector
– Once written to S3, loads to Redshift
– Provides manifest support
– Supports user defined transformers S3 Dynamo Redshift
DB
• DynamoDB Connector
– BatchPut appends to a table
– Supports user defined transformers
Best Practices: Processing Data From Kinesis
Build applications as part of an Auto Scaling group
• Simply helps with application availability
• Scales in response to incoming spikes in-data volume,
assuming Shards have been provisioned
• Select scaling metrics based on nature of Kinesis
application
– Instance metrics: CPU, Memory, and others
– Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes
Metric Units
PutRecord.Bytes Bytes
PutRecord.Latency Milliseconds
PutRecord.Success Count
GetRecords.Bytes Bytes
GetRecords.IteratorAge Milliseconds
GetRecords.Latency Milliseconds
Getrecords.Success Count
Best Practices: Processing Data From Kinesis
Build an flush-to-S3 consumer app
• App can specify three conditions that can trigger a buffer
flush:
– Number of records
– Total byte count
– Time since last flush
• The buffer is flushed and the data is emitted to the
destination when any of these thresholds is crossed.
# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or
when time since last emit exceeds 10 minutes
bufferSizeByteLimit = 1024
bufferRecordCountLimit = 8
bufferMillisecondsLimit = 600000
Best Practices: Processing Data From Kinesis

• In KCL app, ensure data being processed is persisted to durable store like
DynamoDB, or S3, prior to check-pointing.

• Duplicates: Make the authoritative data repository (usually at the end of the
data flow) resilient to duplicates. That way the rest of the system has a simple
policy – keep retrying until you succeed.

• Idempotent Processing: Use number of records since previous checkpoint, to


get repeatable results when the record processors fail over.
Best Practices: Processing Data From Kinesis
• Creates a manifest file based on a custom set of input files
• Use a manifest stream with only one shard
• Adjust checkpoint frequency, connector buffer and filter to align with your
redshift load models
Amazon Kinesis Customer Scenarios
Collect all data of interest continuously
Faster time to market due to ease of deployment
Enable operators, partners get to valuable data quickly
http://bit.ly/awsevals

You might also like