Adi Krishnan, Sr. Product Manager Amazon Kinesis: November 13, 2014 - Las Vegas, NV

Adi Krishnan, Sr.
Product Manager Amazon Kinesis
November 13, 2014 | Las Vegas, NV

Scenarios Across Industry Segments
Scenarios 1 Accelerated Ingest-Transform-Load 2 Continual Metrics/ KPI Extraction 3 Responsive Data Analysis
Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data
Digital Ad Tech./ Advertising Data aggregation Advertising metrics like coverage, yield, Analytics on User engagement with
Marketing conversion Ads, Optimized bid/ buy engines
Software/ IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational
Technology Intelligence
Financial Services Market/ Financial Transaction order data Financial market data metrics Fraud monitoring, and Value-at-Risk
collection assessment, Auditing of market order
data
Consumer Online/ Online customer engagement data Consumer engagement metrics like Customer clickstream analytics,
E-Commerce aggregation page views, CTR Recommendation engines
Amazon Kinesis
Managed Service for streaming data ingestion, and processing Aggregate and
archive to S3
Real-time
Front
dashboards
End
and alarms
Ordered stream
Authentication AZ AZ AZ of events supports
Millions of multiple readers
sources producing Authorization
Durable, highly consistent storage replicates data
100s of terabytes across three data centers (availability zones)
per hour
Amazon Web Services Machine learning
algorithms or
Inexpensive: $0.028 per million puts sliding window
analytics
Aggregate analysis
in Hadoop or a
data warehouse
Real-time Ingest Continuous Processing FX
• Highly Scalable • Elastic
• Durable • Load-balancing incoming streams
• Elastic • Fault-tolerance, Checkpoint / Replay
• Replay-able Reads • Enable multiple processing apps in parallel
Managed Service
Low end-to-end latency
Enable data movement into Stores/ Processing Engines

Kinesis Stream
Managed Ability To Capture And Store Data
Putting Data into Kinesis
Simple Put interface to store data in Kinesis
Best Practices: Putting Data in Kinesis
Determine Your Partition Key Strategy
• Kinesis as a managed buffer or a streaming map-
reduce
• Ensure a high cardinality for Partition Keys with
respect to shards, to prevent a “hot shard” problem
– Generate Random Partition Keys
• Streaming Map-Reduce: Leverage Partition Keys for
business specific logic as applicable
– Partition Key per billing customer, per DeviceId, per
stock symbol
Provisioning Adequate Shards
• For ingress needs
• Egress needs for all consuming applications: If more
than 2 simultaneous consumers
• Include head-room for catching up with data in stream
in the event of application failures
Pre-Batch before Puts for better efficiency
Pre-Batch before Puts for better efficiency
# KINESIS appender
log4j.logger.KinesisLogger=INFO, KINESIS
log4j.additivity.KinesisLogger=false
log4j.appender.KINESIS=com.amazonaws.services.kinesis.log4j.
KinesisAppender
# DO NOT use a trailing %n unless you want a newline to be

transmitted to KINESIS after every message
log4j.appender.KINESIS.layout=org.apache.log4j.PatternLayout
log4j.appender.KINESIS.layout.ConversionPattern=%m
# mandatory properties for KINESIS appender

log4j.appender.KINESIS.streamName=testStream
#optional, defaults to UTF-8

log4j.appender.KINESIS.encoding=UTF-8
#optional, defaults to 3
log4j.appender.KINESIS.maxRetries=3
log4j.appender.KINESIS.bufferSize=1000
https://github.com/awslabs/kinesis-log4j- log4j.appender.KINESIS.threadCount=20
#optional, defaults to 30 seconds
appender log4j.appender.KINESIS.shutdownTimeout=30
• Retry if rise in input rate is temporary
Metric Units
• Reshard to increase number of
PutRecord.Bytes Bytes
shards
• Monitor CloudWatch metrics: PutRecord.Latency Milliseconds
PutRecord.Bytes and PutRecord.Success Count
GetRecords.Bytes metrics keep track
of shard usage
• Keep track of your metrics
putRecordRequest.setPartitionKey
• Log hashkey values generated by (String.format( "myPartitionKey"));
your partition keys
• Log Shard-Ids String shardId =
• Determine which Shard receive the putRecordResult.getShardId();
most (hashkey) traffic.
Options:
• stream-name - The name of the
Stream to be scaled
• scaling-action - The action to be
taken to scale. Must be one of
"scaleUp”, "scaleDown" or
“resize"
• count - Number of shards by
which to absolutely scale up or
down, or resize to or:
• pct - Percentage of the existing
https://github.com/awslabs/amazon- number of shards by which to
kinesis-scaling-utils scale up or down
Sending & Reading Data from Kinesis Streams
Sending Consuming
HTTP Post
Get* APIs
AWS SDK Kinesis Client

Library
+
AWS Mobile Connector Library
SDK
Apache
Storm
LOG4J
Flume Amazon Elastic

MapReduce
Fluentd
Building Kinesis Applications: Kinesis Client Library
Open Source library for fault-tolerant, continuous processing apps
• Java client library, also available for Python Developers
• Source available on Github
• Build app with Kinesis Client Library
• Deploy on your set of EC2 instances

• Every KCL application includes these components:
• Record processor factory: Creates the record processor
• Record processor: The processing unit that processes data from a shard
of a Kinesis stream
• Worker: The processing unit that maps to each application instance
• The KCL uses the
IRecordProcessor interface to
communicate with your application
• A Kinesis application must

implement the KCL's
IRecordProcessor interface
• Contains the business logic for

processing the data retrieved from
the Kinesis stream
• One record processor maps to one shard and processes data records from
that shard
• One worker maps to one or more record processors
• Balances shard-worker associations when worker / instance counts change
• Balances shard-worker associations when shards split or merge
Moving data into Amazon S3, Redshift
Amazon Kinesis Connector Library
Customizable, Open Source Apps to Connect Kinesis with S3, Redshift, DynamoDB
ITransformer IFilter IBuffer IEmitter
• Defines the • Excludes • Buffers the set • Makes client S3

transformation irrelevant of records to calls to other
of records records from be processed AWS services
Kinesis from the the by specifying and persists
Amazon processing. size limit (# of the records
Kinesis stream records)& total stored in the
in order to suit byte count buffer.
the user-
defined data DynamoDB
model
Redshift
Amazon Kinesis Connectors
• S3 Connector
– Batch writes files for archive into S3
– Uses sequence-based file naming scheme Kinesis
• Redshift Connector
– Once written to S3, loads to Redshift
– Provides manifest support
– Supports user defined transformers S3 Dynamo Redshift
DB
• DynamoDB Connector
– BatchPut appends to a table
– Supports user defined transformers
Best Practices: Processing Data From Kinesis
Build applications as part of an Auto Scaling group
• Simply helps with application availability
• Scales in response to incoming spikes in-data volume,
assuming Shards have been provisioned
• Select scaling metrics based on nature of Kinesis
application
– Instance metrics: CPU, Memory, and others
– Kinesis Metrics: PutRecord.Bytes, GetRecord.Bytes
Metric Units
PutRecord.Bytes Bytes
PutRecord.Latency Milliseconds
PutRecord.Success Count
GetRecords.Bytes Bytes
GetRecords.IteratorAge Milliseconds
GetRecords.Latency Milliseconds
Getrecords.Success Count
Build an flush-to-S3 consumer app
• App can specify three conditions that can trigger a buffer
flush:
– Number of records
– Total byte count
– Time since last flush
• The buffer is flushed and the data is emitted to the
destination when any of these thresholds is crossed.
# Flush when buffer exceeds 8 Kinesis records, 1 KB size limit or
when time since last emit exceeds 10 minutes
bufferSizeByteLimit = 1024
bufferRecordCountLimit = 8
bufferMillisecondsLimit = 600000
• In KCL app, ensure data being processed is persisted to durable store like
DynamoDB, or S3, prior to check-pointing.
• Duplicates: Make the authoritative data repository (usually at the end of the
data flow) resilient to duplicates. That way the rest of the system has a simple
policy – keep retrying until you succeed.
• Idempotent Processing: Use number of records since previous checkpoint, to

get repeatable results when the record processors fail over.
• Creates a manifest file based on a custom set of input files
• Use a manifest stream with only one shard
• Adjust checkpoint frequency, connector buffer and filter to align with your
redshift load models
Amazon Kinesis Customer Scenarios
Collect all data of interest continuously
Faster time to market due to ease of deployment
Enable operators, partners get to valuable data quickly
http://bit.ly/awsevals

Adi Krishnan, Sr. Product Manager Amazon Kinesis: November 13, 2014 - Las Vegas, NV

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adi Krishnan, Sr. Product Manager Amazon Kinesis: November 13, 2014 - Las Vegas, NV

Uploaded by

Copyright:

Available Formats

Adi Krishnan, Sr.

Product Manager Amazon Kinesis

November 13, 2014 | Las Vegas, NV

Low end-to-end latency

Enable data movement into Stores/ Processing Engines

# DO NOT use a trailing %n unless you want a newline to be

# mandatory properties for KINESIS appender

#optional, defaults to UTF-8

AWS SDK Kinesis Client

Flume Amazon Elastic

• Source available on Github

• Build app with Kinesis Client Library

• Deploy on your set of EC2 instances

• A Kinesis application must

• Contains the business logic for

ITransformer IFilter IBuffer IEmitter

• Defines the • Excludes • Buffers the set • Makes client S3

• Idempotent Processing: Use number of records since previous checkpoint, to

You might also like