Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

12/29/22, 1:33 PM Kafka as a storage system

Engineering

Back

Infrastructure

Kafka as a storage system


By
Babatunde Fashola
Wednesday, 16 December 2020

Introduction
When developers consume public Twitter data through our API, they need
reliability, speed, and stability. Therefore, some time ago, we launched the
Account Activity Replay API (https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-
users/subscribe-account-activity/guides/activity-replay-
api#:~:text=The%20Account%20Activity%20Replay%20API%20was%20developed%20for%20any%2
0scenario,real%2Dtime%20delivery%20of%20activities.) for the Account Activity API
(https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/subscribe-account-
to let developers bake stability into their systems. The Account
activity/overview)
Activity Replay API is a data recovery tool that lets developers retrieve events
from as far back as five days. This API recovers events that weren’t delivered for
various reasons, including inadvertent server outages during real-time delivery
attempts. 

In addition to building APIs that will provide a positive developer experience, we


also optimize for: 

Increasing the productivity of our engineers.


Keeping the system easy to maintain. Specifically, we try to minimize
context switching for developers, site reliability engineers, and anyone else
who interacts with the system.

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/kafka-as-a-storage-system 1/6
12/29/22, 1:33 PM Kafka as a storage system

For these reasons, when building the replay system on which this API relies, we
leveraged the existing design of the real-time system that powers the Account
Activity API. This helped us reuse existing work as well as minimize the context
switching and training that would have been required had we created an entirely
different system.

The real-time system is built on a publish-subscribe architecture. Therefore, in


keeping to these goals, building the storage layer we would read from drove us to
rethink a traditional streaming technology -- Apache Kafka (https://kafka.apache.org/).

Background
Real-time events are produced in two datacenters (DCs). When these events are
produced, they are written into pub-sub topics that are cross-replicated across the
two DCs for redundancy purposes.

Not all events should be delivered so they are filtered by an internal application
which consumes events from these topics, checks each against a set of rules in a
key-value store, and decides whether the event should be delivered to a given
developer through our public API. Events are delivered through a webhook, and
each webhook URL owned by a developer is identified by a unique ID.

Storage and Partitioning


Ordinarily, when building a replay system that requires storage like this, one might
use an architecture based on Hadoop and HDFS. We chose Apache Kafka
(https://kafka.apache.org/) instead for two reasons:

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/kafka-as-a-storage-system 2/6
12/29/22, 1:33 PM Kafka as a storage system

the real-time system was built on a similar pub-sub architecture 


the volume of events to be stored by the replay system isn't petabyte scale.
We store no more than a few days worth of data. In addition, spinning
Hadoop’s MapReduce jobs is more expensive and slower than consuming
data on Kafka, which would not meet developer expectations.

We leverage the real-time pipeline to build the replay pipeline by first ensuring
that events that should have been delivered for each developer are stored on
Kafka. We call the Kafka topic the delivery_log; there is one in each DC. These
topics are then cross-replicated to ensure redundancy which allows for serving a
replay request out of one DC. These stored events are deduplicated before they
are delivered.

On this Kafka topic, we created multiple partitions using the default semantic
partitioning mechanism. Therefore, partitions correspond to the hash of a
developer’s webhookId, which is the key for each record. We considered using
static partitioning but decided against it because of the increased risk of having
one partition with more data than the others if one developer generates more
events than others. Instead, we chose a fixed number of partitions to spread out
the data with the default partition strategy. With this, we mitigate the risk of
unbalanced partitions and don’t need to read all the partitions on the Kafka topic.
Rather, based on the webhookId for which a request comes in, a Replay Service
(referenced below [Requests and Processing]) determines the specific partition to
read from and spins up a new Kafka consumer for that partition. The number of
partitions on the topic does not change because this affects the hashing of the
keys and how events are distributed.

We use solid-state disks (SSDs), given the amount of events we read per time
period. We chose this over the traditional hard drive disks for faster processing
and to reduce the overhead that comes with seek and access times. it’s better to
use SSDs for faster reads because we access infrequently accessed data and
thus we don’t get the benefits of the Page Cache optimizations.

To minimize storage space, this data is compressed using snappy


(https://opensource.google/projects/snappy) as the compression algorithm. We know most
of our processing is on the consumption side. We chose snappy because it's
faster when decompressing (https://blog.cloudflare.com/squeezing-the-firehose/) than the
other Kafka-supported compression algorithms: gzip and lz4.

Requests and Processing

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/kafka-as-a-storage-system 3/6
12/29/22, 1:33 PM Kafka as a storage system

In the system we designed, an API call sends replay requests. As part of the
payload in each validated request, we get the webhookId and the date range for
which events should be replayed. These requests are persisted into MySQL and
enqueued until they are picked up by the Replay Service. The date range on the
request is used to determine the offset on the partition to start reading from. The
offsetForTimes
(https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#offsets
ForTimes-java.util.Map-) function on the Consumer object is used to get the offsets.

Replay Service instances process each replay request. The instances coordinate
with one another using MySQL to process the next replay entry in the database.
Each replay worker polls MySQL periodically to know the pending job that should
be processed. A request transitions between states. A request that has not been
picked to be worked on is in an OPEN STATE. A request that was just dequeued
is in a STARTED STATE. A request that is being processed is in an ONGOING
STATE. A request that is done transitions to a COMPLETED STATE. A replay
worker picks only a request that hasn't been started (that is, a request in an
OPEN state).

Periodically, after a worker has dequeued a request for processing, the worker
heartbeats the MySQL table with timestamp entries to show that the replay job is
still ongoing. In cases where the replay worker instance dies while still processing
a request, such jobs are restarted. Therefore, in addition to dequeuing only
requests that are in an OPEN STATE, replay workers also pick up jobs that are in
either a STARTED or ONGOING STATE that got no heartbeat for a predefined
number of minutes. 

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/kafka-as-a-storage-system 4/6
12/29/22, 1:33 PM Kafka as a storage system

Events are eventually deduplicated from the topic while being read and then
published to the webhook URL of the customer. The deduplication is done by
maintaining a cache of the hash of the events being read. If an event with the
same hash is encountered, it isn’t delivered. 

In summary, our solution isn't a traditional real-time, streaming use of Kafka as we


know it. However, we successfully used Kafka as a storage system in building an
API that optimizes customer experience and data access in recovering events.
Leveraging the design of the real-time system makes it easy to maintain our
systems. In addition, the speed of recovery for our customers is as we
envisaged. 

Acknowledgements
Building a service as this would not have been possible without the help of people
like Owen Parry, Eoin Coffey, Lisa White, Andrew Rapp, and Brent Halsey.
Special thanks to Lisa White, Chris Coco and Christopher Hogue for reading and
reviewing this blog. 

(https://www.twitter.com/Fasholaide)
Babatunde Fashola

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/kafka-as-a-storage-system 5/6
12/29/22, 1:33 PM Kafka as a storage system

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2020/kafka-as-a-storage-system 6/6

You might also like