Cassandra at Twitter

Cassandra @
Cassandra SF July 11th, 2011
Team
Chris Goffinet
Stu Hood
Ryan King
@stuhood
@rk
@lennox Alan Liang Oscar Moll
Melvin Wang
@alan
@padauk9
Measuring ourselves
#prostyle
Measuring ourselves

Hardware Platform Data Storage Latency and Throughput Operational Efficiency Capacity Planning Developer Integration Testing
Hardware Platform

CPU Core Utilization Memory bandwidth and consumption Machine cost RAID Filesystems and I/O Schedulers IOPS Network bandwidth Kernel
Hardware Platform

CPU Core Utilization Memory bandwidth and consumption Machine cost RAID Filesystems and I/O Schedulers IOPS Network bandwidth Kernel
Hardware Platform
Filesystem configurations
Ext4

Data mode = Ordered Data mode = Writeback
XFS RAID

0 and 10 far side vs near side copies
128 vs 256 vs 512 stripe sizes
Hardware Platform
I/O Schedulers

CFQ vs Noop vs Deadline vs Anticipatory Workloads

Timeseries 50/50
Measure

p90 p99 Average Max
Hardware Platform
I/O Schedulers 5050 - Reads
Scheduler cfq noop deadline anticipatory p90 73ms 47ms 75ms 76ms p99 210ms 167ms 233ms 214ms Average 11.72ms 9.12ms 12.72ms 12.37ms Max 4940ms 4132ms 3718ms 5120ms
Hardware Platform
I/O Schedulers 5050 - Writes
Scheduler cfq noop deadline anticipatory p90 2ms 2ms 2ms 2ms p99 2ms 2ms 2ms 2ms Average 2.02ms 2.06ms 2.13ms 2.03ms Max 5927ms 3475ms 3718ms 5119ms
Measuring ourselves

Data Storage

How efficient is our on-disk storage? Could we do compression? Do we have CPU to trade? How do we push for better? Is it worth it?
Data Storage
Old Easy to Implement Checksumming Varint Encoding Delta Encoding Type Specic Compression Fixed Size Blocks New
Data Storage
Old Easy to Implement Checksumming Varint Encoding Delta Encoding Type Specic Compression Fixed Size Blocks X X X X X X New
Data Storage
How did we do?
Data Storage

1.5x? 2.5x? 3.5x?
Data Storage
7.03x
Data Storage
10,00o rows; 250M columns
Rows Current Format New Format Timeseries LongType column names CounterColumnType values 10000 10000 Columns 250M 250M Size on disk 16,716,432,189 2,375,027,696 bytes per column 66.8 9.5
Data Storage
compression
type specific
fine-grained corruption detection index promotion

normalizing narrow and wide rows predictable performance no double-pass on compaction
range and slice deletes
Measuring ourselves

Latency and Throughput

What are our issues? Compaction Performance? Caching? Too many disk seeks? Garbage Collection?
Compaction
Compaction

Multithread Compaction + Throttling Compact each bucket in parallel Throttle across all buckets Compaction running all the time CASSANDRA-2191 CASSANDRA-2156
Measure latency

p99 p999
No averages! Every customer has p99 and p999 targets we must hit 24x7 on-call rotation
Caching?

In-heap Off-heap
Pluggable cache
Memcache
Case Study: Tweet Button
Growth was requiring entire dataset in memory. Why? How big is the active dataset within 24hours? What happens when dataset outgrows memory? Could other storage solutions do better? What are we missing here?
Key Size Variable length (each one a url)
Implement hashing on keys
Can we do better? But... the cache in Java isnt very efficient...
or is it?
On-heap
Requires us to scale the JVM heap with cache
Off-heap
Store pointers to data allocated out of the JVM
Memcache
Out of process
On-heap
Data + CLHM overhead (87GB)
Off-heap
CLHM overhead (67GB just the pointers!)
Memcache
Internal overhead + data (48GB!)
* CLHM (Concurrent Linked HashMap)

Co-locate memcache on each node Routing + Cache replication Write through LRU Rolling restarts do not cause degraded performance states
Cassandra
Memcache
Cassandra
Memcache
Cassandra
Memcache
Cassandra
Memcache

In production today Stats
99th percentile went before 200ms 800ms when data > memory 99th percentile now - 2.5ms
Case Study: Cuckoo

New observability stack Replaces Ganglia Collect metrics for graphing in real-time Scale based on machine count + defined metrics Heavy write throughput requirements SLA Target
All metrics written under 60 seconds
Case Study: Cuckoo

1.3 million writes/second 112 billion writes a day 3.2 gigabit/s over the network 492GB of new data per hour 140MB/s writes across cluster 70MB/s reads across cluster
Case Study: Cuckoo
36,000 writes/second
persistently to disk on each node
36 nodes without RF (Replication Factor) Replication Factor = 3 30-35% cpu utilization FSync Commit Log every 10s
Case Study: Cuckoo
Garbage Collection Challenge
30-60 second pauses multiple times per hour on each node
Why? Heap fragmentation
Case Study: Cuckoo

1.5e+09 1.0e+09 free_space
5.0e+08
value
1.5e+09
1.0e+09
max_chunk
5.0e+08
1000
2000
time
3000
4000
Case Study: Cuckoo

Slab Allocation Fixed sized chunks (2MB) Copy byte[] into slabs using CAS (Compare & Swap) Largely reduced fragmentation CASSANDRA-2252
Case Study: Cuckoo

No Slab Slab
GC Pause Avg Time
30-60 seconds
Frequency of pause
Every hour
Case Study: Cuckoo

No Slab Slab
GC Pause Avg Time
30-60 seconds
5 seconds
Frequency of pause
Every hour
3 days 10 hours
Case Study: Cuckoo
Pluggable Compaction

Custom strategy for retention support Used for our timeseries
Drop SSTables after N days
Make it easy to implement more interesting and intelligent compaction strategies SSTable Min/Max Timestamp
Read time optimization
Measuring ourselves

Operational Efficiency

Automated infrastructure burn-in process Rack awareness to handle switch failures Grow clusters per rack, not per node Lower Server RPC timeout (200ms to 1s)
Fail fast
Split out RPC timeouts by read & writes CASSANDRA-2819
Fault tolerance at the disk level

Eject from cluster if raid array fails CASSANDRA-2118
No swap and dedicated commit log Multiple hard drive vendors 300+ nodes in production Run on cheap commodity hardware Design for failure
What failures do we see in production?

Bad memory that causes corruption Multiple disks dying on same hosts within hours Rack switch failures Memory allocation delays causing JVM to encounter higher latency GC collections (mlockall recommended) Stop the world pauses if traffic patterns change
What failures do we see in production?
Network cards sometimes negotiating down to 100Mbit Machines randomly die and never come back Disks auto-ejecting themselves from the raid array
Deploy Process
Driver Hudson Git
Cass Cass Cass Cass
Cass Cass Cass
Cass
Deploy Process

Deploy to hundreds of nodes in under 20s Roll the cluster

Disable Gossip on a node Check ring on all nodes to ensure Down state Drain Restart
Measuring ourselves

Capacity Planning

In-house capacity planning tool Collect input from sources:

hardware platform (kernel, hw data) on-disk serialization overhead cost of read/write (seeks, index overhead) query cost (cpu, memory usage) requirements from customers
Capacity Planning
Input Example
spec = { 'read_qps': 500, 'write_qps': 1000, 'replication_factor': 3, 'dataset_hot_percent': 0.05, 'latency_95': 350.0, 'latency_99': 250.0, 'read_growth_percentage': 0.1, 'write_growth_percentage': 0.1, ...... }
Capacity Planning
Output Example
90 days datasize: 14.49T page cache size: 962.89G number of disks: 68 disk capacity: 15.22T iops: 6800.00/s replication_factor: 3 servers: 51 servers (w/o replication): 17 read_ops: 2323 write_ops: 991066 servers: 57 servers (w/o replication): 19 read_ops: 2877 write_ops: 1143171
Measuring ourselves

Developer Integration
Cassie

Light-weight Cassandra Client Cluster member auto discovery Uses Finagle (http://github.com/twitter/ finagle) Scala + Java support Open sourcing
Measuring ourselves

Testing
Distributed Testing Harness
Open sourced to community
Custom Internal Build of YCSB

Performance Benchmarking Custom workloads such as timeseries
Performance Framework

Custom framework that uses YCSB What we do:

Collect as much data as possible Measure Do it again
Generate reports per build

Read/Insert/Update Combinations: 30 Request Targeting (per second): 8
500, 1000, 2000, 4000, 8000, 16000, 32000, 64000
Payload Sizes: 5
100, 500, 1000, 2000, 4000 bytes
Single node vs cluster
Total test combinations: 1,200
Summary
Understand your hardware and operating system Rigorously exercise your entire stack Capacity plan with math not guesswork Measure everything, then do it again Invest in your storage technology Automate Expect everything to fail
Were hiring @jointheflock

Cassandra at Twitter

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cassandra at Twitter

Uploaded by

Copyright:

Available Formats

Cassandra @

Cassandra SF July 11th, 2011

@lennox Alan Liang Oscar Moll

Data mode = Ordered Data mode = Writeback

0 and 10 far side vs near side copies

128 vs 256 vs 512 stripe sizes

CFQ vs Noop vs Deadline vs Anticipatory Workloads

p90 p99 Average Max

How did we do?

1.5x? 2.5x? 3.5x?

fine-grained corruption detection index promotion

normalizing narrow and wide rows predictable performance no double-pass on compaction

range and slice deletes

Latency and Throughput

Latency and Throughput

Latency and Throughput

Latency and Throughput

Latency and Throughput

Latency and Throughput

Case Study: Tweet Button

Case Study: Tweet Button

Case Study: Tweet Button

Key Size Variable length (each one a url)

Implement hashing on keys

Can we do better? But... the cache in Java isnt very efficient...

Case Study: Tweet Button

Requires us to scale the JVM heap with cache

Store pointers to data allocated out of the JVM

Case Study: Tweet Button

Data + CLHM overhead (87GB)

CLHM overhead (67GB just the pointers!)

Internal overhead + data (48GB!)

* CLHM (Concurrent Linked HashMap)

Case Study: Tweet Button

Case Study: Tweet Button

In production today Stats

Case Study: Cuckoo

All metrics written under 60 seconds

Case Study: Cuckoo

Case Study: Cuckoo

persistently to disk on each node

Case Study: Cuckoo

Garbage Collection Challenge

30-60 second pauses multiple times per hour on each node

Why? Heap fragmentation

Case Study: Cuckoo

Case Study: Cuckoo

Case Study: Cuckoo

GC Pause Avg Time

Case Study: Cuckoo

GC Pause Avg Time

Case Study: Cuckoo

Custom strategy for retention support Used for our timeseries

Drop SSTables after N days

Read time optimization

Split out RPC timeouts by read & writes CASSANDRA-2819

Fault tolerance at the disk level

Eject from cluster if raid array fails CASSANDRA-2118

Cass Cass Cass Cass

Cass Cass Cass

Deploy to hundreds of nodes in under 20s Roll the cluster

In-house capacity planning tool Collect input from sources:

Distributed Testing Harness