Error Detection with Mysql Replication

Khanh Do Ba Stephen Tu Daniel Peek


(cosmic ray)! Master! Slave!

statement-based replication!




Target scenario What kind of errors? How to nd errors Results on production systems

Target Scenario
Master! Slave!

Dont interfere with workload Minimize communication Detect when master slave Deal with replication lag Use vanilla MySQL

Easy Errors
Table does not exist Different schema Database ofine

Kinds of Errors

Wrong Data!

Kinds of Errors

Wrong Data!

Slave Missing Row!

Kinds of Errors

Wrong Data!

Slave Missing Row!

Slave Extra Row!

First thoughts
DB Contents





Contents


Second thoughts
Fing er print s!


DB1! Fingerprint!
Fingerprints!


A New Plan

1. Fast pass to narrow search to blocks 2. CM-ngerprint narrows search to rows 3. Third pass gives denite answers

First Pass: Checksum

cs! cs! cs! cs! Table! Record!

Block Boundaries
1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! 16! 17! 18! 19! 20!

Rows start - 4! Rows 4 - 7! Rows 7 - 10! Rows 10 - 13! Rows 13 - 16! Rows 16 - end!

cs! cs! cs! cs! cs! cs!

Now What? We know which blocks may have inconsistencies Which rows in those blocks have inconsistencies?

Second Pass: CM-Fingerprint

CM-Fingerprints!


DB1! CM-Fingerprint!
CM-Fingerprints!


CM-ngerprinting: encoding
0! 1! 2! 3! 4! 5! 6! 7!
fp0! fp1! fp2! fp3! fp4! fp5! fp6! fp7!

Bad Block! 0 1

fp0! fp0!

fp1! fp1!

CM-ngerprinting: decoding
x00 = i:binary(i)=0** fpi = fp0 fp1 fp2 fp3!
x00 x01 x10 x11 x20 x21

x01 = i:binary(i)=1** fpi = fp4 fp5 fp6 fp7! x10 = i:binary(i)=*0* fpi = fp0 fp1 fp4 fp5! x11 = i:binary(i)=*1* fpi = fp2 fp3 fp6 fp7! x

= i:binary(i)=**0 fpi = fp0 fp2 fp4 fp6!

y00 y01 y10 y11 y20 y21

x21 = i:binary(i)=**1 fpi = fp1 fp3 fp5 fp7!

CM-ngerprinting: decoding
x00 x01 x10 x11 x20 x21 0 0 0 0

Case 1: All rows agree!

y00 y01 y10 y11 y20 y21

0 0

CM-ngerprinting: decoding
x00 x01 x10 x11 x20 x21

Case 2: 1 row disagrees ! (e.g., row 3)!

z 0 z z

y00 y01 y10 y11 y20 y21

0 0

CM-ngerprinting: decoding
x00 x01 x10 x11 x20 x21 ? ? ? ?

Case 3: >1 rows disagree!

y00 y01 y10 y11 y20 y21

? ?

CM-ngerprinting: analysis
0! 1! 2! 3! 4! 5! ! n!


Blocks of 1000 rows require CM-ngerprints of size! !2log21000 * 32 bits = 640 bits!

Pass 3: Consistent snapshot

Copy bad blocks and rows into a side table! Use statement-based replication for consistency! Snapshot! Master!

Snapshot! Slave!

Comparing Rows
Master Snapshot Slave Snapshot

Easy with unique keys If no unique key, order by by md5(row)

Final picture: Phase 1

Narrow search to blocks
Fing er

print !


Fingerprint!


Final picture: Phase 2

Narrow search to rows

CM-Fingerprint!

n CM-

CM-Fingerprint!

Final picture: Phase 3

Denitive Answers
Snapshot!


Snapshot!


On Facebooks User Databases Rate of inconsistency: 0.0056% - Strange Tables Rate of inconsistency: 0.0027% What did we nd at what cost?

Finding Inconsistencies
(log scale)

100% Pass 1: Checksum 1.12% Pass 2: CM-ngerprint 0.014% Pass 3: Consistent Snapshot 0.0027%

How inconsistent are blocks?

1 inconsistency: 2 inconsistencies: 3 inconsistencies: >4 inconsistencies: 27.8% 13.7% 2.9% 55.6%

CM-Fingerprint saves if 1 inconsistency Use smaller blocksize?

What kind of inconsistencies?

Of the inconsistent 0.0027%, Different data: 99.54% Slave missing a row: 0.41% Slave has extra row: 0.05%

What kind of wrong data?

Tool cant tell us about causes 1 column is off: 98.5% Off by one: 0.04% Bad timestamp: 97.6% Still unexplained: 2.4% Serious(?) inconsistency rate: 0.000066%

Future Work
Master-master mode No consistent snapshot! Measure growth rate Evaluate blocksize vs. data trafc

