Professional Documents
Culture Documents
Errant GTIDs Breaking Replication - How To Detect and Avoid Them - FileId - 187306
Errant GTIDs Breaking Replication - How To Detect and Avoid Them - FileId - 187306
replication
Dieter Adriaenssens
Ghent University
Who am I?
Dieter Adriaenssens
• Linux System Administrator
• MySQL DBA
• Works at Ghent University
• Open Source : former phpMyAdmin team member
• Lives in Ghent, Belgium
• Climber
• E-mail : dieter.adriaenssens@ugent.be
• Twitter : @dcadriaenssens
• Pictures :
4
Overview
5
Introduction
Replication, GTID, data consistency
Replication
7
Replication : Why
• High availability
• Master failover
• Disaster recovery
• Scaling load
• Regional distribution
• ...
8
Replication : How
9
Replication : How
3E11FA47-71CA-11E1-9E33-C80AA9429562
11
GTID set
c004c0eb-c84e-11e6-8efc-aa00009002fd:
1-6084195:6140951-6141015
12
ROW based replication
13
Replicating cluster with GTID
orchestrator -c topology –alias=demo
node1:3306 [0s,ok,5.7.25,rw,ROW,>>,GTID]
MySQL Server cluster + node2:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID]
• Primary (R/W) + node3:3306 [0s,ok,5.7.25,ro,ROW,>>,GTID]
• Replicas (R/O)
Orchestrator
• GTID enabled
• Row based replication
• ProxySQL to redirect traffic to correct
cluster node Node
2
• Orchestrator managing the cluster nodes Prox
Nod
(automatic master failover) y
e1
SQL
Node
3
14
Errant GTID
Definition, consequences, detection, examples, how to avoid, fixes
Errant GTID
16
Errant GTID : consequences
• Everything is fine?
• Inconsistent state between nodes
● Split brain
● Different data when reading from
that replica
• Unexpected behaviour when a replica is
promoted to master
‘[..] the master has purged binary
● Replication might fail
logs containing GTIDs that the
● If GTID is purged from binlog, on
slave requires.’
master failover → replication
stops
17
Errant GTID detection
Errant GTID detection
19
GTID executed set
# primary
SELECT @@GLOBAL.gtid_executed;
27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120
# replica
SELECT @@GLOBAL.gtid_executed;
27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4
20
GTID subset
SELECT GTID_SUBSET('<gtid_executed_replica>',
'<gtid_executed_primary>');
SELECT GTID_SUBSET(
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113',
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset;
+-----------+
| is_subset |
+-----------+
| 1|
+-----------+
1 row in set (0.00 sec)
Replica GTID set is a subset of primary GTID set : OK
21
GTID subset
SELECT GTID_SUBSET('<gtid_executed_replica>',
'<gtid_executed_primary>');
SELECT GTID_SUBSET(
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4',
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS is_subset;
+-----------+
| is_subset |
+-----------+
| 0|
+-----------+
1 row in set (0.00 sec)
Replica GTID set is NOT a subset of primary GTID set
=> Errant GTID on replica
22
Find errant GTIDs
SELECT GTID_SUBTRACT('<gtid_executed_replica>',
'<gtid_executed_primary>');
SELECT GTID_SUBTRACT(
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552113,
50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4',
'27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120') AS errant_gtid;
+------------------------------------------+
| errant_gtid |
+------------------------------------------+
| 50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1-4 |
+------------------------------------------+
1 row in set (0.00 sec)
Result is subset of errant GTIDs
23
Errant GTID : automatic detection
24
Errant GTID : detection
Monitoring check
• Automate checking for errant GTID
• Icinga compatible output format
• Uses orchestrator for cluster info
• https://github.com/UGent-DICT/check_mysql_gtid
25
Errant GTID : monitoring check
./check_mysql_gtid <clustername>
./check_mysql_gtid demo
Everything is fine!
26
Errant GTID : monitoring check
./check_mysql_gtid <clustername>
./check_mysql_gtid demo
- node3 : OK
Orchestrator
• Reports errant GTIDs (>= v3.0.13)
28
Errant GTID
Find transaction
• Look for GTID in binary logs
• Each binlog mentions the executed GTID
set (initial state)
• Select relevant binlog
• Find transaction in that binlog
29
Errant GTID : find transaction
30
Errant GTID : Examples
Transactions on a replica
• Manual transactions (by accident on a replica)
• Scripted maintenance tasks (eg. config management)
● User creation
● Database creation
• Master failover gone bad :
● Split brain
● Writes redirected to a replica (eg. host is r/w by
accident, or after a restart)
• Log flushes
31
Intermezzo : flush-logs
● (1) expected behaviour, according to documentation : “FLUSH LOGS, FLUSH BINARY LOGS, FLUSH
TABLES WITH READ LOCK (with or without a table list), and FLUSH TABLES tbl_name ... FOR EXPORT are
not written to the binary log in any case because they would cause problems if replicated to a slave.“
https://dev.mysql.com/doc/refman/8.0/en/flush.html
• (2) Related bug report : https://bugs.mysql.com/bug.php?id=88720
• (3) introduced in MySQL 5.7.4
32
Avoid errant GTIDs
Avoid errant GTIDs
34
Avoid errant GTIDs
Use read_only:
• Set read_only on all replicas
• Preferably in the config file, to avoid a writable node after
restart
• Orchestrator can set a previous master to read_only on a
failover : ApplyMySQLPromotionAfterMasterFailover = true
35
Avoid errant GTIDs
Use super_read_only:
• Users with SUPER privileges can still write when read_only is set
• Limit SUPER privileges/users
• Set super_read_only on all replicas
• Orchestrator can set a previous master to super_read_only on a failover :
UseSuperReadOnly = true (>= v3.0.7)
36
Avoid errant GTIDs
37
Fix errant GTIDs
Fix errant GTIDs
Examine situation :
• Examine transaction (binlog)
• Does it change data?
• Can’t find transaction?
• Check consistency with pt-table-checksum and pt-table-sync
• Is data consistent?
39
Fix errant GTIDs
Possible fixes:
• Insert empty transactions on other nodes (including primary)
• ‘Remove’ GTIDs from replica binlog
• Rollback transactions : Unsplit brain
● Talk Shlomi Noach @ FOSDEM 2019
https://fosdem.org/2019/schedule/event/unplitmysql/
• Restore data from primary/backup
40
Errant GTID : Insert empty transactions
On all nodes (or only on the primary of replication still
works):
● Repeat for each errant GTID
SET gtid_next='50d5e9eb-c5d3-11e6-b86b-aa00009002f7:1';
BEGIN;
COMMIT;
SET gtid_next='50d5e9eb-c5d3-11e6-b86b-aa00009002f7:2';
BEGIN;
COMMIT;
SET gtid_next=automatic;
41
Errant GTID : Remove from binlog
On the primary
SELECT @@GLOBAL.gtid_executed;
27ab32d2-7f36-11e7-8bd7-aa00009002fd:1-7552120
On the replica
STOP SLAVE;
RESET MASTER;
START SLAVE;
42
Errant GTID : Fix them
43
Demo
Click to add text
Conclusion
45
Acknowledgements
Thanks to:
• Colleagues at Ghent University
• Tibor Korocz from Percona
• Blogposts :
● https://www.percona.com/blog/2014/05/19/errant-transactions-major-hurdle-for-gtid-based-failover-in-mysql-5-6/
● https://dzone.com/articles/how-createrestore-slave-using
● https://dzone.com/articles/mysql-replication-errant-transactions-in-gtid-base
● https://severalnines.com/blog/mysql-replication-and-gtid-based-failover-deep-dive-errant-transactions
● https://www.percona.com/blog/2013/03/26/repair-mysql-5-6-gtid-replication-by-injecting-empty-transactions/
46
Questions?
● Contact : @dcadriaenssens
● Monitoring check: https://github.com/UGent-DICT/check_mysql_gtid
47
Rate My Session
48