How to get an active-active MySQL setup back in sync after hard crash
Yesterday one member of our active-active MySQL cluster experienced a hard power off. After getting the server back online, MySQL’s binary logs (both the server’s and the relay logs) turned out to be corrupted. For the sake of clarity, I will call the server that crashed ‘node1’, and the other server ‘node2’. This post will explain how to get the cluster back in sync with a minimum of data loss.
The current situation
On both servers, the replication has stopped. This is the relevant output on both nodes:
- node1:
Master_Log_File: mysql-bin.000253
Read_Master_Log_Pos: 569975131
Relay_Log_File: relay-bin.000198
Relay_Log_Pos: 179576569
Relay_Master_Log_File: mysql-bin.000253
Last_Error: Relay log read failure: Could not parse relay log event entry. The
possible reasons are: the master's binary log is corrupted (you can
check this by running 'mysqlbinlog' on the binary log), the slave's
relay log is corrupted (you can check this by running 'mysqlbinlog'
on the relay log), a network problem, or a bug in the master's or
slave's MySQL code. If you want to check the master's binary log or
slave's relay log, you will be able to know their names by issuing
'SHOW SLAVE STATUS' on this slave.
Exec_Master_Log_Pos: 419130498
- node2:
Master_Log_File: mysql-bin.000234
Read_Master_Log_Pos: 173841457
Relay_Log_File: relay-bin.000197
Relay_Log_Pos: 101852398
Relay_Master_Log_File: mysql-bin.000234
Exec_Master_Log_Pos: 173841457
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log:
'Client requested master to start replication from impossible position'
So basically, node1’s binlogs and relaylogs were corrupted: two problems to fix.
Fixing node2’s syncing from node1
The binlog is easy. When starting, MySQL always starts writing to a new binlog file. We point node2 to the head of node1’s newer binlog (mysql-bin.000235
position 4
). Execute on node2:
mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)
mysql> CHANGE MASTER TO MASTER_HOST='node1', MASTER_USER='user', MASTER_PASSWORD='pass',
MASTER_LOG_FILE='mysql-bin.000235', MASTER_LOG_POS=4;
Query OK, 0 rows affected (0.00 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
Now check the status:
mysql> show slave status;
<...>
Seconds_Behind_Master: 48988
<...>
Great - node2 is fetching node1’s binary logs. Now the next problem: node1’s relay logs are corrupt.
Fixing node1’s syncing from node2
The way to go is tell node1 to restart slaving from node2 from the position it failed: mysql-bin.000253
position 419130498
Again, nothing to hard if you know what you’re doing :-)
mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)
mysql> CHANGE MASTER TO MASTER_HOST='node1', MASTER_USER='user', MASTER_PASSWORD='pass', MASTER_LOG_FILE='mysql-bin.000253', MASTER_LOG_POS=419130498;
Query OK, 0 rows affected (0.01 sec)
mysql> start slave;
Query OK, 0 rows affected (0.00 sec)
Again, verify the status:
mysql> show slave status;
<...>
Seconds_Behind_Master: 60590
<...>
Great, it works! Keep an eye on it for a while (show slave status;
) until it reaches 0, and you’re set again!
comments powered by Disqus