2、重新复现、详细分析与验证
- 为了方便问题排查,我们需要按照复现过程重新来一次(再把master_auto_position从1切换为0),这一次,我们每一步昨晚都先检查一些信息,以便找出问题的起源以及找到从根本上杜绝这个问题发生的方法!!
-
需要检查的信息有:
- stop slave;之前先show slave status查看一下复制的大概位置信息
- 查看并收集stop slave;时、change master to master_auto_position=0时、 start slave;时三个步骤的如下信息:
* 从库查看复制信息:show slave status;
* 从库查看slave_master_info表:select * from mysql.slave_master_info;
* 从库查看slave_relay_log_info表:select * from mysql.slave_relay_log_info;
* 从库查看slave_worker_info表:select * from mysql.slave_worker_info;
* 从库查看replication_applier_status_by_worker表:select * from performance_schema.replication_applier_status_by_worker;
* 从库查看replication_connection_status表:select * from performance_schema.replication_connection_status;
* 从库查看relay log是否被清理,解析relay log,查看所有relay log中当前的起始GTID
* 根据show slave status输出信息中的Master_Log_File、Read_Master_Log_Pos、Relay_Master_Log_File、Exec_Master_Log_Pos、Retrieved_Gtid_Set、Executed_Gtid_Set值解析主库的binlog,找到对应这些值的事务相关event信息
* 从库解析binlog file,查看show slave status输出中Executed_Gtid_Set值、Executed_Gtid_Set值相比Retrieved_Gtid_Set值缺失的GTID在从库的binlog中是否能找到
-
为了方便大家理解下文中将要提到的几个步骤中的冗长信息,这里将这些信息整理为三个脑图(这里紧列出关键信息,为了简洁,GTID去掉了UUID部分,只留下事务ID),有兴趣的可以跟着下文向下看详细信息,兴趣不大的可以从这里直接跳转到"解决方法验证"小节
- 执行stop slave语句时主从的相关详细信息
- 执行change master to master_auto_position=0语句时主从的相关详细信息
- 执行stop slave语句时主从的相关详细信息
- 执行start slave语句时主从的相关详细信息
2.1. 重新复现与详细分析
- sysbench压力根据步骤需要,可能暂停或者重新执行
- sysbench执行对主库加压
- 从库上执行show slave status\G语句查看复制状态
- admin@localhost : (none) 12:19:20> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State: Waiting for master to send event
- ......
- Master_Log_File: mysql-bin.000038
- Read_Master_Log_Pos: 19058431
- Relay_Log_File: mysql-relay-bin.000006
- Relay_Log_Pos: 11609958
- Relay_Master_Log_File: mysql-bin.000038
- Slave_IO_Running: Yes
- Slave_SQL_Running: Yes
- ......
- Exec_Master_Log_Pos: 11609745
- ......
- Seconds_Behind_Master: 61
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 0
- Last_SQL_Error:
- ......
- Slave_SQL_Running_State: Waiting for slave workers to process their queues
- ......
- Last_IO_Error_Timestamp:
- Last_SQL_Error_Timestamp:
- ......
- Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-764951
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756013
- Auto_Position: 1
- ......
- 1 row in set (0.00 sec)
- 待到从库复制有一定延迟时,执行stop slave语句
- admin@localhost : (none) 12:21:36> stop slave;
- Query OK, 0 rows affected (0.11 sec)
- 从库执行show slave status语句查看复制状态
- admin@localhost : (none) 12:21:43> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State:
- ......
- Master_Log_File: mysql-bin.000038
- Read_Master_Log_Pos: 19989790
- Relay_Log_File: mysql-relay-bin.000006
- Relay_Log_Pos: 12148076
- Relay_Master_Log_File: mysql-bin.000038
- Slave_IO_Running: No
- Slave_SQL_Running: No
- ......
- Exec_Master_Log_Pos: 12147863
- ......
- Seconds_Behind_Master: NULL
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 0
- Last_SQL_Error:
- ......
- Slave_SQL_Running_State:
- ......
- Last_IO_Error_Timestamp:
- Last_SQL_Error_Timestamp:
- ......
- Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
- Auto_Position: 1
- ......
- 1 row in set (0.00 sec)
- 现在,把对主库的sysbench进程终止
- 从库查询slave_master_info表
- admin@localhost : (none) 12:21:46> select * from mysql.slave_master_info\G;
- *************************** 1. row ***************************
- Number_of_lines: 25
- Master_log_name: mysql-bin.000038
- Master_log_pos: 19989790
- Host: 10.10.20.14
- User_name: qfsys
- User_password: letsg0
- Port: 3306
- Connect_retry: 60
- Enabled_ssl: 0
- Ssl_ca:
- Ssl_capath:
- Ssl_cert:
- Ssl_cipher:
- Ssl_key:
- Ssl_verify_server_cert: 0
- Heartbeat: 5
- Bind:
- Ignored_server_ids: 0
- Uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- Retry_count: 86400
- Ssl_crl:
- Ssl_crlpath:
- Enabled_auto_position: 1
- Channel_name:
- Tls_version:
- 1 row in set (0.00 sec)
- 从库查询slave_relay_log_info表
- admin@localhost : (none) 12:25:57> select * from mysql.slave_relay_log_info\G;
- *************************** 1. row ***************************
- Number_of_lines: 7
- Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
- '# 与show slave status输出中的Relay_log_pos值相等'
- Relay_log_pos: 12148076
- Master_log_name: mysql-bin.000038
- Master_log_pos: 12147863
- Sql_delay: 0
- Number_of_workers: 4
- Id: 1
- Channel_name:
- 1 row in set (0.00 sec)
- 从库查询slave_worker_info表
- admin@localhost : (none) 12:26:23> select * from mysql.slave_worker_info\G;
- *************************** 1. row ***************************
- Id: 1
- Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
- Relay_log_pos: 12146410
- Master_log_name: mysql-bin.000038
- Master_log_pos: 12146197
- Checkpoint_relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
- Checkpoint_relay_log_pos: 12130583
- Checkpoint_master_log_name: mysql-bin.000038
- Checkpoint_master_log_pos: 12130370
- Checkpoint_seqno: 18
- Checkpoint_group_size: 64
- Checkpoint_group_bitmap: HD
- Channel_name:
- *************************** 2. row ***************************
- Id: 2
- Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
- Relay_log_pos: 12144744
- Master_log_name: mysql-bin.000038
- Master_log_pos: 12144531
- Checkpoint_relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000006
- Checkpoint_relay_log_pos: 12130583
- Checkpoint_master_log_name: mysql-bin.000038
- Checkpoint_master_log_pos: 12130370
- Checkpoint_seqno: 16
- Checkpoint_group_size: 64
- Checkpoint_group_bitmap:
- 从库查询replication_applier_status_by_worker表
- admin@localhost : (none) 12:27:02> select * from performance_schema.replication_applier_status_by_worker\G;
- *************************** 1. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 1
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756653
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 2. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 2
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756651
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 3. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 3
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756652
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 4. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 4
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 4 rows in set (0.00 sec)
- 从库查询replication_connection_status表
- admin@localhost : (none) 12:28:29> select * from performance_schema.replication_connection_status\G;
- *************************** 1. row ***************************
- CHANNEL_NAME:
- GROUP_NAME:
- SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- COUNT_RECEIVED_HEARTBEATS: 10030
- LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 12:19:36
- RECEIVED_TRANSACTION_SET: b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 1 row in set (0.00 sec)
- 查看gtid_executed表与show master status信息,从下面的结果中可以看到,gtid_executed表中记录的GTID事务号是741923,远小于show master status的最大GTID事务号756655,说明这个表中记录GTID信息并不是实时的
- admin@localhost : (none) 02:01:49> select * from mysql.gtid_executed\G;
- *************************** 1. row ***************************
- source_uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- interval_start: 1
- interval_end: 741923
- 1 row in set (0.01 sec)
- admin@localhost : (none) 02:06:54> show master status;
- +------------------+----------+--------------+------------------+-----------------------------------------------+
- | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
- +------------------+----------+--------------+------------------+-----------------------------------------------+
- | mysql-bin.000002 | 12074497 | | | b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655 |
- +------------------+----------+--------------+------------------+-----------------------------------------------+
- 1 row in set (0.00 sec)
- 查看并解析从库的relay log,找到relay log中记录的GTID起始事务号
- [root@luoxiaobo-02 ~]# cd /data/mysqldata1/relaylog/
- [root@luoxiaobo-02 relaylog]# ll
- 总用量 19532
- -rw-r-----. 1 mysql mysql 460 7月 8 22:15 mysql-relay-bin.000005
- -rw-r-----. 1 mysql mysql 19990003 7月 9 12:21 mysql-relay-bin.000006
- -rw-r-----. 1 mysql mysql 120 7月 8 22:15 mysql-relay-bin.index
- '#mysql-relay-bin.000005中没有数据相关的GTID'
- [root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000005 |grep -i GTID_NEXT
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
- '#解析mysql-relay-bin.000006结果先放到一个文件暂存'
- [root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000006 > a.sql
- [root@luoxiaobo-02 relaylog]# grep -i GTID_NEXT a.sql|head -3
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- '#mysql-relay-bin.000006中的第一个GTID'
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:741926'/*!*/;
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:741927'/*!*/;
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:741928'/*!*/;
- [root@luoxiaobo-02 relaylog]# grep -i GTID_NEXT a.sql|tail -3
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766069'/*!*/;
- '#mysql-relay-bin.000006中的最后一个GTID'
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
- '# 对于最后一个GTID,我们需要在解析结果中把完整的event组内容贴出来,方便后续排查(这里也可以看到,relay log中最优一个事务只有GTID EVENT,没有其他的EVENT,'\
- '# 说明这个事务记录的event组并不完整)'
- ......
- COMMIT/*!*/;
- # at 19989938
- #170709 12:21:43 server id 330614 end_log_pos 19989790 CRC32 0xef47a7bb GTID last_committed=24138 sequence_number=24145
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
- BEGIN /*added by mysqlbinlog */ /*!*/;
- ROLLBACK /* added by mysqlbinlog */ /*!*/;
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
- DELIMITER ;
- # End of log file
- ......
- 现在,到主库上解析stop slave之后,IO和SQL协调器线程停止的位置(mysql-bin.000038文件的pos:19989790和12147863)
- '# 解析结果先暂存到一个文件'
- [root@luoxiaobo-01 binlog]# mysqlbinlog -vv mysql-bin.000038 > a.sql
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- [root@luoxiaobo-01 binlog]#
- '# 打开这个文件,搜索19989790和12147863'
- ## pos:19989790
- '#上一个事务的COMMIT语句'
- COMMIT/*!*/;
- # at 19989725
- '# 从库的IO线程最后停止的pos对应的event在这里,这是一个事务event组中的第一个event'
- #170709 12:21:43 server id 330614 end_log_pos 19989790 CRC32 0xef47a7bb GTID last_committed=24138 sequence_number=24145
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
- # at 19989790
- '#这个事务缺失的后续event组'
- #170709 12:21:43 server id 330614 end_log_pos 19989869 CRC32 0x15f97bf9 Query thread_id=8 exec_time=0 error_code=0
- SET TIMESTAMP=1499574103/*!*/;
- BEGIN
- /*!*/;
- # at 19989869
- #170709 12:21:43 server id 330614 end_log_pos 19990052 CRC32 0x5dde7aa8 Rows_query
- # UPDATE sbtest1 SET c='61636976773-16107654940-51623712437-75581944876-25176120731-37361704412-40193407500-44604516796-09703856563-85070507959' WHERE id=2507560
- # at 19990052
- #170709 12:21:43 server id 330614 end_log_pos 19990111 CRC32 0x119bffe4 Table_map: `sbtest`.`sbtest1` mapped to number 120
- # at 19990111
- #170709 12:21:43 server id 330614 end_log_pos 19990527 CRC32 0xcc9c086d Update_rows: table id 120 flags: STMT_END_F
- BINLOG '
- ......
- '/*!*/;
- ......
- # at 19990527
- #170709 12:21:43 server id 330614 end_log_pos 19990558 CRC32 0x542c9d8c Xid = 42809
- '#标记这个事务结束'
- COMMIT/*!*/;
- ......
- ## pos:12147863
- '#上一个事务的COMMIT语句'
- COMMIT/*!*/;
- # at 12147030
- #170709 12:20:38 server id 330614 end_log_pos 12147095 CRC32 0x741990e3 GTID last_committed=14722 sequence_number=14730
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:756655'/*!*/;
- # at 12147095
- #170709 12:20:38 server id 330614 end_log_pos 12147174 CRC32 0x15ecc69f Query thread_id=8 exec_time=0 error_code=0
- SET TIMESTAMP=1499574038/*!*/;
- BEGIN
- /*!*/;
- # at 12147174
- #170709 12:20:38 server id 330614 end_log_pos 12147357 CRC32 0x54ed17d2 Rows_query
- # UPDATE sbtest4 SET c='59725621137-97256297494-12005645169-65422620013-80006649645-18254436888-53122616738-78749990655-60632029435-76689256326' WHERE id=2204328
- # at 12147357
- #170709 12:20:38 server id 330614 end_log_pos 12147416 CRC32 0xc0fd34bd Table_map: `sbtest`.`sbtest4` mapped to number 123
- # at 12147416
- #170709 12:20:38 server id 330614 end_log_pos 12147832 CRC32 0x5c45d713 Update_rows: table id 123 flags: STMT_END_F
- BINLOG
- ......
- # at 12147832
- '#从库的SQL协调器线程最后停止的pos对应的event在这里,这是一个事务的event组中的最后一个event'
- #170709 12:20:38 server id 330614 end_log_pos 12147863 CRC32 0xd650cd60 Xid = 26692
- '# 标记这个事务结束'
- COMMIT/*!*/;
- ......
-
从上文中描述我们可以得知,复制突然停止时,IO线程处在一个事务开始的GTID EVENT位置,而SQL协调器线程处于一个事务结束的xid event位置,这也证实了Read_Master_Log_Pos是以event为单位进行更新的。
- 解析从库的最后一个binlog,查找最后一个事务的GTID是否与Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655 相对应,两者相等
- [root@luoxiaobo-02 relaylog]# cd ../binlog/
- [root@luoxiaobo-02 binlog]# ll
- 总用量 46196
- -rw-r-----. 1 mysql mysql 35219969 7月 8 22:11 mysql-bin.000001
- -rw-r-----. 1 mysql mysql 12074497 7月 9 12:21 mysql-bin.000002
- -rw-r-----. 1 mysql mysql 104 7月 8 22:11 mysql-bin.index
- [root@luoxiaobo-02 binlog]# mysqlbinlog -vv mysql-bin.000002 |grep -i GTID_NEXT |tail -3
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654'/*!*/;
- '# 从库的binlog中最后一个GTID 事务号是756655,与stop slave时show slave status输出的Executed_Gtid_Set中最大事务号相等'
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:756655'/*!*/;
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/
- 留意观察每一个步骤相对于stop slave步骤有什么变化?以下会在每一个发生变化的位置进行注解
- 从库执行change master to master_auto_position=0语句
- admin@localhost : (none) 04:00:37> change master to master_auto_position=0;
- Query OK, 0 rows affected (0.05 sec)
- 从库执行show slave status语句查看复制状态
- admin@localhost : (none) 04:02:54> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State:
- ......
- Master_Log_File: mysql-bin.000038
- Read_Master_Log_Pos: 19989790
- '# relay log文件名从之前的mysql-relay-bin.000006变为了mysql-relay-bin.000001'
- Relay_Log_File: mysql-relay-bin.000001
- '#relay log pos从之前的12148076变为了4'
- Relay_Log_Pos: 4
- Relay_Master_Log_File: mysql-bin.000038
- Slave_IO_Running: No
- Slave_SQL_Running: No
- ......
- '# 从这里可以看到,SQL协调器线程的停止位置发生了变化,与之前stop slave;时IO线程停止位置相同了(从12147863变为了19989790)'
- Exec_Master_Log_Pos: 19989790
- ......
- Seconds_Behind_Master: NULL
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 0
- Last_SQL_Error:
- ......
- Slave_SQL_Running_State:
- ......
- Last_IO_Error_Timestamp:
- Last_SQL_Error_Timestamp:
- ......
- '#从这里可以看到,接收的GTID SET值被清空了!'
- Retrieved_Gtid_Set:
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
- Auto_Position: 0
- ......
- 1 row in set (0.00 sec)
- 从库查询slave_master_info表
- admin@localhost : (none) 04:03:04> select * from mysql.slave_master_info\G;
- *************************** 1. row ***************************
- Number_of_lines: 25
- Master_log_name: mysql-bin.000038
- Master_log_pos: 19989790
- Host: 10.10.20.14
- User_name: qfsys
- User_password: letsg0
- Port: 3306
- Connect_retry: 60
- Enabled_ssl: 0
- Ssl_ca:
- Ssl_capath:
- Ssl_cert:
- Ssl_cipher:
- Ssl_key:
- Ssl_verify_server_cert: 0
- Heartbeat: 5
- Bind:
- Ignored_server_ids: 0
- Uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- Retry_count: 86400
- Ssl_crl:
- Ssl_crlpath:
- '#从这里可以看到,从之前的1变为了0'
- Enabled_auto_position: 0
- Channel_name:
- Tls_version:
- 1 row in set (0.00 sec)
- 从库查询slave_relay_log_info表
- admin@localhost : (none) 04:07:18> select * from mysql.slave_relay_log_info\G;
- *************************** 1. row ***************************
- Number_of_lines: 7
- '# relay log文件名从之前的mysql-relay-bin.000006变为了mysql-relay-bin.000001'
- Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000001
- '#relay log pos从之前的12148076变为了4'
- Relay_log_pos: 4
- Master_log_name: mysql-bin.000038
- '# 从这里可以看到,SQL协调器线程的停止位置发生了变化,与之前stop slave;时IO线程停止位置相同了(从12147863变为了19989790)'
- Master_log_pos: 19989790
- Sql_delay: 0
- '# 记录worker线程数量的值变为了0'
- Number_of_workers: 0
- Id: 1
- Channel_name:
- 1 row in set (0.00 sec)
- 从库查询slave_worker_info表,从下面的结果中可以看到,worker线程信息被清空了
- admin@localhost : (none) 04:11:07> select * from mysql.slave_worker_info\G;
- Empty set (0.00 sec)
- 从库查询replication_applier_status_by_worker表,从下面的信息中可以看到,这个表记录的信息没有任何变化
- admin@localhost : (none) 04:20:17> select * from performance_schema.replication_applier_status_by_worker\G;
- *************************** 1. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 1
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756653
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 2. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 2
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756651
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 3. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 3
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756652
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 4. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 4
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- LAST_SEEN_TRANSACTION: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756654
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 4 rows in set (0.00 sec)
- 从库查询replication_connection_status表
- admin@localhost : (none) 04:21:04> select * from performance_schema.replication_connection_status\G;
- *************************** 1. row ***************************
- CHANNEL_NAME:
- GROUP_NAME:
- SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- COUNT_RECEIVED_HEARTBEATS: 10030
- LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 12:19:36
- ' #从这里可以看到,之前记录的b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069 值,现在变为空了'
- RECEIVED_TRANSACTION_SET:
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 1 row in set (0.01 sec)
- 查看gtid_executed表与show master status信息,从下面的结果中可以看到,与之前stop slave时查看到的信息相比,没有任何变化
- admin@localhost : (none) 04:22:19> select * from mysql.gtid_executed\G;
- *************************** 1. row ***************************
- source_uuid: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- interval_start: 1
- interval_end: 741923
- 1 row in set (0.00 sec)
- admin@localhost : (none) 04:23:56> show master status;
- +------------------+----------+--------------+------------------+-----------------------------------------------+
- | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
- +------------------+----------+--------------+------------------+-----------------------------------------------+
- | mysql-bin.000002 | 12074497 | | | b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655 |
- +------------------+----------+--------------+------------------+-----------------------------------------------+
- 1 row in set (0.00 sec)
- 查看并解析从库的relay log,找到relay log中记录的GTID起始事务号
- [root@luoxiaobo-02 ~]# cd /data/mysqldata1/relaylog/
- [root@luoxiaobo-02 relaylog]# ll
- 总用量 65424
- -rw-r--r--. 1 root root 66984802 7月 9 13:01 a.sql
- '#这里可以看到,relay log文件有变化,000002编号的文件不见了'
- -rw-r-----. 1 mysql mysql 194 7月 9 16:02 mysql-relay-bin.000001
- -rw-r-----. 1 mysql mysql 60 7月 9 16:02 mysql-relay-bin.index
- [root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000001
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
- /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
- DELIMITER /*!*/;
- # at 4
- #170709 16:02:54 server id 330615 end_log_pos 123 CRC32 0xa0b99fea Start: binlog v 4, server v 5.7.18-log created 170709 16:02:54 at startup
- # This Format_description_event appears in a relay log and was generated by the slave thread.
- # at 123
- #170709 16:02:54 server id 330615 end_log_pos 194 CRC32 0x2c2d5d7f Previous-GTIDs
- '#从这里可以看到,这个Previous-GTIDs event中记录的GTID SET就是之前show slave status中的Retrieved_Gtid_Set值'
- # b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
- DELIMITER ;
- # End of log file
- /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
- /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/
-
从上面的内容中可以看到,之前relay log中的日志全部被清理了,当前mysql-relay-bin.000001中记录的有效数据除了Previous-GTIDs event中记录的GTID SET之外,没有其他有用数据,而且这个GTID SET是使用之前show slave status输出中的Retrieved_Gtid_Set值填充的
-
主库中change master to master_auto_position=0之后的binlog这里就不需要去解析了,因为现在IO和SQL协调器线程停止的位置相同了(mysql-bin.000038文件的pos:19989790)
-
从库中change master to master_auto_position=0之后的binlog这里也不需要去解析了,因为现在还没有启动复制,从库中的数据理论上是不会发生变化的
-
通过以上发生变化的信息我们可以得出,change master to master_auto_position=0语句执行了relay log的清理,slave_worker_info表中worker线程的信息清理,SQL协调器线程位置的重置(设置为stop slave;时IO线程的位置,并更新了slave_relay_log_info表和内存中的SQL线程和relay log位置信息),更新了replication_connection_status表的RECEIVED_TRANSACTION_SET字段值为空、slave_master_info表的Enabled_auto_position字段值为0
- 留意观察每一个步骤相对于change master to master_auto_position=0步骤有什么变化?以下会在每一个发生变化的位置进行注解
- 从库执行start slave语句
- admin@localhost : (none) 05:09:02> start slave;
- Query OK, 0 rows affected (0.01 sec)
- 从库执行show slave status语句查看复制状态
- admin@localhost : (none) 05:09:06> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State: Waiting for master to send event
- ......
- Master_Log_File: mysql-bin.000038
- '# IO线程正常运行,该位置持续正常变化'
- Read_Master_Log_Pos: 40855542
- '# relay log文件名从之前的mysql-relay-bin.000001变为了mysql-relay-bin.000002,说明新生成了一个relay log文件'
- Relay_Log_File: mysql-relay-bin.000002
- '# IO线程正常运行时,该位置持续正常变化'
- Relay_Log_Pos: 320
- Relay_Master_Log_File: mysql-bin.000038
- '#IO线程运行状态正常'
- Slave_IO_Running: Yes
- '#SQL线程未运行'
- Slave_SQL_Running: No
- ......
- '# SQL线程位置未发生变化'
- Exec_Master_Log_Pos: 19989790
- ......
- Seconds_Behind_Master: NULL
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 1782
- '# SQL线程应用relay log时发生错误,错误binlog pos发生在master log mysql-bin.000038, end_log_pos 19989869'
- Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'NOT_YET_DETERMINED' at master \
- log mysql-bin.000038, end_log_pos 19989869. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
- ......
- Slave_SQL_Running_State:
- ......
- Last_IO_Error_Timestamp:
- '#SQL线程发生错误的时间'
- Last_SQL_Error_Timestamp: 170709 17:09:06
- ......
- '# 接收到的GTID SET重新生成了'
- Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:766071-791118
- '# 由于SQL线程报错终止,Executed_Gtid_Set位置未发生变化,说明SQL线程在复制启动时未应用任何事务就立即报错终止了'
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-756655
- Auto_Position: 0
- ......
- 1 row in set (0.00 sec)
- 现在,把对主库的sysbench进程终止
- 由于IO线程运行正常,相对于主库的binlog pos会持续发生正常变化,不需要查看从库的slave_master_info表
- 从库查询slave_relay_log_info表,从下面的信息中可以可以看到,除了Number_of_workers字段发生变化之外,其他字段没有发生变化
- admin@localhost : (none) 05:22:12> select * from mysql.slave_relay_log_info\G;
- *************************** 1. row ***************************
- Number_of_lines: 7
- Relay_log_name: /home/mysql/data/mysqldata1/relaylog/mysql-relay-bin.000001
- Relay_log_pos: 4
- Master_log_name: mysql-bin.000038
- Master_log_pos: 19989790
- Sql_delay: 0
- '# worker线程数量值变为4'
- Number_of_workers: 4
- Id: 1
- Channel_name:
- 1 row in set (0.00 sec)
- 从库查询slave_worker_info表,从下面的信息中可以看到,slave_worker_info中有线程信息了,但是缺没有任何有用的信息,show processlist可以看到SQL线程全部停止了
- admin@localhost : (none) 05:24:31> select * from mysql.slave_worker_info\G;
- *************************** 1. row ***************************
- Id: 1
- Relay_log_name:
- Relay_log_pos: 0
- Master_log_name:
- Master_log_pos: 0
- Checkpoint_relay_log_name:
- Checkpoint_relay_log_pos: 0
- Checkpoint_master_log_name:
- Checkpoint_master_log_pos: 0
- Checkpoint_seqno: 0
- Checkpoint_group_size: 64
- Checkpoint_group_bitmap:
- Channel_name:
- *************************** 2. row ***************************
- Id: 2
- Relay_log_name:
- Relay_log_pos: 0
- Master_log_name:
- Master_log_pos: 0
- Checkpoint_relay_log_name:
- Checkpoint_relay_log_pos: 0
- Checkpoint_master_log_name:
- Checkpoint_master_log_pos: 0
- Checkpoint_seqno: 0
- Checkpoint_group_size: 64
- Checkpoint_group_bitmap:
- Channel_name:
- *************************** 3. row ***************************
- Id: 3
- Relay_log_name:
- Relay_log_pos: 0
- Master_log_name:
- Master_log_pos: 0
- Checkpoint_relay_log_name:
- Checkpoint_relay_log_pos: 0
- Checkpoint_master_log_name:
- Checkpoint_master_log_pos: 0
- Checkpoint_seqno: 0
- Checkpoint_group_size: 64
- Checkpoint_group_bitmap:
- Channel_name:
- *************************** 4. row ***************************
- Id: 4
- Relay_log_name:
- Relay_log_pos: 0
- Master_log_name:
- Master_log_pos: 0
- Checkpoint_relay_log_name:
- Checkpoint_relay_log_pos: 0
- Checkpoint_master_log_name:
- Checkpoint_master_log_pos: 0
- Checkpoint_seqno: 0
- Checkpoint_group_size: 64
- Checkpoint_group_bitmap:
- Channel_name:
- 4 rows in set (0.00 sec)
- ERROR:
- No query specified
- admin@localhost : (none) 05:26:57> show processlist;
- +----+-------------+-----------+------+---------+------+----------------------------------+------------------+
- | Id | User | Host | db | Command | Time | State | Info |
- +----+-------------+-----------+------+---------+------+----------------------------------+------------------+
- | 13 | admin | localhost | NULL | Query | 0 | starting | show processlist |
- | 14 | system user | | NULL | Connect | 1083 | Waiting for master to send event | NULL |
- +----+-------------+-----------+------+---------+------+----------------------------------+------------------+
- 2 rows in set (0.01 sec)
- 从库查询replication_applier_status_by_worker表
- admin@localhost : (none) 05:27:09> select * from performance_schema.replication_applier_status_by_worker\G;
- *************************** 1. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 1
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- '#该字段变为空值'
- LAST_SEEN_TRANSACTION:
- '# 这里可以看到,WORKER_ID=1的线程的报错的错误代码和错误详细信息,通过前面show slave status输出Executed_Gtid_Set的值未发生变化,可以推理出在start slave;启动复制时,'\
- '# 第一个worker线程分配到的就是stop slave时IO线程读取不完整的事务,所以,start slave;时SQL线程应用第一个事务就报错停止了'
- LAST_ERROR_NUMBER: 1782
- LAST_ERROR_MESSAGE: Worker 1 failed executing transaction 'NOT_YET_DETERMINED' at master log mysql-bin.000038, end_log_pos 19989869; Error '@@SESSION.GTID_NEXT \
- cannot be set to ANONYMOUS when @@GLOBAL.GTID_MODE = ON.' on query. Default database: 'sbtest'. Query: 'BEGIN'
- LAST_ERROR_TIMESTAMP: 2017-07-09 17:09:06
- *************************** 2. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 2
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- '# 该字段变为空值'
- LAST_SEEN_TRANSACTION:
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 3. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 3
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- '# 该字段变为空值'
- LAST_SEEN_TRANSACTION:
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- *************************** 4. row ***************************
- CHANNEL_NAME:
- WORKER_ID: 4
- THREAD_ID: NULL
- SERVICE_STATE: OFF
- '# 该字段变为空值'
- LAST_SEEN_TRANSACTION:
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 4 rows in set (0.00 sec)
- 从库查询replication_connection_status表
- admin@localhost : (none) 06:00:08> select * from performance_schema.replication_connection_status\G;
- *************************** 1. row ***************************
- CHANNEL_NAME:
- GROUP_NAME:
- SOURCE_UUID: b57c75c2-6205-11e7-8d9f-525400a4b2e1
- THREAD_ID: 56
- SERVICE_STATE: ON
- COUNT_RECEIVED_HEARTBEATS: 10703
- LAST_HEARTBEAT_TIMESTAMP: 2017-07-09 18:05:13
- ' #该字段值变为重新生成的Retrieved_Gtid_Set值'
- RECEIVED_TRANSACTION_SET: b57c75c2-6205-11e7-8d9f-525400a4b2e1:766071-791118
- LAST_ERROR_NUMBER: 0
- LAST_ERROR_MESSAGE:
- LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
- 1 row in set (0.00 sec)
-
查看gtid_executed表与show master status信息,由于SQL线程一启动就报错,所以理论上不会发生任何变化,不需要查看
- 由于IO线程正常运行,relay log会持续正常发生变化,解析relay log时只需要查看起始位置的日志即可
- [root@luoxiaobo-02 ~]# cd /data/mysqldata1/relaylog/
- [root@luoxiaobo-02 relaylog]# ll
- 总用量 93968
- -rw-r--r--. 1 root root 66984802 7月 9 13:01 a.sql
- -rw-r-----. 1 mysql mysql 247 7月 9 17:09 mysql-relay-bin.000001
- -rw-r-----. 1 mysql mysql 23258448 7月 9 18:10 mysql-relay-bin.000002
- -rw-r-----. 1 mysql mysql 120 7月 9 17:09 mysql-relay-bin.index
- '#mysql-relay-bin.000001中没有数据相关的GTID,只有一个Previous-GTIDs event记录GTID SET事务号为741924-766069'
- [root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000001
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
- /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
- DELIMITER /*!*/;
- # at 4
- #170709 16:02:54 server id 330615 end_log_pos 123 CRC32 0xa0b99fea Start: binlog v 4, server v 5.7.18-log created 170709 16:02:54 at startup
- # This Format_description_event appears in a relay log and was generated by the slave thread.
- # at 123
- #170709 16:02:54 server id 330615 end_log_pos 194 CRC32 0x2c2d5d7f Previous-GTIDs
- # b57c75c2-6205-11e7-8d9f-525400a4b2e1:741924-766069
- # at 194
- #170709 17:09:06 server id 330615 end_log_pos 247 CRC32 0x1584ebb7 Rotate to mysql-relay-bin.000002 pos: 4
- SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
- DELIMITER ;
- # End of log file
- /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
- /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
- '# mysql-relay-bin.000002解析结果先放到a.sql文件中暂存'
- [root@luoxiaobo-02 relaylog]# mysqlbinlog -vv mysql-relay-bin.000002 > a.sql
- mysqlbinlog: [Warning] unknown variable 'loose_default-character-set=utf8'
- '# vim打开这个文件,查看起始位置的日志数据'
- # at 123
- '# 从这里可以看到,从第一次复现时一样,Previous-GTIDs EVENT记录了一个"[empty]"而并不是一个GTID SET'
- #170709 17:09:06 server id 330615 end_log_pos 154 CRC32 0xa1000bf3 Previous-GTIDs
- # [empty]
- # at 154
- '#这里就是SQL协调器线程报错终止的位置'
- #700101 8:00:00 server id 330614 end_log_pos 0 CRC32 0xfff429de Rotate to mysql-bin.000038 pos: 19989790
- # at 201
- #170708 22:14:19 server id 330614 end_log_pos 0 CRC32 0x40b1240b Start: binlog v 4, server v 5.7.18-log created 170708 22:14:19
- BINLOG '
- ......
- '/*!*/;
- # at 320
- ' # 这里就是worker线程报错终止的位置'
- #170709 12:21:43 server id 330614 end_log_pos 19989869 CRC32 0x15f97bf9 Query thread_id=8 exec_time=0 error_code=0
- SET TIMESTAMP=1499574103/*!*/;
- ......
- BEGIN
- /*!*/;
- # at 399
- #170709 12:21:43 server id 330614 end_log_pos 19990052 CRC32 0x5dde7aa8 Rows_query
- # UPDATE sbtest1 SET c='61636976773-16107654940-51623712437-75581944876-25176120731-37361704412-40193407500-44604516796-09703856563-85070507959' WHERE id=2507560
- # at 582
- #170709 12:21:43 server id 330614 end_log_pos 19990111 CRC32 0x119bffe4 Table_map: `sbtest`.`sbtest1` mapped to number 120
- # at 641
- #170709 12:21:43 server id 330614 end_log_pos 19990527 CRC32 0xcc9c086d Update_rows: table id 120 flags: STMT_END_F
- BINLOG
- ......
- # at 1057
- #170709 12:21:43 server id 330614 end_log_pos 19990558 CRC32 0x542c9d8c Xid = 42809
- COMMIT/*!*/;
- # at 1088
- #170709 12:21:43 server id 330614 end_log_pos 19990623 CRC32 0x7fb6a8d9 GTID last_committed=24138 sequence_number=24146
- '#从这里可以发现,relay log中第一个可见的GTID是事务号是766071,与show slave status输出的Retrieved_Gtid_Set值起始GTID事务号相同,所以可以推理出Retrieved_Gtid_Set值就是从这里开始计算起始值的'
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766071'/*!*/;
- ......
- 现在,到主库上解析start slave之后,SQL协调器线程和worker线程停止的位置(mysql-bin.000038文件的pos:19989790和19989869)
- '# 由于主库的binlog不会受到从库的影响,所以这里直接把上一次解析主库binlog的相关内容重新贴出来'
- COMMIT/*!*/; #上一个事务的COMMIT语句
- # at 19989725
- '# 从库的SQL协调器线程最后停止的pos对应的event在这里,这是一个事务event组中的第一个event'
- #170709 12:21:43 server id 330614 end_log_pos 19989790 CRC32 0xef47a7bb GTID last_committed=24138 sequence_number=24145
- SET @@SESSION.GTID_NEXT= 'b57c75c2-6205-11e7-8d9f-525400a4b2e1:766070'/*!*/;
- # at 19989790
- '#这里是worker线程停止的位置'
- #170709 12:21:43 server id 330614 end_log_pos 19989869 CRC32 0x15f97bf9 Query thread_id=8 exec_time=0 error_code=0
- SET TIMESTAMP=1499574103/*!*/;
- BEGIN
- /*!*/;
- '# at 19989869 # 19989869 位置也是下一个event开始的位置,从上文中的信息可以得知,worker线程由于丢失前面的GTID EVENT,start slave;时对于这个事务读取到的第一个even't\
- '# 就是这个Rows_query event,没有读取到GTID EVENT,所以被误认为是匿名事务,而在gtid_mode系统参数为ON时,不允许读取到匿名事务,所以报错终止了'
- #170709 12:21:43 server id 330614 end_log_pos 19990052 CRC32 0x5dde7aa8 Rows_query
- # UPDATE sbtest1 SET c='61636976773-16107654940-51623712437-75581944876-25176120731-37361704412-40193407500-44604516796-09703856563-85070507959' WHERE id=2507560
- # at 19990052
- #170709 12:21:43 server id 330614 end_log_pos 19990111 CRC32 0x119bffe4 Table_map: `sbtest`.`sbtest1` mapped to number 120
- # at 19990111
- #170709 12:21:43 server id 330614 end_log_pos 19990527 CRC32 0xcc9c086d Update_rows: table id 120 flags: STMT_END_F
- BINLOG '
- ......
- '/*!*/;
- ......
- # at 19990527
- #170709 12:21:43 server id 330614 end_log_pos 19990558 CRC32 0x542c9d8c Xid = 42809
- '#标记这个事务结束'
- COMMIT/*!*/;
- ......
-
从上文中描述我们可以得知,复制突然报错时,SQL协调器线程的停止位置并不是一个事务结束的xid event位置(与stop slave时正常停止不一样,正常停止时SQL协调器线程停止的位置是一个事务的xid event的位置,说明正常停止时,SQL线程会继续做正在执行的事务才停止),而是一个事务开始的GTID EVENT位置,这也说明了Exec_Master_Log_Pos是以event为单位进行更新的。
-
由于start slave启动复制时,SQL线程就立即报错终止,所以理论上从库数据不会发生变化,不需要解析从库的binlog
2.2. 解决方法验证
- 现在,通过上述分析过程中的详细信息,我们知道了引起报错的起因是由于主库持续写入的情况下,stop slave;导致IO线程只写入了一个事务的部分event到relay log中,然后在change master语句时mysql server清理了relay log,SQL线程被更新为IO线程停止的位置,导致start slave时SQL应用了一个event不完整的事务,所以报错了,那么,IO线程停止的位置我们不可控,SQL线程我们通过上文中的长篇解说已经知道,stop slave;的时候停止位置一定是最近一次提交事务的binlog pos,so..解决方法就是:change master语句的时候乖乖按照官方文档的要求指定master_log_file和master_log_pos选项(这样在change master语句时,IO线程就会以SQL线程的位置为准),如果你不想relay log被清理,那么就同时指定relay_log_file和relay_log_pos选项。
- 现在我们在从库重新执行stop slave;change master to master_auto_position=1;start slave; 恢复复制
- admin@localhost : (none) 06:57:51> stop slave;change master to master_auto_position=1;start slave;
- Query OK, 0 rows affected (0.00 sec)
- Query OK, 0 rows affected (0.02 sec)
- Query OK, 0 rows affected (0.02 sec)
- show slave status看看复制是否恢复正常
- admin@localhost : (none) 06:59:12> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State: Waiting for master to send event
- ......
- Master_Log_File: mysql-bin.000038
- Read_Master_Log_Pos: 44331651
- Relay_Log_File: mysql-relay-bin.000002
- Relay_Log_Pos: 32184202
- Relay_Master_Log_File: mysql-bin.000038
- Slave_IO_Running: Yes
- Slave_SQL_Running: Yes
- ......
- Exec_Master_Log_Pos: 44331651
- ......
- Seconds_Behind_Master: 0
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 0
- Last_SQL_Error:
- ......
- Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
- ......
- Last_IO_Error_Timestamp:
- Last_SQL_Error_Timestamp:
- ......
- Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756656-795291
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-795291
- Auto_Position: 1
- ......
- 1 row in set (0.00 sec)
- 从上面的结果中可以看到,复制已恢复正常,现在在主库上使用pt-table-checksum校验主从数据一致性
- [root@luoxiaobo-01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=xiaoboluo.checksums --databases sbtest h=localhost,u=admin,p=letsg0,P=3306
- .....
- '# 如果DIFFS列出现有不为0值时,就表示主备数据不一致,可以执行如下语句执行同步,同步之后再次执行pt-table-checksum校验直到DIFFS列不会出现0值为止'
- pt-table-sync --replicate=xiaoboluo.checksums --databases sbtest h=localhost,u=admin,p=letsg0,P=3306 --execute
-
启动sysbench压测进程
- 从库等到复制有一些延迟时,执行top slave;
- admin@localhost : (none) 07:15:14> stop slave;
- Query OK, 0 rows affected (0.04 sec)
- 从库执行一次show slave status并查看Relay_Master_Log_File、Exec_Master_Log_Pos、Relay_Log_File、Relay_Log_Pos选项值
- admin@localhost : (none) 07:15:19> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State:
- ......
- Master_Log_File: mysql-bin.000038
- Read_Master_Log_Pos: 52989383
- Relay_Log_File: mysql-relay-bin.000002
- Relay_Log_Pos: 36434532
- Relay_Master_Log_File: mysql-bin.000038
- Slave_IO_Running: No
- Slave_SQL_Running: No
- ......
- Exec_Master_Log_Pos: 48581981
- ......
- Seconds_Behind_Master: NULL
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 0
- Last_SQL_Error:
- ......
- Last_IO_Error_Timestamp:
- Last_SQL_Error_Timestamp:
- ......
- Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756656-805843
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-800552
- Auto_Position: 1
- ......
- 1 row in set (0.00 sec)
- 从库执行change master语句切换master_auto_position=0,带上前面获取的四个选项值
- admin@localhost : (none) 07:15:23> change master to master_auto_position=0,master_log_file='mysql-bin.000038',master_log_pos=48581981,relay_log_file='mysql-relay-bin.000002',\
- relay_log_pos=36434532;
- Query OK, 0 rows affected (0.00 sec)
- 从库查看一下relay log,从下面的结果中可以看到relay log并没有被清理
- [root@luoxiaobo-02 binlog]# cd ../relaylog/
- [root@luoxiaobo-02 relaylog]# ll
- 总用量 144792
- -rw-r--r--. 1 root root 82262702 7月 9 18:14 a.sql
- -rw-r-----. 1 mysql mysql 247 7月 9 18:59 mysql-relay-bin.000001
- -rw-r-----. 1 mysql mysql 40841934 7月 9 19:15 mysql-relay-bin.000002
- -rw-r-----. 1 mysql mysql 120 7月 9 18:59 mysql-relay-bin.index
- 启动复制
- admin@localhost : (none) 07:17:44> start slave;
- Query OK, 0 rows affected (0.01 sec)
- 查看复制状态,从下面的结果中可以看到,复制一切正常
- admin@localhost : (none) 07:18:43> show slave status\G;
- *************************** 1. row ***************************
- Slave_IO_State: Queueing master event to the relay log
- ......
- Master_Log_File: mysql-bin.000038
- Read_Master_Log_Pos: 104115141
- Relay_Log_File: mysql-relay-bin.000002
- Relay_Log_Pos: 36931000
- Relay_Master_Log_File: mysql-bin.000038
- Slave_IO_Running: Yes
- Slave_SQL_Running: Yes
- ......
- Exec_Master_Log_Pos: 49078449
- ......
- Seconds_Behind_Master: 221
- Master_SSL_Verify_Server_Cert: No
- Last_IO_Errno: 0
- Last_IO_Error:
- Last_SQL_Errno: 0
- Last_SQL_Error:
- ......
- Slave_SQL_Running_State: Waiting for dependent transaction to commit
- ......
- Last_IO_Error_Timestamp:
- Last_SQL_Error_Timestamp:
- ......
- Retrieved_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:756656-867218
- Executed_Gtid_Set: b57c75c2-6205-11e7-8d9f-525400a4b2e1:1-801145
- Auto_Position: 0
- ......
- 1 row in set (0.14 sec)
-
现在,停止sysbench进程
- 等到复制无延迟时,再次使用pt-table-checksum再校验一把(校验结果省略,推理上这里不会出现数据不一致)
- [root@luoxiaobo-01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=xiaoboluo.checksums --databases sbtest h=localhost,u=admin,p=letsg0,P=3306
- ......
3、总结
- 从上文中冗长的分析过程,我们可以得知,导致出问题的起因是主库持续写入的情况下,从库执行stop slave;时,IO线程读取并只向relay log中写入了一个事务的部分event(event组不完整,只写了GTID EVENT),后续change master to master_auto_position=0语句既没有指定请求主库的binlog位置,也没有指定从库的relay log,导致mysql server执行了清理relay log(清理了IO线程最后写入的GTID EVENT)以及 relay log和SQL线程相关的表和内存中的位置信息,并把SQL线程的位置指向了IO线程的停止位置(IO线程读取的主库binlog file and pos),后续start slave;时从库就从IO线程停止的位置开始复制,由于IO线程停止时的位置处在一个事务的中间,SQL线程从这个位置开始应用时,立即发现了这个事务缺少GTID event,被误认为是一个 匿名事务(gtid_mode系统参数设置为ON时,IO线程和SQL线程都不允许碰到匿名事务),所以SQL协调器线程和worker线程就都报错终止了。
- 这个问题的解决办法就是:执行change master to master_auto_position=0语句的时候,乖乖按照官方文档要求,指定master_log_file和master_log_pos选项,如果你不想relay log被清理,那么同时指定relay_log_file或relay_log_pos选项,如:change master to master_auto_position=0,master_log_file=Relay_Master_Log_File,master_log_pos=Exec_Master_Log_Pos,relay_log_file=Relay_Log_File,relay_log_pos=Relay_Log_Pos;
- 对于复制模式从master_auto_position=1切换为master_auto_position=0这种操作,个人有以下几点看法:
- change master to master_auto_position=0;不带master_log_file和master_log_pos是非常不规范的操作,官方文档中也明确说明了需要指定master_log_file和master_log_pos选项,因为master_auto_position=0时即停止使用基于GTID的复制协议,启用基于文件的复制协议,这个时候如果复制有延迟,就算mysql server没有自动清理relay log,也有可能被人为清理掉,而一旦被清理且start slave;之后,IO线程继续从上次stop slave;的位置开始请求主库的binlog,SQL线程也只能从relay log的起始位置开始重放,从而报错、丢失数据的问题就接踵而来。
- 你真的有不得不切换到master_auto_position=0的需求吗?还是一个伪需求呢?如果没有特殊的不得不切换为0的需求,建议在5.7版本中尽量使用master_auto_position=1
- 注意:
- 使用change master to master_auto_position=0不指定SQL线程位置和relay log文件时,start slave;不一定会报错,如果IO线程读取主库binlog时没有发生部分读,可能就不会发生报错这个问题(但此时可能会引发从库丢失数据的问题)。如果你想复现这个问题,你可以把master_auto_position恢复为1启动复制之后,再次切换为0(注意,切换期间,sysbench压力需要持续进行)
-
回顾与思考(卖个关子)
- 根据官方描述,change master …语句不带relay_log_file和relay_log_pos选项都会导致relay log被清理,那么,在GTID复制架构下,执行master_auto_position=1时,会不会有报错和数据丢失的情况发生呢?为什么?
* 不会,因为master_auto_position=1时,在执行start slave;语句时,IO线程向主库请求的GTID位置是根据公式"UNION(@@global.gtid_executed, Retrieved_gtid_set - last_received_GTID)"自动计算的,所以无论relay log是否清理,起始GTID位置使用包含了从库当前数据的gtid_executed位置。当然,要保证start slave;时SQL线程正常拉起,前提是上次SQL线程是正常停止的 - GTID复制架构下,如果stop slave时,IO线程正好停止在一个事务的xid event位置,然后执行change master to master_auto_position=0;start slave;语句会发生什么情况?(温馨提示:尤其需要留意这种情况,本文不做演示,请自行尝试)
* start slave;之后,IO线程会从上次停止的位置继续请求主库binlog(新生成的relay log中这个时候就不存在读取到一个事务的部分event),而SQL线程也可以继续正常执行IO线程新生成的relay log。注意:这个时候从库就发生数据丢失了,丢失的就是执行change master to master_auto_position=0;语句清理的那部分relay log
* 后续再切换为master_auto_position=1,虽然从库会把缺失的事务重新从向主库请求,也会在从库进行重放,但是。。这个时候,事务的提交顺序乱了,可能导致UPDATE、DELETE、INSERT顺序错乱,从而导致虽然数据行数主从库一致,但是行内数据可能并不一致,如果你运气好,颠倒的事务顺序之后有数据发生冲突,那么SQL线程就会报错终止,如果运气不好,顺序颠倒之后数据没有发生冲突,那么如果你不使用pt工具校验一下主备数据一致性,你可能就很难发现到底是哪里出了问题。当应用反馈说从库查询数据有异常时,你也很可能一脸懵逼! - 根据官方描述,change master …语句不带relay_log_file和relay_log_pos选项都会导致relay log被清理,那么,在GTID复制架构下,执行master_auto_position=1时,会不会有报错和数据丢失的情况发生呢?为什么?
-
全文参考资料
- https://dev.mysql.com/doc/refman/5.7/en/change-master-to.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-features-transaction-inconsistencies.html
- https://dev.mysql.com/doc/refman/5.7/en/show-slave-status.html
- https://dev.mysql.com/doc/refman/5.7/en/start-slave.html
- https://dev.mysql.com/doc/refman/5.7/en/stop-slave.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-mode-change-online-disable-gtids.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-applier-status-by-worker-table.html
- https://dev.mysql.com/doc/refman/5.7/en/replication-connection-status-table.html
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/28218939/viewspace-2142235/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/28218939/viewspace-2142235/
本文围绕MySQL主从复制中master_auto_position从1切换为0的问题展开。详细介绍重新复现问题的过程,需检查多表信息。分析得出问题起因是操作不规范致relay log清理等,给出解决办法,还探讨切换可能引发的数据丢失、顺序错乱等情况。
&spm=1001.2101.3001.5002&articleId=100445188&d=1&t=3&u=178c5f499ec54ab7b80d8df97b60c419)
1846

被折叠的 条评论
为什么被折叠?



