MySQL MHA 故障模拟测试及 MHA 在线切换原理

四季读书网 2026-04-14 22:34:23 7 0

一、架构说明

MHA（Master High Availability）是一个用于 MySQL 主从复制管理和自动故障转移的开源工具集。MHA 的主要目的是提供 MySQL 环境的高可用性和自动故障转移功能，确保在主库发生故障时能够快速切换到备库，降低业务中断时间。

MHA 主要由以下几个组件组成：

Manager：管理节点，负责监控主库的状态，并处理自动故障转移。Manager 会不断检查数据库的主从同步状态，当主库发生故障时，会自动将备库提升为新的主库，并重新配置其他从库的复制关系。如果主从延迟，则等待所有relay log全部应用后，将VIP漂移到新主库上，整个故障转移过程对应用程序完全透明。
Node：节点，即 MySQL 实例。MHA 通过在每个 MySQL 实例上部署 MHA 脚本来对节点进行监控和操作。
SSH：MHA 使用 SSH 协议来进行节点之间的通信和操作。确保在使用 MHA 之前，通过 SSH 配置了节点之间的互信。

MHA 的工作流程通常如下：

Manager 通过 SSH 登录到节点上监控 MySQL 主库的状态，包括主从复制延迟、节点健康状况等。
当发现主库发生故障或不可用时，Manager 会触发自动故障转移流程，将备库晋升为新的主库。
Manager 会更新其他从库的复制关系，确保它们能够正确复制新的主库。

二、故障模拟测试

2.1 在线手工切换（维护切换，需要把MHA监控进程关掉）

命令如下

--停止MHA监控进程masterha_stop --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnf--在线手工切换masterha_master_switch --master_state=alive --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=60

参数注解：

–orig_master_is_new_slave：把旧的master配置为从库

–running_updates_limit=60：如果主从库同步延迟在60s内都允许切换，但是但是切换的时间长短是由recover时relay 日志的大小决定

停止MHA监控进程

[root@mysql-mha01 ~]# masterha_stop --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnfStopped app successfully.[1]+  退出 1                nohup masterha_manager --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --ignore_last_failover < /dev/null > /data/mha/run.log 2>&1

在线手工切换

[root@mysql-mha01 ~]# masterha_master_switch --master_state=alive --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=60Tue Apr 14 20:16:41 2026 - [info] MHA::MasterRotate version 0.58.Tue Apr 14 20:16:41 2026 - [info] Starting online master switch..Tue Apr 14 20:16:41 2026 - [info]Tue Apr 14 20:16:41 2026 - [info] * Phase 1: Configuration Check Phase..Tue Apr 14 20:16:41 2026 - [info]Tue Apr 14 20:16:41 2026 - [info] Reading default configuration from /data/mha/conf/masterha.cnf..Tue Apr 14 20:16:41 2026 - [info] Reading application default configuration from /data/mha/conf/app.cnf..Tue Apr 14 20:16:41 2026 - [info] Reading server configuration from /data/mha/conf/app.cnf..Tue Apr 14 20:16:42 2026 - [info] GTID failover mode = 1Tue Apr 14 20:16:42 2026 - [info] Current Alive Master: 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Alive Slaves:Tue Apr 14 20:16:42 2026 - [info]   192.168.61.140(192.168.61.140:3307)  Version=8.0.44 (oldest major version between slaves) log-bin:enabledTue Apr 14 20:16:42 2026 - [info]     GTID ONTue Apr 14 20:16:42 2026 - [info]     Replicating from 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Apr 14 20:16:42 2026 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Checking MHA is not monitoring or doing failover..Tue Apr 14 20:16:42 2026 - [info] Checking replication health on 192.168.61.140..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Searching new master from slaves..Tue Apr 14 20:16:42 2026 - [info]  Candidate masters from the configuration file:Tue Apr 14 20:16:42 2026 - [info]   192.168.61.139(192.168.61.139:3307)  Version=8.0.44 log-bin:enabledTue Apr 14 20:16:42 2026 - [info]     GTID ONTue Apr 14 20:16:42 2026 - [info]   192.168.61.140(192.168.61.140:3307)  Version=8.0.44 (oldest major version between slaves) log-bin:enabledTue Apr 14 20:16:42 2026 - [info]     GTID ONTue Apr 14 20:16:42 2026 - [info]     Replicating from 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Apr 14 20:16:42 2026 - [info]  Non-candidate masters:Tue Apr 14 20:16:42 2026 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Tue Apr 14 20:16:42 2026 - [info]From:192.168.61.139(192.168.61.139:3307) (current master) +--192.168.61.140(192.168.61.140:3307)To:192.168.61.140(192.168.61.140:3307) (new master) +--192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Checking whether 192.168.61.140(192.168.61.140:3307) is ok for the new master..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307): Resetting slave pointing to the dummy host.Tue Apr 14 20:16:42 2026 - [info] ** Phase 1: Configuration Check Phase completed.Tue Apr 14 20:16:42 2026 - [info]Tue Apr 14 20:16:42 2026 - [info] * Phase 2: Rejecting updates Phase..Tue Apr 14 20:16:42 2026 - [info]Tue Apr 14 20:16:42 2026 - [info] Executing master ip online change script to disable write on the current master:Tue Apr 14 20:16:42 2026 - [info]   /data/mha/scripts/master_ip_online_change_vip --command=stop --orig_master_host=192.168.61.139 --orig_master_ip=192.168.61.139 --orig_master_port=3307 --orig_master_user='i_mha' --new_master_host=192.168.61.140 --new_master_ip=192.168.61.140 --new_master_port=3307 --new_master_user='i_mha' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx***************************************************************Disabling the VIP - 192.168.61.141/24 on old master: 192.168.61.139***************************************************************SIOCSIFFLAGS: 无法指定被请求的地址Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Locking all tables on the orig master to reject updates from everybody (including root):Tue Apr 14 20:16:42 2026 - [info] Executing FLUSH TABLES WITH READ LOCK..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Orig master binlog:pos is mysql-bin.000003:237.Tue Apr 14 20:16:42 2026 - [info]  Waiting to execute all relay logs on 192.168.61.140(192.168.61.140:3307)..Tue Apr 14 20:16:42 2026 - [info]  master_pos_wait(mysql-bin.000003:237) completed on 192.168.61.140(192.168.61.140:3307). Executed 0 events.Tue Apr 14 20:16:42 2026 - [info]   done.Tue Apr 14 20:16:42 2026 - [info] Getting new master's binlog name and position..Tue Apr 14 20:16:42 2026 - [info]  mysql-bin.000003:237Tue Apr 14 20:16:42 2026 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.61.140', MASTER_PORT=3307, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Tue Apr 14 20:16:42 2026 - [info] Executing master ip online change script to allow write on the new master:Tue Apr 14 20:16:42 2026 - [info]   /data/mha/scripts/master_ip_online_change_vip --command=start --orig_master_host=192.168.61.139 --orig_master_ip=192.168.61.139 --orig_master_port=3307 --orig_master_user='i_mha' --new_master_host=192.168.61.140 --new_master_ip=192.168.61.140 --new_master_port=3307 --new_master_user='i_mha' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx***************************************************************Enabling the VIP - 192.168.61.141/24 on new master: 192.168.61.140***************************************************************Tue Apr 14 20:16:43 2026 - [info]  ok.Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] * Switching slaves in parallel..Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] Unlocking all tables on the orig master:Tue Apr 14 20:16:43 2026 - [info] Executing UNLOCK TABLES..Tue Apr 14 20:16:43 2026 - [info]  ok.Tue Apr 14 20:16:43 2026 - [info] Starting orig master as a new slave..Tue Apr 14 20:16:43 2026 - [info]  Resetting slave 192.168.61.139(192.168.61.139:3307) and starting replication from the new master 192.168.61.140(192.168.61.140:3307)..Tue Apr 14 20:16:43 2026 - [info]  Executed CHANGE MASTER.Tue Apr 14 20:16:43 2026 - [info]  Slave started.Tue Apr 14 20:16:43 2026 - [info] All new slave servers switched successfully.Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] * Phase 5: New master cleanup phase..Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info]  192.168.61.140: Resetting slave info succeeded.Tue Apr 14 20:16:43 2026 - [info] Switching master to 192.168.61.140(192.168.61.140:3307) completed successfully.

切换成功需要看到类似下面的提示：

[info] Switching master to 192.168.61.140(192.168.61.140:3307) completed successfully.

同时要查看VIP是否已经漂移到了新的主库上面，新的主库是否可读写，最后在新的主库上起manager服务。

2.1.1 MHA在线切换的原理

检查当前的配置信息及主从服务器的信息。
包括读取MHA的配置文件/data/mha/conf/app.cnf及检查当前slave的健康状态。
阻止对当前master的更新。
主要通过如下步骤：
2.1.等待1.5s（$time_until_kill_threads*100ms），等待当前连接断开。
2.2 执行 read_only=1，阻止新的DML操作
2.3 等待0.5s，等待当前DML操作完成。
2.4 kill掉所有连接。
2.5 FLUSH NO_WRITE_TO_BINLOG TABLES
2.6 FLUSH TABLES WITH READ LOCK
等待新master执行完所有的relay log。
Waiting to execute all relay logs on 192.168.61.139(192.168.61.139:3307)..
将新master的read_only设置为off，并添加VIP。
slave切换到新master上。
5.1 等待slave（192.168.61.140）应用完原主从复制产生的relay log，然后执行change master操作切换到新master上。
5.2 释放原master上加的锁。
5.3 因masterha_master_switch命令行中带有--orig_master_is_new_slave参数，故原master也切换为新master的从。
清理新master的相关信息。
主要是执行了reset slave all操作，清除之前的复制信息。

2.2 故障手工切换（MHA进程没启动或者挂了的同时主库也挂了）

2.2.1 模拟场景

主库139停manager监控服务

masterha_stop --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnf

主库139停数据库服务

/data/mysql8.0.44/install_8.0.44/bin/mysqladmin --defaults-file=/data/mysql8.0.44/3307/conf/my.cnf -uroot -p shutdown

2.2.2 故障手工切换命令

masterha_master_switch --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --dead_master_host=192.168.61.139 --dead_master_port=3307 --master_state=dead --new_master_host=192.168.61.140  --new_master_port=3307 --ignore_last_failover

切换成功需要看到类似如下提示

Started manual(interactive) failover.Invalidated master IP address on 192.168.61.139(192.168.61.139:3307)Selected 192.168.61.140(192.168.61.140:3307) as a new master.192.168.61.140(192.168.61.140:3307): OK: Applying all logs succeeded.192.168.61.140(192.168.61.140:3307): OK: Activated master IP address.192.168.61.140(192.168.61.140:3307): Resetting slave info succeeded.Master failover to 192.168.61.140(192.168.61.140:3307) completed successfully.

表示成功切换，切换成功后，查看VIP是否漂移到了从库上(切换成功后，MHA进程会自动停止)，同时查看数据库是否可读写。

故障主库起来后，需要确认数据是否跟新的主库一样，如果一样，那么就把故障主库作为新的从库加入新主库下。然后在主库上启动MHA进程。

2.2.3 将故障主库139数据库服务拉起，然后加入集群成为从库

nohup /data/mysql8.0.44/install_8.0.44/bin/mysqld_safe --defaults-file=/data/mysql8.0.44/3307/conf/my.cnf &-- 数据库服务拉起来后，配置从库change master tomaster_host='192.168.61.140',master_port=3307,master_user='repl',master_password='Admin#dba1',master_auto_position=1;start slave;	//启动slave进程 --查看slave状态确认IO线程、SQL线程都以运行mysql> show slave status\G;	//查看当前的从库状态    Slave_IO_Running: Yes	//IO线程已运行     Slave_SQL_Running: Yes	//SQL线程已运行    mysql> show slave status\G;	//查看当前的从库状态

2.2.4 主库启动manager服务并查看MHA状态

nohup masterha_manager --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnf --ignore_last_failover < /dev/null > /data/mha/run.log 2>&1 &masterha_check_status --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnfapp (pid:14837) is running(0:PING_OK), master:192.168.61.140

MySQL MHA 部署方案如下👇：

MySQL高可用必选项：一份无死角的生产环境MHA部署清单

本文地址： https://sjds.net/632784.html

文章来源：四季读书网