MySQL MHA 故障模拟测试及 MHA 在线切换原理

四季读书网 3 0
MySQL MHA 故障模拟测试及 MHA 在线切换原理

一、架构说明

MHA(Master High Availability)是一个用于 MySQL 主从复制管理和自动故障转移的开源工具集。MHA 的主要目的是提供 MySQL 环境的高可用性和自动故障转移功能,确保在主库发生故障时能够快速切换到备库,降低业务中断时间。

MHA 主要由以下几个组件组成:

  • Manager:管理节点,负责监控主库的状态,并处理自动故障转移。Manager 会不断检查数据库的主从同步状态,当主库发生故障时,会自动将备库提升为新的主库,并重新配置其他从库的复制关系。如果主从延迟,则等待所有relay log全部应用后,将VIP漂移到新主库上,整个故障转移过程对应用程序完全透明。

  • Node:节点,即 MySQL 实例。MHA 通过在每个 MySQL 实例上部署 MHA 脚本来对节点进行监控和操作。

  • SSH:MHA 使用 SSH 协议来进行节点之间的通信和操作。确保在使用 MHA 之前,通过 SSH 配置了节点之间的互信。

MHA 的工作流程通常如下:

  • Manager 通过 SSH 登录到节点上监控 MySQL 主库的状态,包括主从复制延迟、节点健康状况等。

  • 当发现主库发生故障或不可用时,Manager 会触发自动故障转移流程,将备库晋升为新的主库。

  • Manager 会更新其他从库的复制关系,确保它们能够正确复制新的主库。

二、故障模拟测试

2.1 在线手工切换(维护切换,需要把MHA监控进程关掉)

命令如下

--停止MHA监控进程masterha_stop --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnf--在线手工切换masterha_master_switch --master_state=alive --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=60

参数注解:

–orig_master_is_new_slave:把旧的master配置为从库

–running_updates_limit=60:如果主从库同步延迟在60s内都允许切换,但是但是切换的时间长短是由recover时relay 日志的大小决定

停止MHA监控进程

[root@mysql-mha01 ~]# masterha_stop --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnfStopped app successfully.[1]+  退出 1                nohup masterha_manager --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --ignore_last_failover < /dev/null > /data/mha/run.log 2>&1

在线手工切换

[root@mysql-mha01 ~]# masterha_master_switch --master_state=alive --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=60Tue Apr 14 20:16:41 2026 - [info] MHA::MasterRotate version 0.58.Tue Apr 14 20:16:41 2026 - [info] Starting online master switch..Tue Apr 14 20:16:41 2026 - [info]Tue Apr 14 20:16:41 2026 - [info] * Phase 1: Configuration Check Phase..Tue Apr 14 20:16:41 2026 - [info]Tue Apr 14 20:16:41 2026 - [info] Reading default configuration from /data/mha/conf/masterha.cnf..Tue Apr 14 20:16:41 2026 - [info] Reading application default configuration from /data/mha/conf/app.cnf..Tue Apr 14 20:16:41 2026 - [info] Reading server configuration from /data/mha/conf/app.cnf..Tue Apr 14 20:16:42 2026 - [info] GTID failover mode = 1Tue Apr 14 20:16:42 2026 - [info] Current Alive Master: 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Alive Slaves:Tue Apr 14 20:16:42 2026 - [info]   192.168.61.140(192.168.61.140:3307)  Version=8.0.44 (oldest major version between slaves) log-bin:enabledTue Apr 14 20:16:42 2026 - [info]     GTID ONTue Apr 14 20:16:42 2026 - [info]     Replicating from 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Apr 14 20:16:42 2026 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Checking MHA is not monitoring or doing failover..Tue Apr 14 20:16:42 2026 - [info] Checking replication health on 192.168.61.140..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Searching new master from slaves..Tue Apr 14 20:16:42 2026 - [info]  Candidate masters from the configuration file:Tue Apr 14 20:16:42 2026 - [info]   192.168.61.139(192.168.61.139:3307)  Version=8.0.44 log-bin:enabledTue Apr 14 20:16:42 2026 - [info]     GTID ONTue Apr 14 20:16:42 2026 - [info]   192.168.61.140(192.168.61.140:3307)  Version=8.0.44 (oldest major version between slaves) log-bin:enabledTue Apr 14 20:16:42 2026 - [info]     GTID ONTue Apr 14 20:16:42 2026 - [info]     Replicating from 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info]     Primary candidate for the new Master (candidate_master is set)Tue Apr 14 20:16:42 2026 - [info]  Non-candidate masters:Tue Apr 14 20:16:42 2026 - [info]  Searching from candidate_master slaves which have received the latest relay log events..Tue Apr 14 20:16:42 2026 - [info]From:192.168.61.139(192.168.61.139:3307) (current master) +--192.168.61.140(192.168.61.140:3307)To:192.168.61.140(192.168.61.140:3307) (new master) +--192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Checking whether 192.168.61.140(192.168.61.140:3307) is ok for the new master..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307): Resetting slave pointing to the dummy host.Tue Apr 14 20:16:42 2026 - [info] ** Phase 1: Configuration Check Phase completed.Tue Apr 14 20:16:42 2026 - [info]Tue Apr 14 20:16:42 2026 - [info] * Phase 2: Rejecting updates Phase..Tue Apr 14 20:16:42 2026 - [info]Tue Apr 14 20:16:42 2026 - [info] Executing master ip online change script to disable write on the current master:Tue Apr 14 20:16:42 2026 - [info]   /data/mha/scripts/master_ip_online_change_vip --command=stop --orig_master_host=192.168.61.139 --orig_master_ip=192.168.61.139 --orig_master_port=3307 --orig_master_user='i_mha' --new_master_host=192.168.61.140 --new_master_ip=192.168.61.140 --new_master_port=3307 --new_master_user='i_mha' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx***************************************************************Disabling the VIP - 192.168.61.141/24 on old master: 192.168.61.139***************************************************************SIOCSIFFLAGS: 无法指定被请求的地址Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Locking all tables on the orig master to reject updates from everybody (including root):Tue Apr 14 20:16:42 2026 - [info] Executing FLUSH TABLES WITH READ LOCK..Tue Apr 14 20:16:42 2026 - [info]  ok.Tue Apr 14 20:16:42 2026 - [info] Orig master binlog:pos is mysql-bin.000003:237.Tue Apr 14 20:16:42 2026 - [info]  Waiting to execute all relay logs on 192.168.61.140(192.168.61.140:3307)..Tue Apr 14 20:16:42 2026 - [info]  master_pos_wait(mysql-bin.000003:237) completed on 192.168.61.140(192.168.61.140:3307). Executed 0 events.Tue Apr 14 20:16:42 2026 - [info]   done.Tue Apr 14 20:16:42 2026 - [info] Getting new master's binlog name and position..Tue Apr 14 20:16:42 2026 - [info]  mysql-bin.000003:237Tue Apr 14 20:16:42 2026 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.61.140', MASTER_PORT=3307, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Tue Apr 14 20:16:42 2026 - [info] Executing master ip online change script to allow write on the new master:Tue Apr 14 20:16:42 2026 - [info]   /data/mha/scripts/master_ip_online_change_vip --command=start --orig_master_host=192.168.61.139 --orig_master_ip=192.168.61.139 --orig_master_port=3307 --orig_master_user='i_mha' --new_master_host=192.168.61.140 --new_master_ip=192.168.61.140 --new_master_port=3307 --new_master_user='i_mha' --orig_master_ssh_user=root --new_master_ssh_user=root   --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx***************************************************************Enabling the VIP - 192.168.61.141/24 on new master: 192.168.61.140***************************************************************Tue Apr 14 20:16:43 2026 - [info]  ok.Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] * Switching slaves in parallel..Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] Unlocking all tables on the orig master:Tue Apr 14 20:16:43 2026 - [info] Executing UNLOCK TABLES..Tue Apr 14 20:16:43 2026 - [info]  ok.Tue Apr 14 20:16:43 2026 - [info] Starting orig master as a new slave..Tue Apr 14 20:16:43 2026 - [info]  Resetting slave 192.168.61.139(192.168.61.139:3307) and starting replication from the new master 192.168.61.140(192.168.61.140:3307)..Tue Apr 14 20:16:43 2026 - [info]  Executed CHANGE MASTER.Tue Apr 14 20:16:43 2026 - [info]  Slave started.Tue Apr 14 20:16:43 2026 - [info] All new slave servers switched successfully.Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] * Phase 5: New master cleanup phase..Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info]  192.168.61.140: Resetting slave info succeeded.Tue Apr 14 20:16:43 2026 - [info] Switching master to 192.168.61.140(192.168.61.140:3307) completed successfully.

切换成功需要看到类似下面的提示:

[info] Switching master to 192.168.61.140(192.168.61.140:3307) completed successfully.

同时要查看VIP是否已经漂移到了新的主库上面,新的主库是否可读写,最后在新的主库上起manager服务。

2.1.1 MHA在线切换的原理

  1. 检查当前的配置信息及主从服务器的信息。

    包括读取MHA的配置文件/data/mha/conf/app.cnf及检查当前slave的健康状态。

  2. 阻止对当前master的更新。

    主要通过如下步骤:

       2.1.等待1.5s($time_until_kill_threads*100ms),等待当前连接断开。

       2.2 执行 read_only=1,阻止新的DML操作

       2.3 等待0.5s,等待当前DML操作完成。

       2.4 kill掉所有连接。

       2.5 FLUSH NO_WRITE_TO_BINLOG TABLES

       2.6 FLUSH TABLES WITH READ LOCK

  3. 等待新master执行完所有的relay log。

    Waiting to execute all relay logs on 192.168.61.139(192.168.61.139:3307)..

  4. 将新master的read_only设置为off,并添加VIP。

  5. slave切换到新master上。

    5.1 等待slave(192.168.61.140)应用完原主从复制产生的relay log,然后执行change master操作切换到新master上。

    5.2 释放原master上加的锁。

    5.3 因masterha_master_switch命令行中带有--orig_master_is_new_slave参数,故原master也切换为新master的从。

  6. 清理新master的相关信息。

    主要是执行了reset slave all操作,清除之前的复制信息。

2.2 故障手工切换(MHA进程没启动或者挂了的同时主库也挂了)

2.2.1 模拟场景

主库139停manager监控服务

masterha_stop --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnf

主库139停数据库服务

/data/mysql8.0.44/install_8.0.44/bin/mysqladmin --defaults-file=/data/mysql8.0.44/3307/conf/my.cnf -uroot -p shutdown

2.2.2 故障手工切换命令

masterha_master_switch --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --dead_master_host=192.168.61.139 --dead_master_port=3307 --master_state=dead --new_master_host=192.168.61.140  --new_master_port=3307 --ignore_last_failover

切换成功需要看到类似如下提示

Started manual(interactive) failover.Invalidated master IP address on 192.168.61.139(192.168.61.139:3307)Selected 192.168.61.140(192.168.61.140:3307) as a new master.192.168.61.140(192.168.61.140:3307): OK: Applying all logs succeeded.192.168.61.140(192.168.61.140:3307): OK: Activated master IP address.192.168.61.140(192.168.61.140:3307): Resetting slave info succeeded.Master failover to 192.168.61.140(192.168.61.140:3307) completed successfully.

表示成功切换,切换成功后,查看VIP是否漂移到了从库上(切换成功后,MHA进程会自动停止),同时查看数据库是否可读写。

故障主库起来后,需要确认数据是否跟新的主库一样,如果一样,那么就把故障主库作为新的从库加入新主库下。然后在主库上启动MHA进程。

2.2.3 将故障主库139数据库服务拉起,然后加入集群成为从库

nohup /data/mysql8.0.44/install_8.0.44/bin/mysqld_safe --defaults-file=/data/mysql8.0.44/3307/conf/my.cnf &-- 数据库服务拉起来后,配置从库change master tomaster_host='192.168.61.140',master_port=3307,master_user='repl',master_password='Admin#dba1',master_auto_position=1;start slave;	//启动slave进程 --查看slave状态确认IO线程、SQL线程都以运行mysql> show slave status\G;	//查看当前的从库状态    Slave_IO_Running: Yes	//IO线程已运行     Slave_SQL_Running: Yes	//SQL线程已运行    mysql> show slave status\G;	//查看当前的从库状态

2.2.4 主库启动manager服务并查看MHA状态

nohup masterha_manager --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnf --ignore_last_failover < /dev/null > /data/mha/run.log 2>&1 &masterha_check_status --global_conf=/data/mha/conf/masterha.cnf  --conf=/data/mha/conf/app.cnfapp (pid:14837is running(0:PING_OK), master:192.168.61.140
MySQL MHA 部署方案如下👇:

MySQL高可用必选项:一份无死角的生产环境MHA部署清单

抱歉,评论功能暂时关闭!