一、架构说明
MHA(Master High Availability)是一个用于 MySQL 主从复制管理和自动故障转移的开源工具集。MHA 的主要目的是提供 MySQL 环境的高可用性和自动故障转移功能,确保在主库发生故障时能够快速切换到备库,降低业务中断时间。
MHA 主要由以下几个组件组成:
Manager:管理节点,负责监控主库的状态,并处理自动故障转移。Manager 会不断检查数据库的主从同步状态,当主库发生故障时,会自动将备库提升为新的主库,并重新配置其他从库的复制关系。如果主从延迟,则等待所有relay log全部应用后,将VIP漂移到新主库上,整个故障转移过程对应用程序完全透明。
Node:节点,即 MySQL 实例。MHA 通过在每个 MySQL 实例上部署 MHA 脚本来对节点进行监控和操作。
SSH:MHA 使用 SSH 协议来进行节点之间的通信和操作。确保在使用 MHA 之前,通过 SSH 配置了节点之间的互信。
MHA 的工作流程通常如下:
Manager 通过 SSH 登录到节点上监控 MySQL 主库的状态,包括主从复制延迟、节点健康状况等。
当发现主库发生故障或不可用时,Manager 会触发自动故障转移流程,将备库晋升为新的主库。
Manager 会更新其他从库的复制关系,确保它们能够正确复制新的主库。
二、故障模拟测试
2.1 在线手工切换(维护切换,需要把MHA监控进程关掉)
命令如下
--停止MHA监控进程masterha_stop --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf--在线手工切换masterha_master_switch --master_state=alive --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=60
参数注解:
–orig_master_is_new_slave:把旧的master配置为从库
–running_updates_limit=60:如果主从库同步延迟在60s内都允许切换,但是但是切换的时间长短是由recover时relay 日志的大小决定
停止MHA监控进程
[root@mysql-mha01 ~]# masterha_stop --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnfStopped app successfully.[1]+ 退出 1 nohup masterha_manager --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --ignore_last_failover < /dev/null > /data/mha/run.log 2>&1
在线手工切换
[root@mysql-mha01 ~]# masterha_master_switch --master_state=alive --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=60Tue Apr 14 20:16:41 2026 - [info] MHA::MasterRotate version 0.58.Tue Apr 14 20:16:41 2026 - [info] Starting online master switch..Tue Apr 14 20:16:41 2026 - [info]Tue Apr 14 20:16:41 2026 - [info] * Phase 1: Configuration Check Phase..Tue Apr 14 20:16:41 2026 - [info]Tue Apr 14 20:16:41 2026 - [info] Reading default configuration from /data/mha/conf/masterha.cnf..Tue Apr 14 20:16:41 2026 - [info] Reading application default configuration from /data/mha/conf/app.cnf..Tue Apr 14 20:16:41 2026 - [info] Reading server configuration from /data/mha/conf/app.cnf..Tue Apr 14 20:16:42 2026 - [info] GTID failover mode = 1Tue Apr 14 20:16:42 2026 - [info] Current Alive Master: 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Alive Slaves:Tue Apr 14 20:16:42 2026 - [info] 192.168.61.140(192.168.61.140:3307) Version=8.0.44 (oldest major version between slaves) log-bin:enabledTue Apr 14 20:16:42 2026 - [info] GTID ONTue Apr 14 20:16:42 2026 - [info] Replicating from 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Primary candidate for the new Master (candidate_master is set)Tue Apr 14 20:16:42 2026 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..Tue Apr 14 20:16:42 2026 - [info] ok.Tue Apr 14 20:16:42 2026 - [info] Checking MHA is not monitoring or doing failover..Tue Apr 14 20:16:42 2026 - [info] Checking replication health on 192.168.61.140..Tue Apr 14 20:16:42 2026 - [info] ok.Tue Apr 14 20:16:42 2026 - [info] Searching new master from slaves..Tue Apr 14 20:16:42 2026 - [info] Candidate masters from the configuration file:Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307) Version=8.0.44 log-bin:enabledTue Apr 14 20:16:42 2026 - [info] GTID ONTue Apr 14 20:16:42 2026 - [info] 192.168.61.140(192.168.61.140:3307) Version=8.0.44 (oldest major version between slaves) log-bin:enabledTue Apr 14 20:16:42 2026 - [info] GTID ONTue Apr 14 20:16:42 2026 - [info] Replicating from 192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Primary candidate for the new Master (candidate_master is set)Tue Apr 14 20:16:42 2026 - [info] Non-candidate masters:Tue Apr 14 20:16:42 2026 - [info] Searching from candidate_master slaves which have received the latest relay log events..Tue Apr 14 20:16:42 2026 - [info]From:192.168.61.139(192.168.61.139:3307) (current master)+--192.168.61.140(192.168.61.140:3307)To:192.168.61.140(192.168.61.140:3307) (new master)+--192.168.61.139(192.168.61.139:3307)Tue Apr 14 20:16:42 2026 - [info] Checking whether 192.168.61.140(192.168.61.140:3307) is ok for the new master..Tue Apr 14 20:16:42 2026 - [info] ok.Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.Tue Apr 14 20:16:42 2026 - [info] 192.168.61.139(192.168.61.139:3307): Resetting slave pointing to the dummy host.Tue Apr 14 20:16:42 2026 - [info] ** Phase 1: Configuration Check Phase completed.Tue Apr 14 20:16:42 2026 - [info]Tue Apr 14 20:16:42 2026 - [info] * Phase 2: Rejecting updates Phase..Tue Apr 14 20:16:42 2026 - [info]Tue Apr 14 20:16:42 2026 - [info] Executing master ip online change script to disable write on the current master:Tue Apr 14 20:16:42 2026 - [info] /data/mha/scripts/master_ip_online_change_vip --command=stop --orig_master_host=192.168.61.139 --orig_master_ip=192.168.61.139 --orig_master_port=3307 --orig_master_user='i_mha' --new_master_host=192.168.61.140 --new_master_ip=192.168.61.140 --new_master_port=3307 --new_master_user='i_mha' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx***************************************************************Disabling the VIP - 192.168.61.141/24 on old master: 192.168.61.139***************************************************************SIOCSIFFLAGS: 无法指定被请求的地址Tue Apr 14 20:16:42 2026 - [info] ok.Tue Apr 14 20:16:42 2026 - [info] Locking all tables on the orig master to reject updates from everybody (including root):Tue Apr 14 20:16:42 2026 - [info] Executing FLUSH TABLES WITH READ LOCK..Tue Apr 14 20:16:42 2026 - [info] ok.Tue Apr 14 20:16:42 2026 - [info] Orig master binlog:pos is mysql-bin.000003:237.Tue Apr 14 20:16:42 2026 - [info] Waiting to execute all relay logs on 192.168.61.140(192.168.61.140:3307)..Tue Apr 14 20:16:42 2026 - [info] master_pos_wait(mysql-bin.000003:237) completed on 192.168.61.140(192.168.61.140:3307). Executed 0 events.Tue Apr 14 20:16:42 2026 - [info] done.Tue Apr 14 20:16:42 2026 - [info] Getting new master's binlog name and position..Tue Apr 14 20:16:42 2026 - [info] mysql-bin.000003:237Tue Apr 14 20:16:42 2026 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.61.140', MASTER_PORT=3307, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';Tue Apr 14 20:16:42 2026 - [info] Executing master ip online change script to allow write on the new master:Tue Apr 14 20:16:42 2026 - [info] /data/mha/scripts/master_ip_online_change_vip --command=start --orig_master_host=192.168.61.139 --orig_master_ip=192.168.61.139 --orig_master_port=3307 --orig_master_user='i_mha' --new_master_host=192.168.61.140 --new_master_ip=192.168.61.140 --new_master_port=3307 --new_master_user='i_mha' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx***************************************************************Enabling the VIP - 192.168.61.141/24 on new master: 192.168.61.140***************************************************************Tue Apr 14 20:16:43 2026 - [info] ok.Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] * Switching slaves in parallel..Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] Unlocking all tables on the orig master:Tue Apr 14 20:16:43 2026 - [info] Executing UNLOCK TABLES..Tue Apr 14 20:16:43 2026 - [info] ok.Tue Apr 14 20:16:43 2026 - [info] Starting orig master as a new slave..Tue Apr 14 20:16:43 2026 - [info] Resetting slave 192.168.61.139(192.168.61.139:3307) and starting replication from the new master 192.168.61.140(192.168.61.140:3307)..Tue Apr 14 20:16:43 2026 - [info] Executed CHANGE MASTER.Tue Apr 14 20:16:43 2026 - [info] Slave started.Tue Apr 14 20:16:43 2026 - [info] All new slave servers switched successfully.Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] * Phase 5: New master cleanup phase..Tue Apr 14 20:16:43 2026 - [info]Tue Apr 14 20:16:43 2026 - [info] 192.168.61.140: Resetting slave info succeeded.Tue Apr 14 20:16:43 2026 - [info] Switching master to 192.168.61.140(192.168.61.140:3307) completed successfully.
切换成功需要看到类似下面的提示:
[info] Switching master to 192.168.61.140(192.168.61.140:3307) completed successfully.
同时要查看VIP是否已经漂移到了新的主库上面,新的主库是否可读写,最后在新的主库上起manager服务。
2.1.1 MHA在线切换的原理
检查当前的配置信息及主从服务器的信息。
包括读取MHA的配置文件/data/mha/conf/app.cnf及检查当前slave的健康状态。
阻止对当前master的更新。
主要通过如下步骤:
2.1.等待1.5s($time_until_kill_threads*100ms),等待当前连接断开。
2.2 执行 read_only=1,阻止新的DML操作
2.3 等待0.5s,等待当前DML操作完成。
2.4 kill掉所有连接。
2.5 FLUSH NO_WRITE_TO_BINLOG TABLES
2.6 FLUSH TABLES WITH READ LOCK
等待新master执行完所有的relay log。
Waiting to execute all relay logs on 192.168.61.139(192.168.61.139:3307)..
将新master的read_only设置为off,并添加VIP。
slave切换到新master上。
5.1 等待slave(192.168.61.140)应用完原主从复制产生的relay log,然后执行change master操作切换到新master上。
5.2 释放原master上加的锁。
5.3 因masterha_master_switch命令行中带有--orig_master_is_new_slave参数,故原master也切换为新master的从。
清理新master的相关信息。
主要是执行了reset slave all操作,清除之前的复制信息。
2.2 故障手工切换(MHA进程没启动或者挂了的同时主库也挂了)
2.2.1 模拟场景
主库139停manager监控服务
masterha_stop --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf主库139停数据库服务
/data/mysql8.0.44/install_8.0.44/bin/mysqladmin --defaults-file=/data/mysql8.0.44/3307/conf/my.cnf -uroot -p shutdown2.2.2 故障手工切换命令
masterha_master_switch --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --dead_master_host=192.168.61.139 --dead_master_port=3307 --master_state=dead --new_master_host=192.168.61.140 --new_master_port=3307 --ignore_last_failover切换成功需要看到类似如下提示
Started manual(interactive) failover.Invalidated master IP address on 192.168.61.139(192.168.61.139:3307)Selected 192.168.61.140(192.168.61.140:3307) as a new master.192.168.61.140(192.168.61.140:3307): OK: Applying all logs succeeded.192.168.61.140(192.168.61.140:3307): OK: Activated master IP address.192.168.61.140(192.168.61.140:3307): Resetting slave info succeeded.Master failover to 192.168.61.140(192.168.61.140:3307) completed successfully.
表示成功切换,切换成功后,查看VIP是否漂移到了从库上(切换成功后,MHA进程会自动停止),同时查看数据库是否可读写。
故障主库起来后,需要确认数据是否跟新的主库一样,如果一样,那么就把故障主库作为新的从库加入新主库下。然后在主库上启动MHA进程。
2.2.3 将故障主库139数据库服务拉起,然后加入集群成为从库
nohup/data/mysql8.0.44/install_8.0.44/bin/mysqld_safe --defaults-file=/data/mysql8.0.44/3307/conf/my.cnf &-- 数据库服务拉起来后,配置从库change master tomaster_host='192.168.61.140',master_port=3307,master_user='repl',master_password='Admin#dba1',master_auto_position=1;start slave; //启动slave进程--查看slave状态确认IO线程、SQL线程都以运行mysql> show slave status\G; //查看当前的从库状态Slave_IO_Running: Yes //IO线程已运行Slave_SQL_Running: Yes //SQL线程已运行mysql> show slave status\G; //查看当前的从库状态
2.2.4 主库启动manager服务并查看MHA状态
nohup masterha_manager --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnf --ignore_last_failover < /dev/null > /data/mha/run.log 2>&1 &masterha_check_status --global_conf=/data/mha/conf/masterha.cnf --conf=/data/mha/conf/app.cnfapp (pid:14837) is running(0:PING_OK), master:192.168.61.140