replication failure on db2115 and db2215
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ABran-WMF
	Mar 27 2024, 4:44 PM

Description

Both db2115 and db2215 are showing replication errors after a replication restart.
db2115 has been reimaged on 2024-03-14 and been used as a source to clone on db2215 on 2023-03-27. Here are SAL records about db2115
Both servers are showing errors upon trying to catch up with the global cluster state:

$ sudo mysql -e 'show slave status\G'|head -2
*************************** 1. row ***************************
                Slave_IO_State: Waiting to reconnect after a failed master event read

network looks OK from command line's point of view at least:

$ sudo mysql -h db2196.codfw.wmnet -u repl -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.

No significant log has been found at this stage.

Event Timeline

ABran-WMF created this task.Mar 27 2024, 4:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2024, 4:44 PM

It looks like a transient network error or something else causing a connection error (TLS?). It seems to be working now, did you do something?

Ignore the above: It switches to Slave_IO_Running: Yes, but the replication counter doesn't go forward.

Nothing obvious on the general log (systemd journal).

I found something weird with db2196, the slave host table is full of duplicate entries, so I am quite sure that is the problem, not the replicas. Something is weird with the primary, who is killing the replica threads:

db2196[(none)]> SHOW SLAVE HOSTS;
(multiple duplicate connections)

| 120932605 | repl            | 10.192.48.120:43862  | NULL      | Killed      |   10956 | Waiting to finalize termination   >
| 120933988 | repl            | 10.192.48.120:37886  | NULL      | Killed      |   10475 | Waiting to finalize termination   >
| 120936221 | repl            | 10.192.48.120:58408  | NULL      | Killed      |    9639 | starting                          >
| 120937478 | repl            | 10.192.48.120:46634  | NULL      | Killed      |    9143 | starting                          >
| 120937813 | repl            | 10.192.48.120:50328  | NULL      | Killed      |    9023 | starting                          >
| 120938162 | repl            | 10.192.48.120:34328  | NULL      | Killed      |    8902 | starting                          >
| 120938493 | repl            | 10.192.48.120:41872  | NULL      | Killed      |    8782 | starting                          >
| 120938783 | repl            | 10.192.32.134:49842  | NULL      | Killed      |    8672 | starting                          >
| 120938814 | repl            | 10.192.48.120:52786  | NULL      | Killed      |    8662 | starting                          >
| 120938946 | repl            | 10.192.32.134:38980  | NULL      | Killed      |    8612 | starting

I tried a bin_log=0; flush hosts; but that wasn't the issue. I don't want to touch further and risk breaking codfw's primary as I am about to go ooo for a few days.

I will stop replication on both db2115 and db2215 to mitigate the issue on the primary and extending to the other hosts. That will prevent overwhealming the primary, as the other replicas look of so far. I think this will require a restart of the primary to go to a healthy state, but not touching it for now.

This should be enough to keep us up until Monday and a decision is made to either failover or restart the primary.

jcrespo triaged this task as High priority.Mar 27 2024, 6:29 PM

good catch @jcrespo ! Thanks!

ABran-WMF moved this task from Triage to In progress on the DBA board.Mar 28 2024, 7:23 AM

Mentioned in SAL (#wikimedia-operations) [2024-03-28T07:28:45Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 5 days, 2:00:00 on db[2115,2215].codfw.wmnet with reason: Downtime until tuesday (T361133)

Mentioned in SAL (#wikimedia-operations) [2024-03-28T07:28:49Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 2:00:00 on db[2115,2215].codfw.wmnet with reason: Downtime until tuesday (T361133)

• Marostegui claimed this task.Mon, Apr 1, 5:06 AM

There is no need to restart the intermediate master for now, looks like it was a hiccup with semi sync. I have disabled it on the codfw master, the slaves are catching up. Once they are in sync I will enable it again and see how it behaves.

Hosts caught up - I have re-enabled semi sync on the master and so far so good. I am going to give it 24h before calling this fixed.

This has been all fine. Closing as resolved. I have repooled db2115 but NOT db2215 as it is in process of being productionized, so I will leave that last step to @ABran-WMF

Maintenance_bot moved this task from In progress to Done on the DBA board.Tue, Apr 2, 5:29 AM

replication failure on db2115 and db2215Closed, ResolvedPublicActions

Description

Event Timeline

replication failure on db2115 and db2215
Closed, ResolvedPublic
Actions