Network errors and MySQL replication recovery
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• cwdent
	Sep 19 2017, 9:04 PM

Description

In the last week or two all of our replag has been confined to frdb2001 at codfw. In the same time frame we have gotten Prometheus monitoring working. Being that the other servers have not lagged since, I have no corresponding graphs and can't speculate as to whether it's the same cause. But this seems to tell a story:

https://grafana.wikimedia.org/dashboard/db/frdb2001?orgId=1&from=1505747335168&to=1505854264832

During each instance of replag, db write activity appears to cease. I can log in during that time, so it isn't a full network outage, but we are wondering if perhaps the replication thread has not recovered from an earlier problem. strace shows select queries from the monitoring software as normal.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• cwdent	T173472 fundraising database replication lag master thread
		Resolved		• cwdent	T176266 Network errors and MySQL replication recovery

Event Timeline

• cwdent created this task.Sep 19 2017, 9:04 PM

• cwdent renamed this task from Network errors and replication recovery to Network errors and MySQL replication recovery.Sep 19 2017, 9:12 PM

The BGP session over IPsec between the two sites has been regularly flapping, and thus briefly breaking the TCP session.
Unless urgent, I'd recommend waiting for the eqiad upgrade before more advanced troubleshooting.
It might also be worth looking at MySQL for faster failure detection and re-connection.

@ayounsi thank you for the info! Agreed that digging in right before the firewall switch is not a good use of time. In the mean time I will look into the MySQL side some more.

I adjusted slave_net_timeout from the <=MariaDB 10.2.3 default 3600s to new default 60s and we haven't seen this since. Closing task, will reopen if it happens again after changing the setting.

Dwisehaupt moved this task from Triage to Done on the fundraising-tech-ops board.Feb 13 2020, 9:34 PM

Network errors and MySQL replication recoveryClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Network errors and MySQL replication recovery
Closed, ResolvedPublic
Actions

Related Objects
Search...