Page MenuHomePhabricator

Network errors and MySQL replication recovery
Closed, ResolvedPublic

Description

In the last week or two all of our replag has been confined to frdb2001 at codfw. In the same time frame we have gotten Prometheus monitoring working. Being that the other servers have not lagged since, I have no corresponding graphs and can't speculate as to whether it's the same cause. But this seems to tell a story:

https://grafana.wikimedia.org/dashboard/db/frdb2001?orgId=1&from=1505747335168&to=1505854264832

During each instance of replag, db write activity appears to cease. I can log in during that time, so it isn't a full network outage, but we are wondering if perhaps the replication thread has not recovered from an earlier problem. strace shows select queries from the monitoring software as normal.

Event Timeline

cwdent renamed this task from Network errors and replication recovery to Network errors and MySQL replication recovery.Sep 19 2017, 9:12 PM

The BGP session over IPsec between the two sites has been regularly flapping, and thus briefly breaking the TCP session.
Unless urgent, I'd recommend waiting for the eqiad upgrade before more advanced troubleshooting.
It might also be worth looking at MySQL for faster failure detection and re-connection.

@ayounsi thank you for the info! Agreed that digging in right before the firewall switch is not a good use of time. In the mean time I will look into the MySQL side some more.

I adjusted slave_net_timeout from the <=MariaDB 10.2.3 default 3600s to new default 60s and we haven't seen this since. Closing task, will reopen if it happens again after changing the setting.