Page MenuHomePhabricator

LinksUpdate fails, spams exception logs, whenever replication lag on any server rises above 10s
Closed, ResolvedPublic

Description

As part of T95501, it was decided to do LinksUpdates in a loop of commitAndWaitForReplication() in order to avoid replication lag. This means that if any server in the cluster is lagged, even if it has 0.01% load, waitForReplication() will time out and throw an exception after a hard-coded 10 seconds. Unlike when the same problem occurs in jobs, there is no possibility of the commit being retried, so the links will just be wrong forever.

A possible fix: after a short timeout of say 1s, ignore the error and continue with the next transaction. This will throttle the update and thus limit the impact of update jobs on the replication lag, but still allow the update to complete.

I don't think it's acceptable to throw an exception and discard the update, even if all slaves are lagged. Read-only mode is a better mechanism for offloading the DB cluster in this case. By dumping outstanding work into the binlog at whatever rate and then switching to read-only mode for a few minutes, we would at least preserve the consistency of the links tables.

I noticed these exceptions while investigating T198049.

Event Timeline

tstarling created this task.Aug 8 2018, 2:34 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 8 2018, 2:34 AM

Thank you for working on this. I don't have any concrete advice, but whenever timeouts are involved, I get worried about possible overload consequences (although in T119626 the timeouts were 60+ seconds).

As an alternative, for long term, one thing we have thought for some time is if would be interesting to have a model similar to GitHub's, where there is 2 modes "normal" (where replicas with lag/unavailable are depooled) and "ignore lag" (where everything is lagged) https://githubengineering.com/context-aware-mysql-pools-via-haproxy/

aaron added a comment.Aug 8 2018, 9:11 PM

Something like that approach seems worth trying.

@tstarling Safe to assume that you're planning to tackle this?

Imarlier moved this task from Inbox to Radar on the Performance-Team board.Aug 13 2018, 8:10 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.
CCicalese_WMF triaged this task as Normal priority.Aug 14 2018, 1:59 AM
CCicalese_WMF moved this task from Inbox to CPT TEC1 Backlog on the Core-Platform-Team-Old board.

Change 453091 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Don't throw an exception when waiting for replication times out

https://gerrit.wikimedia.org/r/453091

Change 453091 merged by jenkins-bot:
[mediawiki/core@master] Don't throw an exception when waiting for replication times out

https://gerrit.wikimedia.org/r/453091

tstarling closed this task as Resolved.Oct 19 2018, 2:33 AM
tstarling claimed this task.