Page MenuHomePhabricator

LinksUpdate fails, spams exception logs, whenever replication lag on any server rises above 10s
Closed, ResolvedPublic


As part of T95501, it was decided to do LinksUpdates in a loop of commitAndWaitForReplication() in order to avoid replication lag. This means that if any server in the cluster is lagged, even if it has 0.01% load, waitForReplication() will time out and throw an exception after a hard-coded 10 seconds. Unlike when the same problem occurs in jobs, there is no possibility of the commit being retried, so the links will just be wrong forever.

A possible fix: after a short timeout of say 1s, ignore the error and continue with the next transaction. This will throttle the update and thus limit the impact of update jobs on the replication lag, but still allow the update to complete.

I don't think it's acceptable to throw an exception and discard the update, even if all slaves are lagged. Read-only mode is a better mechanism for offloading the DB cluster in this case. By dumping outstanding work into the binlog at whatever rate and then switching to read-only mode for a few minutes, we would at least preserve the consistency of the links tables.

I noticed these exceptions while investigating T198049.

Event Timeline

Thank you for working on this. I don't have any concrete advice, but whenever timeouts are involved, I get worried about possible overload consequences (although in T119626 the timeouts were 60+ seconds).

As an alternative, for long term, one thing we have thought for some time is if would be interesting to have a model similar to GitHub's, where there is 2 modes "normal" (where replicas with lag/unavailable are depooled) and "ignore lag" (where everything is lagged)

Something like that approach seems worth trying.

@tstarling Safe to assume that you're planning to tackle this?

CCicalese_WMF moved this task from Inbox to CPT TEC1 Backlog on the Core-Platform-Team-Old board.

Change 453091 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] Don't throw an exception when waiting for replication times out

Change 453091 merged by jenkins-bot:
[mediawiki/core@master] Don't throw an exception when waiting for replication times out

tstarling claimed this task.