Page MenuHomePhabricator

DBReplicationWaitError: Could not wait for slaves to catch up to 10.64.0.7
Closed, ResolvedPublicPRODUCTION ERROR

Description

This is showing up a lot in fatalmonitor.

/srv/mediawiki/php-1.28.0-wmf.2/includes/db/loadbalancer/LBFactory.php line 396

{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php","line":392,
"function":"waitForReplication","class":"LBFactory","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php","line":224,
"function":"incrTableUpdate","class":"LinksUpdate","type":"->","args":["string","string","array","array"]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php","line":158,
"function":"doIncrementalUpdate","class":"LinksUpdate","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/DataUpdate.php","line":99,
"function":"doUpdate","class":"LinksUpdate","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/jobs/RefreshLinksJob.php","line":271,
"function":"runUpdates","class":"DataUpdate","type":"::","args":["array"]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/jobs/RefreshLinksJob.php","line":115,
"function":"runForTitle","class":"RefreshLinksJob","type":"->","args":["Title"]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/JobRunner.php","line":265,
"function":"run","class":"RefreshLinksJob","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/JobRunner.php","line":179,
"function":"executeJob","class":"JobRunner","type":"->","args":["RefreshLinksJob","BufferingStatsdDataFactory","integer"]},
{"file":"/srv/mediawiki/rpc/RunJobs.php","line":47,
"function":"run","class":"JobRunner","type":"->","args":["array"]}

Exceptions increased by 1319.82% after rolling out 1.28.0-wmf.2 so I have rolled back group1 to wmf.1 as a result

Event Timeline

mmodell triaged this task as Unbreak Now! priority.May 18 2016, 9:37 PM
mmodell added a project: DBA.

What we know so far:

  • all or nearly-all occurrences in logstash are from commonswiki
  • all triggered by refreshlinks job
  • this does not appear to be actual replication lag
  • similar symptoms to T126436, the solution there may be incomplete?
mmodell claimed this task.

@aaron: you mentioned that there may be deeper issues still, should this task be resolved and another opened for those deeper issues?

@aaron: you mentioned that there may be deeper issues still, should this task be resolved and another opened for those deeper issues?

I suppose, the issue is that LoadBalancer needs a way to recognize cluster of servers that just have a read-only copy of the same data and no master/slave replication.

Change 289574 had a related patch set uploaded (by Aaron Schulz):
Support non-replicating DB clusters for static datasets

https://gerrit.wikimedia.org/r/289574

Change 289575 had a related patch set uploaded (by Aaron Schulz):
Fix slave lag wait calls for read-only ES clusters

https://gerrit.wikimedia.org/r/289575

Change 289574 merged by jenkins-bot:
Support non-replicating DB clusters for static datasets

https://gerrit.wikimedia.org/r/289574

Change 289575 merged by jenkins-bot:
Fix slave lag wait calls for read-only ES clusters

https://gerrit.wikimedia.org/r/289575

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM