Page MenuHomePhabricator

DBReplicationWaitError: Could not wait for slaves to catch up to
Closed, ResolvedPublicPRODUCTION ERROR


This is showing up a lot in fatalmonitor.

/srv/mediawiki/php-1.28.0-wmf.2/includes/db/loadbalancer/LBFactory.php line 396


Exceptions increased by 1319.82% after rolling out 1.28.0-wmf.2 so I have rolled back group1 to wmf.1 as a result

Event Timeline

mmodell triaged this task as Unbreak Now! priority.May 18 2016, 9:37 PM
mmodell added a project: DBA.

What we know so far:

  • all or nearly-all occurrences in logstash are from commonswiki
  • all triggered by refreshlinks job
  • this does not appear to be actual replication lag
  • similar symptoms to T126436, the solution there may be incomplete?
mmodell claimed this task.

@aaron: you mentioned that there may be deeper issues still, should this task be resolved and another opened for those deeper issues?

@aaron: you mentioned that there may be deeper issues still, should this task be resolved and another opened for those deeper issues?

I suppose, the issue is that LoadBalancer needs a way to recognize cluster of servers that just have a read-only copy of the same data and no master/slave replication.

Change 289574 had a related patch set uploaded (by Aaron Schulz):
Support non-replicating DB clusters for static datasets

Change 289575 had a related patch set uploaded (by Aaron Schulz):
Fix slave lag wait calls for read-only ES clusters

Change 289574 merged by jenkins-bot:
Support non-replicating DB clusters for static datasets

Change 289575 merged by jenkins-bot:
Fix slave lag wait calls for read-only ES clusters

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM