DBReplicationWaitError: Could not wait for slaves to catch up to 10.64.0.7
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

This is showing up a lot in fatalmonitor.

/srv/mediawiki/php-1.28.0-wmf.2/includes/db/loadbalancer/LBFactory.php line 396

{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php","line":392,
"function":"waitForReplication","class":"LBFactory","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php","line":224,
"function":"incrTableUpdate","class":"LinksUpdate","type":"->","args":["string","string","array","array"]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/LinksUpdate.php","line":158,
"function":"doIncrementalUpdate","class":"LinksUpdate","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/deferred/DataUpdate.php","line":99,
"function":"doUpdate","class":"LinksUpdate","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/jobs/RefreshLinksJob.php","line":271,
"function":"runUpdates","class":"DataUpdate","type":"::","args":["array"]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/jobs/RefreshLinksJob.php","line":115,
"function":"runForTitle","class":"RefreshLinksJob","type":"->","args":["Title"]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/JobRunner.php","line":265,
"function":"run","class":"RefreshLinksJob","type":"->","args":[]},
{"file":"/srv/mediawiki/php-1.28.0-wmf.2/includes/jobqueue/JobRunner.php","line":179,
"function":"executeJob","class":"JobRunner","type":"->","args":["RefreshLinksJob","BufferingStatsdDataFactory","integer"]},
{"file":"/srv/mediawiki/rpc/RunJobs.php","line":47,
"function":"run","class":"JobRunner","type":"->","args":["array"]}

Exceptions increased by 1319.82% after rolling out 1.28.0-wmf.2 so I have rolled back group1 to wmf.1 as a result

Details

	Subject	Repo	Branch	Lines +/-
	Fix slave lag wait calls for read-only ES clusters	operations/mediawiki-config	master	+30 -28
	Support non-replicating DB clusters for static datasets	mediawiki/core	master	+6 -3

Customize query in gerrit

Revisions and Commits

rMW MediaWiki
	rMW7ee9645c1ec3 Make LinksUpdate only wait on the DB with the link tables

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• mmodell	T134450 MW-1.28.0-wmf.2 deployment blockers
		Resolved	PRODUCTION ERROR	• mmodell	T135690 DBReplicationWaitError: Could not wait for slaves to catch up to 10.64.0.7

Event Timeline

• mmodell created this task.May 18 2016, 9:30 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 18 2016, 9:30 PM

• mmodell added a parent task: T134450: MW-1.28.0-wmf.2 deployment blockers.May 18 2016, 9:32 PM

• mmodell mentioned this in T134450: MW-1.28.0-wmf.2 deployment blockers.

Paladox subscribed.May 18 2016, 9:34 PM

• mmodell triaged this task as Unbreak Now! priority.May 18 2016, 9:37 PM

• mmodell added a project: DBA.

Restricted Application added subscribers: Luke081515, TerraCodes, Urbanecm. · View Herald TranscriptMay 18 2016, 9:37 PM

• mmodell updated the task description. (Show Details)May 18 2016, 9:40 PM

The only other reference to this error message seems to be T126436: Spikes of mediawiki in read only for job runners after altering the s2 slaves topology

• mmodell mentioned this in T126436: Spikes of mediawiki in read only for job runners after altering the s2 slaves topology.May 18 2016, 9:58 PM

What we know so far:

all or nearly-all occurrences in logstash are from commonswiki
all triggered by refreshlinks job
this does not appear to be actual replication lag
similar symptoms to T126436, the solution there may be incomplete?

https://gerrit.wikimedia.org/r/#/c/289569/ seems to have fixed the issue.

• mmodell added a commit: rMW7ee9645c1ec3: Make LinksUpdate only wait on the DB with the link tables.May 18 2016, 10:26 PM

• mmodell added a subscriber: aaron.

@aaron: you mentioned that there may be deeper issues still, should this task be resolved and another opened for those deeper issues?

In T135690#2307430, @mmodell wrote:

@aaron: you mentioned that there may be deeper issues still, should this task be resolved and another opened for those deeper issues?

I suppose, the issue is that LoadBalancer needs a way to recognize cluster of servers that just have a read-only copy of the same data and no master/slave replication.

Change 289574 had a related patch set uploaded (by Aaron Schulz):
Support non-replicating DB clusters for static datasets

https://gerrit.wikimedia.org/r/289574

gerritbot added a project: Patch-For-Review.May 18 2016, 11:03 PM

Change 289575 had a related patch set uploaded (by Aaron Schulz):
Fix slave lag wait calls for read-only ES clusters

https://gerrit.wikimedia.org/r/289575

ReleaseTaggerBot added projects: MW-1.28-release (WMF-deploy-2016-05-24_(1.28.0-wmf.3)), MW-1.28-release-notes.May 24 2016, 12:00 PM

Change 289574 merged by jenkins-bot:
Support non-replicating DB clusters for static datasets

https://gerrit.wikimedia.org/r/289574

Change 289575 merged by jenkins-bot:
Fix slave lag wait calls for read-only ES clusters

https://gerrit.wikimedia.org/r/289575

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-06-07_(1.28.0-wmf.5)).May 31 2016, 9:01 PM

• demon moved this task from Untriaged to Resolved on the Wikimedia-production-error board.Jul 20 2016, 1:47 AM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM

Anomie mentioned this in T246072: Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only.Feb 25 2020, 4:51 PM

DBReplicationWaitError: Could not wait for slaves to catch up to 10.64.0.7Closed, ResolvedPublicPRODUCTION ERRORActions