Misleading "replica catching up" error when master DB is down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Bawolff
	Feb 19 2019, 11:52 AM

Description

Split from T216484

If master DB goes down, the read only message says it is waiting for replica to catch up.

This is misleading, the error message should say something like the master DB is down.

Details

	Subject	Repo	Branch	Lines +/-
	rdbms: simplify LoadBalancer::getLaggedReplicaMode()	mediawiki/core	master	+4 -11

Customize query in gerrit

Related Objects

Mentioned In: T248481: Mysterious replication lag observed by MW in Codfw
T113114: Make all wiki-facing error pages consistent
T218692: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error"
Mentioned Here: T216484: Database locked on beta.wmflabs.org sites (deployment-db03 down?)

Event Timeline

Bawolff created this task.Feb 19 2019, 11:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2019, 11:52 AM

Peachey88 edited projects, added MediaWiki-Documentation; removed Documentation.Feb 19 2019, 7:56 PM

RhinosF1 mentioned this in T218692: read only on mediawiki generates "LoadBalancer.php: Cannot access the database: Unknown error".Mar 19 2019, 9:21 PM

RhinosF1 added a project: Wikimedia-production-error.

[Beta are not production wikis, hence removing the Wikimedia-production-error tag]

That should be better

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Apr 3 2019, 2:48 AM

RhinosF1 mentioned this in T113114: Make all wiki-facing error pages consistent.May 21 2019, 7:40 PM

Krinkle renamed this task from read only message for master DB down misleading to Misleading "replica catching up" error when master DB is down.May 21 2019, 10:15 PM

Krinkle removed projects: Beta-Cluster-Infrastructure, MediaWiki-Documentation.

The error comes from the following code in rdbms/LoadBalancer.php

	public function getReadOnlyReason( $domain = false, IDatabase $conn = null ) {
		if ( $this->readOnlyReason !== false ) {
			return $this->readOnlyReason;
		} elseif ( $this->getLaggedReplicaMode( $domain ) ) {
			if ( $this->allReplicasDownMode ) {
				return 'The database has been automatically locked ' .
					'until the replica database servers become available';
			} else {
				return 'The database has been automatically locked ' .
					'while the replica database servers catch up to the master.';
			}

This code and the laggedReplicaMode() method appear to work as intended.

I suspect that maybe earlier on in the code, it might be unable to compare something between the master and replica. If in that comparison, the bottom value is interpreted as newer, then that means it will look like the master is far ahead of the replica, and thus lead to this error.

For the general case of a master being down for maintenance, the code is known to behave correctly. However, this case it was temporarily unavailable in an unexpected way. I'm classifying this as low priority for now as it is only an error message. The logical behaviour of the code is as expected, which is that we automatically enable "read-only" mode until the master and its replication are back up.

Krinkle added a project: Performance-Team (Radar).May 21 2019, 10:20 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Re-tagging on them main workboard for @aaron to review when he's back. I recall this area being refactored in the last two weeks, possibly resolving the issue reported here. Or, if not, being fresh in mind and perhaps easy to fix.

Krinkle assigned this task to aaron.Aug 6 2019, 5:57 PM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptAug 6 2019, 5:57 PM

WDoranWMF removed a project: Platform Engineering.Aug 6 2019, 6:40 PM

Change 529189 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: simplify LoadBalancer::getLaggedReplicaMode()

https://gerrit.wikimedia.org/r/529189

gerritbot added a project: Patch-For-Review.Aug 8 2019, 11:02 PM

aaron moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Aug 12 2019, 7:44 PM

Change 529189 merged by jenkins-bot:
[mediawiki/core@master] rdbms: simplify LoadBalancer::getLaggedReplicaMode()

https://gerrit.wikimedia.org/r/529189

aaron closed this task as Resolved.Aug 20 2019, 6:01 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 20 2019, 6:10 PM

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.20; 2019-08-27).Aug 21 2019, 12:01 AM

Change 583350 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] rdbms: Fix unprocessed "{host}" in LoadMonitor replag message

https://gerrit.wikimedia.org/r/583350

gerritbot added a project: Patch-For-Review.Mar 25 2020, 2:42 PM

Krinkle mentioned this in T248481: Mysterious replication lag observed by MW in Codfw.Mar 25 2020, 2:50 PM

Misleading "replica catching up" error when master DB is downClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Misleading "replica catching up" error when master DB is down
Closed, ResolvedPublic
Actions