Page MenuHomePhabricator

Misleading "replica catching up" error when master DB is down
Closed, ResolvedPublic

Description

Split from T216484

If master DB goes down, the read only message says it is waiting for replica to catch up.

This is misleading, the error message should say something like the master DB is down.

Event Timeline

Framawiki subscribed.

[Beta are not production wikis, hence removing the Wikimedia-production-error tag]

Krinkle renamed this task from read only message for master DB down misleading to Misleading "replica catching up" error when master DB is down.May 21 2019, 10:15 PM
Krinkle added a project: patch-welcome.
Krinkle subscribed.

The error comes from the following code in rdbms/LoadBalancer.php

	public function getReadOnlyReason( $domain = false, IDatabase $conn = null ) {
		if ( $this->readOnlyReason !== false ) {
			return $this->readOnlyReason;
		} elseif ( $this->getLaggedReplicaMode( $domain ) ) {
			if ( $this->allReplicasDownMode ) {
				return 'The database has been automatically locked ' .
					'until the replica database servers become available';
			} else {
				return 'The database has been automatically locked ' .
					'while the replica database servers catch up to the master.';
			}

This code and the laggedReplicaMode() method appear to work as intended.

I suspect that maybe earlier on in the code, it might be unable to compare something between the master and replica. If in that comparison, the bottom value is interpreted as newer, then that means it will look like the master is far ahead of the replica, and thus lead to this error.

For the general case of a master being down for maintenance, the code is known to behave correctly. However, this case it was temporarily unavailable in an unexpected way. I'm classifying this as low priority for now as it is only an error message. The logical behaviour of the code is as expected, which is that we automatically enable "read-only" mode until the master and its replication are back up.

Krinkle added a subscriber: aaron.

Re-tagging on them main workboard for @aaron to review when he's back. I recall this area being refactored in the last two weeks, possibly resolving the issue reported here. Or, if not, being fresh in mind and perhaps easy to fix.

Change 529189 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: simplify LoadBalancer::getLaggedReplicaMode()

https://gerrit.wikimedia.org/r/529189

Change 529189 merged by jenkins-bot:
[mediawiki/core@master] rdbms: simplify LoadBalancer::getLaggedReplicaMode()

https://gerrit.wikimedia.org/r/529189

Change 583350 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] rdbms: Fix unprocessed "{host}" in LoadMonitor replag message

https://gerrit.wikimedia.org/r/583350