Page MenuHomePhabricator

Misleading "replica catching up" error when master DB is down
Closed, ResolvedPublic

Description

Split from T216484

If master DB goes down, the read only message says it is waiting for replica to catch up.

This is misleading, the error message should say something like the master DB is down.

Details

Related Gerrit Patches:

Event Timeline

Bawolff created this task.Feb 19 2019, 11:52 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 19 2019, 11:52 AM
Framawiki added a subscriber: Framawiki.

[Beta are not production wikis, hence removing the Wikimedia-production-error tag]

Krinkle renamed this task from read only message for master DB down misleading to Misleading "replica catching up" error when master DB is down.May 21 2019, 10:15 PM
Krinkle triaged this task as Low priority.May 21 2019, 10:20 PM
Krinkle added a project: patch-welcome.
Krinkle added a subscriber: Krinkle.

The error comes from the following code in rdbms/LoadBalancer.php

	public function getReadOnlyReason( $domain = false, IDatabase $conn = null ) {
		if ( $this->readOnlyReason !== false ) {
			return $this->readOnlyReason;
		} elseif ( $this->getLaggedReplicaMode( $domain ) ) {
			if ( $this->allReplicasDownMode ) {
				return 'The database has been automatically locked ' .
					'until the replica database servers become available';
			} else {
				return 'The database has been automatically locked ' .
					'while the replica database servers catch up to the master.';
			}

This code and the laggedReplicaMode() method appear to work as intended.

I suspect that maybe earlier on in the code, it might be unable to compare something between the master and replica. If in that comparison, the bottom value is interpreted as newer, then that means it will look like the master is far ahead of the replica, and thus lead to this error.

For the general case of a master being down for maintenance, the code is known to behave correctly. However, this case it was temporarily unavailable in an unexpected way. I'm classifying this as low priority for now as it is only an error message. The logical behaviour of the code is as expected, which is that we automatically enable "read-only" mode until the master and its replication are back up.

Krinkle added a subscriber: aaron.

Re-tagging on them main workboard for @aaron to review when he's back. I recall this area being refactored in the last two weeks, possibly resolving the issue reported here. Or, if not, being fresh in mind and perhaps easy to fix.

Krinkle assigned this task to aaron.Aug 6 2019, 5:57 PM
Restricted Application added a project: Core Platform Team. · View Herald TranscriptAug 6 2019, 5:57 PM

Change 529189 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: simplify LoadBalancer::getLaggedReplicaMode()

https://gerrit.wikimedia.org/r/529189

aaron moved this task from Inbox to Doing on the Performance-Team board.Aug 12 2019, 7:44 PM

Change 529189 merged by jenkins-bot:
[mediawiki/core@master] rdbms: simplify LoadBalancer::getLaggedReplicaMode()

https://gerrit.wikimedia.org/r/529189

aaron closed this task as Resolved.Aug 20 2019, 6:01 PM