Page MenuHomePhabricator

SHOW SLAVE STATUS as a health check should have a low timeout
Closed, ResolvedPublic

Description

While in normal operation the query should return very quickly, there are some conditions by which the query could get stuck (e.g. the replica is pending to be stopped, but it is itself pending on a large write to finish (which could be itself be blocked due to metadata locking)). While this scenario is very unlikely, it literally happened on codfw while performing maintenanance on 1 pooled wikidata servers (making all mediawikis, that were checking only enwiki's home fail).

There are 3 things that could be done to mitigate that:

  • make sure show slave status has an adequate timeout, in seconds, not in minutes, to avoid pileups. Consider the server dead (delayed) if the timeout happens.
  • Use pt-heartbeat for replication checks exclusively- this will allow to avoid problems with show slave status, which is not "100% safe" as it requires some locking
  • Avoid hard dependency between wikis and wikidata, allowing to see "some content", or fail quickly if wikidata db is unavailable (that is not trivial and probably out of the scope of this ticket, but it is worth mentioning it)

Event Timeline

@aaron Is this a task Performance is going to take on? I'm going untag CPT for now but please retag us if that is incorrect.

Change 534268 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] [WIP] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/534268

BTW, I consider this a smaller issue once replication control was migrated to heartbeat- I am guessing some show slave status are still in place. The query timeout, however, will be useful in multiple other cases like limiting DOS due to api performance spikes.

Krinkle triaged this task as Medium priority.Sep 4 2019, 8:24 PM
Krinkle lowered the priority of this task from Medium to Low.Apr 21 2020, 6:51 PM
aaron removed aaron as the assignee of this task.Oct 18 2021, 6:45 PM
Dinoguy1000 renamed this task from SHOW SLAVE STATUS as a health check should have a low timeout to SHOW REPLICA STATUS as a health check should have a low timeout.Nov 30 2021, 1:28 PM
Dinoguy1000 updated the task description. (Show Details)
Dinoguy1000 renamed this task from SHOW REPLICA STATUS as a health check should have a low timeout to SHOW SLAVE STATUS as a health check should have a low timeout.Nov 30 2021, 2:10 PM
Dinoguy1000 updated the task description. (Show Details)

Change 534268 merged by jenkins-bot:

[mediawiki/core@master] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/534268

Change 747692 had a related patch set uploaded (by Ladsgroup; author: Aaron Schulz):

[mediawiki/core@wmf/1.38.0-wmf.12] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/747692

Change 747692 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.12] rdbms: add query timeout support to Database::select()

https://gerrit.wikimedia.org/r/747692

Mentioned in SAL (#wikimedia-operations) [2021-12-16T15:03:25Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.12/includes/libs/rdbms/database/: Backport: [[gerrit:747692|rdbms: add query timeout support to Database::select() (T129093 T195792)]] (duration: 01m 11s)

According to db-production.php, only the "es" servers use Seconds_Behind_Master (except for the 'is static' servers, which don't need any lag methods). Is there a reason those servers cannot use pt-heartbeat as well?

According to db-production.php, only the "es" servers use Seconds_Behind_Master (except for the 'is static' servers, which don't need any lag methods). Is there a reason those servers cannot use pt-heartbeat as well?

I don't think there should be any reason for es4 and es5 (non RO es hosts) not to use pt-heartbeat.

According to db-production.php, only the "es" servers use Seconds_Behind_Master (except for the 'is static' servers, which don't need any lag methods). Is there a reason those servers cannot use pt-heartbeat as well?

I don't think there should be any reason for es4 and es5 (non RO es hosts) not to use pt-heartbeat.

Is this just a matter of wmf-config then?

I believe so. At least from our side it's all ready

Tables seems fine. The "es*" vs "cluster*" part of config will be annoying though (the "shard" option needs the "es*" name):

> $lb = \MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory()->getExternalLB( 'cluster26' );
= Wikimedia\Rdbms\LoadBalancer {#3349}

> iterator_to_array( $lb->getConnection( DB_REPLICA )->query( "select * from heartbeat.heartbeat" ) );
= [
    {#5769
      +"ts": "2023-02-06T20:31:12.000950",
      +"server_id": "171970708",
      +"file": "es1021-bin.008190",
      +"position": "230205834",
      +"relay_master_log_file": null,
      +"exec_master_log_pos": null,
      +"shard": "es4",
      +"datacenter": "eqiad",
    },
    {#5784
      +"ts": "2023-02-06T20:31:12.001070",
      +"server_id": "180359316",
      +"file": "es2021-bin.008236",
      +"position": "823987058",
      +"relay_master_log_file": "es1021-bin.008190",
      +"exec_master_log_pos": "230205834",
      +"shard": "es4",
      +"datacenter": "codfw",
    },
  ]

> 
> $lb = \MediaWiki\MediaWikiServices::getInstance()->getDBLoadBalancerFactory()->getExternalLB( 'cluster27' );
= Wikimedia\Rdbms\LoadBalancer {#5789}

> iterator_to_array( $lb->getConnection( DB_REPLICA )->query( "select * from heartbeat.heartbeat" ) );
= [
    {#5782
      +"ts": "2023-02-06T20:31:17.000970",
      +"server_id": "171966666",
      +"file": "es1024-bin.008220",
      +"position": "772610538",
      +"relay_master_log_file": null,
      +"exec_master_log_pos": null,
      +"shard": "es5",
      +"datacenter": "eqiad",
    },
    {#3350
      +"ts": "2023-02-06T20:31:17.001170",
      +"server_id": "180367499",
      +"file": "es2023-bin.008269",
      +"position": "503583733",
      +"relay_master_log_file": "es1024-bin.008220",
      +"exec_master_log_pos": "772610538",
      +"shard": "es5",
      +"datacenter": "codfw",
    },
  ]

>

Change 893835 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[operations/mediawiki-config@master] Use pt-heartbeat for all non-static external clusters

https://gerrit.wikimedia.org/r/893835