Page MenuHomePhabricator

Re-evaluate whether WMF's MainStash config should use LBFactory
Closed, ResolvedPublic

Description

In the context of T387509, I noticed that a routine test of read-only mode (via $wgReadOnly or $wgLBFactoryConf) resulted in DBReadOnlyError exceptions for writes to the MainStash.

Status quo

Read-only mode in MediaWiki is specific to, and propagated by, the LBFactory service class in MediaWiki. This is in charge of connections and queries to "MediaWiki databases".

As such, MediaWiki read-only mode should not (and does not) affect local services like Memcached, Cassandra (SessionStore), Swift, or Kafka (EventBus).

Idem for ParserCache at WMF. In wmf-config, we configure ParserCache as an instance of SqlBagOStuff that is directly given a list of hostnames. It does not use the LBFactory service class to establish database connections.

The MainStash at WMF (wmf-config source), is also an instance of SqlBagOStuff. But, it uses the cluster, and dbDomain options to fetch a list of x2 hostnames from the LBFactory service. I suspect this helps certain DBA tasks, because it means the full range of dbctl features for MediaWiki databases is also offered to the MainStash/x2 cluster (unlike for ParserCache).

Evaluate

This task is to answer these questions:

  1. Is this expected and useful from an SRE ServiceOps perspective? (context: Routine switchover tests)

Afaik when a datacenter is placed into MW read-only only, the intent is to make sure there are no cross-datacenter connections being initiated. That is, if for some reason read-write requests were routed here, they will fail. And any rare use of optional/defered/best-effort writes on read requests, is proactive skipped.

As such, MW read-only mode does not prevent writes to php-apcu, Memcached, Cassandra, or ParserCache. Yet, it does currently prevent writes to MainStash, even though it would be writing to a DC-local service (same as ParserCache).

  1. Is this expected and useful from a DBA perspective? (context: DB maintenance)

I don't know if it's common to MW's read-only mode during DBA work. If it is, is this difference considered useful?

  1. Is this supported in MediaWiki from a developer perspective?

The MainStash service catches database failures so that they don't crash the web request. It instead returns false, and lets the caller decide whether this write is a functional requirement.

During actions thought of as "writes by users", the MainStash may be functionally relied upon in this way.

However during "read" requests it is (afaik) not functionally relied on. By "read" requests I mean, requests routed to the secondary DC; not strictly GET/POST per Multi-DC T91820.)

This means, while perhaps not ideal, the current situation is supported by and compatible with MediaWiki.

Event Timeline

I suspect this helps certain DBA tasks.

It really doesn't. As I've said before, anything with x2 is a pain from DBA point of view. The best it can provide is having the server template so we wouldn't need to repeat $wgDBpassword and other variables but that's about it. ParserCache doesn't use LB config and we already used dbctl for them.

I'm already moving away from LB in mw-config patch: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123447

Change #1125556 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Migrate x2 off LB config

https://gerrit.wikimedia.org/r/1125556

Change #1125556 merged by jenkins-bot:

[operations/mediawiki-config@master] Migrate x2 off LB config

https://gerrit.wikimedia.org/r/1125556

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:14:58Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]]

Change #1130542 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] etcd: Make Mainstash config globa variable

https://gerrit.wikimedia.org/r/1130542

Change #1130542 merged by jenkins-bot:

[operations/mediawiki-config@master] etcd: Make Mainstash config global variable

https://gerrit.wikimedia.org/r/1130542

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:21:02Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:26:01Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:38:42Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] (duration: 17m 39s)

@Krinkle Do you want to remove support for LB code in SqlBagOStuff too or just you want to do the clean up in our production config?

Change #1130545 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] beta: Fix mainstash config

https://gerrit.wikimedia.org/r/1130545

Change #1130545 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Fix mainstash config

https://gerrit.wikimedia.org/r/1130545

Change #1130550 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] beta: Fix mainstash, take II

https://gerrit.wikimedia.org/r/1130550

Change #1130550 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Fix mainstash, take II

https://gerrit.wikimedia.org/r/1130550

WMF config is now done. The fix for beta cluster is ugly but c'est la vie

Krinkle claimed this task.