Page MenuHomePhabricator

MW scripts should reload the database config
Open, MediumPublic

Description

Often when running long-running tasks such as schema changes we need to depool the hosts and wait for all the connections (from MW servers) to clean up.
This normally involves wikiuser and wikiadmin, the first one is used from production and wikiadmin is mostly used from mwmaint* hosts for maintenance tasks.

It is pretty common to find maintenance scripts connecting to a single host and not reloading the config and hence messing up with general depooling/pooling actions. While these are annoying, there are other critical operations where this can be a massive issue (ie: master swaps).

Lastly, as we are now in process of running automatic schema changes, the schema change scripts does wait for all the connections to be gone after the depooling, so this can mess up and make the script take hours before it can proceed.

Scripts should reload the config after each change, or after X minutes.

See also:

Event Timeline

Marostegui triaged this task as Medium priority.Jan 3 2022, 3:46 PM
Marostegui moved this task from Triage to Ready on the DBA board.

Change 751923 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] [WIP] maintenance: Make ::getDb() reload the config from time to time

https://gerrit.wikimedia.org/r/751923

At least for my own case, I wrote a script that wraps around mwscript and ran it with something like this:

python3 mwscript-runner.py --dblist=s4 --interval 600 --start 37310754 -- migrateRevisionActorTemp.php --sleep=2

This would make it reload the script every ten minutes. I think this slowly should become the default way to run maint scripts. I don't know how complicated dumps would be.

Change 751923 abandoned by Ladsgroup:

[mediawiki/core@master] [WIP] maintenance: Make ::getDb() reload the config from time to time

Reason:

This wouldn't work and it's quite a deviation from mediawiki's design.

https://gerrit.wikimedia.org/r/751923

Is there a way to give the connection a chance to expire, e.g. when the maintenance script is waiting for replication after processing a batch? Can it just close and reopen the connection? Providing a Maintenance method to periodically reset the connection seems vastly simpler than changing how several maintenance scripts run.

I haven't checked the code fully, but it is my general understanding that if a connection fails, MW will try to reconnect and it also has the ability to consider one of various other replicas. If one of the replicas refuses new connections or otherwise too lagged or unavailable, I would expect it to pick another one without needing to load new configs (and remembers it for a short time after that via php-apcu).

The only situation in which it (should be) unwillling to pick a different host is if the connection is for the primary DB. For scripts that perform writes or otherwise need to query a primary, those would then fail with a fatal error I think. That is bad for uncheduled/manual scripts and snapshot/dump scripts that don't support retry or resume yet, but should at least result in the effect of the old DB no longer being in used within a few minutes, since those write-needing scripts would fatal and exit out.

Question remains then, what maintenance script do we have that keeps trying to do the same thing?

I haven't checked the code fully, but it is my general understanding that if a connection fails, MW will try to reconnect and it also has the ability to consider one of various other replicas. If one of the replicas refuses new connections or otherwise too lagged or unavailable, I would expect it to pick another one without needing to load new configs (and remembers it for a short time after that via php-apcu).

This is a problem though when we depool a replica for maintenance, as the script will always pick the same host (or at least it is what it does now) - which essentially means the host isn't depooled and we have to either kill the script and/or carry on with the maintenance (ie: stopping mysql not caring about that connection).
This interferes a lot with how we automatically deploy schema changes, as we wait for the host to be fully empty of connections, which won't happen if the script takes hours to run and keeps picking a depooled host.

The only situation in which it (should be) unwillling to pick a different host is if the connection is for the primary DB. For scripts that perform writes or otherwise need to query a primary, those would then fail with a fatal error I think. That is bad for uncheduled/manual scripts and snapshot/dump scripts that don't support retry or resume yet, but should at least result in the effect of the old DB no longer being in used within a few minutes, since those write-needing scripts would fatal and exit out.

What I described above will also happen if a host is promoted to master. We'd end up with a newly promoted master receiving maintenance read queries from maintenance hosts.

Question remains then, what maintenance script do we have that keeps trying to do the same thing?

I don't know the full list, but as an example, this week I have had to kill /usr/local/bin/foreachwikiindblist growthexperiments & s7 extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php everytime I had to run a schema change and the script was running against the section I was working on.

I haven't checked the code fully, but it is my general understanding that if a connection fails, MW will try to reconnect and it also has the ability to consider one of various other replicas. If one of the replicas refuses new connections or otherwise too lagged or unavailable, I would expect it to pick another one without needing to load new configs (and remembers it for a short time after that via php-apcu).

Maybe I'm missing something obvious but most maint scripts get a '$dbr` object from LB and that object has the host and port hard-coded in the constructor. So it connects to the same host during the whole run even if the connection drops, they just reconnect to the same host. That's why fixing this has been quite complicated for me.

The refreshLinkRecommendations is being tracked in the subticket.

I don't think refreshing the DB config is going to work.

  • Various non-rdbms objects/code might keep $db/$lb references around.
  • Some callers might prefer to keep using an MVCC snapshot (from REPEATABLE-READ) when possible, rather than getting switched to another DB.
  • Config loading is still a mix of declarative/imperative logic with hooks and runtime files. This is still a problem even if we had an LBFactory::recycleAll() method to close connections and reload config if enough time passed since init or last "recycle".

It seems doable to have a narrow-purpose config reloading just to check "is this DB still pooled", kind of like how @@READ_ONLY state is cached for X seconds. It could make further calls to a Database handle error out and possibly make LoadBalancer::getConnection*() avoid such handles in some cases.

AIUI some readonly errors already work in real time (that is, you can get a working DB handle and in a subsequent call within the same request it can throw a DBReadOnlyError, becase e.g. DatabaseMysqlBase::serverIsReadOnly() does a check over the network every few seconds). Maybe depooling could have some similar logic; from a maintenance script's perspective that seems both simpler and more effective than restarting regularly whether or not the connection actually gets depooled.

One way to tackle this and T305016 is to make maint scripts have one replica which would be a proxy db (possibly proxysql) and make all of the read queries to that db. Maint scripts already use a different db user, making them have a different replica set should be pretty easy and it would solve a lot of problems including db config reload, etc. etc.

Change 798678 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] EXPERIMENT: reload DB config on the fly

https://gerrit.wikimedia.org/r/798678

I just pushed an experimental patch that ads suppor for the following:

  • define a callback that loads some part of LBFactoryConf from somewhere.
  • define a autoreConfigure method on LBFactoryConf that maintenance scripts can call.
  • when autoreConfigure detects a change in config, it calls reconfigure on all LoadBalancers.
  • LoadBalancer::reconfigure causes all existing connections to be closed, and all existing DBConnRef instances to be notified that the need to get a new connection from the LB.

The patch will probably need some polishing wrt transaction handling, but otherwise it looks like this would work. Note that this doesn't do automatic error recovery, it needs each scriupt to check for config changes periodically.

I imaging the callback would be implemented to get the relevant info from etcd.

Change 801721 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] EXPERIMENT: allow DB config reload

https://gerrit.wikimedia.org/r/801721

I need to get back to this, it has been causing a lot of headache these days...

I just pushed an experimental patch that ads suppor for the following:

  • define a callback that loads some part of LBFactoryConf from somewhere.
  • define a autoreConfigure method on LBFactoryConf that maintenance scripts can call.
  • when autoreConfigure detects a change in config, it calls reconfigure on all LoadBalancers.
  • LoadBalancer::reconfigure causes all existing connections to be closed, and all existing DBConnRef instances to be notified that the need to get a new connection from the LB.

The patch will probably need some polishing wrt transaction handling, but otherwise it looks like this would work. Note that this doesn't do automatic error recovery, it needs each scriupt to check for config changes periodically.

I imaging the callback would be implemented to get the relevant info from etcd.

Actually I think this can be much simpler now that DbConnectionRef is used as default. This idea is very top-down, I think we can do part of the work as top-down and part of the work bottom-up and they meet at the middle:

  • Introduce LoadBalancer::reconfigure() that just updates the configuration. e.g. new calls to getConnectionInternal() wouldn't return the old replica
  • Introduce concept of connection max age in DBConnectionRef. Basically in __call() checks when the connection has been made and if it's older than let's say five minutes (and maybe one or two more conditions, e.g. avoid closing the connection at the middle of a primary transaction), it closes the connection, calls the reconfigure and then calls getConnectionInternal() again.

This would also solve the issue of maint scripts hitting new masters (after switchover) thinking it's a replica

Also it can avoid closing the connection in case the config hasn't changed this avoid drastically avoid a lot of reconnections too.

Also it can avoid closing the connection in case the config hasn't changed this avoid drastically avoid a lot of reconnections too.

My patch also avoid that. It only triggers if the config actually changed.

I think my patch is pretty much what you are asking for, except for the "timeout" bit. It is indeed relying on DBConnectionRef being used.

DBConnectionRef is the default everywhere now. It is really really hard to actually grab a Database object directly now ^^

Change 798678 merged by jenkins-bot:

[mediawiki/core@master] Allow DB config to be reloaded on the fly

https://gerrit.wikimedia.org/r/798678

Change 821197 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Stop accepting live connection in DBConnRef constructor

https://gerrit.wikimedia.org/r/821197

Change 821197 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop accepting live connection in DBConnRef constructor

https://gerrit.wikimedia.org/r/821197

Change 824411 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Simplify wmfEtcdApplyDBConfig() a bit

https://gerrit.wikimedia.org/r/824411

Change 824411 merged by jenkins-bot:

[operations/mediawiki-config@master] Simplify wmfEtcdApplyDBConfig() a bit

https://gerrit.wikimedia.org/r/824411

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:06:03Z] <ladsgroup@deploy1002> Synchronized wmf-config/etcd.php: Config: [[gerrit:824411|Simplify wmfEtcdApplyDBConfig() a bit (T298485)]], Part I (duration: 03m 02s)

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:10:05Z] <ladsgroup@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:824411|Simplify wmfEtcdApplyDBConfig() a bit (T298485)]], Part II (duration: 03m 11s)

Change 824413 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Allow passing arguments to wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824413

Change 824413 merged by jenkins-bot:

[operations/mediawiki-config@master] Allow passing arguments to wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824413

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:56:50Z] <ladsgroup@deploy1002> Synchronized wmf-config/etcd.php: Config: [[gerrit:824413|Allow passing arguments to wmfEtcdApplyDBConfig() (T298485)]] (duration: 03m 40s)

Change 824418 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Call wmfApplyEtcdDBConfig() directly in CS.php

https://gerrit.wikimedia.org/r/824418

Change 824418 merged by jenkins-bot:

[operations/mediawiki-config@master] Call wmfApplyEtcdDBConfig() directly in CS.php

https://gerrit.wikimedia.org/r/824418

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:15:59Z] <ladsgroup@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:824418|Call wmfApplyEtcdDBConfig() directly in CS.php (T298485)]] (duration: 03m 46s)

Change 824419 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Drop now-unused wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824419

Change 824419 merged by jenkins-bot:

[operations/mediawiki-config@master] Drop now-unused wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824419

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:27:44Z] <ladsgroup@deploy1002> Synchronized wmf-config/etcd.php: Config: [[gerrit:824419|Drop now-unused wmfEtcdApplyDBConfig() (T298485)]] (duration: 03m 36s)

Change 801721 merged by jenkins-bot:

[operations/mediawiki-config@master] Allow DB config reload

https://gerrit.wikimedia.org/r/801721

Change 824460 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Fix call to wmfApplyEtcdDBConfig

https://gerrit.wikimedia.org/r/824460

Change 824460 merged by jenkins-bot:

[operations/mediawiki-config@master] Fix call to wmfApplyEtcdDBConfig

https://gerrit.wikimedia.org/r/824460

It needs a bit of work, first of all, the config needs to actually apply, currently it gets overriden by db-production.php, then it needs to also apply what's in db-production, needs to keep s11 overrides in mind. My tests didn't show it'd be fixed after these tho. I need to debug even further.

Change 824476 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Disable LBFactory config callback for now

https://gerrit.wikimedia.org/r/824476

Change 824476 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable LBFactory config callback for now

https://gerrit.wikimedia.org/r/824476

I'm trying to figure out what to do and the biggest problem is that the configuration gets transformed multiple times in multiple places (MWLBFactory, LBFactoryMulti::newLoadBalancer, LBFactoryMulti::makeServerConfigArrays(), etc.) so adding a new server or removing one will have to go through all of them.

The original proposal had old config doing array plus on the new one but: 1- It doesn't apply the server template if we repool a host (or pool it for the first time) 2- It doesn't remove the old db if depooled because it exists in the old config.
I'm trying to think which way to go, rework the whole configuration to I can pass the new configuration there or do plus but work around these issues I mentioned.

Possibly we could pass the original config to LBFactory classes instead of LB and then make it transform the config and push it trough to the LBs. I'll see what I can do about that.

Change 828577 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/828577

I'm really tired, after hours and hours of reading code the whole thing doesn't make any damn sense, I think I've found at least several bugs as well.