Page MenuHomePhabricator

MW scripts should reload the database config
Closed, ResolvedPublic

Description

Often when running long-running tasks such as schema changes we need to depool the hosts and wait for all the connections (from MW servers) to clean up.
This normally involves wikiuser and wikiadmin, the first one is used from production and wikiadmin is mostly used from mwmaint* hosts for maintenance tasks.

It is pretty common to find maintenance scripts connecting to a single host and not reloading the config and hence messing up with general depooling/pooling actions. While these are annoying, there are other critical operations where this can be a massive issue (ie: master swaps).

Lastly, as we are now in process of running automatic schema changes, the schema change scripts does wait for all the connections to be gone after the depooling, so this can mess up and make the script take hours before it can proceed.

Scripts should reload the config after each change, or after X minutes.

See also:

Details

SubjectRepoBranchLines +/-
operations/mediawiki-configmaster+8 -7
mediawiki/coreREL1_39+0 -1
mediawiki/coremaster+0 -1
mediawiki/corewmf/1.40.0-wmf.8+0 -1
mediawiki/corewmf/1.40.0-wmf.7+61 -121
mediawiki/coremaster+61 -121
operations/mediawiki-configmaster+1 -3
operations/mediawiki-configmaster+15 -11
mediawiki/coremaster+88 -48
mediawiki/corewmf/1.40.0-wmf.5+87 -48
operations/mediawiki-configmaster+11 -10
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+18 -0
operations/mediawiki-configmaster+2 -7
operations/mediawiki-configmaster+1 -1
operations/mediawiki-configmaster+19 -12
operations/mediawiki-configmaster+55 -56
mediawiki/coremaster+12 -23
mediawiki/coremaster+592 -136
mediawiki/coremaster+15 -2
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

One way to tackle this and T305016 is to make maint scripts have one replica which would be a proxy db (possibly proxysql) and make all of the read queries to that db. Maint scripts already use a different db user, making them have a different replica set should be pretty easy and it would solve a lot of problems including db config reload, etc. etc.

Change 798678 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] EXPERIMENT: reload DB config on the fly

https://gerrit.wikimedia.org/r/798678

I just pushed an experimental patch that ads suppor for the following:

  • define a callback that loads some part of LBFactoryConf from somewhere.
  • define a autoreConfigure method on LBFactoryConf that maintenance scripts can call.
  • when autoreConfigure detects a change in config, it calls reconfigure on all LoadBalancers.
  • LoadBalancer::reconfigure causes all existing connections to be closed, and all existing DBConnRef instances to be notified that the need to get a new connection from the LB.

The patch will probably need some polishing wrt transaction handling, but otherwise it looks like this would work. Note that this doesn't do automatic error recovery, it needs each scriupt to check for config changes periodically.

I imaging the callback would be implemented to get the relevant info from etcd.

Change 801721 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/mediawiki-config@master] EXPERIMENT: allow DB config reload

https://gerrit.wikimedia.org/r/801721

I need to get back to this, it has been causing a lot of headache these days...

I just pushed an experimental patch that ads suppor for the following:

  • define a callback that loads some part of LBFactoryConf from somewhere.
  • define a autoreConfigure method on LBFactoryConf that maintenance scripts can call.
  • when autoreConfigure detects a change in config, it calls reconfigure on all LoadBalancers.
  • LoadBalancer::reconfigure causes all existing connections to be closed, and all existing DBConnRef instances to be notified that the need to get a new connection from the LB.

The patch will probably need some polishing wrt transaction handling, but otherwise it looks like this would work. Note that this doesn't do automatic error recovery, it needs each scriupt to check for config changes periodically.

I imaging the callback would be implemented to get the relevant info from etcd.

Actually I think this can be much simpler now that DbConnectionRef is used as default. This idea is very top-down, I think we can do part of the work as top-down and part of the work bottom-up and they meet at the middle:

  • Introduce LoadBalancer::reconfigure() that just updates the configuration. e.g. new calls to getConnectionInternal() wouldn't return the old replica
  • Introduce concept of connection max age in DBConnectionRef. Basically in __call() checks when the connection has been made and if it's older than let's say five minutes (and maybe one or two more conditions, e.g. avoid closing the connection at the middle of a primary transaction), it closes the connection, calls the reconfigure and then calls getConnectionInternal() again.

This would also solve the issue of maint scripts hitting new masters (after switchover) thinking it's a replica

Also it can avoid closing the connection in case the config hasn't changed this avoid drastically avoid a lot of reconnections too.

Also it can avoid closing the connection in case the config hasn't changed this avoid drastically avoid a lot of reconnections too.

My patch also avoid that. It only triggers if the config actually changed.

I think my patch is pretty much what you are asking for, except for the "timeout" bit. It is indeed relying on DBConnectionRef being used.

DBConnectionRef is the default everywhere now. It is really really hard to actually grab a Database object directly now ^^

Change 798678 merged by jenkins-bot:

[mediawiki/core@master] Allow DB config to be reloaded on the fly

https://gerrit.wikimedia.org/r/798678

Change 821197 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Stop accepting live connection in DBConnRef constructor

https://gerrit.wikimedia.org/r/821197

Change 821197 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop accepting live connection in DBConnRef constructor

https://gerrit.wikimedia.org/r/821197

Change 824411 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Simplify wmfEtcdApplyDBConfig() a bit

https://gerrit.wikimedia.org/r/824411

Change 824411 merged by jenkins-bot:

[operations/mediawiki-config@master] Simplify wmfEtcdApplyDBConfig() a bit

https://gerrit.wikimedia.org/r/824411

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:06:03Z] <ladsgroup@deploy1002> Synchronized wmf-config/etcd.php: Config: [[gerrit:824411|Simplify wmfEtcdApplyDBConfig() a bit (T298485)]], Part I (duration: 03m 02s)

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:10:05Z] <ladsgroup@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:824411|Simplify wmfEtcdApplyDBConfig() a bit (T298485)]], Part II (duration: 03m 11s)

Change 824413 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Allow passing arguments to wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824413

Change 824413 merged by jenkins-bot:

[operations/mediawiki-config@master] Allow passing arguments to wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824413

Mentioned in SAL (#wikimedia-operations) [2022-08-18T09:56:50Z] <ladsgroup@deploy1002> Synchronized wmf-config/etcd.php: Config: [[gerrit:824413|Allow passing arguments to wmfEtcdApplyDBConfig() (T298485)]] (duration: 03m 40s)

Change 824418 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Call wmfApplyEtcdDBConfig() directly in CS.php

https://gerrit.wikimedia.org/r/824418

Change 824418 merged by jenkins-bot:

[operations/mediawiki-config@master] Call wmfApplyEtcdDBConfig() directly in CS.php

https://gerrit.wikimedia.org/r/824418

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:15:59Z] <ladsgroup@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:824418|Call wmfApplyEtcdDBConfig() directly in CS.php (T298485)]] (duration: 03m 46s)

Change 824419 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Drop now-unused wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824419

Change 824419 merged by jenkins-bot:

[operations/mediawiki-config@master] Drop now-unused wmfEtcdApplyDBConfig()

https://gerrit.wikimedia.org/r/824419

Mentioned in SAL (#wikimedia-operations) [2022-08-18T10:27:44Z] <ladsgroup@deploy1002> Synchronized wmf-config/etcd.php: Config: [[gerrit:824419|Drop now-unused wmfEtcdApplyDBConfig() (T298485)]] (duration: 03m 36s)

Change 801721 merged by jenkins-bot:

[operations/mediawiki-config@master] Allow DB config reload

https://gerrit.wikimedia.org/r/801721

Change 824460 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Fix call to wmfApplyEtcdDBConfig

https://gerrit.wikimedia.org/r/824460

Change 824460 merged by jenkins-bot:

[operations/mediawiki-config@master] Fix call to wmfApplyEtcdDBConfig

https://gerrit.wikimedia.org/r/824460

It needs a bit of work, first of all, the config needs to actually apply, currently it gets overriden by db-production.php, then it needs to also apply what's in db-production, needs to keep s11 overrides in mind. My tests didn't show it'd be fixed after these tho. I need to debug even further.

Change 824476 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Disable LBFactory config callback for now

https://gerrit.wikimedia.org/r/824476

Change 824476 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable LBFactory config callback for now

https://gerrit.wikimedia.org/r/824476

I'm trying to figure out what to do and the biggest problem is that the configuration gets transformed multiple times in multiple places (MWLBFactory, LBFactoryMulti::newLoadBalancer, LBFactoryMulti::makeServerConfigArrays(), etc.) so adding a new server or removing one will have to go through all of them.

The original proposal had old config doing array plus on the new one but: 1- It doesn't apply the server template if we repool a host (or pool it for the first time) 2- It doesn't remove the old db if depooled because it exists in the old config.
I'm trying to think which way to go, rework the whole configuration to I can pass the new configuration there or do plus but work around these issues I mentioned.

Possibly we could pass the original config to LBFactory classes instead of LB and then make it transform the config and push it trough to the LBs. I'll see what I can do about that.

Change 828577 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/828577

I'm really tired, after hours and hours of reading code the whole thing doesn't make any damn sense, I think I've found at least several bugs as well.

Change 841873 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.40.0-wmf.5] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/841873

Change 841873 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.5] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/841873

Mentioned in SAL (#wikimedia-operations) [2022-10-12T13:53:12Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:841873|rdbms: Instead of reconfiguring all of LB, just remove depooled db (T298485)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-12T13:53:36Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:841873|rdbms: Instead of reconfiguring all of LB, just remove depooled db (T298485)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet

I managed to test this in production. The patch needed a small fix and then config needs a bit of change but after that, it works like a charm.

I wrote a script that queries the database, outputs a value (max rev_id of the wiki), outputs what db it queried and sleeps for 30 seconds (source code: P35437)

ladsgroup@mwdebug1001:/srv/mediawiki/php-1.40.0-wmf.5$ mwscript maintenance/testDepool.php --wiki=testwiki
547707
db1175
547707
db1175
547707
db1175
*I depooled db1175 here*
547707
db1175
547707
db1179
547707
db1179
547707
db1179
547707
db1179

Change 828577 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/828577

Change 828577 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/828577

So for long running scripts that use this mechanism, it is possible that all dbs in a pool could be depooled, if the pool is small (like vslow/dumps). Then traffic would fall back to the regular dbs. What do we think about that?

Change 828577 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Instead of reconfiguring all of LB, just remove depooled db

https://gerrit.wikimedia.org/r/828577

So for long running scripts that use this mechanism, it is possible that all dbs in a pool could be depooled, if the pool is small (like vslow/dumps). Then traffic would fall back to the regular dbs. What do we think about that?

That already happens when we depool a host for maint and the script starts in the mean time (I have seen it multiple times) and keep it in mind that maint scripts need to finish soon regardless (updating the rest of config, mw version deployments, backport deployments, etc.) and this only is for quick need of a db being depooled and it doesn't mean it's okay to have a maint script running for days.

Change 848201 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable LBFactory config callback in CLI

https://gerrit.wikimedia.org/r/848201

Change 848201 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable LBFactory config callback in CLI

https://gerrit.wikimedia.org/r/848201

Mentioned in SAL (#wikimedia-operations) [2022-10-24T11:09:13Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:848201|Enable LBFactory config callback in CLI (T298485)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-24T11:09:33Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:848201|Enable LBFactory config callback in CLI (T298485)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-24T11:17:49Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:848201|Enable LBFactory config callback in CLI (T298485)]] (duration: 08m 35s)

This is done. Maint scripts need to use $this->waitForReplication(); to trigger this

Ladsgroup moved this task from Ready to Done on the DBA board.

Change 848452 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] maintenance: Use $this->waitForReplication()

https://gerrit.wikimedia.org/r/848452

Change 848453 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Avoid using DBLoadBalancerFactoryConfigBuilder mw service

https://gerrit.wikimedia.org/r/848453

Change 848453 merged by jenkins-bot:

[operations/mediawiki-config@master] Avoid using DBLoadBalancerFactoryConfigBuilder mw service

https://gerrit.wikimedia.org/r/848453

Mentioned in SAL (#wikimedia-operations) [2022-10-24T18:53:31Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:848453|Avoid using DBLoadBalancerFactoryConfigBuilder mw service (T298485)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-24T18:53:50Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:848453|Avoid using DBLoadBalancerFactoryConfigBuilder mw service (T298485)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-24T19:00:26Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:848453|Avoid using DBLoadBalancerFactoryConfigBuilder mw service (T298485)]] (duration: 06m 55s)

Change 848452 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Use $this->waitForReplication()

https://gerrit.wikimedia.org/r/848452

Change 850079 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.40.0-wmf.7] maintenance: Use $this->waitForReplication()

https://gerrit.wikimedia.org/r/850079

Change 850079 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.7] maintenance: Use $this->waitForReplication()

https://gerrit.wikimedia.org/r/850079

Mentioned in SAL (#wikimedia-operations) [2022-10-27T12:47:51Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-27T12:48:09Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-27T12:52:29Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] (duration: 04m 40s)

Change 852990 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852990

Change 852883 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.40.0-wmf.8] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852883

Change 852990 merged by jenkins-bot:

[mediawiki/core@master] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852990

Change 852884 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/core@REL1_39] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852884

Change 852883 merged by jenkins-bot:

[mediawiki/core@wmf/1.40.0-wmf.8] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852883

Mentioned in SAL (#wikimedia-operations) [2022-11-03T18:59:29Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]]

Mentioned in SAL (#wikimedia-operations) [2022-11-03T18:59:48Z] <ladsgroup@deploy1002> ladsgroup and ladsgroup: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-11-03T19:03:54Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:852883|WikiExporter: Avoid calling reload in processing every row (T298485 T322360)]] (duration: 04m 24s)

Change 852884 merged by jenkins-bot:

[mediawiki/core@REL1_39] WikiExporter: Avoid calling reload in processing every row

https://gerrit.wikimedia.org/r/852884

Change 854090 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] Only Enable LBFactory config callback in CLI in production

https://gerrit.wikimedia.org/r/854090

Change 854090 merged by jenkins-bot:

[operations/mediawiki-config@master] Only Enable LBFactory config callback in CLI in production

https://gerrit.wikimedia.org/r/854090

Mentioned in SAL (#wikimedia-operations) [2022-11-09T21:09:50Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]]

Mentioned in SAL (#wikimedia-operations) [2022-11-09T21:10:10Z] <samtar@deploy1002> samtar and dancy: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-11-09T21:15:32Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:854090|Only Enable LBFactory config callback in CLI in production (T298485)]] (duration: 05m 41s)