Page MenuHomePhabricator

Disable maintenance scripts via conftool
Closed, ResolvedPublic

Description

Maintenance scripts used to be scheduled as cron jobs, running only in the active data center. Now, instead, they run as systemd timers in both DCs, but they're called via the mw-cli-wrapper script. That wrapper first checks confd to see if it's running in the active DC, and exits if not, so the actual maintenance work is still only performed in the active DC.

The switchdc cookbook stops cron jobs in the origin DC before the switch, and then starts them in the destination DC afterward -- which means it no longer has any effect on the maintenance scripts using this wrapper (i.e., most of them). That's mostly fine, because mw-cli-wrapper reads from confd, which is updated as part of the switchover. But it means nothing stops those scripts from running during the switch.

We could fix this by adding another field to the mwconfig confd entity, called maintenance_disabled. mw-cli-wrapper would exit early if it was in the inactive data center or if maintenance_disabled was true, so setting that flag would effectively pause maintenance scripts in both DCs. Then the 01-stop-maintenance cookbook can set the maintenance_disabled flag, and 08-start-maintenance can clear it.

@Volans @Joe Seem reasonable to you? I'm happy to send patches if the basic approach is sound.

Event Timeline

RLazarus triaged this task as Medium priority.Oct 28 2020, 9:46 PM

I'm not familiar on what's the usual execution time of all the scripts, but I'm wondering if we also need to stop all the systemd unit triggered by the timers and leave the kill commands anyway for cleaning not migrated jobs and also anything leftover.

If we're introducing a new parameter I'm wondering also if we should decouple the scripts from the master datacenter and have something more flexible. I'm thinking of the case that we want to add a new mwmaint server with a new OS to test and be in control where the scripts are running and where not, avoiding to run them on multiple hosts. If we want to go in this direction maybe we could convert you maintenance_disabled into maintenance_host and from boolean to a string that holds the hostname where the scripts should run. Setting it to empty string or none would disable all jobs from running anywhere and achieve the same goal.
My 2 cents.

Oh, that's a good idea! We'd set maintenance_host to the empty string before the switchover, so that no new jobs would start anywhere, then set it to the new hostname afterward.

I'm not sure if the switchdc cookbook could automatically deduce the correct value for maintenance_host after the switch, though. It's easy enough as long as A:mw-maintenance and A:eqiad only matches a single host, but if we end up with more than one (e.g. your new OS scenario), I'm not sure we'd want the cookbook to guess which one to use. I suppose in that case we can just list all the hostnames that match, and prompt the operator for which one to use.

(Re stopping in-progress maintenance scripts: Yep, we should do that, in the same way that we currently kill all the cronjobs. The good news is, the systemd unit name always starts with mediawiki_job_ so that'll be easy. We can rip the old crontab stuff out of spicerack too, but only when the WDQS dispatcher isn't using it anymore.)

I'm not sure if the switchdc cookbook could automatically deduce the correct value for maintenance_host after the switch, though. It's easy enough as long as A:mw-maintenance and A:eqiad only matches a single host, but if we end up with more than one (e.g. your new OS scenario), I'm not sure we'd want the cookbook to guess which one to use. I suppose in that case we can just list all the hostnames that match, and prompt the operator for which one to use.

It depends what's their puppetization, if there is anything in puppetdb that we can query that marks the other(s) not active/primary we could add that to the query, otherwise I agree with the prompt, sounds simple and flexible enough to me, +1.

I think we have a better way to avoid this. Basically we want to stop running scripts once we get into the readonly phase. So we could modify the wrapper to start only if both the following conditions are met:

  1. the {"name": "WMFMasterDatacenter", "scope": "common"} variable is equal to the local datacenter
  2. the 'ReadOnly' object with scope equal to the current datacenter is set to false

So scripts will stop running once we enter the readonly phase, well before we actually switch datacenter.

As an alternative, I'd be happy to add a maintenance_enabled key to the common scope in mwconfig. I don't particularly like the maintenance_host approach because it's very specific to our current configuration (which ties execution of periodic jobs to a single host) and maybe add support for poolcounter locking to the wrapper instead (although that's a bit tricky) to avoid running from multiple hosts in the same DC.

I think we have a better way to avoid this. Basically we want to stop running scripts once we get into the readonly phase. So we could modify the wrapper to start only if both the following conditions are met:

  1. the {"name": "WMFMasterDatacenter", "scope": "common"} variable is equal to the local datacenter
  2. the 'ReadOnly' object with scope equal to the current datacenter is set to false

So scripts will stop running once we enter the readonly phase, well before we actually switch datacenter.

That's elegant from the point of view of stopping the scripts, I really like the idea of not having to add anything to the config. But it also means, when we exit readonly, the scripts will start again immediately, instead of waiting for us to enable them. Are you worried about that at all?

As an alternative, I'd be happy to add a maintenance_enabled key to the common scope in mwconfig. I don't particularly like the maintenance_host approach because it's very specific to our current configuration (which ties execution of periodic jobs to a single host) and maybe add support for poolcounter locking to the wrapper instead (although that's a bit tricky) to avoid running from multiple hosts in the same DC.

I feel a little differently about the relative ranking of these, as we discussed -- I think maintenance_host is good enough for MW-on-metal, and it's fine to keep that until MW-on-k8s, when we'll have to rework mw-cli-wrapper more heavily anyway. In particular I'd rather just use maintenance_host than involve poolcounter, which is both more complicated and wouldn't let us choose which of the hosts is active.

But if (as is being discussed elsewhere) we're sticking to a single maintenance host per DC, and just scheduling a maintenance downtime rather than cleanly cutting over from the old host to the new one, then maintenance_enabled is all we'll need.

Switching between the three different mwconfig-only proposals would also be pretty easy, so we aren't constraining our future selves by choosing one for now.

T265936 is partially a duplicate of this ticket but also adds the part that maintenance servers are web hosts for https://noc.wikimedia.org. Last switch-over we had the situation where 'maintenance host running jobs' was switched but 'maintenance host hosting noc' was not switched. Usually that is both the same server.

In T265936#6732896 i tried to summarize my thoughts. I think it can mostly be considered duplicate of this ticket here but also we should add geoDNS for noc and make it active/active. If we do that we avoid having to make a DNS change to switch even though we already use a discovery name.

Also there is mwmaint.discovery.wmnet currently used for noc/envoy. And then there is also maintenance.eqiad.wmnet which I am not sure where it is used. But both are pointing to mwmaint1002 and ideally we can get rid of both or at least one of them.

I think we have a better way to avoid this. Basically we want to stop running scripts once we get into the readonly phase. So we could modify the wrapper to start only if both the following conditions are met:

  1. the {"name": "WMFMasterDatacenter", "scope": "common"} variable is equal to the local datacenter
  2. the 'ReadOnly' object with scope equal to the current datacenter is set to false

So scripts will stop running once we enter the readonly phase, well before we actually switch datacenter.

That's elegant from the point of view of stopping the scripts, I really like the idea of not having to add anything to the config. But it also means, when we exit readonly, the scripts will start again immediately, instead of waiting for us to enable them. Are you worried about that at all?

I guess the concern is that starting the scripts right away just adds extra pressure and load while things are starting up right?

The other thing I suggested on IRC is that we could mask all the systemd units to disable them, and unmask whenever we're ready. I'm not sure if puppet will have an issue with that though.

Change 701053 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/software/spicerack@master] mediawiki: Stop all systemd job units when stopping cronjobs

https://gerrit.wikimedia.org/r/701053

Change 701055 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] mediawiki: mw-cli-wrapper: Only run if read only in confctl is false

https://gerrit.wikimedia.org/r/701055

I guess the concern is that starting the scripts right away just adds extra pressure and load while things are starting up right?

Right, but also, it's more moving parts at the same time. The moment right after we set RW is usually pretty busy; it's common for us to have a bunch of stuff to look at right then. If the maintenance jobs swing into action by themselves, it will be more difficult to troubleshoot any issues that crop up.

If restarting maintenance remains a separate step in the procedure, it'll still be one more thing for us to do, but it gives us a little more flexibility to start it when we're ready (just a few moments later), and makes root-causing simpler.

That's not insurmountable, though -- if you consider it a worthwhile tradeoff to have one less step in the procedure, I can get behind that. Your call.

Ack, and we already have a separate step to re-enable them. The masking I suggested earlier doesn't seem to work exactly yet, see T285425: Puppet does not undo manual "systemctl mask $unit".

Instead we can just disable the timers + stop any running units in 01-stop-maintenance, which stops all running and theoretically nothing will restart them, plus the the new readonly check. Then 08-start-maintenance will re-enable puppet, which should re-enable the timers and start jobs.

I updated https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/701053/ with a process that should address all the concerns.

In 01-stop-maintenance:

  • Disable puppet
  • Disable systemd timers, no new jobs will start
  • Stop running systemd services

... switch dc ...

In 08-start-maintenance:

  • Re-enable and run puppet, which re-enables the systemd timers
  • mw-cli-wrapper will check conftool state and only jobs in primary DC will run

This process both prevents jobs from running during the switchover and still has a manual enable maintenance step. There's still one cron left (Wikidata dispatcher), so all that cron code is still needed but hopefully not for long.

Change 701219 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/cookbooks@master] sre.switchdc.mediawiki: Update for periodic job changes in spicerack

https://gerrit.wikimedia.org/r/701219

Change 701055 merged by Legoktm:

[operations/puppet@production] mediawiki: mw-cli-wrapper: Only run if read only in confctl is false

https://gerrit.wikimedia.org/r/701055

Change 701053 merged by jenkins-bot:

[operations/software/spicerack@master] mediawiki: Update cronjob code now that most are systemd timers

https://gerrit.wikimedia.org/r/701053

Change 701219 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Update for periodic job changes in spicerack

https://gerrit.wikimedia.org/r/701219