Page MenuHomePhabricator

Control mw-on-k8s periodic maintenance jobs with an etcd value
Closed, ResolvedPublic

Description

Presently, maintenance jobs are systemd timers. They run in both DCs, but inside a wrapper script that first checks the MW config state in etcd to make sure it's running in the current primary_dc and that DC is not read-only; otherwise, it quits without invoking the maintenance script.

Today, when we do a DC switchover, we kill the jobs that are currently running but we don't actually prevent new jobs from starting. That means in the time between stopping maintenance, and setting read-only, new maintenance jobs can start, and they'll run into trouble when the RW DC disappears out from under them. This race condition adds some more coordination complexity, and a little rush, to the early stages of the switchover, which we could eliminate.

When we move periodic maintenance jobs to Kubernetes CronJobs, we'll add an additional config value to etcd specifically for controlling them. There are a few different ways we might structure that, each with pros and cons:

  1. A new string value maintenance_dc that could be eqiad, codfw, or none. (In an eqiad-codfw switch, we'd change it from eqiad to none, then to codfw.) The downside is this is partly redundant with the primary_dc value; primary_dc: eqiad; maintenance_dc: codfw is a nonsense state that we shouldn't be able to encode.
  2. A new map value maintenance_enabled like {eqiad: true, codfw: false}. (In an eqiad-codfw switch, we'd change it to {eqiad: false, codfw: false}, then to {eqiad: false, codfw: true}.) This has the same downside as #1, plus it allows the additional nonsense state {eqiad: true, codfw: true}. (On the other hand, read_only has a similar structure already.)
  3. A new boolean value maintenance_enabled which would be true most of the time and false during a switchover. (In an eqiad-codfw switch, we'd change it from true to false, then switch the primary_dc, then change it back to true.) We'd get the DC state from primary_dc, and check maintenance_enabled along with (or instead of) read_only. The only nonsense state here is if maintenance is enabled while the primary DC is RO, and we can mitigate that by checking both.

Clearly I'm leaning toward #3 for simplicity, but open to other possibilities. (I prefer "enabled" to "disabled," because maintenance_disabled: false is an unnecessarily confusing double negative. Unfortunately that means the sense is reversed from read_only but I think that's okay.)

Event Timeline

#3 sounds like the most sane option to me as well, and enables us to halt maintenance for other reasons than a DC switchover or a read-only window. Maintenance should check read_only state anyways as part of the pre-flight.

Scott_French subscribed.

+1 to option #3 as the most sensible / obvious one: adding something more complex than a single global boolean invites odd nonsense states in combination with read-only (currently a per-DC toggle) and primary DC.

I'd like to make this happen soon, and also wire it into the preflight checks in mw-cli-wrapper.py so that we can us it in the upcoming switchover.

Reminder to self: once live, wire this into the 01-stop-maintenance.py and 08-start-maintenance.py cookbooks.

See also T359130 for the cookbook work. We aren't as far as I expected we'd be, so we can revisit which of those steps for cronjobs-on-k8s need to be accomplished before the switchover, but I agree it's a good idea to add this to mw-cli-wrapper.py in the meantime and start using it.

Thanks, Reuven!

Agreed, yeah: Some subset of those items will need done before the switchover, but exactly which subset depends on how far we expect things to be by then. I'll follow up on the task shortly.

I took some time earlier today to sketch out what putting maintenance-enabled state in etcd might look like, and have some thoughts ...

A hypothetical "path of least resistance" approach could look something like: introduce a new mwconfig-type object [0] and associated schema, wire that into the mediawiki.yaml confd template [1] instantiated by P:conftool::state and already consumed by mw-cli-wrapper.py, and update the latter to now check the new field.

Now, one significant issue with that is that there's no good reason for this to be a mwconfig object, as it's not actually consumed by MediaWiki (and there isn't a clear future use case for doing so).

Also, it would be nice to take a step back and think a bit about how the k8s case would consume this, along with mwconfig state, and gate as needed.

I tend to be of the opinion that adding direct clients of etcd is something to be avoided, so sticking with a config file written by confd seems to be the right approach.

How about something like the following:

  • Rather than using the mwconfig type, introduce a new type for maintenance-related dynamic state, which includes enabledness.
  • Introduce a new confd template (referencing objects from both the new type and mwconfig), instantiated by a new puppet profile, that renders a combined configuration file for mw-cli-wrapper.py.
  • Add the profile to maintenance hosts and migrate the wrapper to the new configuration (i.e., get it out of the business of following mediawiki.yaml directly).
  • In the k8s case, add the new profile to the deployment host, along with a simple script to sync the configuration to a ConfigMap object in the mw-script namespace, with the latter mounted into the job's container.

This would do the trick, without muddying mwconfig and providing a fairly(?) straightforward path to making the same wrapper work on k8s (T341555).

Note: mw-cli-wrapper.py will need to lose the check_confd_template call if we go this route. Fun fact, though: that check does nothing, and probably has not for some time, given that the confd::file resource in P:conftool::state doesn't specify a check command (so no error state files will ever be produced).

Thoughts?

[0] https://gerrit.wikimedia.org/g/operations/puppet/+/8fae5805c21c77137273d3c86eeef719eb267c4a/modules/profile/files/conftool/schema.yaml#19

[1] https://gerrit.wikimedia.org/g/operations/puppet/+/8fae5805c21c77137273d3c86eeef719eb267c4a/modules/profile/templates/conftool/state-mediawiki.tmpl.erb

Unassigning this, as it's not something I'm planning to work on in the near future (i.e., it was not needed to the switchover).

Also, there's ongoing work for T341555 that may supersede this in some form.

Copying from T341555 with added reflections

While we could rely on setting the mwcron.enabled parameter to false in the values-{eqiad,codfw}.yaml files and redeploying the release, I think it would be nice to have the information from conftool on the deployment server in an helmfile-defaults file.

This would allow us to just change the conftool values for primary_dc or read_only and helmfile apply. It also makes it possible to control maintenance state (T367118) without multiplying the number of conftool clients.

Pros:

  • Easy to wire up: Add ::profile::conftool::state on the deployment servers and symlink the file from (for instance) /etc/helmfile-default/mediawiki-state.yaml
  • Only two conftool clients added
  • Flexible: We can still disable kubernetes cronjobs at the values-{eqiad,codfw}.yaml level, but have a second layer of protection if it is forgotten, or if a code deployment isn't possible

Cons:

  • Only semi-automated: This will not kill running jobs, or actually disable them on its own, an helmfile apply will be required

Switchover considerations: This requires us to add some code to the switchover cookbook that will helmfile destroy both datacenters' mwcron releases at the 01-stop-maintenance step, and helmfile apply them at the 08-start-maintenance step. If we go with the ::profile::conftool::state way, this should be enough to ensure that maintenance follows the switchover correctly.

Implementation detail: Will need some logic in the helmfile.yaml to ensure the only release available to apply is the one in the primary datacenter.

Clement_Goubert changed the task status from Open to In Progress.Oct 23 2024, 2:42 PM
Clement_Goubert claimed this task.
Clement_Goubert triaged this task as High priority.

The shape of this sounds right to me. Similarly we can have the mw-script helmfiles gate on the same conftool value for defense-in-depth, but also read it in the wrapper script and exit early with a friendlier error message.

From a switchover POV, in theory we should be able to make it a helmfile apply at both the stop-maintenance and start-maintenance phases, right? We just arrange for the release to be empty when maintenance is disabled, so the apply deletes all the CronJobs. One reason to like that is, it means more of the cookbook is --live-testable. (Another reason, beyond the switchover, is it means a helmfile apply in the secondary datacenter is a perfectly safe noop.)

Since the spawned Jobs are not part of the release, they won't be stopped, so that would need to be added as well.

Change #1083146 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] deploy: Provide conftool data for mwcron

https://gerrit.wikimedia.org/r/1083146

If we are indeed introducing an etcd-based "maintenance enabled" state and wiring that into the confd template and wiring that into the release template [0], the then yeah - agreed with T367118#10256952 that in the switchover case, both stop-maintenance and start-maintenance could just helmfile apply after updating etcd and waiting for confd (in the stop case, also killing all running jobs as we do today, after the apply completes).

I guess the question is, whether a dynamic "maintenance enabled" state is much of a win, if actuating it is this involved. I think the answer is still yes, at least from the standpoint @RLazarus makes about maximizing --live-test-able-ness of the switchover procedure.

Edit: I guess another benefit of using an apply/apply over destroy/apply is that destroy is asynchronous IIRC. In the case where there's a "maintenance enabled" gate, when the apply returns we know the cron objects are gone. With destroy, we would probably want to add a "check for all crons to be gone" step before deleting all running jobs.

[0] i.e., mwcron releases are only defined if both maintenance is enabled && local DC == primary DC && ! readonly

Change #1083146 merged by Clément Goubert:

[operations/puppet@production] Provide conftool data for mwcron and mwscript-k8s

https://gerrit.wikimedia.org/r/1083146

Now available on deployment servers, updated every 30s from confd:

/etc/helmfile-defaults/mediawiki/conftool-state.yaml
mw:
  # the master datacenter for mediawiki
  primary_dc: codfw
  # read-only settings
  read_only:
    codfw: false
    eqiad: false

Change #1114777 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-script: Add conftool state to helmfile.yaml

https://gerrit.wikimedia.org/r/1114777

Change #1114777 merged by jenkins-bot:

[operations/deployment-charts@master] mw-script: Add conftool state to helmfile.yaml

https://gerrit.wikimedia.org/r/1114777