Page MenuHomePhabricator

Allow running one-off scripts manually
Open, Needs TriagePublic

Description

Our developers/deployers are used to be able to launch mediawiki maintenance scripts using mwscript as follows:

  • ssh to the maintenance server
  • run mwscript <script-name> --wiki <wiki> or mwscript <script-name> <wiki>

How can we get developers to run one-off scripts on kubernetes?

I would imagine it would go as follows:

  • Add a Job definition to the mediawiki chart. Make it possible to apply either the Deployment or a Job. The Job should allow values to inject the arguments to "mwscript".
  • Create a dedicated namespace to run these one offs
  • Each Job should be a separate helm release, if we want multiple Jobs to be launched in parallel - that's the only way these resources can share the same namespace. We need to check if it's possible to adapt our helmfile to accept arbitrary release names.
  • A small wrapper called something like mwscript-k8s should check the user name, generate a random release name, run helm(file) passing the arguments from CLI as a value we'll inject as args for the container.

Event Timeline

One update:

it should be possible to use helmfile to support arbitrary release names, with something like

releases:
 - name: job-{{ requiredEnv "NAME_TOKEN" }}
   <<: *default

and thus setting the env variable NAME_TOKEN from the wrapper.

I hope that mwscript-k8s will finish with the right kubectl logs --follow command, so that deployers can see the output.

But also: if this system allows any deployer to list the currently running maintenance scripts and list their output, then that would be a pretty big advantage over the current setup IMHO. Right now, long-running scripts are usually run from tmux or screen, and only the original deployer and roots can see the output then.

Change 957375 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] hieradata: Add kubeconfig files for mw-script

https://gerrit.wikimedia.org/r/957375

Change 957375 merged by RLazarus:

[operations/puppet@production] hieradata: Add kubeconfig files for mw-script

https://gerrit.wikimedia.org/r/957375

Change 957377 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] admin_ng: Add mw-script namespace

https://gerrit.wikimedia.org/r/957377

Change 957377 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add mw-script namespace

https://gerrit.wikimedia.org/r/957377

Change 988849 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mediawiki: Support one-off jobs

https://gerrit.wikimedia.org/r/988849

Change 988850 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] Add helmfile for running MediaWiki one-off jobs.

https://gerrit.wikimedia.org/r/988850

Change 988851 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add mwscript_k8s

https://gerrit.wikimedia.org/r/988851

@Joe @JMeybohm That's a lot of code review at once, across two tasks -- I posted it all for context, but no expectation you'll have time to look at all of it immediately. (It's all tested together from my homedir on deploy2002, and works.) Here's the suggested reviewing order:

  1. Sidecar controller (chart and helmfile): https://gerrit.wikimedia.org/r/988847 and https://gerrit.wikimedia.org/r/988848
  2. MW Job config (chart and helmfile): https://gerrit.wikimedia.org/r/988849 and https://gerrit.wikimedia.org/r/988850
  3. Python wrapper: https://gerrit.wikimedia.org/r/988851

Surfacing @JMeybohm's reasonable concern from https://gerrit.wikimedia.org/r/c/988851/comments/3827b6cd_15427748:

Running jobs via helmfile will result in one helm release per job run which will never be cleaned up. While not an immediate problem, this will clutter the mw-scripts namespace over time, leaving one k8s secret object per job run around. I thing we should have an idea how to (automatically) clean those up

And my response:

That's true. A couple of options offhand: we could add a periodic cleanup cronjob (which could also handle the case of a maintenance script that gets wedged at runtime, and hangs around forever). Or we could modify the sidecar controller to perform some additional cleanup after shutting down each job.

I don't love the second option -- partly because it changes up the controller's semantics, and partly because we'd have to keep the controller around indefinitely, rather than dropping it at Kubernetes 1.29 which has job-sidecar logic built in.

Keeping that discussion open here.

Change 988849 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Support one-off jobs

https://gerrit.wikimedia.org/r/988849

Change 988850 merged by jenkins-bot:

[operations/deployment-charts@master] Add helmfile for running MediaWiki one-off jobs.

https://gerrit.wikimedia.org/r/988850

Change 988851 merged by RLazarus:

[operations/puppet@production] deployment_server: Add mwscript_k8s

https://gerrit.wikimedia.org/r/988851

Change 1006607 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add missing env variables to mwscript_k8s

https://gerrit.wikimedia.org/r/1006607

Change 1006607 merged by RLazarus:

[operations/puppet@production] deployment_server: Add missing env variables to mwscript_k8s

https://gerrit.wikimedia.org/r/1006607

Random idea: T315510 still needs a few more maintenance script runs (at least one on enwiki and one no viwiki), but is currently blocked on application errors; if this task is close to being done, and that other task gets unblocked soon, then maybe we could use those maintenance scripts as an opportunity to test this new feature.

Change 1008975 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Typo fix in mwscript_k8s.py

https://gerrit.wikimedia.org/r/1008975

Change 1008975 merged by RLazarus:

[operations/puppet@production] deployment_server: Typo fix in mwscript_k8s.py

https://gerrit.wikimedia.org/r/1008975

Change 1009373 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mediawiki: Add mwscript labels to the job as well as the pods

https://gerrit.wikimedia.org/r/1009373

Change 1009373 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add mwscript labels to the job as well as the pods

https://gerrit.wikimedia.org/r/1009373

Change 1012802 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/deployment-charts@master] mediawiki: Add a comment annotation for mwscript jobs

https://gerrit.wikimedia.org/r/1012802

Change 1012803 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Label and annotation improvements for mwscript-k8s

https://gerrit.wikimedia.org/r/1012803

Change #1012802 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add a comment annotation for mwscript jobs

https://gerrit.wikimedia.org/r/1012802

Change #1012803 merged by RLazarus:

[operations/puppet@production] deployment_server: Label and annotation improvements for mwscript-k8s

https://gerrit.wikimedia.org/r/1012803