Page MenuHomePhabricator

mw-scripts SAL integration
Closed, ResolvedPublic

Description

Long running jobs (if not all jobs) should be announced on SAL. Today's incident (link TBA) was related to an mw-script job that was announced, however, it started hours before the incident.

We could consider which of the following would work for us

  • Every triggered mw-script job must/may be announced on SAL
  • For every 30', 60',? the job is still running, a new announcement could be made, eg "mw-script picking-apples job has been running since $timestamp"

Details

Event Timeline

A mwscript-k8s flag to log to SAL is on my to-do list -- I hadn't gotten around to filing a task, thanks.

I don't think it would have helped us in this incident, though: there actually wasn't a long-running mw-script job, there were 1,839 short-running ones. (And in fact that's where the problem came from.) Logging each job would have been too spammy, so even if mwscript-k8s could do that, the user would have turned it off and made one manual !log instead -- which is what she did. So I think that would have looked the same.

Thinking hypothetically about a single long-running job that misbehaves, I agree we'd want some way for responders to know it's there. I think re-announcing every job every hour might also be too noisy to be effective, especially since there are some intentionally long-lived scripts, but it's a good starting point for discussion.

The usual way to check on job status is with kube_env mw-script codfw; kubectl get job -- in this case, when your terminal started vomiting a couple thousand almost-identical lines, you'd have a quick idea what the problem might be. The catch is, that only helps after you think "hm, I wonder if a maintenance script might be the problem." So we might still need some other form of visibility to lead you down that path.

Change #1174579 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] deployment_server: Add --sal to mwscript_k8s

https://gerrit.wikimedia.org/r/1174579

Change #1174579 merged by RLazarus:

[operations/puppet@production] deployment_server: Add --sal to mwscript_k8s

https://gerrit.wikimedia.org/r/1174579

Mentioned in SAL (#wikimedia-operations) [2025-08-04T19:39:44Z] <rzl@deploy1003> mwscript-k8s job started: Version.php --wiki=urwiki # Testing --sal for T376776

Implemented and documented on Wikitech.

I think this will probably be all we do here. I was considering enabling --sal by default eventually, but I think I'm convinced it would be a mistake: thinking of cases like resetUserEmail.php, changePassword.php, and a few other examples where script args are sensitive, the convenience of automatic logging is outweighed by the risk of inadvertently logging something inappropriate.

As discussed above I still agree with the idea of another form of visibility for long-running or high-volume jobs, but I don't think the SAL is it.