Page MenuHomePhabricator

Proposal: mw-cron failure tasks that get automatically filed for unstewarded components should also tag the ServiceOps Phabricator project
Closed, ResolvedPublic

Description

...The rationale basically being, at this point, they get tagged on them (by me) anyway :p (e.g. T416970#11600701, T416440#11582344, T412325#11450239, and the same for some of the other tasks returned by this search query).

But more seriously -- because these components don't have any code stewards, they - IIUC - don't have anyone whose job it is to triage the automated cron-job-failure tasks that get filed about them (xref T341555#10760093). As a volunteer without Logstash access, I currently find myself tagging ServiceOps on these tasks almost without exception, in order to - at least - retrieve the error's stack trace & time. If this information wasn't retrieved by serviceops for these tasks that get filed, I worry that they might just end up sitting around Phabricator 'gathering dust' (so-to-speak), which doesn't seem like it'd be beneficial to anyone (and - at which point - it feels like cron-job-failure tasks for unstewarded components might as well just not be automatically filed at all).

Event Timeline

Change #1238369 had a related patch set uploaded (by A smart kitten; author: A smart kitten):

[operations/puppet@production] alertmanager: Also add ServiceOps to mw-cron tasks for unstewarded components

https://gerrit.wikimedia.org/r/1238369

Blake removed Blake as the assignee of this task.
Blake triaged this task as Low priority.
Blake moved this task from Inbox to Backlog on the ServiceOps new board.
Blake subscribed.

Hello :) I was wondering if ServiceOps had any opinions on this idea?

This seems reasonable to me - in the longer term, I'd prefer that we (ServiceOps) find a way to improve the resilience of these scripts, so they can mostly be retrying, rather than opening tickets on failure, but until we fix that, someone should be aware of the failures.

[ copying over from #wikimedia-operations for the Phab record ]

2026-02-24 17:04:23 <rzl> A_smart_kitten: so, in serviceops we're in the middle of redoing our phab workflow (cc matthieulec) -- I want to run the proposal by the team before +2ing, just for social reasons not technical ones :)
2026-02-24 17:05:37 <rzl> sorry for the extra delay, I know it's frustrating especially because I see you got a positive reply on the task already, I just want to make sure we get a chance to discuss
2026-02-24 17:06:59 <A_smart_kitten> rzl: sounds okay to me, but thanks for acknowledging the situation re the positive reply on the task :) [I probably assumed it represented the okay from serviceops generally]
2026-02-24 17:07:02 <A_smart_kitten> do you want me to reschedule in the future for another puppet request window, or should I leave it to serviceops to deploy as/when?
2026-02-24 17:08:00 <rzl> good question -- you can consider this handed off, if the team is happy with it I'll merge it async and no need for another window
2026-02-24 17:08:30 <A_smart_kitten> rzl: ty, will leave it with you :)
2026-02-24 17:08:52 <rzl> if you don't hear back in, let's say a week, please do ping me directly
2026-02-24 17:09:11 <A_smart_kitten> will do (probably on the task)
2026-02-24 17:09:18 <rzl> sgtm!

Change #1238369 merged by RLazarus:

[operations/puppet@production] alertmanager: Also add ServiceOps to mw-cron tasks for unstewarded components

https://gerrit.wikimedia.org/r/1238369

RLazarus claimed this task.
RLazarus subscribed.

(Copying from my Gerrit comment:)

Thanks @A_smart_kitten for this. Some caveats:

The plan is for these mw-cron alerts to fire less often, pretty soon. Most cron jobs are resilient to intermittent failures, so most single job failures don't need to be investigated, and in the old days we wouldn't get alerts at all. These tickets were added as part of the migration to Kubernetes -- primarily to warn us of problems with that migration itself! We're about to start working on tuning them so we only get alerts for real problems, so you can expect this issue to come up less frequently. @Blake is working on that in T416576.

Service Ops is hesitant to be auto-tagged on all these tasks, because historically it's sometimes assumed that we're responsible for fixing everything in production that isn't owned by someone else. We're concerned that if we tag these alerts for ourselves, it will look like we're promising to investigate and fix every mw-cron job: we don't have the capacity for that. We can take a quick look for triage, and (for users without access who want to look deeper) we can help fetch stack traces. But we can't take ownership for debugging these, and adding our tag doesn't mean that we will. (For posterity I added a comment to your patch to reflect that.)

Conclusion to both of those two: In the long run we'll be reducing mw-cron alerts to only those that (a) require action (b) by someone who is going to act. (That is, not just reducing the noisy alerts need no response, but cutting down on alerts for jobs not owned by anyone.) That brings us back around to the last sentence of your task description here -- we agree, tasks that don't have someone responsible probably shouldn't be filed automatically to nowhere, and tagging them all with Service Ops doesn't really change that. We're going ahead with this change in the meantime, but it's a temporary measure.

Thank you @RLazarus & Service Ops!

Service Ops is hesitant to be auto-tagged on all these tasks, because historically it's sometimes assumed that we're responsible for fixing everything in production that isn't owned by someone else. We're concerned that if we tag these alerts for ourselves, it will look like we're promising to investigate and fix every mw-cron job: we don't have the capacity for that. We can take a quick look for triage, and (for users without access who want to look deeper) we can help fetch stack traces. But we can't take ownership for debugging these, and adding our tag doesn't mean that we will.

Given what you've said ^ here, I understand the reluctance to be auto-tagged.

Conclusion to both of those two: In the long run we'll be reducing mw-cron alerts to only those that (a) require action (b) by someone who is going to act. (That is, not just reducing the noisy alerts need no response, but cutting down on alerts for jobs not owned by anyone.) That brings us back around to the last sentence of your task description here -- we agree, tasks that don't have someone responsible probably shouldn't be filed automatically to nowhere, and tagging them all with Service Ops doesn't really change that.

Question: this sounds good to me in principle, but can I ask -- will 'actual' new issues/regressions with unstewarded components' cron-jobs (that cause them to fail) -- e.g. T408530, maybe also T405313, maybe also some others that don't immediately come to mind -- still be flagged up/filed on Phabricator (I suppose, maybe manually by a member of ServiceOps)?

We're going ahead with this change in the meantime, but it's a temporary measure.

Understood!