Page MenuHomePhabricator

MediaWiki periodic job startupregistrystats-mediawikiwiki failed
Open, Needs TriagePublic

Description

Common information

  • alertname: MediaWikiCronJobFailed
  • label_cronjob: startupregistrystats-mediawikiwiki
  • label_team: mediawiki-platform
  • prometheus: k8s
  • severity: task
  • site: codfw
  • source: prometheus
  • team: mediawiki-platform

Firing alerts


  • dashboard: https://w.wiki/DocP
  • description: Use kube-env mw-cron codfw; kubectl get jobs -l team=mediawiki-platform,cronjob=startupregistrystats-mediawikiwiki --field-selector status.successful=0 to see failures
  • runbook: https://wikitech.wikimedia.org/wiki/Periodic_jobs#Troubleshooting
  • summary: MediaWiki periodic job startupregistrystats-mediawikiwiki failed
  • alertname: MediaWikiCronJobFailed
  • label_cronjob: startupregistrystats-mediawikiwiki
  • label_team: mediawiki-platform
  • prometheus: k8s
  • severity: task
  • site: codfw
  • source: prometheus
  • team: mediawiki-platform
  • Source

Event Timeline

matmarex subscribed.

This keeps happening over the last couple of weeks (https://phabricator.wikimedia.org/search/query/NBa4SdpM463N/#R). I'd like to find out why and/or make it stop.

We've had several past instances of issues like:

It seems like @Krinkle (if willing) can help us during our collab sessions with a demo on the potential causes of these or how to debug. From the history above, I've also seen @Clement_Goubert deal with some of these issues too. So it seems like the cause is not very obvious and requires some digging through logs?

Did something happen with blameStartupRegistry.php script recently? The most recent instance was T409212: MediaWiki periodic job startupregistrystats-testwiki failed, and Krinkle took care of it.

I went through https://logstash.wikimedia.org/goto/53c805d5b28776f67ebcc8821f06d153 and it looks like the issue has resolved itself?

@matmarex, this one: T411654: MediaWiki periodic job startupregistrystats failed just showed up on our board (not long ago), and I left a comment there.

In general I think it would be nice to rename these auto-generated tasks and put something about the specific failure reason in their title. They are going to be very unhelpful when searching for related issues in the future.

kubectl doesn't return anything for this error. Maybe the logs already expired? It would be nice to know the k8s log retention period; if it's less than one week, we should call that out in the chores docs.

As I complained in T411654#11429879, our logging for this kind of thing is crap, but once you have a guess about the error message, you can search for it in Logstash, and indeed there's an identical error message to the one in that task that roughly matches the time of this task getting logged.
(Could phaultfinder just automatically get the logs instead of dumping instructions on how to do it? I guess there's a secrecy aspect to it, but it could go to a restricted paste by default.)

In general I think it would be nice to rename these auto-generated tasks and put something about the specific failure reason in their title. They are going to be very unhelpful when searching for related issues in the future.

If you rename them, the next time the alert fires, it will create a new task instead of appending to the existing open one.

kubectl doesn't return anything for this error. Maybe the logs already expired? It would be nice to know the k8s log retention period; if it's less than one week, we should call that out in the chores docs.

By default, completed (failed or not) jobs are kept 30 hours in kubernetes. We could definitely change that for your job, or even change the default to something like a week or so. Afterwards they are available in logstash through the dashboard link in the task. You'll need to select yourself the cronjob you want to see, as I've not found a way to pass that as a query parameter to logstash.

As I complained in T411654#11429879, our logging for this kind of thing is crap, but once you have a guess about the error message, you can search for it in Logstash, and indeed there's an identical error message to the one in that task that roughly matches the time of this task getting logged.

We at serviceops need to get to work on T390972: Restart CronJobs on failure of the service mesh now that we have the right version of Kubernetes to use podFailurePolicy

(Could phaultfinder just automatically get the logs instead of dumping instructions on how to do it? I guess there's a secrecy aspect to it, but it could go to a restricted paste by default.)

I am not aware of a way to do that, although maybe SRE Observability does

If you rename them, the next time the alert fires, it will create a new task instead of appending to the existing open one.

IMO that's not ideal; these tasks will clutter up search results and will be hard to navigate. Couldn't it use a custom Maniphest field rather than the title for identifying which task is related to a script?

By default, completed (failed or not) jobs are kept 30 hours in kubernetes. We could definitely change that for your job, or even change the default to something like a week or so. Afterwards they are available in logstash through the dashboard link in the task.

I think one week would be nice, we have a weekly cadence for chores, errors in these cronjobs tend not to be urgent, and Logstash is very hard to use (T411663: Normal output and error output from Wikimedia scheduled maintenance scripts should be logged differently in Logstash).

I've not found a way to pass that as a query parameter to logstash.

https://logstash.wikimedia.org/app/dashboards#/view/d51552d0-e309-11ef-87d0-9371e01d3c68?_a=(filters:!((query:(match_phrase:(kubernetes.labels.team.keyword:growth))),(query:(match_phrase:(kubernetes.labels.cronjob.keyword:growthexperiments-refreshlinkrecommendations-s2))),(query:(match_phrase:(kubernetes.labels.script.keyword:refreshLinkRecommendations.php)))))&_g=(time:(from:now-24h,to:now))

Change the team name, cronjob name, script name and time range as needed.
If you only need the cronjob name then

https://logstash.wikimedia.org/app/dashboards#/view/d51552d0-e309-11ef-87d0-9371e01d3c68?_a=(filters:!((query:(match_phrase:(kubernetes.labels.cronjob.keyword:growthexperiments-refreshlinkrecommendations-s2)))))&_g=(time:(from:now-24h,to:now))

More generally, use share -> snapshot -> copy URL on the dashboard (without the short URL option) then amend as needed. The _a and _g query parameters are RISON, and anything you don't want to change from the dashboard's default value can be omitted from the tree. _a contains an Elasticsearch filter expression in its filters field, so usually that's the only thing you need to override. You can omit the various metadata from filters, you only need the query part.

(Could phaultfinder just automatically get the logs instead of dumping instructions on how to do it? I guess there's a secrecy aspect to it, but it could go to a restricted paste by default.)

I am not aware of a way to do that, although maybe SRE Observability does

Can't it just execute the same commands it pastes into the task? Or would it be too insecure to give permissions for a script to do that?

The dashboard URL structure is pretty obscure so I added it to the docs on wikitech.

If you rename them, the next time the alert fires, it will create a new task instead of appending to the existing open one.

IMO that's not ideal; these tasks will clutter up search results and will be hard to navigate. Couldn't it use a custom Maniphest field rather than the title for identifying which task is related to a script?

I don't know, we'd have to ask SRE Observability about that

By default, completed (failed or not) jobs are kept 30 hours in kubernetes. We could definitely change that for your job, or even change the default to something like a week or so. Afterwards they are available in logstash through the dashboard link in the task.

I think one week would be nice, we have a weekly cadence for chores, errors in these cronjobs tend not to be urgent, and Logstash is very hard to use (T411663: Normal output and error output from Wikimedia scheduled maintenance scripts should be logged differently in Logstash).

Ack, I'll file a CR to change the default to a week.

I've not found a way to pass that as a query parameter to logstash.

https://logstash.wikimedia.org/app/dashboards#/view/d51552d0-e309-11ef-87d0-9371e01d3c68?_a=(filters:!((query:(match_phrase:(kubernetes.labels.team.keyword:growth))),(query:(match_phrase:(kubernetes.labels.cronjob.keyword:growthexperiments-refreshlinkrecommendations-s2))),(query:(match_phrase:(kubernetes.labels.script.keyword:refreshLinkRecommendations.php)))))&_g=(time:(from:now-24h,to:now))

Change the team name, cronjob name, script name and time range as needed.
If you only need the cronjob name then

https://logstash.wikimedia.org/app/dashboards#/view/d51552d0-e309-11ef-87d0-9371e01d3c68?_a=(filters:!((query:(match_phrase:(kubernetes.labels.cronjob.keyword:growthexperiments-refreshlinkrecommendations-s2)))))&_g=(time:(from:now-24h,to:now))

More generally, use share -> snapshot -> copy URL on the dashboard (without the short URL option) then amend as needed. The _a and _g query parameters are RISON, and anything you don't want to change from the dashboard's default value can be omitted from the tree. _a contains an Elasticsearch filter expression in its filters field, so usually that's the only thing you need to override. You can omit the various metadata from filters, you only need the query part.

Thanks! I'll see if that fits into the alert generator as far as URL length goes.

(Could phaultfinder just automatically get the logs instead of dumping instructions on how to do it? I guess there's a secrecy aspect to it, but it could go to a restricted paste by default.)

I am not aware of a way to do that, although maybe SRE Observability does

Can't it just execute the same commands it pastes into the task? Or would it be too insecure to give permissions for a script to do that?

I am not aware of a way to make AlertManager execute arbitrary commands for enrichment, and the alert hosts don't have access to the kubernetes cluster anyways. Everything in the alert is either from Prometheus or defined in the templates in operations/alerts

If you rename them, the next time the alert fires, it will create a new task instead of appending to the existing open one.

IMO that's not ideal; these tasks will clutter up search results and will be hard to navigate. Couldn't it use a custom Maniphest field rather than the title for identifying which task is related to a script?

I don't know, we'd have to ask SRE Observability about that

From a cursory glance at the phalerts docs, this title-oriented behavior is intentional.

(Could phaultfinder just automatically get the logs instead of dumping instructions on how to do it? I guess there's a secrecy aspect to it, but it could go to a restricted paste by default.)

I am not aware of a way to do that, although maybe SRE Observability does

Can't it just execute the same commands it pastes into the task? Or would it be too insecure to give permissions for a script to do that?

I am not aware of a way to make AlertManager execute arbitrary commands for enrichment, and the alert hosts don't have access to the kubernetes cluster anyways. Everything in the alert is either from Prometheus or defined in the templates in operations/alerts

@Clement_Goubert is correct. Neither Alertmanager nor phalerts know anything about how to get debugging information. The debugging and docs links provided in the description are set in the alert definition. The data Prometheus knows about (labels) are dumped into the task description as bullet points.

We should now keep the last failure for up to a week, subtask was created for alert improvements and waiting for review. I wanted to add timing variable to scope down the logstash dashboard, but alertmanager is *very* limited in what processing it can do in templates, so for now it links to the last 24h. It's pretty easy to change the default, or just to change the relative times oneself when debugging so I think it's a good enough compromise.

As far as the task creations go, I think it's pretty clear we can't really do anything substantially better than what it currently is. One way you could handle it is leaving the autogenerated tasks open until the problem is known to be resolved (that way additional failures will only add information to this task and not create new ones), adding debugging information there, and maybe creating more specific subtasks with more descriptive titles and descriptions to track the uncovered bugs and fixes?

Thanks, those changes are very helpful!

As far as the task creations go, I think it's pretty clear we can't really do anything substantially better than what it currently is.

I still think we could (whether it's worth the effort is another matter). phaultfinder has some way of checking whether there's a task with a given title open, presumably via the Phabricator search API. It could just use something other than the title for that (a custom "job name" field, presumably).