Page MenuHomePhabricator

MediaWiki periodic job startupregistrystats failed
Open, Needs TriagePublic

Description

Common information

  • alertname: MediaWikiCronJobFailed
  • label_cronjob: startupregistrystats
  • label_team: mediawiki-platform
  • prometheus: k8s
  • severity: task
  • site: codfw
  • source: prometheus
  • team: mediawiki-platform

Firing alerts


  • dashboard: https://w.wiki/DocP
  • description: Use kube-env mw-cron codfw; kubectl get jobs -l team=mediawiki-platform,cronjob=startupregistrystats --field-selector status.successful=0 to see failures
  • runbook: https://wikitech.wikimedia.org/wiki/Periodic_jobs#Troubleshooting
  • summary: MediaWiki periodic job startupregistrystats failed
  • alertname: MediaWikiCronJobFailed
  • label_cronjob: startupregistrystats
  • label_team: mediawiki-platform
  • prometheus: k8s
  • severity: task
  • site: codfw
  • source: prometheus
  • team: mediawiki-platform
  • Source

Event Timeline

Per https://logstash.wikimedia.org/goto/f9b4204879f7bc8969d1bf80d9a0f9ff, this appears to be a service-mesh issue, as the logs suggest.

{F70840900}

tgr@deploy2002:~$ kube-env mw-cron codfw
tgr@deploy2002:~$ kubectl get jobs -l team=mediawiki-platform,cronjob=startupregistrystats-mediawikiwiki --field-selector status.successful=0
No resources found in mw-cron namespace.
tgr@deploy2002:~$ kubectl get jobs --field-selector status.successful=0
NAME                                                       STATUS    COMPLETIONS   DURATION   AGE
startupregistrystats-29413055                              Failed    0/1           59m        59m
tgr@deploy2002:~$ kubectl logs jobs/startupregistrystats-29413055 mediawiki-main-app
extensions/WikimediaMaintenance/maintenance/blameStartupRegistry.php: Start run
extensions/WikimediaMaintenance/maintenance/blameStartupRegistry.php: Running on large
extensions/WikimediaMaintenance/maintenance/blameStartupRegistry.php: Running on large
arwiki The service mesh is unavailable, which can lead to unexpected results.
arwiki 
arwiki Therefore, the script will not be executed. If you are *very* sure your script will
arwiki not need the service mesh at all, you can run it again with MESH_CHECK_SKIP=1

It's really annoying that on Logstash, this error is lost in a huge soup of normal script output. Filed T411663: Normal output and error output from Wikimedia scheduled maintenance scripts should be logged differently in Logstash.

That aside, it seems like jobs are run with MESH_CHECK_SKIP (owing to T387480: Beta update job fails: The service mesh is unavailable, which can lead to unexpected results.) so not sure what's going on there.

So this is essentially a duplicate of T410764: MediaWiki periodic job startupregistrystats-mediawikiwiki failed (and will be fixed by T390972: Restart CronJobs on failure of the service mesh I think?) but we can't close it because the bot would just reopen it.