Page MenuHomePhabricator

MediaWiki periodic job update-special-pages-s2 failed
Closed, ResolvedPublicPRODUCTION ERROR

Description

Common information

  • alertname: MediaWikiCronJobFailed
  • label_cronjob: update-special-pages-s2
  • label_team: mediawiki-special-pages
  • prometheus: k8s
  • severity: task
  • site: codfw
  • source: prometheus
  • team: mediawiki-special-pages

Firing alerts


Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptWed, Feb 4, 6:22 AM
Restricted Application added subscribers: A_smart_kitten, Aklapper. · View Herald Transcript

Hello serviceops! Please could an an SRE grab the stack trace & error time when you have some time spare?

Blake moved this task from Inbox to In Progress on the ServiceOps new board.
Blake subscribed.

Hello! It looks like the MariaDB server was briefly read-only for some maintenance during the time when the cron executed:

2026-02-04T06:16:25.490906024Z nowiki Mostimages                     [QueryPage] Wikimedia\Rdbms\DBQueryError from line [0/0]
of /srv/mediawiki/php-1.46.0-wmf.13/includes/libs/Rdbms/Database/Database.php: Error 1290: The MariaDB server is running with
 the --read-only option so it cannot execute this statement

That work was taking place in T416300, and should now be complete - I'd expect this cron to succeed on next run.

Thanks for fetching that! From my perspective this task can probably now be closed (although I don't know whether the failed job would need to be deleted first or not). I'll leave it up to y'all about whether or not you think it's worth rerunning this job :)

(For cross-referencing purposes: it seems like this might be a similar reason for the job failing as in T404280: MediaWiki periodic job initsitestats failed)

If this happens from time to time, would it worth referencing those example bugs (or the MariaDB switchover) in the runbook @Blake ?

@MLechvien-WMF If possible, I'd rather find a way to make the script resilient to the maintenance more generally. Ideally, I think we'd just schedule a re-run on failure, and only alert if the cron hasn't successfully run in a particular time window. I'll do some reading, and see what might make sense here. I'll change the docs if there's no better way to automate this away.

I'd like to have a chat with the team about how we alert for mw-crons. I've created T416576 to track the follow-up here, but in the short term, the failed job has been deleted, and we expect it to succeed on next run, so closing this out.