Page MenuHomePhabricator

MediaWiki periodic job update-special-pages-s3 failed
Closed, ResolvedPublicPRODUCTION ERROR

Description

Common information

  • alertname: MediaWikiCronJobFailed
  • label_cronjob: update-special-pages-s3
  • label_team: mediawiki-special-pages
  • prometheus: k8s
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: mediawiki-special-pages

Firing alerts


  • dashboard: https://w.wiki/DocP
  • description: Use kube-env mw-cron eqiad; kubectl get jobs -l team=mediawiki-special-pages,cronjob=update-special-pages-s3 --field-selector status.successful=0 to see failures
  • runbook: https://wikitech.wikimedia.org/wiki/Periodic_jobs#Troubleshooting
  • summary: MediaWiki periodic job update-special-pages-s3 failed
  • alertname: MediaWikiCronJobFailed
  • label_cronjob: update-special-pages-s3
  • label_team: mediawiki-special-pages
  • prometheus: k8s
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: mediawiki-special-pages
  • Source

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptAug 7 2025, 8:40 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
nlwiktionary Fewestrevisions                [QueryPage] Wikimedia\Rdbms\DBConnectionError from line 1160 of /srv/mediawiki/php-1.45.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Cannot access the database: could not connect to any replica DB server
nlwiktionary #0 /srv/mediawiki/php-1.45.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php(789): Wikimedia\Rdbms\LoadBalancer->reportConnectionError('could not conne...')
nlwiktionary #1 /srv/mediawiki/php-1.45.0-wmf.13/includes/libs/rdbms/database/DBConnRef.php(111): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(-1, Array, 'nlwiktionary', 0)
nlwiktionary #2 /srv/mediawiki/php-1.45.0-wmf.13/includes/libs/rdbms/database/DBConnRef.php(125): Wikimedia\Rdbms\DBConnRef->ensureConnection()
nlwiktionary #3 /srv/mediawiki/php-1.45.0-wmf.13/includes/libs/rdbms/database/DBConnRef.php(384): Wikimedia\Rdbms\DBConnRef->__call('select', Array)
nlwiktionary #4 /srv/mediawiki/php-1.45.0-wmf.13/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(762): Wikimedia\Rdbms\DBConnRef->select(Array, Array, Array, 'MediaWiki\\Speci...', Array, Array)
nlwiktionary #5 /srv/mediawiki/php-1.45.0-wmf.13/includes/specialpage/QueryPage.php(569): Wikimedia\Rdbms\SelectQueryBuilder->fetchResultSet()
nlwiktionary #6 /srv/mediawiki/php-1.45.0-wmf.13/includes/specialpage/QueryPage.php(416): MediaWiki\SpecialPage\QueryPage->reallyDoQuery(5000, false)
nlwiktionary #7 /srv/mediawiki/php-1.45.0-wmf.13/maintenance/updateSpecialPages.php(92): MediaWiki\SpecialPage\QueryPage->recache(5000)
nlwiktionary #8 /srv/mediawiki/php-1.45.0-wmf.13/maintenance/includes/MaintenanceRunner.php(691): UpdateSpecialPages->execute()
nlwiktionary #9 /srv/mediawiki/php-1.45.0-wmf.13/maintenance/run.php(53): MediaWiki\Maintenance\MaintenanceRunner->run()
nlwiktionary #10 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
nlwiktionary #11 {main}

I can rerun that job manually if necessary, but we may want to modify it to continue with the next wiki on error?

Thanks for grabbing the stack trace @Clement_Goubert! It seems like another occurrence of T400133: When a database replica is depooled/rebooted, update-special-pages maintenance scripts connected to that replica seem to fail (instead of continuing with a different replica) to me at a first glance. (If you have a second, please could you grab the error's timestamp?)

we may want to modify it to continue with the next wiki on error?

Out of interest, how often is this job scheduled to run? I don't imagine I have the authority to make this decision; but it occurs to me that - if it runs fairly frequently anyway - it might not be worth setting it to continue on error, given that this would (IIUC) limit the visibility of (e.g.) how often T400133 tends to occur (and might also obscure any new errors that start intermittently affecting this job, if any such errors began to occur).

Thanks for grabbing the stack trace @Clement_Goubert! It seems like another occurrence of T400133: When a database replica is depooled/rebooted, update-special-pages maintenance scripts connected to that replica seem to fail (instead of continuing with a different replica) to me at a first glance. (If you have a second, please could you grab the error's timestamp?)

From logstash, Aug 7, 2025 @ 08:33:52

we may want to modify it to continue with the next wiki on error?

Out of interest, how often is this job scheduled to run? I don't imagine I have the authority to make this decision; but it occurs to me that - if it runs fairly frequently anyway - it might not be worth setting it to continue on error, given that this would (IIUC) limit the visibility of (e.g.) how often T400133 tends to occur.

It runs every three days. I am also unsure which is the best approach, at least until we get a good way to conditionally restart jobs on error with Kubernetes 1.31

Thanks for grabbing the stack trace @Clement_Goubert! It seems like another occurrence of T400133: When a database replica is depooled/rebooted, update-special-pages maintenance scripts connected to that replica seem to fail (instead of continuing with a different replica) to me at a first glance. (If you have a second, please could you grab the error's timestamp?)

From logstash, Aug 7, 2025 @ 08:33:52

Yep, correlates with the depool of db1166 - https://sal.toolforge.org/log/xEGqg5gBvg159pQrP_Xc

It runs every three days. I am also unsure which is the best approach, at least until we get a good way to conditionally restart jobs on error with Kubernetes 1.31

Hrm. For now, my immediate personal gut instinct would probably be to not change anything/to keep the status quo with regards to the job [not] continuing on error. It would probably be worth reconsidering this (and/or prioritizing T400133) if its error proportion started to get quite high, though. (Purely out of curiosity, is there a rough idea of when WMF servers might be running Kubernetes 1.31+?)

For this task, I'd probably say you could re-run this job if you get the time, and otherwise the task can be kept open for now to ensure the job runs successfully in 3 days.

Thanks for grabbing the stack trace @Clement_Goubert! It seems like another occurrence of T400133: When a database replica is depooled/rebooted, update-special-pages maintenance scripts connected to that replica seem to fail (instead of continuing with a different replica) to me at a first glance. (If you have a second, please could you grab the error's timestamp?)

From logstash, Aug 7, 2025 @ 08:33:52

Yep, correlates with the depool of db1166 - https://sal.toolforge.org/log/xEGqg5gBvg159pQrP_Xc

Seems weird the DB layer doesn't retry on another replica, but I don't know enough about the php DB LoadBalancer.

It runs every three days. I am also unsure which is the best approach, at least until we get a good way to conditionally restart jobs on error with Kubernetes 1.31

Hrm. For now, my immediate personal gut instinct would probably be to not change anything/to keep the status quo with regards to the job [not] continuing on error. It would probably be worth reconsidering this (and/or prioritizing T400133) if its error proportion started to get quite high, though. (Purely out of curiosity, is there a rough idea of when WMF servers might be running Kubernetes 1.31+?)

You can follow T341984: Update Kubernetes clusters to 1.31 for updates on the kubernetes upgrade, for now only codfw (the current secondary DC) runs that version, but we will update eqiad after the next DC Switchover.

For this task, I'd probably say you could re-run this job if you get the time, and otherwise the task can be kept open for now to ensure the job runs successfully in 3 days.

I agree, I will rerun it.

kubectl create job update-special-pages-s3-$(date +"%Y%m%d%H%M") --from=cronjobs/update-special-pages-s3
job.batch/update-special-pages-s3-202508071235 created
A_smart_kitten assigned this task to Clement_Goubert.

I'm gonna boldly assume that the rerun has now hopefully completed successfully — FewestRevisions on nlwiktionary has now been updated, and so has FewestRevisions on zuwiktionary (AFAICS, the final wiki in s3.dblist) :)

Clement_Goubert added a subscriber: Jdforrester-WMF.

Deleted failed job update-special-pages-s3-29242380 following successful manual run, and a successful scheduled run on Sunday.