Page MenuHomePhabricator

MediaWiki periodic job startupregistrystats failed
Closed, InvalidPublic

Description

Common information

  • alertname: MediaWikiCronJobFailed
  • label_cronjob: startupregistrystats
  • label_team: mediawiki-platform
  • prometheus: k8s
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: mediawiki-platform

Firing alerts


  • dashboard: https://w.wiki/DocP
  • description: Use kube-env mw-cron eqiad; kubectl get jobs -l team=mediawiki-platform,cronjob=startupregistrystats --field-selector status.successful=0 to see failures
  • runbook: https://wikitech.wikimedia.org/wiki/Periodic_jobs#Troubleshooting
  • summary: MediaWiki periodic job startupregistrystats failed
  • alertname: MediaWikiCronJobFailed
  • label_cronjob: startupregistrystats
  • label_team: mediawiki-platform
  • prometheus: k8s
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: mediawiki-platform
  • Source

Event Timeline

Looking at the Logstash link, there are only stdout lines showing successful progress, odd.

Looking at kubectl, after exiting screen to workaround T404739, we get:

$ kubectl logs jobs/startupregistrystats-29300555 mediawiki-main-app
shwiki Sending stats...
shwiki Done!
specieswiki Warning: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached in /srv/mediawiki/php-1.45.0-wmf.18/includes/config/EtcdConfig.php on line 206
specieswiki Warning: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached in /srv/mediawiki/php-1.45.0-wmf.18/includes/config/EtcdConfig.php on line 206
specieswiki Fatal error: Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: (curl error: 28) Timeout was reached in /srv/mediawiki/php-1.45.0-wmf.18/includes/config/EtcdConfig.php:233
specieswiki Stack trace:
specieswiki #0 /srv/mediawiki/php-1.45.0-wmf.18/includes/config/EtcdConfig.php(156): MediaWiki\Config\EtcdConfig->load()
specieswiki #1 /srv/mediawiki/wmf-config/CommonSettings.php(201): MediaWiki\Config\EtcdConfig->getModifiedIndex()
specieswiki #2 /srv/mediawiki/php-1.45.0-wmf.18/LocalSettings.php(4): require('/srv/mediawiki/...')
specieswiki #3 /srv/mediawiki/php-1.45.0-wmf.18/includes/Setup.php(228): require_once('/srv/mediawiki/...')
specieswiki #4 /srv/mediawiki/php-1.45.0-wmf.18/maintenance/run.php(51): require_once('/srv/mediawiki/...')
specieswiki #5 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
specieswiki #6 {main}
specieswiki   thrown in /srv/mediawiki/php-1.45.0-wmf.18/includes/config/EtcdConfig.php on line 233

Whicih appears to be T346971/T349376, a known issue where Kubernetes pods sometimes can't reach etcd (https://logstash.wikimedia.org/goto/4c62d7b8b82a82de0bea42cbcaba2ec3).

Going back to Logstash, I can see other mw-cron jobs failing with the same error, via the mediawiki-errors dashboard when filtering for servergroup:kube-mw-cron https://logstash.wikimedia.org/app/dashboards#/view/mediawiki-errors. For example:

type: mediawiki
channel: error
servergroup: kube-mw-cron
wiki: enwiki
message:
[7d3645bb2847e8b6a51cb4a9] [no req]   PHP Warning: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached

However, I don't see any entries for blameStartupRegistry.php there. And the error we do get from the kukectl jobs output is also not shown on the MediaWiki Periodic Jobs dashboard. Did these get lost in transporation?

Looking at the Logstash link, there are only stdout lines showing successful progress, odd.

I see the same error message you've found here: https://logstash.wikimedia.org/goto/766a2aff7c1a024dbd24fd5138cb5155

It looks like it runs hourly and produces lots of output, so you have to filter by time.

Krinkle closed this task as Invalid.EditedSep 17 2025, 12:53 AM

Looking at the Logstash link, there are only stdout lines showing successful progress, odd.

I see the same error message you've found here: https://logstash.wikimedia.org/goto/766a2aff7c1a024dbd24fd5138cb5155

Oops, I've crossed out my previous comment. Thanks.

Even with an exact time there are still many pages will of output, but it is ordered reverse-chrononical and the last output of any failed run will be the error, which is on top. I didn't think of that.

I actually did a search for "EtcdConfig" on the Logstash link after finding the error via kubectl, and still found nothing. I think I was looking at the default time range, which is 15 minutes. The failure was a fluke and more recent runs of this cronjob have succeeded. I should have looked at last 24 hours.

Thanks! I'll close this as invalid since it is working now, and the error is not related to our code.