Page MenuHomePhabricator

Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php:202
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

MediaWiki version: 1.35.0-wmf.38

message
Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php:202

Impact

spurious but worrying production error logspam

Notes

This should not be possible

Only occurrences over the past 7 days took place around 2020-07-01T18:00:00 in 1.35.0-wmf.38 and 1.35.0-wmf.39

Details

Request ID
d0022155-0fb7-4b4c-89cf-659e7b605b33
Request URL
https://en.wikipedia.org/wiki/Lipoprotein
Stack Trace
exception.trace
#0 /srv/mediawiki/php-1.35.0-wmf.38/includes/config/EtcdConfig.php(124): EtcdConfig->load()
#1 /srv/mediawiki/wmf-config/CommonSettings.php(132): EtcdConfig->getModifiedIndex()
#2 /srv/mediawiki/php-1.35.0-wmf.38/LocalSettings.php(4): require('/srv/mediawiki/...')
#3 /srv/mediawiki/php-1.35.0-wmf.38/includes/Setup.php(143): require_once('/srv/mediawiki/...')
#4 /srv/mediawiki/php-1.35.0-wmf.38/includes/WebStart.php(89): require_once('/srv/mediawiki/...')
#5 /srv/mediawiki/php-1.35.0-wmf.38/index.php(44): require('/srv/mediawiki/...')
#6 /srv/mediawiki/w/index.php(3): require('/srv/mediawiki/...')
#7 {main}
  thrown

Event Timeline

		if ( $loop->invoke() !== WaitConditionLoop::CONDITION_REACHED ) {
			// No cached value exists and etcd query failed; throw an error
			throw new ConfigException( "Failed to load configuration from etcd: $error" );
		}

WaitConditionLoop is working with an timeout where this exception could be thrown, when the timeout is reached.

It seems there was no $error set, when reaching the exception. That is possible with timeout.

Maybe the result of invoke should be evaluated a bit stronger for WaitConditionLoop::CONDITION_TIMED_OUT/CONDITION_FAILED/CONDITION_ABORTED

Krinkle claimed this task.
Krinkle subscribed.

I guess it's normal that etcd can timeout in rare cases. The request would fatal in that case and leave the user stranded with a system error page. In the future such cases where we fail with HTTP 5xx and know it happened early and is safe to restart, perhaps we can handle that automatically in our infrastructure but that's a separate task. For now, given we saw one in over a month, that seems expected.

We have the general Scap and Icingla alerts to detect any spikes in this and other fatals.