Page MenuHomePhabricator

Uncaught ConfigException: Failed to load configuration from etcd
Open, LowPublicPRODUCTION ERROR

Description

Error
labels.normalized_message
Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.27/includes/config/EtcdConfig.php:229
error.stack_trace
(REDACTED)
Impact
Notes

~13 in last hour. I'm seeing this for .26 as well, so presumably not train-related.

Details

Request URL
https://commons.wikimedia.org/wiki/Special:EntityData/M137871102.ttl?flavor=*&revision=*

Event Timeline

brennen renamed this task from Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.27/includes/config/EtcdConfig.php:229 to Uncaught ConfigException: Failed to load configuration from etcd.Sep 20 2023, 6:36 PM
Krinkle subscribed.

Untagging MediaWiki-Platform-Team since MediaWiki-Configuration does not have an owner today (mw:Maintainers).

Tagging MW-on-K8s and serviceops because in Logstash: mediawiki dashboard, results for "Failed to load configuration from etcd" over the past 7 days are exclusively from kubernetes nodes. Baremetal seems to be fine.

I apologize in advance for speculating on something I know very little about, but it's starting to smell like Kubernetes just doesn't know how to do networking. See also T342201#9151176 about MW-on-K8s failing to reach Memcached. We also have an as-of-yet-unknown issue with reaching Shellbox, AQS, and EventBus in K8s from services outside it (T292663, T341634, T249700)


Given this is K8s-specific and likely to get worse as traffic increases, also pinging T346422: Move 10% of mediawiki external requests to mw on k8s and T290536: Serve production traffic via Kubernetes.

The error condition is reached when there is no value in apc and a response can't be fetched from etcd in time.

I kind of suspect one of the two following conditions are causing this:

  • This happens on container startup, and for some reason we're triggering some specific network-related race condition
  • We evict APC more in pods as we use a smaller apc size (to be verified, I'm not sure this is the case, to be clear) so we load from etcd more often, sometimes causing such failures

I'll take a better look tomorrow.

As per your speculations: it seems improbable that we fail http requests often enough because of network failures that we end up in this code path. Also suspicious is the fact that $error is empty which I wouldn't expect to happen if the problem is with fetching data from etcd.

Maybe a DNS resolution issue would explain this behaviour: DnsSrvDiscoverer::getServers throwing an exception that gets silently caught by the try/finally block; I would totally consider possible that dns resolution inside of k8s might have some issues. The code, as written, doesn't really help debug this problem.

Change 959801 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] config: replace unused Logger in EtcdConfig with native trigger_error

https://gerrit.wikimedia.org/r/959801

JMeybohm subscribed.

I've checked the logs from September (https://logstash.wikimedia.org/goto/37581ef39fe3ed2251e9cf0e13d12445) where this happened 132 times as of now, usually in batches of around 13 messages coming from one single pod. Cross referencing with k8s events showed that this "sometimes" happens during the startup of a new mediawiki pod. I would assume there is some kind of race condition at play.

Change 959801 merged by jenkins-bot:

[mediawiki/core@master] config: replace unused Logger in EtcdConfig with native trigger_error

https://gerrit.wikimedia.org/r/959801

Task description:
Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.27/includes/config/EtcdConfig.php:229

With the instrumentation in place to fix the empty string reason, we now know the reason recorded. All entries have the same reason: lost lock

message
Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock

There are about ~100 per day. They are exclusively from MW-on-k8s.

Correlation with the new non-fatal diagnostic warnings added by the above patch is hard, since they happen before Monolog is initialised during the MW web request. This means those warnigns to to the php7.2-fpm (baremetal) or php-fpm-error (k8s) syslog, where there is no reqId, HTTP host name, or request URL logged.

The best I know to do, is to use the View surrounding documents button in Logstash, and limit surrounding documents to the same logsource host.

What I'm seeing consistently is that the Kubernetes pods are encountering this within the first minute of booting up. This makes logical sense since the data is stored indefinitely in php-apcu once it is succesfully loaded and allowed to be used in a stale way as long as needed. The EtcdConfig class treats cache data as fresh for 10s and after that it lets the first request get a lock to renew the data, and uses the cached data until new data is available, waiting either for that request to succeed or to let another request try it after it times out after 2s.

For example:

timestampNov 1, 2023 @ 20:17:18.458
hostmw-api-ext.codfw.main-78dc85dd44-f5zvw
typemediawiki
caught_byphp-wmerrors
messageUncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock in /srv/mediawiki/php-1.42.0-wmf.2/includes/config/EtcdConfig.php:231

The "View surrounding documents" button, and then pressing the positive filter for the host name, yields a warning saying that only 7 earlier documents exist in the Logstash index. This is without any type or channel filter, meaning it would include any messages from syslog/php-fpm, but there is nothing earlier.

Querying the same from the Discover page, shows the same, there exist only 7 earlier messages:

Screenshot 2023-11-02 at 22.39.29.png (679×996 px, 78 KB) Screenshot 2023-11-02 at 22.39.40.png (1×2 px, 625 KB)

The 1 selected message and 7 earlier messages are all the same uncaught exception. So where's what I think is happening:

TimeEvent
?K8s pod boots up.
?MW request 1 starts, tries cache (empty), acquires lock (expires after 9s if not released), starts Etcd fetch (HTTP connect timeout: 2s, HTTP req timeout: 2s). The etcd fetch takes over 2 seconds to complete, but does eventually succeed.
?MW request 2 starts, tries cache (empty), loses lock, enters a poll loop for upto 2 seconds with brief sleeps in between.
Nov 1, 2023 @ 20:17:16.139MW request 2 fails with HTTP 500, after 2 seconds of waiting, got Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock.
Nov 1, 2023 @ 20:17:16.426MW request 3 fails, idem.
Nov 1, 2023 @ 20:17:16.448MW request 4 fails, idem.
Nov 1, 2023 @ 20:17:17.181MW request 5 fails, idem.
Nov 1, 2023 @ 20:17:17.704MW request 6 fails, idem.
Nov 1, 2023 @ 20:17:18.050MW request 7 fails, idem.
Nov 1, 2023 @ 20:17:18.338MW request 8 fails, idem.
Nov 1, 2023 @ 20:17:18.458MW request 9 fails, idem.
Nov 1, 2023 @ 20:17:18.859MW request 10, second successful request. It either saw cache data on the first try if it started after MW req 1 ended, or maybe it also entered a poll loop for a short time and then saw the cache data appear.

MW request 2-9 represent the 8 uncaught exceptions in the logs.

The only messages we see during the first minute of the pod's existence are Failed to load configuration from etcd: lost lock. By definition this means the first request, which acquired the lock, must not have gotten any HTTP error or HTTP timeout for the Etcd fetch, as that would have resulted in a different or extra log message.

Main questions:

  • How common do pods (re)start? Do they stay until the next deployment, or does the cluster restart/grow/reallocate periodically? Based on this, we can maybe determine how often a new pod enters this kind of failure cycle.
  • Why would an Etcd fetch take more than 2 seconds? My guess is that Etcd isn't actually taking this long, but rather this is yet another example of Kubernetes networking losing packets and so it just goes until the curl client reaches the allocated connect timeout (T249700, T342201, T341634, T292663, etc)

Action item: Can we avoid this? I think so.

In the long-term, I hope a warmup can be implemented which would benefit lots of things and absorb many different kinds of bootstrapping scenarios through a single mechanism. As I understand it, Kubernetes has a built-in health check probe that it requires to finish succesfully before pooling the pod. However, I'm guessing we've either disabled this for MW right now, or it points to a generic virtualhost that isn't served by MediaWiki.

In the short-term, I suppose we could re-align the timeouts better so that we don't rely on servers starting in a way where the first request must succeed in under 2s or else all incoming requests during that window fail with HTTP 500.

Current timeouts:

  • Cache is considered fresh for 10s
  • Wait-loop timeout: 2s. Each loop checks cache, then lock, then etcd fetch.
  • Lock is held for 9s.
  • Etcd fetch timeout is 4s (connect timeout: 2s + request timeout: 2s)

We could lower the Etcd fetch timeouts (currently 4s = 2s connect, 2s req) to be less than the poll-wait timeout (currently 2s). Given the Etcd SLO of 32ms at https://wikitech.wikimedia.org/wiki/SLO/etcd_main_cluster, we could for example set this to 0.5s and 0.5s. That way, if a fetch would have taken 2+ seconds to complete, it will timeout before that and have time left to retry under the same acquired lock, succeed, and then feed it to the other queued requests instead of letting those all fail.

Another way could be to lower the lock expiry significantly. For example, if we set the lock expiry to 0.5s, that would mean if the lock-holding thread takes more than 0.5s to renew the value, one of the other queued threads is given a chance to make a succesful renew. The Etcd rate should be similar, but rather than happening serially from a single thread, it would be in parallel from one other thread that succesfully gets the lock on its next wait-loop iteration.