Uncaught ConfigException: Failed to load configuration from etcd
Open, LowPublicPRODUCTION ERROR
Actions

Assigned To

None

Authored By

	brennen
	Sep 20 2023, 6:36 PM

Description

Error

service.version: 1.41.0-wmf.27
trace.id: b0a79c86-20d9-48f0-925d-811379c69e25
Find trace.id in Logstash

labels.normalized_message

Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.27/includes/config/EtcdConfig.php:229

error.stack_trace

(REDACTED)

Impact

Notes

~13 in last hour. I'm seeing this for .26 as well, so presumably not train-related.

Details

Request URL: https://commons.wikimedia.org/wiki/Special:EntityData/M137871102.ttl?flavor=*&revision=*

	Subject	Repo	Branch	Lines +/-
	config: replace unused Logger in EtcdConfig with native trigger_error	mediawiki/core	master	+28 -17

Customize query in gerrit

Related Objects

Mentioned In: T349376: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218
Mentioned Here: T349376: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218
T290536: Serve production traffic via Kubernetes
T346422: Move 10% of mediawiki external requests to mw on k8s
T249700: k8s/mw: traffic to eventgate dropped by iptables
T292663: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503)
T341634: MediaWiki frequently receives HTTP 500 from AQS (via PageViewInfo extension)
T342201: MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}
T296125: Fatal error: Uncaught ConfigException: Failed to load configuration from etcd
T345812: Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.25/includes/config/EtcdConfig.php:229

Event Timeline

brennen created this task.Sep 20 2023, 6:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 20 2023, 6:36 PM

brennen renamed this task from Uncaught ConfigException: Failed to load configuration from etcd: in /srv/mediawiki/php-1.41.0-wmf.27/includes/config/EtcdConfig.php:229 to Uncaught ConfigException: Failed to load configuration from etcd.Sep 20 2023, 6:36 PM

brennen added projects: MediaWiki-Configuration, MediaWiki-Platform-Team.Sep 20 2023, 6:41 PM

Thought this rang a bell...

Krinkle edited projects, added serviceops, MW-on-K8s; removed MediaWiki-Platform-Team.Sep 20 2023, 8:56 PM

Untagging MediaWiki-Platform-Team since MediaWiki-Configuration does not have an owner today (mw:Maintainers).

Tagging MW-on-K8s and serviceops because in Logstash: mediawiki dashboard, results for "Failed to load configuration from etcd" over the past 7 days are exclusively from kubernetes nodes. Baremetal seems to be fine.

I apologize in advance for speculating on something I know very little about, but it's starting to smell like Kubernetes just doesn't know how to do networking. See also T342201#9151176 about MW-on-K8s failing to reach Memcached. We also have an as-of-yet-unknown issue with reaching Shellbox, AQS, and EventBus in K8s from services outside it (T292663, T341634, T249700)

Given this is K8s-specific and likely to get worse as traffic increases, also pinging T346422: Move 10% of mediawiki external requests to mw on k8s and T290536: Serve production traffic via Kubernetes.

Krinkle moved this task from Untriaged to Sep 2023 on the Wikimedia-production-error board.Sep 20 2023, 9:04 PM

jhathaway subscribed.Sep 20 2023, 9:05 PM

The error condition is reached when there is no value in apc and a response can't be fetched from etcd in time.

I kind of suspect one of the two following conditions are causing this:

This happens on container startup, and for some reason we're triggering some specific network-related race condition
We evict APC more in pods as we use a smaller apc size (to be verified, I'm not sure this is the case, to be clear) so we load from etcd more often, sometimes causing such failures

I'll take a better look tomorrow.

As per your speculations: it seems improbable that we fail http requests often enough because of network failures that we end up in this code path. Also suspicious is the fact that $error is empty which I wouldn't expect to happen if the problem is with fetching data from etcd.

Maybe a DNS resolution issue would explain this behaviour: DnsSrvDiscoverer::getServers throwing an exception that gets silently caught by the try/finally block; I would totally consider possible that dns resolution inside of k8s might have some issues. The code, as written, doesn't really help debug this problem.

Change 959801 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] config: replace unused Logger in EtcdConfig with native trigger_error

https://gerrit.wikimedia.org/r/959801

gerritbot added a project: Patch-For-Review.Sep 21 2023, 3:50 PM

brennen moved this task from Backlog to Logs/Train on the User-brennen board.Sep 21 2023, 10:04 PM

I've checked the logs from September (https://logstash.wikimedia.org/goto/37581ef39fe3ed2251e9cf0e13d12445) where this happened 132 times as of now, usually in batches of around 13 messages coming from one single pod. Cross referencing with k8s events showed that this "sometimes" happens during the startup of a new mediawiki pod. I would assume there is some kind of race condition at play.

Change 959801 merged by jenkins-bot:

[mediawiki/core@master] config: replace unused Logger in EtcdConfig with native trigger_error

https://gerrit.wikimedia.org/r/959801

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.29; 2023-10-03).Sep 27 2023, 1:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 27 2023, 1:12 PM

SLopes-WMF added a project: MediaWiki-Engineering.Oct 25 2023, 2:03 PM

SLopes-WMF moved this task from Inbox, needs triage to Radar on the MediaWiki-Engineering board.

Krinkle mentioned this in T349376: EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218.Nov 1 2023, 1:38 PM

(Edit: Never mind)

Krinkle moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Nov 2 2023, 5:51 PM

Krinkle removed a project: MW-1.41-notes (1.41.0-wmf.29; 2023-10-03).

Krinkle updated the task description. (Show Details)

Krinkle updated the task description. (Show Details)Nov 2 2023, 5:53 PM

Krinkle updated the task description. (Show Details)Nov 2 2023, 6:13 PM

Task description:

Uncaught ConfigException: Failed to load configuration from etcd:  in /srv/mediawiki/php-1.41.0-wmf.27/includes/config/EtcdConfig.php:229

With the instrumentation in place to fix the empty string reason, we now know the reason recorded. All entries have the same reason: lost lock

message

Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock

There are about ~100 per day. They are exclusively from MW-on-k8s.

Correlation with the new non-fatal diagnostic warnings added by the above patch is hard, since they happen before Monolog is initialised during the MW web request. This means those warnigns to to the php7.2-fpm (baremetal) or php-fpm-error (k8s) syslog, where there is no reqId, HTTP host name, or request URL logged.

The best I know to do, is to use the View surrounding documents button in Logstash, and limit surrounding documents to the same logsource host.

What I'm seeing consistently is that the Kubernetes pods are encountering this within the first minute of booting up. This makes logical sense since the data is stored indefinitely in php-apcu once it is succesfully loaded and allowed to be used in a stale way as long as needed. The EtcdConfig class treats cache data as fresh for 10s and after that it lets the first request get a lock to renew the data, and uses the cached data until new data is available, waiting either for that request to succeed or to let another request try it after it times out after 2s.

For example:

timestamp	Nov 1, 2023 @ 20:17:18.458
host	`mw-api-ext.codfw.main-78dc85dd44-f5zvw`
type	mediawiki
caught_by	php-wmerrors
message	`Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock in /srv/mediawiki/php-1.42.0-wmf.2/includes/config/EtcdConfig.php:231`

The "View surrounding documents" button, and then pressing the positive filter for the host name, yields a warning saying that only 7 earlier documents exist in the Logstash index. This is without any type or channel filter, meaning it would include any messages from syslog/php-fpm, but there is nothing earlier.

Querying the same from the Discover page, shows the same, there exist only 7 earlier messages:

The 1 selected message and 7 earlier messages are all the same uncaught exception. So where's what I think is happening:

Time	Event
?	K8s pod boots up.
?	MW request 1 starts, tries cache (empty), acquires lock (expires after 9s if not released), starts Etcd fetch (HTTP connect timeout: 2s, HTTP req timeout: 2s). The etcd fetch takes over 2 seconds to complete, but does eventually succeed.
?	MW request 2 starts, tries cache (empty), loses lock, enters a poll loop for upto 2 seconds with brief sleeps in between.
Nov 1, 2023 @ 20:17:16.139	MW request 2 fails with HTTP 500, after 2 seconds of waiting, got `Uncaught MediaWiki\Config\ConfigException: Failed to load configuration from etcd: lost lock`.
Nov 1, 2023 @ 20:17:16.426	MW request 3 fails, idem.
Nov 1, 2023 @ 20:17:16.448	MW request 4 fails, idem.
Nov 1, 2023 @ 20:17:17.181	MW request 5 fails, idem.
Nov 1, 2023 @ 20:17:17.704	MW request 6 fails, idem.
Nov 1, 2023 @ 20:17:18.050	MW request 7 fails, idem.
Nov 1, 2023 @ 20:17:18.338	MW request 8 fails, idem.
Nov 1, 2023 @ 20:17:18.458	MW request 9 fails, idem.
Nov 1, 2023 @ 20:17:18.859	MW request 10, second successful request. It either saw cache data on the first try if it started after MW req 1 ended, or maybe it also entered a poll loop for a short time and then saw the cache data appear.

MW request 2-9 represent the 8 uncaught exceptions in the logs.

The only messages we see during the first minute of the pod's existence are Failed to load configuration from etcd: lost lock. By definition this means the first request, which acquired the lock, must not have gotten any HTTP error or HTTP timeout for the Etcd fetch, as that would have resulted in a different or extra log message.

Main questions:

How common do pods (re)start? Do they stay until the next deployment, or does the cluster restart/grow/reallocate periodically? Based on this, we can maybe determine how often a new pod enters this kind of failure cycle.

Why would an Etcd fetch take more than 2 seconds? My guess is that Etcd isn't actually taking this long, but rather this is yet another example of Kubernetes networking losing packets and so it just goes until the curl client reaches the allocated connect timeout (T249700, T342201, T341634, T292663, etc)

Action item: Can we avoid this? I think so.

In the long-term, I hope a warmup can be implemented which would benefit lots of things and absorb many different kinds of bootstrapping scenarios through a single mechanism. As I understand it, Kubernetes has a built-in health check probe that it requires to finish succesfully before pooling the pod. However, I'm guessing we've either disabled this for MW right now, or it points to a generic virtualhost that isn't served by MediaWiki.

In the short-term, I suppose we could re-align the timeouts better so that we don't rely on servers starting in a way where the first request must succeed in under 2s or else all incoming requests during that window fail with HTTP 500.

Current timeouts:

Cache is considered fresh for 10s
Wait-loop timeout: 2s. Each loop checks cache, then lock, then etcd fetch.
Lock is held for 9s.
Etcd fetch timeout is 4s (connect timeout: 2s + request timeout: 2s)

We could lower the Etcd fetch timeouts (currently 4s = 2s connect, 2s req) to be less than the poll-wait timeout (currently 2s). Given the Etcd SLO of 32ms at https://wikitech.wikimedia.org/wiki/SLO/etcd_main_cluster, we could for example set this to 0.5s and 0.5s. That way, if a fetch would have taken 2+ seconds to complete, it will timeout before that and have time left to retry under the same acquired lock, succeed, and then feed it to the other queued requests instead of letting those all fail.

Another way could be to lower the lock expiry significantly. For example, if we set the lock expiry to 0.5s, that would mean if the lock-holding thread takes more than 0.5s to renew the value, one of the other queued threads is given a chance to make a succesful renew. The Etcd rate should be similar, but rather than happening serially from a single thread, it would be in parallel from one other thread that succesfully gets the lock on its next wait-loop iteration.

	F41434769: Screenshot 2023-11-02 at 22.39.40.png
	Nov 2 2023, 11:36 PM

	F41434768: Screenshot 2023-11-02 at 22.39.29.png
	Nov 2 2023, 11:36 PM

Uncaught ConfigException: Failed to load configuration from etcdOpen, LowPublicPRODUCTION ERRORActions