Page MenuHomePhabricator

EtcdConfig using stale data: lost lock in /srv/mediawiki/php-1.42.0-wmf.1/includes/config/EtcdConfig.php on line 218
Open, MediumPublic

Description

This issue surfaced following this change that restored logging capability to the EtcdConfig class.

Logstash link: https://logstash.wikimedia.org/goto/3a693de4428249a88d707aa8bf4e1672

It is currently logging about 300k messages per 24 hours, which will undoubtedly become an issue.

  • It only happens on MW-on-K8s, these errors do not surface on baremetal
  • We do not observe either memcached or confd connectivity issues
    • curl'ing confd from a mediawiki pod namespace works
    • no mcrouter TKO logs which would indicate it lost connection with the upstream memcached server
    • if there were packet drops or transient connectivity errors, it is safe to assume that either TCP would take care of it, or the rate would be high enough to cause other issues
  • The error is caused by this lock-setting conditional failing

This lock seems to be using BagOStuff but I can't tell from reading the code if it is using local cache or a memcached key as a lock. On the off-chance it was local, we checked if we had full APCus and we don't.

Can you help us understand this lock setting mechanism? Is this lock local or global? If it is global and set in memcached, can you help us find the name of the key used for the lock so we can check if it actually gets set?

We need to find out why it fails, and if we actually do refresh EtcdConfig during pod lifetimes.

Tagging @Krinkle for initial contact as the most recent person to touch that code :)

Event Timeline

EtcdConfig uses eventually APCUBagOfStuff, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/etcd.php#29, so the lock is taken on APCU.

Looking at the code, though, the logging of "lost lock", which actually means "failed to acquire a lock", has been only added recently, in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/959801 and I'm not sure it does what it should.

Specifically, it reports a lost lock each time a thread fails to acquire a lock, which could well be because another thread is actually performing the refresh from etcd.

I am unsure why it only appears to come from mw ok k8s but I'd suppose it might have to do with slightly different configurations of error reporting between the two platforms - but this needs to be properly investigated.

Joe triaged this task as Medium priority.Oct 23 2023, 7:38 AM

Setting the priority to medium as we do clearly read correctly from etcd in k8s, else we would've noticed all sorts of other problems like db connections failing

The reason for the increase in warnings is because the change I made in T346971 explicily did so, it added instrumentation that should help SRE in determining the root cause of why Etcd is unrealiable in MW-on-K8s. I'm inclined to merge this task into that one, but I'll leave it to you in case there's more to this than continuing T346971: Uncaught ConfigException: Failed to load configuration from etcd.

@Krinkle the instrumentation added doesn't distinguish between failing to get a lock because it's already taken by another thread and just a complete failure. That would help us understand if it's actually an issue or not.

no mcrouter TKO logs which would indicate it lost connection with the upstream memcached server. […] This lock seems to be using BagOStuff but I can't tell from reading the code if it is using local cache or a memcached key as a lock. On the off-chance it was local, we checked if we had full APCus and we don't.

[…] If it is global and set in memcached, can you help us find the name of the key used for the lock so we can check if it actually gets set?

Memcached isn't involved. php-apcu is used store the cached data from Etcd, as well as to coordinate updates to that cache.

@Krinkle the instrumentation added doesn't distinguish between failing to get a lock because it's already taken by another thread and just a complete failure. That would help us understand if it's actually an issue or not.

The issue is that MediaWiki is serving infinitely more HTTP 500s on K8s than on baremetal in relation to Etcd.

To my knowledge there is no such thing as a failure to get a response from php-apcu. Hypothetically the process could segfault, hang indefinitely, or throw a generic exception, but it can only return if it got an actual response afaik. This isn't a networked service like memcached where an abstraction layer like php-memcached or BagOStuff may treat a negative response and inability to get a response as the same thing.

As I understand it, the root cause here, as with the other half dozen unresolved tasks about Kubernetes network failures, are due to a failure to contact Etcd.

"lost lock" is a diagnostic message added by my instrumentation, and a condition that is expected to happen from time to time. The HTTP 500 is caused after that condition is reached, when mw-on-k8s fails to get a response from Etcd.

Un-merging. I see now that the issues don't overlap completely. It also isn't specific to Kubernetes in any way, untagging MW-on-K8s as such.

I think my assumption is correct about the fatal error regression at T346971 only ocurring after "stale data" is encountered first. But, that's kind of obvious in retrospect, and doesn't mean one "causes" the other. The "stale data" is orders of magnitude more common and not even an issue per-se. I expect this to have been happening for several years and simply not previously visible to us through logging.

This wasn't obvious to me at first because these errors don't go to Logstash under type:mediawiki, since they happen early on during the request setup, before MediaWiki has a chance to initialise Monolog.

Correlation is hard to do with these raw php-fpm logs since they don't contain a reqId, or even a request URL. It's hard to say with certainty which diagnostic messages coincide with other entries in php-fpm. If some of the non-fatal diagnostic conditions are hit in BagOStuff, we wouldn't know since that operates with a NullLogger until Monolog is setup.

On the Discover page in Logstash, one can see messages from bare metal, by querying program:php7.2-fpm in the logstash-* index, and for the k8s ones one has to query program:php-fpm-error (can be OR'ed). Analysis is hindered by T350430: php-fpm logs from Kubernetes lack 'message' and 'normalized_message', but if you then AND with a query that duplicates its check to message OR originalmsg, one can see that it happens in both environments about equally often.

(type:mediawiki OR program:php7.2-fpm OR program:php-fpm-error) AND (message:EtcdConfig OR originalmsg:EtcdConfig)

Last 7 days:

  • 8% host:mw-web
  • 6% host:mw-api-int
  • 3% host:mw-api-ext
  • 1% host:parse1001
  • 1% host:mw1356
  • 1% host: (various mw1xxx Eqiad hosts)
  • 0.5% host:mw2279
  • 0.5% host:mw2281
  • 0.5% host: (various mw2xxx Codfw hosts)

Of these 12.9 million messages in the last 7 days, 12.8 million are for EtcdConfig using stale data: lost lock.

The remaining messages are:

  • EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached
    • ~1000 in the last 24h. ~70% are from MediaWiki on K8s, ~30% from Codfw appservers.
  • EtcdConfig using stale data: (curl error: 6) Couldn't resolve host name
    • ~35 in the last 24h. ~40% are from MediaWiki on K8s, ~60% from Codfw appservers.

These may be worth looking into separately from this task.

I guess for this task, the main question is: What is considered a "normal" amount of requests to happen between expiry and end of Etcd fetch? In other words, from when the 10s logical expiry has ellapsed, until the the thread that acquired the lock has finished fetching fresh data from Etcd and setting it into php-apcu (timeout: 2.0s).

I believe we have about ~15,000 requests per second on appservers/mw-on-k8s combined. The above logs indicate we see about ~150/s (12.8 million/24/3600) using stale data. MW's EtcdConfig class stores and uses the cached Etcd data unconditionally for 10 seconds, and then gives 1 thread the lock to fetch new data. That would mean during the 10 seconds we use no stale data, and then the 11th second sees 1,1650 (150/s*11) requests fallback to stale data, of the about ~15,000 requests in that second. This suggests the Etcd refresh takes about 0.1s on average to complete, which seems reasonable?

Unless I measured these numbers incorrectly, or if taking 0.1s to refresh is longer than expected, I think this is working exactly as designed and as expected.

A potential action item is that we can limit this logging to only happening when it is unexpected. For exampe, we can limit it to when stale data is used after the allocated ~2s max timeout for one Etcd fetch. E.g. if after we told 1 thread to renew the data (and all other requests use the 10s stale data), 2 seconds have ellapsed and we still use stale data, only then emit a diagnostic warning. Before that, it is entirely expected to happen of course since time doesn't stand still when 1 thread starts fetching new data.

This message is still the most frequent in php-fpm error logs (nowadays, Logstash: mediawiki-k8s-errorlog).

It doesn't bother me much since afaik MW devs tend not to look at that (php-fpm logs have no reqId or URLs, and anything between process start and MW Monolog setup is ignorable unless there's a fatal, in which case php-fatal-error hopefully logs it appropriately under type:mediawiki, where we look instead).

I'm looking at it now during incident review, after T349376.

For SRE that monitor php-fpm or etcd, any thoughts on the above logging/threshold strategy?