This issue surfaced following this change that restored logging capability to the EtcdConfig class.
Logstash link: https://logstash.wikimedia.org/goto/3a693de4428249a88d707aa8bf4e1672
It is currently logging about 300k messages per 24 hours, which will undoubtedly become an issue.
- It only happens on MW-on-K8s, these errors do not surface on baremetal
- We do not observe either memcached or confd connectivity issues
- curl'ing confd from a mediawiki pod namespace works
- no mcrouter TKO logs which would indicate it lost connection with the upstream memcached server
- if there were packet drops or transient connectivity errors, it is safe to assume that either TCP would take care of it, or the rate would be high enough to cause other issues
- The error is caused by this lock-setting conditional failing
This lock seems to be using BagOStuff but I can't tell from reading the code if it is using local cache or a memcached key as a lock. On the off-chance it was local, we checked if we had full APCus and we don't.
Can you help us understand this lock setting mechanism? Is this lock local or global? If it is global and set in memcached, can you help us find the name of the key used for the lock so we can check if it actually gets set?
We need to find out why it fails, and if we actually do refresh EtcdConfig during pod lifetimes.
Tagging @Krinkle for initial contact as the most recent person to touch that code :)