Spurred by a small, but still surprising, number of periodic job failures over the weekend associated with fetch timeouts in EtcdConfig, I spent some time earlier today thinking about what might be driving the background rate of timeouts (i.e., inclusive of those that are not fatal).
That's summarized in T346971#11791136, and specifically raises concerns about DNS resolution.
However, I'd somehow not looked directly at the overall rate of logged fetch timeouts (just focused on a couple of examples). I did that this evening, and wow was I surprised (PHP-errorlog logstash):
That's March 17th (previous spurious correlation was due to inadvertently capturing the source line in the log).
That's April 1st and 2nd last week, and those large steps seem to correlate quite strongly with 1.46.0-wmf.22 hitting group1 and group2.
Note: These are "just" timeouts, in the sense that, in the vast majority of cases MediaWiki is able to continue with the (stale) APCu-cached config. However, the overall rate is rather concerning.
Further, there's not really a strong correlation with overall rate of requests to etcd itself (e.g., in eqiad) over the same time window (i.e., I don't think that's where the problem is).
If I had to venture a guess, this feels like some form of antagonist workload landed on the 17th of March that's impacting performance of a shared dependency, DNS resolution being one possibility (e.g., some new source traffic that's not using the service mesh).
Maybe related task: T422486: MediaWiki periodic job failures due to timeouts



