Spurred by a small, but still surprising, number of periodic job failures over the weekend associated with fetch timeouts in `EtcdConfig`, I spent some time earlier today thinking about what might be driving the background rate of timeouts (i.e., inclusive of those that are not fatal).
That's summarized in T346971#11791136, and specifically raises concerns about DNS resolution.
However, I'd somehow not looked directly at the overall rate of [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/wmf/1.46.0-wmf.22/includes/Config/EtcdConfig.php#180 | logged ]] fetch timeouts (just focused on a couple of examples). I did that this evening, and wow was I surprised ([[ https://logstash.wikimedia.org/goto/558058a068356e3fd52d0e76afae1cac | PHP-errorlog logstash ]]):
{F75198994}
That's April 1st and 2nd last week, and those large steps //seem// to correlate quite strongly with 1.46.0-wmf.22 hitting [[ https://sal.toolforge.org/log/MjcnSZ0Bvg159pQrNWrX | group1 ]] and [[ https://sal.toolforge.org/log/Fzo-TZ0Bvg159pQrZySF | group2 ]].
**Note**: These are "just" timeouts, in the sense that, in the vast majority of cases MediaWiki is able to continue with the (stale) APCu-cached config. However, the overall rate is rather concerning.
Further, there's not really a strong correlation with overall rate of requests to etcd itself (e.g., in [[ https://grafana.wikimedia.org/goto/afibuez0xnitce?orgId=1 | eqiad ]]) over the same time window (i.e., I don't think that's where the problem is).
If I had to venture a guess, this feels like some form of antagonist workload landed in 1.46.0-wmf.22 that's impacting performance of a shared dependency, DNS resolution being one possibility (e.g., some new source traffic that's not using the service mesh).
Maybe related task: {T422486}