Page MenuHomePhabricator

Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th
Closed, ResolvedPublicPRODUCTION ERROR

Description

Spurred by a small, but still surprising, number of periodic job failures over the weekend associated with fetch timeouts in EtcdConfig, I spent some time earlier today thinking about what might be driving the background rate of timeouts (i.e., inclusive of those that are not fatal).

That's summarized in T346971#11791136, and specifically raises concerns about DNS resolution.

However, I'd somehow not looked directly at the overall rate of logged fetch timeouts (just focused on a couple of examples). I did that this evening, and wow was I surprised (PHP-errorlog logstash):

Screenshot From 2026-04-07 07-15-48.png (499×902 px, 41 KB)

That's March 17th (previous spurious correlation was due to inadvertently capturing the source line in the log).

That's April 1st and 2nd last week, and those large steps seem to correlate quite strongly with 1.46.0-wmf.22 hitting group1 and group2.

Note: These are "just" timeouts, in the sense that, in the vast majority of cases MediaWiki is able to continue with the (stale) APCu-cached config. However, the overall rate is rather concerning.

Further, there's not really a strong correlation with overall rate of requests to etcd itself (e.g., in eqiad) over the same time window (i.e., I don't think that's where the problem is).

If I had to venture a guess, this feels like some form of antagonist workload landed on the 17th of March that's impacting performance of a shared dependency, DNS resolution being one possibility (e.g., some new source traffic that's not using the service mesh).

Maybe related task: T422486: MediaWiki periodic job failures due to timeouts

Related Objects

Mentioned In
T424280: Consider surfacing curl error details in MultiHttpClient
T422967: Investigate DNS query improvements in MediaWiki-on-k8s
T422955: Detect elevated rates of EtcdConfig fetch failures
T349376: Fix noisy "EtcdConfig using stale data: lost lock" warning from EtcdConfig.php
T422414: MediaWiki periodic job purge-temporary-accounts failed
T422489: rdbms errors in eqiad
T422486: MediaWiki periodic job failures due to timeouts
T422227: MediaWiki periodic job refreshlinks-delete-from-nonexistent-s3 failed
Mentioned Here
T422486: MediaWiki periodic job failures due to timeouts
T348255: Parser cache infrastructure for OutputTransform
T413811: 1.46.0-wmf.20 deployment blockers
T418518: Remove code for legacy GrowthMentorList validator
T418534: Update the design of the popup login form for use in a mobile web view
T419125: hCaptcha: Update mediawiki-config to enforce checks for API edits coming from the MobileFrontend
T419163: Opt new accounts into ReadingLists BetaFeature
T419730: Vector 2022 should support duplication of languages in header and sidebar
T420062: Uninstall PSI extensions on closed wikis which are not needed
T420063: Uninstall AbuseFilter from wikis which are closed and have no AbuseLog entries
T420288: VisualEditor link tool is confusing project namespace and interwiki links
T420315: Error: Cannot modify readonly property MediaWiki\Category\CategoryViewer::$query
T422320: Android & iOS app login broken: "Could not extract login status"
T346971: Uncaught ConfigException: Failed to load configuration from etcd

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptApr 7 2026, 2:27 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Options to consider:

  1. We could try temporarily reverting (at least group1 and group2) to php-1.46.0-wmf.21 to confirm the correlation described above. This may of course end badly if anyone has already assumed .22 is rollback-safe and made dependent changes.
  1. If we believe core DNS might potentially be the impacted shared dependency (assuming the theory above is plausible), we could try significantly upsizing the number of replicas.
  1. If anyone happens to know of a feature rolling out with .22 that may be introducing a sizable amount of non-service-mesh traffic, we could try turning it off.

Adding @jnuche and @dancy as train deployers this week - FYI, given the possibility of needing to explore option #1 in T422455#11792120.

  1. We could try temporarily reverting (at least group1 and group2) to php-1.46.0-wmf.21 to confirm the correlation described above. This may of course end badly if anyone has already assumed .22 is rollback-safe and made dependent changes.

That could be problematic at this point. A quick glance already shows backports for 1.46.0-wmf.22, e.g.: https://gerrit.wikimedia.org/r/c/mediawiki/skins/MinervaNeue/+/1267214 or https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1268281. The latter fixing a mobile login bug T422320

It seems like EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached started sometime around March 17th, with eqiad showing significantly more errors

March 17th

TimeSummaryGerrit
00:00Enable languages in main menu on Russian Wikipedia (T419730)1251158
08:43hcaptcha: Enforce hCaptcha on API edits from MobileFrontend (T419125)1250575
08:57Revert hcaptcha enforce on API edits (T419125)1254114
13:09Remove misplaced readonly from CategoryViewer (T420315)1254166
13:20Turn on postprocessing cache for all Parsoid parses (T348255)1251610
13:37TitleWidget: Prioritise namespace prefix over interwiki prefix (T420288)1254189, 1254190
13:57group0 bumped to 1.46.0-wmf.20 (T413811)
15:02Create dblists for wikis where CheckUser/AbuseFilter disabled (T420063, T420062)1254217
15:11Growth: Remove temporary GrowthMentorList overrides (T418518)1244723
20:03Set wgReadingListsBetaDefaultForNewAccountsAfter for beta cluster (T419163)1251309
20:22Passwordless login: Don't display conditional auth errors1254301, 1254302
20:52Remove notice from login form in popup mode (T418534)1254280

DC Switchover

  • March 24th @ 15:00 UTC: Read traffic switched to eqiad
  • March 25th @ 15:00 UTC: codfw depooled
  • April 2nd @ 13:00 UTC: codfw repooled

Setting aside any mediawiki changes, the difference between the two DCs in the same time period (post codfw repooling), is alarming

April 2nd 14:00 - April 7th 9:30

  • eqiad: ~228.225
  • codfw: ~76

image.png (916×3 px, 223 KB)

image.png (1×3 px, 213 KB)

jijiki changed the task status from Open to In Progress.Apr 7 2026, 11:23 AM
jijiki moved this task from Inbox to In Progress on the ServiceOps new board.

Take this with a grain of salt, it seems like something indeed changed during the week of March 17th, and if eqiad was not producing all those errors, we wouldn't have noticed.

Scott_French renamed this task from Massive increase in `EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached` with 1.46.0-wmf.22 to Massive increase in `EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached` on March 17th.Apr 7 2026, 2:19 PM

Change #1268569 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikikube: Request 1 CPU and 500M memory per replica

https://gerrit.wikimedia.org/r/1268569

Change #1268573 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] wikikube: Temporarily double coredns replicas (12)

https://gerrit.wikimedia.org/r/1268573

Change #1268569 merged by jenkins-bot:

[operations/deployment-charts@master] wikikube: Request 1 CPU and 500M memory per replica

https://gerrit.wikimedia.org/r/1268569

Change #1268579 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] coredns: Add a switch to enable autopath

https://gerrit.wikimedia.org/r/1268579

Many thanks for continuing to investigate this @jijiki - particularly for spotting the actual onset date (March 17th) rather than the spurious correlation with the .22 release last week (now noted in the task description).

From what you collected, it looks like the onset time - if there was indeed a single trigger - would have been at / around 15:00 UTC on that day.

@jnuche and @dancy - Given that the correlation with .22 was spurious, feel free to ignore this going forward in the context of this week's train.

@jnuche and @dancy - Given that the correlation with .22 was spurious, feel free to ignore this going forward in the context of this week's train.

Thank you!

The changes in https://gerrit.wikimedia.org/r/1268569 were applied to codfw and eqiad at 15:10 and 15:20 UTC today respectively.

Following those changes, the rate of errors has plummeted (logstash):

Screenshot From 2026-04-07 09-50-55.png (713×1 px, 106 KB)

We're now down to single-digits (and often zero) per 10m, with the exception of brief clusters of timeouts mainly during deployments.

Note: This is also a slightly different query than before, still focusing on causative errors (rather than the duplicate EtcdConfig using stale data: {reason} logs), covering all 3 of curl 28 (timeout) and 6 (outright resolution failure) during fetch, plus dns_get_record failure in DnsSrvDiscoverer (i.e., during SRV record resolution).

While it would at least seem that coredns was somehow implicated in this, it's very hard to precisely identify why this change appears to have improved the situation - e.g., decoupling the effect of the restart vs. the benefits of having more appropriate resource requests set (i.e., reducing the likelihood of resource contention with collocated workloads).

I'll continue to monitor for now and hold off on the proposed upsize unless the situation begins to degrade again (https://gerrit.wikimedia.org/r/1268573).

The changes in https://gerrit.wikimedia.org/r/1268569 were applied to codfw and eqiad at 15:10 and 15:20 UTC today respectively.

Following those changes, the rate of errors has plummeted (logstash):

Thank you for sorting it! As we suspected, this sorted the k8s jobs issue too T422486#11796063

Scott_French lowered the priority of this task from High to Medium.

Checking in near the end of my day, the observations in T422455#11795500 continue to hold - the rate of fetch timeouts seems to remain stable in the single-digits-per-10m, with the exception of windows containing MediaWiki deployments.

While I would still like to observe the effect that https://gerrit.wikimedia.org/r/1268573 (2x replicas) has on the lingering rate of timeouts, at this point I'm inclined to give the current situation a longer soak time - most likely a full 24h through to tomorrow.

In the interim, I'll drop this to Medium, since the immediate impact (e.g., on periodic jobs) has been mitigated (thanks for confirming that, @jijiki!).

I agree with a longer soak, but I'm also not opposed to doubling the replica even if the current situation holds. My rationale is that we keep having infrequent issues with coredns related to scaling up the number of client making DNS queries, and that would give us more margin without having to revisit that once again. Thoughts?

Krinkle renamed this task from Massive increase in `EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached` on March 17th to Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings since March 17th.Apr 8 2026, 1:38 PM

the vast majority of cases MediaWiki is able to continue with the (stale) APCu-cached config.

I believe it was 100%, not just a majority.

Reading EtcdConfig.php I understand it as follows:

  1. This thread found no data in APC, or it found data that is stale and prefererably not used; We try to renew.
  2. This thread suceeds in getting the renew lock (other concurrent threads might use stale data).
  3. This thread fetches from Etcd but this fetch failed with a timeout.
    • If we have no retries left, then we log a PHP Warning EtcdConfig failed to fetch data: {error}.
    • If we have retries left and no stale data in APC, we go back to step 1 and retry.
    • If we do have stale data, then we log a PHP Notice EtcdConfig using stale data: {error} and use that.
    • If we have no retries left and did not find stale data, then we abort the retry loop, and unconditionally throw ConfigException: Failed to load configuration from etcd: {error} which terminates the response with HTTP 500.

Last 7 days of Warning: EtcdConfig messages in Logstash mediawiki-errors:

PHP Warning: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reached1,800
PHP Warning: EtcdConfig failed to fetch data: (curl error: 6) Couldn't resolve host name40

Server groups:

  • kube-mw-cron: 1654
  • kube-mw-script: 183

Last 7 days of Notice: EtcdConfig messages to php-fpm in mediawiki-k8s-errorlog

  • PHP Notice: EtcdConfig using stale data: (curl error: 28) Timeout was reached
    • Count: 290K
    • Server groups: mw-web 143K, mw-api-ext 69K, mw-jobrunner 34K, ..
  • PHP Notice: EtcdConfig using stale data: lost lock
    • Count: 85M
    • Server groups: mw-web 33M, mw-api-ext 30M mw-jobrunner 7M, ..
  • PHP Notice: EtcdConfig using stale data: (curl error: 6) Couldn't resolve host name
    • Count: 12K
    • Server groups: mw-web 6K, mw-api-ext 3K, mw-jobrunner 1K

Last 7 days of ConfigException: Failed to load configuration from etcd messages in Logstash mediawiki-errors:

  • Uncaught ConfigException: Failed to load configuration from etcd: lost lock
    • Count: 363
    • Server groups: mw-api-int 258, mw-api-ext 55, mw-jobrunner 50

It is interesting that the fetch failures are all from mw-cron and mw-script. Afaik those pods are always one-off for a single process and thus have had no previous processes in the same pod (and APCU disabled). Volume wise it make sense that they are most likely to encounter the issue given their uncached circumstances. Web-facing pods make several orders of magnitude fewer coldstart fetches (essentially only after a deployment on pod startup, given a lock to deduplicate attempts, and once started has stale data as fallback).

Spurred by a small, but still surprising, number of periodic job failures over the weekend associated with fetch timeouts in EtcdConfig, I spent some time earlier today thinking about what might be driving the background rate of timeouts (i.e., inclusive of those that are not fatal).

There were no uncaught exceptions with "Timeout was reached" or "resolve host name" as the error, which suggests they always retried and succeeded, or had stale data available.

There were also no uncaught exceptions from mw-cron or mw-script under Logstash mediawiki-errors, which suggests no perodic jobs failed? That would contradict what you found. Something is funky with the logs. And indeed, looking at Logstash mw-cron directly I see the warnigns and exceptions there now.

Warning: EtcdConfig failed to fetch data: (curl error: 28) Timeout was reachedCount: 770
Warning: EtcdConfig failed to fetch data: (curl error: 6) Couldn't resolve host nameCount: 27
ConfigException: Failed to load configuration from etcd: (curl error: 28) Timeout was reachedCount: 35
ConfigException: Failed to load configuration from etcd: (curl error: 6) Couldn't resolve host nameCount: 3

It seems wrong that none of those exceptions were reported to the general MediaWiki logstash intake, yet other warnings and PSR logs does end up there from mw-cron. This was not the case pre-Kubernetes afaik. Also it's strange that the numbers don't line up. In the general MediaWiki logstash intake we see 1800 of these warnings from mw-cron, but then here we see only 770.

Krinkle renamed this task from Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings since March 17th to Massive increase in "EtcdConfig failed to fetch data: Timeout was reached" warnings and errors since March 17th.Apr 8 2026, 2:37 PM

Change #1268573 merged by jenkins-bot:

[operations/deployment-charts@master] wikikube: Temporarily double coredns replicas (12)

https://gerrit.wikimedia.org/r/1268573

Thanks for the detailed analysis, @Krinkle.

It is interesting that the fetch failures are all from mw-cron and mw-script. [...]

I don't have a good answer to that (see edit below). Naively, I'd expect that the (numerous) fetch-failure Warnings emitted by PHP-FPM workloads to the mediawiki-k8s-errorlog "should" have followed the same path to mediawiki-errors as the CLI / maintenance workloads (i.e., surfaced via MWExceptionHandler::handleError wired into set_error_handler).

Either I'm fundamentally misunderstanding something about trigger_error handling under PHP-FPM, or we've done something to otherwise suppress these during ingestion.

What's also interesting is that, spot checking the mw-cron and mw-script Warnings over the last 7 days, these are all via the LBFactory::autoReconfigure callback rather than anything that has occurred at startup. Meaning, they're happening "at runtime" after initialization has already succeeded. Which brings me to ...

There were also no uncaught exceptions from mw-cron or mw-script under Logstash mediawiki-errors, which suggests no perodic jobs failed? [...]

Indeed, as you noted, we can only see these in the raw stderr for mw-cron (or mw-script), rather than seeing them in mediawiki-errors.

Previously, I've wondered whether these are happening so early in wmf-config/CommonSettings.php that it's questionable whether those errors could end up in logstash at all (e.g., comparing the positions of the first use of the EtcdConfig instance returned by wmfSetupEtcd and where the monolog channels are configured).

However, that would run counter to the observation that we do see the ConfigException case for PHP-FPM workloads in mediawiki-errors logstash.

Edit: Notably, these are appear to be the PHP default exception handler, which makes sense given that we've not reached the point in Setup.php where MWExceptionHandler::installHandler would have been called. While it's initially puzzling how these might make it into logstash, you can see that these are actually shipped by rsyslog by way of php-wmerrors. That explains how it gets there in the absence of the subsequent logging configuration (and why we would not see this in CLI workloads, where the error_script_file that actually tees the payload to rsyslog is not configured, nor would it ever be executed under that SAPI). It's also consistent with the fact that we never see the fetch-failure Warnings in mediawiki-errors logstash in the PHP-FPM case - if they're happening at all, it's at a point where we've not yet called set_error_handler (and indeed wmerrors does not care about E_USER_WARNING).

Curiously, those are all the lost-lock case - i.e., the one you've described previously where there are requests that fail by way of exhausting the timeout on the WaitConditionLoop, while one worker is (presumably) experiencing a slow fetch, which of course would only affect PHP-FPM workloads (where the notion of multiple processes is relevant). I'm not quite sure what to make of that.


So, what was up with DNS?

We made some additional changes (https://gerrit.wikimedia.org/r/1268573) today to double the number of coredns replicas, in order to assess whether this might have an effect on the lingering rate of timeouts. In short, it has not (the rate is unchanged). While the story is undoubtedly complicated, it does suggest that capacity alone is not (at least as of now) the primary contributor.

That brings us back to the question of what effect https://gerrit.wikimedia.org/r/1268569 actually had - i.e.,

While it would at least seem that coredns was somehow implicated in this, it's very hard to precisely identify why this change appears to have improved the situation - e.g., decoupling the effect of the restart vs. the benefits of having more appropriate resource requests set (i.e., reducing the likelihood of resource contention with collocated workloads).

Many thanks to @JMeybohm, who made a number of improvements to the Kubernetes DNS grafana dashboard and introduced a logstash dashboard, which reveals not only a high rate of upsteam resolution timeouts during the period of badness, but also that those were limited to a single coredns pod. Indeed, when that pod was terminated, the problem went away.

I don't see anything obviously wrong with the specific host (wikikube-worker1067.eqiad.wmnet) that pod was scheduled on, nor do I have a good working theory as to why one pod might be in a bad state or to disprove that other pods were misbehaving as well (albeit, silently).

Many thanks to @JMeybohm, who made a number of improvements to the Kubernetes DNS grafana dashboard and introduced a logstash dashboard, which reveals not only a high rate of upsteam resolution timeouts during the period of badness, but also that those were limited to a single coredns pod. Indeed, when that pod was terminated, the problem went away.

I don't see anything obviously wrong with the specific host (wikikube-worker1067.eqiad.wmnet) that pod was scheduled on, nor do I have a good working theory as to why one pod might be in a bad state or to disprove that other pods were misbehaving as well (albeit, silently).

Around the time the host was (and partly still is) dropping incoming packages at a rather high rate (probably due to increased throughput with spikes of 500 Mb/s and more). It's likely that caused the timeouts for responses from upstream DNS. There is also a pretty steep increase in conntrack entries that immediately drops to normal after the coredns pod was terminated on wikikube-worker1067 (see: https://grafana.wikimedia.org/goto/cfikqaqby3p4wc?orgId=1) which I can't really explain.

Ah, thanks for highlighting that @JMeybohm!

I'd not noticed the in-drops previously. Those definitely correlate with when coredns was scheduled there, and appear to hover in the 0 - 50 mpps range (so a couple of lost packets per minute), though peaking at up to ~ 5x that. I do wonder to what degree that's a contributing cause vs. more an effect of a slowdown in software (e.g., something that slows processing of inbound datagrams).

That's also an interesting observation about the conntrack entries - indeed, the same can be seen on other hosts (example) where coredns was scheduled at the time. No doubt these are rather "popular" pods (both for workloads that would likely "reuse" the same entries and for those where the source port would constantly churn), but the effect is still quite impressive.

Next steps

As much as I would like to see it, I suspect we're unlikely to get to the bottom of the specific issue that struck this time around.

In which case, I think it makes sense to think about what else we do next. Thoughts:

  1. Cleanup - We should decide whether to revert https://gerrit.wikimedia.org/r/1268573. On the one hand, it seems to have no effect on the lingering rate of timeouts. On the other, I'm not opposed to over-provisioning a such a critical cluster-wide service.
  2. Observability - This went on for weeks, and we didn't really notice until periodic jobs started failing more frequently.
    • One aspect of that is timeouts are happening early enough that the associated warnings will never make it into mediawiki-errors logstash unless they result in an unhandled ConfigException and instead end up in the errorlog (and that's only in the PHP-FPM case; for CLI, we only have stderr, unless they happen during LBFactory::autoReconfigure).
    • I'm wondering whether we should have an alert for this, though I'm not sure off hand what the standard pattern is for alerting on a logs-based signal.
  3. DNS queries - As we've talked about at various points, we could probably do something smarter with our DNS queries on certain critical paths like this one (e.g., dot suffixing). There are likely some tricky aspects to doing so on this path, but I wonder whether it makes sense to investigate anyway.

Both #2 and #3 would be follow-on tasks.

Ah, thanks for highlighting that @JMeybohm!

I'd not noticed the in-drops previously. Those definitely correlate with when coredns was scheduled there, and appear to hover in the 0 - 50 mpps range (so a couple of lost packets per minute), though peaking at up to ~ 5x that. I do wonder to what degree that's a contributing cause vs. more an effect of a slowdown in software (e.g., something that slows processing of inbound datagrams).

That's also an interesting observation about the conntrack entries - indeed, the same can be seen on other hosts (example) where coredns was scheduled at the time. No doubt these are rather "popular" pods (both for workloads that would likely "reuse" the same entries and for those where the source port would constantly churn), but the effect is still quite impressive.

Next steps

As much as I would like to see it, I suspect we're unlikely to get to the bottom of the specific issue that struck this time around.

In which case, I think it makes sense to think about what else we do next. Thoughts:

  1. Cleanup - We should decide whether to revert https://gerrit.wikimedia.org/r/1268573. On the one hand, it seems to have no effect on the lingering rate of timeouts. On the other, I'm not opposed to over-provisioning a such a critical cluster-wide service.

I'd rather err on the side of over than under-provisioning. There's a fairly good chance that having more pods would also lessen both the amount of traffic and the size of the conntrack per host, which could help with the packet loss

  1. Observability - This went on for weeks, and we didn't really notice until periodic jobs started failing more frequently.
    • One aspect of that is timeouts are happening early enough that the associated warnings will never make it into mediawiki-errors logstash unless they result in an unhandled ConfigException and instead end up in the errorlog (and that's only in the PHP-FPM case; for CLI, we only have stderr, unless they happen during LBFactory::autoReconfigure).
    • I'm wondering whether we should have an alert for this, though I'm not sure off hand what the standard pattern is for alerting on a logs-based signal.

I'm not finding a definitive documentation for it, even though I think we already do it for some things like MediaWiki error channels. We should ask o11y about it (not tagging them on purpose, we can tag them on the followup task itself).

  1. DNS queries - As we've talked about at various points, we could probably do something smarter with our DNS queries on certain critical paths like this one (e.g., dot suffixing). There are likely some tricky aspects to doing so on this path, but I wonder whether it makes sense to investigate anyway.

Agreed we should at least investigate the impact of bounding query recursion with dot suffixing where we can. Should we also think about alerting on things like sustained health check failures (coredns_proxy_healthcheck_failures_total) on a pod so we can reset it?

Ah, thanks for highlighting that @JMeybohm!

I'd not noticed the in-drops previously. Those definitely correlate with when coredns was scheduled there, and appear to hover in the 0 - 50 mpps range (so a couple of lost packets per minute), though peaking at up to ~ 5x that. I do wonder to what degree that's a contributing cause vs. more an effect of a slowdown in software (e.g., something that slows processing of inbound datagrams).

That's also an interesting observation about the conntrack entries - indeed, the same can be seen on other hosts (example) where coredns was scheduled at the time. No doubt these are rather "popular" pods (both for workloads that would likely "reuse" the same entries and for those where the source port would constantly churn), but the effect is still quite impressive.

Next steps

As much as I would like to see it, I suspect we're unlikely to get to the bottom of the specific issue that struck this time around.

In which case, I think it makes sense to think about what else we do next. Thoughts:

  1. Cleanup - We should decide whether to revert https://gerrit.wikimedia.org/r/1268573. On the one hand, it seems to have no effect on the lingering rate of timeouts. On the other, I'm not opposed to over-provisioning a such a critical cluster-wide service.

I'd rather err on the side of over than under-provisioning. There's a fairly good chance that having more pods would also lessen both the amount of traffic and the size of the conntrack per host, which could help with the packet loss

I agree. This will also half the blast radius (if only a single replica is affected) since we're serving 5k rather than 10k rps per replica.

  1. Observability - This went on for weeks, and we didn't really notice until periodic jobs started failing more frequently.
    • One aspect of that is timeouts are happening early enough that the associated warnings will never make it into mediawiki-errors logstash unless they result in an unhandled ConfigException and instead end up in the errorlog (and that's only in the PHP-FPM case; for CLI, we only have stderr, unless they happen during LBFactory::autoReconfigure).
    • I'm wondering whether we should have an alert for this, though I'm not sure off hand what the standard pattern is for alerting on a logs-based signal.

I'm not finding a definitive documentation for it, even though I think we already do it for some things like MediaWiki error channels. We should ask o11y about it (not tagging them on purpose, we can tag them on the followup task itself).

  1. DNS queries - As we've talked about at various points, we could probably do something smarter with our DNS queries on certain critical paths like this one (e.g., dot suffixing). There are likely some tricky aspects to doing so on this path, but I wonder whether it makes sense to investigate anyway.

Agreed we should at least investigate the impact of bounding query recursion with dot suffixing where we can. Should we also think about alerting on things like sustained health check failures (coredns_proxy_healthcheck_failures_total) on a pod so we can reset it?

Yes, Let's open a subtask to gather the low hanging fruits like in mw-config and the mesh config at least. After that we can decide if it's worth it to go over all the other places.

I was trying to get a proper signal out of coredns_proxy_healthcheck_failures_total but it is very flaky in nature. Maybe something like https://grafana.wikimedia.org/goto/afio8mkkpfbb4b?orgId=1 over 5min?
While playing around with this I started wondering whether it makes sense to have these health checks at all given we have only one upstream DNS configured. When all upstreams are failing, CoreDNS sends the query to a random upstream. So disabling health check completely will have exactly the same effect with lower load (since no health checks are done). But that would mean we loose the signal. We could maybe alert on elevated p99 latency of forwarded requests, but that signal seems even less clear than the failing health checks.

Change #1270056 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] wikikube: Remove comment on coredns replicas

https://gerrit.wikimedia.org/r/1270056

Change #1270056 merged by jenkins-bot:

[operations/deployment-charts@master] wikikube: Remove comment on coredns replicas

https://gerrit.wikimedia.org/r/1270056

Many thanks @Clement_Goubert and @JMeybohm. I've opened the two follow-up tasks and we can shift further discussion there.