Page MenuHomePhabricator

noc.wikimedia.org is slow and it times out sporadically
Closed, DuplicatePublic

Description

On T368755, we use https://noc.wikimedia.org to get the latest dblists, and to resolve Analytic Replicas.

We typically hit endpoints like:

https://noc.wikimedia.org/conf/dblists/open.dblist
https://noc.wikimedia.org/db.php?dc=eqiad&format=json

On or around 2024-11-14T15:00:00 UTC, we started seeing responses taking >= 30 seconds, and outrigth failures with:

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure

We discussed this on IRC #wikimedia-sre, and it looked like it was happening because of T364092: Upgrade core routers to Junos 23.4R2, but now that that work is done, I can still see timeouts and slow responses both from clients running inside eqiad as well as externally.

Event Timeline

Example behavior, with 3 requests taking <= 1 sec, and the 4th one taking 30s:

mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0  48516      0 --:--:-- --:--:-- --:--:-- 48391

√ mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0  50140      0 --:--:-- --:--:-- --:--:-- 50407

√ mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0  70945      0 --:--:-- --:--:-- --:--:-- 71163

√ mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0    358      0 --:--:--  0:00:30 --:--:--  2591

A couple of points of note:

Comparing istio-reported error rates for mw-misc in eqiad (https://grafana.wikimedia.org/goto/a4cH6UGHR?orgId=1) vs. codfw (https://grafana.wikimedia.org/goto/I4Ov6UMHg?orgId=1) we can see a marked increase in codfw-emitted errors after eqiad was depooled for edge traffic around 15:40 - i.e., when traffic using the "noc.wikimedia.org" host would have shifted (you can see another drop in traffic in eqiad around 17:00, which is when the discovery service was depooled).

Interestingly, this seems to be a codfw-specific behavior - e.g., we don't see this in eqiad at steady state, and indeed I can reproduce it only in codfw right now, here using a query like T379968#10324204 pointed at mw-misc.svc.codfw.wmnet for maybe 1 in 10 requests.

Also interestingly, I don't see errors in reported on the LOCAL_mw-misc cluster from the collocated envoy: https://grafana.wikimedia.org/goto/ojc4gQGHR?orgId=1

That suggests maybe some odd interaction with Ingress in codfw?

These errors also pop up in the Istio gateway logs (https://logstash.wikimedia.org/goto/528e857bf1c7f06f26cd830b897d99f4), but don't really give us much more information than can be seen from the dashboards linked above (e.g., it's all upstream_reset_before_response_started{connection_failure}).

Adding @JMeybohm in case Janis has any thoughts.

As of today, I cannot repro this behavior anymore.

I will let serviceops decide if we should pursue a postmortem analysis further than what @Scott_French shared above, or if we simply close it for now.

Thanks, @xcollazo.

Agreed, yeah - I can't seem to reproduce this either today.

However, I realized this morning that something didn't add up when looking at the Istio gateway logs yesterday. Specifically, mw-misc has only 2 replicas, so I should expect to see only 2 distinct upstream-host pod IPs at any given time in the logs, but it felt like there were more ...

Looking again today, let's pick a short time window far way from deployments, say 17:00 - 17:10 UTC on 11/14. During that 10m window, there are indeed exactly 2 pod IPs that are consistently resulting in non-503 responses when used as backends:

  • 10.194.134.189
  • 10.194.186.105

However, there are also 3 pod IPs consistently resulting in 503 responses with upstream_reset_before_response_started{connection_failure}:

  • 10.194.131.5
  • 10.194.157.16
  • 10.194.186.92

Spot checking Calico and / or k8s event logs, these correspond to pods mw-misc.codfw.main-8d9f549cb-tcmlx, mw-misc.codfw.main-8d9f549cb-dmdw8, and mw-misc.codfw.main-75795dcd8f-s6hwj, all of which were long-gone by 17:00.

So, it seems like the Istio control plane was somehow very slow to distribute the changes in upstream backend membership to proxy envoys, or the envoy xDS subscriptions were otherwise borked.

This is about the limit of my understanding of how this works, but one last thing did catch my eye - a sharp drop in connected ADS endpoints as reported by istiod starting a bit before 11:00 and only recovering to ~ the number of proxy instances by 22:00 or so: https://grafana.wikimedia.org/goto/889SUuMHR?orgId=1 (you can also see the tail end of a similar drop the day prior)

That would maybe suggest it's the xDS subscriptions somehow being disrupted and taking some time to recover?

Looking at the SAL, the only operation that seems to plausibly correlate is the containerd migration work on wikikube-ctrl2001, but I don't have an obvious way to directly connect those dots.

Thanks for the very thorough analysis @Scott_French!
I think I found the missing piece by looking at logs from istiod: https://logstash.wikimedia.org/goto/248249e30080b1e0af92f3b4cd15f546 - where we see authentication errors that align with the drop in XDS connections and me reimaging the control-plane(s). The same can be seen a day earlier in eqiad (when I was reimaging control-planes there).

I think what happens is that reimaging one control-plane basically invalidates all k8s services account tokens that have been issued/signed by that control-plane. This is because each control-plane uses it's own cert & key to sign service account tokens and every control-plane depends on the availability of the certificates of all other control-planes to be able to validate tokens signed by them. Reimaging one control-plane overrides it's cert & key for service account signing and that in turn drops it's previous service account signing certificate from the validation chain (of all control planes). This system has been implemented in T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).

As the service account tokens are pretty short lived, they got refreshed after some time and things slowly started to work again.

jijiki triaged this task as Medium priority.Nov 18 2024, 10:31 AM
jijiki moved this task from Incoming 🐫 to Production Errors 🚜 on the serviceops board.

Thank you very much, @JMeybohm! That explains what we saw quite well.