noc.wikimedia.org is slow and it times out sporadically
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	xcollazo
	Nov 14 2024, 8:02 PM

Description

On T368755, we use https://noc.wikimedia.org to get the latest dblists, and to resolve Analytic Replicas.

We typically hit endpoints like:

https://noc.wikimedia.org/conf/dblists/open.dblist
https://noc.wikimedia.org/db.php?dc=eqiad&format=json

On or around 2024-11-14T15:00:00 UTC, we started seeing responses taking >= 30 seconds, and outrigth failures with:

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure

We discussed this on IRC #wikimedia-sre, and it looked like it was happening because of T364092: Upgrade core routers to Junos 23.4R2, but now that that work is done, I can still see timeouts and slow responses both from clients running inside eqiad as well as externally.

Related Objects
Search...

Status	Assigned	Task
Open	xcollazo	T358877 Dumps 2.0 Phase II: Production intermediate table milestone
Open	None	T358373 [Dumps 2] Reconciliation mechanism to detect and fetch missing/mismatched revisions
In Progress	xcollazo	T368753 Implement production mechanism that emits (wiki_db, revision_id) pairs for missing or inaccurate rows
Resolved	xcollazo	T368755 Python job that reads from wmf_dumps.wikitext_inconsistent_row and produced reconciliation events.
Duplicate	None	T379968 noc.wikimedia.org is slow and it times out sporadically
Resolved	JMeybohm	T380142 Reimaging a kubernetes control-plane invalidates service-account tokens issued by it

Event Timeline

xcollazo created this task.Nov 14 2024, 8:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 14 2024, 8:02 PM

Example behavior, with 3 requests taking <= 1 sec, and the 4th one taking 30s:

mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0  48516      0 --:--:-- --:--:-- --:--:-- 48391

√ mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0  50140      0 --:--:-- --:--:-- --:--:-- 50407

√ mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0  70945      0 --:--:-- --:--:-- --:--:-- 71163

√ mediawiki-config % curl 'https://noc.wikimedia.org/db.php?dc=eqiad&format=json' -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10888    0 10888    0     0    358      0 --:--:--  0:00:30 --:--:--  2591

xcollazo removed xcollazo as the assignee of this task.Nov 14 2024, 8:05 PM

A couple of points of note:

Comparing istio-reported error rates for mw-misc in eqiad (https://grafana.wikimedia.org/goto/a4cH6UGHR?orgId=1) vs. codfw (https://grafana.wikimedia.org/goto/I4Ov6UMHg?orgId=1) we can see a marked increase in codfw-emitted errors after eqiad was depooled for edge traffic around 15:40 - i.e., when traffic using the "noc.wikimedia.org" host would have shifted (you can see another drop in traffic in eqiad around 17:00, which is when the discovery service was depooled).

Interestingly, this seems to be a codfw-specific behavior - e.g., we don't see this in eqiad at steady state, and indeed I can reproduce it only in codfw right now, here using a query like T379968#10324204 pointed at mw-misc.svc.codfw.wmnet for maybe 1 in 10 requests.

Also interestingly, I don't see errors in reported on the LOCAL_mw-misc cluster from the collocated envoy: https://grafana.wikimedia.org/goto/ojc4gQGHR?orgId=1

That suggests maybe some odd interaction with Ingress in codfw?

These errors also pop up in the Istio gateway logs (https://logstash.wikimedia.org/goto/528e857bf1c7f06f26cd830b897d99f4), but don't really give us much more information than can be seen from the dashboards linked above (e.g., it's all upstream_reset_before_response_started{connection_failure}).

Adding @JMeybohm in case Janis has any thoughts.

akosiaris subscribed.Nov 15 2024, 2:59 PM

As of today, I cannot repro this behavior anymore.

I will let serviceops decide if we should pursue a postmortem analysis further than what @Scott_French shared above, or if we simply close it for now.

Thanks, @xcollazo.

Agreed, yeah - I can't seem to reproduce this either today.

However, I realized this morning that something didn't add up when looking at the Istio gateway logs yesterday. Specifically, mw-misc has only 2 replicas, so I should expect to see only 2 distinct upstream-host pod IPs at any given time in the logs, but it felt like there were more ...

Looking again today, let's pick a short time window far way from deployments, say 17:00 - 17:10 UTC on 11/14. During that 10m window, there are indeed exactly 2 pod IPs that are consistently resulting in non-503 responses when used as backends:

10.194.134.189
10.194.186.105

However, there are also 3 pod IPs consistently resulting in 503 responses with upstream_reset_before_response_started{connection_failure}:

10.194.131.5
10.194.157.16
10.194.186.92

Spot checking Calico and / or k8s event logs, these correspond to pods mw-misc.codfw.main-8d9f549cb-tcmlx, mw-misc.codfw.main-8d9f549cb-dmdw8, and mw-misc.codfw.main-75795dcd8f-s6hwj, all of which were long-gone by 17:00.

So, it seems like the Istio control plane was somehow very slow to distribute the changes in upstream backend membership to proxy envoys, or the envoy xDS subscriptions were otherwise borked.

This is about the limit of my understanding of how this works, but one last thing did catch my eye - a sharp drop in connected ADS endpoints as reported by istiod starting a bit before 11:00 and only recovering to ~ the number of proxy instances by 22:00 or so: https://grafana.wikimedia.org/goto/889SUuMHR?orgId=1 (you can also see the tail end of a similar drop the day prior)

That would maybe suggest it's the xDS subscriptions somehow being disrupted and taking some time to recover?

Looking at the SAL, the only operation that seems to plausibly correlate is the containerd migration work on wikikube-ctrl2001, but I don't have an obvious way to directly connect those dots.

CDanis subscribed.Nov 16 2024, 2:36 AM

Thanks for the very thorough analysis @Scott_French!
I think I found the missing piece by looking at logs from istiod: https://logstash.wikimedia.org/goto/248249e30080b1e0af92f3b4cd15f546 - where we see authentication errors that align with the drop in XDS connections and me reimaging the control-plane(s). The same can be seen a day earlier in eqiad (when I was reimaging control-planes there).

I think what happens is that reimaging one control-plane basically invalidates all k8s services account tokens that have been issued/signed by that control-plane. This is because each control-plane uses it's own cert & key to sign service account tokens and every control-plane depends on the availability of the certificates of all other control-planes to be able to validate tokens signed by them. Reimaging one control-plane overrides it's cert & key for service account signing and that in turn drops it's previous service account signing certificate from the validation chain (of all control planes). This system has been implemented in T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen).

As the service account tokens are pretty short lived, they got refreshed after some time and things slowly started to work again.

JMeybohm mentioned this in T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it.Nov 18 2024, 6:50 AM

JMeybohm added a subtask: T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it.

jijiki triaged this task as Medium priority.Nov 18 2024, 10:31 AM

jijiki moved this task from Incoming 🐫 to Production Errors 🚜 on the serviceops board.

Thank you very much, @JMeybohm! That explains what we saw quite well.

Scott_French closed this task as a duplicate of T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it.Nov 18 2024, 5:45 PM

JMeybohm closed subtask T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it as Resolved.Nov 20 2024, 11:27 AM

noc.wikimedia.org is slow and it times out sporadicallyClosed, DuplicatePublicActions

Description

Related ObjectsSearch...

Event Timeline

noc.wikimedia.org is slow and it times out sporadically
Closed, DuplicatePublic
Actions

Related Objects
Search...