Page MenuHomePhabricator

High (relatively) number of memcached errors in eqiad
Open, LowPublic

Assigned To
None
Authored By
jijiki
Mar 16 2026, 3:28 PM
Referenced Files
F78451196: image.png
Tue, Apr 28, 8:34 AM
F75242738: Screenshot From 2026-04-07 16-20-30.png
Apr 7 2026, 2:21 PM
F75145604: image.png
Apr 6 2026, 11:05 AM
F75144607: image.png
Apr 6 2026, 11:05 AM
F74718424: crazy_nine_seconds.pcap
Apr 1 2026, 9:29 AM
F73768915: image.png
Mar 26 2026, 7:24 PM
F73744770: image.png
Mar 26 2026, 10:30 AM
F73744638: image.png
Mar 26 2026, 10:30 AM

Description

Since 16th Dec ~13:50 we can mw-mcrouter in eqiad has been logging errors connecting to memcached servers

eqiad:
https://logstash.wikimedia.org/goto/cdcc1d4a5bf1e57824c455ddc81bdd6e

image.png (1×3 px, 464 KB)

codfw:
https://logstash.wikimedia.org/goto/71141b2137a4edae3156b364e01438db

image.png (1×3 px, 435 KB)

It is quite odd that codfw is handling a lot

From Mcrouters PoV:
eqiad: https://grafana.wikimedia.org/goto/efg76kg9965tsa?orgId=1

image.png (1×3 px, 531 KB)

codfw: https://grafana.wikimedia.org/goto/afg76m6hulblsd?orgId=1

image.png (1×3 px, 402 KB)

It is also visible here that there are noticeable more TKOs recorded by mcrouter

image.png (994×2 px, 260 KB)

The impact is that due to those connection (?) errors, eqiad is switching to the gutter pool. It is notable that eqiad is much less traffic than codfw, while soon we will be switching to eqiad as part of the March DC switchover

Note: I created a grafana dashboard in an effort to combine various metrics that could help: https://grafana.wikimedia.org/goto/efggyhi41r18gf?orgId=1

Note2: Some errors are due to T374366: Race condition in iptables rules during puppet runs on k8s nodes, while others are due to the occasional ooms

Related Objects

Mentioned In
T422489: rdbms errors in eqiad
Mentioned Here
T423863: decommission wikikube-worker[1002-1005,1011-1012,1019-1020,1029-1031,1058-1063,1082-1083,1088-1092,1096-1112,1166-1168].eqiad.wmnet
T422489: rdbms errors in eqiad
T421504: mcrouter live config reload without pod restarts
P90140 RTTs of SYN/ACK exchanges from mcrouter on wikikube-worker1273 (10.67.160.184) to codfw memcached
T374366: Race condition in iptables rules during puppet runs on k8s nodes
T280718: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practice
T320769: Don't show over limit checks in the CheckUserLog or remove all over limit entries from enwiki
T361173: Add schema migration config for cu_useragent table
T373711: Add support for Scribunto, JavaScript, CSS, JSON and Vue to CodeMirror 6
T379086: Remove wgGlobalBlockingEnableAutoblocks
T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse
T398500: [timebox: 3 days] Impact module: Support larger wgGEUserImpactMaxEdits
T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch)
T408842: Surface nominated collections in Search view
T410512: Add support for maintaining legacy non-emergency flow during transition to v2
T411268: Suggested Investigations: Show link to checkuser log if target has been checked before
T411779: Handle invalid featured collection name
T412455: [Legal Footer] Turn on wmgUseFooterLegalContactLink config for English and German
T412710: Enable protection indicators for zhwiki
T412713: Set $wgAutoConfirmCount to 10 for sv.wikipedia
T412791: jawiki: Add abusefilter-access-protected-vars to temporary-account-viewer
T412848: Disable spellcheck in non-wikitext
T412884: Move linter icons to the left of line numbers
T412886: Enable active line feature by default in non-wikitext
T413084: Parsoid Read Views to deploy ~2025-12-18 (itwiki, nlwiki)
T412785: Enable QoS for upload video files

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Scott_French moved this task from Inbox to Needs Info / Blocked on the ServiceOps new board.
Scott_French subscribed.

Moving this to Needs Info / Blocked while awaiting input from netops.

@jijiki - Please adjust the priority if needed. Thanks!

I tried to isolate a connection between a single mcrouter pod in eqiad (10.67.189.81 on wikikube-worker1270.eqiad.wmnet) and mc2039 (10.192.0.22) in codfw:

  • The only weird thing that I see in metrics is STAT ssl_handshake_errors 491652, but the host has been up for ~260 days and I am not sure if anything weird happened in the middle. We don't track the metric afaics, it may be something worth to add for port 11214
  • tcpdump shows relatively frequent RSTs and related SYNs to create new TCP connections, I am wondering if the this and the above are related.
  • The memcached logs show errors like Mar 18 10:19:37 mc2039 memcached[553977]: accept4(): Resource temporarily unavailable. We use -c 25000 and I think we have roughly the same limit for ulimits. Should it be higher?
  • The last 100 days of the memcached metrics for the shard don't show anything changing afaics, but the above error is very telling.
sudo lsof -p 553977 | wc -l
4842

So in theory we are not saturated by TCP connections..

@jijiki thanks for the task. In terms of the network in general nothing changed the week of Dec 15th last. We had done some work in Nov/Dec but it was all done by then, and we we're firmly in change freeze mode that week so definitely nothing changed.

Looking at the above destination mc* hosts I don't see an obvious pattern in terms of location:

SiteRackHost(s)
eqiadC4mc1049
eqiadA7mc-gp1004
eqiadC2mc-gp1005
codfwA2mc2038, mc2039
codfwA7mc2040, mc2041
codfwB2mc2042, mc2043
codfwB4mc2044, mc2045
codfwB7mc2046
codfwC4mc2047, mc2048
codfwC7mc2049, mc2050
codfwD4mc2051, mc2052
codfwD7mc2053, mc2054
codfwA4mc2055

Codfw is most consistent in terms of the network, with almost all racks being connected with Juniper QFX devices in the newer spine/leaf model. So there shouldn't be much difference between any of the above codfw racks. Eqiad has more variance (mixture of Juniper and Nokia, plus rows a/b still being on the old setup). But there is no pattern jumping out either way. Might be worth seeing if there is any pattern in terms of rack/location on the mcrouter side.

The latency across the WAN from eqiad to codfw seems to be steady before/after Dec 17th, so that doesn't look to be a factor:

https://w.wiki/Jx2Q

Screenshot from 2026-03-18 12-16-31.png (966×1 px, 361 KB)

I know you said it's been there forever, but I do wonder about the memcached[553977]: accept4(): Resource temporarily unavailable messages getting logged. That's an error accepting an incoming client connection, one assumes the other side of the conversation experiences a timeout when that occurs. Could be unrelated of course.

For next steps I'd say if we can identify an endpoint that regularly gets these timeouts, and the destination it is trying to talk to, we can run some tests to try and determine if there is any packet loss between them. Grabbing some pcaps from both sides might help too, if we can have a tcpdump running both sides at a time one of these errors are logged it should tell us a lot about what actually happened on the wire.

@cmooney Check my update above, there are a couple of IPs that we could use to verify if anything is weird in their path. It is difficult from the mcrouter logs to understand what IP is related to a timeout, but I do see RSTs when trying to connect from eqiad to codfw via TLS. Memcached also shows a high number of failed TLS transactions, we don't have it as metric so it may be something unrelated, but it smells like something problematic.

I noticed the accept() errors as well, and even if they have been there forever, it is a sign that memcached is under stress so anything added on top could cause troubles. We didn't know what started on Dec 17th but it could be related.

@cmooney Check my update above, there are a couple of IPs that we could use to verify if anything is weird in their path. It is difficult from the mcrouter logs to understand what IP is related to a timeout, but I do see RSTs when trying to connect from eqiad to codfw via TLS. Memcached also shows a high number of failed TLS transactions, we don't have it as metric so it may be something unrelated, but it smells like something problematic.

I noticed the accept() errors as well, and even if they have been there forever, it is a sign that memcached is under stress so anything added on top could cause troubles. We didn't know what started on Dec 17th but it could be related.

If what is written in this thread (which is just a wee bit old), it is probably nothing we should look into further. However, stranger things have happened :)

@elukey thank you for getting nerdsnipped!

@jijiki I checked the thread, and I think it is a separate issue. Yesterday I found occurrences of accept4(): Resource temporarily unavailable, meanwhile the thread mentions people stracing memcached and seeing read() EAGAIN. The former means that memcached wasn't able to create a new socket via accept, the latter that no data was read (that is expected in async code). This is why I am worried - accept() not being able to work is usually a sign of distress of the network stack, there is probably a bottleneck somewhere.

In general this worries me a little:

elukey@mc2043:~$ echo stats | nc localhost 11211 | grep ssl
STAT ssl_handshake_errors 514238
elukey@mc2043:~$ echo stats | nc localhost 11211 | grep ssl
STAT ssl_handshake_errors 514239
[.. some mins..]
elukey@mc2043:~$ echo stats | nc -q 1 localhost 11211 | grep ssl
STAT ssl_handshake_errors 514246
elukey@mc2043:~$ echo stats | nc -q 1 localhost 11211 | grep ssl
STAT ssl_handshake_errors 514246

The metric is not listed in the exporter, but it would be great to have it. My understanding is the following:

  • mcrouter in eqiad tries to contact a mc2xxx shard in codfw via TLS, using the memcached native TLS implementation.
  • memcached in codfw is already showing distress in accepting new TCP conns/sockets, and it is also returning RSTs (see above). Moreover the ssl_handshake_errors grows steadily.

Let me know if what I wrote is incorrect or doesn't make sense. It may be not the exact issue that triggered the errors, but ideally it shouldn't happen. My worry is that the baseline of our shards is already a little in distress, and anything on top causes isuses.

After some back and forth with Cathal on IRC, we didn't find anything that could point to accept() being problematic.

This bit of code may be relevant: https://github.com/memcached/memcached/blob/1.6.18/memcached.c#L3000 - EGAIN is tested after emitting the error, so @jijiki this may be again something benign, probably wrong track.

ssl_handshake_errors is also not a lot daily, and I checked TLS certs etc.. so maybe some weird client doing it? Not sure which one.

In general this worries me a little:

elukey@mc2043:~$ echo stats | nc localhost 11211 | grep ssl
STAT ssl_handshake_errors 514238
elukey@mc2043:~$ echo stats | nc localhost 11211 | grep ssl
STAT ssl_handshake_errors 514239
[.. some mins..]
elukey@mc2043:~$ echo stats | nc -q 1 localhost 11211 | grep ssl
STAT ssl_handshake_errors 514246
elukey@mc2043:~$ echo stats | nc -q 1 localhost 11211 | grep ssl
STAT ssl_handshake_errors 514246

The metric is not listed in the exporter, but it would be great to have it. My understanding is the following:

  • mcrouter in eqiad tries to contact a mc2xxx shard in codfw via TLS, using the memcached native TLS implementation.
  • memcached in codfw is already showing distress in accepting new TCP conns/sockets, and it is also returning RSTs (see above). Moreover the ssl_handshake_errors grows steadily.

Let me know if what I wrote is incorrect or doesn't make sense. It may be not the exact issue that triggered the errors, but ideally it shouldn't happen. My worry is that the baseline of our shards is already a little in distress, and anything on top causes isuses.

I find the handshake errors alarming as well, and worth digging further. What I am not convinced of is that this is related to resources/traffic, but alas, I could be wrong.

mc1038:~$ echo "stats" | nc 10.64.0.125 11211 | grep ssl
STAT ssl_handshake_errors 1661

mc1038:~$ uptime
 10:38:32 up 22:30,  1 user,  load average: 0.41, 0.55, 0.65

I thought the same but if you count 1661 / 22 you get ~75 errors for each hour, that is not ideal but not even a ton. Ideally it shouldn't be there, but I do see some TCP RST sent periodically by the memcached shards to the mcrouter pods (via tcpdump) so that my correlate.

I also tried to compare irate() for total traffic and tko/errors, it seems that we are having trouble with ~1% of the traffic or a little more. It is not great but also not incredibly alarming, and it puts things into perspective: maybe the change was related to specific wikis etc..

The issue started around 2025-12-16T13:40/14 from the mcrouter's perspective, and it is easy to spot:

  • the same number of error and tko (so I guess they both track the same)
  • a lower amount of connect timeouts that follow the same pattern (so they are correlated in some way).

I see some deployments around that time, we could try to follow up: https://sal.toolforge.org/production?p=2&q=&d=2025-12-16

Moreover, the issue is only present for traffic that goes from mcrouter in eqiad to memcached in codfw, hence using TLS. Assuming that the errors happen to every mcrouter pod trying to contact a codfw shard, we could try to concentrate on a single set of IPS:

mcrouter pod in eqiad (10.67.189.81 on wikikube-worker1270.eqiad.wmnet) and mc2039 (10.192.0.22) in codfw.

I already tried to tcpdump those and I see periodical RST sent by mcrouter to memcached, but it seems always happening after FIN-ACK and TLS encrypted alert. So IIUC mcrouter closes its side of the TLS session earlier, and the kernel (pod side) sent a RST to memcached. This may be what contributes to the ssl_handshake_failures, so likely another red herring.

I think we should concentrate on what changed, likely a deployment?

@cmooney one thing worth to look is if we added QOS or similar changes over the network in response to the traffic attacks that happened in December. I found T412785 but the timeline doesn't match 100% to the timings provided by the metrics (started around 16th Dec ~13:50 more or less). mcrouter is configured to SET certain keys to both eqiad and codfw, waiting for both before returning success. We also have a 250ms timeout of cross-region calls like mcrouter in eqiad reaching codfw, so anything that slows down may start some timeout errors. There is nothing that lines up in puppet or sal that I can see, or even k8s deployments.

I tried to tcpdump the traffic between two IPs (eqiad -> codfw, see above) and there seem to be several occasions in which after a FIN-ACK, mcrouter (eqiad) respond with a RST to memcached in codfw. I am wondering if the 250ms timeout influences this, because a TCP+TLS handshake may take some tens of ms only by themselves. Lemme know if you can think of anything :)

I checked netbox changes too https://netbox.wikimedia.org/extras/changelog/?page=251&sort=-time (though if anyone would like to take another look, that would be great).

The keys and the key sizes we set cross dc, are relatively small and low in volume, so we can cross that off out list too. Deployments during 16th-18th Dec:

Dec 16th 2025

December 17, 2025

December 18, 2025

The keys and the key sizes we set cross dc, are relatively small and low in volume, so we can cross that off out list too. Deployments during 16th-18th Dec:

Sure but what else affects writes from eqiad to codfw? Even if the keys are not big the 250ms max latency could be breached easily in case of some delays, ending up in possibly those RSTs that I captured in tcpdump. The issue happens with ~1% of the traffic that memcached sees, so it would make sense if affected the wan keys.

I've reworked a little https://grafana-rw.wikimedia.org/d/ltSHWhHIk/mw-mcrouter to use irate() everywhere and not a mix of increase and irate, to more easily compare panels. In eqiad we roughly handle 400k rps and we have a background of 1 or 2k requests / s tko-ed, that is around 0.5%. I think that this is a great example of use case where an SLO would be perfect to understand the urgency of this task, since we are talking about 99.5+ SLO target for availability. I am not suggesting to not work on this task, but its priority may need to vary since the impact of the problem is not big at the moment.

Some notes while debugging:

  • After the last series of eqiad wikikube-workers reboots the mcrouter error log changed. It moved from TCP connect failed: AsyncSocketException: connect timed out after 1000ms, type = Timed out to client handshakeErr: AsyncSocketException: SSL connect timed out after 362ms, type = Timed out (or more, even 900ms etc..).
    Screenshot From 2026-03-23 09-36-55.png (924×5 px, 257 KB)
  • Another thing that is noticeable is the drop in processing latency after the reboot:

Screenshot From 2026-03-23 09-57-15.png (2×4 px, 1 MB)

  • Very weirdly, it doesn't seem that the above affected the tkos registered, that keep being registered at the same rate (it varies, but it is consistent across days).
  • The memcached errors seen by MediaWiki didn't change much, but most of them are related to WANCache:global:rdbms-server-readonly:dbXXXX (https://logstash.wikimedia.org/goto/8569336195cc373338bc4ab96954e698). It may be due to the fact that this is a key that is very frequently used, so a red herring, but let's keep it in mind.
  • Mcrouter seems to be recycling a lot of connections in eqiad, that is consistent with the pattern of TCP RST that I noticed while tcpdumping.

The gutter pool in eqiad matches the December pattern, due to tkos: dashboard

I ran echo "lru_crawler metadump all" | nc localhost 11211 | tee all_keys.txt on mc-gp1004, some of the keys in there may be the ones related to the problem that we are investigating. All of them are WANCache prefixed, that corroborates out theory about the AllSyncRoute being problematic and causing some timeouts.

Maybe there is something that we run only in eqiad and that SETs WANCache keys that we haven't investigated so far?

Something interesting while checking mcrouter stats (the full list, not the ones exposed via the exporter):

"libmcrouter.mcrouter.4442.inactive_connection_closed_interval_sec": 337.30648483070996,

It seems that connections are recycled every 337 seconds in eqiad, and every time we have a new TLS connection from eqiad to codfw. The value is weird since the mcrouter's default is 60s afaics, so it may be some sort of misleading average. Given that the overall timeout for a request (IIRC) is 250ms for cross region, multiple RTT may be needed very often to establish a connection to a codfw shard. I tried to use openssl -connect via nsenter on a eqiad pod to a codfw shard, and the time to do the TLS handshake + TCP handshake etc.. was around 120/130ms.

Overall things that we could test:

  • Play with keep alive settings for mcrouter to explicitly reduce the churn of creating new TLS connections/sessions every few minutes (mcrouter-pods * codfw memc-shards is not a low number).
  • Test a cross-region timeout of say 500ms, to see if reduces the error rate. It wouldn't be permanent but it would tell us if we are in the right direction or not.
  • There is the option to force memcached in codfw to use only TLS 1.3, that has less RTT for each handshake and generally more performant.

My personal option would be to go with the following on the mcrouter side:

--reset-inactive-connection-interval=0 --keepalive-idle=120 --keepalive-interval=10 --keepalive-count=3

Or similar. IIUC we'd move away from mcrouter actively closing connections (with TCP RST etc..) to something IIUC more handled by the kernel, hopefully in a more graceful way.

We could also play with memcached's internal SSL session resumption logic but we may end up in a mess in my opinion, meanwhile forcing TLS 1.3 could be easy enough to test (my understanding is that with extstore we are not clearing the shards anymore upon restart, so we could try with a few and see).

edit: both mcrouter and mcXXXX hosts have openssl 3.x installed, so in theory they should already be using TLS 1.3. In theory :)

@cmooney @jijiki Lemme know your thoughts :)

Change #1259148 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/docker-images/production-images@master] mcrouter: ease testing new cli parameters

https://gerrit.wikimedia.org/r/1259148

MLechvien-WMF lowered the priority of this task from Medium to Low.Mar 25 2026, 8:49 AM

Screenshot From 2026-03-26 10-21-47.png (1×4 px, 896 KB)

It is very interesting what happened directly after the depool of codfw for the MW Switchover - the total number of mcrouter rps in eqiad jumped to 1.3M/s (!!) and the tkos raised somehow proportionally, peaking now at 7/8k rps. We are still around the 0.5% of requests ending up in TKO, so I am more and more convinced that this may be an issue of connection recycling that may cause this issue. Why it doesn't happen the same in codfw is not clear to me yet, but my proposal is to test the settings that I mentioned above as next step:

--reset-inactive-connection-interval=0 --keepalive-idle=300 --keepalive-interval=10 --keepalive-count=3

The keepalive-idle can vary, I don't have a strong opinion, but I wouldn't push it too far. The idea is to move from mcrouter recycling conns to the kernel handling keep alives via sockets etc..

I reworked a little the graph that shows TKOs registered by mcrouter pods, and I added a column to sort by total occurrences. The first 14 pods have their k8s workers in the following racks:

4xD8
3xD1
3xC6
3xC5
1xC3

That is a little weird, I expected a little more variance.

The top mcrouter pod for TKOs is mcrouter-main-d7czx, running on wikikube-worker1070.eqiad.wmnet with IP 10.67.223.126.

elukey@wikikube-worker1070:~$ sudo nsenter -t 2488 -n mtr -4 -r -w -c 200 mc2041.codfw.wmnet
Start: 2026-03-26T10:11:45+0000
HOST: wikikube-worker1070                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- wikikube-worker1070.eqiad.wmnet        0.0%   200    0.1   0.0   0.0   0.2   0.0
  2.|-- et-1-0-5-1020.cr2-eqiad.wikimedia.org  0.0%   200    0.3   8.5   0.3 557.2  53.0
  3.|-- ae0.cr1-eqiad.wikimedia.org            0.0%   200    0.3   3.8   0.2 358.7  29.7
  4.|-- et-1-0-2.cr1-codfw.wikimedia.org       0.0%   200   30.5  33.7  30.2 418.7  29.1
  5.|-- mc2041.codfw.wmnet                     0.0%   200   30.2  31.2  30.2 221.1  13.5

No packet loss but some of the worst latency values are concerning, since mcrouter works with very tight settings. I tried to run the same for a wikikube worker in Row A:

elukey@wikikube-worker1244:~$ sudo nsenter -t 2687 -n mtr -4 -r -w -c 200 mc2041.codfw.wmnet
Start: 2026-03-26T10:19:03+0000
HOST: wikikube-worker1244              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- wikikube-worker1244.eqiad.wmnet   0.0%   200    0.1   0.0   0.0   0.1   0.0
  2.|-- ae1-1017.cr1-eqiad.wikimedia.org  0.0%   200    0.4   0.7   0.2  15.0   1.6
  3.|-- et-1-0-2.cr1-codfw.wikimedia.org  0.0%   200   30.3  30.9  30.2  70.4   3.9
  4.|-- mc2041.codfw.wmnet                0.0%   200   30.3  30.3  30.2  32.4   0.3

Much better. @cmooney nothing definitive because there may be some variance but what do you think?

Just to add that I keep seeing a pattern where a handful of nodes make up the majority of the errors, for example:

image.png (1×3 px, 385 KB)

image.png (1×2 px, 384 KB)

Much better. @cmooney nothing definitive because there may be some variance but what do you think?

The "avg" time isn't much worse than the lowest, which means that the number of times when the latency is "worst" was very low. You also need to bear in mind with traceroute is that sometimes hosts or routers are busy and take time to generate the ICMP "TLL exceeded" message to send back to the source. It isn't necessarily just the transit time in each direction that contributes to the reported RTT.

Overall I'd not be too worried here given how close the lowest and the avg are, we can probably dismiss the worst values as occasional outliers. Obviously over a long / WAN path there is more scope for buffering so that might play a role, but overall these stats look good to me.

I retried the above experiment with an eqiad memcached shard:

elukey@wikikube-worker1244:~$ sudo nsenter -t 2687 -n mtr -4 -r -w -c 200 mc1041.eqiad.wmnet
Start: 2026-03-26T11:32:32+0000
HOST: wikikube-worker1244              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- wikikube-worker1244.eqiad.wmnet   0.0%   200    0.0   0.0   0.0   0.1   0.0
  2.|-- ae1-1017.cr1-eqiad.wikimedia.org  0.5%   200    1.6   1.4   0.2   3.3   0.6
  3.|-- mc1041.eqiad.wmnet                0.0%   200    1.3   1.2   0.1   2.8   0.7

elukey@wikikube-worker1244:~$ sudo nsenter -t 2687 -n mtr -4 -r -w -c 200 mc1041.eqiad.wmnet
Start: 2026-03-26T11:36:11+0000
HOST: wikikube-worker1244              Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- wikikube-worker1244.eqiad.wmnet   0.0%   200    0.0   0.0   0.0   0.1   0.0
  2.|-- ae1-1017.cr1-eqiad.wikimedia.org  0.0%   200    1.7   1.5   0.2   7.1   0.9
  3.|-- mc1041.eqiad.wmnet                0.0%   200    1.6   1.3   0.1   3.5   0.6

vs

elukey@wikikube-worker1070:~$ sudo nsenter -t 2488 -n mtr -4 -r -w -c 200 mc1041.eqiad.wmnet
Start: 2026-03-26T11:25:46+0000
HOST: wikikube-worker1070                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- wikikube-worker1070.eqiad.wmnet        0.0%   200    0.0   0.0   0.0   0.1   0.0
  2.|-- et-1-0-5-1020.cr2-eqiad.wikimedia.org  0.0%   200  424.2  16.5   0.3 1128. 111.1
  3.|-- mc1041.eqiad.wmnet                     0.0%   200   92.7   9.7   0.1 799.4  74.3

elukey@wikikube-worker1070:~$ sudo nsenter -t 2488 -n mtr -4 -r -w -c 200 mc1041.eqiad.wmnet
Start: 2026-03-26T11:36:27+0000
HOST: wikikube-worker1070                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- wikikube-worker1070.eqiad.wmnet        0.0%   200    0.1   0.0   0.0   0.1   0.0
  2.|-- et-1-0-5-1020.cr2-eqiad.wikimedia.org  0.0%   200    0.3   5.6   0.2 458.3  39.9
  3.|-- mc1041.eqiad.wmnet                     0.0%   200    0.1   5.3   0.1 859.1  61.4

@cmooney I totally get your point about the outliers and possibly weird timing generated by unrelated TTL exceeded packets by routers, but there is some consistency. The additional latency is marginal and not affecting avg too much, so overall everything is good. It matches really well with what we are seeing though - a low percentage of memcached traffic gets impacted because the extra latency causes timeouts every once in a while, only in specific k8s workers (in row D/C?). I don't want to sound like somebody preaching a conspiracy theory, but so far this is the only theory that fits all the things that we are seeing :D

I took a pcap on wikikube-worker1070 for TCP packets to mc1041, and did some comparisons on RTT (i.e. time between packet sent to mc1041 and the response arriving.

Total RTT samples: 280984

Bucket            Count     Pct  Bar
--------------------------------------------------
> 500ms               2    0.0%  
250 - 500ms           2    0.0%  
100 - 250ms          13    0.0%  
50 - 100ms           10    0.0%  
40 - 50ms         19849    7.1%  ███
30 - 40ms          4854    1.7%  
20 - 30ms          7949    2.8%  █
10 - 20ms         14398    5.1%  ██
5 - 10ms          12662    4.5%  ██
2 - 5ms           11842    4.2%  ██
1 - 2ms            7543    2.7%  █
0 - 1ms          201860   71.8%  ███████████████████████████████████

The thing which clouds this is we are looking at all packets here, and the RTT includes however long the server took create the response, as well as the transmission time. My gut feeling here is the ~200k RTTs under 1ms are probably "no-ops", and the others above that up to about 50ms are when the other side has data to assemble etc and takes longer. Can't be 100% on that, I'll do another pcap and just focus on SYN / SYN-ACK packets, which will be more reflective on the network latency (though the end systems, NICs, CPUs etc still have a role).

Overall the number of very high RTTs here (above 50ms) don't seem excessive out of ~300k samples (less than 0.05%). From experience it is likely going to be a lot easier to work on the application side to tolerate this level of jitter than try to eliminate it on the network (if indeed the network is causing it).

These nodes should be depooled to test if tkos decrese considerably for a stable amount of time:

wikikube-worker1055.eqiad.wmnet
wikikube-worker1097.eqiad.wmnet
wikikube-worker1273.eqiad.wmnet
wikikube-worker1070.eqiad.wmnet
wikikube-worker1160.eqiad.wmnet

There is a large spike of Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR errors in the last few days https://logstash.wikimedia.org/goto/f093e55fa4bf14cb68615c4b0e776d44, noting it here in case that's helpful to the investigation

image.png (1×5 px, 488 KB)

Change #1259148 merged by Elukey:

[operations/docker-images/production-images@master] mcrouter: ease testing new cli parameters

https://gerrit.wikimedia.org/r/1259148

I'll do another pcap and just focus on SYN / SYN-ACK packets, which will be more reflective on the network latency

Since last night I captured exactly 0 new connections / tcp handshakes between wikikube-worker1070 and mc1041, is that expected?

@cmooney yes Effie depooled it IIRC! You can probably use wikikube-worker1273.eqiad.wmnet (@jijiki let's not depool it).

Yesterday I depooled

  • wikikube-worker1055.eqiad.wmnet
  • wikikube-worker1097.eqiad.wmnet
  • wikikube-worker1070.eqiad.wmnet

@cmooney yes Effie depooled it IIRC! You can probably use wikikube-worker1273.eqiad.wmnet (@jijiki let's not depool it).

Ah my bad you even mention above! I've a capture going on wikikube-worker1273 now, so if we can leave that pooled I'll gather data for a few days (there are relatively few SYN / SYN-ACK handshakes as the connections are long lived, so we need some time to see a statistically significant number).

Ok so I gathered stats for the past few days (Mar 27 - Apr 1) of the SYN / SYN-ACK exchanges starting the tcp handshake, and this is the breakdown of RTTs:

Total SYN / SYN-ACK RTTs measured: 146553

Bucket            Count     Pct  Bar
--------------------------------------------------
> 500ms              71  0.048%  
250 - 500ms          77  0.053%  
100 - 250ms         193  0.132%  
50 - 100ms          321  0.219%  
40 - 50ms           258  0.176%  
30 - 40ms        145633  99.372%  █████████████████████████████████████████████████
20 - 30ms             0  0.000%  
10 - 20ms             0  0.000%  
5 - 10ms              0  0.000%  
2 - 5ms               0  0.000%  
1 - 2ms               0  0.000%  
0 - 1ms               0  0.000%

So the outliers we see in mtr are reproduced (also a breakdown per-destination in P90140, but all destinations basically the same). I still can't say for sure if the delay here is being introduced by:

  • Time on the host to process and respond to the SYN
  • Normal buffering on the network if interfaces are momentarily busy (can esp. happen on the WAN)
  • A malfunctioning network

I still see no evidence for the latter. But right now we have a situation where approx ~0.4% of RTTs exceed 50ms, and ~0.16% exceed 100ms. There is also ~0.05% over 500ms which definitely is not ideal. The highest RTT is an insane 9 seconds, one thing we can say for sure is this delay is being caused by the host, as none of our routers or switches have large enough buffers to store a packet that long.

I think the next step is to define the following SLOs for the mcrouter <-> mc host network latency:

  • Maximum RTT for a successful mcrouter -> memcached request
  • Percentage of RTTs beyond this / failures the service as a whole can tolerate before it is considered degraded

If the assessment after that is that the current network performance is insufficient to support the application we can start looking more deeply into what is causing the outliers, and what would be required if we wanted to eliminate them.

@elukey the part of this which worries me most is the suggestion that this is a problem we only have on row c/d - due to the Nokia switches. I don't see strong evidence for that in the above comments tbh, if you have more detail please share. If there is a wikikube-worker in rows a/d with mcrouter regularly talking to codfw mc hosts let me know, I can potentially do the same kind of analysis on traffic there so we can compare the difference?

If there is a wikikube-worker in rows a/d with mcrouter regularly talking to codfw mc hosts let me know, I can potentially do the same kind of analysis on traffic there so we can compare the difference?

wikikube-worker1258 is sending such requests so I started a similar capture on it and will check back in a few days.

@cmooney I added some info T420223#11753137, where I tested jitter seen by MTR on a worker in row A/B vs a worker in C/D: the former doesn't show it. I also tried on another couple of nodes, but I don't have anything definitive form a statistics point of view. I can collect more info if you want!

I am really curious to see if wikikube workers migrated to the new vlan configs change their behavior in any way, I saw Arzhel working on them so there may be something ready for a test.

This was also relevant: T420223#11752776

@cmooney I added some info T420223#11753137, where I tested jitter seen by MTR on a worker in row A/B vs a worker in C/D: the former doesn't show it. I also tried on another couple of nodes, but I don't have anything definitive form a statistics point of view. I can collect more info if you want!

Thanks Luca. No that's fine thanks. I seen those stats, they show a difference alright. But they are only samples of 200 packets, from the tests I ran it seems possible we could randomly get a "clean" 200 packets all with low RTT, and also that a single high RTT could skew results. I'm not saying there isn't a wider pattern, but we'd definitely need something more concrete before declaring the Nokias are defective.

Let's see what the capture I'm currently running shows tomorrow, more samples if it has no high RTT responses that will be very telling.

This was also relevant: T420223#11752776

That one is definitely more worrying. If the issue is on the hosts or with the application we shouldn't see pattern like that.

I am really curious to see if wikikube workers migrated to the new vlan configs change their behavior in any way

It shouldn't really make a difference, as ultimately the packets flow through the same set of hops/links. But yes it will be interesting to see if there is any difference.

Due to an unfortunate coincidence, this issue caused a paging event.

Background
The mw-api-ext-next deployment (mw-api-ext-next.discovery.wmnet) serves traffic only during upgrades. The rest of time, it runs a single pod solely to pass PyBal health checks.

Today, that pod encountered a memcached connectivity issue, blocking all workers (

image.png (1×3 px, 391 KB)
). This triggered a page that, in reality, had no production impact, though wasn't obvious from the alert itself. Only someone from ServiceOps new could recognise it as unimportant.

What is though quite alarming that mcrouter has registered TKOs towards the gutter pool hosts, which is our failover pool.

  • 10.64.0.21 -> mc-gp1004.eqiad.wmnet
  • 10.64.32.126 -> mc-gp1005.eqiad.wmnet
  • 10.64.48.182 -> mc-gp1006.eqiad.wmnet

image.png (644×1 px, 111 KB)

Worker wikikube-worker1105 had a puppet run at the time, though in that case I do not think it is related to T374366: Race condition in iptables rules during puppet runs on k8s nodes

2026-04-06T09:21:25.439081+00:00 wikikube-worker1105 puppet-agent[174531]: Using environment 'production'
2026-04-06T09:21:25.692805+00:00 wikikube-worker1105 puppet-agent[174531]: Retrieving pluginfacts
2026-04-06T09:21:25.772053+00:00 wikikube-worker1105 puppet-agent[174531]: Retrieving plugin
2026-04-06T09:21:26.596299+00:00 wikikube-worker1105 puppet-agent[174531]: Loading facts
2026-04-06T09:21:29.936138+00:00 wikikube-worker1105 puppet-agent[174531]: The current total number of facts: 2185 exceeds the number of facts limit: 2048
2026-04-06T09:22:18.555982+00:00 wikikube-worker1105 puppet-agent[174531]: Caching catalog for wikikube-worker1105.eqiad.wmnet
2026-04-06T09:22:19.422001+00:00 wikikube-worker1105 puppet-agent[174531]: Applying configuration version '(617ca2cb7f) Manuel Arostegui - db2142: Disable notifications'
2026-04-06T09:22:50.588977+00:00 wikikube-worker1105 puppet-agent[174531]: Applied catalog in 31.67 seconds

Ok the results from wikikube-worker1258 (row B) don't seem to show the same percentage of longer RTT packets as wikikube-worker1273 (row D - in above comment).

Bucket            Count     Pct  Bar
--------------------------------------------------
> 500ms               1  0.001%  
250 - 500ms           0  0.000%  
100 - 250ms           0  0.000%  
50 - 100ms            0  0.000%  
40 - 50ms             2  0.001%  
30 - 40ms        136548  99.998%  █████████████████████████████████████████████████
20 - 30ms             0  0.000%  
10 - 20ms             0  0.000%  
5 - 10ms              0  0.000%  
2 - 5ms               0  0.000%  
1 - 2ms               0  0.000%  
0 - 1ms               0  0.000%

So perhaps we do need to consider there is a problem with Nokia. Or maybe there is a link not performing well somewhere (though it'd be unusual for such a thing to be present and not be causing a much higher number of problems than observed).

I'm sort of at a loss to explain tbh. It will need more investigation but I fear it could be a long and difficult one.

Screenshot From 2026-04-07 16-20-30.png (2×4 px, 1 MB)

It is interesting how the TKOs started to reduce from around the third/fourth of April.

On April 2nd we completed the switch-dc procedures for services: https://sal.toolforge.org/log/gbxtTp0BffdvpiTrvPTF

In order to try and exclude the new switches with old vlan as a possible cause, I've pooled a host that is on the new vlan wikikube-worker1347.eqiad.wmnet, and am in the process of renumbering an existing one wikikube-worker1273.eqiad.wmnet.
Once workloads get scheduled on them we should see if they have exhibit the issue or not.

@jijiki shall we deploy mcrouter 2023.07.17.00-2 and test the keep alive options? I have the feeling that the TKOs will go down after it.

@jijiki shall we deploy mcrouter 2023.07.17.00-2 and test the keep alive options? I have the feeling that the TKOs will go down after it.

I will be redeploying mcrouter sometime next week for T421504: mcrouter live config reload without pod restarts, so I can add this as well. I will be testing this on a container on mw-debug before updating the daemonset

In order to try and exclude the new switches with old vlan as a possible cause, I've pooled a host that is on the new vlan wikikube-worker1347.eqiad.wmnet, and am in the process of renumbering an existing one wikikube-worker1273.eqiad.wmnet.
Once workloads get scheduled on them we should see if they have exhibit the issue or not.

These two hosts have mediawiki workloads on them, feel free to go capture

While investigating a different problem, I found that we have a similar(?) issue when mediawiki contacts the DBs T422489: rdbms errors in eqiad

@Clement_Goubert wikikube-worker1273 shows the majority of tkos, and it is a big outlier compared to the other ones. Can we depool it to see how the TKO impact looks like afterwards?

@Clement_Goubert wikikube-worker1273 shows the majority of tkos, and it is a big outlier compared to the other ones. Can we depool it to see how the TKO impact looks like afterwards?

Sure, on it.

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 depool for host wikikube-worker1273.eqiad.wmnet completed:

  • wikikube-worker1273.eqiad.wmnet (PASS)
    • Host wikikube-worker1273.eqiad.wmnet depooled from wikikube-eqiad

That seems to have lowered the total amount of TKOs, or at least some of the volatility on it, but the overall error level still bothers me

Taking the next host with the highest "last" amount of TKOs, I can once again correlate most of the TKO spikes with these logs:

I20260413 23:27:34.135628    12 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 968ms, type = Timed out
I20260413 23:50:10.883538    10 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 968ms, type = Timed out
I20260414 00:04:47.102306     7 AsyncSSLSocket.cpp:159] TCP connect failed: AsyncSocketException: connect timed out after 1000ms, type = Timed out
I20260414 04:22:03.026331    12 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 968ms, type = Timed out
I20260414 07:12:59.181154    17 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 968ms, type = Timed out
I20260414 07:21:10.240903    19 AsyncMcClientImpl.cpp:700] Failed to write into socket with remote endpoint "10.64.0.124:11211:ascii:plain:notcompressed", wrote 143616 bytes. Exception: AsyncSocketException: write timed out after 1000ms, type = Timed out
I20260414 07:26:49.328914    13 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 966ms, type = Timed out
I20260414 08:15:18.548874    16 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 968ms, type = Timed out
I20260414 08:46:24.012511    15 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 969ms, type = Timed out
I20260414 10:57:10.440109    14 AsyncSSLSocket.cpp:175] client handshakeErr: AsyncSocketException: SSL connect timed out after 969ms, type = Timed out

How do we feel about trying to implement these settings?

--reset-inactive-connection-interval=0 --keepalive-idle=300 --keepalive-interval=10 --keepalive-count=3

The keepalive-idle can vary, I don't have a strong opinion, but I wouldn't push it too far. The idea is to move from mcrouter recycling conns to the kernel handling keep alives via sockets etc..

@Clement_Goubert do you mind to repool the server? I'd like to understand if it goes back to cause so many tkos or not, it may be a problem of containers/pods scheduled on a single host?

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1003 pool for host wikikube-worker1273.eqiad.wmnet completed:

  • wikikube-worker1273.eqiad.wmnet (PASS)
    • Host wikikube-worker1273.eqiad.wmnet pooled in wikikube-eqiad

Change #1271771 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-debug: use new mcrouter image

https://gerrit.wikimedia.org/r/1271771

The current pod for wikikube-worker1273 is mcrouter-main-28xsl and afaics in the last 12 hours there was only one TKO event. Let's keep this graph monitored, I want to figure out if the pod will get back to its previous error rate. If it is a network problem, it should eventually get back to a bad state, otherwise the jitter may be caused by the actual traffic that it flows through it.

The other thing that may play a role here is the number of proxy threads, that is high: 16. There may be some contention going on, I don't see any container cpu throttling but it may manifest as jitter. I am starting to think that the eqiad C/D rows play a role in somehow adding a little jitter, that gets "amplified" by some mcrouter bottlenecks.

I would personally try these things incrementally:

  • --reset-inactive-connection-interval=0 --keepalive-idle=300 --keepalive-interval=10 --keepalive-count=3 (to be verified what values are best).
  • --timeouts-until-tko=10 (given the huge amount of requests that every mcrouter pod handles, 3 is a very high target and we pay a big price when the logic gets triggered).
  • --num-proxies=8 (or similar, just to understand how the number plays a role)

Change #1271771 merged by jenkins-bot:

[operations/deployment-charts@master] mw-debug: use new mcrouter image

https://gerrit.wikimedia.org/r/1271771

I would personally try these things incrementally:

  • --reset-inactive-connection-interval=0 --keepalive-idle=300 --keepalive-interval=10 --keepalive-count=3 (to be verified what values are best).
  • --timeouts-until-tko=10 (given the huge amount of requests that every mcrouter pod handles, 3 is a very high target and we pay a big price when the logic gets triggered).
  • --num-proxies=8 (or similar, just to understand how the number plays a role)

I find everything reasonable, however, given how slow and noisy mcrouter deployments are, I would rather take more conservative action:

Test 1

  • deploy the image @elukey built to support the extra command line options
  • --timeouts-until-tko=10 and --num-proxies=12: we already have those options configurable
  • wait for a day or two

Test 2

  • continue with the connection tuning options

Change #1275376 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-mcrouter: bump image and new config (codfw)

https://gerrit.wikimedia.org/r/1275376

Yesterday I depooled

  • wikikube-worker1055.eqiad.wmnet
  • wikikube-worker1097.eqiad.wmnet
  • wikikube-worker1070.eqiad.wmnet

wikikube-worker1097.eqiad.wmnet is up for decom (T423863). Should I exclude it or is it fine to remove?

Yesterday I depooled

  • wikikube-worker1055.eqiad.wmnet
  • wikikube-worker1097.eqiad.wmnet
  • wikikube-worker1070.eqiad.wmnet

wikikube-worker1097.eqiad.wmnet is up for decom (T423863). Should I exclude it or is it fine to remove?

fine to remove!

Change #1275376 merged by jenkins-bot:

[operations/deployment-charts@master] mw-mcrouter: bump image and new config (codfw)

https://gerrit.wikimedia.org/r/1275376

Change #1277300 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-mcrouter: bump image and new config (eqiad+codfw)

https://gerrit.wikimedia.org/r/1277300

Change #1277300 merged by jenkins-bot:

[operations/deployment-charts@master] mw-mcrouter: bump image and new config (eqiad+codfw)

https://gerrit.wikimedia.org/r/1277300

We have rolled out a couple of changes that seem to have improved the reliability on codfw. There are "bursts" of errors like

image.png (826×1 px, 119 KB)

and the gutterpool kicks in occasionally. Let's keep an eye and make further changes next week.

@jijiki can we close this? And consider reopening if we see new symptoms.

@MLechvien-WMF We need to rollout another set of new settings to add hopefully better connection re-use, after that we'll close.