User Details
- User Since
- May 10 2021, 3:25 PM (256 w, 6 d)
- Availability
- Available
- IRC Nick
- topranks
- LDAP User
- Cathal Mooney
- MediaWiki User
- CMooney (WMF) [ Global Accounts ]
Fri, Apr 10
Broadly the patch submitted looked good to me, though I see it was abandoned.
Thu, Apr 9
Wed, Apr 8
Ok Juniper came back with the following:
I found that your version 23.4R2-S7.4 is hitting the PR1933049. Unfortunately, this is a confidential PR, but in order to get this issue resolved and avoid further issues, you need to upgrade to a slightly higher version.
Tue, Apr 7
Ok the results from wikikube-worker1258 (row B) don't seem to show the same percentage of longer RTT packets as wikikube-worker1273 (row D - in above comment).
Bucket Count Pct Bar -------------------------------------------------- > 500ms 1 0.001% 250 - 500ms 0 0.000% 100 - 250ms 0 0.000% 50 - 100ms 0 0.000% 40 - 50ms 2 0.001% 30 - 40ms 136548 99.998% █████████████████████████████████████████████████ 20 - 30ms 0 0.000% 10 - 20ms 0 0.000% 5 - 10ms 0 0.000% 2 - 5ms 0 0.000% 1 - 2ms 0 0.000% 0 - 1ms 0 0.000%
Thu, Apr 2
Is it maybe an idea to re-use some of the existing vlans? Like if we assign rack A1 as the public rack for the A/B POD we could add all the hosts to public1-a-eqiad as we move them? And then when complete rename the vlan to public1-a1-eqiad?
We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being investigated.
Wed, Apr 1
If we are going to have one public-enabled rack per "pod" then should we not have just one vlan assigned for codfw row E/F (and then one also for a/b and c/d)?
Ok so I gathered stats for the past few days (Mar 27 - Apr 1) of the SYN / SYN-ACK exchanges starting the tcp handshake, and this is the breakdown of RTTs:
Total SYN / SYN-ACK RTTs measured: 146553
Mon, Mar 30
Fri, Mar 27
Thanks for the write-up @JMeybohm. Definitely an odd one.
Thu, Mar 26
I took a pcap on wikikube-worker1070 for TCP packets to mc1041, and did some comparisons on RTT (i.e. time between packet sent to mc1041 and the response arriving.
Total RTT samples: 280984
Wed, Mar 25
Tue, Mar 24
This should now be working again. Big thanks to @ayounsi for the heavy-lifting with all the puppet patches to add the $INSTALL_HOSTS set.
Mon, Mar 23
I think a cookbook that takes down doh and durum simultaneously at a site (I assume by changing bird?) would solve this perfectly.
Ok this should no longer be an issue after updating the wikimedia6 prefix list. Right now, with Wikidough depooled in esams, traffic which lands in esams for wikidough gets sent to eqiad and answered:
cathal@officepc:~$ mtr -b -w -c 5 -6 wikimedia-dns.org Start: 2026-03-23T15:06:23+0000 HOST: officepc Loss% Snt Last Avg Best Wrst StDev 1.|-- pool-ipv6-pd.agg1.srl.blp-srl.eir.ie (redacted) 0.0% 5 0.3 0.4 0.3 0.5 0.1 2.|-- agg1.srl.blp-srl.eircom.net (2001:bb0:6:a11d::1) 0.0% 5 5.7 5.9 4.5 8.3 1.6 3.|-- 2001:bb0:6:a197::1 0.0% 5 4.5 4.6 4.3 4.9 0.2 4.|-- 10ge16-1.core1.dub1.he.net (2001:7f8:18::69) 0.0% 5 5.2 5.4 5.1 5.5 0.2 5.|-- e0-32.core2.man1.he.net (2001:470:0:410::1) 20.0% 5 10.0 10.0 9.9 10.3 0.2 6.|-- ??? 100.0 5 0.0 0.0 0.0 0.0 0.0 7.|-- ae1-380.cr1-esams.wikimedia.org (2001:7f8:1::a501:4907:1) 0.0% 5 19.2 19.4 19.2 19.7 0.2 8.|-- xe-3-2-1.cr2-eqiad.wikimedia.org (2a02:ec80:300:fe09::1) 0.0% 5 93.6 93.5 93.1 94.0 0.3 9.|-- wikimedia-dns.org (2001:67c:930::1) 0.0% 5 92.0 92.1 91.9 92.6 0.3
cathal@officepc:~$ dig -6 +https +nsid www.ietf.org @wikimedia-dns.org
Sat, Mar 21
FWIW the reason for traffic re-routed to eqiad not drmrs was due to how we have the core routers set up. TL;DR depooling the service (i.e. stopping the doh VMs announcing the /32 IPs) did not cause the CRs in Amsterdam to cease announcing the /24 and /48 prefixes to the world. Reason for that was other anycast IPs in the same range still being announced locally in esams (durum IPs).
Fri, Mar 20
Ticket 05547487 opened with Nokia.
Unfortunately we hit another blocker with this so we will have to review the way forward. See T420706.
Thu, Mar 19
Nice work!
Wed, Mar 18
@jijiki thanks for the task. In terms of the network in general nothing changed the week of Dec 15th last. We had done some work in Nov/Dec but it was all done by then, and we we're firmly in change freeze mode that week so definitely nothing changed.
Tue, Mar 17
Ok all vxlan tunnels right now on row c/d leaf switches to ssw1-d1-eqiad and ssw1-d8-eqiad have a valid vxlan tunnel id. So unless something causes that to change (shouldn't) we should not hit this issue again.
Ok this work is now complete. Only had to reset the tunnel on lsw1-d4-eqiad it was the only one with an ID of '1' going to ssw1-d8.
FWIW I agree it'd be better if the web proxy could be used here, as conceptually this is "private WMF host needs access to external internet IP".
This won't be needed now, we were able to reset the tunnels for this switch without disrupting traffic to the rack.
A:lsw1-c7-eqiad# show network-instance default tunnel-table ipv4 | grep "10.64.128.17\|10.64.128.18" | 10.64.128.17/32 | vxlan | vxlan | 27 | Y | 8 | 0 | 2026-03-17T12:02:30.154Z | 10.64.129.68 | ethernet-1/56.0 | | 10.64.128.18/32 | vxlan | vxlan | 11 | Y | 8 | 0 | 2025-10-02T18:41:01.245Z | 10.64.129.70 | ethernet-1/55.0 |
This won't be required now, we have reset the tunnels without disrupting traffic to the hosts in the rack.
A:lsw1-c2-eqiad# show network-instance default tunnel-table ipv4 | grep "10.64.128.17\|10.64.128.18" | 10.64.128.17/32 | vxlan | vxlan | 27 | Y | 8 | 0 | 2026-03-17T11:56:33.887Z | 10.64.129.26 | ethernet-1/56.0 | | 10.64.128.18/32 | vxlan | vxlan | 12 | Y | 8 | 0 | 2025-10-02T18:49:18.796Z | 10.64.129.28 | ethernet-1/55.0 |
Work is all complete, BGP sessions to ssw1-d1-eiqad were reset on these switches which all had tunnels with ID 1 towards it, no packet loss to servers was detected:
lsw1-c2-eqiad lsw1-c3-eqiad lsw1-c4-eqiad lsw1-c6-eqiad lsw1-c7-eqiad lsw1-d1-eqiad lsw1-d3-eqiad lsw1-d8-eqiad
Mon, Mar 16
Can we hold off on any work related to this? I am planning to drain the spine switches in order tomorrow morning and will reset the tunnels on all the switches, so we shouldn't have to arrange downtime with users.
Can we hold off on any work related to this? I am planning to drain the spine switches in order tomorrow morning and will reset the tunnels on all the switches, so we shouldn't have to arrange downtime with users.
Mar 13 2026
Thanks for the task @BTullis
@Papaul please tell them to keep the case low as they have not yet fixed it.