Page MenuHomePhabricator

cmooney (Cathal Mooney)
SRE (netops)

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
May 10 2021, 3:25 PM (256 w, 6 d)
Availability
Available
IRC Nick
topranks
LDAP User
Cathal Mooney
MediaWiki User
CMooney (WMF) [ Global Accounts ]

Recent Activity

Fri, Apr 10

cmooney added a comment to T356877: Increase visibility of kubernetes network status.

Broadly the patch submitted looked good to me, though I see it was abandoned.

Fri, Apr 10, 2:41 PM · Patch-For-Review, Sustainability (Incident Followup), ServiceOps-good-first-task, Infrastructure-Foundations, netops, ServiceOps new, observability, Prod-Kubernetes, Kubernetes
cmooney added a comment to T422043: Create public vlans in eqiad and codfw.

D row has no specialty rack at all so we can easily work around that for future private vlan installs.

Fri, Apr 10, 12:06 PM · Infrastructure-Foundations, netops

Thu, Apr 9

cmooney updated the task description for T422525: cr1-esams failed upgrade.
Thu, Apr 9, 9:23 AM · netops, Infrastructure-Foundations, SRE

Wed, Apr 8

cmooney added a comment to T422525: cr1-esams failed upgrade.

Ok Juniper came back with the following:

I found that your version 23.4R2-S7.4 is hitting the PR1933049. Unfortunately, this is a confidential PR, but in order to get this issue resolved and avoid further issues, you need to upgrade to a slightly higher version.
Wed, Apr 8, 4:14 PM · netops, Infrastructure-Foundations, SRE

Tue, Apr 7

cmooney updated the task description for T422525: cr1-esams failed upgrade.
Tue, Apr 7, 4:49 PM · netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T422525: cr1-esams failed upgrade.
Tue, Apr 7, 4:47 PM · netops, Infrastructure-Foundations, SRE
cmooney added a subtask for T416450: esams: upgrade routers & switches (2026): T422525: cr1-esams failed upgrade.
Tue, Apr 7, 3:56 PM · Patch-For-Review, Infrastructure-Foundations, netops
cmooney added a parent task for T422525: cr1-esams failed upgrade: T416450: esams: upgrade routers & switches (2026).
Tue, Apr 7, 3:55 PM · netops, Infrastructure-Foundations, SRE
cmooney created T422525: cr1-esams failed upgrade.
Tue, Apr 7, 3:55 PM · netops, Infrastructure-Foundations, SRE
cmooney created P90315 rpd logs.
Tue, Apr 7, 1:48 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Ok the results from wikikube-worker1258 (row B) don't seem to show the same percentage of longer RTT packets as wikikube-worker1273 (row D - in above comment).

Bucket            Count     Pct  Bar
--------------------------------------------------
> 500ms               1  0.001%  
250 - 500ms           0  0.000%  
100 - 250ms           0  0.000%  
50 - 100ms            0  0.000%  
40 - 50ms             2  0.001%  
30 - 40ms        136548  99.998%  █████████████████████████████████████████████████
20 - 30ms             0  0.000%  
10 - 20ms             0  0.000%  
5 - 10ms              0  0.000%  
2 - 5ms               0  0.000%  
1 - 2ms               0  0.000%  
0 - 1ms               0  0.000%
Tue, Apr 7, 11:00 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Thu, Apr 2

cmooney created P90248 (An Untitled Masterwork).
Thu, Apr 2, 6:31 PM
cmooney added a comment to T422043: Create public vlans in eqiad and codfw.

Is it maybe an idea to re-use some of the existing vlans? Like if we assign rack A1 as the public rack for the A/B POD we could add all the hosts to public1-a-eqiad as we move them? And then when complete rename the vlan to public1-a1-eqiad?

Thu, Apr 2, 3:57 PM · Infrastructure-Foundations, netops
cmooney added a comment to T422130: Database servers in cluster(number) are overloaded.

We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being investigated.

Thu, Apr 2, 1:57 PM · Wikimedia-Incident, SRE, DBA
cmooney claimed T417873: eqiad: upgrade routers (2026).
Thu, Apr 2, 9:45 AM · Infrastructure-Foundations, netops
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney I added some info T420223#11753137, where I tested jitter seen by MTR on a worker in row A/B vs a worker in C/D: the former doesn't show it. I also tried on another couple of nodes, but I don't have anything definitive form a statistics point of view. I can collect more info if you want!

Thu, Apr 2, 9:44 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Wed, Apr 1

cmooney added a comment to T422043: Create public vlans in eqiad and codfw.

If we are going to have one public-enabled rack per "pod" then should we not have just one vlan assigned for codfw row E/F (and then one also for a/b and c/d)?

Wed, Apr 1, 3:56 PM · Infrastructure-Foundations, netops
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

If there is a wikikube-worker in rows a/d with mcrouter regularly talking to codfw mc hosts let me know, I can potentially do the same kind of analysis on traffic there so we can compare the difference?

Wed, Apr 1, 2:29 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Ok so I gathered stats for the past few days (Mar 27 - Apr 1) of the SYN / SYN-ACK exchanges starting the tcp handshake, and this is the breakdown of RTTs:

Total SYN / SYN-ACK RTTs measured: 146553
Wed, Apr 1, 9:29 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney edited P90140 RTTs of SYN/ACK exchanges from mcrouter on wikikube-worker1273 (10.67.160.184) to codfw memcached.
Wed, Apr 1, 8:56 AM
cmooney created P90140 RTTs of SYN/ACK exchanges from mcrouter on wikikube-worker1273 (10.67.160.184) to codfw memcached.
Wed, Apr 1, 8:50 AM

Mon, Mar 30

cmooney triaged T421706: Infrastructure foundations : Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets as Low priority.
Mon, Mar 30, 2:55 PM · Infrastructure-Foundations
cmooney triaged T421238: mr1-eqiad: move from OSPF to BGP as Medium priority.
Mon, Mar 30, 2:55 PM · Patch-For-Review, Infrastructure-Foundations, netops

Fri, Mar 27

cmooney edited P89962 (An Untitled Masterwork).
Fri, Mar 27, 4:47 PM
cmooney created P89962 (An Untitled Masterwork).
Fri, Mar 27, 4:39 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney yes Effie depooled it IIRC! You can probably use wikikube-worker1273.eqiad.wmnet (@jijiki let's not depool it).

Fri, Mar 27, 3:53 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T416249: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6].

@cmooney Could you assist with this next week?

Fri, Mar 27, 2:22 PM · fundraising-tech-ops, SRE, DC-Ops, ops-eqiad
cmooney edited P89956 (An Untitled Masterwork).
Fri, Mar 27, 1:53 PM
cmooney created P89956 (An Untitled Masterwork).
Fri, Mar 27, 1:27 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I'll do another pcap and just focus on SYN / SYN-ACK packets, which will be more reflective on the network latency

Fri, Mar 27, 1:13 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T421343: Some traffic still flowing to mw-api-int after the switchover.

Thanks for the write-up @JMeybohm. Definitely an odd one.

Fri, Mar 27, 12:33 PM · Observability-Metrics, Prod-Kubernetes, Kubernetes, Patch-For-Review, ServiceOps new

Thu, Mar 26

cmooney edited P89948 (An Untitled Masterwork).
Thu, Mar 26, 3:40 PM
cmooney edited P89948 (An Untitled Masterwork).
Thu, Mar 26, 3:26 PM
cmooney created P89948 (An Untitled Masterwork).
Thu, Mar 26, 3:17 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I took a pcap on wikikube-worker1070 for TCP packets to mc1041, and did some comparisons on RTT (i.e. time between packet sent to mc1041 and the response arriving.

Total RTT samples: 280984
Thu, Mar 26, 1:27 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Much better. @cmooney nothing definitive because there may be some variance but what do you think?

Thu, Mar 26, 11:25 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Wed, Mar 25

cmooney edited P89932 (An Untitled Masterwork).
Wed, Mar 25, 1:15 PM
cmooney created P89932 (An Untitled Masterwork).
Wed, Mar 25, 1:13 PM
cmooney created P89931 (An Untitled Masterwork).
Wed, Mar 25, 1:10 PM

Tue, Mar 24

cmooney closed T420975: Atlas no longer reachable from monitoring on routed ganeti as Resolved.

This should now be working again. Big thanks to @ayounsi for the heavy-lifting with all the puppet patches to add the $INSTALL_HOSTS set.

Tue, Mar 24, 12:42 PM · Infrastructure-Foundations, netops, SRE
cmooney created P89911 (An Untitled Masterwork).
Tue, Mar 24, 12:10 PM

Mon, Mar 23

cmooney added a comment to T420821: Anycast services - depool strategy in terms of BGP routing.

I think a cookbook that takes down doh and durum simultaneously at a site (I assume by changing bird?) would solve this perfectly.

Mon, Mar 23, 7:18 PM · netops, Infrastructure-Foundations, SRE
cmooney added a parent task for T420975: Atlas no longer reachable from monitoring on routed ganeti: Unknown Object (Task).
Mon, Mar 23, 5:40 PM · Infrastructure-Foundations, netops, SRE
cmooney created T420975: Atlas no longer reachable from monitoring on routed ganeti.
Mon, Mar 23, 5:38 PM · Infrastructure-Foundations, netops, SRE
cmooney closed T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP, a subtask of T420821: Anycast services - depool strategy in terms of BGP routing, as Resolved.
Mon, Mar 23, 3:09 PM · netops, Infrastructure-Foundations, SRE
cmooney closed T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP as Resolved.

Ok this should no longer be an issue after updating the wikimedia6 prefix list. Right now, with Wikidough depooled in esams, traffic which lands in esams for wikidough gets sent to eqiad and answered:

cathal@officepc:~$ mtr -b -w -c 5 -6 wikimedia-dns.org 
Start: 2026-03-23T15:06:23+0000
HOST: officepc                                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- pool-ipv6-pd.agg1.srl.blp-srl.eir.ie (redacted)  0.0%     5    0.3   0.4   0.3   0.5   0.1
  2.|-- agg1.srl.blp-srl.eircom.net (2001:bb0:6:a11d::1)              0.0%     5    5.7   5.9   4.5   8.3   1.6
  3.|-- 2001:bb0:6:a197::1                                            0.0%     5    4.5   4.6   4.3   4.9   0.2
  4.|-- 10ge16-1.core1.dub1.he.net (2001:7f8:18::69)                  0.0%     5    5.2   5.4   5.1   5.5   0.2
  5.|-- e0-32.core2.man1.he.net (2001:470:0:410::1)                  20.0%     5   10.0  10.0   9.9  10.3   0.2
  6.|-- ???                                                          100.0     5    0.0   0.0   0.0   0.0   0.0
  7.|-- ae1-380.cr1-esams.wikimedia.org (2001:7f8:1::a501:4907:1)     0.0%     5   19.2  19.4  19.2  19.7   0.2
  8.|-- xe-3-2-1.cr2-eqiad.wikimedia.org (2a02:ec80:300:fe09::1)      0.0%     5   93.6  93.5  93.1  94.0   0.3
  9.|-- wikimedia-dns.org (2001:67c:930::1)                           0.0%     5   92.0  92.1  91.9  92.6   0.3
cathal@officepc:~$ dig -6 +https +nsid www.ietf.org @wikimedia-dns.org
Mon, Mar 23, 3:09 PM · netops, Infrastructure-Foundations, SRE
cmooney created P89900 (An Untitled Masterwork).
Mon, Mar 23, 10:56 AM

Sat, Mar 21

cmooney updated the task description for T420706: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW.
Sat, Mar 21, 4:24 PM · netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T420821: Anycast services - depool strategy in terms of BGP routing.
Sat, Mar 21, 4:02 PM · netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP.
Sat, Mar 21, 3:58 PM · netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP.
Sat, Mar 21, 12:33 PM · netops, Infrastructure-Foundations, SRE
cmooney closed T420819: Wikidough: consider regional Anycast addresses as Declined.

FWIW the reason for traffic re-routed to eqiad not drmrs was due to how we have the core routers set up. TL;DR depooling the service (i.e. stopping the doh VMs announcing the /32 IPs) did not cause the CRs in Amsterdam to cease announcing the /24 and /48 prefixes to the world. Reason for that was other anycast IPs in the same range still being announced locally in esams (durum IPs).

Sat, Mar 21, 12:31 PM · Traffic, SRE
cmooney updated the task description for T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP.
Sat, Mar 21, 12:28 PM · netops, Infrastructure-Foundations, SRE
cmooney added a parent task for T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP: T420821: Anycast services - depool strategy in terms of BGP routing.
Sat, Mar 21, 12:27 PM · netops, Infrastructure-Foundations, SRE
cmooney added a subtask for T420821: Anycast services - depool strategy in terms of BGP routing: T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP.
Sat, Mar 21, 12:27 PM · netops, Infrastructure-Foundations, SRE
cmooney created T420821: Anycast services - depool strategy in terms of BGP routing.
Sat, Mar 21, 12:27 PM · netops, Infrastructure-Foundations, SRE
cmooney created T420820: Wikidough unreachable over IPv6 if it is depooled but still announced from a POP.
Sat, Mar 21, 12:16 PM · netops, Infrastructure-Foundations, SRE
cmooney created T420819: Wikidough: consider regional Anycast addresses.
Sat, Mar 21, 11:02 AM · Traffic, SRE

Fri, Mar 20

cmooney added a comment to T420706: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW.

Ticket 05547487 opened with Nokia.

Fri, Mar 20, 9:42 AM · netops, Infrastructure-Foundations, SRE
cmooney added a comment to T416872: Eqiad: move row-wide vlan gateways to Nokia switches.

Unfortunately we hit another blocker with this so we will have to review the way forward. See T420706.

Fri, Mar 20, 9:16 AM · Infrastructure-Foundations, netops, SRE
cmooney added a parent task for T420706: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW: T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs.
Fri, Mar 20, 9:15 AM · netops, Infrastructure-Foundations, SRE
cmooney added a subtask for T405562: Eqiad C/D refresh: move legacy switch uplinks to Nokias and migrate Vlan GWs: T420706: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW.
Fri, Mar 20, 9:15 AM · netops, Infrastructure-Foundations, SRE
cmooney created T420706: Nokia SR-Linux - wonky routing with IPv6 RAs and EVPN Anycast GW.
Fri, Mar 20, 9:14 AM · netops, Infrastructure-Foundations, SRE

Thu, Mar 19

cmooney added a comment to T420342: esams/magru: 185.71.138.0/24 (wikidough) prefix not advertized.

Nice work!

Thu, Mar 19, 3:27 PM · Traffic, Infrastructure-Foundations, netops
cmooney added a comment to T366193: Anycast ns[01].wikimedia.org for IPv4.

I think we should clean up stuff in the interim though since it will be a while before we can get our hands on the /24. I will need your help with that bits once I get the paperwork out of the way :)

Thu, Mar 19, 2:05 PM · SRE, Traffic

Wed, Mar 18

cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@jijiki thanks for the task. In terms of the network in general nothing changed the week of Dec 15th last. We had done some work in Nov/Dec but it was all done by then, and we we're firmly in change freeze mode that week so definitely nothing changed.

Wed, Mar 18, 12:31 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Tue, Mar 17

cmooney lowered the priority of T411054: Nokia SR-Linux DHCP Relay Bug from Medium to Low.

Ok all vxlan tunnels right now on row c/d leaf switches to ssw1-d1-eqiad and ssw1-d8-eqiad have a valid vxlan tunnel id. So unless something causes that to change (shouldn't) we should not hit this issue again.

Tue, Mar 17, 3:15 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney closed T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment as Resolved.

Ok this work is now complete. Only had to reset the tunnel on lsw1-d4-eqiad it was the only one with an ID of '1' going to ssw1-d8.

Tue, Mar 17, 3:14 PM · Infrastructure-Foundations, netops, SRE
cmooney closed T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment, a subtask of T411054: Nokia SR-Linux DHCP Relay Bug, as Resolved.
Tue, Mar 17, 3:14 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney added a subtask for T411054: Nokia SR-Linux DHCP Relay Bug: T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment.
Tue, Mar 17, 1:43 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney added a parent task for T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment: T411054: Nokia SR-Linux DHCP Relay Bug.
Tue, Mar 17, 1:43 PM · Infrastructure-Foundations, netops, SRE
cmooney created T420351: Drain ssw1-d8-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment.
Tue, Mar 17, 1:43 PM · Infrastructure-Foundations, netops, SRE
cmooney created P89874 Set cr1-eqiad to vrrp master for all vlans.
Tue, Mar 17, 1:40 PM
cmooney edited P89873 (An Untitled Masterwork).
Tue, Mar 17, 1:27 PM
cmooney created P89873 (An Untitled Masterwork).
Tue, Mar 17, 1:24 PM
cmooney created P89872 (An Untitled Masterwork).
Tue, Mar 17, 1:19 PM
cmooney added a comment to T419996: cloudcumin not able to communicate with openstack.eqiad1.wikimediacloud.org:25000 anymore.

FWIW I agree it'd be better if the web proxy could be used here, as conceptually this is "private WMF host needs access to external internet IP".

Tue, Mar 17, 1:00 PM · Cloud-VPS, cloud-services-team
cmooney closed T420159: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT, a subtask of T411054: Nokia SR-Linux DHCP Relay Bug, as Declined.
Tue, Mar 17, 12:48 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney closed T420159: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT as Declined.

This won't be needed now, we were able to reset the tunnels for this switch without disrupting traffic to the rack.

A:lsw1-c7-eqiad# show network-instance default tunnel-table ipv4 | grep "10.64.128.17\|10.64.128.18"
| 10.64.128.17/32 | vxlan | vxlan | 27 | Y | 8 | 0 | 2026-03-17T12:02:30.154Z | 10.64.129.68 | ethernet-1/56.0 |
| 10.64.128.18/32 | vxlan | vxlan | 11 | Y | 8 | 0 | 2025-10-02T18:41:01.245Z | 10.64.129.70 | ethernet-1/55.0 |
Tue, Mar 17, 12:48 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney closed T420158: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT, a subtask of T411054: Nokia SR-Linux DHCP Relay Bug, as Declined.
Tue, Mar 17, 12:47 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney closed T420158: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT as Declined.

This won't be required now, we have reset the tunnels without disrupting traffic to the hosts in the rack.

A:lsw1-c2-eqiad# show network-instance default tunnel-table ipv4 | grep "10.64.128.17\|10.64.128.18"
| 10.64.128.17/32 | vxlan | vxlan | 27 | Y | 8 | 0 | 2026-03-17T11:56:33.887Z | 10.64.129.26 | ethernet-1/56.0 |
| 10.64.128.18/32 | vxlan | vxlan | 12 | Y | 8 | 0 | 2025-10-02T18:49:18.796Z | 10.64.129.28 | ethernet-1/55.0 |
Tue, Mar 17, 12:47 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney closed T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment, a subtask of T411054: Nokia SR-Linux DHCP Relay Bug, as Resolved.
Tue, Mar 17, 12:43 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney closed T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment as Resolved.

Work is all complete, BGP sessions to ssw1-d1-eiqad were reset on these switches which all had tunnels with ID 1 towards it, no packet loss to servers was detected:

lsw1-c2-eqiad
lsw1-c3-eqiad
lsw1-c4-eqiad
lsw1-c6-eqiad
lsw1-c7-eqiad
lsw1-d1-eqiad
lsw1-d3-eqiad
lsw1-d8-eqiad
Tue, Mar 17, 12:42 PM · Infrastructure-Foundations, netops
cmooney edited P89871 (An Untitled Masterwork).
Tue, Mar 17, 11:33 AM
cmooney created P89871 (An Untitled Masterwork).
Tue, Mar 17, 11:28 AM

Mon, Mar 16

cmooney added a comment to T415743: Inbound errors on interface cr2-magru:xe-0/1/0 (Transit: EdgeUno (E1-SER-7853-IP) {#70091}).

Please re-drain this link Wednesday in advance of this work, thank you!

Mon, Mar 16, 5:46 PM · ops-magru
cmooney triaged T419919: Consider reducing verbosity of IRC logging as Low priority.
Mon, Mar 16, 2:29 PM · SRE, Infrastructure-Foundations
cmooney triaged T420159: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT as Low priority.
Mon, Mar 16, 2:27 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney triaged T420158: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT as Low priority.
Mon, Mar 16, 2:27 PM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney triaged T419992: Alert if calico BGP sessions are not established on any kubernetes worker as Low priority.
Mon, Mar 16, 2:22 PM · Sustainability (Incident Followup), Infrastructure-Foundations, Data-Platform-SRE
cmooney added a comment to T411054: Nokia SR-Linux DHCP Relay Bug.

Will all of the switches in rows C & D be getting this configuration change?

Mon, Mar 16, 12:08 PM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment.
Mon, Mar 16, 9:54 AM · Infrastructure-Foundations, netops
cmooney added a subtask for T411054: Nokia SR-Linux DHCP Relay Bug: T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment.
Mon, Mar 16, 9:53 AM · ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney added a parent task for T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment: T411054: Nokia SR-Linux DHCP Relay Bug.
Mon, Mar 16, 9:52 AM · Infrastructure-Foundations, netops
cmooney updated the task description for T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment.
Mon, Mar 16, 9:52 AM · Infrastructure-Foundations, netops
cmooney created T420180: Drain ssw1-d1-eqiad and reset BGP EVPN sessions to force new vxlan tunnel establishment.
Mon, Mar 16, 9:51 AM · Infrastructure-Foundations, netops
cmooney added a comment to T420159: Eqiad: lsw1-c7-eqiad BGP maintenance/ Thursday 19th at 10:00 am CDT.

Can we hold off on any work related to this? I am planning to drain the spine switches in order tomorrow morning and will reset the tunnels on all the switches, so we shouldn't have to arrange downtime with users.

Mon, Mar 16, 9:49 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE
cmooney added a comment to T420158: Eqiad: lsw1-c2-eqiad BGP maintenance/ Tuesday 17th at 9:30 CDT.

Can we hold off on any work related to this? I am planning to drain the spine switches in order tomorrow morning and will reset the tunnels on all the switches, so we shouldn't have to arrange downtime with users.

Mon, Mar 16, 9:49 AM · Data-Platform-SRE (2026-03-06 - 2026-03-27), ServiceOps new, netops, Infrastructure-Foundations, SRE

Mar 13 2026

cmooney added a comment to T419992: Alert if calico BGP sessions are not established on any kubernetes worker.

Thanks for the task @BTullis

Mar 13 2026, 4:30 PM · Sustainability (Incident Followup), Infrastructure-Foundations, Data-Platform-SRE
cmooney added a comment to T412733: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1.

@Papaul please tell them to keep the case low as they have not yet fixed it.

Mar 13 2026, 1:45 PM · netops, Infrastructure-Foundations, SRE