Page MenuHomePhabricator
Feed Search

Today

cmooney created P92481 ulsfo: netbox tidy-up for ulsfo to remove blockers for T424611.
Tue, May 12, 12:13 PM
cmooney added a comment to T424611: POPs - free up 2xQSFP ports.

Agh hit a bit of a hiccup with this (really should have anticipated). Take drmrs for example:

cmooney@asw1-b12-drmrs> show configuration interfaces et-0/0/48 
description "Core: cr1-drmrs:et-0/0/1 {#D0100}";
mtu 9192;
unit 0 {
    family inet {
        address 185.15.58.143/31;
    }
    family inet6 {
        address 2a02:ec80:600:fe06::2/64;
    }
}
Tue, May 12, 10:50 AM · Infrastructure-Foundations, netops

Yesterday

cmooney added a comment to T424683: Network telemetry - collect device sub-interface statistics with gnmic.

Nice!

We can also filter out the .16386, .16384, .16385, .16383, .32769 - weird juniper...
as well as .0 like we do for SNMP (for LibreNMS) - see T283060: SNMP: filter out default sub interfaces

We don’t get the .0 via gnmi so one less worry.

But yep, sounds like it makes sense to drop the others if needed. I’ll try to validate what ones we get back in stats.

Mon, May 11, 6:31 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE
cmooney added a comment to T424611: POPs - free up 2xQSFP ports.

So anyway, for now I'd propose we add the following vlans for this:

341  core1-bw27-esams
342  core1-by27-esams
Mon, May 11, 5:40 PM · Infrastructure-Foundations, netops
cmooney added a comment to T425921: cr2-drmrs<->asw1-b12-drmrs down.

All looks good with the link, traffic flowing again.

Mon, May 11, 5:36 PM · ops-drmrs
cmooney placed T425921: cr2-drmrs<->asw1-b12-drmrs down up for grabs.

It still shows no light incoming to asw1-b12-drmrs on lane 3:

cmooney@asw1-b12-drmrs> show interfaces diagnostics optics xe-0/0/50:2 | match "Laser receiver power" | match dB    
    Laser receiver power                      :  0.000 mW / -40.04 dBm
Mon, May 11, 4:49 PM · ops-drmrs
cmooney added a comment to T424611: POPs - free up 2xQSFP ports.

I suggest core1 instead of corebgp but that lgtm!

Mon, May 11, 4:44 PM · Infrastructure-Foundations, netops
cmooney closed T425813: Nokia SR-Linux: BFD broken with default homer configuration as Resolved.

Patch merged and config pushed to all Nokia devices now.

Mon, May 11, 12:55 PM · netops, Infrastructure-Foundations, SRE
cmooney added a comment to T425921: cr2-drmrs<->asw1-b12-drmrs down.

@RobH can you raise a task with Digital Realty to take a look at this in MRS2?

Mon, May 11, 10:14 AM · ops-drmrs
cmooney added a comment to T425921: cr2-drmrs<->asw1-b12-drmrs down.

Very weird, box still sees the optic but thinks interface is invalid.

cmooney@asw1-b12-drmrs> show chassis pic fpc-slot 0 pic-slot 0 | match "^  50" 
  50   40GBASE SR4       MM    FS                 QSFP-SR4-40G      850 nm                    0.0           REV 01   SFF-8436 ver n/a
cmooney@asw1-b12-drmrs> show interfaces et-0/0/50 detail                          
error: device et-0/0/50 not found
Mon, May 11, 9:52 AM · ops-drmrs
cmooney added a comment to T424683: Network telemetry - collect device sub-interface statistics with gnmic.

Nice!

We can also filter out the .16386, .16384, .16385, .16383, .32769 - weird juniper...
as well as .0 like we do for SNMP (for LibreNMS) - see T283060: SNMP: filter out default sub interfaces

Mon, May 11, 9:09 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

Sat, May 9

cmooney added a comment to T425674: neutron vlan interfaces should not enable jumbo frames.

It seems like the same problem is present on all VLAN interfaces on cloudvirt/cloudcontrol hosts:

taavi@cloudvirt1076 ~ $ ip link show vlan1105
13: vlan1105@eno12399np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether 04:32:01:db:42:d0 brd ff:ff:ff:ff:ff:ff
Sat, May 9, 9:03 AM · tools-infrastructure-team, Cloud-VPS

Fri, May 8

cmooney created T425813: Nokia SR-Linux: BFD broken with default homer configuration.
Fri, May 8, 6:56 PM · netops, Infrastructure-Foundations, SRE
cmooney created P92444 Allowed BGP / MBFD ranges after change .
Fri, May 8, 12:55 PM
cmooney created P92443 IPv6 Nokia cpm ACL - BGP / MBFD rules after change.
Fri, May 8, 12:49 PM
cmooney added a comment to T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.

So it seems the reason for this is some ferm complexity. When puppet signals a 'refresh' to it it asks ferm itself to reload rules, however ferm does not pick up on the changes to these files for some reason. The ultimate answer is for hosts to move to nftables, it's not worth spending time to fix.

Fri, May 8, 11:41 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney created P92439 (An Untitled Masterwork).
Fri, May 8, 11:08 AM
cmooney added a comment to T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.

Just an update here.

Fri, May 8, 10:57 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE

Thu, May 7

cmooney created P92420 Nokia - show int desc.
Thu, May 7, 6:12 PM

Wed, May 6

cmooney added a comment to T425300: lvs on cloudelastic1012 is misconfigured.

10.2.2.30 is the IP for search.svc.eqiad.wmnet, so it looks like cloudelastic1012 is set up for that rather than cloudelastic.wikimedia.org.

Wed, May 6, 5:59 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15), Discovery-Search, CirrusSearch

Fri, May 1

cmooney added a comment to T422816: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets.

It doesn't seem a propagation error related to ferm, but it should be something related to it if I had to guess. No errors reported after it, so we should be good.

Fri, May 1, 11:21 AM · observability, SRE
cmooney added a comment to T422816: Observability: Re-IP codfw private baremetal hosts to new per-rack vlans/subnets.

@ayounsi @cmooney would you be able to have a look to see if anything stands out from netops perspective?

Fri, May 1, 10:55 AM · observability, SRE
cmooney added a comment to T408892: ULSFO: New switch configuration.

@Papaul thanks. I see most of those don't exist even for IPv4, nor are there any IPv6 addresses listed, so I'm not sure exactly what might need to be added.

Fri, May 1, 10:34 AM · Patch-For-Review, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
cmooney created P92133 (An Untitled Masterwork).
Fri, May 1, 10:02 AM

Thu, Apr 30

cmooney created P92131 (An Untitled Masterwork).
Thu, Apr 30, 9:10 PM

Wed, Apr 29

cmooney edited parent tasks for T424611: POPs - free up 2xQSFP ports, added: Unknown Object (Task); removed: Unknown Object (Task).
Wed, Apr 29, 12:26 PM · Infrastructure-Foundations, netops

Tue, Apr 28

cmooney added a comment to T424683: Network telemetry - collect device sub-interface statistics with gnmic.

I had a stab at this in the above patch. Some notes on the event processors added:

Tue, Apr 28, 10:49 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE
cmooney updated the task description for T424683: Network telemetry - collect device sub-interface statistics with gnmic.
Tue, Apr 28, 5:11 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE
cmooney created T424683: Network telemetry - collect device sub-interface statistics with gnmic.
Tue, Apr 28, 5:09 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE
cmooney updated the task description for T424639: Network QoS: expand support to Nokia switches.
Tue, Apr 28, 3:55 PM · netops, Infrastructure-Foundations, SRE
cmooney added a parent task for T424639: Network QoS: expand support to Nokia switches: Unknown Object (Task).
Tue, Apr 28, 3:50 PM · netops, Infrastructure-Foundations, SRE
cmooney updated subscribers of T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.
Tue, Apr 28, 3:30 PM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney renamed T424611: POPs - free up 2xQSFP ports from POPs - free up 2x100G ports to POPs - free up 2xQSFP ports.
Tue, Apr 28, 3:26 PM · Infrastructure-Foundations, netops
cmooney added a comment to T424611: POPs - free up 2xQSFP ports.

My basic thoughts on this are:

Tue, Apr 28, 3:24 PM · Infrastructure-Foundations, netops
cmooney updated the task description for T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.
Tue, Apr 28, 10:48 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.
Tue, Apr 28, 10:48 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.
Tue, Apr 28, 10:45 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney added a subtask for T424639: Network QoS: expand support to Nokia switches: T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.
Tue, Apr 28, 10:34 AM · netops, Infrastructure-Foundations, SRE
cmooney added a parent task for T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41: T424639: Network QoS: expand support to Nokia switches.
Tue, Apr 28, 10:34 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney created T424640: Network QoS: use the 'CS1' DSCP code point for low-priority instead of AF41.
Tue, Apr 28, 10:34 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
cmooney created T424639: Network QoS: expand support to Nokia switches.
Tue, Apr 28, 10:13 AM · netops, Infrastructure-Foundations, SRE

Mon, Apr 27

cmooney edited P91645 (An Untitled Masterwork).
Mon, Apr 27, 4:21 PM
cmooney created P91645 (An Untitled Masterwork).
Mon, Apr 27, 4:20 PM
cmooney added a comment to T390052: Enable gNMI on SRX devices and fasw.

Now we need to figure out if it's worth upgrading the management routers or not, as it's more recent that the current Junos Recommended Version.

Mon, Apr 27, 1:46 PM · Patch-For-Review, Infrastructure-Foundations, netops
cmooney closed T421238: mr1-eqiad: move from OSPF to BGP as Resolved.
Mon, Apr 27, 1:21 PM · Infrastructure-Foundations, netops

Mon, Apr 20

cmooney triaged T423852: Add calico network alerting as Medium priority.
Mon, Apr 20, 2:21 PM · Sustainability (Incident Followup), ServiceOps-good-first-task, ServiceOps new, observability, Prod-Kubernetes, Kubernetes

Fri, Apr 17

cmooney edited P91109 (An Untitled Masterwork).
Fri, Apr 17, 5:33 PM
cmooney edited P91109 (An Untitled Masterwork).
Fri, Apr 17, 5:21 PM
cmooney added a comment to P91109 (An Untitled Masterwork).

IPv4 looks ok:

Target: 208.80.154.224
Location: USA, CA, Fremont
Provider: Linode
MTR captured UTC: 2026-04-17T17:02:48.834Z
MTR captured local: Fri Apr 17 2026 18:02:48 GMT+0100 (Irish Standard Time)
Fri, Apr 17, 5:07 PM
cmooney created P91109 (An Untitled Masterwork).
Fri, Apr 17, 4:55 PM
cmooney added a comment to P91106 (An Untitled Masterwork).

3 working example traces with mitigation off:

Target: 2620:0:861:ed1a::1
Location: USA, TX, Houston
Provider: Cubepath
MTR captured UTC: 2026-04-17T16:47:50.831Z
MTR captured local: Fri Apr 17 2026 17:47:50 GMT+0100 (Irish Standard Time)
Fri, Apr 17, 4:48 PM
cmooney created P91106 (An Untitled Masterwork).
Fri, Apr 17, 4:34 PM
cmooney edited P91028 (An Untitled Masterwork).
Fri, Apr 17, 9:05 AM
cmooney created P91028 (An Untitled Masterwork).
Fri, Apr 17, 9:03 AM

Wed, Apr 15

cmooney reopened T421238: mr1-eqiad: move from OSPF to BGP as "Open".

Re-opening this so we can look at the automation changes that are needed to remove tthe OSPF stuff (see here).

Wed, Apr 15, 7:49 PM · Infrastructure-Foundations, netops
cmooney renamed T423430: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux from No not announce OSPF routes in unicast BGP on Nokia SR-Linux to Don't announce OSPF routes in unicast BGP on Nokia SR-Linux.
Wed, Apr 15, 2:59 PM · Infrastructure-Foundations, netops, SRE
cmooney added a parent task for T423430: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux: T423384: Investigate internal rejected prefixes.
Wed, Apr 15, 2:53 PM · Infrastructure-Foundations, netops, SRE
cmooney added a subtask for T423384: Investigate internal rejected prefixes: T423430: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux.
Wed, Apr 15, 2:53 PM · Infrastructure-Foundations, netops
cmooney created T423430: Don't announce OSPF routes in unicast BGP on Nokia SR-Linux.
Wed, Apr 15, 2:53 PM · Infrastructure-Foundations, netops, SRE

Mon, Apr 13

cmooney added a comment to T356877: Increase visibility of kubernetes network status.

We can also run the prometheus bird exporter as a sidecar to calico-node (container image needs to be build and some yaml needs to be edited) which should give us more standardized metrics. We could add that as a stretch goal for this task.

Mon, Apr 13, 12:58 PM · Sustainability (Incident Followup), ServiceOps-good-first-task, Infrastructure-Foundations, netops, ServiceOps new, observability, Prod-Kubernetes, Kubernetes

Apr 10 2026

cmooney added a comment to T356877: Increase visibility of kubernetes network status.

Broadly the patch submitted looked good to me, though I see it was abandoned.

Apr 10 2026, 2:41 PM · Sustainability (Incident Followup), ServiceOps-good-first-task, Infrastructure-Foundations, netops, ServiceOps new, observability, Prod-Kubernetes, Kubernetes
cmooney added a comment to T422043: Create public vlans in eqiad and codfw.

D row has no specialty rack at all so we can easily work around that for future private vlan installs.

Apr 10 2026, 12:06 PM · Infrastructure-Foundations, netops

Apr 9 2026

cmooney updated the task description for T422525: cr1-esams failed upgrade.
Apr 9 2026, 9:23 AM · netops, Infrastructure-Foundations, SRE

Apr 8 2026

cmooney added a comment to T422525: cr1-esams failed upgrade.

Ok Juniper came back with the following:

I found that your version 23.4R2-S7.4 is hitting the PR1933049. Unfortunately, this is a confidential PR, but in order to get this issue resolved and avoid further issues, you need to upgrade to a slightly higher version.
Apr 8 2026, 4:14 PM · netops, Infrastructure-Foundations, SRE

Apr 7 2026

cmooney updated the task description for T422525: cr1-esams failed upgrade.
Apr 7 2026, 4:49 PM · netops, Infrastructure-Foundations, SRE
cmooney updated the task description for T422525: cr1-esams failed upgrade.
Apr 7 2026, 4:47 PM · netops, Infrastructure-Foundations, SRE
cmooney added a subtask for T416450: esams: upgrade routers & switches (2026): T422525: cr1-esams failed upgrade.
Apr 7 2026, 3:56 PM · Infrastructure-Foundations, netops
cmooney added a parent task for T422525: cr1-esams failed upgrade: T416450: esams: upgrade routers & switches (2026).
Apr 7 2026, 3:55 PM · netops, Infrastructure-Foundations, SRE
cmooney created T422525: cr1-esams failed upgrade.
Apr 7 2026, 3:55 PM · netops, Infrastructure-Foundations, SRE
cmooney created P90315 rpd logs.
Apr 7 2026, 1:48 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Ok the results from wikikube-worker1258 (row B) don't seem to show the same percentage of longer RTT packets as wikikube-worker1273 (row D - in above comment).

Bucket            Count     Pct  Bar
--------------------------------------------------
> 500ms               1  0.001%  
250 - 500ms           0  0.000%  
100 - 250ms           0  0.000%  
50 - 100ms            0  0.000%  
40 - 50ms             2  0.001%  
30 - 40ms        136548  99.998%  █████████████████████████████████████████████████
20 - 30ms             0  0.000%  
10 - 20ms             0  0.000%  
5 - 10ms              0  0.000%  
2 - 5ms               0  0.000%  
1 - 2ms               0  0.000%  
0 - 1ms               0  0.000%
Apr 7 2026, 11:00 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Apr 2 2026

cmooney created P90248 (An Untitled Masterwork).
Apr 2 2026, 6:31 PM
cmooney added a comment to T422043: Create public vlans in eqiad and codfw.

Is it maybe an idea to re-use some of the existing vlans? Like if we assign rack A1 as the public rack for the A/B POD we could add all the hosts to public1-a-eqiad as we move them? And then when complete rename the vlan to public1-a1-eqiad?

Apr 2 2026, 3:57 PM · Infrastructure-Foundations, netops
cmooney added a comment to T422130: External store unreachable: "Database servers in clusterXX are overloaded".

We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being investigated.

Apr 2 2026, 1:57 PM · Wikimedia-Incident, SRE, DBA
cmooney claimed T417873: eqiad: upgrade routers (2026).
Apr 2 2026, 9:45 AM · Infrastructure-Foundations, netops
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney I added some info T420223#11753137, where I tested jitter seen by MTR on a worker in row A/B vs a worker in C/D: the former doesn't show it. I also tried on another couple of nodes, but I don't have anything definitive form a statistics point of view. I can collect more info if you want!

Apr 2 2026, 9:44 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Apr 1 2026

cmooney added a comment to T422043: Create public vlans in eqiad and codfw.

If we are going to have one public-enabled rack per "pod" then should we not have just one vlan assigned for codfw row E/F (and then one also for a/b and c/d)?

Apr 1 2026, 3:56 PM · Infrastructure-Foundations, netops
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

If there is a wikikube-worker in rows a/d with mcrouter regularly talking to codfw mc hosts let me know, I can potentially do the same kind of analysis on traffic there so we can compare the difference?

Apr 1 2026, 2:29 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Ok so I gathered stats for the past few days (Mar 27 - Apr 1) of the SYN / SYN-ACK exchanges starting the tcp handshake, and this is the breakdown of RTTs:

Total SYN / SYN-ACK RTTs measured: 146553
Apr 1 2026, 9:29 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney edited P90140 RTTs of SYN/ACK exchanges from mcrouter on wikikube-worker1273 (10.67.160.184) to codfw memcached.
Apr 1 2026, 8:56 AM
cmooney created P90140 RTTs of SYN/ACK exchanges from mcrouter on wikikube-worker1273 (10.67.160.184) to codfw memcached.
Apr 1 2026, 8:50 AM

Mar 30 2026

cmooney triaged T421706: Infrastructure foundations : Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets as Low priority.
Mar 30 2026, 2:55 PM · Infrastructure-Foundations
cmooney triaged T421238: mr1-eqiad: move from OSPF to BGP as Medium priority.
Mar 30 2026, 2:55 PM · Infrastructure-Foundations, netops

Mar 27 2026

cmooney edited P89962 (An Untitled Masterwork).
Mar 27 2026, 4:47 PM
cmooney created P89962 (An Untitled Masterwork).
Mar 27 2026, 4:39 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

@cmooney yes Effie depooled it IIRC! You can probably use wikikube-worker1273.eqiad.wmnet (@jijiki let's not depool it).

Mar 27 2026, 3:53 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T416249: Q3:rack/setup/install frdata1003, frmx1002, frqueue100[5-6].

@cmooney Could you assist with this next week?

Mar 27 2026, 2:22 PM · fundraising-tech-ops, SRE, DC-Ops, ops-eqiad
cmooney edited P89956 (An Untitled Masterwork).
Mar 27 2026, 1:53 PM
cmooney created P89956 (An Untitled Masterwork).
Mar 27 2026, 1:27 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I'll do another pcap and just focus on SYN / SYN-ACK packets, which will be more reflective on the network latency

Mar 27 2026, 1:13 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T421343: Some traffic still flowing to mw-api-int after the switchover.

Thanks for the write-up @JMeybohm. Definitely an odd one.

Mar 27 2026, 12:33 PM · Observability-Metrics, Prod-Kubernetes, Kubernetes, ServiceOps new

Mar 26 2026

cmooney edited P89948 (An Untitled Masterwork).
Mar 26 2026, 3:40 PM
cmooney edited P89948 (An Untitled Masterwork).
Mar 26 2026, 3:26 PM
cmooney created P89948 (An Untitled Masterwork).
Mar 26 2026, 3:17 PM
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

I took a pcap on wikikube-worker1070 for TCP packets to mc1041, and did some comparisons on RTT (i.e. time between packet sent to mc1041 and the response arriving.

Total RTT samples: 280984
Mar 26 2026, 1:27 PM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores
cmooney added a comment to T420223: High (relatively) number of memcached errors in eqiad.

Much better. @cmooney nothing definitive because there may be some variance but what do you think?

Mar 26 2026, 11:25 AM · Infrastructure-Foundations, ServiceOps new, ServiceOps-Datastores

Mar 25 2026

cmooney edited P89932 (An Untitled Masterwork).
Mar 25 2026, 1:15 PM
cmooney created P89932 (An Untitled Masterwork).
Mar 25 2026, 1:13 PM
cmooney created P89931 (An Untitled Masterwork).
Mar 25 2026, 1:10 PM

Mar 24 2026

cmooney closed T420975: Atlas no longer reachable from monitoring on routed ganeti as Resolved.

This should now be working again. Big thanks to @ayounsi for the heavy-lifting with all the puppet patches to add the $INSTALL_HOSTS set.

Mar 24 2026, 12:42 PM · Infrastructure-Foundations, netops, SRE