Page MenuHomePhabricator

ayounsi (Arzhel Younsi)
Staff Network SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Apr 3 2017, 6:23 PM (414 w, 5 d)
Availability
Available
IRC Nick
xionox
LDAP User
Ayounsi
MediaWiki User
AYounsi (WMF) [ Global Accounts ]

Recent Activity

Wed, Mar 12

ayounsi added a comment to T372457: Remove librenms -> graphite integration, replace with gnmi.

@dcaro you might find this useful https://gerrit.wikimedia.org/r/c/operations/alerts/+/1126966

Wed, Mar 12, 2:07 PM · Observability-Metrics, SRE Observability (FY2024/2025-Q3), Cloud-VPS, cloud-services-team
ayounsi closed T388642: gnmi_interfaces_interface_state_oper_status missing from most devices, a subtask of T388641: Migrate network icinga alerts to gNMI/prometheus, as Resolved.
Wed, Mar 12, 11:41 AM · Patch-For-Review, Infrastructure-Foundations, Observability-Alerting, netops
ayounsi closed T388642: gnmi_interfaces_interface_state_oper_status missing from most devices as Resolved.

Chatted about it with Cathal on IRC, the gNMIc deamon just needed a restart.

Wed, Mar 12, 11:40 AM · Infrastructure-Foundations, netops
ayounsi added a subtask for T388641: Migrate network icinga alerts to gNMI/prometheus: T388642: gnmi_interfaces_interface_state_oper_status missing from most devices.
Wed, Mar 12, 10:54 AM · Patch-For-Review, Infrastructure-Foundations, Observability-Alerting, netops
ayounsi added a parent task for T388642: gnmi_interfaces_interface_state_oper_status missing from most devices: T388641: Migrate network icinga alerts to gNMI/prometheus.
Wed, Mar 12, 10:54 AM · Infrastructure-Foundations, netops
ayounsi updated the task description for T388641: Migrate network icinga alerts to gNMI/prometheus.
Wed, Mar 12, 10:53 AM · Patch-For-Review, Infrastructure-Foundations, Observability-Alerting, netops
ayounsi created T388642: gnmi_interfaces_interface_state_oper_status missing from most devices.
Wed, Mar 12, 10:53 AM · Infrastructure-Foundations, netops
ayounsi triaged T388641: Migrate network icinga alerts to gNMI/prometheus as Low priority.
Wed, Mar 12, 10:46 AM · Patch-For-Review, Infrastructure-Foundations, Observability-Alerting, netops

Tue, Mar 11

ayounsi added a comment to T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance.

That's great, thanks for digging into it.

Tue, Mar 11, 12:50 PM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting

Mon, Mar 10

ayounsi added a comment to T388419: cannot ping ripe-atlas-codfw from the codfw prometheus instance.

The RIPE atlas are in a special sandbox vlan. Which behave like the internet (no special permissions into our network).
alert1002 have a public IP so the ripe atlas is reachable like any internet host. However the prometheus hosts are internal, so pings can't go out (or back in) as expected.
Obviously the HTTP proxies don't work for pings. I don't know what is the best course of action here.

Mon, Mar 10, 3:52 PM · Patch-For-Review, Infrastructure-Foundations, SRE Observability (FY2024/2025-Q3), Observability-Alerting
ayounsi closed T387773: Different BFD settings on direct connected links as Resolved.

nice !

Mon, Mar 10, 9:34 AM · Infrastructure-Foundations, netops
ayounsi awarded T388169: Eliminate use of secondary IP interfaces & DNS for Cassandra instances a Love token.
Mon, Mar 10, 8:38 AM · Cassandra, SRE

Wed, Mar 5

ayounsi added a comment to T387287: Prometheus: attach host's BGP/interface remote side metrics.

I also added that metric to this dashboard as an example visualization : https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1

Wed, Mar 5, 5:41 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops
ayounsi closed T387287: Prometheus: attach host's BGP/interface remote side metrics, a subtask of T384731: Prevent BGP alerts triggering when K8s host maintenance is being done, as Resolved.
Wed, Mar 5, 5:24 PM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE
ayounsi closed T387287: Prometheus: attach host's BGP/interface remote side metrics as Resolved.
Wed, Mar 5, 5:24 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops
ayounsi added a comment to T387287: Prometheus: attach host's BGP/interface remote side metrics.

It's live and working fine :

image.png (1×2 px, 213 KB)

https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000026%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22remote_instance:gnmi_bgp_neighbor_session_state%7Bpeer_group%3D~%5C%22Kubernetes.%2A%5C%22%7D%20%21%3D%206%22,%22legendFormat%22:%22between%20%7B%7Binstance%7D%7D%20and%20%7B%7Bremote_instance%7D%7D%22,%22range%22:true,%22instant%22:true,%22exemplar%22:false%7D%5D,%22range%22:%7B%22from%22:%22now-30m%22,%22to%22:%22now%22%7D%7D

Wed, Mar 5, 2:05 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops
ayounsi added a comment to T387773: Different BFD settings on direct connected links .

They're mandatory on long distance link as we've had issue with interface status being up but the provider not forwarding traffic through said link. For local (direct) links I don't think BFD is needed as it's unlikely that the interface is up without forwarding data.

Wed, Mar 5, 10:20 AM · Infrastructure-Foundations, netops
ayounsi added a comment to T387839: Update `netflow` retention strategy in Druid (too much data).

So what about:

  • turnilo full dimensions - 1 months
  • turnilo sanisitzed/reduced - 12 months
  • everything else, in data lake?
Wed, Mar 5, 10:11 AM · netops, Infrastructure-Foundations, Data-Engineering (Q3 2025 January 1st - March 31th)

Tue, Mar 4

ayounsi added a comment to T387839: Update `netflow` retention strategy in Druid (too much data).

Mostly to be able to see long term trends, for example per destination AS.

Tue, Mar 4, 3:27 PM · netops, Infrastructure-Foundations, Data-Engineering (Q3 2025 January 1st - March 31th)
ayounsi added a comment to T387839: Update `netflow` retention strategy in Druid (too much data).

Let's see what other people think, but I think it would be fine to :

  • Keep only 1 month of not sanitized data, as that data is especially important for its real/short time aspect
  • reduce the sanitized data, for example by sampling it even further, as long as we can keep historical trends, it's enough.
  • Remove those 3 dimensions from the sanitized data : "parsed_comms", "as_name_src", "as_name_dst"
Tue, Mar 4, 10:58 AM · netops, Infrastructure-Foundations, Data-Engineering (Q3 2025 January 1st - March 31th)
ayounsi added a comment to T387773: Different BFD settings on direct connected links .

Not a strong feeling, but I usually try to steer towards the leaner option. So in that case it's to remove BFD between cr1/2-codfw.
Looking at https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml#L164 it seems like only cr1/2-codfw is impacted, all the others already don't have BFD.
Automation wise, we could probably automate "no metric = no BFD".

Tue, Mar 4, 7:31 AM · Infrastructure-Foundations, netops

Mon, Mar 3

ayounsi claimed T387287: Prometheus: attach host's BGP/interface remote side metrics.
Mon, Mar 3, 4:07 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops

Wed, Feb 26

ayounsi added a comment to T387287: Prometheus: attach host's BGP/interface remote side metrics.

I'm wondering what the benefit is to having the additional metrics saved for other network devices? As in what alterts will we have, and downtime configuration, where that will help us? Consider a link from cr1-codfw to cr2-drmrs. Our normal alerting will fire for both of these if the session goes down. Is there a way to reconfigure that so we still always get alerted when we need, but we can suppress any alerts if only one side is downtimed? Or is there other benefit of having them?

No strong feelings, but a part of it is to treat network devices a bit like servers, having less exceptions.

Wed, Feb 26, 4:03 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops
ayounsi added a comment to T385560: Create RIPE Atlas anchors VMs.

Step 1, create/trunk the vlan to the hypervisors - https://netbox.wikimedia.org/ipam/vlans/?q=sandbox1&site_id=11&site_id=9&site_id=6&site_id=8

Wed, Feb 26, 2:06 PM · Infrastructure-Foundations, observability
ayounsi added a comment to T387287: Prometheus: attach host's BGP/interface remote side metrics.

Another question is how to name those new metrics ?
One suggestion, to stay generic as well, is to do something like
gnmi_bgp_neighbor_session_state -> remote_bgp_neighbor_session_state (or peer_bgp_neighbor_session_state)
What do you think ?

Wed, Feb 26, 1:56 PM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops
ayounsi added subtasks for T384731: Prevent BGP alerts triggering when K8s host maintenance is being done: T387288: Add prometheus-bird-exporter sidecar to calico-node pods, T387287: Prometheus: attach host's BGP/interface remote side metrics.
Wed, Feb 26, 11:01 AM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE
ayounsi added a parent task for T387287: Prometheus: attach host's BGP/interface remote side metrics: T384731: Prevent BGP alerts triggering when K8s host maintenance is being done.
Wed, Feb 26, 11:01 AM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops
ayounsi added a parent task for T387288: Add prometheus-bird-exporter sidecar to calico-node pods: T384731: Prevent BGP alerts triggering when K8s host maintenance is being done.
Wed, Feb 26, 11:01 AM · Prod-Kubernetes, Kubernetes, serviceops
ayounsi added a comment to T384731: Prevent BGP alerts triggering when K8s host maintenance is being done.

I forked the discussion to T387287: Prometheus: attach host's BGP/interface remote side metrics and T387288: Add prometheus-bird-exporter sidecar to calico-node pods as that task was becoming more difficult to read.

Wed, Feb 26, 11:01 AM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE
ayounsi created T387288: Add prometheus-bird-exporter sidecar to calico-node pods.
Wed, Feb 26, 11:00 AM · Prod-Kubernetes, Kubernetes, serviceops
ayounsi created T387287: Prometheus: attach host's BGP/interface remote side metrics.
Wed, Feb 26, 10:56 AM · SRE Observability (FY2024/2025-Q3), Observability-Alerting, Infrastructure-Foundations, netops

Tue, Feb 25

ayounsi closed T380469: eqiad/esams/drmrs LVS: use Netbox BGP flag as Resolved.

All done ! There was no diff, as expected in the best case scenario.

Tue, Feb 25, 4:38 PM · netops, Infrastructure-Foundations, Traffic
ayounsi added a comment to T384731: Prevent BGP alerts triggering when K8s host maintenance is being done.

And what happens if peer_descr is missing or empty ?

good question, in that case the instance label will be :0 since $1 will be empty. Is having peer_descr something that can happen?

Tue, Feb 25, 2:48 PM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE
ayounsi triaged T387220: BGP peers with missing descriptions as Low priority.
Tue, Feb 25, 2:44 PM · Infrastructure-Foundations, netops
ayounsi closed T387006: cr2-magru errors on xe-0/1/0 (EdgeUno Transit) as Resolved.

No more errors.

Tue, Feb 25, 9:24 AM · ops-magru, netops, Infrastructure-Foundations, SRE

Mon, Feb 24

ayounsi claimed T385560: Create RIPE Atlas anchors VMs.
Mon, Feb 24, 3:44 PM · Infrastructure-Foundations, observability
ayounsi added a comment to T354872: Re-IP Swift hosts to per-rack subnets in codfw row A and B..

@MatthewVernon that's correct. Thanks !

Mon, Feb 24, 2:04 PM · SRE-swift-storage, Infrastructure-Foundations, SRE
ayounsi added a comment to T387006: cr2-magru errors on xe-0/1/0 (EdgeUno Transit).

Our datacenter engineering team has concluded the on-site activity, and no problems were found on our side. Could you please confirm if this has improved the situation and ceased the errors on your side?
We performed a fiber cleaning and checked all the physical components.

Mon, Feb 24, 1:53 PM · ops-magru, netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T387018: gNMIc connection not working for cloudsw2-d5-eqiad.

Enabling traceoptions shows a no shared cipher error on the switch :

Feb 24 09:33:58 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher.
Feb 24 09:33:58 chttp2_server.c:83: Handshaking failed: {"created":"@1740389638.118250387","description":"Handshake failed","file":"../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"}
Feb 24 09:34:02 ssl_transport_security.c:948: Handshake failed with fatal error SSL_ERROR_SSL: error:1408A0C1:SSL routines:ssl3_get_client_hello:no shared cipher.
Feb 24 09:34:02 chttp2_server.c:83: Handshaking failed: {"created":"@1740389642.111017067","description":"Handshake failed","file":"../../../../../../../src/external/bsd/grpc/dist/src/core/lib/security/transport/security_handshaker.c","file_line":276,"tsi_code":10,"tsi_error":"TSI_PROTOCOL_FAILURE"}
Mon, Feb 24, 9:59 AM · Infrastructure-Foundations, netops, SRE
ayounsi added a comment to T387018: gNMIc connection not working for cloudsw2-d5-eqiad.

The switch is running a too old junos version for analytics-agent. I tried cloudsw2-d5-eqiad> restart SDN-Telemetry gracefully instead, but that didn't work.

Mon, Feb 24, 9:33 AM · Infrastructure-Foundations, netops, SRE

Wed, Feb 19

ayounsi added a comment to T384731: Prevent BGP alerts triggering when K8s host maintenance is being done.

Since we have to overwrite instance with the host instead of the router, that information is effectively lost, unless we move instance elsewhere. At any rate I think for a limited (i.e. internal bgp peers) use case it is fine.

The base query looks like this: label_replace(sum without (instance) (gnmi_bgp_neighbor_session_state), "instance", "$1:0", "peer_descr", "(.*)"), we would record that query into a new metric such as instance:gnmi_bgp_neighbor_session_state in modules/profile/files/prometheus/rules_ops.yml and we can alert if said metric == 6.

Wed, Feb 19, 2:54 PM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE
ayounsi closed T316539: Upgrade network devices to Junos 20+, a subtask of T254013: all network devices must run OpenSSH >= 7.2p1 but != 7.4p1, as Resolved.
Wed, Feb 19, 10:57 AM · Infrastructure-Foundations, netops, SRE
ayounsi closed T316539: Upgrade network devices to Junos 20+, a subtask of T317175: Junos: resolve DNS through mgmt_junos, as Resolved.
Wed, Feb 19, 10:57 AM · SRE, Infrastructure-Foundations, netops
ayounsi closed T316539: Upgrade network devices to Junos 20+, a subtask of T327862: Use mgmt_junos on all network devices, as Resolved.
Wed, Feb 19, 10:57 AM · SRE, netops, Infrastructure-Foundations
ayounsi closed T316539: Upgrade network devices to Junos 20+ as Resolved.

Nop, thanks for the ping.
There is now T364092: Upgrade core routers to Junos 23.4R2

Wed, Feb 19, 10:57 AM · SRE, netops, Infrastructure-Foundations
ayounsi added a comment to T386026: Decide what to do with SUL attached Wikitech accounts that Bitu associates with a different SUL account.

Thanks @bd808
Please detach 'Ayounsi' from SUL, rename it to 'AYounsi (WMF)', and reattach to SUL.

Wed, Feb 19, 10:55 AM · User-bd808, wikitech.wikimedia.org
ayounsi assigned T386766: cr2-esams:interface ae1 present under protocol ospf but not configure to Papaul.

Thanks! You can remove the now obsolete references from the ospf section in https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml

Wed, Feb 19, 10:51 AM · Infrastructure-Foundations, netops

Feb 5 2025

ayounsi added a comment to T384774: Jan 2025 - Magru core router connectivity blips.

Good idea regarding BFD. From https://supportportal.juniper.net/s/article/Observing-BGP-IO-ERROR-CLOSE-SESSION-error-logs-when-BGP-protocolgoes-down?language=en_US it seems like JTAC is the next step :(

Feb 5 2025, 3:40 PM · Patch-For-Review, ops-magru, netops, Infrastructure-Foundations
ayounsi added a comment to T382017: Q2:rack/setup E8/F8 new leaf switches.

Sure, as usual for power/console/mgmt.
Regarding production ports :
On the ssw1 side: use et-0/0/7` towards e8 and et-0/0/15 towards f8. They're currently used by the links to the dell switches, so no need to re-run new cables.

Feb 5 2025, 8:14 AM · Patch-For-Review, SRE, Infrastructure-Foundations, ops-eqiad, netops, DC-Ops
ayounsi updated subscribers of T378715: Possibility to transition some codfw data persistence hosts to 10G.

Cool, let's start with pc2011 to validate the workflow, then if all good we can iterate faster on the other hosts.

Feb 5 2025, 8:05 AM · DBA, Patch-For-Review

Feb 4 2025

ayounsi added a comment to T382219: codfw:expansion: Network devices/patch panel wiring .

Just coming back, I'm also curious about the upcoming FR-tech changes, is that discussed somewhere ?

Feb 4 2025, 1:40 PM · ops-codfw, DC-Ops, SRE, procurement
ayounsi added a comment to T382518: WMF RIPE Atlas probe in Eqiad offline.

Let's decom it and focus our efforts on spinning up VMs instead (T385560).
It needs to be removed from the list on https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L1881 as well as from RIPE's dashboard. Then shut down the switch port, reclaim the IP and hand it over to DCops to be recycled. There are no confidential data on the box so no need for a wipe.

Feb 4 2025, 11:01 AM · netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T382519: WMF RIPE Atlas probe in Eqsin offline.

Let's decom it and focus our efforts on spinning up VMs instead (T385560).
It needs to be removed from the list on https://github.com/wikimedia/operations-puppet/blob/production/hieradata/common.yaml#L1881 as well as from RIPE's dashboard. Then shut down the switch port, reclaim the IP and hand it over to DCops to be recycled in the next DC visit. There are no confidential data on the box so no need for a wipe.

Feb 4 2025, 11:01 AM · Infrastructure-Foundations, netops, SRE, ops-eqsin
ayounsi created T385560: Create RIPE Atlas anchors VMs.
Feb 4 2025, 10:58 AM · Infrastructure-Foundations, observability
ayounsi added a comment to T381175: Homer trying to delete BGP peerings for VMs on new Eqiad ganeti nodes.

For (1) we can have the sre.ganeti.addnode cookbook call the PuppetDBImport script towards the end. What do you and @MoritzMuehlenhoff think ?

Feb 4 2025, 10:44 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

Feb 3 2025

ayounsi added a comment to T378715: Possibility to transition some codfw data persistence hosts to 10G.

I assume no IP changes would happen right?

That's correct.

Feb 3 2025, 4:13 PM · DBA, Patch-For-Review
ayounsi added a comment to T384052: Migrate port utilisation alert from LibreNMS to alertmanager.

Looks all good to me !

Feb 3 2025, 12:28 PM · Observability-Alerting, Infrastructure-Foundations, netops, SRE
ayounsi added a comment to T384731: Prevent BGP alerts triggering when K8s host maintenance is being done.

An alternative (or short term solution until the above/cleaner one is live) is to not alert for sessions on the router side towards k8s nodes, and only monitor BGP from the k8s side.

Feb 3 2025, 9:44 AM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE

Nov 22 2024

ayounsi updated the task description for T380050: Decommission E/F 8 Dell switches.
Nov 22 2024, 11:19 AM · Patch-For-Review, SRE, DC-Ops, ops-eqiad

Nov 21 2024

ayounsi created T380469: eqiad/esams/drmrs LVS: use Netbox BGP flag.
Nov 21 2024, 12:43 PM · netops, Infrastructure-Foundations, Traffic
ayounsi added a comment to T380451: Lumen codfw-ulsfo down (Nov 2024).

Ticket ID 30654682 has been successfully created.

Nov 21 2024, 10:24 AM · Infrastructure-Foundations, netops
ayounsi triaged T380451: Lumen codfw-ulsfo down (Nov 2024) as High priority.
Nov 21 2024, 10:18 AM · Infrastructure-Foundations, netops

Nov 18 2024

ayounsi updated the task description for T380147: Homer failure on port speed change.
Nov 18 2024, 8:51 AM · homer, Infrastructure-Foundations
ayounsi created T380147: Homer failure on port speed change.
Nov 18 2024, 8:50 AM · homer, Infrastructure-Foundations

Nov 15 2024

ayounsi updated the task description for T380050: Decommission E/F 8 Dell switches.
Nov 15 2024, 1:31 PM · Patch-For-Review, SRE, DC-Ops, ops-eqiad
ayounsi created T380050: Decommission E/F 8 Dell switches.
Nov 15 2024, 1:22 PM · Patch-For-Review, SRE, DC-Ops, ops-eqiad
ayounsi closed T335028: Put Dell SONiC switches in production as Declined.

Because of the various limitations listed in {T342673} (plus the ones from pygnmi) we're not going to proceed any further on Dell SONiC, focusing on {T371088} now.

Nov 15 2024, 1:04 PM · SRE, netops, Infrastructure-Foundations
ayounsi closed T320638: Add Dell switches support to Homer/Cookbooks, a subtask of T335028: Put Dell SONiC switches in production, as Declined.
Nov 15 2024, 1:03 PM · SRE, netops, Infrastructure-Foundations
ayounsi closed T320638: Add Dell switches support to Homer/Cookbooks as Declined.

Because of the various limitations listed in {T342673} (plus the ones from pygnmi) we're not going to proceed any further on Dell SONiC, focusing on {T371088} now.

Nov 15 2024, 1:03 PM · Patch-For-Review, SRE, netops, Infrastructure-Foundations
ayounsi closed T340045: Package pyGNMI and dictdiffer to be used by cookbooks, a subtask of T320638: Add Dell switches support to Homer/Cookbooks, as Declined.
Nov 15 2024, 1:02 PM · Patch-For-Review, SRE, netops, Infrastructure-Foundations
ayounsi closed T340045: Package pyGNMI and dictdiffer to be used by cookbooks, a subtask of T338028: Users management on SONiC, as Declined.
Nov 15 2024, 1:02 PM · SRE, Infrastructure-Foundations, netops
ayounsi closed T340045: Package pyGNMI and dictdiffer to be used by cookbooks, a subtask of T344325: gNMI module in Spicerack, as Declined.
Nov 15 2024, 1:02 PM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
ayounsi closed T340045: Package pyGNMI and dictdiffer to be used by cookbooks as Declined.

Thanks for dictdiffer, because of a change in priorities and current limitations in pyGNMI, there is no more need to package it.

Nov 15 2024, 1:02 PM · Infrastructure-Foundations, SRE-tools
ayounsi closed T344325: gNMI module in Spicerack as Declined.

Going to close that task as we're not planning on using gNMI for automation any further, due to various shortcoming in the existing python gNMI library. We're alternatively looking into JSON-RPC see T371088#10272661 for example.

Nov 15 2024, 1:01 PM · Patch-For-Review, Infrastructure-Foundations, SRE-tools, Spicerack
ayounsi closed T344325: gNMI module in Spicerack, a subtask of T320638: Add Dell switches support to Homer/Cookbooks, as Declined.
Nov 15 2024, 1:00 PM · Patch-For-Review, SRE, netops, Infrastructure-Foundations
ayounsi closed Restricted Task, a subtask of T320638: Add Dell switches support to Homer/Cookbooks, as Declined.
Nov 15 2024, 12:57 PM · Patch-For-Review, SRE, netops, Infrastructure-Foundations
ayounsi moved T364092: Upgrade core routers to Junos 23.4R2 from Backlog to This quarter on the netops board.
Nov 15 2024, 12:50 PM · netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T378715: Possibility to transition some codfw data persistence hosts to 10G.

Cool, nothing urgent, in that case please let you know when you can which hosts that you want to migrate (or the ones that are not worth it), we can then figure out a plan of attack.

Nov 15 2024, 7:56 AM · DBA, Patch-For-Review

Nov 14 2024

ayounsi created T379907: Netbox: librenms report errors.
Nov 14 2024, 12:00 PM · Patch-For-Review, Infrastructure-Foundations, netops, netbox
ayounsi updated the task description for T379778: Decom prod infra side of the ulsfo-office link.
Nov 14 2024, 7:36 AM · ops-ulsfo, DC-Ops, Infrastructure-Foundations, netops, procurement, SRE

Nov 13 2024

ayounsi updated the task description for T379778: Decom prod infra side of the ulsfo-office link.
Nov 13 2024, 4:33 PM · ops-ulsfo, DC-Ops, Infrastructure-Foundations, netops, procurement, SRE
ayounsi created T379778: Decom prod infra side of the ulsfo-office link.
Nov 13 2024, 4:24 PM · ops-ulsfo, DC-Ops, Infrastructure-Foundations, netops, procurement, SRE
ayounsi added a comment to T362392: Routed Ganeti: Add support for VM BGP.

interesting idea, definitely worth a try. I'm particularly curious on how routing between VMs would work in that setup, and where to apply filtering. But not requiring multihop would be a plus.

Nov 13 2024, 10:33 AM · Patch-For-Review, Ganeti

Nov 12 2024

ayounsi closed T379465: https://wikitech.wikimedia.org/wiki/Out-of-band_network out of date as Resolved.

Updated :)

Nov 12 2024, 7:59 AM · Documentation, netops, Infrastructure-Foundations

Nov 7 2024

ayounsi added a comment to T374379: BFD won't esablish between QFX in VRF and host from IPv6 link-local.

If it's a bug on the switch it's probably worth opening a JTAC ticket. Even if it's not fixed on time for us they could provide a workaround or fix it in the longer run (unfortunately not on time for us).

Nov 7 2024, 10:36 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T364092: Upgrade core routers to Junos 23.4R2.

Upgrades should follow the standard process

The standard process docs are outdated I fear.

Depool site (optional)
(optional) if codfw, drain mw traffic sudo cookbook sre.mediawiki.route-traffic primary

codfw will be the primary during that set of dates, it should NOT be depooled.

Nov 7 2024, 7:23 AM · netops, Infrastructure-Foundations, SRE

Nov 5 2024

ayounsi added a comment to T375216: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port.

Another point, after running the script, the changelog on a problematic interface shows 3 changes (for that interface) in the same transaction:

  1. "updated" Post-Change Data looks like what we want (disabled, no vlans, no mtu, cable still attached).
  2. "updated" that "reverts" the values we don't want to keep <- that's the odd one

Screenshot 2024-11-05 at 14-54-58 DCIM interface ge-5_0_1 updated by ayounsi NetBox.png (667×923 px, 66 KB)

  1. "delete" that removes the cable termination, as expected
Nov 5 2024, 1:58 PM · Infrastructure-Foundations, netops, SRE
ayounsi updated subscribers of T375216: Top-of-rack 'MoveServersUplinks' Netbox scripts doesn't clean up the old trunk port.

I added some logging (self.log_info(f"{interface} {interface.enabled} {interface.untagged_vlan} {interface.tagged_vlans}") at the end of def clean_interface(self, interface: Interface): (after the save) as it's the problematic part of the script and was able to reproduce on netbox-next:

Nov 5 2024, 1:36 PM · Infrastructure-Foundations, netops, SRE

Oct 31 2024

ayounsi triaged T378751: Netbox: ImportPuppetDB uses wrong netmask for some hosts as High priority.
Oct 31 2024, 5:25 PM · Infrastructure-Foundations, netbox
ayounsi added a parent task for T378744: GeoDNS: consider sending CN to eqsin: Unknown Object (Task).
Oct 31 2024, 4:42 PM · Traffic
ayounsi created T378744: GeoDNS: consider sending CN to eqsin.
Oct 31 2024, 4:42 PM · Traffic
ayounsi added a comment to T373519: Allow UEFI DHCP configs.

@bking I think it's a question worth asking, but probably not in that task :) Could you open a dedicated one for the Procurement/DCops team?

Oct 31 2024, 3:03 PM · Infrastructure-Foundations
ayounsi added a subtask for T360297: Take advantage of 10Gb NICs in the new network stack: T378715: Possibility to transition some codfw data persistence hosts to 10G.
Oct 31 2024, 1:30 PM · Infrastructure-Foundations, DC-Ops, netops
ayounsi added a parent task for T378715: Possibility to transition some codfw data persistence hosts to 10G: T360297: Take advantage of 10Gb NICs in the new network stack.
Oct 31 2024, 1:30 PM · DBA, Patch-For-Review
ayounsi triaged T378715: Possibility to transition some codfw data persistence hosts to 10G as Low priority.
Oct 31 2024, 1:30 PM · DBA, Patch-For-Review
ayounsi added a subtask for T360297: Take advantage of 10Gb NICs in the new network stack: T378714: Possibility to transition ml-serve[2001-2008] and and ml-staging[2001-2002] to 10G.
Oct 31 2024, 1:24 PM · Infrastructure-Foundations, DC-Ops, netops
ayounsi added a parent task for T378714: Possibility to transition ml-serve[2001-2008] and and ml-staging[2001-2002] to 10G: T360297: Take advantage of 10Gb NICs in the new network stack.
Oct 31 2024, 1:24 PM · Machine-Learning-Team
ayounsi triaged T378714: Possibility to transition ml-serve[2001-2008] and and ml-staging[2001-2002] to 10G as Low priority.
Oct 31 2024, 1:24 PM · Machine-Learning-Team
ayounsi added a comment to T360297: Take advantage of 10Gb NICs in the new network stack.

Had a chat with Riccardo on IRC, here is the new list I came up with:

Oct 31 2024, 9:35 AM · Infrastructure-Foundations, DC-Ops, netops
ayounsi added a comment to T360297: Take advantage of 10Gb NICs in the new network stack.

@Papaul 54 but that only included rows A and B, now C and D are also eligible to a free 10G upgrade when available.

Oct 31 2024, 9:03 AM · Infrastructure-Foundations, DC-Ops, netops