Page MenuHomePhabricator

ayounsi (Arzhel Younsi)
Staff Network SRE

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Apr 3 2017, 6:23 PM (376 w, 3 d)
Availability
Available
IRC Nick
xionox
LDAP User
Ayounsi
MediaWiki User
AYounsi (WMF) [ Global Accounts ]

Recent Activity

Yesterday

ayounsi committed rOSNE3391d63fe21c: Netbox-extra: Add bandit and prospector to CI.
Netbox-extra: Add bandit and prospector to CI
Thu, Jun 20, 11:45 AM
ayounsi committed rOSNEd50e85c312c5: Fix lots of CI errors.
Fix lots of CI errors
Thu, Jun 20, 11:40 AM
ayounsi updated the task description for T336275: Upgrade Netbox to 4.x.
Thu, Jun 20, 8:00 AM · Patch-For-Review, Infrastructure-Foundations, netbox

Wed, Jun 19

ayounsi added a comment to T336275: Upgrade Netbox to 4.x.

Some notes before I forget, to make the sre.deploy.python-code work I had to:

Wed, Jun 19, 4:43 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ayounsi created T367973: Replace ping offload servers with eBPF.
Wed, Jun 19, 1:12 PM · Traffic
Syaifulnizamshamsudin awarded Blog Post: Ganeti on modern network design a Manufacturing Defect? token.
Wed, Jun 19, 12:47 PM

Mon, Jun 17

ayounsi removed projects from T367056: Rise in ms-fe2* TCP retransmits since 11:40 UTC today : Infrastructure-Foundations, netops.

We had a quick look at the network side and couldn't find any smoking gun.

Mon, Jun 17, 3:19 PM · Traffic, SRE, SRE-swift-storage
ayounsi claimed T367265: Capirca setup for routed Ganeti VMs.
Mon, Jun 17, 3:05 PM · Infrastructure-Foundations, netops
ayounsi triaged T367731: drmrs/esams/magru LVS : remove cross-rack links as Low priority.
Mon, Jun 17, 3:05 PM · netops, Traffic, Infrastructure-Foundations
ayounsi triaged T367732: POPs LVS : remove public vlan trunking as Low priority.
Mon, Jun 17, 3:04 PM · netops, Traffic, Infrastructure-Foundations
ayounsi added a comment to T367731: drmrs/esams/magru LVS : remove cross-rack links.

Of course ! not planning on doing it today :) The task is there to not forget.

Mon, Jun 17, 1:49 PM · netops, Traffic, Infrastructure-Foundations
ayounsi created T367732: POPs LVS : remove public vlan trunking.
Mon, Jun 17, 11:13 AM · netops, Traffic, Infrastructure-Foundations
ayounsi created T367731: drmrs/esams/magru LVS : remove cross-rack links.
Mon, Jun 17, 11:05 AM · netops, Traffic, Infrastructure-Foundations
ayounsi added a comment to T250415: Homer: add parallelization support.

Yeah I think it's what I tried to mean with

We can also decide that batch means to silently skip any device that have a different diff, to not risk blocking the run in the middle of it if a device have local changes

Basically decide if the batch behavior is (3) or (4) and then stick to it. 4 options seems a bit too much.
I tend to prefer (3), and would be ok to not support (4), especially as in a good state there should be no local changes.

Mon, Jun 17, 9:06 AM · User-Elukey, Infrastructure-Foundations, SRE-tools, homer
ayounsi added a comment to T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.

I don't understand why the need to be moved to get upgraded to 10G. If we take for example wikikube-ctrl2001 the switch in rack B6 have plenty of available/ready to use 10G ports (for example 44-47).

Mon, Jun 17, 9:01 AM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
ayounsi added a comment to T367408: Should we channelize unused QSFP28 ports on QFX5120s to provide 'buffer' for 10G upgrades?.

Can we move the cables instead of moving the servers ?

Mon, Jun 17, 8:54 AM · netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T250415: Homer: add parallelization support.

It's necessary to do the diff on all target devices anyway, so that behavior is fine.

Mon, Jun 17, 8:36 AM · User-Elukey, Infrastructure-Foundations, SRE-tools, homer

Fri, Jun 7

ayounsi updated the task description for T336275: Upgrade Netbox to 4.x.
Fri, Jun 7, 8:35 AM · Patch-For-Review, Infrastructure-Foundations, netbox
ayounsi updated the task description for T366874: Netbox: accounting report failure.
Fri, Jun 7, 6:49 AM · netbox, Infrastructure-Foundations
ayounsi created T366874: Netbox: accounting report failure.
Fri, Jun 7, 6:15 AM · netbox, Infrastructure-Foundations
ayounsi triaged T366864: cr2-eqdfw: PEM 0 Input Voltage Out Of Range as High priority.
Fri, Jun 7, 4:53 AM · SRE, DC-Ops, ops-eqdfw

Thu, Jun 6

ayounsi updated the task description for T336275: Upgrade Netbox to 4.x.
Thu, Jun 6, 8:21 AM · Patch-For-Review, Infrastructure-Foundations, netbox
ayounsi updated the task description for T336275: Upgrade Netbox to 4.x.
Thu, Jun 6, 8:14 AM · Patch-For-Review, Infrastructure-Foundations, netbox

Wed, Jun 5

ayounsi added a comment to T336275: Upgrade Netbox to 4.x.

Plan so far is to merge https://gerrit.wikimedia.org/r/1037784 to be able to have a puppetized test server compatible with the new deploy directory scheme (netbox-dev)
Then to merge https://gerrit.wikimedia.org/r/1038694 and check it out from /srv/deployment/netbox-dev/deploy
Then load a copy of the prod Netbox DB on the dev instance pbsql
Then Run the deploy python code cookbook to have a working Netbox 4 setup (and fix any issue that could prevent it)
Then check if the DB migration went well
In parallel merge https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/905570/ and parent change to have better CI on netbox-extra ahead of fixing all the Netbox 4 breaking changes.
Then send/merge patches to fix those netbox-extra changes.
Then (non blocker) Update the sre.netbox.update-extras cookbook to account for those changes.
Then send Spicerack, Cookbooks and Homer patches to fix Netbox's breaking changes. Ideally by moving some of the Cookbook's Netbox API calls to Spicerack.

Wed, Jun 5, 2:09 PM · Patch-For-Review, Infrastructure-Foundations, netbox

Mon, Jun 3

ayounsi added a comment to T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd.

Last time we rolled out this change, it was simply updating modules/install_server/files/autoinstall/common.cfg. Do you have any other place in mind where this might need to be reconfigured? I am personally for removing this completely but it's not a big deal and we can keep it around as well.

https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Opengear_Serial_Consoles
https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/ServerTech

Mon, Jun 3, 3:45 PM · Patch-For-Review, SRE, Traffic
ayounsi closed T362523: Juniper: use export-format state-data json compact as Resolved.

Our engineering team has now indicated that the compact json is not supported, due to hardware limitations with respect to compact json formatting. The feature will be deprecated in Junos 24.4. So, please do not use compact json to export data.

Mon, Jun 3, 1:30 PM · Infrastructure-Foundations, netops
ayounsi added a comment to T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd.

Moving the dynamic nature of NTP definition to some automated system instead of human or Puppet is a great idea :)
Human as in right now for network devices, the list is hard-coded https://github.com/wikimedia/operations-homer-public/blob/master/config/common.yaml#L365

Mon, Jun 3, 12:11 PM · Patch-For-Review, SRE, Traffic
ayounsi added a comment to T366193: Anycast ns1.wikimedia.org.
  • i.e. one in Germany which will pick ns0 rather than lower latency ns2

Seems like the main one is adguard-dns.com, which picks them randomly.
https://w.wiki/AGmr

Mon, Jun 3, 8:09 AM · SRE, Traffic
ayounsi added a comment to T366193: Anycast ns1.wikimedia.org.

I think the difficult part is where to stop the overengineering, for example it could make sens to use Liberica to healthcheck/advertise one of the NS anycast IP, but it might not be worth using a different AuthDNS software on half the servers, or a different Puppet infra.
Before going full anycast we need to make sure we're covering all major failure scenarios, or alternatively making a call to keep some unicast, knowing some places with broken/dumb implementation won't be the fastest, but maybe an ok tradeoff for better resiliency.

Mon, Jun 3, 7:43 AM · SRE, Traffic

Fri, May 31

ayounsi added a comment to T366193: Anycast ns1.wikimedia.org.

That's quite interesting seeing the variation of tradeoffs, and can be quite (an important) rabbithole. Is the goal to figure it out before anycasting ns1, or first anycast ns1 from anywhere then figure out how to modify the setup for possible better redundancy.
It could be useful to list all the failure scenarios, and if we need to mitigate them or not. (server or network missconfig, Bird bug, etc). In other words are we putting too many of our eggs in the same basket ?

Fri, May 31, 8:04 AM · SRE, Traffic

Thu, May 30

ayounsi added a comment to T366193: Anycast ns1.wikimedia.org.

assign a /24 from https://netbox.wikimedia.org/ipam/aggregates/ to be used for this

As we couldn't get a /24 from LACNIC for magru, we only have two free /24s
We have to decide between a few options:

  1. allocate a new whole /24 for ns1 right now
    • Pro: quick turnaround, no added cost
    • Con: risk of lacking public v4 IPs for future projects (eg. new pops) or core site growth, can be mitigated by applying for more prefixes in parallel but not guaranty to get them
  2. Apply (and pay) for more IPs at RIPE or ARIN (T288342) and wait for an allocation before anycasting ns1
    • Pro: limited cost, more flexible on IP usage
    • Con: long turnaround (months to years)
  3. Buy a /24 on resale market
    • Pro: faster turnaround
    • Con: higher cost
  4. Don't Anycast ns1
    • Listing it only for the sake of completeness, but not preferred. Even though there is a diminishing return after anycasting ns2, we believe anycasting one more ns would bring performance improvements to users
  5. Use the DoH Anycast prefix for ns1
    • Pro: quick turnaround, no added cost
    • Con: risk of providers blocking ns1 as a side effect of blocking DoH, mitigated by having 2 other NS.
Thu, May 30, 10:36 AM · SRE, Traffic
ayounsi closed Unknown Object (Task), a subtask of T346722: Sao Paulo, Brazil, South America POP tracking task, as Resolved.
Thu, May 30, 7:29 AM · ops-magru

Tue, May 28

ayounsi added a comment to T365687: Improve calico-typha firewall rules.

The Typha firewall service is now based on firewall::service and does dynamic name resolution on the puppet server side, let's see if this improves things with the next rename.

The issue didn't happen again, but we also did the move vlan in addition to the rename (so the IP changed too).

Tue, May 28, 5:28 PM · serviceops, Prod-Kubernetes, Kubernetes

Mon, May 27

ayounsi claimed T365697: Arelion IPv6 transit renumbering.
Mon, May 27, 2:27 PM · Patch-For-Review, Infrastructure-Foundations, netops
ayounsi created P63281 move-vlan dry-run error.
Mon, May 27, 9:12 AM
ayounsi added a comment to T362523: Juniper: use export-format state-data json compact.

JTAC was able to confirm/duplicate the bug on 22.3R3-S2.4, they're escalating it to their engineering team.

Mon, May 27, 7:35 AM · Infrastructure-Foundations, netops

Fri, May 24

ayounsi added a comment to T362523: Juniper: use export-format state-data json compact.

Opened JTAC case 2024-0524-163553

Fri, May 24, 7:52 AM · Infrastructure-Foundations, netops

Thu, May 23

ayounsi added a comment to T355750: CFSSL gencert "remote error: tls: certificate require".

sudo cookbook sre.network.tls --system lsw1-f8-eqiad

Thu, May 23, 2:24 PM · CFSSL-PKI, Infrastructure-Foundations
ayounsi added a comment to T365571: Rename wikikube worker nodes during OS reimage.

Before I forget, please notify DCops so they update the physical labels on the server.

Thu, May 23, 12:47 PM · Kubernetes, Prod-Kubernetes, serviceops
ayounsi updated the task description for T365697: Arelion IPv6 transit renumbering.
Thu, May 23, 12:19 PM · Patch-For-Review, Infrastructure-Foundations, netops
ayounsi created T365697: Arelion IPv6 transit renumbering.
Thu, May 23, 12:12 PM · Patch-For-Review, Infrastructure-Foundations, netops
ayounsi triaged T365694: Cookbooks: move Netbox IP allocation to spicerack module as Low priority.
Thu, May 23, 11:56 AM · Infrastructure-Foundations, SRE-tools, netbox, Spicerack
ayounsi triaged T365680: Redfish _get_dummy_response() should return empty json as Low priority.
Thu, May 23, 8:49 AM · Infrastructure-Foundations, SRE-tools, Spicerack

Wed, May 22

ayounsi added a comment to T365571: Rename wikikube worker nodes during OS reimage.

Sounds good ! I'd recommend doing first a rename then normal re-image, then just a move-vlan, then on a different host, test both actions one after the other.

Wed, May 22, 1:58 PM · Kubernetes, Prod-Kubernetes, serviceops
ayounsi added a comment to T362523: Juniper: use export-format state-data json compact.

Tested on a MX204 running Junos 21.2 and 22.4R3.25, the returned JSON is invalid...

Wed, May 22, 1:13 PM · Infrastructure-Foundations, netops
ayounsi added a comment to P62883 cr3-ulsfo> show interfaces xe-0/1/2 | display json - list not being defined properly.

Diff:

@@ -100,7 +100,7 @@
                 {
                 }, 
                 "address-family" :
-                {
+                [{
                     "address-family-name" : "inet", 
                     "mtu" : "4456", 
                     "max-local-cache" : "100000", 
@@ -142,7 +142,7 @@
                         "internal-flags" : "0x0"
                     }, 
                     "interface-address" :
-                    {
+                    [{
                         "ifa-flags" :
                         {
                             "ifaf-current-preferred" : "[null]", 
@@ -174,6 +174,7 @@
                             }
                         }
                     }
+                    ]
                 }, 
                 {
                     "address-family-name" : "multiservice", 
@@ -182,9 +183,8 @@
                     {
                         "internal-flags" : "0x0"
                     }
-                }
+                }]
             }
         }
     }
 }
Wed, May 22, 1:02 PM
ayounsi added a comment to P62883 cr3-ulsfo> show interfaces xe-0/1/2 | display json - list not being defined properly.

Valid:

{
    "interface-information" :
    {
        "physical-interface" :
        {
            "name" : "xe-0/1/2", 
            "admin-status" : "up", 
            "oper-status" : "up", 
            "local-index" : "164", 
            "snmp-index" : "536", 
            "description" : "Transit: Arelion (IC-) {#1071}", 
            "link-level-type" : "Ethernet", 
            "sonet-mode" : "LAN-PHY", 
            "mtu" : "4470", 
            "mru" : "4478", 
            "source-filtering" : "disabled", 
            "speed" : "10Gbps", 
            "bpdu-error" : "none", 
            "ld-pdu-error" : "none", 
            "l2pt-error" : "none", 
            "loopback" : "none", 
            "if-flow-control" : "enabled", 
            "if-speed-cfg" : "Auto", 
            "pad-to-minimum-frame-size" : "Disabled", 
            "if-device-flags" :
            {
                "ifdf-present" : "[null]", 
                "ifdf-running" : "[null]"
            }, 
            "ifd-specific-config-flags" :
            {                           
                "internal-flags" : "0x100200"
            }, 
            "if-config-flags" :
            {
                "iff-snmp-traps" : "[null]", 
                "internal-flags" : "0x4000"
            }, 
            "if-media-flags" :
            {
                "ifmf-none" : "[null]"
            }, 
            "physical-interface-cos-information" :
            {
                "physical-interface-cos-hw-max-queues" : "8", 
                "physical-interface-cos-use-max-queues" : "8", 
                "physical-interface-schedulers" : "0"
            }, 
            "current-physical-address" : "f0:4b:3a:ef:7e:45", 
            "hardware-physical-address" : "f0:4b:3a:ef:7e:45", 
            "interface-flapped" : "2023-03-09 08:13:40 UTC (62w6d 04:14 ago)", 
            "traffic-statistics" :
            {
                "input-bps" : "7184824", 
                "input-pps" : "8233", 
                "output-bps" : "85449600", 
                "output-pps" : "8470"
            }, 
            "active-alarms" :
            {
                "interface-alarms" :    
                {
                    "alarm-not-present" : "[null]"
                }
            }, 
            "active-defects" :
            {
                "interface-alarms" :
                {
                    "alarm-not-present" : "[null]"
                }
            }, 
            "ethernet-pcs-statistics" :
            {
                "bit-error-seconds" : "3", 
                "errored-blocks-seconds" : "3"
            }, 
            "interface-transmit-statistics" : "Disabled", 
            "logical-interface" :
            {
                "name" : "xe-0/1/2.0", 
                "local-index" : "343", 
                "snmp-index" : "555", 
                "if-config-flags" :
                {
                    "iff-up" : "[null]", 
                    "iff-snmp-traps" : "[null]", 
                    "internal-flags" : "0x4004000"
                }, 
                "encapsulation" : "ENET2", 
                "policer-overhead" :    
                {
                }, 
                "traffic-statistics" :
                {
                    "input-packets" : "1184552371726", 
                    "output-packets" : "1206955771514"
                }, 
                "filter-information" :
                {
                }, 
                "address-family" :
                [{
                    "address-family-name" : "inet", 
                    "mtu" : "4456", 
                    "max-local-cache" : "100000", 
                    "new-hold-limit" : "100000", 
                    "intf-curr-cnt" : "1", 
                    "intf-unresolved-cnt" : "0", 
                    "intf-dropcnt" : "0", 
                    "address-family-flags" :
                    {
                        "ifff-rpf-check" : "[null]", 
                        "ifff-rpf-loose-mode" : "[null]", 
                        "ifff-sendbcast-pkt-to-re" : "[null]", 
                        "internal-flags" : "0x0"
                    }, 
                    "interface-address" :
                    {
                        "ifa-flags" :
                        {               
                            "ifaf-current-preferred" : "[null]", 
                            "ifaf-current-primary" : "[null]"
                        }, 
                        "ifa-destination" : "80.239.192.64/30", 
                        "ifa-local" : "80.239.192.66", 
                        "ifa-broadcast" : "80.239.192.67"
                    }
                }, 
                {
                    "address-family-name" : "inet6", 
                    "mtu" : "4456", 
                    "max-local-cache" : "75000", 
                    "new-hold-limit" : "75000", 
                    "intf-curr-cnt" : "2", 
                    "intf-unresolved-cnt" : "0", 
                    "intf-dropcnt" : "0", 
                    "address-family-flags" :
                    {
                        "ifff-rpf-check" : "[null]", 
                        "ifff-rpf-loose-mode" : "[null]", 
                        "internal-flags" : "0x0"
                    }, 
                    "interface-address" :
                    [{
                        "ifa-flags" :
                        {
                            "ifaf-current-preferred" : "[null]", 
                            "ifaf-current-primary" : "[null]"
                        }, 
                        "ifa-destination" : "2001:2000:3080:a9a::/64", 
                        "ifa-local" : "2001:2000:3080:a9a::2", 
                        "interface-address" :
                        {
                            "in6-addr-flags" :
                            {
                                "ifaf-none" : "[null]"
                            }
                        }
                    }, 
                    {
                        "ifa-flags" :
                        {
                            "ifaf-current-preferred" : "[null]", 
                            "internal-flags" : "0x800"
                        }, 
                        "ifa-destination" : "fe80::/64", 
                        "ifa-local" : "fe80::f24b:3aff:feef:7e45", 
                        "interface-address" :
                        {
                            "in6-addr-flags" :
                            {
                                "ifaf-none" : "[null]"
                            }
                        }
                    }
                    ]
                }, 
                {
                    "address-family-name" : "multiservice", 
                    "mtu" : "Unlimited", 
                    "address-family-flags" :
                    {
                        "internal-flags" : "0x0"
                    }
                }]
            }
        }
    }
}
Wed, May 22, 1:02 PM
ayounsi updated the title for P62883 cr3-ulsfo> show interfaces xe-0/1/2 | display json - list not being defined properly from untitled to cr3-ulsfo> show interfaces xe-0/1/2 | display json - list not being defined properly.
Wed, May 22, 12:45 PM
ayounsi created P62883 cr3-ulsfo> show interfaces xe-0/1/2 | display json - list not being defined properly.
Wed, May 22, 12:35 PM
ayounsi closed Restricted Task, a subtask of T346722: Sao Paulo, Brazil, South America POP tracking task, as Resolved.
Wed, May 22, 7:34 AM · ops-magru
ayounsi committed rOSNEfef0e314d323: LibreNMS report: fix for server tech PDUs special case.
LibreNMS report: fix for server tech PDUs special case
Wed, May 22, 7:34 AM
ayounsi committed rOSNE1ee9d9e76ec6: LibreNMS: add special case.
LibreNMS: add special case
Wed, May 22, 7:26 AM

May 21 2024

ayounsi added a comment to T362392: Routed Ganeti: Add support for VM BGP.

https://trubka.network.cz/pipermail/bird-users/2024-May/017687.html

I already made it a feature request and plan to implement it.

May 21 2024, 12:44 PM · Ganeti
Restricted Application added a project to T365289: partial power outage for lsw1-e5-eqiad: DC-Ops.
May 21 2024, 6:48 AM · DC-Ops, SRE, netops, ops-eqiad, Infrastructure-Foundations
ayounsi reopened Restricted Task, a subtask of T346722: Sao Paulo, Brazil, South America POP tracking task, as Open.
May 21 2024, 6:42 AM · ops-magru

May 17 2024

ayounsi added a comment to P62586 drop-nak diff.

To block DHCPNAK packets while allowing other DHCP packets through using iptables, you need to create a rule that specifically matches the DHCPNAK packet type. DHCPNAK packets have an option code of 6 within the DHCP message type option (Option 53).

May 17 2024, 3:06 PM
ayounsi added a comment to T362421: magru network setup.

The Telxius community doesn't seem to be of any effect so far, I'll wait for their reply, maybe they changed or need to be enabled on their side first. I'll look at the other providers afterwards.

May 17 2024, 1:17 PM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T362421: magru network setup.

Cogent is a bit surprising, from EU or the US they route to magru.

May 17 2024, 10:35 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE
ayounsi added a comment to T362421: magru network setup.

Before advertising ns2, we need to do some traffic engineering. Telxius being part of Spain's main ISP, Telefonica ES prefers magru to drmrs :
See https://w.wiki/A6qH

Screenshot 2024-05-17 at 09-44-34 Turnilo (1.38.2).png (699×1 px, 91 KB)

May 17 2024, 7:53 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE

May 15 2024

ayounsi added a comment to T355750: CFSSL gencert "remote error: tls: certificate require".

As data point, same error today with cumin1002:~$ sudo cookbook sre.network.tls lsw1-d1-codfw

May 15 2024, 7:26 AM · CFSSL-PKI, Infrastructure-Foundations

May 14 2024

ayounsi created P62386 dhcpd /32 v4 allocation.
May 14 2024, 1:46 PM
ayounsi added a comment to T187929: Cloud IPv6 subnets.

@cmooney what do you think of duplicating the other POPs allocation scheme?
For example looking at eqiad as example, keep 2a02:ec80:a000::/40 as "reserved for future growth"
Then use 2a02:ec80:a000::/48 for the existing WMCS eqiad infra
Then 2a02:ec80:a000::/56 for public, another /56 for private, /55 for the infra, 2a02:ec80:a000:ed1a::/64 for VIPs, etc
Or is the risk to not be able to allocate a /64 for each VM ? (a /48 is 65536 /64s)

May 14 2024, 9:28 AM · User-aborrero, Infrastructure-Foundations, SRE, netops

May 13 2024

ayounsi closed T363117: eqiad: magru transport down as Resolved.

All good now.

May 13 2024, 3:56 PM · SRE, ops-eqiad
ayounsi updated subscribers of T351418: Upgrade from ISC-DHCP Server to KEA-DHCP Server.

Thanks indeed it looks very promising ! If he have the time maybe @MoritzMuehlenhoff can provide some pointers on the packaging steps.

May 13 2024, 9:05 AM · Infrastructure-Foundations

May 10 2024

ayounsi created T364633: connected console ports attached to unracked device.
May 10 2024, 3:23 PM · SRE, ops-codfw
ayounsi reopened T361871: codfw: use old asw switches from row A and B as msw switches in row C and D as "Open".

I'm not fond of setting up a password on un-managed switched. For example what is the procedure to rotate the password ?

May 10 2024, 2:49 PM · SRE, netops, ops-codfw, Infrastructure-Foundations
ayounsi added a subtask for T346722: Sao Paulo, Brazil, South America POP tracking task: Unknown Object (Task).
May 10 2024, 2:26 PM · ops-magru

May 3 2024

ayounsi added a comment to T364092: Upgrade core routers to Junos 22.4R3.

Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru.

May 3 2024, 9:55 AM · netops, Infrastructure-Foundations, SRE

May 2 2024

ayounsi updated the task description for T364016: Q4:magru VM tracking task.
May 2 2024, 3:53 PM · Traffic, Infrastructure-Foundations

Apr 29 2024

ayounsi created P61404 cr1-magru - junos validate.
Apr 29 2024, 1:02 PM

Apr 28 2024

ayounsi added a comment to T362522: mr1-eqsin performance issue.

Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.

Apr 28 2024, 1:15 PM · Infrastructure-Foundations, netops

Apr 23 2024

ayounsi added a comment to T362985: Improve how we generate DNS entries from Netbox.

Another question I think is "do we still have to go through text files ?"
It made sens for back in the time when we were manually editing the configuration, and for the few places we still do, but it seems sub-optimal to go from Netbox database to text file to gdnsd.
Probably too simplistic, but could we generate a raw list of IP/FQDN from Netbox, and feed it to gDNSd without having to care about PTR and zones structures ?

Apr 23 2024, 12:02 PM · Infrastructure-Foundations, SRE

Apr 22 2024

ayounsi triaged T363117: eqiad: magru transport down as High priority.
Apr 22 2024, 9:07 PM · SRE, ops-eqiad

Apr 15 2024

ayounsi added a comment to T362522: mr1-eqsin performance issue.

I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX.
The login attempts are creating process on the SRX that sometimes don't close correctly or take more time to fully close. If enough of them are stack it can cause the reboot.
To fix this we can set a firewall filter for the control plane of the SRX and use an allow list to mitigate the packets that are actually reaching the device.

Apr 15 2024, 9:22 PM · Infrastructure-Foundations, netops
ayounsi added a comment to T277438: Move management routers ssh port.

We might have to re-prioritize this task because of T362522: mr1-eqsin performance issue

Apr 15 2024, 9:22 PM · Patch-For-Review, Infrastructure-Foundations, SRE, netops
ayounsi removed a project from T346722: Sao Paulo, Brazil, South America POP tracking task: procurement.
Apr 15 2024, 9:16 PM · ops-magru
ayounsi added a comment to T346722: Sao Paulo, Brazil, South America POP tracking task.

Done :)

Apr 15 2024, 9:15 PM · ops-magru
ayounsi updated subscribers of T360297: Take advantage of 10Gb NICs in the new network stack.

I started implementing a fix for that but it quickly gets complex as it means shutting down a port, and fully setting up another one. Before going that way let's see if it's something we want/need to do.
Also as I think you mentioned somewhere else, it would mess with @Papaul 's rack U to switch port mapping.

Apr 15 2024, 4:08 PM · Infrastructure-Foundations, DC-Ops, netops
ayounsi claimed T362421: magru network setup.
Apr 15 2024, 2:27 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations
ayounsi claimed T362523: Juniper: use export-format state-data json compact.
Apr 15 2024, 2:25 PM · Infrastructure-Foundations, netops
ayounsi added a comment to T362522: mr1-eqsin performance issue.

Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump.

Apr 15 2024, 12:31 PM · Infrastructure-Foundations, netops
ayounsi created T362523: Juniper: use export-format state-data json compact.
Apr 15 2024, 12:11 PM · Infrastructure-Foundations, netops
ayounsi updated the task description for T362522: mr1-eqsin performance issue.
Apr 15 2024, 11:56 AM · Infrastructure-Foundations, netops
ayounsi triaged T362522: mr1-eqsin performance issue as High priority.
Apr 15 2024, 11:53 AM · Infrastructure-Foundations, netops

Apr 12 2024

ayounsi added a comment to T362421: magru network setup.

Prefixes assigned in Netbox: https://netbox.wikimedia.org/ipam/prefixes/?site_id=11

Apr 12 2024, 3:33 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations
ayounsi updated the task description for T362421: magru network setup.
Apr 12 2024, 3:25 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations
ayounsi created T362421: magru network setup.
Apr 12 2024, 3:25 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations
ayounsi triaged T362392: Routed Ganeti: Add support for VM BGP as Low priority.
Apr 12 2024, 10:42 AM · Ganeti
ayounsi added a comment to T351418: Upgrade from ISC-DHCP Server to KEA-DHCP Server.

As data point Ganeti's iPXE send 4 DHCP requests, doubling the timeout between each: 1s, 2s, 4s.

12:34:21.812153 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
12:34:22.831693 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
12:34:24.863614 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
12:34:28.928072 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
Apr 12 2024, 8:15 AM · Infrastructure-Foundations
ayounsi updated the task description for T362330: Routed Ganeti : Add support for public IPs.
Apr 12 2024, 7:03 AM · Ganeti

Apr 11 2024

ayounsi created T362330: Routed Ganeti : Add support for public IPs.
Apr 11 2024, 2:19 PM · Ganeti

Apr 8 2024

ayounsi added a parent task for T336275: Upgrade Netbox to 4.x: T252747: Generate ssh_known_hosts for network devices.
Apr 8 2024, 2:55 PM · Patch-For-Review, Infrastructure-Foundations, netbox
ayounsi added a subtask for T252747: Generate ssh_known_hosts for network devices: T336275: Upgrade Netbox to 4.x.
Apr 8 2024, 2:55 PM · Infrastructure-Foundations, SRE-tools, SRE
ayounsi added a subtask for T361549: Automatically run Capirca Netbox script regularly: T358339: Netbox: capirca.getHosts script runs into timeout.
Apr 8 2024, 2:28 PM · netbox, Infrastructure-Foundations, netops
ayounsi added a parent task for T358339: Netbox: capirca.getHosts script runs into timeout: T361549: Automatically run Capirca Netbox script regularly.
Apr 8 2024, 2:28 PM · Infrastructure-Foundations, netbox
ayounsi triaged T361549: Automatically run Capirca Netbox script regularly as Medium priority.
Apr 8 2024, 2:28 PM · netbox, netops, Infrastructure-Foundations
ayounsi added a comment to T358096: Automation to add extra IPs to servers.

Netbox script is great, we can call it from a cookbook if needed later on.

Apr 8 2024, 9:26 AM · Infrastructure-Foundations
ayounsi added a comment to T361871: codfw: use old asw switches from row A and B as msw switches in row C and D.

Thanks. What I don't understand is that if they go through ZTP or manual basic setup, they will by definition be managed switches (with root password, IP, etc). I don't think we can have a middle ground where we have only some config.

Apr 8 2024, 9:19 AM · SRE, netops, ops-codfw, Infrastructure-Foundations

Apr 5 2024

ayounsi added a comment to T361871: codfw: use old asw switches from row A and B as msw switches in row C and D.

We first need to discuss if we want to start using managed switches for management switches (except the aggregation ones).
On the plus side it's convenient to have the extra visibility, but it adds a lots of management overhead to our automation, while I'm not sure we have the resources for that.

Apr 5 2024, 8:03 AM · SRE, netops, ops-codfw, Infrastructure-Foundations
ayounsi closed T361825: eqiad-drmrs transport down (April 2024) as Resolved.

RFO: The unavailability of the link was due to problems with optical modules and cards at the Marseille and Paris, France locations on the Telxius network. The link returned to normal after the modules and cards were replaced.

Apr 5 2024, 6:36 AM · Infrastructure-Foundations, netops