Another question I think is "do we still have to go through text files ?"
It made sens for back in the time when we were manually editing the configuration, and for the few places we still do, but it seems sub-optimal to go from Netbox database to text file to gdnsd.
Probably too simplistic, but could we generate a raw list of IP/FQDN from Netbox, and feed it to gDNSd without having to care about PTR and zones structures ?

Tue, Apr 23, 12:02 PM · Infrastructure-Foundations, SRE

Mon, Apr 22

ayounsi triaged T363117: eqiad: magru transport down as High priority.

Mon, Apr 22, 9:07 PM · SRE, ops-eqiad

Mon, Apr 15

ayounsi added a comment to T362522: mr1-eqsin performance issue.

I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX.
The login attempts are creating process on the SRX that sometimes don't close correctly or take more time to fully close. If enough of them are stack it can cause the reboot.
To fix this we can set a firewall filter for the control plane of the SRX and use an allow list to mitigate the packets that are actually reaching the device.

Mon, Apr 15, 9:22 PM · Infrastructure-Foundations, netops

ayounsi added a comment to T277438: Move management routers ssh port.

We might have to re-prioritize this task because of T362522: mr1-eqsin performance issue

Mon, Apr 15, 9:22 PM · Patch-For-Review, Infrastructure-Foundations, SRE, netops

ayounsi removed a project from T346722: Sao Paulo, Brazil, South America POP tracking task: procurement.

Mon, Apr 15, 9:16 PM · ops-magru, Patch-For-Review

ayounsi added a comment to T346722: Sao Paulo, Brazil, South America POP tracking task.

Done :)

Mon, Apr 15, 9:15 PM · ops-magru, Patch-For-Review

ayounsi updated subscribers of T360297: Take advantage of 10Gb NICs in the new network stack.

I started implementing a fix for that but it quickly gets complex as it means shutting down a port, and fully setting up another one. Before going that way let's see if it's something we want/need to do.
Also as I think you mentioned somewhere else, it would mess with @Papaul 's rack U to switch port mapping.

Mon, Apr 15, 4:08 PM · Infrastructure-Foundations, DC-Ops, netops

ayounsi claimed T362421: magru network setup.

Mon, Apr 15, 2:27 PM · netops, SRE, Infrastructure-Foundations

ayounsi claimed T362523: Juniper: use export-format state-data json compact.

Mon, Apr 15, 2:25 PM · Infrastructure-Foundations, netops

ayounsi added a comment to T362522: mr1-eqsin performance issue.

Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump.

Mon, Apr 15, 12:31 PM · Infrastructure-Foundations, netops

ayounsi created T362523: Juniper: use export-format state-data json compact.

Mon, Apr 15, 12:11 PM · Infrastructure-Foundations, netops

ayounsi updated the task description for T362522: mr1-eqsin performance issue.

Mon, Apr 15, 11:56 AM · Infrastructure-Foundations, netops

ayounsi triaged T362522: mr1-eqsin performance issue as High priority.

Mon, Apr 15, 11:53 AM · Infrastructure-Foundations, netops

Fri, Apr 12

ayounsi added a comment to T362421: magru network setup.

Prefixes assigned in Netbox: https://netbox.wikimedia.org/ipam/prefixes/?site_id=11

Fri, Apr 12, 3:33 PM · netops, SRE, Infrastructure-Foundations

ayounsi updated the task description for T362421: magru network setup.

Fri, Apr 12, 3:25 PM · netops, SRE, Infrastructure-Foundations

ayounsi created T362421: magru network setup.

Fri, Apr 12, 3:25 PM · netops, SRE, Infrastructure-Foundations

ayounsi triaged T362392: Routed Ganeti: Add support for VM BGP as Low priority.

Fri, Apr 12, 10:42 AM · Ganeti

ayounsi added a comment to T351418: Upgrade from ISC-DHCP Server to KEA-DHCP Server.

As data point Ganeti's iPXE send 4 DHCP requests, doubling the timeout between each: 1s, 2s, 4s.

12:34:21.812153 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
12:34:22.831693 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
12:34:24.863614 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
12:34:28.928072 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414

Fri, Apr 12, 8:15 AM · Infrastructure-Foundations

ayounsi updated the task description for T362330: Routed Ganeti : Add support for public IPs.

Fri, Apr 12, 7:03 AM · Ganeti

Thu, Apr 11

ayounsi created T362330: Routed Ganeti : Add support for public IPs.

Thu, Apr 11, 2:19 PM · Ganeti

Apr 8 2024

ayounsi added a parent task for T336275: Upgrade Netbox to 4.x: T252747: Generate ssh_known_hosts for network devices.

Apr 8 2024, 2:55 PM · Patch-For-Review, Infrastructure-Foundations, netbox

ayounsi added a subtask for T252747: Generate ssh_known_hosts for network devices: T336275: Upgrade Netbox to 4.x.

Apr 8 2024, 2:55 PM · Infrastructure-Foundations, SRE-tools, SRE

ayounsi added a subtask for T361549: Automatically run Capirca Netbox script regularly: T358339: Netbox: capirca.getHosts script runs into timeout.

Apr 8 2024, 2:28 PM · netbox, netops, Infrastructure-Foundations

ayounsi added a parent task for T358339: Netbox: capirca.getHosts script runs into timeout: T361549: Automatically run Capirca Netbox script regularly.

Apr 8 2024, 2:28 PM · Infrastructure-Foundations, netbox

ayounsi triaged T361549: Automatically run Capirca Netbox script regularly as Medium priority.

Apr 8 2024, 2:28 PM · netbox, netops, Infrastructure-Foundations

ayounsi added a comment to T358096: Automation to add extra IPs to servers.

Netbox script is great, we can call it from a cookbook if needed later on.

Apr 8 2024, 9:26 AM · Infrastructure-Foundations

ayounsi added a comment to T361871: codfw: use old asw switches from row A and B as msw switches in row C and D.

Thanks. What I don't understand is that if they go through ZTP or manual basic setup, they will by definition be managed switches (with root password, IP, etc). I don't think we can have a middle ground where we have only some config.

Apr 8 2024, 9:19 AM · SRE, netops, Infrastructure-Foundations, ops-codfw

Apr 5 2024

ayounsi added a comment to T361871: codfw: use old asw switches from row A and B as msw switches in row C and D.

We first need to discuss if we want to start using managed switches for management switches (except the aggregation ones).
On the plus side it's convenient to have the extra visibility, but it adds a lots of management overhead to our automation, while I'm not sure we have the resources for that.

Apr 5 2024, 8:03 AM · SRE, netops, Infrastructure-Foundations, ops-codfw

ayounsi closed T361825: eqiad-drmrs transport down (April 2024) as Resolved.

RFO: The unavailability of the link was due to problems with optical modules and cards at the Marseille and Paris, France locations on the Telxius network. The link returned to normal after the modules and cards were replaced.

Apr 5 2024, 6:36 AM · Infrastructure-Foundations, netops

Apr 4 2024

ayounsi added a comment to T361825: eqiad-drmrs transport down (April 2024).

Emailed Telxius NOC.

Apr 4 2024, 11:54 AM · Infrastructure-Foundations, netops

ayounsi created T361825: eqiad-drmrs transport down (April 2024).

Apr 4 2024, 11:48 AM · Infrastructure-Foundations, netops

Apr 3 2024

ayounsi renamed T336275: Upgrade Netbox to 4.x from Upgrade Netbox to 3.7.x to Upgrade Netbox to 4.x.

Apr 3 2024, 2:37 PM · Patch-For-Review, Infrastructure-Foundations, netbox

ayounsi added a comment to T311052: Netbox: replace getstats.GetDeviceStats with ntc-netbox-plugin-metrics-ext.

for information https://github.com/Eskemm-Numerique/ntc-netbox-plugin-metrics-ext/pull/1 got merged, so ntc-netbox-plugin-metrics-ext should now works out of the box.

Apr 3 2024, 2:12 PM · Patch-For-Review, netbox, Infrastructure-Foundations

ayounsi closed T300152: Investigate Ganeti in routed mode as Resolved.

We can consider this task completed with success.

Apr 3 2024, 9:21 AM · Patch-For-Review, SRE, netops, Ganeti, Infrastructure-Foundations

Apr 2 2024

ayounsi raised the priority of T250415: Homer: add parallelization support from Low to High.

Apr 2 2024, 1:07 PM · Infrastructure-Foundations, SRE-tools, homer

ayounsi added a comment to T250415: Homer: add parallelization support.

Ping? :)

Apr 2 2024, 1:01 PM · Infrastructure-Foundations, SRE-tools, homer

ayounsi added a comment to T344325: gNMI module in Spicerack.

For the record I looked deeper at gNMI to configure Juniper devices.

Apr 2 2024, 8:10 AM · Patch-For-Review, Infrastructure-Foundations, Spicerack, SRE-tools

ayounsi updated the title for P59136 experiment: modifying spicerack netbox and gnmi modules to work with Juniper devices using Juniper specific yang models from experiment: modifying spiecrack gnmi module to work with Juniper devices using Juniper specific yang models to experiment: modifying spicerack netbox and gnmi modules to work with Juniper devices using Juniper specific yang models.

Apr 2 2024, 8:00 AM

ayounsi created P59136 experiment: modifying spicerack netbox and gnmi modules to work with Juniper devices using Juniper specific yang models.

Apr 2 2024, 8:00 AM

ayounsi created P59135 experiment: modifying example-gnmi-cookbook.py to work with Juniper devices using Juniper specific yang models.

Apr 2 2024, 7:57 AM

ayounsi added a comment to T361549: Automatically run Capirca Netbox script regularly.

Thanks for the task. I was thinking of either a timer or using Netbox's hooks to only run it when relevant changes are done.

Apr 2 2024, 7:47 AM · netbox, netops, Infrastructure-Foundations

Mar 28 2024

ayounsi triaged T361252: Replace Rancid with Oxidized as Low priority.

Mar 28 2024, 4:19 PM · Infrastructure-Foundations, netops

ayounsi updated the task description for T344325: gNMI module in Spicerack.

Mar 28 2024, 3:12 PM · Patch-For-Review, Infrastructure-Foundations, Spicerack, SRE-tools

Mar 25 2024

ayounsi added a comment to T360772: Move public-vlan host BGP peerings from CRs to top-of-rack switches in codfw.

So we need to decide if this imbalance for local queries is going to be an issue.

I think load is the main thing to look at. I briefly thought about cold caches but if I understand correctly, all servers will keep receiving some traffic.

Mar 25 2024, 1:10 PM · SRE, Infrastructure-Foundations, netops

ayounsi renamed T313729: Splunk/VictorOps CAPTCHA from Splunk CAPTCHA to Splunk/VictorOps CAPTCHA.

Mar 25 2024, 7:19 AM · Observability-Alerting, observability

Mar 20 2024

ayounsi added a comment to T358417: Inbound interface errors.

The counters are for failed packets and not dropped packets due to saturation (that's a different counter). So there is something wrong somewhere, and looks like it's not the cable or the NIC based on @Papaul's comment.
It's fine to not do anything about it as long as people are aware, there is a little risk of alerting noise, but fine to revisit later on if it becomes a larger issue.

Mar 20 2024, 7:59 AM · SRE, ops-codfw

ayounsi added a comment to T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team.

Thanks, and no pb !

elastic2107-2108 are unreachable and have DRAC problems. I'll try and take a look at them tomorrow.

Please set their Netbox status to Failed then :)

Mar 20 2024, 7:27 AM · Data-Platform-SRE (2024.03.04 - 2024.03.24)

Mar 19 2024

ayounsi added a comment to T310583: Netbox: use Journaling feature.

First use of the journaling feature in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1012680/

Mar 19 2024, 2:29 PM · Infrastructure-Foundations, netbox

ayounsi updated subscribers of T360297: Take advantage of 10Gb NICs in the new network stack.

For the ML hosts - our K8s clusters don't currently require 10G bandwidth, and at the time we didn't want to "waste" 10G ports if not really needed. But if now it is not a problem anymore, we'd be happy to switch (let us know what is the current best practice regarding 1G vs 10G) :)

Mar 19 2024, 2:04 PM · Infrastructure-Foundations, DC-Ops, netops

ayounsi added a comment to T360297: Take advantage of 10Gb NICs in the new network stack.

Feel free to test it on Netbox next

Mar 19 2024, 1:59 PM · Infrastructure-Foundations, DC-Ops, netops

Mar 18 2024

ayounsi triaged T360297: Take advantage of 10Gb NICs in the new network stack as Low priority.

Thanks for the task, nothing private in there.

Mar 18 2024, 12:10 PM · Infrastructure-Foundations, DC-Ops, netops

ayounsi changed the visibility for T360297: Take advantage of 10Gb NICs in the new network stack.

Mar 18 2024, 12:10 PM · Infrastructure-Foundations, DC-Ops, netops

ayounsi created T360285: Netbox: mismatched device models: PowerEdge R450 - ConfigE-10G (netbox) != PowerEdge R650xs (puppetdb).

Mar 18 2024, 7:48 AM · SRE, DC-Ops, ops-codfw

ayounsi added a comment to T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team.

Hi, as some of those hosts had Puppet disabled for a long time (with this task as disabled message), they got removed from PuppetDB.
As hosts not in PuppetDB can be problematic (lack of security updates for example) we have a check to catch them:
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
Full list is currently:

an-worker1096 (WMF4839)
elastic2107 (WMF11895)
elastic2108 (WMF11896)
moss-be2001 (WMF5769)
moss-be2002 (WMF5772)
wdqs1022 (WMF11314)
wdqs1023 (WMF11315)
wdqs1024 (WMF11316)

Mar 18 2024, 7:42 AM · Data-Platform-SRE (2024.03.04 - 2024.03.24)

ayounsi added a comment to T358244: Decom asw-a-codfw switch stack.

FYI it's alerting for one of its PSU being down, but we don't really care anymore :

asw-a-codfw> show system alarms
1 alarms currently active
Alarm time Class Description
2024-03-16 09:20:23 UTC Major FPC 6 PEM 1 is not powered

Mar 18 2024, 7:19 AM · netops, Infrastructure-Foundations, SRE, ops-codfw

Mar 8 2024

ayounsi created T359629: Netbox: bug preventing removing a parent bridge in custom script automation.

Mar 8 2024, 1:11 PM · Infrastructure-Foundations, netops, netbox

Mar 6 2024

ayounsi added a comment to T357630: Improve automation for the vendor maintenance calendar.

Thanks for looking into it !

Mar 6 2024, 10:22 AM · SRE-tools, DC-Ops, observability, Infrastructure-Foundations

Mar 5 2024

ayounsi added a comment to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.

I'd recommend to start by turning up a small country/region on that continent (Uruguay/Paraguay for example), ideally outside of peak time. That will help warm up the caches nice and slowly and reduce the impact of an issue. Then ramping it up progressively.

Mar 5 2024, 10:43 AM · Infrastructure-Foundations, SRE, Traffic

ayounsi added a comment to T358417: Inbound interface errors.

Thank you both ! Something seems funky with db2099 as well.

Mar 5 2024, 7:50 AM · SRE, ops-codfw

Feb 29 2024

ayounsi updated subscribers of T358417: Inbound interface errors.

@ABran-WMF this host needs a quick downtime to replace the SFP. Please sync up with Jenn.

Feb 29 2024, 4:48 PM · SRE, ops-codfw

ayounsi added a comment to T358542: Netbox errors caused by system board replacement .

As far as I know the MB serial number is the most convenient unique identifier we can use as it's on the chassis for most of the devices and we can query it in a programmatic way.

Feb 29 2024, 1:41 PM · SRE, ops-codfw

ayounsi added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

One possible path forward is to work with Dell's support to solve T304483: PXE boot NIC firmware regression

Feb 29 2024, 9:55 AM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Feb 28 2024

ayounsi closed T358639: lsw1-b7-codfw - FPC0: PEM 0 Not Powered as Resolved.

All good, thanks !

Feb 28 2024, 3:09 PM · SRE, ops-codfw

ayounsi triaged T358639: lsw1-b7-codfw - FPC0: PEM 0 Not Powered as High priority.

Feb 28 2024, 7:23 AM · SRE, ops-codfw

Feb 27 2024

ayounsi added a comment to T358237: Ganeti VM for contint migration.

To clarify, there was no blocker in any of my comments.

Feb 27 2024, 4:58 PM · Patch-For-Review, collaboration-services, SRE, Continuous-Integration-Infrastructure, vm-requests

ayounsi closed T358044: Migrate dev user accounts for bvibber as Resolved.

Puppet and LDAP updated.

Feb 27 2024, 2:59 PM · Phabricator, SRE, SRE-Access-Requests, LDAP-Access-Requests

ayounsi added a comment to T310717: Netbox: get rid of WMF Production Patches.

as well as https://github.com/dennisv/django-storage-swift/pull/113

Feb 27 2024, 1:43 PM · Patch-For-Review, netbox, Infrastructure-Foundations

ayounsi closed T357379: Inbound interface errors as Resolved.

Seeing the very sporadic nature of the issue, I'd say it's a provider issue and not an optic issue.
https://librenms.wikimedia.org/graphs/to=1709017500/id=11592/type=port_errors/from=1677481500/
If it happens too often we could look at replacing the optic.

Feb 27 2024, 7:09 AM · SRE, ops-eqiad

ayounsi updated subscribers of T358237: Ganeti VM for contint migration.

cloud VPS doesn't really seem feasible to me

I'm curious to know more why it doesn't ?

Feb 27 2024, 7:02 AM · Patch-For-Review, collaboration-services, SRE, Continuous-Integration-Infrastructure, vm-requests

Feb 26 2024

ayounsi claimed T344325: gNMI module in Spicerack.

Feb 26 2024, 3:46 PM · Patch-For-Review, Infrastructure-Foundations, Spicerack, SRE-tools

ayounsi added a comment to T358455: Primary outbound port utilisation over 80% alert muted.

Peering link to DE-CIX on cr2-codfw was saturating, deployed the patch above to fix the immediate issue.

Feb 26 2024, 9:18 AM · Traffic, Sustainability (Incident Followup), netops, Infrastructure-Foundations

ayounsi added a comment to T358455: Primary outbound port utilisation over 80% alert muted.

Feb 23 2024

ayounsi closed T357147: Requesting access to analytics-privatedata-users for Arthur Taylor as Resolved.

Give it ~30min for the change propagate and you should be good to go. Please let us know if there is any issue.

Feb 23 2024, 2:40 PM · Patch-For-Review, SRE, SRE-Access-Requests

ayounsi closed T357097: Requesting access to analytics-privatedata-users for ElineWMDE as Resolved.

Give it ~30min for the change propagate and you should be good to go. Please let us know if there is any issue.

Feb 23 2024, 2:34 PM · Patch-For-Review, SRE, SRE-Access-Requests

ayounsi added a comment to T358237: Ganeti VM for contint migration.

For testing hosts I'd prefer running on private IPs as those tend to have puppet disabled for longer period of time and "experimental" changes.

Feb 23 2024, 2:13 PM · Patch-For-Review, collaboration-services, SRE, Continuous-Integration-Infrastructure, vm-requests

ayounsi added a parent task for T336275: Upgrade Netbox to 4.x: T358339: Netbox: capirca.getHosts script runs into timeout.

Feb 23 2024, 2:06 PM · Patch-For-Review, Infrastructure-Foundations, netbox

ayounsi added a subtask for T358339: Netbox: capirca.getHosts script runs into timeout: T336275: Upgrade Netbox to 4.x.

Feb 23 2024, 2:06 PM · netbox, Infrastructure-Foundations

ayounsi claimed T357147: Requesting access to analytics-privatedata-users for Arthur Taylor.

Feb 23 2024, 2:04 PM · Patch-For-Review, SRE, SRE-Access-Requests

ayounsi added a comment to T357097: Requesting access to analytics-privatedata-users for ElineWMDE.

User added to the NDA LDAP group. Only thing left is the patch above once reviewed.

Feb 23 2024, 1:54 PM · Patch-For-Review, SRE, SRE-Access-Requests

ayounsi triaged T358339: Netbox: capirca.getHosts script runs into timeout as High priority.

Feb 23 2024, 1:53 PM · netbox, Infrastructure-Foundations

ayounsi claimed T357097: Requesting access to analytics-privatedata-users for ElineWMDE.

Feb 23 2024, 1:53 PM · Patch-For-Review, SRE, SRE-Access-Requests

ayounsi added a comment to T358339: Netbox: capirca.getHosts script runs into timeout.

https://netbox.wikimedia.org/admin/extras/jobresult/?name=capirca.GetHosts&o=-3.1.2
Not a good track record.

Screenshot 2024-02-23 at 14-50-28 Select job result to change NetBox.png (664×1 px, 119 KB)

Feb 23 2024, 1:52 PM · netbox, Infrastructure-Foundations

ayounsi added a comment to T355899: Netbox MoveServersUplinks script doesn't handle trunked ports correctly.

I think we should "just" put it on the list of things to check after the Netbox upgrade. This behavior seems like a bug, and might have been fixed since.

Feb 23 2024, 10:14 AM · Infrastructure-Foundations, netbox

ayounsi added a parent task for T336275: Upgrade Netbox to 4.x: T355899: Netbox MoveServersUplinks script doesn't handle trunked ports correctly.

Feb 23 2024, 10:13 AM · Patch-For-Review, Infrastructure-Foundations, netbox

ayounsi added a subtask for T355899: Netbox MoveServersUplinks script doesn't handle trunked ports correctly: T336275: Upgrade Netbox to 4.x.

Feb 23 2024, 10:13 AM · Infrastructure-Foundations, netbox

ayounsi added a comment to T342968: Requesting access to releasers-wikibase for darthmon_wmde.

Can you update the key present on your mediawiki page as well ? Thanks

Feb 23 2024, 10:07 AM · SRE, SRE-Access-Requests

ayounsi claimed T358044: Migrate dev user accounts for bvibber.

Feb 23 2024, 10:05 AM · Phabricator, SRE, SRE-Access-Requests, LDAP-Access-Requests

ayounsi created T358319: FAIL: cumin-check-aliases - puppetdb-api.discovery.wmnet ConnectTimeout.

Feb 23 2024, 10:03 AM · Infrastructure-Foundations