Both Junos 22.2R3-Sx and Junos 22.4R3 are latest recommended. fyi, I went with 22.4R3 in magru.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Fri, May 3
Thu, May 2
Mon, Apr 29
Sun, Apr 28
Good idea, worth trying ! If it's enough it would be less of a pain than changing the SSH port.
Tue, Apr 23
Another question I think is "do we still have to go through text files ?"
It made sens for back in the time when we were manually editing the configuration, and for the few places we still do, but it seems sub-optimal to go from Netbox database to text file to gdnsd.
Probably too simplistic, but could we generate a raw list of IP/FQDN from Netbox, and feed it to gDNSd without having to care about PTR and zones structures ?
Mon, Apr 22
Mon, Apr 15
I have checked the logs and it looks like the issue we are facing with the slowness on the device and the reboots is product of a brute force SSH attack on the SRX.
The login attempts are creating process on the SRX that sometimes don't close correctly or take more time to fully close. If enough of them are stack it can cause the reboot.
To fix this we can set a firewall filter for the control plane of the SRX and use an allow list to mitigate the packets that are actually reaching the device.
We might have to re-prioritize this task because of T362522: mr1-eqsin performance issue
Done :)
I started implementing a fix for that but it quickly gets complex as it means shutting down a port, and fully setting up another one. Before going that way let's see if it's something we want/need to do.
Also as I think you mentioned somewhere else, it would mess with @Papaul 's rack U to switch port mapping.
Opened JTAC 2024-0415-128563 and attached logs/RSI/coredump.
Fri, Apr 12
Prefixes assigned in Netbox: https://netbox.wikimedia.org/ipam/prefixes/?site_id=11
As data point Ganeti's iPXE send 4 DHCP requests, doubling the timeout between each: 1s, 2s, 4s.
12:34:21.812153 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414 12:34:22.831693 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414 12:34:24.863614 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414 12:34:28.928072 IP ganeti2033.codfw.wmnet.bootps > install2004.wikimedia.org.bootps: BOOTP/DHCP, Request from aa:00:00:7e:e0:91 (oui Unknown), length 414
Thu, Apr 11
Apr 8 2024
Netbox script is great, we can call it from a cookbook if needed later on.
Thanks. What I don't understand is that if they go through ZTP or manual basic setup, they will by definition be managed switches (with root password, IP, etc). I don't think we can have a middle ground where we have only some config.
Apr 5 2024
We first need to discuss if we want to start using managed switches for management switches (except the aggregation ones).
On the plus side it's convenient to have the extra visibility, but it adds a lots of management overhead to our automation, while I'm not sure we have the resources for that.
RFO: The unavailability of the link was due to problems with optical modules and cards at the Marseille and Paris, France locations on the Telxius network. The link returned to normal after the modules and cards were replaced.
Apr 4 2024
Emailed Telxius NOC.
Apr 3 2024
for information https://github.com/Eskemm-Numerique/ntc-netbox-plugin-metrics-ext/pull/1 got merged, so ntc-netbox-plugin-metrics-ext should now works out of the box.
We can consider this task completed with success.
Apr 2 2024
Ping? :)
For the record I looked deeper at gNMI to configure Juniper devices.
Thanks for the task. I was thinking of either a timer or using Netbox's hooks to only run it when relevant changes are done.
Mar 28 2024
Mar 25 2024
So we need to decide if this imbalance for local queries is going to be an issue.
I think load is the main thing to look at. I briefly thought about cold caches but if I understand correctly, all servers will keep receiving some traffic.
Mar 20 2024
The counters are for failed packets and not dropped packets due to saturation (that's a different counter). So there is something wrong somewhere, and looks like it's not the cable or the NIC based on @Papaul's comment.
It's fine to not do anything about it as long as people are aware, there is a little risk of alerting noise, but fine to revisit later on if it becomes a larger issue.
Thanks, and no pb !
elastic2107-2108 are unreachable and have DRAC problems. I'll try and take a look at them tomorrow.
Please set their Netbox status to Failed then :)
Mar 19 2024
First use of the journaling feature in https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/1012680/
For the ML hosts - our K8s clusters don't currently require 10G bandwidth, and at the time we didn't want to "waste" 10G ports if not really needed. But if now it is not a problem anymore, we'd be happy to switch (let us know what is the current best practice regarding 1G vs 10G) :)
Feel free to test it on Netbox next
Mar 18 2024
Thanks for the task, nothing private in there.
Hi, as some of those hosts had Puppet disabled for a long time (with this task as disabled message), they got removed from PuppetDB.
As hosts not in PuppetDB can be problematic (lack of security updates for example) we have a check to catch them:
https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
Full list is currently:
an-worker1096 (WMF4839) elastic2107 (WMF11895) elastic2108 (WMF11896) moss-be2001 (WMF5769) moss-be2002 (WMF5772) wdqs1022 (WMF11314) wdqs1023 (WMF11315) wdqs1024 (WMF11316)
FYI it's alerting for one of its PSU being down, but we don't really care anymore :
asw-a-codfw> show system alarms
1 alarms currently active
Alarm time Class Description
2024-03-16 09:20:23 UTC Major FPC 6 PEM 1 is not powered
Mar 8 2024
Mar 6 2024
Thanks for looking into it !
Mar 5 2024
I'd recommend to start by turning up a small country/region on that continent (Uruguay/Paraguay for example), ideally outside of peak time. That will help warm up the caches nice and slowly and reduce the impact of an issue. Then ramping it up progressively.
Thank you both ! Something seems funky with db2099 as well.
Feb 29 2024
@ABran-WMF this host needs a quick downtime to replace the SFP. Please sync up with Jenn.
As far as I know the MB serial number is the most convenient unique identifier we can use as it's on the chassis for most of the devices and we can query it in a programmatic way.
One possible path forward is to work with Dell's support to solve T304483: PXE boot NIC firmware regression
Feb 28 2024
All good, thanks !
Feb 27 2024
To clarify, there was no blocker in any of my comments.
Puppet and LDAP updated.
Seeing the very sporadic nature of the issue, I'd say it's a provider issue and not an optic issue.
https://librenms.wikimedia.org/graphs/to=1709017500/id=11592/type=port_errors/from=1677481500/
If it happens too often we could look at replacing the optic.
cloud VPS doesn't really seem feasible to me
I'm curious to know more why it doesn't ?
Feb 26 2024
Peering link to DE-CIX on cr2-codfw was saturating, deployed the patch above to fix the immediate issue.
See also {T192688}
Feb 23 2024
Give it ~30min for the change propagate and you should be good to go. Please let us know if there is any issue.
Give it ~30min for the change propagate and you should be good to go. Please let us know if there is any issue.
For testing hosts I'd prefer running on private IPs as those tend to have puppet disabled for longer period of time and "experimental" changes.
User added to the NDA LDAP group. Only thing left is the patch above once reviewed.
https://netbox.wikimedia.org/admin/extras/jobresult/?name=capirca.GetHosts&o=-3.1.2
Not a good track record.
I think we should "just" put it on the list of things to check after the Netbox upgrade. This behavior seems like a bug, and might have been fixed since.
Can you update the key present on your mediawiki page as well ? Thanks
Setting to "decommissioning" will cause automation to remove the mgmt DNS record.
Feb 21 2024
The one at a time part is what worries me a bit,
Doesn't seem like a hard problem to solve :)
The other bit is the name
Similarly we could pass a pattern or prefix.
T358096: Automation to add extra IPs to servers for the Cassandra/extra IPs usecase.
Feb 20 2024
You should be good to go ! Please re-open if any issues.
Closing as this ticket stalled, please re-open if needed or follow up in the other one.