User Details
- User Since
- Apr 3 2017, 6:23 PM (387 w, 5 d)
- Availability
- Busy Busy until Sep 16.
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Fri, Aug 30
Documented on how to use it in scripts and pynetbox: https://wikitech.wikimedia.org/wiki/Netbox#Journaling
Thu, Aug 29
@ssingh yep, you can clean it up anytime, thanks !
Wed, Aug 28
Ta da: https://wikitech.wikimedia.org/w/index.php?title=Adding_and_removing_transit_providers&diff=2218856&oldid=2042295. Can you verify this is correct? There are lots of references to private Phabricator tasks, and of course I have never dealed with WMF's transit providers before.
That's great and sound much more pro :)
I did some tiny changes
I'm wondering if we can have Homer populate a prefix list instead.
That's always an option, it of course comes down to how much more complex that makes it, and if the tradeoff is worth it
Tue, Aug 27
Deployed! let me know if any issue.
rpki2003 created, I used this opportunity to create it as a Bookworm VM.
All done!
Draft dashboard: https://grafana.wikimedia.org/d/f_tZtVlMz/drbd
I manually added --collector.drbd to /etc/default/prometheus-node-exporter on one of the Routed Ganeti exporter
Mon, Aug 26
It's all good now, I guess we just had to wait a little bit.
Validator deployed.
Deployed on netbox-next and tests seem all good.
It would be useful to know why it failed (maybe on the server's logs?), but +1 to adding a retry logic regardless.
I had a try at this. See attached screenshot for using the "offline device" script, then the "revert" script using the request ID.
Fri, Aug 23
Not going active/active (see T234997: Make Netbox Active/Active)
The active server parameter now control rq-netbox as well, so it's unlikely we get rid of it (see T341843: Netbox rq.timeouts.JobTimeoutException)
As we're not going to the active/active direction anytime soon (see T234997: Make Netbox Active/Active) I'm going to close this task in favor of T330883: Improve Netbox active/passive failover process
We should focus our efforts on improving the active/passive failover process.
An active/active Netbox is not really doable for now. For both Redis and Postgres the extra cross-DC latency makes it practically unusable (see T341843: Netbox rq.timeouts.JobTimeoutException and T330883: Improve Netbox active/passive failover process. And it doesn't seems doable for not to even just split reads and writes.
We should focus our efforts on improving the active/passive failover process.
Upgrade is done, I had a bit more time to look into that.
https://netbox.wikimedia.org/extras/scripts/results/82927/
cephosd1005 (WMF10631) Device is Active in Netbox but is missing from PuppetDB (should be ('decommissioning', 'inventory', 'offline', 'planned', 'staged', 'failed'))
Thu, Aug 22
The changelog links have expired. I tried to reproduce the issue with other similar hosts (gitlab, gerrit, etc) but couldn't.
Added a relevant check in the IP validator.
Probably safe to close as it has been more than a year.
netbox1002 is gone :) Netbox 4 servers have bigger disks and getstats (which was generating them) has been replaced by a plugin.
Taking the task to create the validator
@taavi are you still having issues here?
Disk is gone :
re0: [...] Item Capacity Part number Serial number Description DIMM 0 16384 MB VL33A2G60F-N6SB-JUN 0x49689191 DDR4 2133 MHz DIMM 1 16384 MB VL33A2G60F-N6SB-JUN 0x49689445 DDR4 2133 MHz DIMM 2 16384 MB VL33A2G60F-N6SB-JUN 0x49689440 DDR4 2133 MHz DIMM 3 16384 MB VL33A2G60F-N6SB-JUN 0x49689134 DDR4 2133 MHz Disk1 50.0 GB SFSA050GV3AA2TO 000060158505A1000227 SLIM SATA SSD
Jelto made me aware of that task.
I cleared the report's error by de-attaching the IP from the interface in Netbox, so it matches what we currently have configured for Gerrit's IP as well. That way we can start having report error notification again.
As they're tagged as VIPs, they won't automatically be imported from Puppet.
Wed, Aug 21
All done !
Confirmed that cr1-eqiad stopped generating those logs for 10.64.0.82 (prometheus1005). The other one will happen anytime puppet picks up the changes.
Updated pynetbox package has been pushed to cumin hosts to unblock the situation.
I build a pynetbox 7.4.0 using the new pipeline : https://gitlab.wikimedia.org/repos/sre/pynetbox
https://gitlab.wikimedia.org/repos/sre/pynetbox/-/jobs/348632/artifacts/browse/WMF_BUILD_DIR/
Tue, Aug 20
I have added them to our PFW config and created T372520 for the deployment.
Deployed.
I need to check that the physical cabling changes are ok before we start
Physical cabling is on the new switches for rows A and B. Old switches are offline.
Mon, Aug 19
Peer removed.
All tested and deployed.
@Jclark-ctr @Clement_Goubert fixed the typo, so you should be good to go.
Ditto for fundraising firewalls, they do have the alert hosts IP addresses in them, though I'm not sure how to add/remove them? cc @ayounsi for this and the homer point above
Homer is generated automatically from Netbox data using the capirca script: https://netbox.wikimedia.org/extras/scripts/1/jobs/ then a homer run is needed to actually update the network devices from using that data.
For the fundraising hosts/network you need to ping @Jgreen and @Dwisehaupt
I didn't get a screenshot of the "before" but here is the "after" the linked patch:
After discussions we decided to go down the first path of the list. I couldn't replicate the issue since deploying the workaround/fix. Please reopen if you see the timeout again or would like to fix the root of the issue.
Fri, Aug 16
All done.
Aug 16 09:24:25 netbox1003 python[1079619]: 09:24:25 default: extras.scripts.run_script(commit=True, data={}, job=<Job: 61e3e2e3-59b1-4982-81a7-5589221d7ed9>, request=<utilities.request.NetBoxFakeRequest ob> Aug 16 09:25:47 netbox1003 python[1622900]: 09:25:47 default: Job OK (61e3e2e3-59b1-4982-81a7-5589221d7ed9)
Wed, Aug 14
Sounds good to me !
>>> pprint.pprint(results) defaultdict(<class 'int'>, {'an-redacteddb': 1, 'clouddb': 9, 'db': 251, 'dbprov': 6, 'dbproxy': 18, 'dbstore': 3, 'dumpsdata': 1, 'es': 42, 'ganeti': 29, 'maps': 12, 'ms-be': 16, 'pc': 14, 'restbase': 30, 'snapshot': 1, 'thanos-fe': 6, 'wdqs': 5}) >>> >>> print(set(NO_V6_DEVICE_NAME_PREFIXES) - set(results.keys())) {'thumbor', 'mc-gp', 'restbase-dev', 'mc', 'wtp', 'mwlog', 'mw', 'parse', 'graphite', 'ores', 'sessionstore'}
Tue, Aug 13
With the patches above, the last urgent-ish thing needed is to package the new pynetbox for the cumin hosts (bullseye)
https://netbox.wikimedia.org/extras/scripts/results/78992/
cloudcephosd1039 (WMF11571) /dcim/devices/5296/ Primary IPv6 missing DNS name
I guess the skip IPv6 box got checked by mistake, could someone add the host's FQDN to https://netbox.wikimedia.org/ipam/ip-addresses/17171/ (similar to https://netbox.wikimedia.org/ipam/ip-addresses/17159/) then run the sre.dns.netbox cookbook ?
Mon, Aug 12
Emailed AS54994 and cleared the errors for the others.
Nevertheless, it should be possible to publish ASPA records in RPKI through the ARIN portal
I looked a bit around Arin's RPKI's portal but couldn't find it, is there doc about it ?
However, in reality, it should be possible to reject all IP packets where the source IP is not part of the IP prefixes that the Foundation has been assigned (i.e. prefix lists production{4,6}, which are a superset of the publicly routable LVS service IPs).
We would need to at least permit traffic from the transit interface IPs, as they do BGP to their peers, v6 link local for neighbor discovery, some land GRE tunnels, etc. Not sure what is the cleanest way for that, maybe using an apply-path like for bgp-sessions ? Ideally we wouldn't have to list them all :)
Aug 8 2024
Cleaned up.
This went very well until it didn't. Changes fully rolled back.
Thanks, it's fixed for those 2 hosts.
Aug 7 2024
Sent the patches for the last few ones left in the task description.
Notes from the Debrief meeting