User Details
- User Since
- May 10 2021, 3:25 PM (158 w, 1 d)
- Availability
- Available
- IRC Nick
- topranks
- LDAP User
- Cathal Mooney
- MediaWiki User
- CMooney (WMF) [ Global Accounts ]
Today
@Jhancock.wm @Papaul I'd been using the server in b7 for testing already, but I should be able to move over to the one in a8 instead (I assume we have the same problem with public1-a-codfw as we had with public1-b-codfw)
Yesterday
So some interesting findings when testing today.
Fri, May 17
From what I can tell the 'authoritative' statement only controls NAK generation. I think we're hitting this part of the code, and the different source address (of another switch) on the duplicate REQUESTS is why it is sending the NAKs:
Re-reading the man page for dhcpd.conf it seems that pontentially changing the 'authoritative' statement at the top of our config to 'not authoritative' would prevent it sending the NAKs. Might be worth a shot? Better to not create them than to filter them elsewhere. I don't believe in our environment there is any use-case where we need NAKs.
Just a note on this, I only discovered this document after the task:
Also I didn't see in the dhcpd docs and way to constrain the generation of NAKs in response to invalid REQUEST messages.
One observation is that the NAK's are unique in so far as they are sent from 208.80.153.33 (Switch IRB int IP) to 255.255.255.255 (and matching L2 MACs).
Pcap of DHCP request from contint2002 here:
+1 sounds like a good idea. Nice we have some limited scope to experiment with the DoH ranges before pulling the plug on ns2.
Thu, May 16
This has been implemented and the new vlan setup is recorded here. Closing task
Thanks. It is very much something we wish to do but unfortunately other priorities have always trumped it for multiple past quarters.
And fwiw announcement looks good, all 3 of our transits are learning it ok, and I see it on other carriers from those sources as well. We also see live requests on the doh servers.
Wed, May 15
Gonna close this one. As a last datapoint if you 'stack' the Hadoop graph in Grafana you can clearly see the cumulative reads at ~15:55 on May 14th was a good deal higher than any of the other spikes of usage over the past few days (peaks at almost 200Gbit/sec). So it makes sense that paged and the others didn't.
Seems this is not possible as the cloudsw's still on JunOS 18 don't support exporting the data within the mgmt routing-instance.
Tue, May 14
Thanks for the task and analysis.
Patch to Homer wmf plugin merged now, so BGP to VMs at POPs / on L3 switches now under automation too.
Thanks for the task @taavi. Looks well put together let me know the exact time you're starting and if feel free to ping me if there is anything you need checked from the physical network side of things (where MAC addresses are in the forwarding tables etc.)
Mon, May 13
Happy to discuss. I think if we are doing this it makes sense to do the cloudgw <-> cloudsw BGP at the same time (we will need to create the Bird config for the cloudgw to talk to cloudnet, so while we are doing so let's do the other side too).
Fri, May 10
@Papaul I've added all the links for the new switches in Netbox now:
Thu, May 9
Hey Andrew,
Wed, May 8
Sorry for the delay, the capirca script times out a lot for some reason will need to look at that.
Fri, May 3
@Papaul I think this one is ready to be moved to rack D1 now.
Device has been removed from LiberNMS now. I also downtimed it for 2 weeks just in case I mess up the order of anything.
Thu, May 2
Not sure if it might be worth taking a step back and weighing up what's happening here?
Wed, May 1
These are direct peerings to Equinix tehmselves over their own exchange. We are waiting on them to complete the configuration of their side (see peering@wikimedia.org). I emailed last week to chase them for an update.
Tue, Apr 30
Mon, Apr 29
Looks like this was a brief blip of inbound errors (unlike last time when they began and kept increasing until eventually the link failed).
Actually it may be just easier to check the route for each pooled IP and make sure the check doesn't return saying it's using the default as per the task descr.
cmooney@lvs1019:~$ ip --json route get fibmatch 1.1.1.1 [{"dst":"default","gateway":"10.64.32.1","dev":"eno1np0","flags":["onlink"]}]
Thanks Brian.