Page MenuHomePhabricator

WMCS VIPs: Netbox netmask inconsistencies
Open, MediumPublic

Description

As a result of an audit (see parent task) I've noticed some discrepancies for some IPs in Netbox and their role and usage.
As reported in [1] IPs that are VIPs and have role VIP in Netbox should have /32 or /128 netmasks (IPv4/6 respectively) and not the netmask of their parent prefix (Netbox default).
But for some hosts there are IPs marked as VIP that have a different netmask.
If trying to run the PuppetDB import script on the hosts where those IPs are allocated, brings various warnings.

The affected hosts are:
cloudgw1002 cloudgw2002-dev

And the related affected IPs are:

185.15.56.237/30
185.15.56.244/29
185.15.57.9/30
208.80.153.190/29

I think they should be converted all to be /32 both on Netbox and on the instances. This will also let the automation know that they are proper VIPs and will prevent it to attach them specifically to any host on Netbox (because as VIPs they can float from one host to another).

[1] https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_to_manually_allocate_a_special_purpose_IP_address_in_Netbox

Event Timeline

Change 887282 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] codfw1dev: cloudgw: make VIPs use /32 netmask

https://gerrit.wikimedia.org/r/887282

Change 887288 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] eqiad1: cloudgw: make VIPs use /32 netmask

https://gerrit.wikimedia.org/r/887288

Change 887282 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] codfw1dev: cloudgw: make VIPs use /32 netmask

https://gerrit.wikimedia.org/r/887282

Mentioned in SAL (#wikimedia-cloud) [2023-02-07T11:26:42Z] <arturo> [codfw1dev] testing network changes in cloudgw, expect unrealiable network (T295774)

Change 887373 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: introduce additional route for VIPs

https://gerrit.wikimedia.org/r/887373

Change 887373 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: introduce additional route for VIPs

https://gerrit.wikimedia.org/r/887373

Mentioned in SAL (#wikimedia-cloud) [2023-02-08T17:08:31Z] <arturo> changing to cloudgw network setup, make VIPs /32 (T295774)

Change 887811 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: eqiad1: move remaining VIP to /32 netmask

https://gerrit.wikimedia.org/r/887811

Change 887811 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: eqiad1: move remaining VIP to /32 netmask

https://gerrit.wikimedia.org/r/887811

Change 887288 abandoned by Arturo Borrero Gonzalez:

[operations/puppet@production] eqiad1: cloudgw: make VIPs use /32 netmask

Reason:

merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/887373 instead

https://gerrit.wikimedia.org/r/887288

aborrero subscribed.

hey @Volans I made the required changes on our side. Please verify if Netbox is happier now and close the ticket if so, thanks!

PD: sorry, this went long overlooked in my backlog.

I think they should be converted all to be /32 both on Netbox and on the instances. This will also let the automation know that they are proper VIPs and will prevent it to attach them specifically to any host on Netbox (because as VIPs they can float from one host to another).

I'm not convinced this logic is best on the network/host side.

We have a normal Ethernet segment (vlan) at play, over which ARP should function to discover the link-layer (MAC) address of devices owning a given IP. Using a /32 netmask on the interface tells the system "there are no other devices connected to this vlan".

On the Juniper's we don't have the problem of importing the IPs back into Netbox, which means we don't have this constraint. Ultimately I think the fix to the problem should be changes that mean we don't need the PuppetDB import script (see T347411).

In the meantime, for the cloudgw issue in particular, we should widen the /30 to a /29 on the cloudgw-transport vlans. That will allow ARP to function normally without the workarounds the were required to make things work with the /32.

I'm gonna re-open this for now, as it looks like the issue isn't fully solved.

On the cloudnet side of this particular link the VIP is still added to interfaces with a /30 netmask:

image.png (296×988 px, 65 KB)

When I changed the 185.15.56.237 IP from a /30 to /32 netmask in Netbox (to reflect the changes done in this task and how the cloudgw side is now done), the Network report throws an error because this VIP netmask is now inconsistent with 185.15.56.238/30.

One approach to solve this would be to tell OpenStack neutron to configure the IP with a /32 netmask, but as per my last comment this will require other changes to force the kernel to ARP, and could be even trickier on the neutron side than cloudgw.

@Volans I think there may be a better option here:

  1. Modify the PuppetDB import script to skip import/modification of IPs it finds which are already type VIP in Netbox
  2. Configure two instances of the VIP, one on each device that could use it
    • Yes only 1 uses it at a time, but this is the traditional way we did VRRP on the Junipers so there is precedent.

@cmooney this change would affect a lot of VIPs assigned by puppet all over production so we must check carefully the consequences of any changes. That said I'm happy to evaluate/discuss alternative approaches.
See for example vipexempt , _process_binding_address() and _handle_vip() in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/netbox-extras/+/refs/heads/master/customscripts/interface_automation.py

Change 962614 had a related patch set uploaded (by Cathal Mooney; author: Cathal Mooney):

[operations/software/netbox-extras@master] Interface automation: skip import of existing int IPs and VIPs

https://gerrit.wikimedia.org/r/962614

@Volans yep thanks.

I created a provisional patch but I agree we need to consider all the cases. I believe from looking through the code the vipexempt and handle_vip parts only relate to when we need to import new IPs to netbox that aren't already present. So my approach was to just skip things that already exist, which I think should be safe, I avoided touching any of the other code for vip processing as indeed it looks fairly complex.

Very much a quick thing to allow for discussion. I've no strong feelings on this one way or other. I guess the main difference I stumbled on is "VIPs" that are on the main ethernet interface/subnet (and should have wider mask technically) versus those on loopback device (where /32 is the correct choice).

Trying to run the import puppetdb script on cloudgw1002 is now a noop, but for cloudgw2002-dev fails with this exception:

Failure	
An exception occurred: MultipleObjectsReturned: get() returned more than one IPAddress -- it returned 2!

Traceback (most recent call last):
  File "/srv/deployment/netbox/current/src/netbox/extras/scripts.py", line 461, in _run_script
    script.output = script.run(data=data, commit=commit)
  File "/srv/deployment/netbox-extras//customscripts/import_server_facts.py", line 70, in run
    messages.extend(self._import_interfaces_for_device(device, net_driver, networking, lldp, False))
  File "/srv/deployment/netbox-extras//customscripts/_common.py", line 750, in _import_interfaces_for_device
    self._assign_ip_to_interface(address, nbiface, networking, iface, is_primary, False)
  File "/srv/deployment/netbox-extras//customscripts/_common.py", line 203, in _assign_ip_to_interface
    ipaddr = IPAddress.objects.get(address=str(address))
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.9/site-packages/django/db/models/query.py", line 499, in get
    raise self.model.MultipleObjectsReturned(
ipam.models.ip.IPAddress.MultipleObjectsReturned: get() returned more than one IPAddress -- it returned 2!
Volans triaged this task as Medium priority.Jan 30 2024, 2:15 PM

@cmooney did you had a chance to test the above failure scenario? AFAICT is still happening