Page MenuHomePhabricator

Improve Netbox "locations" use
Open, MediumPublic

Description

We currently use the "locations" model as physical rack groups only.

I'm suggesting that we change that to match failure groups when relevant.

For example if we take eqiad row C location https://netbox.wikimedia.org/dcim/racks/?location_id=7
it includes racks dedicated to both WMCS and Fundraising.
I'm suggesting that we remove C1 from this rack group, as well as C8 once T308339: eqiad: move non WMCS servers out of rack C8 is solved.
Same goes for row D now that D5 is WMCS only (thanks to T308331: eqiad: move non WMCS servers out of rack D5).

This will for example allow queries such as 'P{P:netbox::host%location ~ "C.*eqiad"}' to only return prod hosts impacted by network maintenance.

I'm not sure there is a need for the row E and F locations neither (and if we keep them, we should remove E4/F4 as they're WMCS as well.

For automation, eg. puppet hiera export, we could either keep the location field empty, or duplicate the rack information there.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

for hosts i think we may be able to get the parent which would be a much better way of doing this? (we still need to talk about how we do that for net devices)

Another way to turn the question is: 'what "locations" should match to?' and apply it consistently. To me, 1:1 match to rows is not relevant anymore.

Indeed your suggestion is good for this specific usecase (network maintenance).
to get all physical devices connected to a given ToR. The LLDP parent is already in PuppetDB, eg. https://puppetboard.wikimedia.org/node/netmon1003.wikimedia.org parent : asw2-b-eqiad.mgmt.eqiad.wmnet is it possible to filter on it using Cumin?
However it gets more complex when doing VMs, but in that case we can rely on the existing Ganeti cluster location selector.

Another way to turn the question is: 'what "locations" should match to?' and apply it consistently. To me, 1:1 match to rows is not relevant anymore.

I started this bit trying to describe the current data structure but i think that it might be better for you to say what you need/is useful and we can adjust to that. ill say that the current structure was designed with the following considerations:

  • for VM's its general more useful to know about the vm host then the actual rack/row (we could probably add the later with some munging)
  • site/datacenter is genrally useful and it would be great to replace the $::site global
  • rack is useful for rack aware services e.g. Cassandra
  • row ise useful to combine it with rack to get a unique rack i.e. ensure a box in A1 is not considered in the same rack as B1

AFAIK this info is not been used yet but we should keep the Cassandra use case in mind

Indeed your suggestion is good for this specific usecase (network maintenance).
to get all physical devices connected to a given ToR. The LLDP parent is already in PuppetDB, eg. https://puppetboard.wikimedia.org/node/netmon1003.wikimedia.org parent : asw2-b-eqiad.mgmt.eqiad.wmnet is it possible to filter on it using Cumin?

ish, currently cumin doesn't have good support for structured facts but we can use a pattern on the lldp fact e.g. sudo cumin 'F:lldp ~ "parent.+asw2-b-eqiad.mgmt.eqiad.wmnet"' . We cold improve this by creating a legacy fact for lldp_parent which would allow us to do sudo cumin 'F:lldp_parent = "asw2-c-eqiad.mgmt.eqiad.wmnet"' this is a simple change so just let me know if its useful

However it gets more complex when doing VMs, but in that case we can rely on the existing Ganeti cluster location selector.

exactly and we can use the same fact e.g. sudo cumin 'F:lldp ~ "parent.+ganeti1024.eqiad.wmnet"'

jbond triaged this task as Medium priority.Apr 5 2023, 9:47 AM