Page MenuHomePhabricator

Puppet: get row/rack info from Netbox
Open, MediumPublic

Description

It was discussed in the SRE-Foundations weekly meeting the need of integrating datacenter row/rack information into Puppet reading it from Netbox and exposing it in ways so that Grafana dashboards and Cumin could use it.

The requirement is to have a system that doesn't directly depend on Netbox, hence that if Netbox is down should continue to work just with stale data.

The various options discussed boils down to:

  1. Have a script that periodically generates a hiera file with that information for each host, to be merged by the puppetmasters with the public tree as we already do with the private repo
  2. Have a hiera backend that dynamically query Netbox (or a local file to have a cache)
  3. Have a Puppet function that query Netbox (TBD how to make a local cache)
  4. Use client side facts (require each host to contact Netbox with a token and also have a local cache)
  5. See if server side facts could be used for this (at first sight it seems not, but need more investigation)

As for the exposure to Cumin, it might not be needed as we've already in CR a Netbox backend for Cumin. Or could be exposed using the same trick used for $::site in profile::cumin::target.
As for the exposure to Grafana dashboards it needs investigation.

Event Timeline

Volans created this task.Jul 31 2019, 11:19 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2019, 11:19 AM
Volans triaged this task as Medium priority.Jul 31 2019, 11:40 AM

It seems to me that the simplest option would be #1, it would also be the one that optimizes API calls to Netbox (just one per puppetmaster every X minutes) and has a natural caching mechanism.
In addition we could add an alert if the file is stale (too old).

Change 526664 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] netbox/puppet: An example of how we may intergrate netbox data with puppet

https://gerrit.wikimedia.org/r/526664

jbond added a comment.Jul 31 2019, 1:12 PM

Have a script that periodically generates a hiera file with that information for each host, to be merged by the puppetmasters with the public tree as we already do with the private repo

The use of the word "merged" suggests [to me] that some git operation is taking place, unless im missing something i dont think this is the case. I would suggest

Have a script that periodically generates a hiera file with that information for each host and place that file in to the appropriate place in the hiera hierarchy.

If we go this route and getting a bit more into the details i think it would probably makes more sense to create one file per host then were ever the data is needed in puppet one can just call lookup('netbox::rack'). or is there a need for some systems to have all the netbox data?

I created a quick example of how the one file per host configuration may look

If we go on the one file per host approach then I'd say we can read the file before writing so that we write/overwrite only if it's not there or has the wrong info. This should limit the re-write operations that in turn should reduce the race conditions of puppet non finding the file at the exact moment it's reading them to a negligible amount.
FYI row/rack don't change for most hosts during their lifetime, but in some cases we move hosts around, so it's a use case to take into account.

crusnov claimed this task.Oct 2 2019, 5:32 PM
jbond moved this task from Unsorted 💣 to Watching 👀 on the User-jbond board.Oct 30 2019, 6:07 PM

Change 563186 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] netbox/puppet: Add machinery to get Puppet facts from Netbox

https://gerrit.wikimedia.org/r/563186

Change 526664 merged by Jbond:
[operations/puppet@production] netbox/puppet: Add machinery to get Puppet facts from Netbox

https://gerrit.wikimedia.org/r/526664

crusnov moved this task from Backlog to In Progress on the User-crusnov board.May 1 2020, 9:59 PM

Let me add a number of use cases for this:

  • Kubernetes nodes currently have manually set in hiera their datacenter and rack row setup. The rack itself is not, however that is also usable if we have it. It would be aweseome if we did not have to do that manually as we tend to forget it.
  • Icinga could have hostgroups per rack row (if not rack) allowing us to know networking and/or rack level issues. That could prove very useful during outages as it would allow to visually verify hosts that are affected and also pinpoint to possible causes
  • Ganeti also is, networking wise, rack row aware and can be made even rack aware. This is a bit of a more difficult case as the data about the row membership is done at the ganeti node configuration level, which up to now I don't know how to populate from puppet.
  • Cassandra is also rack row aware in infrastructure and automating this instead of doing it manually via hiera would be awesome.

Adding observability for tracking/visibility, +1 on the use cases!

lmata moved this task from Inbox to Radar on the observability board.Sep 14 2020, 3:23 PM

I fear that we are going down another path like the dns generation in which the Netbox API don't really suits our needs in terms of performance and efficiency (multiple API calls per device). I'm wondering if we should convert John's patch into a Netbox script instead and take advantage of the speed and power of Django/Netbox internal APIs instead.

It was already discussed but I want to re-surface the fact that this will create a direct link Netbox data -> Puppet without any manual stopgap. Are we ready for this?
If not we could run it as part of the provisioning workflow, as it will work fine for the location, but might not work at all if we start exposing additional data, like the status in Netbox as the current patch does.

I fear that we are going down another path like the dns generation in which the Netbox API don't really suits our needs in terms of performance and efficiency (multiple API calls per device). I'm wondering if we should convert John's patch into a Netbox script instead and take advantage of the speed and power of Django/Netbox internal APIs instead.

It was already discussed but I want to re-surface the fact that this will create a direct link Netbox data -> Puppet without any manual stopgap. Are we ready for this?

Possibly not, so maybe the first draft could be Have a script that periodically generates a hiera file with that information for each host, to be merged by the puppetmasters with the public tree as we already do with the private repo?

Volans added a comment.EditedOct 23 2020, 3:38 PM

We were discussing this offline with John and there are still various open questions, we plan to discuss them in the next I/F meeting next Wed.
I'll try to summarize them here:

  • Manual vs Automatic: in all discussions we tend to prefer a manual stopgap when possible, here there are multiple options to achieve that:
    • Export a file showing a diff
    • Use the same mechanism for the current Netbox -> DNS integration (export a git repo with manual approval of the diff)
  • What will trigger it (it depends on how we'll endup utilizing it and what data we'll be exporting):
    • Integrate it with puppet-merge
    • Have an independent stop gap like a dedicated cookbook
    • Integrate it with the Netbox -> DNS update process during provisioning (will cover devices location but might fail to pick status changes)
  • Data should be public or private? It mostly depends on what data we'll export for now it could totally be public but we might not want to design it in a way that will prevent us to export private data in the future
  • Distributed or centralized: given that we have multiple puppetmasters it might be wiser to export once (centralize) and have the same data exported to all of them, like the current Netbox -> DNS integration does with the authdns hosts instead of having all the puppetmasters query Netbox (distributed). This would reduce the load on the Netbox host and should simplify the synchronization across puppetmasters so that they share the same data avoiding race conditions between two consecutive runs of puppet on the same host.
  • Pull or push: this mostly depends on how we trigger the update that in turn depends on the Manual vs Automatic answer.

Once we have the above answers we can decide how to implement the Netbox side of it, if querying the Netbox APIs (slower and less efficient) or have a Netbox script or even a plugin to expose more efficiently the data needed.

Let's not over-engineer it.

Automatic. For what I understand, that data is to be used for "Grafana dashboards and Cumin", so if Netbox gets compromised and the data modified, no harm would happen. So I don't see the need for a stopgap so far, as long as it's properly documented.
If the need for a stopgap arises one day (eg. IPs, Puppet role, etc), we could/should revisit it and consider applying it to only that specific data.
Public. The only private data for a server is its serial number, and it's already in Facter.
Centralized. Seems to be working fine for DNS.

We discussed it a bit during the Infrastructure Foundation last meeting on Wed. I'll try to summarize the outcome of it, please correct me if I'm misrepresenting anything.
Also worth noting that it was a smaller than usual meeting due to some absences, so nothing is set in stones yet, feel free to comment / propose alternative approaches.

  • Manual stopgap vs Automatic: better to keep it manual for now as for the use cases already listed a tampered data might harm production
  • What triggers it: it seems that for 90% of the cases the provisioning workflow is the right place where to trigger it. For the remaining use cases they should be converted into cookbooks. To make a practical example a change in status from Active to Failed in Netbox should trigger it. But instead of documenting that, let's make a cookbook that marks a host as failed, changing Netbox status, triggering the sync, possibly running puppet on the host and icinga (if applicable and we set disabled notification based on Netbox status), open a task to DCOps for the host, etc...
  • Public or private data: we agreed that all the data exported from Netbox to Puppet should be public and, if using the APIs, we could enforce it with a dedicated user that has access only to the public bits in Netbox (but we might endup not using the APIs, see below)
  • Distributed or centralized: Given that we already have a machinery in place for the DNS integration, it seems that we could re-use the same logic, have a centralized git repo that is exposed by the Netbox hosts (it would allow the puppet-compiler to consume it too as a plus) and have the cookbook pull the SHA1 commit to the puppetmasters. That ensures also that all the puppetmasters have the same consistent data at all time.
  • Pull or push: see above

As for the implementation, given the limitations we're finding in the Netbox APIs we were thinking that this might be a good test case to try to make a Netbox plugin that exposes an endpoint in the API with all the data that's needed already in a nicely formatted JSON so that the client has to make just a single API call and then apply the same logic we have for the DNS stuff.
If this experiment is successfull then we could consider converting the dns generation stuff into a plugin too.

Another use case for netbox data in puppet is exposing the network devices so they can be used in configuration such as icinga parent mapping and turnilo data augmentation. @ayounsi perhaps there are others in this space as well.

I think we could probably include all devices under something like

netbox::network_devices:
  fqdn:
    * => $metadata

Larger scope could be to look at all the IPs hardcoded in Puppet and see if it would make sens to import them from Netbox?
Same for prefixes I guess.

Larger scope could be to look at all the IPs hardcoded in Puppet and see if it would make sens to import them from Netbox?
Same for prefixes I guess.

e.g. autogenerating, at least part of, the data that is in network/data/data.yaml ? Yes that would be awesome as it's now manually maintained.

jbond added a comment.Wed, Dec 2, 11:14 AM

I think we could probably include all devices under something like

netbox::network_devices:
  fqdn:
    * => $metadata

Another use case https://phabricator.wikimedia.org/T268806