Page MenuHomePhabricator

Abstract a bit more the server provisioning process
Open, MediumPublic

Description

The server provisioning workflow has improved significantly in the last few years. However toil and miss-configuration are still happening.

We can make it even better by automating the following steps from https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Requested_-%3E_Planned_additional_steps_&_Spare_-%3E_Planned :

  • Run the Netbox ProvisionServerNetwork script to assign mgmt IP, primary IPv4/IPv6, vlan and switch interface
  • Follow the DNS/Netbox#Update_generated_records to create and deploy the mgmt and primary IPs (for mgmt should include both the $assettag.mgmt.site.wmnet as well as $hostname.mgmt.site.wmnet).
  • Run the sre.network.configure-switch-interfaces cookbook to configure the switch side.
  • System Bios & out of band mgmt settings are configured at this time.
  • Serial Redirection and mgmt must be tested at this time

This automation could take the form of a meta cookbook, let's say sre.hosts.makeitplanned with the following arguments.

  • host (mandatory) - should match a Netbox device hostname
  • Switch port (mandatory)
  • port speed (mandatory but could be derived from the switch or host group specs table) - 1/10/25 G
  • cable ID (optional)
  • task (optional)

The cookbook would then do the following:

  1. Display to the user what's going to happen and ask for confirmation (like the makevm cookbook)
  2. Run the Netbox ProvisionServerNetwork script to assign mgmt IP, primary IPv4/IPv6, vlan and switch interface
    • Select the proper vlan based on a host group specs table (eg. if it's a cp host, set it to the private vlan)
    • Same for "Skip IPv6 DNS records" and " How many Cassandra instances "
  3. Run the DNS cookbook
  4. Run the sre.network.configure-switch-interfaces cookbook to configure the switch side.
  5. Run the sre.hosts.provision cookbook
    • Enable virtualization based on the same specs table
  6. Update the task with what has been done

Not sure if "Serial Redirection and mgmt must be tested at this time " is still needed now that we have automation and monitoring, but the cookbook could at least check for SSH access.

Of course for special cases the individual actions would still need to be done like we currently do. This would also only be used for the very first provisioning.
It could also potentially take care of running sre.hardware.upgrade-firmware.

The "host group specs table" could for now be defined in Puppet's Hiera, written to disk on cumin1001 as yaml file, and loaded by the cookbook.

dns:
   vlan: public
ganeti:
  vlan: ganeti # special vlan trunking
  virtualization: true
db:
  vlan: private
  no_ipv6: true

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

When we introduced the sre.hosts.provision cookbook we envision
Piling many changes together simplifies the user interaction but leaves a lot of open questions to be answered before automating the process regarding what to do in case of errors:

  • A single cookbook that just calls other cookbooks will not be able to easily rollback changes or keep track of everything. In addition longer the time it takes higher the chance that some other changes were merged in the meanwhile by other cookbooks and so the rollback might have side effects depending on how it's done.
  • Failing in the middle leaves the host in an undefined state, so the cookbook should be idempotent and potentially be able to resume from the last successfully completed step using auto-detection (and not relying on user input).
  • When retrying a step if should not add too much additional time to get to the step to retry just because is one of the last ones

In addition I think that we need to solve first another problem, that is a pre-requisite for this and other similar requests of automation: an authoritative mapping between hostnames and what you called specs table.
We currently have that mapping in Puppet, but to be able to query it in an official way (via puppet commands) we need the facts of a host, that we don't have until the host is up. And I'm not only talking about the current hostname->$role mapping, but also any other potentially specific setting that might change within the same role because of different hardware.
Once we solve that problem for good, than sure we can have specs for each "group" (there can be more than 1 group per $role) and automate most of the above process solely based on that, reducing significantly the amount of input to ask the operator.

Unfortunately Netbox doesn't seem the best place where to put that mapping at the moment, first because it doesn't support it natively and we should "force" it, and second because we'll loose in terms of auditing and human-reviewing.
Needless to say this mapping should exist in a single place and be the source of truth for Puppet too, either because inside Puppet or acting as an ENC to Puppet.

Great work! I had some thoughts on this, more around the latter pieces than the workflow itself.

In terms of the proposed cookbook, do you envision it running the Netbox ProvisionServerNetwork script? Or instead would it do the same thing, but directly via Netbox API etc? I will try to progress T346428 to fit the direction here.

The "host group specs table" could for now be defined in Puppet's Hiera, written to disk on cumin1001 as yaml file, and loaded by the cookbook.

Thanks, yep. I had some half-ideas about this in terms of how it should work. I suspect that instead of a key "vlan" under the host type we would have a key "network_profile" or similar. The values of that would point to an item in another dict defined in the same YAML which would define:

  • Untagged vlan
  • Tagged vlans (optional, presence determines switch port mode is trunk)
  • Bridge interfaces on the host

Tagged vlans would get vlan sub-int defined on the host side and IPs allocated. We could include an optional attribute "skip_ips", which we could set to "True" for ganeti hosts to add the vlan sub-interfaces but not allocate IPs to them. We could also have an optional "v6_dns" flag to control whether the v6 IPs should get dns records. Lastly we could tie vlans to fqdn's, probably based on vlan prefix, i.e.:

public: wikimedia.org
private: $SITE.wmnet
analytics: $SITE.wmnet
cloud-private: private.$SITE.wikimedia.cloud

I'd need to work through it and refine as I went, but some structure along those lines is kind of where I see this going. Obviously this task is more focussed on removing the potential for error for simple host setup, and I'm more focussing on the complex host setups to remove the manual steps, but I think these two goals are aligned.

In addition I think that we need to solve first another problem, that is a pre-requisite for this and other similar requests of automation: an authoritative mapping between hostnames and what you called specs table.

I fully understand the first problem, which is not easy to fix. But I'm wondering why that would be a pre-requisite to the "specs table" mapping, for instance just updating the existing Netbox ProvisionServerNetwork script to allocate IPs/vlan/dns names based on such a table? That doesn't seem related to me at first glance but perhaps I'm missing something.

In addition I think that we need to solve first another problem, that is a pre-requisite for this and other similar requests of automation: an authoritative mapping between hostnames and what you called specs table.

I fully understand the first problem, which is not easy to fix. But I'm wondering why that would be a pre-requisite to the "specs table" mapping, for instance just updating the existing Netbox ProvisionServerNetwork script to allocate IPs/vlan/dns names based on such a table? That doesn't seem related to me at first glance but perhaps I'm missing something.

How does the cookbook know which spec table to use for a given host? User-input? Then we're back to square one.

How does the cookbook know which spec table to use for a given host? User-input? Then we're back to square one.

As Arzhel defined it there would be one table, and the host the script (be that existing Netbox ProvisionServerNetwork or replacement cookbook) would select the entry based on the name of the host it was operating on?

If we need _different_ network setups for hosts of the same type then we should probably have a way to manually override the "network_profile", but tbh I think that's something of an anti-pattern anyway. Hosts with a given hostname prefix should ideally all have the same network profile.

As Arzhel defined it there would be one table, and the host the script (be that existing Netbox ProvisionServerNetwork or replacement cookbook) would select the entry based on the name of the host it was operating on?

If we have one file per host we're back to square one, it's a lot of duplication, easy to make typos and mistakes, hard to spot errors.

If we need _different_ network setups for hosts of the same type then we should probably have a way to manually override the "network_profile", but tbh I think that's something of an anti-pattern anyway. Hosts with a given hostname prefix should ideally all have the same network profile.

This is not limited to the network setup, we have also other parameters like bios settings and we'll need the same solution to automate hardware RAID setup. And once we have this mechanism we could move the partman recipe to this mechanism too.

And yes, we do have hosts in the same logical cluster that need different settings because of different hardware, so we need to support that use case too.

As Arzhel defined it there would be one table, and the host the script (be that existing Netbox ProvisionServerNetwork or replacement cookbook) would select the entry based on the name of the host it was operating on?

If we have one file per host we're back to square one, it's a lot of duplication, easy to make typos and mistakes, hard to spot errors.

One definition per host-type was the suggestion as I understood it. And we'd need to structure the data properly to minimise duplication (I touched on this in my original comment).

If we need _different_ network setups for hosts of the same type then we should probably have a way to manually override the "network_profile", but tbh I think that's something of an anti-pattern anyway. Hosts with a given hostname prefix should ideally all have the same network profile.

This is not limited to the network setup, we have also other parameters like bios settings and we'll need the same solution to automate hardware RAID setup. And once we have this mechanism we could move the partman recipe to this mechanism too.

And yes, we do have hosts in the same logical cluster that need different settings because of different hardware, so we need to support that use case too.

Ok yeah I was perhaps only focussing on only one part of this, given my quarterly goal of allowing for more complex network-setups to be created without manual netbox edits.

For now, it seems like this way forward needs some more thought put into it. So I will proceed with those improvements to the Netbox ProvisionServerNetwork script as I'd planned, and we can factor back into this improved workflow once the way forward is clear.

Wow, nice to see all the comments !

First this proposal is not about replacing the "ProvisionServerNetwork" Netbox script, so it makes sens to me to keep improving it, probably by removing the VLAN field and making the VLAN Type smarter, but we can discuss it in another task.

About "the first problem" (which is error management), this proposal conveniently chains existing and well defined steps and cookbooks, so I'm predicting that in most of the cases they will run smoothly and prevent operator mistakes between scripts as well as general oversight.
Each step is a natural checkpoint as they've already been designed to be independent. If the run fails at any step we will know exactly where and the task will be updated accordingly by the cookbook so it will be easy to do the following steps manually.

On "the second problem" (authoritative list of server configuration specs), I worry we're falling deep into scope creep, like partman receipe or full on ENC is clearly not the goal here.
The current scope is to prevent issues during the very first provisioning of servers, as it caused at least 1 outage and growing frustration in SRE.

And yes, we do have hosts in the same logical cluster that need different settings because of different hardware, so we need to support that use case too.

I'm not sure a new batch of servers in the same group ever get provisioned with different settings (eg. some in the private and some in the public vlan), or some with virtualization and some without. If some very specific hosts do, they can still be configured the way we currently do.

However I do agree that the next big step is to solve the ENC++ problem, and we could migrate the data defined here wherever it's needed next.

I had a quick thought about the ENC++ problem as you have named it and I think in the end given a netbox device object (hostname + location + eventually other data) + hardware specs (auto-detection via Redfish?) we will need something to map this to:

  • Puppet role (currently in site.pp)
  • Hardware profile [BIOS virtualization + hardware RAID configuration] (currently manually set via cookbook argument and manually setup)
  • Network profile [VLAN, skip IPv6, cassandra IPs, etc...] (currently manually set via Netbox provision script arguments]

The other problem to tackle is the mapping, the same regex matching a role might need to match two different hardware or network profiles, based on parameters. We should not lock ourselves in a system that doesn't allow that. At the same time those will be exceptions so we should keep a way to support the default use case without duplication.
Ideally the selection should not be purely based on a regex but it should also allow to decide based on other factors like location (for new/old network stack for example) and hardware specs.

In my mind, trying to be too automatic or too smart here will only cause edge cases issues and complexity to troubleshot. And a too big project to implement.

As a puppet role can be assigned to virtually any kind of hosts, it should be a drop down menu in Netbox.

BIOS virtualization is quite static, so that can be a static list based on hostname (like I defined previously).
Network profile as well (it will always be possible to manually edit Netbox for snowflakes).

I haven't dug too much at RAID config so I won't comment here.

This also solves the mapping side of things. Either it's a "per cluster" setting, or it's a per host one. Multi-factor mapping seems like a slippery slope.

The downsides about using Netbox here are 1/ new dependency (and potential circular dependency), 2/ Change move away from Git (so no me possibility to review them)