The server provisioning workflow has improved significantly in the last few years. However toil and miss-configuration are still happening.
We can make it even better by automating the following steps from https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Requested_-%3E_Planned_additional_steps_&_Spare_-%3E_Planned :
- Run the Netbox ProvisionServerNetwork script to assign mgmt IP, primary IPv4/IPv6, vlan and switch interface
- Follow the DNS/Netbox#Update_generated_records to create and deploy the mgmt and primary IPs (for mgmt should include both the $assettag.mgmt.site.wmnet as well as $hostname.mgmt.site.wmnet).
- Run the sre.network.configure-switch-interfaces cookbook to configure the switch side.
- System Bios & out of band mgmt settings are configured at this time.
- Serial Redirection and mgmt must be tested at this time
This automation could take the form of a meta cookbook, let's say sre.hosts.makeitplanned with the following arguments.
- host (mandatory) - should match a Netbox device hostname
- Switch port (mandatory)
- port speed (mandatory but could be derived from the switch or host group specs table) - 1/10/25 G
- cable ID (optional)
- task (optional)
The cookbook would then do the following:
- Display to the user what's going to happen and ask for confirmation (like the makevm cookbook)
- Run the Netbox ProvisionServerNetwork script to assign mgmt IP, primary IPv4/IPv6, vlan and switch interface
- Select the proper vlan based on a host group specs table (eg. if it's a cp host, set it to the private vlan)
- Same for "Skip IPv6 DNS records" and " How many Cassandra instances "
- Run the DNS cookbook
- Run the sre.network.configure-switch-interfaces cookbook to configure the switch side.
- Run the sre.hosts.provision cookbook
- Enable virtualization based on the same specs table
- Update the task with what has been done
Not sure if "Serial Redirection and mgmt must be tested at this time " is still needed now that we have automation and monitoring, but the cookbook could at least check for SSH access.
Of course for special cases the individual actions would still need to be done like we currently do. This would also only be used for the very first provisioning.
It could also potentially take care of running sre.hardware.upgrade-firmware.
The "host group specs table" could for now be defined in Puppet's Hiera, written to disk on cumin1001 as yaml file, and loaded by the cookbook.
dns: vlan: public ganeti: vlan: ganeti # special vlan trunking virtualization: true db: vlan: private no_ipv6: true