Page MenuHomePhabricator

Update makevm to include completion of the installation with the puppet runs
Closed, ResolvedPublic

Description

Task-related: T305589#7837933. Duplicating comment here for tracking purposes:

AIUI the decom cookbook doesn't support VMs yet (?)

That's not actually correct, the decommission cookbook does support VMs since the start.
What is missing is that the makevm cookbook doesn't do yet the completion of the installation with the puppet runs and such.
My idea would be to complete that part so that makevm can automate the whole process of creating a new VM.
At that point there will be two options:

  1. sre.ganeti.reimage with the same hostname (and we could make adjustments to keep the same IP too). That will be logically equivalent to a physical hardware reimage, including the downtime of the host.
  2. makevm with a new name and IPs + decommission of the old VM. That will be logically equivalent to a physical hardware refresh (getting new hardware), including the possibility to bring the new host up before decommissioning the old one.

Event Timeline

For reference, depooling a Wikidough or durum host involves stopping the bird and bird6 services (after disabling Puppet but I think that's generic).

Change 793856 had a related patch set uploaded (by Volans; author: Volans):

[operations/software/spicerack@master] ganeti: add set_boot_media() method

https://gerrit.wikimedia.org/r/793856

Change 793856 merged by jenkins-bot:

[operations/software/spicerack@master] ganeti: add set_boot_media() method

https://gerrit.wikimedia.org/r/793856

Volans triaged this task as High priority.May 24 2022, 3:37 PM

@Volans / @joanna_borun: This open task has no active project tags associated. Could you please either update the task status, or associate an active project tag? Thanks!

Volans updated the task description. (Show Details)
Volans added a subscriber: SLyngshede-WMF.

Change 860080 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.ganeti.makevm: refactor to simplify expansion

https://gerrit.wikimedia.org/r/860080

Change 865057 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/cookbooks@master] Ganeti: Add reimaging cookbook

https://gerrit.wikimedia.org/r/865057

In tests others have seen the following message: NetboxHostNotFoundError

Tried to test the new ganeti cookbook with a local checkout of the cookbook repo + custom config on cumin (1001 and 2002) but I keep getting:

elukey@cumin1001:~$ sudo cookbook --dry-run -c config-elukey.yaml sre.ganeti.reimage --os bullseye --no-downtime ml-staging-etcd2001.codfw.wmnet
DRY-RUN: Executing cookbook sre.ganeti.reimage with args: ['--os', 'bullseye', '--no-downtime', 'ml-staging-etcd2001.codfw.wmnet']
DRY-RUN: Exception raised while initializing the Cookbook sre.ganeti.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 229, in get_server
    server = self._fetch_host(hostname)
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 79, in _fetch_host
    raise NetboxHostNotFoundError
spicerack.netbox.NetboxHostNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 219, in run
    runner = self.instance.get_runner(args)
  File "/home/elukey/cookbooks/cookbooks/sre/ganeti/reimage.py", line 87, in get_runner
    return ReimageRunner(args, self.spicerack)
  File "/home/elukey/cookbooks/cookbooks/sre/ganeti/reimage.py", line 100, in __init__
    self.netbox_server = spicerack.netbox_server(self.host)
  File "/usr/lib/python3/dist-packages/spicerack/__init__.py", line 647, in netbox_server
    return self.netbox(read_write=read_write).get_server(hostname)
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 231, in get_server
    server = self._fetch_virtual_machine(hostname)
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 104, in _fetch_virtual_machine
    raise NetboxHostNotFoundError
spicerack.netbox.NetboxHostNotFoundError

Has it happened to anybody else?

EDIT: my bad, I used the FQDN

Change 865057 merged by Slyngshede:

[operations/cookbooks@master] sre.ganeti.reimage: add new cookbook

https://gerrit.wikimedia.org/r/865057

Hi folks, I tried to call the new reimage cookbook from sre.k8s.upgrade-cluster and I got the following error after the first try:

Internet Systems Consortium DHCP Server 4.4.1
Copyright 2004-2018 Internet Systems Consortium.
All rights reserved.
For info, please visit https://www.isc.org/software/dhcp/
/etc/dhcp/linux-host-entries.ttyS0-115200 line 901: host ml-staging-etcd2001: already exists
}
 ^
/etc/dhcp/dhcpd.conf line 766: /etc/dhcp/linux-host-entries.ttyS0-115200: bad parse.
        include "/etc/dhcp/linux-host-entries.ttyS0-115200"
                 ^
Configuration file errors encountered -- exiting

If you think you have received this message due to a bug rather
than a configuration issue please read the section on submitting
bugs on either our web page at www.isc.org or in the README file
before submitting a bug.  These pages explain the proper
process and the information we find helpful for debugging.

exiting.
2023-02-10 14:36:15,037 [ERROR] dhcp config test returned non-zero.
================
100.0% (1/1) of nodes failed to execute command '/usr/local/sbin/...cludes -r commit': install2004.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: '/usr/local/sbin/...cludes -r commit'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook sre.ganeti.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/dhcp.py", line 218, in refresh_dhcp
    self._hosts.run_sync("/usr/local/sbin/dhcpincludes -r commit", print_progress_bars=False)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 520, in run_sync
    return self._execute(
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)

Do I have to change linux-host-entries.ttyS0-115200 before running my cookbook?

Change 888241 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/cookbooks@master] sre:ganeti:reimage switch tty

https://gerrit.wikimedia.org/r/888241

Change 888241 merged by Slyngshede:

[operations/cookbooks@master] sre:ganeti:reimage switch tty

https://gerrit.wikimedia.org/r/888241

Cookbook cookbooks.sre.ganeti.reimage was started by slyngshede@cumin1001 for host test-reimage2001.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by slyngshede@cumin1001 for host test-reimage2001.codfw.wmnet with OS bullseye completed:

  • test-reimage2001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202302130941_slyngshede_555149_test-reimage2001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Change 889999 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] sre.ganeti.makevm: Stop printing the dhcp config snippet

https://gerrit.wikimedia.org/r/889999

Change 889999 merged by Muehlenhoff:

[operations/cookbooks@master] sre.ganeti.makevm: Stop printing the dhcp config snippet

https://gerrit.wikimedia.org/r/889999

Should we also extend the cookbook to run sre.puppet.sync-netbox-hiera? Or at least print a note, this is easy to forget.

@MoritzMuehlenhoff in theory no, the makevm cookbook should call the reimage one directly and do all automatically.

@MoritzMuehlenhoff in theory no, the makevm cookbook should call the reimage one directly and do all automatically.

Yeah, but currently only sre.ganeti.reimage calls sre.puppet.sync-netbox-hiera, but not sre.ganeti.reimage

Right, and also re-thinking about it given that VMs can't change cluster currently and we don't use other status values, the only time to run it is with the makevm and not the reimage.
@SLyngshede-WMF could you maybe take care of this and add the call to the hiera cookbook? If not let me know and I can take it.

Change 860080 abandoned by Volans:

[operations/cookbooks@master] sre.ganeti.makevm: refactor to simplify expansion

Reason:

Superseeded by more recent refactors

https://gerrit.wikimedia.org/r/860080