Page MenuHomePhabricator

Create Ganeti test cluster
Closed, ResolvedPublic

Description

We need a Ganeti test cluster to test invasive changes (such as OS updates) without production impact.

New hardware to be procured via https://phabricator.wikimedia.org/T284954, but with current lead times, two servers from the current codfw capex will be initially repurposed for this.

Details

SubjectRepoBranchLines +/-
operations/software/spicerackmaster+1 -1
operations/puppetproduction+3 -2
operations/puppetproduction+3 -4
operations/puppetproduction+2 -7
operations/puppetproduction+2 -1
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
operations/puppetproduction+1 -2
operations/puppetproduction+4 -3
operations/puppetproduction+1 -1
operations/puppetproduction+6 -0
operations/dnsmaster+1 -0
operations/cookbooksmaster+1 -1
operations/software/spicerackmaster+1 -0
operations/puppetproduction+2 -1
operations/puppetproduction+17 -1
operations/puppetproduction+7 -0
operations/puppetproduction+14 -3
operations/puppetproduction+23 -0
labs/privatemaster+4 -0
operations/puppetproduction+22 -0
Show related patches Customize query in gerrit

Event Timeline

Change 703213 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add separate role for Ganeti test cluster

https://gerrit.wikimedia.org/r/703213

Change 703213 merged by Muehlenhoff:

[operations/puppet@production] Add separate role for Ganeti test cluster

https://gerrit.wikimedia.org/r/703213

Change 705894 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Add dummy certs for ganeti test cluster

https://gerrit.wikimedia.org/r/705894

Change 705894 merged by Muehlenhoff:

[labs/private@master] Add dummy certs for ganeti test cluster

https://gerrit.wikimedia.org/r/705894

Change 705899 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add cert for ganeti-test RAPI

https://gerrit.wikimedia.org/r/705899

Change 705899 merged by Muehlenhoff:

[operations/puppet@production] Add cert for ganeti-test RAPI

https://gerrit.wikimedia.org/r/705899

Change 706315 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti2025/2026 to Ganeti test cluster

https://gerrit.wikimedia.org/r/706315

Change 706424 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make the RAPI certname configurable

https://gerrit.wikimedia.org/r/706424

Change 706424 merged by Muehlenhoff:

[operations/puppet@production] Make the RAPI certname configurable

https://gerrit.wikimedia.org/r/706424

Change 708500 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti_test to wikimedia_clusters

https://gerrit.wikimedia.org/r/708500

Change 708500 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti_test to wikimedia_clusters

https://gerrit.wikimedia.org/r/708500

Change 706315 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti2025/2026 to Ganeti test cluster

https://gerrit.wikimedia.org/r/706315

Change 708763 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/spicerack@master] ganeti: Add ganeti test cluster

https://gerrit.wikimedia.org/r/708763

Change 708973 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Update Cumin aliases for Ganeti test cluster

https://gerrit.wikimedia.org/r/708973

Change 708973 merged by Muehlenhoff:

[operations/puppet@production] Update Cumin aliases for Ganeti test cluster

https://gerrit.wikimedia.org/r/708973

Change 708976 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] addnode cookbook: Also allow ganeti test cluster role

https://gerrit.wikimedia.org/r/708976

Change 708763 merged by Muehlenhoff:

[operations/software/spicerack@master] ganeti: Add ganeti test cluster to locations

https://gerrit.wikimedia.org/r/708763

Change 708976 merged by Muehlenhoff:

[operations/cookbooks@master] addnode cookbook: Also allow ganeti test cluster role

https://gerrit.wikimedia.org/r/708976

Change 709386 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Add record for ganeti testcluster

https://gerrit.wikimedia.org/r/709386

Change 709386 merged by Muehlenhoff:

[operations/dns@master] Add record for ganeti testcluster

https://gerrit.wikimedia.org/r/709386

Mentioned in SAL (#wikimedia-operations) [2021-08-03T11:13:02Z] <moritzm> rename Ganeti group for test cluster to row_D T286206

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (FAIL)
    • Host steps raised exception: Host testvm2001 was not found in Icinga status - no hosts have been downtimed.

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2021-08-03T15:25:05Z] <moritzm> prune testvm2001 from Ganeti and clean up from Netbox (stuck in some inconsistent state the decom cookbook can't handle) T286206

Change 710017 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add testvm200[12] to site.pp

https://gerrit.wikimedia.org/r/710017

Change 710017 merged by Muehlenhoff:

[operations/puppet@production] Add testvm200[12] to site.pp

https://gerrit.wikimedia.org/r/710017

The Ganeti test cluster has been set up, along with two test instances (testvm2001/2002). Next it will be used to test the Buster update.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2002.codfw.wmnet

  • testvm2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtme it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
joanna_borun changed the task status from Open to In Progress.Sep 21 2021, 1:32 PM

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2002.codfw.wmnet

  • testvm2002.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

Change 724753 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add testvm2003

https://gerrit.wikimedia.org/r/724753

Change 724753 merged by Muehlenhoff:

[operations/puppet@production] Add testvm2003

https://gerrit.wikimedia.org/r/724753

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2003.codfw.wmnet

  • testvm2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2004.codfw.wmnet

  • testvm2004.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2002.codfw.wmnet

  • testvm2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2006.codfw.wmnet

  • testvm2006.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2026.codfw.wmnet with OS buster completed:

  • ganeti2026 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110141146_jmm_3011474_ganeti2026.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2005.codfw.wmnet

  • testvm2005.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2007.codfw.wmnet

  • testvm2007.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

Change 734204 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] netbox: remove sync from codfw ganeti cluster

https://gerrit.wikimedia.org/r/734204

Change 734204 merged by Volans:

[operations/puppet@production] netbox: remove sync from codfw ganeti cluster

https://gerrit.wikimedia.org/r/734204

Change 734206 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] cumin: temporary remove ganeti-test alias

https://gerrit.wikimedia.org/r/734206

Change 734206 merged by Volans:

[operations/puppet@production] cumin: temporary remove ganeti-test alias

https://gerrit.wikimedia.org/r/734206

Change 736447 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch ganeti-test2001 to ganeti_test role

https://gerrit.wikimedia.org/r/736447

Change 736447 merged by Muehlenhoff:

[operations/puppet@production] Switch ganeti-test2001 to ganeti_test role

https://gerrit.wikimedia.org/r/736447

Change 736449 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Reset profile::ganeti::ganeti216 to false for the role

https://gerrit.wikimedia.org/r/736449

Change 736449 merged by Muehlenhoff:

[operations/puppet@production] Reset profile::ganeti::ganeti216 to false for the role

https://gerrit.wikimedia.org/r/736449

Mentioned in SAL (#wikimedia-operations) [2021-11-03T14:10:37Z] <moritzm> initialising ganeti-test01.svc.codfw.wmnet cluster on ganeti-test2001 T286206

Change 736529 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Extend ganeti-all alias to also include ganeti_test

https://gerrit.wikimedia.org/r/736529

Change 736529 merged by Muehlenhoff:

[operations/puppet@production] Extend ganeti-all alias to also include ganeti_test

https://gerrit.wikimedia.org/r/736529

Change 736707 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Also apply ganeti_test role to 2002/2003

https://gerrit.wikimedia.org/r/736707

Change 736707 merged by Muehlenhoff:

[operations/puppet@production] Also apply ganeti_test role to 2002/2003

https://gerrit.wikimedia.org/r/736707

Change 736724 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Enable RAPI Netbox sync for new test cluster

https://gerrit.wikimedia.org/r/736724

Change 736724 merged by Muehlenhoff:

[operations/puppet@production] Enable RAPI Netbox sync for new test cluster

https://gerrit.wikimedia.org/r/736724

Change 736759 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Update names of nodes in new test cluster

https://gerrit.wikimedia.org/r/736759

Change 736759 merged by Muehlenhoff:

[operations/puppet@production] Update names of nodes in new test cluster

https://gerrit.wikimedia.org/r/736759

Mentioned in SAL (#wikimedia-operations) [2021-11-05T08:52:49Z] <moritzm> installing set kvm::machine_version for ganeti-test cluster to pc-i440fx-2.8 T286206

Change 736994 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/spicerack@master] ganeti: Fix up row configuration for ganeti test cluster

https://gerrit.wikimedia.org/r/736994

Change 736994 merged by Muehlenhoff:

[operations/software/spicerack@master] ganeti: Fix up row configuration for ganeti test cluster

https://gerrit.wikimedia.org/r/736994

Mentioned in SAL (#wikimedia-operations) [2021-11-05T12:22:58Z] <moritzm> renamed Ganeti group of test cluster from "default" to "row_A" (following conventions in main DCs) T286206

The new Ganeti test cluster has been setup: It consists of three nodes in row A of codfw (ganeti-test200[1-3].codfw.wmnet). A test instance has been added with the makevm cookbook and instance migration and master failover are also working fine.