Page MenuHomePhabricator

Create Ganeti test cluster
Open, In Progress, Needs TriagePublic

Description

We need a Ganeti test cluster to test invasive changes (such as OS updates) without production impact.

New hardware to be procured via https://phabricator.wikimedia.org/T284954, but with current lead times, two servers from the current codfw capex will be initially repurposed for this.

Event Timeline

Change 703213 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add separate role for Ganeti test cluster

https://gerrit.wikimedia.org/r/703213

Change 703213 merged by Muehlenhoff:

[operations/puppet@production] Add separate role for Ganeti test cluster

https://gerrit.wikimedia.org/r/703213

Change 705894 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Add dummy certs for ganeti test cluster

https://gerrit.wikimedia.org/r/705894

Change 705894 merged by Muehlenhoff:

[labs/private@master] Add dummy certs for ganeti test cluster

https://gerrit.wikimedia.org/r/705894

Change 705899 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add cert for ganeti-test RAPI

https://gerrit.wikimedia.org/r/705899

Change 705899 merged by Muehlenhoff:

[operations/puppet@production] Add cert for ganeti-test RAPI

https://gerrit.wikimedia.org/r/705899

Change 706315 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti2025/2026 to Ganeti test cluster

https://gerrit.wikimedia.org/r/706315

Change 706424 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Make the RAPI certname configurable

https://gerrit.wikimedia.org/r/706424

Change 706424 merged by Muehlenhoff:

[operations/puppet@production] Make the RAPI certname configurable

https://gerrit.wikimedia.org/r/706424

Change 708500 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add ganeti_test to wikimedia_clusters

https://gerrit.wikimedia.org/r/708500

Change 708500 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti_test to wikimedia_clusters

https://gerrit.wikimedia.org/r/708500

Change 706315 merged by Muehlenhoff:

[operations/puppet@production] Add ganeti2025/2026 to Ganeti test cluster

https://gerrit.wikimedia.org/r/706315

Change 708763 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/software/spicerack@master] ganeti: Add ganeti test cluster

https://gerrit.wikimedia.org/r/708763

Change 708973 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Update Cumin aliases for Ganeti test cluster

https://gerrit.wikimedia.org/r/708973

Change 708973 merged by Muehlenhoff:

[operations/puppet@production] Update Cumin aliases for Ganeti test cluster

https://gerrit.wikimedia.org/r/708973

Change 708976 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/cookbooks@master] addnode cookbook: Also allow ganeti test cluster role

https://gerrit.wikimedia.org/r/708976

Change 708763 merged by Muehlenhoff:

[operations/software/spicerack@master] ganeti: Add ganeti test cluster to locations

https://gerrit.wikimedia.org/r/708763

Change 708976 merged by Muehlenhoff:

[operations/cookbooks@master] addnode cookbook: Also allow ganeti test cluster role

https://gerrit.wikimedia.org/r/708976

Change 709386 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Add record for ganeti testcluster

https://gerrit.wikimedia.org/r/709386

Change 709386 merged by Muehlenhoff:

[operations/dns@master] Add record for ganeti testcluster

https://gerrit.wikimedia.org/r/709386

Mentioned in SAL (#wikimedia-operations) [2021-08-03T11:13:02Z] <moritzm> rename Ganeti group for test cluster to row_D T286206

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (FAIL)
    • Host steps raised exception: Host testvm2001 was not found in Icinga status - no hosts have been downtimed.

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2021-08-03T15:25:05Z] <moritzm> prune testvm2001 from Ganeti and clean up from Netbox (stuck in some inconsistent state the decom cookbook can't handle) T286206

Change 710017 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add testvm200[12] to site.pp

https://gerrit.wikimedia.org/r/710017

Change 710017 merged by Muehlenhoff:

[operations/puppet@production] Add testvm200[12] to site.pp

https://gerrit.wikimedia.org/r/710017

The Ganeti test cluster has been set up, along with two test instances (testvm2001/2002). Next it will be used to test the Buster update.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2002.codfw.wmnet

  • testvm2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2001.codfw.wmnet

  • testvm2001.codfw.wmnet (WARN)
    • Host not found on Icinga, unable to downtme it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti-test01.svc.codfw.wmnet to Netbox
joanna_borun changed the task status from Open to In Progress.Tue, Sep 21, 1:32 PM