Page MenuHomePhabricator

Recreate ml-etcd2002 in a different row
Closed, ResolvedPublic

Description

As Moritz noticed, we have two ml-etcd nodes on the same ganeti host:

elukey@ganeti2021:~$ sudo gnt-instance list | grep ml-etcd
ml-etcd2001.codfw.wmnet         kvm        debootstrap+default ganeti2019.codfw.wmnet running      3.0G
ml-etcd2002.codfw.wmnet         kvm        debootstrap+default ganeti2019.codfw.wmnet running      3.0G
ml-etcd2003.codfw.wmnet         kvm        debootstrap+default ganeti2015.codfw.wmnet running      3.0G

This is my bad since I have created the cluster without double checking. If possible I'd need to destroy / re-create ml-etcd2002 on a different ganeti row/host.

Related Objects

Event Timeline

elukey@ml-etcd2002:~$ etcdctl -C https://ml-etcd2002.codfw.wmnet:2379 cluster-health
member 367f7076aea55538 is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
cluster is healthy
elukey@ml-etcd2002:~$ curl -k -L https://ml-etcd2002.codfw.wmnet:2379/v2/stats/leader
{"message":"not current leader"}

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: ml-etcd2002.codfw.wmnet

  • ml-etcd2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Change 674486 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/dns@master] Remove ml-etcd2002 SRV record for decom

https://gerrit.wikimedia.org/r/674486

Change 674486 merged by Elukey:
[operations/dns@master] Remove ml-etcd2002 SRV record for decom

https://gerrit.wikimedia.org/r/674486

elukey renamed this task from Recreate ml-etcd2001 in a different row to Recreate ml-etcd2002 in a different row.Mar 24 2021, 8:01 AM
elukey updated the task description. (Show Details)

Change 674523 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: change mac address for ml-etcd2002

https://gerrit.wikimedia.org/r/674523

Change 674523 merged by Elukey:
[operations/puppet@production] install_server: change mac address for ml-etcd2002

https://gerrit.wikimedia.org/r/674523

After a little battle I was able to remove/add the etcd node:

elukey@ml-etcd2001:~$ sudo etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health
member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
member 8ac1f758490b9cfc is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
cluster is healthy

Also:

elukey@ganeti2021:~$ sudo gnt-instance list | grep ml-etcd
ml-etcd2001.codfw.wmnet         kvm        debootstrap+default ganeti2019.codfw.wmnet running      3.0G
ml-etcd2002.codfw.wmnet         kvm        debootstrap+default ganeti2014.codfw.wmnet running      3.0G
ml-etcd2003.codfw.wmnet         kvm        debootstrap+default ganeti2015.codfw.wmnet running      3.0G

For the record: I just checked, and no, it wasn't you who created the VM on the wrong Ganeti host, but me. Thanks for fixing it!