Page MenuHomePhabricator

Decommission elastic2037-2054
Closed, ResolvedPublic

Description

Hosts elastic2037-2054 have reached the end of their 5-year lifespan.

Creating this ticket to track their decommissioning.

Event Timeline

Gehel triaged this task as High priority.Mar 5 2024, 10:42 AM

Change #1013398 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Change #1013398 merged by Bking:

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Change #1013401 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: move elastic2037 to insetup

https://gerrit.wikimedia.org/r/1013401

Change #1013401 merged by Bking:

[operations/puppet@production] elastic: move elastic2037 to insetup

https://gerrit.wikimedia.org/r/1013401

Mentioned in SAL (#wikimedia-operations) [2024-03-22T06:22:19Z] <ryankemper> T358882 Updating cross-cluster seeds to bring into concordance with newly added masters: ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9643/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

Mentioned in SAL (#wikimedia-operations) [2024-03-22T06:33:10Z] <ryankemper> T358882 Also updated cross-cluster seeds for ports 9243 and 9443. Everything should be as expected now.

bking reopened this task as In Progress.
bking claimed this task.
bking lowered the priority of this task from High to Medium.
bking updated Other Assignee, added: RKemper.

Change #1014600 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: replace some masters

https://gerrit.wikimedia.org/r/1014600

Change #1014600 merged by Ryan Kemper:

[operations/puppet@production] elastic: replace some masters

https://gerrit.wikimedia.org/r/1014600

Mentioned in SAL (#wikimedia-operations) [2024-03-26T20:09:20Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: cycle some masters - ryankemper@cumin2002 - T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-26T21:45:32Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: cycle some masters - ryankemper@cumin2002 - T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-27T01:06:46Z] <ryankemper> T358882 Updated remote cluster seeds for new master state

Mentioned in SAL (#wikimedia-operations) [2024-03-27T01:35:23Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2037*,elastic2038*,elastic2041*,elastic2042*,elastic2045*,elastic2046*,elastic2047*,elastic2050*,elastic2051*,elastic2052*,elastic2039*,elastic2040*,elastic2043*,elastic2044*,elastic2048*,elastic2053*,elastic2054* for prepare for decom of hosts - ryankemper@cumin2002 - T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-27T01:35:28Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2037*,elastic2038*,elastic2041*,elastic2042*,elastic2045*,elastic2046*,elastic2047*,elastic2050*,elastic2051*,elastic2052*,elastic2039*,elastic2040*,elastic2043*,elastic2044*,elastic2048*,elastic2053*,elastic2054* for prepare for decom of hosts - ryankemper@cumin2002 - T358882

elastic2037 is reported by Netbox for not being anymore in puppetdb, please either decommission it or shut it down. No host should be powered on without puppet running for extensive period of time.

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:51:40Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 12 days, 0:00:00 on elastic2038.codfw.wmnet with reason: T358882

Mentioned in SAL (#wikimedia-operations) [2024-03-27T15:51:44Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 0:00:00 on elastic2038.codfw.wmnet with reason: T358882

Change #1015119 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: remove soon-to-be-decommed codfw hosts

https://gerrit.wikimedia.org/r/1015119

Change #1015119 merged by Bking:

[operations/puppet@production] elasticsearch: remove soon-to-be-decommed codfw hosts

https://gerrit.wikimedia.org/r/1015119

Change #1015123 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic20[37-54]

https://gerrit.wikimedia.org/r/1015123

Change #1015123 merged by Bking:

[operations/puppet@production] elastic: decom elastic20[37-54]

https://gerrit.wikimedia.org/r/1015123

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2052-2054].codfw.wmnet

  • elastic2052.codfw.wmnet (FAIL)
    • Missing DNSName in Nebox for elastic2052, unable to verify it.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.3.58
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • elastic2053.codfw.wmnet (FAIL)
    • Missing DNSName in Nebox for elastic2053, unable to verify it.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.3.59
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • elastic2054.codfw.wmnet (FAIL)
    • Missing DNSName in Nebox for elastic2054, unable to verify it.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.3.60
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic2037.codfw.wmnet

  • elastic2037.codfw.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

With dc-ops having closed out the decom subtask, this should be all done.