Page MenuHomePhabricator

Service implementation for elastic1[084-102].eqiad.wmnet
Closed, ResolvedPublic5 Estimated Story Points

Description

This is a ticket composed of:

elastic108[4-8] procured in T291655 and racked in T294152; these refresh hosts are replacing elastic10[48-52]

elastic1[089-102] procured in T297645 and racked in T299609; these are 14 new expansion hosts

TODO rest of ticket description, use https://phabricator.wikimedia.org/T300943 as a template

Event Timeline

MPhamWMF triaged this task as High priority.Jun 6 2022, 3:33 PM
MPhamWMF moved this task from needs triage to Ops / SRE on the Discovery-Search board.

Starting this work now, as we need more capacity for the Bullseye upgrades.

Change 822129 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: racking info for new hosts

https://gerrit.wikimedia.org/r/822129

Here's the rack/row info for the new hosts, corresponding to https://gerrit.wikimedia.org/r/c/operations/puppet/+/822129/ pulled from Netbox:

elastic1084 A4
elastic1085 B7
elastic1086 B7
elastic1087 C7
elastic1088 C7

elastic1089 E1
elastic1090 E1

elastic1091 E2
elastic1092 E2

elastic1093 E3
elastic1094 E3
elastic1095 E3

elastic1096 F1
elastic1097 F1

elastic1098 F2
elastic1099 F2

elastic1100 F3
elastic1101 F3
elastic1102 F3

Change 822129 merged by Ryan Kemper:

[operations/puppet@production] elastic: racking info for new hosts

https://gerrit.wikimedia.org/r/822129

Change 822169 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: allocate psi vs omega for new hosts

https://gerrit.wikimedia.org/r/822169

host        row assigned_small_cluster

elastic1084 A4  Psi
elastic1085 B7  Psi
elastic1086 B7  Psi
elastic1087 C7  Psi
elastic1088 C7  Psi

elastic1089 E1  Omega
elastic1090 E1  Psi  

elastic1091 E2  Omega
elastic1092 E2  Psi  

elastic1093 E3  Omega
elastic1094 E3  Omega
elastic1095 E3  Psi  

elastic1096 F1  Omega
elastic1097 F1  Psi  

elastic1098 F2  Omega
elastic1099 F2  Psi  

elastic1100 F3  Omega
elastic1101 F3  Psi  
elastic1102 F3  Psi

Note that elastic108[4-8] are replacing elastic10[48-52] which are psi hosts, thus why that whole block was assigned to psi.

Change 822169 merged by Ryan Kemper:

[operations/puppet@production] elastic: allocate psi vs omega for new hosts

https://gerrit.wikimedia.org/r/822169

Mentioned in SAL (#wikimedia-operations) [2022-08-10T21:09:30Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810

Mentioned in SAL (#wikimedia-operations) [2022-08-10T21:09:44Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on elastic[1101-1102].eqiad.wmnet with reason: T309810

Mentioned in SAL (#wikimedia-operations) [2022-08-10T21:10:16Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on 16 hosts with reason: T309810

Mentioned in SAL (#wikimedia-operations) [2022-08-10T21:10:39Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 16 hosts with reason: T309810

Change 822173 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: fix bad copypaste

https://gerrit.wikimedia.org/r/822173

Change 822173 merged by Ryan Kemper:

[operations/puppet@production] elastic: fix bad copypaste

https://gerrit.wikimedia.org/r/822173

Mentioned in SAL (#wikimedia-operations) [2022-08-11T16:29:37Z] <bking@cumin1001> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810

Mentioned in SAL (#wikimedia-operations) [2022-08-11T16:29:52Z] <bking@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1100-1102].eqiad.wmnet with reason: T309810

MPhamWMF set the point value for this task to 5.Aug 15 2022, 3:40 PM
MPhamWMF moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Change 823747 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: decom elastic1048

https://gerrit.wikimedia.org/r/823747

Change 823747 merged by Bking:

[operations/puppet@production] elastic: decom elastic1048

https://gerrit.wikimedia.org/r/823747

cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: elastic1048.eqiad.wmnet

  • elastic1048.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 823771 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic10[49-50]

https://gerrit.wikimedia.org/r/823771

Change 823771 merged by Ryan Kemper:

[operations/puppet@production] elastic: decom elastic10[49-50]

https://gerrit.wikimedia.org/r/823771

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: elastic[1049-1050].eqiad.wmnet

  • elastic1049.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • elastic1050.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 823786 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic10[51-52]

https://gerrit.wikimedia.org/r/823786

Change 823786 merged by Ryan Kemper:

[operations/puppet@production] elastic: decom elastic10[51-52]

https://gerrit.wikimedia.org/r/823786

Change 823788 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: pick new canary host

https://gerrit.wikimedia.org/r/823788

Change 823788 merged by Ryan Kemper:

[operations/puppet@production] elastic: pick new canary host

https://gerrit.wikimedia.org/r/823788

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: elastic[1051-1052].eqiad.wmnet

  • elastic1051.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • elastic1052.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Remembered I still need to create dc-ops decom ticket for the 5 eqiad elastic refresh hosts

Remembered I still need to create dc-ops decom ticket for the 5 eqiad elastic refresh hosts

Created this decom task: https://phabricator.wikimedia.org/T316728