Page MenuHomePhabricator

Service implementation for wdqs10[17-21]
Closed, ResolvedPublic

Description

After this ticket is done the net result will be 2 net-new wdqs-public hosts (wdqs101[8-9]), with the rest being refreshes.

AC -> installation

  • wdqs10[18-21] are added to eqiad wdqs-public with wdqs102[0,1] replacing wdqs100[6,7]
  • wdqs1017 is added to eqiad wdqs-internal to replace wdqs1008

AC -> decoms

Some servers need to be decom'd per the information in https://phabricator.wikimedia.org/T342749.

  • wdqs1006 and wdqs1007 (both eqiad wdqs-public) replaced by wdqs102[0-1]
  • wdqs1008 (eqiad wdqs-internal) needs to be decom'd and replaced. It was originally supposed to be replaced by wdqs1022 but that's already been set up as a test host for the graph split work, so instead we'll have wdqs1017 replace it
  • wdqs10[09,10] need to be decom'd. These can be decom'd whenever.

Event Timeline

Let's leave wdqs1021 out for now, as we need it for performance testing in T351662

bking renamed this task from Service implementation for wdqs1017-1021 to Service implementation for wdqs1017-1020.Nov 20 2023, 8:47 PM
bking updated the task description. (Show Details)
Gehel triaged this task as High priority.Nov 22 2023, 9:23 AM
Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.
RKemper renamed this task from Service implementation for wdqs1017-1020 to Service implementation for wdqs10[17-21].Dec 8 2023, 7:46 PM
RKemper claimed this task.
RKemper updated the task description. (Show Details)

Change 982933 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: decom wdqs10[09-10]

https://gerrit.wikimedia.org/r/982933

Change 982933 merged by Ryan Kemper:

[operations/puppet@production] wdqs: decom wdqs10[09-10]

https://gerrit.wikimedia.org/r/982933

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs[1009-1010].eqiad.wmnet

  • wdqs1009.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • wdqs1010.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 984289 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: bring wdqs10[17-21] online

https://gerrit.wikimedia.org/r/984289

Change 984289 merged by Ryan Kemper:

[operations/puppet@production] wdqs: bring wdqs10[17-21] online

https://gerrit.wikimedia.org/r/984289

Mentioned in SAL (#wikimedia-operations) [2023-12-19T22:26:23Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Mentioned in SAL (#wikimedia-operations) [2023-12-19T22:26:41Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Current status

New hosts added in puppet. Their weights have been set in pybal (more specifically, etcd via conftool), and they're currently marked inactive while we do data xfers. About to kick off batch #1 shortly and then will do batch #2 after the first batch is all finished:

BATCH #1

1016 -> 1017
1014 -> 1018
1013 -> 1019

BATCH #2

1014 -> 1020
1013 -> 1021

Mentioned in SAL (#wikimedia-operations) [2023-12-20T00:02:58Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 22:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Mentioned in SAL (#wikimedia-operations) [2023-12-20T00:03:18Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 22:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Mentioned in SAL (#wikimedia-operations) [2023-12-20T06:31:12Z] <ryankemper> T351671 Pooled wdqs10[17-21]*; data xfers completed and test queries are passing on wdqs1018. Will decom related hosts tomorrow (2023-12-20)

Change 984644 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: decom wdqs100[6-8]

https://gerrit.wikimedia.org/r/984644

Change 984644 merged by Ryan Kemper:

[operations/puppet@production] wdqs: decom wdqs100[6-8]

https://gerrit.wikimedia.org/r/984644

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1002 for hosts: wdqs[1006-1008].eqiad.wmnet

  • wdqs1006.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: Cumin execution failed (exit_code=2)
  • wdqs1007.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: Cumin execution failed (exit_code=2)
  • wdqs1008.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Host steps raised exception: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

bking updated Other Assignee, added: RKemper.

After talking in the #wikimedia-sre IRC channel, I'll run the sre.network.configure-switch-interfaces myself, and then Volans will take care of the puppetdb/debmonitor stuff after seeing if the cookbook can be improved to handle those idempotently.

cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: wdqs1006.eqiad.wmnet

  • wdqs1006.eqiad.wmnet (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.7.87
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

@RKemper so the original failure was due to the fact that homer was not yet setup on the new cumin1002 host and this cookbook was actually requiring a fully functional homer. We've fixed the situation and now homer works fine on cumin1002 so that's solved.
I've also sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/984812 from which I've run the above to make the cookbook idempotent again. I'll run it also for the remaining hosts shortly.

cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: wdqs1007.eqiad.wmnet

  • wdqs1007.eqiad.wmnet (FAIL)
    • Missing DNSName in Nebox for wdqs1007, unable to verify it.
    • Missing DNS record for wdqs1007.eqiad.wmnet, the steps requiring DNS will fail.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.7.88
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: wdqs1008.eqiad.wmnet

  • wdqs1008.eqiad.wmnet (FAIL)
    • Missing DNSName in Nebox for wdqs1008, unable to verify it.
    • Missing DNS record for wdqs1008.eqiad.wmnet, the steps requiring DNS will fail.
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.7.89
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

I can confirm that all hosts are active in pybal and their data is loaded. Closing...