Service implementation for wdqs10[17-21]
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Nov 20 2023, 7:07 PM

Description

After this ticket is done the net result will be 2 net-new wdqs-public hosts (wdqs101[8-9]), with the rest being refreshes.

AC -> installation

wdqs10[18-21] are added to eqiad wdqs-public with wdqs102[0,1] replacing wdqs100[6,7]
wdqs1017 is added to eqiad wdqs-internal to replace wdqs1008

AC -> decoms

Some servers need to be decom'd per the information in https://phabricator.wikimedia.org/T342749.

wdqs1006 and wdqs1007 (both eqiad wdqs-public) replaced by wdqs102[0-1]
wdqs1008 (eqiad wdqs-internal) needs to be decom'd and replaced. It was originally supposed to be replaced by wdqs1022 but that's already been set up as a test host for the graph split work, so instead we'll have wdqs1017 replace it
wdqs10[09,10] need to be decom'd. These can be decom'd whenever.

Details

Other Assignee: RKemper

Subject	Repo	Branch	Lines +/-
wdqs: decom wdqs100[6-8]	operations/puppet	production	+0 -14
wdqs: bring wdqs10[17-21] online	operations/puppet	production	+15 -1
wdqs: decom wdqs10[09-10]	operations/puppet	production	+23 -26

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		bking	T351671 Service implementation for wdqs10[17-21]
Resolved	Request	VRiley-WMF	T353482 decommission wdqs10[09-10].eqiad.wmnet
Resolved	Request	VRiley-WMF	T353845 decommission wdqs100[6-8]

Event Timeline

bking created this task.Nov 20 2023, 7:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2023, 7:07 PM

bking added a subscriber: RKemper.Nov 20 2023, 7:12 PM

Let's leave wdqs1021 out for now, as we need it for performance testing in T351662

bking renamed this task from Service implementation for wdqs1017-1021 to Service implementation for wdqs1017-1020.Nov 20 2023, 8:47 PM

bking updated the task description. (Show Details)

Gehel triaged this task as High priority.Nov 22 2023, 9:23 AM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

Gehel moved this task from Ready for Work to Hardware refresh on the Data-Platform-SRE board.Dec 6 2023, 1:11 PM

RKemper renamed this task from Service implementation for wdqs1017-1020 to Service implementation for wdqs10[17-21].Dec 8 2023, 7:46 PM

RKemper claimed this task.

RKemper updated the task description. (Show Details)

RKemper updated the task description. (Show Details)Dec 8 2023, 8:04 PM

Change 982933 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: decom wdqs10[09-10]

https://gerrit.wikimedia.org/r/982933

gerritbot added a project: Patch-For-Review.Dec 13 2023, 11:13 PM

Change 982933 merged by Ryan Kemper:

[operations/puppet@production] wdqs: decom wdqs10[09-10]

https://gerrit.wikimedia.org/r/982933

Maintenance_bot removed a project: Patch-For-Review.Dec 14 2023, 8:30 PM

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1001 for hosts: wdqs[1009-1010].eqiad.wmnet

wdqs1009.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

wdqs1010.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

RKemper updated the task description. (Show Details)Dec 14 2023, 8:42 PM

RKemper added a subtask: T353482: decommission wdqs10[09-10].eqiad.wmnet.Dec 14 2023, 8:45 PM

RKemper updated the task description. (Show Details)Dec 15 2023, 10:37 PM

Change 984289 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: bring wdqs10[17-21] online

https://gerrit.wikimedia.org/r/984289

gerritbot added a project: Patch-For-Review.Dec 19 2023, 10:12 PM

Change 984289 merged by Ryan Kemper:

[operations/puppet@production] wdqs: bring wdqs10[17-21] online

https://gerrit.wikimedia.org/r/984289

Mentioned in SAL (#wikimedia-operations) [2023-12-19T22:26:23Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Mentioned in SAL (#wikimedia-operations) [2023-12-19T22:26:41Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Maintenance_bot removed a project: Patch-For-Review.Dec 19 2023, 10:30 PM

Current status

New hosts added in puppet. Their weights have been set in pybal (more specifically, etcd via conftool), and they're currently marked inactive while we do data xfers. About to kick off batch #1 shortly and then will do batch #2 after the first batch is all finished:

BATCH #1

1016 -> 1017
1014 -> 1018
1013 -> 1019

BATCH #2

1014 -> 1020
1013 -> 1021

Mentioned in SAL (#wikimedia-operations) [2023-12-20T00:02:58Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 22:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Mentioned in SAL (#wikimedia-operations) [2023-12-20T00:03:18Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 22:00:00 on wdqs[1017-1021].eqiad.wmnet with reason: bringing new wdqs hosts online T351671

Mentioned in SAL (#wikimedia-operations) [2023-12-20T06:31:12Z] <ryankemper> T351671 Pooled wdqs10[17-21]*; data xfers completed and test queries are passing on wdqs1018. Will decom related hosts tomorrow (2023-12-20)

Gehel edited projects, added Data-Platform-SRE (2023.12.01 - 2023.12.31); removed Data-Platform-SRE.Dec 20 2023, 10:47 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2023.12.01 - 2023.12.31) board.

RKemper updated the task description. (Show Details)Dec 20 2023, 10:05 PM

Change 984644 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs: decom wdqs100[6-8]

https://gerrit.wikimedia.org/r/984644

gerritbot added a project: Patch-For-Review.Dec 20 2023, 10:07 PM

Change 984644 merged by Ryan Kemper:

[operations/puppet@production] wdqs: decom wdqs100[6-8]

https://gerrit.wikimedia.org/r/984644

Decom cookbook ran: https://sal.toolforge.org/log/tSXrhYwBhuQtenzvzt4I

RKemper updated the task description. (Show Details)Dec 20 2023, 10:23 PM

RKemper moved this task from In Progress to Done on the Data-Platform-SRE (2023.12.01 - 2023.12.31) board.

cookbooks.sre.hosts.decommission executed by ryankemper@cumin1002 for hosts: wdqs[1006-1008].eqiad.wmnet

wdqs1006.eqiad.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Host steps raised exception: Cumin execution failed (exit_code=2)

wdqs1007.eqiad.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Host steps raised exception: Cumin execution failed (exit_code=2)

wdqs1008.eqiad.wmnet (FAIL)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Host steps raised exception: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

Maintenance_bot removed a project: Patch-For-Review.Dec 20 2023, 10:30 PM

bking claimed this task.Dec 20 2023, 10:52 PM

bking updated Other Assignee, added: RKemper.

After talking in the #wikimedia-sre IRC channel, I'll run the sre.network.configure-switch-interfaces myself, and then Volans will take care of the puppetdb/debmonitor stuff after seeing if the cookbook can be improved to handle those idempotently.

cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: wdqs1006.eqiad.wmnet

wdqs1006.eqiad.wmnet (FAIL)
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.7.87
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

@RKemper so the original failure was due to the fact that homer was not yet setup on the new cumin1002 host and this cookbook was actually requiring a fully functional homer. We've fixed the situation and now homer works fine on cumin1002 so that's solved.
I've also sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/984812 from which I've run the above to make the cookbook idempotent again. I'll run it also for the remaining hosts shortly.

cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: wdqs1007.eqiad.wmnet

wdqs1007.eqiad.wmnet (FAIL)
- Missing DNSName in Nebox for wdqs1007, unable to verify it.
- Missing DNS record for wdqs1007.eqiad.wmnet, the steps requiring DNS will fail.
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.7.88
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by volans@cumin1002 for hosts: wdqs1008.eqiad.wmnet

wdqs1008.eqiad.wmnet (FAIL)
- Missing DNSName in Nebox for wdqs1008, unable to verify it.
- Missing DNS record for wdqs1008.eqiad.wmnet, the steps requiring DNS will fail.
- Unable to find/resolve the mgmt DNS record, using the IP instead: 10.65.7.89
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Host is already powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

I can confirm that all hosts are active in pybal and their data is loaded. Closing...

VRiley-WMF closed subtask T353482: decommission wdqs10[09-10].eqiad.wmnet as Resolved.Jan 10 2024, 4:19 PM

VRiley-WMF closed subtask T353845: decommission wdqs100[6-8] as Resolved.Mar 22 2024, 5:23 PM

Service implementation for wdqs10[17-21]Closed, ResolvedPublicActions

Description

AC -> installation

AC -> decoms

Details

Related ObjectsSearch...

Event Timeline

Current status

BATCH #1

BATCH #2

Service implementation for wdqs10[17-21]
Closed, ResolvedPublic
Actions

Related Objects
Search...