Page MenuHomePhabricator

Decommission labstore100[123] and their disk shelves
Open, HighPublic

Description

labstore100[123] are old WMCS storage servers, that were replaced (refreshed) by labstore100[45] with T161345 and cloudstore100[89] with T186931. They also have disk shelves connected to them, and these should be decom'ed as well.

Disk Wipe Notes: Please note these are HIGH capacity systems, have their own internal disks, and each host has 2-3 disk arrays.

*labstore1001:*
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - host set to 'decommissioning' in netbox.

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - disk shelf disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update status in netbox to offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

*labstore1002:*
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - host set to 'decommissioning' in netbox.

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - disk shelf disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update status in netbox to offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

*labstore1003:*
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - host set to 'decommissioning' in netbox.

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - disk shelf disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update status in netbox to offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

Event Timeline

faidon triaged this task as Low priority.Feb 15 2018, 3:45 PM
faidon created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2018, 3:45 PM
faidon renamed this task from Decommission labstore100[12] to Decommission labstore100[12] and their disk shelves.Feb 15 2018, 3:46 PM
faidon updated the task description. (Show Details)

I believe labstore1003 was a part of the refresh for labstore1006/7, where labstore1004/1005 were the most direct refresh for labstore1001/1002. But labstore1006/7 did not take on all functions of labstore1003 directly -- and now the scratch and maps NFS shares have grown in the last year and would not easily fit into labstrore1004/1005 setup either at the moment.

T186931 is the idea for where we offload the remaining labstore1003 rw use cases.

Labstore1001/1002 are only tangentially related but it would be ideal if at all possible to keep them on hand even though they are out of warranty as at the moment they are the only setup of this capacity in case labstore1003 dies while T186931 is in progress.

faidon updated the task description. (Show Details)Feb 15 2018, 9:03 PM

My apologies, this is all confusing! I corrected the task description to reflect that labstore100[12] have been replaced by labstore100[45]. I guess we can wait until labstore100[89] are procured (T186931), but in general let's please decom systems soon after we replace them in the future :)

Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Feb 16 2018, 3:48 PM
bd808 changed the task status from Open to Stalled.Mar 10 2018, 11:42 PM

@chasemp @Bstorm What is the status of these? Can the decom process continue?

Thanks

Bstorm added a comment.EditedAug 27 2018, 8:37 PM
labstore1001 is a Unused spare system (spare::system)
labstore1002 is a Unused spare system (spare::system)

They are not running the NFS service, and they don't even seem to have a mount for such a thing to share out. Considering their only real filesystem has 3G of use, I'd say that's just the OS and these are pretty unused.

I know of no other odd uses of them.

[EDIT] -- see my next comment based on chat with @chasemp

Apparently they are being held for a reason, though. They are thought of as a possible backup for labstore1003 if we cannot get cloudstore1008/9 up.

So, we are waiting on T193655

The issues we've had with these new Dell systems gives me pause. So far, so good, and the issues around these are different, but I'd like to see if we can actually get them in service before we get rid of these two old machines.

faidon renamed this task from Decommission labstore100[12] and their disk shelves to Decommission labstore100[123] and their disk shelves.Dec 6 2018, 6:50 PM
faidon updated the task description. (Show Details)
faidon added a subscriber: bd808.

Per @bd808 on IRC:

labstore1003 is still in use, blocked by T209527. labstore100[12] are not in use at the moment, but serve as a backup to labstore1003 and we'd like to hold on to them until all three are ready to go.

So, do not decom just yet, but wait until we get that task is resolved and we get the OK from cloud-services-team.

Change 481159 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove obsolete Hiera entries for labstore1001/labstore1002

https://gerrit.wikimedia.org/r/481159

Change 481159 merged by Muehlenhoff:
[operations/puppet@production] Remove obsolete Hiera entries for labstore1001/labstore1002

https://gerrit.wikimedia.org/r/481159

Holding this back until Monday in case of any data concerns, but we are now pretty much unblocked here.

Bstorm claimed this task.May 31 2019, 6:47 PM
Bstorm raised the priority of this task from Low to High.
Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 513678 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: disable alerts for labstore1003 and pass to cloudstores

https://gerrit.wikimedia.org/r/513678

Bstorm changed the task status from Stalled to Open.May 31 2019, 7:31 PM

Change 513678 merged by Bstorm:
[operations/puppet@production] cloudstore: disable alerts for labstore1003 and pass to cloudstores

https://gerrit.wikimedia.org/r/513678

Change 513690 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove hieradata for labstore1003/misc and strip down puppet

https://gerrit.wikimedia.org/r/513690

Change 513690 merged by Bstorm:
[operations/puppet@production] labstore: remove hieradata for labstore1003/misc and strip down puppet

https://gerrit.wikimedia.org/r/513690

Change 513702 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove unused hiera yaml

https://gerrit.wikimedia.org/r/513702

One note for @Cmjohnson for the upcoming decom which is apparently imminent: labstore1003-arrayN are one of the handful cases that lack an asset tag in Netbox. Last time we talked about this (1+ year ago), I believe you had mentioned that the tag wasn't visible due to the way they are racked. Now that they are getting unracked, it'd be ideal to recover that asset tag and enter it in Netbox to have it on the records and keep it while these remain in storage. Thanks!

Bstorm updated the task description. (Show Details)Jun 3 2019, 3:22 PM

Change 513702 merged by Bstorm:
[operations/puppet@production] labstore: remove unused hiera yaml

https://gerrit.wikimedia.org/r/513702

Change 514035 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: switch labstore1003 to the spare:system role

https://gerrit.wikimedia.org/r/514035

Change 514035 merged by Bstorm:
[operations/puppet@production] labstore: switch labstore1003 to the spare:system role

https://gerrit.wikimedia.org/r/514035

Bstorm updated the task description. (Show Details)Jun 3 2019, 3:53 PM
Bstorm updated the task description. (Show Details)
Bstorm updated the task description. (Show Details)Jun 3 2019, 3:58 PM
Bstorm reassigned this task from Bstorm to RobH.Jun 3 2019, 4:01 PM

I think these are ready to hand off now.

RobH updated the task description. (Show Details)Jun 4 2019, 4:28 PM
RobH updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labstore1001.eqiad.wmnet

  • labstore1001.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labstore1002.eqiad.wmnet

  • labstore1002.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labstore1003.eqiad.wmnet

  • labstore1003.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH added a comment.Jun 4 2019, 4:39 PM

Switch port info:

labstore1001:asw2-c-eqiad:ge-2/0/15

labstore1002:asw2-c-eqiad:ge-3/0/5

labstore1003:asw2-a-eqiad:ge-8/0/8

all have been added to disabled, ready to have the descriptions removed when they are unracked.

Change 514343 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] remove labstore100[123] repo entries

https://gerrit.wikimedia.org/r/514343

Change 514343 merged by RobH:
[operations/puppet@production] remove labstore100[123] repo entries

https://gerrit.wikimedia.org/r/514343

Change 514350 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom labstore100[1-3] prod dns

https://gerrit.wikimedia.org/r/514350

Change 514350 merged by RobH:
[operations/dns@master] decom labstore100[1-3] prod dns

https://gerrit.wikimedia.org/r/514350

RobH updated the task description. (Show Details)Jun 4 2019, 5:03 PM
RobH edited projects, added decommission; removed Patch-For-Review.
RobH reassigned this task from RobH to Cmjohnson.Jun 4 2019, 5:07 PM
RobH moved this task from Backlog to pending onsite steps (eqiad) on the decommission board.
RobH added subscribers: MoritzMuehlenhoff, RobH.

For some reason this lacked the decommission tag and I didn't know about it until @MoritzMuehlenhoff pinged me about it yesterday.

Systems have been processed and are now ready for onsite wipe. Please note that the systems all have 2-3 disk shelves that also need to be wiped.