Page MenuHomePhabricator

Decommission labstore100[123] and their disk shelves
Closed, ResolvedPublic

Description

labstore100[123] are old WMCS storage servers, that were replaced (refreshed) by labstore100[45] with T161345 and cloudstore100[89] with T186931. They also have disk shelves connected to them, and these should be decom'ed as well.

Disk Wipe Notes: Please note these are HIGH capacity systems, have their own internal disks, and each host has 2-3 disk arrays.

*labstore1001:*
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps,

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - host set to 'decommissioning' in netbox.

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - disk shelf disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update status in netbox to offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

*labstore1002:*
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - host set to 'decommissioning' in netbox.

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - disk shelf disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update status in netbox to offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

*labstore1003:*
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)
  • - host set to 'decommissioning' in netbox.

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - disk shelf disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update status in netbox to offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

labstore-spare-array:

  • - locate labstore-spare-array https://netbox.wikimedia.org/dcim/devices/1405/
  • - ensure no disks are in it, remove from racks and stack in decom pile
  • - update netbox to show offline/unracked, remove rack info and record their asset tags
  • - add to decom tracking sheet.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@chasemp @Bstorm What is the status of these? Can the decom process continue?

Thanks

Bstorm added a comment.EditedAug 27 2018, 8:37 PM
labstore1001 is a Unused spare system (spare::system)
labstore1002 is a Unused spare system (spare::system)

They are not running the NFS service, and they don't even seem to have a mount for such a thing to share out. Considering their only real filesystem has 3G of use, I'd say that's just the OS and these are pretty unused.

I know of no other odd uses of them.

[EDIT] -- see my next comment based on chat with @chasemp

Apparently they are being held for a reason, though. They are thought of as a possible backup for labstore1003 if we cannot get cloudstore1008/9 up.

So, we are waiting on T193655

The issues we've had with these new Dell systems gives me pause. So far, so good, and the issues around these are different, but I'd like to see if we can actually get them in service before we get rid of these two old machines.

faidon renamed this task from Decommission labstore100[12] and their disk shelves to Decommission labstore100[123] and their disk shelves.Dec 6 2018, 6:50 PM
faidon updated the task description. (Show Details)
faidon added a subscriber: bd808.

Per @bd808 on IRC:

labstore1003 is still in use, blocked by T209527. labstore100[12] are not in use at the moment, but serve as a backup to labstore1003 and we'd like to hold on to them until all three are ready to go.

So, do not decom just yet, but wait until we get that task is resolved and we get the OK from cloud-services-team.

Change 481159 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove obsolete Hiera entries for labstore1001/labstore1002

https://gerrit.wikimedia.org/r/481159

Change 481159 merged by Muehlenhoff:
[operations/puppet@production] Remove obsolete Hiera entries for labstore1001/labstore1002

https://gerrit.wikimedia.org/r/481159

Holding this back until Monday in case of any data concerns, but we are now pretty much unblocked here.

Bstorm claimed this task.May 31 2019, 6:47 PM
Bstorm raised the priority of this task from Low to High.
Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Change 513678 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: disable alerts for labstore1003 and pass to cloudstores

https://gerrit.wikimedia.org/r/513678

Bstorm changed the task status from Stalled to Open.May 31 2019, 7:31 PM

Change 513678 merged by Bstorm:
[operations/puppet@production] cloudstore: disable alerts for labstore1003 and pass to cloudstores

https://gerrit.wikimedia.org/r/513678

Change 513690 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove hieradata for labstore1003/misc and strip down puppet

https://gerrit.wikimedia.org/r/513690

Change 513690 merged by Bstorm:
[operations/puppet@production] labstore: remove hieradata for labstore1003/misc and strip down puppet

https://gerrit.wikimedia.org/r/513690

Change 513702 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: remove unused hiera yaml

https://gerrit.wikimedia.org/r/513702

One note for @Cmjohnson for the upcoming decom which is apparently imminent: labstore1003-arrayN are one of the handful cases that lack an asset tag in Netbox. Last time we talked about this (1+ year ago), I believe you had mentioned that the tag wasn't visible due to the way they are racked. Now that they are getting unracked, it'd be ideal to recover that asset tag and enter it in Netbox to have it on the records and keep it while these remain in storage. Thanks!

Bstorm updated the task description. (Show Details)Jun 3 2019, 3:22 PM

Change 513702 merged by Bstorm:
[operations/puppet@production] labstore: remove unused hiera yaml

https://gerrit.wikimedia.org/r/513702

Change 514035 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] labstore: switch labstore1003 to the spare:system role

https://gerrit.wikimedia.org/r/514035

Change 514035 merged by Bstorm:
[operations/puppet@production] labstore: switch labstore1003 to the spare:system role

https://gerrit.wikimedia.org/r/514035

Bstorm updated the task description. (Show Details)Jun 3 2019, 3:53 PM
Bstorm updated the task description. (Show Details)
Bstorm updated the task description. (Show Details)Jun 3 2019, 3:58 PM
Bstorm reassigned this task from Bstorm to RobH.Jun 3 2019, 4:01 PM

I think these are ready to hand off now.

RobH updated the task description. (Show Details)Jun 4 2019, 4:28 PM
RobH updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labstore1001.eqiad.wmnet

  • labstore1001.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labstore1002.eqiad.wmnet

  • labstore1002.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: labstore1003.eqiad.wmnet

  • labstore1003.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH added a comment.Jun 4 2019, 4:39 PM

Switch port info:

labstore1001:asw2-c-eqiad:ge-2/0/15

labstore1002:asw2-c-eqiad:ge-3/0/5

labstore1003:asw2-a-eqiad:ge-8/0/8

all have been added to disabled, ready to have the descriptions removed when they are unracked.

Change 514343 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] remove labstore100[123] repo entries

https://gerrit.wikimedia.org/r/514343

Change 514343 merged by RobH:
[operations/puppet@production] remove labstore100[123] repo entries

https://gerrit.wikimedia.org/r/514343

Change 514350 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom labstore100[1-3] prod dns

https://gerrit.wikimedia.org/r/514350

Change 514350 merged by RobH:
[operations/dns@master] decom labstore100[1-3] prod dns

https://gerrit.wikimedia.org/r/514350

RobH updated the task description. (Show Details)Jun 4 2019, 5:03 PM
RobH edited projects, added decommission-hardware; removed Patch-For-Review.
RobH reassigned this task from RobH to Cmjohnson.Jun 4 2019, 5:07 PM
RobH moved this task from Backlog to pending onsite steps (eqiad) on the decommission-hardware board.
RobH added subscribers: MoritzMuehlenhoff, RobH.

For some reason this lacked the decommission-hardware tag and I didn't know about it until @MoritzMuehlenhoff pinged me about it yesterday.

Systems have been processed and are now ready for onsite wipe. Please note that the systems all have 2-3 disk shelves that also need to be wiped.

RobH reassigned this task from Cmjohnson to Jclark-ctr.Oct 10 2019, 6:39 PM
RobH added a subscriber: Jclark-ctr.

So the labstore1003-array[123] are all causing report erros on https://netbox.wikimedia.org/extras/reports/coherence.Coherence/ section: test_malformed_asset_tags

Can @Jclark-ctr complete this decommission and wipe the disks/unrack the shelves so they no longer have a reporting error?

RobH added a comment.Oct 22 2019, 7:54 PM

irc update with john:

These are going to take WEEKS to wipe, and are all old hdd. Rather than tie up that much onsite time swapping disks and ensuring wipe, @Jclark-ctr asked if he can just degausse them.

I'm cool with this, since we will just unrack and pay to have the hard disks physically destroyed anyhow.

RobH added a comment.Nov 19 2019, 5:21 PM

Please note this should have included https://netbox.wikimedia.org/dcim/devices/1405/ in the decom as well.

@Jclark-ctr: I'm adding to the task description the following items:

  • - locate labstore-spare-array https://netbox.wikimedia.org/dcim/devices/1405/
  • - ensure no disks are in it, remove from racks and stack in decom pile
  • - update netbox to show offline/unracked and remove rack info.
  • - add to decom tracking sheet.
RobH updated the task description. (Show Details)Nov 19 2019, 5:22 PM
Jclark-ctr updated the task description. (Show Details)Nov 28 2019, 12:03 AM
faidon updated the task description. (Show Details)Dec 5 2019, 6:13 PM
Papaul added a subscriber: Papaul.Dec 6 2019, 12:36 AM

papaul@asw2-a-eqiad# show | compare
[edit interfaces]

  • ge-8/0/8 {
  • description labstore1003;
  • }
papaul@asw2-c-eqiad# show | compare
[edit interfaces]
-   ge-2/0/15 {
-       description labstore1001;
-   }
Papaul updated the task description. (Show Details)Feb 29 2020, 12:51 AM
RobH removed a subscriber: RobH.Feb 29 2020, 12:51 AM
Papaul updated the task description. (Show Details)Feb 29 2020, 1:01 AM
Papaul added a subscriber: RobH.

@Jclark-ctr once you done with this task you can resolve. I checked switch part is done, mgmt DNS is done as well. Thanks

RobH removed a subscriber: RobH.Mar 3 2020, 6:00 PM
RobH removed a project: DC-Ops.Apr 1 2020, 5:06 PM
RobH added a subscriber: RobH.
RobH updated the task description. (Show Details)Apr 1 2020, 5:33 PM
RobH removed subscribers: RobH, Papaul.
Cmjohnson closed this task as Resolved.May 13 2020, 5:47 PM
Cmjohnson updated the task description. (Show Details)

Removed all of these of the racks, resolving this task