Page MenuHomePhabricator

Decommission labstore100[123] and their disk shelves
Open, Stalled, LowPublic

Description

labstore100[123] are old WMCS storage servers, that were replaced (refreshed) by labstore100[45] with T161345 and cloudstore100[89] with T186931. They also have disk shelves connected to them, and these should be decom'ed as well.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare::system if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: system added back to spares tracking (by onsite)

Event Timeline

faidon created this task.Feb 15 2018, 3:45 PM
faidon triaged this task as Low priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 15 2018, 3:45 PM
faidon renamed this task from Decommission labstore100[12] to Decommission labstore100[12] and their disk shelves.Feb 15 2018, 3:46 PM
faidon updated the task description. (Show Details)

I believe labstore1003 was a part of the refresh for labstore1006/7, where labstore1004/1005 were the most direct refresh for labstore1001/1002. But labstore1006/7 did not take on all functions of labstore1003 directly -- and now the scratch and maps NFS shares have grown in the last year and would not easily fit into labstrore1004/1005 setup either at the moment.

T186931 is the idea for where we offload the remaining labstore1003 rw use cases.

Labstore1001/1002 are only tangentially related but it would be ideal if at all possible to keep them on hand even though they are out of warranty as at the moment they are the only setup of this capacity in case labstore1003 dies while T186931 is in progress.

faidon updated the task description. (Show Details)Feb 15 2018, 9:03 PM

My apologies, this is all confusing! I corrected the task description to reflect that labstore100[12] have been replaced by labstore100[45]. I guess we can wait until labstore100[89] are procured (T186931), but in general let's please decom systems soon after we replace them in the future :)

Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Feb 16 2018, 3:48 PM
bd808 changed the task status from Open to Stalled.Mar 10 2018, 11:42 PM

@chasemp @Bstorm What is the status of these? Can the decom process continue?

Thanks

Bstorm added a comment.EditedAug 27 2018, 8:37 PM
labstore1001 is a Unused spare system (spare::system)
labstore1002 is a Unused spare system (spare::system)

They are not running the NFS service, and they don't even seem to have a mount for such a thing to share out. Considering their only real filesystem has 3G of use, I'd say that's just the OS and these are pretty unused.

I know of no other odd uses of them.

[EDIT] -- see my next comment based on chat with @chasemp

Apparently they are being held for a reason, though. They are thought of as a possible backup for labstore1003 if we cannot get cloudstore1008/9 up.

So, we are waiting on T193655

The issues we've had with these new Dell systems gives me pause. So far, so good, and the issues around these are different, but I'd like to see if we can actually get them in service before we get rid of these two old machines.

faidon renamed this task from Decommission labstore100[12] and their disk shelves to Decommission labstore100[123] and their disk shelves.Dec 6 2018, 6:50 PM
faidon updated the task description. (Show Details)
faidon added a subscriber: bd808.

Per @bd808 on IRC:

labstore1003 is still in use, blocked by T209527. labstore100[12] are not in use at the moment, but serve as a backup to labstore1003 and we'd like to hold on to them until all three are ready to go.

So, do not decom just yet, but wait until we get that task is resolved and we get the OK from cloud-services-team.

Change 481159 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove obsolete Hiera entries for labstore1001/labstore1002

https://gerrit.wikimedia.org/r/481159

Change 481159 merged by Muehlenhoff:
[operations/puppet@production] Remove obsolete Hiera entries for labstore1001/labstore1002

https://gerrit.wikimedia.org/r/481159