Page MenuHomePhabricator

decom iridium
Closed, ResolvedPublic

Description

iridium was phabricator, has been replaced by phab1001.

after a grace period, start decom'ing with all the steps.. not yet though

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result & added to gsheet tracking.
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

Event Timeline

Dzahn changed the task status from Open to Stalled.Aug 4 2017, 7:49 AM
Dzahn updated the task description. (Show Details)

please keep stalled for a few more days like this and don't shut down. we are a spare but don't want the disk wiped just yet, and phab-admins still have shell.. just in case

Dzahn changed the task status from Stalled to Open.EditedAug 17 2017, 7:30 PM
Dzahn added a subscriber: demon.

@20after4 Last call before we are actually killing iridium and wiping the disk?

Dzahn removed Dzahn as the assignee of this task.Aug 21 2017, 9:16 PM

assigning it from me to pool. it can now be finalized and iridium can be shutdown and wiped. I can't do all the non-interruptable steps myself due to lack of switch access.

status is: in puppet as role::spare, in DHCP and DNS. remnants in mysql grants (https://gerrit.wikimedia.org/r/369832) , all else is removed

Dzahn triaged this task as Medium priority.Sep 6 2017, 5:34 PM

I rebooted this spare host for completeless wrt Meltdown kernel update and while it's now running the fixed kernel, sshd came up running the /etc/ssh/sshd_config.phabricator instead of the regular /etc/ssh/sshd_config. iridium can still be reached via mgmt and is up for decom, so no point in debugging/fixing this IMO.

No, no point i debugging indeed. Instead it would be really nice if it could be shutdown after running such a long time doing nothing.

No, no point i debugging indeed. Instead it would be really nice if it could be shutdown after running such a long time doing nothing.

Agreed, it would be good to perform the "non-interruptible steps" soon. Currently this adds some noise to monitoring etc.

Yes, i can't do them though because i don't have the access to disable switch ports.

so this host ssh is down, so i cannot disable puppet on the host. I'll do the remainder of the uninterruptible steps now.

its port is asw-c-eqiad:ge-4/0/21

Change 407683 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom iridium

https://gerrit.wikimedia.org/r/407683

Change 407684 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom iridium

https://gerrit.wikimedia.org/r/407684

Change 407683 merged by RobH:
[operations/dns@master] decom iridium

https://gerrit.wikimedia.org/r/407683

Change 407684 merged by RobH:
[operations/puppet@production] decom iridium

https://gerrit.wikimedia.org/r/407684

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH added a subscriber: RobH.

Ok, this is now ready for onsite wipe.

That host has a broken sshd config (coming from Phabricator), but it's possible to login via mgmt and the root password.

That host has a broken sshd config (coming from Phabricator), but it's possible to login via mgmt and the root password.

Done!

power off ready for wipe

Change 421571 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns for iridium

https://gerrit.wikimedia.org/r/421571

Change 421571 abandoned by Cmjohnson:
Removing mgmt dns for iridium

Reason:
This server is meant for spares not full decom.

https://gerrit.wikimedia.org/r/421571

RobH updated the task description. (Show Details)

So this had a different port listed and disabled than the actual system port. I just fixed it, so it'll stop calling into puppet and monitoring. Task description updated, ready for decom/wipe.

MoritzMuehlenhoff added a subscriber: Cmjohnson.

The server is still visible in Cumin:

jmm@sarin:~$ sudo cumin irid*
1 hosts will be targeted:
iridium.eqiad.wmnet
DRY-RUN mode enabled, aborting
jmm@sarin:~$

The server is still visible in Cumin:

jmm@sarin:~$ sudo cumin irid*
1 hosts will be targeted:
iridium.eqiad.wmnet
DRY-RUN mode enabled, aborting
jmm@sarin:~$

When it powered on and called into puppet, it put itself back into everything. Sicne I cleared it on the 23rd, I don't see it in cumin:

robh@neodymium:~$ sudo cumin irid*
No hosts found that matches the query