Page MenuHomePhabricator

Decom ms-be101[345]
Closed, ResolvedPublic

Description

ms-be1013.eqiad.wmnet

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result and status of offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

ms-be1014.eqiad.wmnet

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result and status of offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

ms-be1015.eqiad.wmnet

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Decommissioning
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update netbox with result and status of offline
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2019-04-16T12:52:30Z] <godog> swift eqiad-prod continue ms-be1013 decom - T220590

colewhite triaged this task as Medium priority.Apr 16 2019, 6:06 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-17T09:17:08Z] <godog> swift eqiad-prod continue ms-be1013 decom - T220590

Mentioned in SAL (#wikimedia-operations) [2019-04-23T12:15:45Z] <godog> swift eqiad-prod: fully decom ms-be1013 - T220590

Mentioned in SAL (#wikimedia-operations) [2019-04-24T08:29:09Z] <godog> swift eqiad-prod: start decom for ms-be101[45] - T220590

Change 506478 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] swift eqiad-prod: continue decom ms-be101[45]

https://gerrit.wikimedia.org/r/506478

Change 506478 merged by CDanis:
[operations/software/swift-ring@master] swift eqiad-prod: continue decom ms-be101[45]

https://gerrit.wikimedia.org/r/506478

Mentioned in SAL (#wikimedia-operations) [2019-05-06T14:35:35Z] <godog> swift eqiad-prod: finish decom ms-be101[45] - T220590

Hosts are out of swift rings now, ms-be1013 is still off the network and I'll take care of it before hand over.

Some filesystems report "input/output" error when trying an ls:

root@ms-be1014:~# find /srv/swift-storage/sdc1
find: ‘/srv/swift-storage/sdc1’: Input/output error

And indeed that fs has reported errors in dmesg:

May 11 10:59:58 ms-be1014 kernel: [2000697.050792] XFS (sdc1): xlog_write: reservation summary:
May 11 10:59:58 ms-be1014 kernel: [2000697.050796] XFS (sdc1):   unit res    = 79332 bytes
May 11 10:59:58 ms-be1014 kernel: [2000697.050798] XFS (sdc1):   current res = -6104 bytes
May 11 10:59:58 ms-be1014 kernel: [2000697.050799] XFS (sdc1):   total reg   = 0 bytes (o/flow = 0 bytes)
May 11 10:59:58 ms-be1014 kernel: [2000697.050801] XFS (sdc1):   ophdrs      = 0 (ophdr space = 0 bytes)
May 11 10:59:58 ms-be1014 kernel: [2000697.050802] XFS (sdc1):   ophdr + reg = 0 bytes
May 11 10:59:58 ms-be1014 kernel: [2000697.050803] XFS (sdc1):   num regions = 0
May 11 10:59:58 ms-be1014 kernel: [2000697.050804] XFS (sdc1): xlog_write: reservation ran out. Need to up reservation
May 11 10:59:58 ms-be1014 kernel: [2000697.059184] XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 2091 of file /build/linux-UEAD6s/linux-4.9.144/fs/xfs/xfs_log.c.  Return address = 0xffffffffc0a35d85
May 11 10:59:58 ms-be1014 kernel: [2000697.059189] XFS (sdc1): Log I/O Error Detected.  Shutting down filesystem
May 11 10:59:58 ms-be1014 kernel: [2000697.066968] XFS (sdc1): Please umount the filesystem and rectify the problem(s)

I found some bug reports searching for "reservation ran out. Need to up reservation", e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1092853 and it looks like not enough log space was available at some point. I tested the proposed solution on ms-be1015 (umount / remount the affected filesystems) and seems to work, filesystems are back with some data on them that's being replicated:

/dev/sdc1       2.8T   28G  2.7T   1% /srv/swift-storage/sdc1
/dev/sdb1       2.8T   83G  2.7T   3% /srv/swift-storage/sdb1
/dev/sde1       2.8T   34G  2.7T   2% /srv/swift-storage/sde1
/dev/sdf1       2.8T   34G  2.7T   2% /srv/swift-storage/sdf1

Ditto on ms-be1014:

/dev/sdl1       2.8T   29G  2.7T   2% /srv/swift-storage/sdl1
/dev/sdj1       2.8T   34G  2.7T   2% /srv/swift-storage/sdj1
/dev/sdc1       2.8T   28G  2.8T   1% /srv/swift-storage/sdc1

ms-be1014 has finished swift decom, what's left is zero-bytes old quarantined files

root@ms-be1014:~# find /srv/swift-storage/ -type f -ls
242077616      0 -rw-------   1 swift    swift           0 Feb 12  2015 /srv/swift-storage/sdf1/quarantined/objects/ad2ba865bfc51470d13ef487e664a6a9/1423720270.37474.ts
 67166182      0 -rw-r--r--   1 swift    swift           0 Apr 25 16:22 /srv/swift-storage/sdm3/quarantined/containers/372dc401b6e1b95b978ac6183c4212e3/372dc401b6e1b95b978ac6183c4212e3.db
  9019504      0 -rw-------   1 swift    swift           0 Aug 17  2015 /srv/swift-storage/sdi1/quarantined/objects/d74323f822bc8848e2f81a62e3effcb3/1439835859.39412.ts
 22909456      0 -rw-------   1 swift    swift           0 Feb 22  2016 /srv/swift-storage/sdd1/quarantined/objects/28fff0d06aad75c47ed8c6bbfe1a3b39/1456169668.86455.ts

Ditto for ms-be1015

A bunch of old/zero byte files and a container database in tmp that has been replicated but left behind afaics

root@ms-be1015:~# find /srv/swift-storage/ -type f -ls
 43951820      0 -rw-------   1 swift    swift           0 Jul  4  2016 /srv/swift-storage/sdd1/quarantined/objects/cbecbfaf083b4678a9e0961125a11617/1467596340.24874.ts
446698099      0 -rw-------   1 swift    swift           0 Feb 28  2015 /srv/swift-storage/sdh1/quarantined/objects/5734145a2237eaa7d1c519c01eb83427/1425151966.15747.ts
   913034      0 -rw-------   1 swift    swift           0 Apr  4  2017 /srv/swift-storage/sdk1/quarantined/objects/ccbb2a5f9413541708ad94c8287cb1b2/1491312115.96396.ts
    16053 1840956 -rw-------   1 swift    swift    1885138944 May  6 14:52 /srv/swift-storage/sdm3/tmp/641de3c5-de84-4f69-89c9-7280265c6fad
230760591       0 -rw-------   1 swift    swift             0 Feb 12  2015 /srv/swift-storage/sdj1/quarantined/objects/94d85af99d78cb3cbbc65298bd0a3366/1423720250.83718.ts
341531972       0 -rw-------   1 swift    swift             0 Feb 12  2015 /srv/swift-storage/sdj1/quarantined/objects/459354336644cd353295540d963ade93/1423720258.80436.ts
 49229533       0 -rw-------   1 swift    swift             0 Jul  9  2016 /srv/swift-storage/sdj1/quarantined/objects/13f3dce15f3652e37220c451e5376905/1468025106.79144.ts
 49676914       0 -rw-------   1 swift    swift             0 Jul  9  2016 /srv/swift-storage/sdl1/quarantined/objects/58895ef77cd23d8944ccb9e891c15918/1468024564.30721.ts
RobH subscribed.

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission-hardware queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission-hardware project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

Change 510819 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Set spares for ms-be[12]01[345]

https://gerrit.wikimedia.org/r/510819

Mentioned in SAL (#wikimedia-operations) [2019-05-17T09:27:53Z] <godog> swift remove ms-be101[345] from rings - T220590

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission-hardware queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission-hardware project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

Indeed, I've moved them back to active!

Change 510819 merged by Filippo Giunchedi:
[operations/puppet@production] Set spares for ms-be[12]01[345]

https://gerrit.wikimedia.org/r/510819

Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to @RobH

Also a note re: ms-be1013, it had its raid failed in T220907: Degraded RAID on ms-be1013 and currently I wasn't able to make it boot again. Not worth spending more time on it so it should be wiped and that's it.

@fgiunchedi FYI we got some email to root@ from ms-be1014 with the following:

Cron <root@ms-be1014> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )

/etc/cron.daily/logrotate:
error: skipping "/var/log/swift/background.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
error: skipping "/var/log/swift/server.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
run-parts: /etc/cron.daily/logrotate exited with return code 1

@fgiunchedi FYI we got some email to root@ from ms-be1014 with the following:

thanks! these are spare hosts now so I've rm'd the swift logrotate config

I've put the state of those hosts in Netbox back to active as they are currently "active" for the spare::system role and decomissioning should be set once we run the decom script (and it will be done automatically by the script very soon) and the host is removed from puppet completely.
I've also updated the documentation to reduce confusion:
https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1827408&oldid=1827206

Can we please move forward with the decom steps for at least 1013? This host is down due to hardware trouble for nearly two months( T220907) and always shows up as failing in fleet-wide Cumin runs.

@RobH I had a similar issue with cumin and ms-be1013.eqiad.wmnet, is it possible we move forward with removing it from the fleet?

asw2-d-eqiad:

ge-1/0/8 down down ms-be1013
ge-1/0/9 up up ms-be1014
ge-1/0/10 up up ms-be1015

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: ms-be1013.eqiad.wmnet

  • ms-be1013.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: ms-be1014.eqiad.wmnet

  • ms-be1014.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: ms-be1015.eqiad.wmnet

  • ms-be1015.eqiad.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

Change 520486 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom ms-be101[345] prod dns

https://gerrit.wikimedia.org/r/520486

Change 520486 merged by RobH:
[operations/dns@master] decom ms-be101[345] prod dns

https://gerrit.wikimedia.org/r/520486

Change 520487 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom ms-be101[345] puppet repo entries

https://gerrit.wikimedia.org/r/520487

Change 520487 merged by RobH:
[operations/puppet@production] decom ms-be101[345] puppet repo entries

https://gerrit.wikimedia.org/r/520487

RobH removed RobH as the assignee of this task.Jul 3 2019, 4:38 PM
RobH edited projects, added ops-eqiad; removed Patch-For-Review.
RobH moved this task from Backlog to Decommission on the ops-eqiad board.

Change 538103 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns decom host ms-be101[3-5]

https://gerrit.wikimedia.org/r/538103

Change 538103 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns decom host ms-be101[3-5]

https://gerrit.wikimedia.org/r/538103

Cmjohnson updated the task description. (Show Details)