⚓ T220590 Decom ms-be101[345]

Subject	Repo	Branch	Lines +/-
Removing mgmt dns decom host ms-be101[3-5]	operations/dns	master	+0 -7
decom ms-be101[345] puppet repo entries	operations/puppet	production	+0 -21
decom ms-be101[345] prod dns	operations/dns	master	+1 -6
Set spares for ms-be[12]01[345]	operations/puppet	production	+3 -9
swift eqiad-prod: continue decom ms-be101[45]	operations/software/swift-ring	master	+1 K -1 K

fgiunchedi created this task.Apr 10 2019, 10:00 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2019, 10:00 AM

Mentioned in SAL (#wikimedia-operations) [2019-04-15T13:55:43Z] <godog> start ms-be1013 decom - T220590

fgiunchedi mentioned this in T220907: Degraded RAID on ms-be1013.Apr 15 2019, 2:00 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-16T12:52:30Z] <godog> swift eqiad-prod continue ms-be1013 decom - T220590

colewhite triaged this task as Medium priority.Apr 16 2019, 6:06 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-17T09:17:08Z] <godog> swift eqiad-prod continue ms-be1013 decom - T220590

Mentioned in SAL (#wikimedia-operations) [2019-04-23T12:15:45Z] <godog> swift eqiad-prod: fully decom ms-be1013 - T220590

CDanis subscribed.Apr 23 2019, 5:22 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-24T08:29:09Z] <godog> swift eqiad-prod: start decom for ms-be101[45] - T220590

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.Apr 24 2019, 2:03 PM

Change 506478 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] swift eqiad-prod: continue decom ms-be101[45]

https://gerrit.wikimedia.org/r/506478

Change 506478 merged by CDanis:
[operations/software/swift-ring@master] swift eqiad-prod: continue decom ms-be101[45]

https://gerrit.wikimedia.org/r/506478

Mentioned in SAL (#wikimedia-operations) [2019-05-06T14:35:35Z] <godog> swift eqiad-prod: finish decom ms-be101[45] - T220590

fgiunchedi merged a task: T222991: ms-be1015 - sdb1 failed.May 13 2019, 7:49 AM

fgiunchedi mentioned this in T222991: ms-be1015 - sdb1 failed.

fgiunchedi added subscribers: Dzahn, • Cmjohnson.

Hosts are out of swift rings now, ms-be1013 is still off the network and I'll take care of it before hand over.

Some filesystems report "input/output" error when trying an ls:

root@ms-be1014:~# find /srv/swift-storage/sdc1
find: ‘/srv/swift-storage/sdc1’: Input/output error

And indeed that fs has reported errors in dmesg:

May 11 10:59:58 ms-be1014 kernel: [2000697.050792] XFS (sdc1): xlog_write: reservation summary:
May 11 10:59:58 ms-be1014 kernel: [2000697.050796] XFS (sdc1):   unit res    = 79332 bytes
May 11 10:59:58 ms-be1014 kernel: [2000697.050798] XFS (sdc1):   current res = -6104 bytes
May 11 10:59:58 ms-be1014 kernel: [2000697.050799] XFS (sdc1):   total reg   = 0 bytes (o/flow = 0 bytes)
May 11 10:59:58 ms-be1014 kernel: [2000697.050801] XFS (sdc1):   ophdrs      = 0 (ophdr space = 0 bytes)
May 11 10:59:58 ms-be1014 kernel: [2000697.050802] XFS (sdc1):   ophdr + reg = 0 bytes
May 11 10:59:58 ms-be1014 kernel: [2000697.050803] XFS (sdc1):   num regions = 0
May 11 10:59:58 ms-be1014 kernel: [2000697.050804] XFS (sdc1): xlog_write: reservation ran out. Need to up reservation
May 11 10:59:58 ms-be1014 kernel: [2000697.059184] XFS (sdc1): xfs_do_force_shutdown(0x2) called from line 2091 of file /build/linux-UEAD6s/linux-4.9.144/fs/xfs/xfs_log.c.  Return address = 0xffffffffc0a35d85
May 11 10:59:58 ms-be1014 kernel: [2000697.059189] XFS (sdc1): Log I/O Error Detected.  Shutting down filesystem
May 11 10:59:58 ms-be1014 kernel: [2000697.066968] XFS (sdc1): Please umount the filesystem and rectify the problem(s)

I found some bug reports searching for "reservation ran out. Need to up reservation", e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1092853 and it looks like not enough log space was available at some point. I tested the proposed solution on ms-be1015 (umount / remount the affected filesystems) and seems to work, filesystems are back with some data on them that's being replicated:

/dev/sdc1       2.8T   28G  2.7T   1% /srv/swift-storage/sdc1
/dev/sdb1       2.8T   83G  2.7T   3% /srv/swift-storage/sdb1
/dev/sde1       2.8T   34G  2.7T   2% /srv/swift-storage/sde1
/dev/sdf1       2.8T   34G  2.7T   2% /srv/swift-storage/sdf1

Ditto on ms-be1014:

/dev/sdl1       2.8T   29G  2.7T   2% /srv/swift-storage/sdl1
/dev/sdj1       2.8T   34G  2.7T   2% /srv/swift-storage/sdj1
/dev/sdc1       2.8T   28G  2.8T   1% /srv/swift-storage/sdc1

ms-be1014 has finished swift decom, what's left is zero-bytes old quarantined files

root@ms-be1014:~# find /srv/swift-storage/ -type f -ls
242077616      0 -rw-------   1 swift    swift           0 Feb 12  2015 /srv/swift-storage/sdf1/quarantined/objects/ad2ba865bfc51470d13ef487e664a6a9/1423720270.37474.ts
 67166182      0 -rw-r--r--   1 swift    swift           0 Apr 25 16:22 /srv/swift-storage/sdm3/quarantined/containers/372dc401b6e1b95b978ac6183c4212e3/372dc401b6e1b95b978ac6183c4212e3.db
  9019504      0 -rw-------   1 swift    swift           0 Aug 17  2015 /srv/swift-storage/sdi1/quarantined/objects/d74323f822bc8848e2f81a62e3effcb3/1439835859.39412.ts
 22909456      0 -rw-------   1 swift    swift           0 Feb 22  2016 /srv/swift-storage/sdd1/quarantined/objects/28fff0d06aad75c47ed8c6bbfe1a3b39/1456169668.86455.ts

Ditto for ms-be1015

A bunch of old/zero byte files and a container database in tmp that has been replicated but left behind afaics

root@ms-be1015:~# find /srv/swift-storage/ -type f -ls
 43951820      0 -rw-------   1 swift    swift           0 Jul  4  2016 /srv/swift-storage/sdd1/quarantined/objects/cbecbfaf083b4678a9e0961125a11617/1467596340.24874.ts
446698099      0 -rw-------   1 swift    swift           0 Feb 28  2015 /srv/swift-storage/sdh1/quarantined/objects/5734145a2237eaa7d1c519c01eb83427/1425151966.15747.ts
   913034      0 -rw-------   1 swift    swift           0 Apr  4  2017 /srv/swift-storage/sdk1/quarantined/objects/ccbb2a5f9413541708ad94c8287cb1b2/1491312115.96396.ts
    16053 1840956 -rw-------   1 swift    swift    1885138944 May  6 14:52 /srv/swift-storage/sdm3/tmp/641de3c5-de84-4f69-89c9-7280265c6fad
230760591       0 -rw-------   1 swift    swift             0 Feb 12  2015 /srv/swift-storage/sdj1/quarantined/objects/94d85af99d78cb3cbbc65298bd0a3366/1423720250.83718.ts
341531972       0 -rw-------   1 swift    swift             0 Feb 12  2015 /srv/swift-storage/sdj1/quarantined/objects/459354336644cd353295540d963ade93/1423720258.80436.ts
 49229533       0 -rw-------   1 swift    swift             0 Jul  9  2016 /srv/swift-storage/sdj1/quarantined/objects/13f3dce15f3652e37220c451e5376905/1468025106.79144.ts
 49676914       0 -rw-------   1 swift    swift             0 Jul  9  2016 /srv/swift-storage/sdl1/quarantined/objects/58895ef77cd23d8944ccb9e891c15918/1468024564.30721.ts

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission-hardware queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission-hardware project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

Change 510819 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Set spares for ms-be[12]01[345]

https://gerrit.wikimedia.org/r/510819

Mentioned in SAL (#wikimedia-operations) [2019-05-17T09:27:53Z] <godog> swift remove ms-be101[345] from rings - T220590

In T220590#5188014, @RobH wrote:

Please note these show 'decommission' in netbox when they are still actively calling into puppet. So they should be active in netbox until they are added to the decommission-hardware queue and shifted to dc ops to decom them.

@fgiunchedi: I added in the decommission-hardware project so its easier to find out why these are showing on the report listed here.

We should likely shift all those ms-be systems back to active in netbox.

Indeed, I've moved them back to active!

Change 510819 merged by Filippo Giunchedi:
[operations/puppet@production] Set spares for ms-be[12]01[345]

https://gerrit.wikimedia.org/r/510819

fgiunchedi updated the task description. (Show Details)May 23 2019, 9:45 AM

Task updated with the checklist, hosts are now marked as spare in puppet and I've set netbox status to decommissioning, moving to @RobH

Also a note re: ms-be1013, it had its raid failed in T220907: Degraded RAID on ms-be1013 and currently I wasn't able to make it boot again. Not worth spending more time on it so it should be wiped and that's it.

Maintenance_bot removed a project: Patch-For-Review.May 23 2019, 10:45 AM

@fgiunchedi FYI we got some email to root@ from ms-be1014 with the following:

Cron <root@ms-be1014> test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily )

/etc/cron.daily/logrotate:
error: skipping "/var/log/swift/background.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
error: skipping "/var/log/swift/server.log" because parent directory has insecure permissions (It's world writable or writable by group which is not "root") Set "su" directive in config file to tell logrotate which user/group should be used for rotation.
run-parts: /etc/cron.daily/logrotate exited with return code 1

In T220590#5213945, @Volans wrote:

@fgiunchedi FYI we got some email to root@ from ms-be1014 with the following:

thanks! these are spare hosts now so I've rm'd the swift logrotate config

fgiunchedi moved this task from Doing to Radar on the User-fgiunchedi board.May 27 2019, 1:05 PM

I've put the state of those hosts in Netbox back to active as they are currently "active" for the spare::system role and decomissioning should be set once we run the decom script (and it will be done automatically by the script very soon) and the host is removed from puppet completely.
I've also updated the documentation to reduce confusion:
https://wikitech.wikimedia.org/w/index.php?title=Server_Lifecycle&type=revision&diff=1827408&oldid=1827206

RobH moved this task from Backlog to Ready for Decommission on the decommission-hardware board.Jun 12 2019, 8:30 AM

Can we please move forward with the decom steps for at least 1013? This host is down due to hardware trouble for nearly two months( T220907) and always shows up as failing in fleet-wide Cumin runs.

@RobH I had a similar issue with cumin and ms-be1013.eqiad.wmnet, is it possible we move forward with removing it from the fleet?

asw2-d-eqiad:

ge-1/0/8 down down ms-be1013
ge-1/0/9 up up ms-be1014
ge-1/0/10 up up ms-be1015

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: ms-be1013.eqiad.wmnet

ms-be1013.eqiad.wmnet
- Removed from Puppet master and PuppetDB
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: ms-be1014.eqiad.wmnet

ms-be1014.eqiad.wmnet
- Removed from Puppet master and PuppetDB
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: ms-be1015.eqiad.wmnet

ms-be1015.eqiad.wmnet
- Removed from Puppet master and PuppetDB
- Downtimed host on Icinga
- Downtimed management interface on Icinga
- Removed from DebMonitor