Page MenuHomePhabricator

mw2092 - disk issue
Closed, ResolvedPublic

Description

mw2092 stopped working, as in:

17:48 < Krenair> mw2092.codfw.wmnet returned [70]: 01:44:12 Copying to mw2092.codfw.wmnet from mw2080.codfw.wmnet
17:48 < Krenair> !log mw2092 seems broken
17:49 < Krenair> Connection to mw2092.codfw.wmnet closed.
17:53 < ostriches> Krenair: mw2092 r/o? Hrm....
17:59 < mutante> i'm trying to connect to that mw2092 now
18:00 < mutante> mw2092 login: root
18:04 < logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2092.codfw.wmnet
18:05 < mutante> !log depooled mw2092 because it had I/O errors, dev sda


ssh root@mw2092.codfw.wmnet
Linux mw2092 4.4.0-3-amd64 #1 SMP Debian 4.4.2-3+wmf7 (2016-11-04) x86_64
Debian GNU/Linux 8.5 (jessie)
mw2092 is role::mediawiki::appserver
The last Puppet run was at Wed Nov 23 01:16:17 UTC 2016 (54 minutes ago).
Debian GNU/Linux 8 auto-installed on Tue Aug 30 07:29:09 UTC 2016.
Connection to mw2092.codfw.wmnet closed.

and on mgmt:

mw2092 login: root
[1263230.537241] blk_update_request: I/O error, dev sda, sector 663148816
[1263230.544728] blk_update_request: I/O error, dev sda, sector 663148816

Event Timeline

When I tried to log in earlier:
-bash: /etc/bash_completion: Input/output error

This host is gone for now, I've added a 3 months scheduled downtime and disabled notifications on Icinga.
Looks like it is a single disk host or a 2 disk host where only one was used, without RAID. In fact there is no RAID check on Icinga for it.
I'm checking the others, will follow-up on a separate task.

Volans triaged this task as High priority.Nov 23 2016, 10:59 AM
Volans added a subscriber: Papaul.

Setting high because is a broken production host, @Papaul feel free to lower it if is in the list of soon-to-be-decom hosts

Set set/pooled=inactive to remove it from scap targets too

Disk replacement and re-image complete.

De-assigning from myself, I just cleaned it's conftool status, better to have some with more expertise on this cluster to check it before re-adding it to production.

Papaul raised the priority of this task from High to Needs Triage.Dec 1 2016, 4:25 PM
fgiunchedi triaged this task as Medium priority.Dec 1 2016, 6:59 PM
fgiunchedi claimed this task.
fgiunchedi subscribed.

Back in service, resolving

Mentioned in SAL (#wikimedia-operations) [2017-02-27T12:00:09Z] <elukey> rebooting mw2092 due to puppet errors for mw-cgroup - T151427