Degraded RAID on mw2250
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Jul 1 2019, 3:17 AM

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2250. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0] sdb1[1](F)
      488253440 blocks super 1.2 [2/1] [U_]
      bitmap: 2/4 pages [8KB], 65536KB chunk

unused devices: <none>

Related Objects

Mentioned In: T228328: 'scap pull' stopped working on appservers ?
T228482: Deploy scap 3.11.1-1
T227547: Host mw2250 is not in mediawiki-installation dsh group
Mentioned Here: T228482: Deploy scap 3.11.1-1
T227547: Host mw2250 is not in mediawiki-installation dsh group
T228328: 'scap pull' stopped working on appservers ?

Event Timeline

ops-monitoring-bot added projects: ops-codfw, SRE.Jul 1 2019, 3:17 AM

ops-monitoring-bot subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2019, 3:17 AM

Warranty expired a month ago, do we have any spare disks of that type around?

Papaul triaged this task as Medium priority.Jul 1 2019, 4:54 PM

jijiki added projects: serviceops, User-jijiki.Jul 1 2019, 9:22 PM

jijiki subscribed.

I talked to @MoritzMuehlenhoff on irc about this system. We have no 500GB 2"5 SATA disks on site for replacement.
Option 1: Open a procurement task to request spare disks for mw systems out of warranty
Option2: Since we have a lot of 250GB 2"5 SATA disks on site, we can remove both 500GB disk and replace them with 2x250GB and re-image the system.
Option3: Decommission the system.
@MoritzMuehlenhoff will look into those options next week

Papaul moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-codfw board.Jul 12 2019, 12:51 AM

We don't use a lot of disk space on mw servers, let's go with option 2.

Replaced both 500GB disks with 250GB disks . All your's for re-imaging

Dzahn claimed this task.Jul 17 2019, 10:33 PM

Dzahn mentioned this in T227547: Host mw2250 is not in mediawiki-installation dsh group.Jul 18 2019, 5:32 PM

ran wmf-auto-reimage-host on it. OS is freshly installed though the first puppet run fails because it tries to run scap pull and this is currently broken (T228328)

so this should not be repooled before either scap pull is fixed in ticket above or we ran a manual deploy to just this host.

also T227547 will be fixed once it gets repooled

Dzahn changed the task status from Open to Stalled.Jul 18 2019, 6:48 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-19T00:04:30Z] <mutante> install1002 - exported indices for new scap version - copied back from buster to stretch - upgraded scap version on mw2250 - scap pull now works and starts to rsync (T228482, T228328, T226948)

the above was after "19:50 < mutante> !log built new scap version 3.11.1-1 on boron, copied to install1002, imported package with reprepro, copied from stretch to jessie and buster (T228482)"

20:08 <+icinga-wm> RECOVERY - PHP7 rendering on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.074 second response time

20:10 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet

Degraded RAID on mw2250Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Degraded RAID on mw2250
Closed, ResolvedPublic
Actions