Page MenuHomePhabricator

Degraded RAID on mw2250
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host mw2250. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 1, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sda1[0] sdb1[1](F)
      488253440 blocks super 1.2 [2/1] [U_]
      bitmap: 2/4 pages [8KB], 65536KB chunk

unused devices: <none>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2019, 3:17 AM

Warranty expired a month ago, do we have any spare disks of that type around?

Papaul triaged this task as Normal priority.Jul 1 2019, 4:54 PM
jijiki added a subscriber: jijiki.
Papaul added a comment.Jul 3 2019, 7:54 PM

I talked to @MoritzMuehlenhoff on irc about this system. We have no 500GB 2"5 SATA disks on site for replacement.
Option 1: Open a procurement task to request spare disks for mw systems out of warranty
Option2: Since we have a lot of 250GB 2"5 SATA disks on site, we can remove both 500GB disk and replace them with 2x250GB and re-image the system.
Option3: Decommission the system.
@MoritzMuehlenhoff will look into those options next week

We don't use a lot of disk space on mw servers, let's go with option 2.

Papaul added a subscriber: Papaul.

Replaced both 500GB disks with 250GB disks . All your's for re-imaging

Dzahn claimed this task.Jul 17 2019, 10:33 PM
Dzahn added a comment.Thu, Jul 18, 6:48 PM

ran wmf-auto-reimage-host on it. OS is freshly installed though the first puppet run fails because it tries to run scap pull and this is currently broken (T228328)

so this should not be repooled before either scap pull is fixed in ticket above or we ran a manual deploy to just this host.

also T227547 will be fixed once it gets repooled

Dzahn changed the task status from Open to Stalled.Thu, Jul 18, 6:48 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-19T00:04:30Z] <mutante> install1002 - exported indices for new scap version - copied back from buster to stretch - upgraded scap version on mw2250 - scap pull now works and starts to rsync (T228482, T228328, T226948)

the above was after "19:50 < mutante> !log built new scap version 3.11.1-1 on boron, copied to install1002, imported package with reprepro, copied from stretch to jessie and buster (T228482)"

Dzahn closed this task as Resolved.Fri, Jul 19, 12:10 AM

20:08 <+icinga-wm> RECOVERY - PHP7 rendering on mw2250 is OK: HTTP OK: HTTP/1.1 200 OK - 327 bytes in 0.074 second response time

20:10 <+logmsgbot> !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2250.codfw.wmnet