Degraded RAID on wikikube-worker1256
Open, MediumPublic
Actions

Assigned To

Authored By

	ops-monitoring-bot
	Sat, Nov 9, 9:02 AM

Description

- Provide FQDN of system.
- If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
- Put system into a failed state in Netbox.
- Provide urgency of request, along with justification (redundancy, dependencies, etc)
- Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
- Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: wikikube-worker1256.eqiad.wmnet
Urgency: Medium, one of many wikikube nodes

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (md) was detected on host wikikube-worker1256. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-md
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb2[1]
      937267200 blocks super 1.2 [2/1] [_U]
      bitmap: 4/7 pages [16KB], 65536KB chunk

unused devices: <none>

Event Timeline

ops-monitoring-bot created this task.Sat, Nov 9, 9:02 AM

Restricted Application added a project: DC-Ops. · View Herald TranscriptSat, Nov 9, 9:02 AM

Clement_Goubert edited projects, added serviceops; removed SRE.Wed, Nov 13, 11:22 AM

Clement_Goubert moved this task from Incoming 🐫 to 🛠 Upgrades and Hardware on the serviceops board.

depool host wikikube-worker1256.eqiad.wmnet by cgoubert@cumin1002 with reason: Degraded RAID

Cookbook cookbooks.sre.k8s.pool-depool-node started by cgoubert@cumin1002 depool for host wikikube-worker1256.eqiad.wmnet completed:

wikikube-worker1256.eqiad.wmnet (PASS)
- Host wikikube-worker1256.eqiad.wmnet depooled from wikikube-eqiad

Icinga downtime and Alertmanager silence (ID=b845f658-b5b1-44ba-b75b-ce7430a01e60) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Degraded RAID

wikikube-worker1256.eqiad.wmnet

Host depooled and downtimed, you can replace the disk when able.

Jclark-ctr moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Sat, Nov 16, 5:17 PM

Clement_Goubert assigned this task to Jclark-ctr.Mon, Nov 18, 12:35 PM

Clement_Goubert triaged this task as Medium priority.

Clement_Goubert updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-11-18T14:16:34Z] <claime> running homer 'cr*-eqiad' 'T379454'

Opened ticket with Dell Advised of i/o errors on sda and uploaded tsr report

[Sat Nov  9 08:53:19 2024] blk_update_request: I/O error, dev sda, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[Sat Nov  9 08:53:19 2024] blk_update_request: I/O error, dev sda, sector 585744 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[Sat Nov  9 08:53:19 2024] md: super_written gets error=-5

Confirmed: Service Request 201149035

Degraded RAID on wikikube-worker1256Open, MediumPublicActions

Description

Event Timeline

Degraded RAID on wikikube-worker1256
Open, MediumPublic
Actions