Page MenuHomePhabricator

SMART error (CurrentPendingSector) detected on host: cp5004
Closed, ResolvedPublic

Description

This message was generated by the smartd daemon running on:

host name:  cp5004
DNS domain: eqsin.wmnet

The following warning/error was logged by the smartd daemon:

Device: /dev/sda [SAT], 2 Currently unreadable (pending) sectors

Event Timeline

Vgutierrez triaged this task as Medium priority.Mar 7 2022, 4:10 PM
Vgutierrez moved this task from Triage to Active Issues on the Traffic board.
Vgutierrez added subscribers: wiki_willy, Vgutierrez.

@wiki_willy how should we handle this HW issue on eqsin?

Hi @Vgutierrez - it's due to be refreshed towards the end of this calendar year (and will be on next FY's budget). Would you be able to go that long without repairing this server? If not, we can definitely order a replacement disk and have our dedicated contractor out there swap out the part.

Thanks,
Willy

@wiki_willy how should we handle this HW issue on eqsin?

Mentioned in SAL (#wikimedia-operations) [2022-03-07T17:20:48Z] <vgutierrez@cumin1001> START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp5004.eqsin.wmnet with reason: HW issues see T303043

Icinga downtime set by vgutierrez@cumin1001 for 30 days, 0:00:00 1 host(s) and their services with reason: HW issues see T303043

cp5004.eqsin.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-03-07T17:20:51Z] <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp5004.eqsin.wmnet with reason: HW issues see T303043

BBlack added a subscriber: BBlack.

This seems to have resolved itself. There's no current SMART error, and all disks seem present and working at a glance. We can revisit if it fails again!