Page MenuHomePhabricator

cp3043 disk failure
Closed, ResolvedPublic

Description

System cp4043 alerted via email to a smart disk failure.

This message was generated by the smartd daemon running on:

   host name:  cp3043
   DNS domain: esams.wmnet

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], 2 Currently unreadable (pending) sectors

Device info:
INTEL SSDSC2BA400G3, S/N:BTTV506304K7400HGN, WWN:5-5cd2e4-04b789ea7, FW:5DV10270, 400 GB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
Another message will be sent in 24 hours if the problem persists.

This system (C0MYV42) is under warranty until 2018-03-04. I (@RobH) can try to self dispatch a replacement disk (not sure if it will work for Dell Netherlands), but wanted to coordinate with @mark before doing so, since it will require his onsite work.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack renamed this task from cp4043 disk failure to cp3043 disk failure.Nov 7 2017, 4:31 PM

I have requested parts dispatch SR956320029. Once they notify me of shipment, I'll open an inbound shipment request with EvoSwitch, as well as a smart hands ticket for them to swap the SSD and ship back the defective one.

I've cleared all of this with @mark via IRC discussion (he is aware of the issue and the pending smart hands request.)

Added new group to self dispatch as Dell support advised and it did not allow me to send the part. I've emailed into support asking for next steps.

Any update on disk replacement here?

Once we ask Dell to send the replacement part, we have 14 days until they charge us for it. It seems this would be best called in by @mark when its ready (or we can attempt self dispatch again.)

Mentioned in SAL (#wikimedia-traffic) [2018-07-04T11:03:40Z] <ema> depool cp3043 (cache_upload) for hardware maintenance T179953

cp3043 drive 2 (sdb) has been swapped for cp3048's drive 2 (sdb), as cp3048 is out of warranty and unfixable anyway. RAID1 md0 has been restored, and the server is back up and running.

Mentioned in SAL (#wikimedia-traffic) [2018-07-04T11:54:44Z] <ema> repool cp3043 after hardware maintenance T179953