Page MenuHomePhabricator

Investigate disk errors on wcqs1003.eqiad.wmnet
Closed, ResolvedPublic

Description

While reloading data for WCQS in T316236 , we had to reboot wcqs1003.eqiad.wmnet several times before it would actually load the OS, and the BIOS displayed disk errors every time:

Initializing Serial ATA devices...
 Port A: Device initialization error
 Port B: MTFDDAK1T9TDT
 Port C: MTFDDAK1T9TDT
 Port D: MTFDDAK1T9TDT

Opening ticket to continue this investigation.

Event Timeline

Hello DC Ops,

The server is currently up and in production, but can be depooled and shut down any time. WCQS is a new service and does not yet have an SLO, so don't worry too much about disruptions.

Thanks for your help.

RobH added a project: ops-eqiad.
RobH added subscribers: Jclark-ctr, RobH.

@bking pinged in DC-ops channel asking about this:

Hey DC Ops, this isn't urgent but wondering if we need to add some tags or something? Haven't gotten any response from your team yet https://phabricator.wikimedia.org/T323380

I explained no one had even noticed or triaged this request because it wasn't tagged properly using our hw repair template linked on https://phabricator.wikimedia.org/tag/dc-ops/ which includes directions on how to tag the task. (Only missing thing was ops-eqiad )

I've edited this task's projects to assign it properly.

Checking the host's info it is in warranty: https://netbox.wikimedia.org/dcim/devices/3182/

So this can go to @Jclark-ctr for him to enter a self dispatch for a replacement disk.

Dell support ticket opened Confirmed: Service Request 158186356 was successfully submitted.

@bking i need server depooled and shutdown for hardware testing. do you have time to assist today with that?

Icinga downtime and Alertmanager silence (ID=6ab1ce1c-f2e1-46e4-b6d5-7551d7cbe870) set by bking@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: hardware diagnostics

wcqs1003.eqiad.wmnet

@Jclark-ctr No problem . wcqs1003 is now depooled and shut down. Reach out here or to inflatador (me) in IRC if you need anything else. Thanks for your help!

Reseated Hard Drive and preformed hardware test showed no errors at this time. Resubmitted Tsr report to dell

Removed downtime and repooled WCQS as it sounds like reseating the hard drives may have fixed it. @Jclark-ctr let us know if you hear anything else from Dell, if not feel free to close this one out.