Page MenuHomePhabricator

db1082 hardware check
Closed, ResolvedPublic

Description

Hello Chris,

db1082.eqiad.wmnet crashed yesterday and we had to power cycle it from the ILO.
So far we haven't been able to figure out why, apparently it had a kernel panic.

The RAID looks healthy at this point, but as @MoritzMuehlenhoff suggests - @Cmjohnson could you perform a hardware check to make sure the RAID controller and/or RAM is healthy?

This server is non pooled so it would be safe to power it off if needed. Just ping me if you need me for something.

Thanks
Manuel.

Related Objects

StatusSubtypeAssignedTask
Resolved Marostegui
Resolvedjcrespo

Event Timeline

Marostegui added a project: ops-eqiad.
Marostegui updated the task description. (Show Details)
jcrespo renamed this task from Hardware check to db1082 hardware check.Sep 14 2016, 7:35 AM

I am going to put this high, because this block putting the server back into production, and that means it will lag for longer, so this is time sensitive.

Kernel has been upgraded to 4.4.0-2 and a full-upgrade has been performed as well.

Note:

Replication was started and it went well.
The host was powered off a bit after for a memtest

performed a memtest, test came back with zero errors

jcrespo claimed this task.

:-( Let's repool, note it for the future, move on.

Thanks Chris - I will close this ticket and we will keep updating the upstream.