Page MenuHomePhabricator

ocg1001 is broken
Closed, ResolvedPublic

Description

ocg1001 went down, i tried to powercycle it, it did not come back. i depooled it and made this ticket.


14:14 < icinga-wm> PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100%
14:18 < mutante> !log powercycling ocg1001 which went down and had no console output at all
14:19 < mutante> !log ocg1001 - dead - " Exception Inside the Exception Handler
14:20 < logmsgbot> !log dzahn@neodymium conftool action : set/pooled=no; selector: name=ocg1001.eqiad.wmnet
14:20 <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
14:21 < mutante> !log ocg1001 - Type: General Protection Fault (13) Source: Software (UEFI0011) -  depooled
Exception Inside the Exception Handler
Type: General Protection Fault (13) Source: Software (UEFI0011)
AX=0000000000000000 BX=0000000000200800 BP=00000000CCA87160 SP=00000000CCA85F70
CX=0000000000000000 DX=0000000000000007 R8=0000000000000001 R9=0000000000000000
SI=0000000000248110 DI=00000000CCA86E10 SS=0018 CS=0038     Flags=00010246
IP=00000000C8653274

Event Timeline

Dzahn created this task.Jul 17 2017, 9:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2017, 9:23 PM
Peachey88 updated the task description. (Show Details)Jul 18 2017, 1:04 PM
Joe added a subscriber: Joe.Jul 19 2017, 1:53 PM

Please note that if no one ran the script that removes entries corresponding to ocg1001 from the cache, we're still serving failing requests as mediawiki will try to connect to ocg1001 directly.

Steps are described here https://wikitech.wikimedia.org/wiki/OCG#Decommissioning_a_host

Mentioned in SAL (#wikimedia-operations) [2017-07-19T13:55:18Z] <_joe_> running clear-host-cache.js for ocg1001 decommission T170886

Joe added a comment.Jul 19 2017, 2:02 PM

I ran the script and it worked fine, but we also need to add a redis slave on ocg1003, as right now we lost the replica of the redis master which is on ocg1002, according to puppet.

The server was not booting, i did see a h/w error in racadm syslog that pertained to PCIe port...no ports are being used. i did open up and reseat the DIMM and CPU's...cleared the log and powered on....all appears normal at this time.

Joe claimed this task.Jul 21 2017, 10:15 AM
Joe triaged this task as Medium priority.
Joe added a project: User-Joe.
Joe moved this task from Backlog to Doing on the User-Joe board.Jul 21 2017, 10:17 AM
Joe added a comment.Jul 21 2017, 2:56 PM

I've re-put this server in the rotation for the load-balancer.

Joe closed this task as Resolved.Jul 24 2017, 2:43 PM