Page MenuHomePhabricator

ocg1001 is broken
Closed, ResolvedPublic


ocg1001 went down, i tried to powercycle it, it did not come back. i depooled it and made this ticket.

14:14 < icinga-wm> PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100%
14:18 < mutante> !log powercycling ocg1001 which went down and had no console output at all
14:19 < mutante> !log ocg1001 - dead - " Exception Inside the Exception Handler
14:20 < logmsgbot> !log dzahn@neodymium conftool action : set/pooled=no; selector: name=ocg1001.eqiad.wmnet
14:20 <+stashbot> Logged the message at
14:21 < mutante> !log ocg1001 - Type: General Protection Fault (13) Source: Software (UEFI0011) -  depooled
Exception Inside the Exception Handler
Type: General Protection Fault (13) Source: Software (UEFI0011)
AX=0000000000000000 BX=0000000000200800 BP=00000000CCA87160 SP=00000000CCA85F70
CX=0000000000000000 DX=0000000000000007 R8=0000000000000001 R9=0000000000000000
SI=0000000000248110 DI=00000000CCA86E10 SS=0018 CS=0038     Flags=00010246

Event Timeline

Please note that if no one ran the script that removes entries corresponding to ocg1001 from the cache, we're still serving failing requests as mediawiki will try to connect to ocg1001 directly.

Steps are described here

Mentioned in SAL (#wikimedia-operations) [2017-07-19T13:55:18Z] <_joe_> running clear-host-cache.js for ocg1001 decommission T170886

I ran the script and it worked fine, but we also need to add a redis slave on ocg1003, as right now we lost the replica of the redis master which is on ocg1002, according to puppet.

The server was not booting, i did see a h/w error in racadm syslog that pertained to PCIe ports are being used. i did open up and reseat the DIMM and CPU's...cleared the log and powered on....all appears normal at this time.

Joe triaged this task as Medium priority.
Joe added a project: User-Joe.

I've re-put this server in the rotation for the load-balancer.