ocg1001 is broken
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Jul 17 2017, 9:23 PM

Description

ocg1001 went down, i tried to powercycle it, it did not come back. i depooled it and made this ticket.

14:14 < icinga-wm> PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100%
14:18 < mutante> !log powercycling ocg1001 which went down and had no console output at all
14:19 < mutante> !log ocg1001 - dead - " Exception Inside the Exception Handler
14:20 < logmsgbot> !log dzahn@neodymium conftool action : set/pooled=no; selector: name=ocg1001.eqiad.wmnet
14:20 <+stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
14:21 < mutante> !log ocg1001 - Type: General Protection Fault (13) Source: Software (UEFI0011) -  depooled

Exception Inside the Exception Handler
Type: General Protection Fault (13) Source: Software (UEFI0011)
AX=0000000000000000 BX=0000000000200800 BP=00000000CCA87160 SP=00000000CCA85F70
CX=0000000000000000 DX=0000000000000007 R8=0000000000000001 R9=0000000000000000
SI=0000000000248110 DI=00000000CCA86E10 SS=0018 CS=0038     Flags=00010246
IP=00000000C8653274

Event Timeline

Dzahn created this task.Jul 17 2017, 9:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2017, 9:23 PM

Dzahn added a project: ops-eqiad.Jul 17 2017, 9:24 PM

Peachey88 updated the task description. (Show Details)Jul 18 2017, 1:04 PM

ovasileva added a project: Web-Team-Backlog.Jul 19 2017, 10:20 AM

ovasileva edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

Please note that if no one ran the script that removes entries corresponding to ocg1001 from the cache, we're still serving failing requests as mediawiki will try to connect to ocg1001 directly.

Steps are described here https://wikitech.wikimedia.org/wiki/OCG#Decommissioning_a_host

Mentioned in SAL (#wikimedia-operations) [2017-07-19T13:55:18Z] <_joe_> running clear-host-cache.js for ocg1001 decommission T170886

I ran the script and it worked fine, but we also need to add a redis slave on ocg1003, as right now we lost the replica of the redis master which is on ocg1002, according to puppet.

• Cmjohnson moved this task from Backlog to High Priority Task on the ops-eqiad board.Jul 20 2017, 3:25 PM

The server was not booting, i did see a h/w error in racadm syslog that pertained to PCIe port...no ports are being used. i did open up and reseat the DIMM and CPU's...cleared the log and powered on....all appears normal at this time.

Joe claimed this task.Jul 21 2017, 10:15 AM

Joe triaged this task as Medium priority.

Joe added a project: User-Joe.

Joe moved this task from Backlog to Doing on the User-Joe board.Jul 21 2017, 10:17 AM

I've re-put this server in the rotation for the load-balancer.

Jdlrobson moved this task from Untriaged to Move to Backlog on the Web-Team-Backlog (Tracking) board.Jul 21 2017, 7:12 PM

Joe closed this task as Resolved.Jul 24 2017, 2:43 PM

ocg1001 is brokenClosed, ResolvedPublicActions

Description

Event Timeline

ocg1001 is broken
Closed, ResolvedPublic
Actions