Page MenuHomePhabricator

Investigate short period of ores-web-03 insanity
Closed, ResolvedPublic

Description

Time in UTC-5 (CST daylight savings):

[23:15:53] <icinga-wm> PROBLEM - ORES web node labs ores-web-03 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:19:27] <icinga-wm> PROBLEM - ORES worker labs on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:24:14] <halfak> ARG
[23:24:16] <halfak> WHY
[23:24:19] <icinga-wm> RECOVERY - ORES worker labs on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 443 bytes in 0.628 second response time
[23:24:58] <halfak> Hmm... looks like I can't even get the homepage to load
[23:25:27] <halfak> https://ores.wmflabs.org/node/ores-web-05/
[23:25:29] <halfak> Is up
[23:25:33] <halfak> but https://ores.wmflabs.org/node/ores-web-03/ is down
[23:25:44] <halfak> I can ssh to ores-web-03
[23:26:07] <halfak> On ores-web-03, there's one big python process.
[23:26:28] <halfak> Top line:  3338 www-data  20   0 4020768 3.123g   5600 S   7.3 80.8   1296:19 python   
[23:26:32] <halfak> 80% of memory!
[23:26:49] <halfak> It hovers around 4-8% cpu
[23:27:59] * Amir1 (uid102662@gateway/web/irccloud.com/x-duqukzbslvjcdzgc) has joined
[23:27:59] * ChanServ gives voice to Amir1
[23:28:05] <halfak> o/ Amir1
[23:28:15] <Amir1> halfak: hey
[23:28:26] <Amir1> it's morning here, why are you awake? :D
[23:28:33] <halfak> Been looking into the icinga notification.
[23:28:39] <halfak> Will get you a paste of my notes shortly.
[23:28:49] <halfak> TL;DR: ores-web-03 got into a weird state
[23:29:49] <halfak> Service restart did nothing.  Executed without error.
[23:30:11] <Amir1> okay We should look into that why our instances suddenly gets crazy
[23:30:24] <halfak> This one is really weird.
[23:30:38] <halfak> uwsgi seems to have died and been replaced with a python process.
[23:30:43] <halfak> Usually the "command" is uwsgi
[23:30:57] <halfak> https://ores.wmflabs.org/node/ores-web-03/ is back online
[23:31:11] <icinga-wm> RECOVERY - ORES web node labs ores-web-03 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 457 bytes in 0.860 second response time
[23:31:36] <halfak> And we're back!

Event Timeline

Halfak renamed this task from Investigate show period of ores-web-03 insanity to Investigate short period of ores-web-03 insanity.Sep 12 2016, 4:38 AM

I'm going to claim that this investigation is enough. I'm deciding not to file an incident report because I think this is just labs being... labs. And we didn't experience downtime -- just degraded service for a little while. We should find this card in the future and reference the notes if it happens again.