Page MenuHomePhabricator

Investigate web-05 downtime
Closed, ResolvedPublic

Description

icinga has been reporting minor downtime for ores-web-05 (but not ores-web-03). Investigate what is causing this.

Times in UTC from 2016-07-28:

[01:37:11] <icinga-wm> PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:41:11] <icinga-wm> RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 338 bytes in 9.751 second response time
[01:49:12] <icinga-wm> PROBLEM - ORES web node labs ores-web-05 on ores.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[01:53:02] <icinga-wm> RECOVERY - ORES web node labs ores-web-05 on ores.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 339 bytes in 1.546 second response time

Event Timeline

Halfak triaged this task as High priority.Jul 28 2016, 2:13 PM
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.
Halfak added a project: ORES.

I thought I'd caught some new downtime today, but it turns out it was labs' proxy.

It looks like there's some OOM issues on web-05, so I reduced the amount of workers and it seems that we're doing better. I'm calling this done.