Page MenuHomePhabricator

labs NFS slowness / high load
Closed, ResolvedPublic

Description

13:31 <icinga-wm> PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
15:10 <icinga-wm> PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds
15:12 <icinga-wm> RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973513 bytes in 5.487 second response time
15:24 <icinga-wm> PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds
15:26 <icinga-wm> RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973454 bytes in 12.739 second response time
15:39 <icinga-wm> RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]

File system operations on tool-bastion regularly hang.

Event Timeline

valhallasw updated the task description. (Show Details)
valhallasw raised the priority of this task from to Needs Triage.
valhallasw added a subscriber: valhallasw.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 2 2016, 2:40 PM
12:50 <shinken-wm> PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds
12:55 <shinken-wm> RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 974411 bytes in 6.102 second response time
12:56 <icinga-wm> PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
13:03 <icinga-wm> RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]

What measure in Graphite (.wm.o) is used for the load average measure? Graphite only seems to expose cpu.user et al.

Hmm, it is using metric => "servers.${::hostname}.loadavg.01", so that, I guess?

Ah, yes. Found them -- servers.labstore1001.loadavg.*. This is the graph of the last week:

so it does seem to be a mostly transient load issue.

chasemp closed this task as Resolved.Feb 5 2016, 11:06 PM
chasemp claimed this task.
chasemp added a subscriber: chasemp.

I am resolving not because it's sunshine and roses but as this fall under the guise of other ongoing work. Let me know if there is anything specific.