Page MenuHomePhabricator

SGE queues all overloaded / jobs not submitting although load averages are low
Closed, ResolvedPublic

Description

After reports on IRC/mail, tried to start two jobs, one precise (511100), one trusty (511132). Neither gets submitted. Hosts that are practically empty:

valhallasw@tools-bastion-02:~$ ssh tools-exec-1211 uptime
 06:15:26 up 14 days, 14:48,  0 users,  load average: 0.27, 0.25, 0.30

are still reported as overloaded:

queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.810000 (= 0.060000 + 0.50 * 14.000000 with nproc=4) >= 1.75

This probably is because of that *14 multiplier -- not sure where that comes from? Other hosts sometimes have other values (12 or 15).

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a subscriber: valhallasw.
valhallasw triaged this task as Unbreak Now! priority.Sep 1 2015, 6:18 AM
valhallasw added a project: Toolforge.
valhallasw set Security to None.
valhallasw lowered the priority of this task from Unbreak Now! to High.Sep 1 2015, 6:23 AM

Restarting gridengine master seems to have helped -- but what happened to get the queues in this state?! More issues from the earlier NFS outage?

I noticed job numbers were reset. I'm not sure if this could be related to the cause or whether this is just an effect of restarting the master.

I did disable and later enable queues on some hosts (cf. T109485#1591475). After that, only the queue for tools-webgrid-lighttpd-1411 was disabled, so I assumed that everything would be "normal" again.

I think it's unlikely that was the cause, because *all* queues were in this weird state (qhost didn't report high load). I think it's because of the * 14 multiplier, which has something to do with load adjustment (job_load_adjustments), but it's unclear to me what exactly causes this.

My plan is to heavily load a single node, then to check the output of qstat for a new job. The heavily loaded node should give that np_load_avg message. (I have no clue how to get that info for a non-disabled queue...)

np_load_avg=1.810000 (= 0.060000 + 0.50 * 14.000000 with nproc=4) >= 1.75
= np_load_avg +  job_load_adjustments * [weight number of jobs started in the last 7,5 Minutes]

0,5 is job_load_adjustments shown in "qconf -ssconf"
14 is the sum of a new job number factor. If a new job is started is increased by 1. This value will decrease within 7.5 minutes (load_adjustment_decay_time) to 0 for each job. The sum of all these [0..1] values for jobs started in the last 7,5 Minutes is 14 in this case.

This is done to avoid overload because recently started jobs normally need some time to start up until they are causing maximum load on server.

Thanks you for that explanation, that helps a lot!

A bit more post-mortem work. From the accounting file, I got a list of finished jobs starting midnight UTC last night. I then grouped them by starting time, with a ten minute bin size. Typically, there are 150-250 jobs in that period, but during the outage, this dropped to a few jobs:

 260 02:0
 199 02:1
 177 02:2
 212 02:3
 209 02:4
 139 02:5
  12 03:0
   2 03:1
  19 03:2
   3 03:3
   2 03:4
  31 03:5
   3 04:0
  35 04:2
   6 04:3
   3 04:4
  27 04:5
  18 05:0
  11 05:1
  12 05:2
   5 05:3
  29 05:5
   2 06:0
1057 06:1
 320 06:2
 232 06:3
 221 06:4

~06:10 UTC is when I restarted gridengine-master. The issue seems to have started just before 03:00 UTC (139 jobs for the 02:50-03:00 bin is already on the low side). There also doesn't seem to be a crazy influx of jobs explaining the 14-jobs-started-in-the-last-7.5-minutes number...

What's also odd is that the first report on IRC came in at 02:34 CEST = 00:34 UTC, while there doesn't seem to be anything in the accounting log to suggest issues at that time...

My logging in T109485 was done fairly soon after the corresponding action, i. e. I enabled the queues ~ 3:00Z after seeing the pending queue growing, but I disabled them before ~ 0:20Z.

valhallasw claimed this task.

Hm. I'm not sure what we can do to find the root cause of this issue at the moment. Let's just hope it doesn't happen again? :/