Page MenuHomePhabricator

Enable and build more webgrid hosts
Closed, ResolvedPublic


the trusty webgrid queue has been overloaded for > 12 hours now. Andrew and I can't figure out how to get the host configured correctly. Please fix asap.

my attempts: please see

Event Timeline

valhallasw raised the priority of this task from to Unbreak Now!.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, coren, scfc and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1403.eqiad.wmflabs" dropped because it is temporarily not available
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1404.eqiad.wmflabs" dropped because it is temporarily not available
queue instance "" dropped because it is temporarily not available


seems to be caused by gridengine-exec not running. sudo service gridengine-exec start seems to fix it on these hosts.

I did so far:

  • qconf -mhgrp \@webgrid => => tools-webgrid-lighttpd-1411.eqiad.wmflabs (consistency always good)


scfc@tools-bastion-01:~$ qconf -de
Host object "" is still referenced in cluster queue "webgrid-lighttpd".

So I qconf -mhgrp \@webgrid because qconf -sq webgrid-lighttpd referenced @webgrid, removed -1411 completely, tried again qconf -de, but still no luck, so I added -1411 back with qconf -mhgrp \@webgrid.

Note that there are jobs running on -1411 at the moment, so please be careful with the sge magic.

More DNS crazyness with vs tools-exec-1401.eqiad.wmflabs

and vs tools-exec-catscan.eqiad.wmflabs.

those two don't have jobs running so might be better targets to figure this out.

Ha! The alias list struck:

scfc@tools-bastion-01:~$ qconf -de tools-webgrid-lighttpd-1411.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-lighttpd-1411.eqiad.wmflabs" from execution host list

(Side note: I assume the disabled queues are disabled for the upcoming reboot?)

I'm sorry, I read your comment too late. But qstat -f showed no running jobs anyway?

Ugh, wrong column. Yes there were and still are jobs running on that host.

Okay, now that I have done practically nothing, except changing the list in qconf -mhgrp \@webgrid back to …… because I hadn't considered that the host was not in host_aliases, and @valhallasw did a lot, there are no pending jobs, and the host seems to run jobs, is this task done?

I think so, for now. I still need to document what I did in more detail (T109417) and write up a post-mortem, but I think it's crisis averted for now.