the trusty webgrid queue has been overloaded for > 12 hours now. Andrew and I can't figure out how to get the host configured correctly. Please fix asap.
my attempts: please see https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
the trusty webgrid queue has been overloaded for > 12 hours now. Andrew and I can't figure out how to get the host configured correctly. Please fix asap.
my attempts: please see https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1403.eqiad.wmflabs" dropped because it is temporarily not available queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1404.eqiad.wmflabs" dropped because it is temporarily not available queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" dropped because it is temporarily not available
?
seems to be caused by gridengine-exec not running. sudo service gridengine-exec start seems to fix it on these hosts.
I did so far:
But:
scfc@tools-bastion-01:~$ qconf -de tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs Host object "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" is still referenced in cluster queue "webgrid-lighttpd". scfc@tools-bastion-01:~$
So I qconf -mhgrp \@webgrid because qconf -sq webgrid-lighttpd referenced @webgrid, removed -1411 completely, tried again qconf -de, but still no luck, so I added -1411 back with qconf -mhgrp \@webgrid.
Note that there are jobs running on -1411 at the moment, so please be careful with the sge magic.
More DNS crazyness with
tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs
and
tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
those two don't have jobs running so might be better targets to figure this out.
Ha! The alias list struck:
scfc@tools-bastion-01:~$ qconf -de tools-webgrid-lighttpd-1411.eqiad.wmflabs scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-lighttpd-1411.eqiad.wmflabs" from execution host list scfc@tools-bastion-01:~$
(Side note: I assume the disabled queues are disabled for the upcoming reboot?)
Okay, now that I have done practically nothing, except changing the list in qconf -mhgrp \@webgrid back to …-1411.tools.… because I hadn't considered that the host was not in host_aliases, and @valhallasw did a lot, there are no pending jobs, and the host seems to run jobs, is this task done?
I think so, for now. I still need to document what I did in more detail (T109417) and write up a post-mortem, but I think it's crisis averted for now.