Page MenuHomePhabricator

Enable tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs and build more webgrid hosts
Closed, ResolvedPublic

Description

the trusty webgrid queue has been overloaded for > 12 hours now. Andrew and I can't figure out how to get the host configured correctly. Please fix asap.

my attempts: please see https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL

Event Timeline

valhallasw raised the priority of this task from to Unbreak Now!.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, coren, scfc and 2 others.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1403.eqiad.wmflabs" dropped because it is temporarily not available
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1404.eqiad.wmflabs" dropped because it is temporarily not available
queue instance "webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" dropped because it is temporarily not available

?

seems to be caused by gridengine-exec not running. sudo service gridengine-exec start seems to fix it on these hosts.

I did so far:

  • qconf -mhgrp \@webgrid => tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs => tools-webgrid-lighttpd-1411.eqiad.wmflabs (consistency always good)

But:

scfc@tools-bastion-01:~$ qconf -de tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs
Host object "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" is still referenced in cluster queue "webgrid-lighttpd".
scfc@tools-bastion-01:~$

So I qconf -mhgrp \@webgrid because qconf -sq webgrid-lighttpd referenced @webgrid, removed -1411 completely, tried again qconf -de, but still no luck, so I added -1411 back with qconf -mhgrp \@webgrid.

Note that there are jobs running on -1411 at the moment, so please be careful with the sge magic.

More DNS crazyness with

tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs

and

tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.

those two don't have jobs running so might be better targets to figure this out.

Ha! The alias list struck:

scfc@tools-bastion-01:~$ qconf -de tools-webgrid-lighttpd-1411.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-webgrid-lighttpd-1411.eqiad.wmflabs" from execution host list
scfc@tools-bastion-01:~$

(Side note: I assume the disabled queues are disabled for the upcoming reboot?)

I'm sorry, I read your comment too late. But qstat -f showed no running jobs anyway?

Ugh, wrong column. Yes there were and still are jobs running on that host.

Okay, now that I have done practically nothing, except changing the list in qconf -mhgrp \@webgrid back to …-1411.tools.… because I hadn't considered that the host was not in host_aliases, and @valhallasw did a lot, there are no pending jobs, and the host seems to run jobs, is this task done?

I think so, for now. I still need to document what I did in more detail (T109417) and write up a post-mortem, but I think it's crisis averted for now.