Page MenuHomePhabricator

toolsbeta grid misconfigured
Closed, InvalidPublic

Description

Unable to run job: Job was rejected because job requests unknown queue "task".
valhallasw@toolsbeta-master:~$ sudo qconf -Ae /var/lib/gridengine/etc/exechosts/toolsbeta-exec-1209.toolsbeta.eqiad.wmflabs
root@toolsbeta-master.toolsbeta.eqiad.wmflabs added "toolsbeta-exec-1209.toolsbeta.eqiad.wmflabs" to exechost list
valhallasw@toolsbeta-master:~$ sudo qconf -mhgrp \@general
Host group "@general" does not exist

Related Objects

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Created host group @general using

valhallasw@toolsbeta-master:~$ sudo qconf -ahgrp


group_name @general
hostlist toolsbeta-exec-1209

Copied the task queue definition from tools:

qname                 task
hostlist              @general
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 50
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      unix_behavior
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      /usr/local/bin/jobkill $job_pid
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

and created it on toolsbeta using:

sudo qconf -aq task

That seems to work (at least partially); job now hangs with

scheduling info:            queue instance "task@toolsbeta-exec-1209.toolsbeta.eqiad.wmflabs" dropped because it is temporarily not available
valhallasw@toolsbeta-master:~$ tail /var/spool/gridengine/qmaster/messages
...
05/27/2016 19:06:03| timer|toolsbeta-master|E|error opening file "/var/lib/gridengine/default/common/accounting" for writing: Permission denied
valhallasw@toolsbeta-master:~$ ls -l /var/lib/gridengine/default/common/accounting
lrwxrwxrwx 1 root sgeadmin 32 Mar  6  2014 /var/lib/gridengine/default/common/accounting -> /data/project/.system/accounting
valhallasw@toolsbeta-master:~$ ls -l /data/project/.system/
total 16
-rw-r--r-- 1 root     toolsbeta.admin   72 Jan 25 14:55 bigbrother.scoreboard
drwxrwx--- 5 root     toolsbeta.admin 4096 Feb  7 20:09 crontabs
-rw-r--r-- 1 root     toolsbeta.admin    0 May 27 00:41 dynamic
drwxr-sr-x 7 sgeadmin sgeadmin        4096 Dec 25  2014 gridengine
drwxr-xr-x 2 root     root            4096 Mar 16 20:05 store

valhallasw@toolsbeta-master:~$ sudo touch /data/project/.system/accounting
valhallasw@toolsbeta-master:~$ sudo chown sgeadmin:sgeadmin /data/project/.system/accounting
valhallasw added a subscriber: scfc.

If I read the puppet manifests correctly, this configuration (of hostgroups and queues) should happen automatically. If that's not the case, that has a large impact on our recovery time when something with SGE goes wrong badly.

@scfc, from your work on the gridengine manifest, do you have any insight into why this is broken? (I haven't spent any time debugging the manifest application; I have just manually worked around to get a somewhat-working system, so that I could run h_fsize tests)

The automatic configuration of hostgroups and queues has never worked (CMIIW; cf. T88711); modules/gridengine/files/tracker:

[…]
### XXX: RIGHT NOW THIS ONLY ECHOS THE COMMANDS TO AVOID
###      CHANGING THE RUNNING CONFIGURATION.  THIS SCRIPT
###      IS JUST A NOISY NOOP.
[…]

I find the current setup where instances write files to NFS which are then read by another instance (and would then be set as active configuration if it weren't for the no-op) too risky and prefer an approach where the configuration would be in Puppet (and names of execution/submission/administration hosts in Hiera).

Currently there is no grid in Toolsbeta :-).