I created a new tool named sge-status and deployed the code from https://phabricator.wikimedia.org/source/tool-gridengine-status/ there.
When I run webservice start from tools-sgebastion-06.tools.eqiad.wmflabs the job is submitted to the new grid, but fails to run because of flags added by to the job by the webservice script:
$ qstat -j 2 ============================================================== job_number: 2 exec_file: job_scripts/2 submission_time: Thu Dec 20 03:10:21 2018 owner: tools.sge-status uid: 53920 group: tools.sge-status gid: 53920 sge_o_home: /data/project/sge-status sge_o_log_name: tools.sge-status sge_o_path: /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games sge_o_shell: /bin/bash sge_o_workdir: /mnt/nfs/labstore-secondary-tools-project/sge-status sge_o_host: tools-sgebastion-06 account: sge stderr_path_list: NONE:NONE:/data/project/sge-status/error.log hard resource_list: h_vmem=4G,release=trusty mail_list: firstname.lastname@example.org notify: FALSE job_name: lighttpd-sge-status stdout_path_list: NONE:NONE:/data/project/sge-status/error.log stdin_path_list: NONE:NONE:/dev/null jobshare: 0 hard_queue_list: webgrid-lighttpd env_list: TERM=NONE script_file: /usr/bin/webservice-runner --register-proxy --type lighttpd binding: NONE job_type: binary scheduling info: cannot run in queue "task" because it is not contained in its hard queue list (-q) cannot run in queue "continuous" because it is not contained in its hard queue list (-q) cannot run in queue "webgrid-generic" because it is not contained in its hard queue list (-q) (-l h_vmem=4G,release=trusty) cannot run at host "tools-sgewebgrid-lighttpd-0901.tools.eqiad.wmflabs" because it offers only hf:release=stretch (-l h_vmem=4G,release=trusty) cannot run at host "tools-sgewebgrid-lighttpd-0902.tools.eqiad.wmflabs" because it offers only hf:release=stretch
This didn't surprise me a lot, but what happened next did. The webservicemonitor watchdog script for the legacy grid saw the new ~tools.sge-status/service.manifest file, noticed that the associated job was not running on the legacy grid, and started it there. End result is that the lighttpd process is running on the legacy grid rather than the new grid.
So 3 bugs in one:
- new grid is missing the expected 'webgrid-lighttpd' queue
- webservice is still adding '-l release=trusty' to submitted jobs
- webservicemonitor doesn't know about separate grids and will launch any gridengine service.manifest job on the grid that the monitor is attached to