Page MenuHomePhabricator

Make webservice grid jobs "non-rerunable"
Closed, ResolvedPublic

Description

When a bunch of webgrid-lighttpd nodes were depooled on Jan. 6th, we discovered that they didn't inform the proxylistener process on the Toolforge proxy system that the services were rescheduled.

That caused the system to be out of sync, and we got 503 errors on the web for those tools. It turns out that a long-standing bug is that webgrid issues *should* be marked "not rerunable" so that they are instead stopped as grid jobs (which will run the epilog script against the proxy). They will then be restarted by the webservice monitor. Rescheduling them on the grid skips the epilog script, so it ends up without a port on the proxy and no webservice is apparent event though it reports that it is running to the webservice monitor (because it is running...on the grid, just not on the web).

The fix is to ensure that webservice does qsub -r n instead of -r y or whatever it does now.

Details

Related Gerrit Patches:
operations/puppet : productiongridengine: set the webgrid queues to not rerunable
operations/software/tools-webservice : mastergridengine: Make webservices "not rerunable"

Event Timeline

Bstorm triaged this task as High priority.Jan 10 2020, 12:23 AM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2020, 12:23 AM

Change 564095 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] gridengine: Make webservices "not rerunable"

https://gerrit.wikimedia.org/r/564095

Change 564095 abandoned by Bstorm:
gridengine: Make webservices "not rerunable"

Reason:
The problem is the queue definition, not webservice/job definition

https://gerrit.wikimedia.org/r/564095

Ok, so while I knew the jobs were "rerunable" because I'd done it, @bd808 wisely looked at an individual job and found that it was marked "not rerunable" per the default. The problem is that the queue config for this marks *everything* as rerunable, and we cannot override it at the job level, apparently.

So the right change is the queue config, not the jobs/webservice.

Change 564174 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: set the webgrid queues to not rerunable

https://gerrit.wikimedia.org/r/564174

Change 564174 merged by Bstorm:
[operations/puppet@production] gridengine: set the webgrid queues to not rerunable

https://gerrit.wikimedia.org/r/564174

Mentioned in SAL (#wikimedia-cloud) [2020-01-16T16:45:24Z] <bstorm_> ran configurator to set the gridengine web queues to rerun FALSE T242397

Bstorm closed this task as Resolved.Jan 16 2020, 6:36 PM

Where job 2937052 happens to be bd808-test2:

bstorm@tools-sgegrid-master:/data/project/.system_sge/gridengine/etc/queues$ sudo qmod -rj 2937052
The job 2937052 is running in queue none where jobs are not rerunable

I call this done.

Note: I don't have any idea why the output says "queue none":
hard_queue_list: webgrid-lighttpd
Also
2937052 0.32584 lighttpd-b tools.bd808- r 12/12/2019 12:48:52 webgrid-li MASTER

So that must just be a bug in SGE. The depooling script looks for "not rerunable" and kills the job.