Make webservice grid jobs "non-rerunable"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Jan 10 2020, 12:23 AM

Description

When a bunch of webgrid-lighttpd nodes were depooled on Jan. 6th, we discovered that they didn't inform the proxylistener process on the Toolforge proxy system that the services were rescheduled.

That caused the system to be out of sync, and we got 503 errors on the web for those tools. It turns out that a long-standing bug is that webgrid issues *should* be marked "not rerunable" so that they are instead stopped as grid jobs (which will run the epilog script against the proxy). They will then be restarted by the webservice monitor. Rescheduling them on the grid skips the epilog script, so it ends up without a port on the proxy and no webservice is apparent event though it reports that it is running to the webservice monitor (because it is running...on the grid, just not on the web).

The fix is to ensure that webservice does qsub -r n instead of -r y or whatever it does now.

Details

	Subject	Repo	Branch	Lines +/-
	gridengine: set the webgrid queues to not rerunable	operations/puppet	production	+4 -4
	gridengine: Make webservices "not rerunable"	operations/software/tools-webservice	master	+2 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• Bstorm	T242397 Make webservice grid jobs "non-rerunable"
		Resolved		bd808	T242538 Many grid engine backend webservices not registered at tools-proxy redis following depool restarts

Event Timeline

• Bstorm triaged this task as High priority.Jan 10 2020, 12:23 AM

• Bstorm created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2020, 12:23 AM

• Bstorm moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.Jan 10 2020, 12:23 AM

bd808 added a project: Toolforge.Jan 10 2020, 8:13 PM

Krenair mentioned this in T242538: Many grid engine backend webservices not registered at tools-proxy redis following depool restarts.Jan 13 2020, 2:12 AM

bd808 added a subtask: T242538: Many grid engine backend webservices not registered at tools-proxy redis following depool restarts.Jan 13 2020, 6:41 PM

Change 564095 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/software/tools-webservice@master] gridengine: Make webservices "not rerunable"

https://gerrit.wikimedia.org/r/564095

gerritbot added a project: Patch-For-Review.Jan 13 2020, 7:01 PM

bd808 closed subtask T242538: Many grid engine backend webservices not registered at tools-proxy redis following depool restarts as Resolved.Jan 13 2020, 10:57 PM

Change 564095 abandoned by Bstorm:
gridengine: Make webservices "not rerunable"

Reason:
The problem is the queue definition, not webservice/job definition

https://gerrit.wikimedia.org/r/564095

Ok, so while I knew the jobs were "rerunable" because I'd done it, @bd808 wisely looked at an individual job and found that it was marked "not rerunable" per the default. The problem is that the queue config for this marks *everything* as rerunable, and we cannot override it at the job level, apparently.

So the right change is the queue config, not the jobs/webservice.

Maintenance_bot removed a project: Patch-For-Review.Jan 14 2020, 1:11 AM

Change 564174 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: set the webgrid queues to not rerunable

https://gerrit.wikimedia.org/r/564174

gerritbot added a project: Patch-For-Review.Jan 14 2020, 1:12 AM

Change 564174 merged by Bstorm:
[operations/puppet@production] gridengine: set the webgrid queues to not rerunable

https://gerrit.wikimedia.org/r/564174

Maintenance_bot removed a project: Patch-For-Review.Jan 14 2020, 4:10 PM

Mentioned in SAL (#wikimedia-cloud) [2020-01-16T16:45:24Z] <bstorm_> ran configurator to set the gridengine web queues to rerun FALSE T242397

Where job 2937052 happens to be bd808-test2:

bstorm@tools-sgegrid-master:/data/project/.system_sge/gridengine/etc/queues$ sudo qmod -rj 2937052
The job 2937052 is running in queue none where jobs are not rerunable

I call this done.

Note: I don't have any idea why the output says "queue none":
hard_queue_list: webgrid-lighttpd
Also
2937052 0.32584 lighttpd-b tools.bd808- r 12/12/2019 12:48:52 webgrid-li MASTER

So that must just be a bug in SGE. The depooling script looks for "not rerunable" and kills the job.

zhuyifei1999 mentioned this in T115231: dplbot webservice on Toolforge repeatedly have its dynamicproxy entry removed (because qsub schedules tasks to webgrid queues, causing portreleaser to run as job epilogue).Jan 27 2020, 7:36 AM

Make webservice grid jobs "non-rerunable"Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Make webservice grid jobs "non-rerunable"
Closed, ResolvedPublic
Actions

Related Objects
Search...