qmaster chokes on old jobs from hosts that have been renamed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	scfc
	Sep 24 2015, 3:12 PM

Description

This is more to describe what happened: After restarting tools-webgrid-generic-1404, qmaster died constantly with /var/lib/gridengine/spool/qmaster/messages saying:

09/24/2015 14:56:01|worker|tools-master|E|execd@tools-exec-1218.eqiad.wmflabs reports running job (11474.1/master) in queue "task@tools-exec-1218.eqiad.wmflabs" that was not supposed to be there - killing
09/24/2015 14:56:09|worker|tools-master|E|writing job finish information: can't locate queue "webgrid-generic@tools-webgrid-generic-1404.eqiad.wmflabs"
09/24/2015 14:56:09|worker|tools-master|W|job 1766173.1 failed on host <unknown host> before writing exit_status because: shepherd exited with exit status 19: before writing exit_status
09/24/2015 14:56:09|worker|tools-master|C|!!!!!!!!!! got NULL element for QU_rerun !!!!!!!!!!

So I added the host back to host_aliases, restarted gridengine-master and gridengine-exec on the host and everything seems to be fine so far.

I'll try again (restarting gridengine-master and gridengine-exec with host_aliases not containing the host) in a few hours to see if it the old jobs just needed to be purged from some list.

More importantly, I'm interested in how to avoid this :-). I had looked at qhost -h $hostname and it was empty, and so I never would have assumed that a reference to that host was anywhere. I rechecked that it didn't appear in any queue execution host lists, so the only reference indeed seems to have been the host itself.

Related Objects
Search...

Status	Assigned	Task
Resolved	Bstorm	T204530 cloudvps: tools and toolsbeta trusty deprecation
Resolved	aborrero	T187219 Remove support for Trusty Grid Engine exec hosts
Resolved	None	T63484 Make tools-mail route mail for @tools-*.pmtpa.wmflabs correctly
Resolved	bd808	T109485 Remove modules/toollabs/files/host_aliases
Resolved	valhallasw	T113614 qmaster chokes on old jobs from hosts that have been renamed

Event Timeline

scfc created this task.Sep 24 2015, 3:12 PM

scfc raised the priority of this task from to Medium.

scfc updated the task description. (Show Details)

scfc added projects: Cloud-Services, Toolforge.

scfc added subscribers: yuvipanda, gerritbot, scfc, Aklapper.

Shutting down gridengine-exec on the host before restarting qmaster might help. Or maybe we should just rebuild exec hosts and delete the old ones if this gives too many issues otherwise...

Changing the order of shutting down would make sense :-). I'll try that next time.

scfc mentioned this in T109485: Remove modules/toollabs/files/host_aliases.Sep 26 2015, 3:29 AM

valhallasw closed this task as Resolved.Oct 4 2015, 11:53 AM

valhallasw claimed this task.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:47 PM

qmaster chokes on old jobs from hosts that have been renamedClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

qmaster chokes on old jobs from hosts that have been renamed
Closed, ResolvedPublic
Actions

Related Objects
Search...