Page MenuHomePhabricator

qmaster chokes on old jobs from hosts that have been renamed
Closed, ResolvedPublic

Description

This is more to describe what happened: After restarting tools-webgrid-generic-1404, qmaster died constantly with /var/lib/gridengine/spool/qmaster/messages saying:

09/24/2015 14:56:01|worker|tools-master|E|execd@tools-exec-1218.eqiad.wmflabs reports running job (11474.1/master) in queue "task@tools-exec-1218.eqiad.wmflabs" that was not supposed to be there - killing
09/24/2015 14:56:09|worker|tools-master|E|writing job finish information: can't locate queue "webgrid-generic@tools-webgrid-generic-1404.eqiad.wmflabs"
09/24/2015 14:56:09|worker|tools-master|W|job 1766173.1 failed on host <unknown host> before writing exit_status because: shepherd exited with exit status 19: before writing exit_status
09/24/2015 14:56:09|worker|tools-master|C|!!!!!!!!!! got NULL element for QU_rerun !!!!!!!!!!

So I added the host back to host_aliases, restarted gridengine-master and gridengine-exec on the host and everything seems to be fine so far.

I'll try again (restarting gridengine-master and gridengine-exec with host_aliases not containing the host) in a few hours to see if it the old jobs just needed to be purged from some list.

More importantly, I'm interested in how to avoid this :-). I had looked at qhost -h $hostname and it was empty, and so I never would have assumed that a reference to that host was anywhere. I rechecked that it didn't appear in any queue execution host lists, so the only reference indeed seems to have been the host itself.

Event Timeline

scfc created this task.Sep 24 2015, 3:12 PM
scfc updated the task description. (Show Details)
scfc raised the priority of this task from to Normal.
scfc added projects: Cloud-Services, Toolforge.
scfc added subscribers: yuvipanda, gerritbot, scfc, Aklapper.

Shutting down gridengine-exec on the host before restarting qmaster might help. Or maybe we should just rebuild exec hosts and delete the old ones if this gives too many issues otherwise...

scfc added a comment.Sep 26 2015, 3:17 AM

Changing the order of shutting down would make sense :-). I'll try that next time.

valhallasw closed this task as Resolved.Oct 4 2015, 11:53 AM
valhallasw claimed this task.