Page MenuHomePhabricator

Warnings/errors in /var/lib/gridengine/spool/qmaster/messages
Closed, DeclinedPublic

Description

/var/lib/gridengine/spool/qmaster/messages is full with:

scfc@tools-bastion-03:~$ tail /var/lib/gridengine/spool/qmaster/messages
12/06/2016 02:55:42|schedu|tools-grid-master|E|unable to find job 760793 from the scheduler order package
12/06/2016 02:55:48|worker|tools-grid-master|E|execd@tools-webgrid-lighttpd-1208.eqiad.wmflabs reports running job (4594249.1/master) in queue "webgrid-lighttpd@tools-webgrid-lighttpd-1208.eqiad.wmflabs" that was not supposed to be there - killing
12/06/2016 02:56:16|worker|tools-grid-master|W|unable to find job 760804 from the scheduler order package
12/06/2016 02:56:16|worker|tools-grid-master|W|Skipping remaining 0 orders
12/06/2016 02:56:16|schedu|tools-grid-master|E|unable to find job 760804 from the scheduler order package
12/06/2016 02:56:17|worker|tools-grid-master|W|unable to find job 760805 from the scheduler order package
12/06/2016 02:56:17|worker|tools-grid-master|W|Skipping remaining 0 orders
12/06/2016 02:56:17|schedu|tools-grid-master|E|unable to find job 760805 from the scheduler order package
12/06/2016 02:56:17|worker|tools-grid-master|E|got load report of unknown exec host "tools-exec-1204.eqiad.wmflabs"
12/06/2016 02:56:28|worker|tools-grid-master|E|execd@tools-webgrid-lighttpd-1208.eqiad.wmflabs reports running job (4594249.1/master) in queue "webgrid-lighttpd@tools-webgrid-lighttpd-1208.eqiad.wmflabs" that was not supposed to be there - killing
scfc@tools-bastion-03:~$

So the gridengine master seems to need to learn to discard those messages instead of reexamining them every few seconds.

I believe we had a similar situation in the past, and IIRC then @valhallasw looked up the necessary commands to solve that. @valhallasw, am I remembering correctly? Do you still know what you did?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The only thing I can find is T122638: GridEngine down due to bdb issues, but that has no clear solution. A reboot of the master might solve it (but do we dare to do so?). The ghost job on t-w-l-1208 can probably be solved by a reboot of that specific host.

I have asked on serverfault.com regarding gone hosts, and will do so for "unable to find job 802469 from the scheduler order package" and "execd@tools-webgrid-lighttpd-1208.eqiad.wmflabs reports running job (4594249.1/master) in queue "webgrid-lighttpd@tools-webgrid-lighttpd-1208.eqiad.wmflabs" that was not supposed to be there - killing" tomorrow (question quota of 1/40 minutes in effect). I will also post the questions to the mailing list (users@gridengine.org) and then update serverfault.com. (I don't have much confidence that there is a sizable gridengine community on serverfault.com, but I really like StackExchange's way of presenting canonical, "good" answers.)

(IIRC, restarting the master (process, that is) is no problem, as it does not keep state information in memory (in constrast to execds). But I wouldn't consider this issue important enough to do that.)

T151980 changed host_aliases, but the grid master was probably not restarted afterwards, so it was still working with a reference to that host, therefore I decided to restart it. Et voilà, the warnings about the load reports is gone.

I will disable, drain, reboot and enable tools-webgrid-lighttpd-1208 because (in addition to the master messages) its /var/spool/gridengine/execd/tools-webgrid-lighttpd-1208/messages is full of:

12/08/2016 03:23:22|  main|tools-webgrid-lighttpd-1208|W|can't read pid from pid file "active_jobs/4594249.1/pid" of shepherd for job active_jobs/4594249.1

As this is a web host, it won't be that disrupting.

Rebooting tools-webgrid-lighttpd-1208 was not enough: I had to remove the directory /var/spool/gridengine/execd/tools-webgrid-lighttpd-1208/active_jobs/4594249.1. Unfortunately, I didn't use the opportunity to test whether execd would have picked up that change, but rebooted the instance again.

Remaining /var/lib/gridengine/spool/qmaster/messages:

12/08/2016 03:42:05|worker|tools-grid-master|W|unable to find job 1383 from the scheduler order package
12/08/2016 03:42:05|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:42:05|schedu|tools-grid-master|E|unable to find job 1383 from the scheduler order package
12/08/2016 03:42:23|worker|tools-grid-master|W|unable to find job 1422 from the scheduler order package
12/08/2016 03:42:23|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:42:24|schedu|tools-grid-master|E|unable to find job 1422 from the scheduler order package
12/08/2016 03:42:24|worker|tools-grid-master|W|unable to find job 1423 from the scheduler order package
12/08/2016 03:42:24|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:42:24|schedu|tools-grid-master|E|unable to find job 1423 from the scheduler order package
12/08/2016 03:43:07|worker|tools-grid-master|W|unable to find job 1424 from the scheduler order package
12/08/2016 03:43:07|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:43:07|schedu|tools-grid-master|E|unable to find job 1424 from the scheduler order package
12/08/2016 03:43:17|worker|tools-grid-master|W|unable to find job 1428 from the scheduler order package
12/08/2016 03:43:17|worker|tools-grid-master|W|Skipping remaining 3 orders
12/08/2016 03:43:17|schedu|tools-grid-master|E|unable to find job 1428 from the scheduler order package
12/08/2016 03:43:24|worker|tools-grid-master|W|unable to find job 1432 from the scheduler order package
12/08/2016 03:43:24|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:43:24|schedu|tools-grid-master|E|unable to find job 1432 from the scheduler order package
bd808 subscribed.

The job grid has been completely rebuilt since this was last updated