Page MenuHomePhabricator

Warnings/errors in /var/lib/gridengine/spool/qmaster/messages
Open, LowPublic

Description

/var/lib/gridengine/spool/qmaster/messages is full with:

scfc@tools-bastion-03:~$ tail /var/lib/gridengine/spool/qmaster/messages
12/06/2016 02:55:42|schedu|tools-grid-master|E|unable to find job 760793 from the scheduler order package
12/06/2016 02:55:48|worker|tools-grid-master|E|execd@tools-webgrid-lighttpd-1208.eqiad.wmflabs reports running job (4594249.1/master) in queue "webgrid-lighttpd@tools-webgrid-lighttpd-1208.eqiad.wmflabs" that was not supposed to be there - killing
12/06/2016 02:56:16|worker|tools-grid-master|W|unable to find job 760804 from the scheduler order package
12/06/2016 02:56:16|worker|tools-grid-master|W|Skipping remaining 0 orders
12/06/2016 02:56:16|schedu|tools-grid-master|E|unable to find job 760804 from the scheduler order package
12/06/2016 02:56:17|worker|tools-grid-master|W|unable to find job 760805 from the scheduler order package
12/06/2016 02:56:17|worker|tools-grid-master|W|Skipping remaining 0 orders
12/06/2016 02:56:17|schedu|tools-grid-master|E|unable to find job 760805 from the scheduler order package
12/06/2016 02:56:17|worker|tools-grid-master|E|got load report of unknown exec host "tools-exec-1204.eqiad.wmflabs"
12/06/2016 02:56:28|worker|tools-grid-master|E|execd@tools-webgrid-lighttpd-1208.eqiad.wmflabs reports running job (4594249.1/master) in queue "webgrid-lighttpd@tools-webgrid-lighttpd-1208.eqiad.wmflabs" that was not supposed to be there - killing
scfc@tools-bastion-03:~$

So the gridengine master seems to need to learn to discard those messages instead of reexamining them every few seconds.

I believe we had a similar situation in the past, and IIRC then @valhallasw looked up the necessary commands to solve that. @valhallasw, am I remembering correctly? Do you still know what you did?

Event Timeline

scfc created this task.Dec 6 2016, 3:04 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptDec 6 2016, 3:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The only thing I can find is T122638: GridEngine down due to bdb issues, but that has no clear solution. A reboot of the master might solve it (but do we dare to do so?). The ghost job on t-w-l-1208 can probably be solved by a reboot of that specific host.

scfc added a comment.Dec 7 2016, 4:32 AM

I have asked on serverfault.com regarding gone hosts, and will do so for "unable to find job 802469 from the scheduler order package" and "execd@tools-webgrid-lighttpd-1208.eqiad.wmflabs reports running job (4594249.1/master) in queue "webgrid-lighttpd@tools-webgrid-lighttpd-1208.eqiad.wmflabs" that was not supposed to be there - killing" tomorrow (question quota of 1/40 minutes in effect). I will also post the questions to the mailing list (users@gridengine.org) and then update serverfault.com. (I don't have much confidence that there is a sizable gridengine community on serverfault.com, but I really like StackExchange's way of presenting canonical, "good" answers.)

scfc added a comment.Dec 7 2016, 4:35 AM

(IIRC, restarting the master (process, that is) is no problem, as it does not keep state information in memory (in constrast to execds). But I wouldn't consider this issue important enough to do that.)

scfc added a comment.Dec 8 2016, 3:25 AM

T151980 changed host_aliases, but the grid master was probably not restarted afterwards, so it was still working with a reference to that host, therefore I decided to restart it. Et voilà, the warnings about the load reports is gone.

I will disable, drain, reboot and enable tools-webgrid-lighttpd-1208 because (in addition to the master messages) its /var/spool/gridengine/execd/tools-webgrid-lighttpd-1208/messages is full of:

12/08/2016 03:23:22|  main|tools-webgrid-lighttpd-1208|W|can't read pid from pid file "active_jobs/4594249.1/pid" of shepherd for job active_jobs/4594249.1

As this is a web host, it won't be that disrupting.

scfc added a comment.Dec 8 2016, 3:44 AM

Rebooting tools-webgrid-lighttpd-1208 was not enough: I had to remove the directory /var/spool/gridengine/execd/tools-webgrid-lighttpd-1208/active_jobs/4594249.1. Unfortunately, I didn't use the opportunity to test whether execd would have picked up that change, but rebooted the instance again.

Remaining /var/lib/gridengine/spool/qmaster/messages:

12/08/2016 03:42:05|worker|tools-grid-master|W|unable to find job 1383 from the scheduler order package
12/08/2016 03:42:05|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:42:05|schedu|tools-grid-master|E|unable to find job 1383 from the scheduler order package
12/08/2016 03:42:23|worker|tools-grid-master|W|unable to find job 1422 from the scheduler order package
12/08/2016 03:42:23|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:42:24|schedu|tools-grid-master|E|unable to find job 1422 from the scheduler order package
12/08/2016 03:42:24|worker|tools-grid-master|W|unable to find job 1423 from the scheduler order package
12/08/2016 03:42:24|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:42:24|schedu|tools-grid-master|E|unable to find job 1423 from the scheduler order package
12/08/2016 03:43:07|worker|tools-grid-master|W|unable to find job 1424 from the scheduler order package
12/08/2016 03:43:07|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:43:07|schedu|tools-grid-master|E|unable to find job 1424 from the scheduler order package
12/08/2016 03:43:17|worker|tools-grid-master|W|unable to find job 1428 from the scheduler order package
12/08/2016 03:43:17|worker|tools-grid-master|W|Skipping remaining 3 orders
12/08/2016 03:43:17|schedu|tools-grid-master|E|unable to find job 1428 from the scheduler order package
12/08/2016 03:43:24|worker|tools-grid-master|W|unable to find job 1432 from the scheduler order package
12/08/2016 03:43:24|worker|tools-grid-master|W|Skipping remaining 0 orders
12/08/2016 03:43:24|schedu|tools-grid-master|E|unable to find job 1432 from the scheduler order package