Page MenuHomePhabricator

gridengine master dependencies are missing for gridengine_resources
Closed, ResolvedPublic

Description

It should be started by puppet, but:

^[[1;31mError: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]: Could not evaluate: Field 'shortcut' is required^[[0m

After @scfc's comment below, I checked the logs, and it seems to be another issue altogether:

^[[0;32mInfo: Sleeping for 51 seconds (splay is enabled)^[[0m
^[[0;32mInfo: Retrieving plugin^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/root_home.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/lldp.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb^[[0m
^[[0;32mInfo: Loading facts in /var/lib/puppet/lib/facter/apt.rb^[[0m
^[[0;32mInfo: Caching catalog for tools-grid-master.tools.eqiad.wmflabs^[[0m
^[[0;32mInfo: Applying configuration version '1455835101'^[[0m
error: commlib error: got select error (Connection refused)
^[[mNotice: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]/ensure: created^[[0m
^[[1;31mError: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]: Could not evaluate: Execution of 'qconf -Mc /tmp/gridengine_resource20160218-1423-18h6wth' returned 1: error: commlib error: got select error (Connection refused)
ERROR: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
^[[0m
^[[mNotice: /Stage[main]/Toollabs::Master/Gridengine_resource[release]/ensure: created^[[0m
^[[1;31mError: /Stage[main]/Toollabs::Master/Gridengine_resource[release]: Could not evaluate: Execution of 'qconf -Mc /tmp/gridengine_resource20160218-1423-1fhsrsi' returned 1: error: commlib error: got select error (Connection refused)
ERROR: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[exechosts]/Exec[track-exechosts]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[quotas]/Exec[track-quotas]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[adminhosts]/Exec[track-adminhosts]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[hostgroups]/Exec[track-hostgroups]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[queues]/Exec[track-queues]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Hostgroups[@general]/Gridengine::Collector[@general]/Exec[collect-@general-resource]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Queues[webgrid-lighttpd]/Gridengine::Collector[webgrid-lighttpd]/Exec[collect-webgrid-lighttpd-resource]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[submithosts]/Exec[track-submithosts]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]/ensure: created^[[0m
^[[1;31mError: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]: Could not evaluate: Field 'shortcut' is required^[[0m
^[[mNotice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Queues[webgrid-generic]/Gridengine::Collector[webgrid-generic]/Exec[collect-webgrid-generic-resource]/returns: executed successfully^[[0m
^[[mNotice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[checkpoints]/Exec[track-checkpoints]/returns: executed successfully^[[0m
^[[mNotice: Finished catalog run in 18.20 seconds^[[0m

Event Timeline

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

I cannot replicate that:

scfc@tools-grid-master:~$ sudo puppet agent -t
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Caching catalog for tools-grid-master.tools.eqiad.wmflabs
Info: Applying configuration version '1455846456'
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[exechosts]/Exec[track-exechosts]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[quotas]/Exec[track-quotas]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[adminhosts]/Exec[track-adminhosts]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[hostgroups]/Exec[track-hostgroups]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[queues]/Exec[track-queues]/returns: executed successfully
Notice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Hostgroups[@general]/Gridengine::Collector[@general]/Exec[collect-@general-resource]/returns: executed successfully
Notice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Queues[webgrid-lighttpd]/Gridengine::Collector[webgrid-lighttpd]/Exec[collect-webgrid-lighttpd-resource]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[submithosts]/Exec[track-submithosts]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Exec[update-config-conf]/returns: executed successfully
Notice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Queues[webgrid-generic]/Gridengine::Collector[webgrid-generic]/Exec[collect-webgrid-generic-resource]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[checkpoints]/Exec[track-checkpoints]/returns: executed successfully                                     
Notice: Finished catalog run in 44.79 seconds                                                                                                                           
scfc@tools-grid-master:~$
valhallasw renamed this task from Error: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]: Could not evaluate: Field 'shortcut' is required to gridengine-master does not come back up after reboot.Feb 19 2016, 8:32 AM
valhallasw reopened this task as Open.
valhallasw removed scfc as the assignee of this task.
valhallasw updated the task description. (Show Details)
valhallasw updated the task description. (Show Details)
valhallasw added a subscriber: scfc.
chasemp renamed this task from gridengine-master does not come back up after reboot to gridengine master dependencies are missing for gridengine_resources.Feb 19 2016, 5:34 PM

I figured out the deal here.

If grid engine master proc is stopped you see:

error: commlib error: got select error (Connection refused)
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]: Could not evaluate: Execution of 'qconf -Mc /tmp/gridengine_resource20160219-14676-1oy3bz5' returned 1: error: commlib error: got select error (Connection refused)
ERROR: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error

Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[release]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[release]: Could not evaluate: Execution of 'qconf -Mc /tmp/gridengine_resource20160219-14676-1pmz2df' returned 1: error: commlib error: got select error (Connection refused)
ERROR: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error

Puppet tries to do some "smart" things that require the service but it doesn't have any of the appropriate dependencies.

I stopped the master briefly to make sure that https://gerrit.wikimedia.org/r/#/c/271800/ works.

tldr; it does work and master now is started (and does not start duplicates) but it throws these warnings first time around due to race conditions. Honestly I want to get rid of the way gridengine_resources works now so I'm not going to get into this today as it's not a big problem.

scfc assigned this task to chasemp.

The race condition should only exist on an instance when the Puppet role is applied for the very first time; on a rebooted instance, the grid master will be started by SysVInit before cron gets a chance to run Puppet.

scfc removed chasemp as the assignee of this task.

(Though the dependencies are indeed missing, so why not fix them …)

@valhallasw: This was by design as a, eh, robust fix for T122638; cf. ffe315326317adb4d53fc6d23983b842d89d1fce.

Ah, I vaguely remembered a patchset where the check-for-running was fixed, but I must have confused it with something else.

The race condition should only exist on an instance when the Puppet role is applied for the very first time; on a rebooted instance, the grid master will be started by SysVInit before cron gets a chance to run Puppet.

Well, SysVInit did not start gridengine yesterday, so something must have gone wrong there as well..

chasemp triaged this task as Medium priority.Apr 4 2016, 2:25 PM

Change 339921 had a related patch set uploaded (by Tim Landscheidt):
Tools: Require gridengine-master for gridengine_resource

https://gerrit.wikimedia.org/r/339921

Change 339921 merged by Andrew Bogott:
[operations/puppet@production] Tools: Require gridengine-master for gridengine_resource

https://gerrit.wikimedia.org/r/339921