Page MenuHomePhabricator

Test running a stretch exec node in the existing system on toolsbeta
Closed, ResolvedPublic

Description

In order to be sure one way or another about how to best move people's jobs between versions of gridengine, we should really be sure about whether or not a stetch/sge node can run happily in the current cluster setup.

This is a task to stand up an exec node with a resource regarding its release to see if it can communicate with older versions.

Event Timeline

Bstorm triaged this task as Medium priority.

Interestingly, the grid master is down in tools beta. It also fails on puppet runs. Poking at that.

Apparently, there is a dependency on a particular directory that isn't in puppet:
06/06/2018 18:53:57| main|toolsbeta-grid-master|C|can't change to directory "/var/spool/gridengine/qmaster"

Found this in /tmp/sge_messages, which is only created if it cannot find other directories it needs.

This is getting added to the "puppetize me" list for SGE.

And another

07/12/2018 19:20:25|  main|toolsbeta-grid-master|E|database directory /var/spool/gridengine/spooldb doesn't exist
07/12/2018 19:20:25|  main|toolsbeta-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed
07/12/2018 19:20:25|  main|toolsbeta-grid-master|C|setup failed

After manual creation (for now), we are bought to:

07/12/2018 19:23:21|  main|toolsbeta-grid-master|E|couldn't open berkeley database "sge": (2) No such file or directory

I think much of this was a chicken or egg thing? All of the above should have been created by the package on install. This suggests that maybe puppet mounts NFS over the package install locations (which can be resolved). It also suggests that getting rid of the NFS config would be good, yet again.

Had to run su -s /bin/sh -c "/usr/share/gridengine/scripts/init_cluster /var/lib/gridengine default /var/lib/gridengine/spool/spooldb sgeadmin" sgeadmin
to create the cluster in toolsbeta

The service survives a puppet run, but puppet invariably complains about some things:

Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for toolsbeta-grid-master.toolsbeta.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1531433278'
Notice: /Stage[main]/Gridengine::Master/Service[gridengine-master]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Gridengine::Master/Service[gridengine-master]: Unscheduling refresh on Service[gridengine-master]
error: commlib error: got select error (Connection refused)
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[h_vmem]: Could not evaluate: Field 'shortcut' is required
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[release]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[release]: Could not evaluate: Field 'shortcut' is required
Notice: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]/ensure: created
Error: /Stage[main]/Toollabs::Master/Gridengine_resource[user_slot]: Could not evaluate: Field 'shortcut' is required

I wonder if that happens in tools production?

Excellent, apparently now that the cluster is running in toolsbeta, puppet succeeds correctly.

To get puppet to run on a new Trusty tools node requires a downgrade of libgdal-dev (just as a note). This is because Trusty isn't really supported anymore here, I presume. libgdal was upgraded beyond the support of needed libraries for the grid at WMF.

This fixed, it is clear that the structure of the grid is not puppetized except in a portion of the toollabs module that adds complexes. To get this node up, I'm going to manually add configuration to the beta grid, however, I'm working in parallel on puppetizing these features where it is possible (eg. adding submit, exec and admin hosts, queues, user and host groups are all apparently not in puppet--they need to be for T195889 to go anywhere good).

Puppetizing is under T88711

Grid engine now works on toolsbeta. I am able to submit jobs! It only has a "task" queue, since I don't really feel like messing with web and mail stuff right now, but that makes it a sort of valid test environment.

It only has a "task" queue

I (mostly manually) set up all the queues for T190893. I wonder what happened.

Change 446990 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: stretch doesn't have an hhvm package

https://gerrit.wikimedia.org/r/446990

Change 446990 abandoned by Bstorm:
gridengine: stretch doesn't have an hhvm package

Reason:
Wrong approach. Install it on the stretch VM first instead :-p

https://gerrit.wikimedia.org/r/446990

Change 447089 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: Add package information for stretch exec nodes

https://gerrit.wikimedia.org/r/447089

Change 447089 merged by Bstorm:
[operations/puppet@production] gridengine: Add package information for stretch exec nodes

https://gerrit.wikimedia.org/r/447089

E: Failed to fetch http://tools-services-01.tools.eqiad.wmflabs/repo/dists/stretch-toolsbeta/main/binary-amd64/Packages 404 Not Found

Well, that's interesting. That might explain all the language packs that failed to install?

Nope. Fixed that. The problem is that those lang packs are specific to Ubuntu.

Change 447561 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: try to translate all the Ubuntu package calls to Debian

https://gerrit.wikimedia.org/r/447561

Change 447561 merged by Bstorm:
[operations/puppet@production] gridengine: try to translate all the Ubuntu package calls to Debian

https://gerrit.wikimedia.org/r/447561

Change 447727 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: some more exec node package cleanup for stretch

https://gerrit.wikimedia.org/r/447727

Change 447727 merged by Bstorm:
[operations/puppet@production] gridengine: some more exec node package cleanup for stretch

https://gerrit.wikimedia.org/r/447727

Change 447729 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: just a couple more changes to work with stretch

https://gerrit.wikimedia.org/r/447729

Change 447729 merged by Bstorm:
[operations/puppet@production] gridengine: just a couple more changes to work with stretch

https://gerrit.wikimedia.org/r/447729

Change 447835 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: correct a few issues with stretch exec packages

https://gerrit.wikimedia.org/r/447835

Change 447835 merged by Bstorm:
[operations/puppet@production] gridengine: correct a few issues with stretch exec packages

https://gerrit.wikimedia.org/r/447835

Change 447860 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: include the right profile for the right OS

https://gerrit.wikimedia.org/r/447860

Change 447860 abandoned by Bstorm:
gridengine: include the right profile for the right OS

Reason:
Linting issues won't allow this way of doing it.

https://gerrit.wikimedia.org/r/447860

Change 447914 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] gridengine: switch back to extended locales for debian

https://gerrit.wikimedia.org/r/447914

Change 447914 merged by Bstorm:
[operations/puppet@production] gridengine: switch back to extended locales for debian

https://gerrit.wikimedia.org/r/447914

So it appears that simply turning on a Stretch node configured to point at a Trusty master is enough to segfault the master. I haven't even added the exec node to the list, so I find this extremely surprising. I'm not entirely sure exactly what causes this, but this is something seen on some other versions of gridengine as well (occasionally 6.1 and 6.2 did this when they coexisted). It might be related to the shared config and binary aspect of it all. They are not just communicating but sharing files.

This tells me that we absolutely must establish the new toolforge grid in parallel with no communication between the two. In some ways that simplifies the refactoring that also needs to happen around puppet for all this. However, it makes the transition much less tidy.

Also, while I haven't found the exact cause of the segfault yet, this implies that sharing NFS between the grids might be a no-no as well. Since the new node is not actually in any of the host lists yet, that may very well be true.

Ok, I can confirm that even with NFS unmounted and the execd stopped, running any gridengine binary at all (including qstat) will crash the qmaster with a segfault. A Son of Grid Engine grid must have its own master, config and nodes. There can be no communication on the grid level.

Curiosity satisfied. All future work on this must be done as a parallel grid with its own role names most likely to prevent issues and make refactoring simpler.