This is to track work building on the deployment in toolsbeta (T200557) and replicating it in the tools project for users to try it out.
This would also include testing and reporting of issues by users before it becomes the replacement for tools gridengine we are all trying to move into.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Bstorm | T199271 Upgrade the tools gridengine system | |||
Resolved | Bstorm | T204530 cloudvps: tools and toolsbeta trusty deprecation | |||
Resolved | Bstorm | T212153 Stand up the new sonofgridengine and stretch grid for testing in toolforge | |||
Resolved | Bstorm | T212390 Basic lighttpd+php webservice fails to run on Stretch grid |
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2018-12-17T22:16:19Z] <bstorm_> Adding a bunch of hiera values and prefixes for the new grid - T212153
Per T162955, the grid master will be a large. We probably should be running puppetdb and refresh the puppetmaster since the known hosts bit is done differently, but I'm going to hold off on that and see at first. It's easy to test if it is needed for how we did things in beta.
Change 480264 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: restrict the cluster from initializing itself
Change 480264 merged by Bstorm:
[operations/puppet@production] sonofgridengine: restrict the cluster from initializing itself
Gridengine cluster actually initialized itself correctly without the semi-manual run of the init script at the end!! The puppetization is finally correct.
Change 480291 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: restore ssh knownhosts hack for tools
Change 480291 merged by Bstorm:
[operations/puppet@production] sonofgridengine: restore ssh knownhosts hack for tools
Ok, so accepting that we really should introduce puppetdb when there is more time, the above patch keeps it all working like before without it.
At this point, this is the new grid:
tools-sgecron-01
tools-sgebastion-06
tools-sgeexec-0901/2
tools-sgewebgrid-generic-0901/2
tools-sgewebgrid-lighttpd-0901/2
tools-sgegrid-master
I have not configured a shadow yet, partly because they don't work right now. The configuration is managed using my script, which I need to document now that it's up like this. There are two exec hosts of each type for now. Adding an exec host is far easier than in the original docs, but I'll actually update the doc instead of putting that here. The global config is set correctly.
Major differences from the current "main grid" is: root cannot run jobs and only the master is an admin host. We'll see how long that lasts, but it is a much better security config.
It's ready for us to try some things. If that works, please have users try it.
Note: k8s should work from projects on tools-sgebastion-06. If it doesn't, we got work to do :)
Change 480519 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: removing commented manual db init
Change 480519 merged by Bstorm:
[operations/puppet@production] sonofgridengine: removing commented manual db init
Since T211258 was successful, such as it is, added a shadow master to the new tools grid as well.
I'm ready to call this one done. We still need to scale out the new cluster to be ready for lots of jobs to move to it, but the core grid is up and running well at this point.