Stand up the new sonofgridengine and stretch grid for testing in toolforge
Closed, ResolvedPublic

Description

This is to track work building on the deployment in toolsbeta (T200557) and replicating it in the tools project for users to try it out.
This would also include testing and reporting of issues by users before it becomes the replacement for tools gridengine we are all trying to move into.

Bstorm created this task.Mon, Dec 17, 7:46 PM
Bstorm triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Dec 17, 7:46 PM

Adding the necessary prefixes to tools puppet settings in Horizon.

Mentioned in SAL (#wikimedia-cloud) [2018-12-17T22:16:19Z] <bstorm_> Adding a bunch of hiera values and prefixes for the new grid - T212153

Per T162955, the grid master will be a large. We probably should be running puppetdb and refresh the puppetmaster since the known hosts bit is done differently, but I'm going to hold off on that and see at first. It's easy to test if it is needed for how we did things in beta.

Change 480264 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: restrict the cluster from initializing itself

https://gerrit.wikimedia.org/r/480264

Change 480264 merged by Bstorm:
[operations/puppet@production] sonofgridengine: restrict the cluster from initializing itself

https://gerrit.wikimedia.org/r/480264

Gridengine cluster actually initialized itself correctly without the semi-manual run of the init script at the end!! The puppetization is finally correct.

stood up generic exec, webgrid, lighttpd and cron hosts in eqiad1-r

standing up a bastion as well (tools-sgebastion-06)

Change 480291 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: restore ssh knownhosts hack for tools

https://gerrit.wikimedia.org/r/480291

Change 480291 merged by Bstorm:
[operations/puppet@production] sonofgridengine: restore ssh knownhosts hack for tools

https://gerrit.wikimedia.org/r/480291

Ok, so accepting that we really should introduce puppetdb when there is more time, the above patch keeps it all working like before without it.

At this point, this is the new grid:
tools-sgecron-01
tools-sgebastion-06
tools-sgeexec-0901/2
tools-sgewebgrid-generic-0901/2
tools-sgewebgrid-lighttpd-0901/2
tools-sgegrid-master

I have not configured a shadow yet, partly because they don't work right now. The configuration is managed using my script, which I need to document now that it's up like this. There are two exec hosts of each type for now. Adding an exec host is far easier than in the original docs, but I'll actually update the doc instead of putting that here. The global config is set correctly.

Major differences from the current "main grid" is: root cannot run jobs and only the master is an admin host. We'll see how long that lasts, but it is a much better security config.

It's ready for us to try some things. If that works, please have users try it.

Note: k8s should work from projects on tools-sgebastion-06. If it doesn't, we got work to do :)

Change 480519 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: removing commented manual db init

https://gerrit.wikimedia.org/r/480519

Change 480519 merged by Bstorm:
[operations/puppet@production] sonofgridengine: removing commented manual db init

https://gerrit.wikimedia.org/r/480519

Since T211258 was successful, such as it is, added a shadow master to the new tools grid as well.

bd808 moved this task from Triage to In Progress on the Toolforge board.Sat, Jan 5, 5:12 PM
bd808 closed this task as Resolved.Wed, Jan 9, 5:19 PM
bd808 added a subscriber: bd808.

I'm ready to call this one done. We still need to scale out the new cluster to be ready for lots of jobs to move to it, but the core grid is up and running well at this point.