Right now tools-grid-master is a single core instance that is easy to overwhelm. A lot of what made T161951: tools.iabot is overloading the grid by running too many workers in parallel impactful is the fact that a puppet run eats 20-30% of the single cpu sometimes for the grid master. Any time some overzealous tools hammers the control plan we are in trouble. The other consideration is we have never rebuilt have no confidence in the current Puppetization of the master. We are on a single core server we are not sure we can recreate :)
We should be able to use toolsbeta to make a simple and small grid and figure out the right way to persist the settings from our current master to another.
Previously we did breath-holding impactful maintenance on the grid master and used https://github.com/jtriley/gridscheduler/blob/master/source/dist/util/upgrade_modules/save_sge_config.sh and https://github.com/jtriley/gridscheduler/blob/master/source/dist/util/upgrade_modules/load_sge_config.sh to ensure we could at least recover current state (or so we hoped).
Much of what makes our current grid master setup complex is the on-NFS resource collection dynamism that is mostly a solution looking for a problem. I propose we could take out all of that and simplify things in the same fashion as https://gerrit.wikimedia.org/r/#/c/334203/ using a static config file (yaml?) and a conversion layer that configures the grid itself.