Page MenuHomePhabricator

rebuild tools-grid-master as a large instance
Closed, ResolvedPublic

Description

Right now tools-grid-master is a single core instance that is easy to overwhelm. A lot of what made T161951: tools.iabot is overloading the grid by running too many workers in parallel impactful is the fact that a puppet run eats 20-30% of the single cpu sometimes for the grid master. Any time some overzealous tools hammers the control plan we are in trouble. The other consideration is we have never rebuilt have no confidence in the current Puppetization of the master. We are on a single core server we are not sure we can recreate :)

We should be able to use toolsbeta to make a simple and small grid and figure out the right way to persist the settings from our current master to another.

Previously we did breath-holding impactful maintenance on the grid master and used https://github.com/jtriley/gridscheduler/blob/master/source/dist/util/upgrade_modules/save_sge_config.sh and https://github.com/jtriley/gridscheduler/blob/master/source/dist/util/upgrade_modules/load_sge_config.sh to ensure we could at least recover current state (or so we hoped).

Much of what makes our current grid master setup complex is the on-NFS resource collection dynamism that is mostly a solution looking for a problem. I propose we could take out all of that and simplify things in the same fashion as https://gerrit.wikimedia.org/r/#/c/334203/ using a static config file (yaml?) and a conversion layer that configures the grid itself.

Event Timeline

Change 351214 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] sge: Add gridengine-client package dependency to grid master and shadow-master

https://gerrit.wikimedia.org/r/351214

Change 351214 merged by Madhuvishy:
[operations/puppet@production] sge: Add gridengine-client package dependency to grid master and shadow-master

https://gerrit.wikimedia.org/r/351214

Change 351379 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] sge: Fix global config handling

https://gerrit.wikimedia.org/r/351379

Change 352281 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] gridengine: Cleanup mergeconf script and references

https://gerrit.wikimedia.org/r/352281

Change 352294 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] gridengine: Cleanup old scripts, tracker and collector

https://gerrit.wikimedia.org/r/352294

Change 352301 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] gridengine: Follow up - delete old maintenance scripts and tracker/collector puppet code

https://gerrit.wikimedia.org/r/352301

bd808 added subscribers: Bstorm, bd808.

At this point we may be best served by folding this idea into the Stretch grid project that @Bstorm is starting to work towards.

bd808 assigned this task to Bstorm.

Both tools-sgegrid-master.tools.eqiad.wmflabs and tools-sgegrid-shadow.tools.eqiad.wmflabs are m1.large instances in the new grid.