A single virt* host going down should not take down toollabs.
Description
Event Timeline
For the record, this basically requires three things:
(1) That the shadow master be on a different virt host than the grid master (that is already the case)
(2) That tools-submit be redundant
(3) That the webproxy be redundant
Of course, having the virt nodes themselves distributed over as many hosts as possible reduces the fraction lost should any one of them be down - I believe right now they are spread over three or four of them, but they may not be spread equally. Migrating them to spread them around would also be a robustness gain.
As I wrote on T89995, I don't think this is feasible. Tools running in the Tools project would require the whole foundation of networking & Co. to work, on top of that OpenStack, on top of that the Tools project, on top of that SGE, on top of that the individual tools written (by individual authors with varying skills) in a fault-tolerant way to recover from literally unimaginable outage scenarios.
On the other hand, even the worst outage (hardware failure of a virtual node) was handled in less than three hours for the most part (for the uninitiated: On Toolserver, this could have taken days to fix).
If it is unacceptable for a tool to be not available for three hours, it needs to be moved to production proper, with code review and assigned babysitters. But setting unattainable goals will only lead to frustration.
@scfc: We don't need full on redundancy like how we have for prod, but just enough that tools will limp along rather than just die. Webproxy comes to mind as an immediate start, for example. And enough nodes for jobs to be relocated.
The "limping along" bit is what I am afraid of :-). Planning for catastrophes is a hard problem. Instead of depending on that a fire will stop at the preplanned fire walls and the "unaffected" neighbours can carry on with their day, I'd much rather invest in the fire department so that they can show up in force and short time and do whatever the individual situation requires. The includes shiny engines like the spare servers that @Andrew put on the wish list.
But IIRC only the last two outages (where the rapid succession skewed the public perception) were due to a virt host being down. In "most" cases it was scheduled NFS maintenance, and the last unplanned outage that I remember was the digger operator who found the third line to Tampa.
My concern is that a lot of effort could be spent on this task, a lot of PR follows that Tools is now fire-proof, and a month later a network switch dying undermines all that.
@scfc: hmm, fair enough. I guess we should make sure to avoid any PR that says Tools is now 'fire-proof'. Allowing one virt host to die and still allow toollabs to be 'fine' will also help with other things as well (like a rolling restart of all virt* hosts, as happened with GHOST). I guess we'll work on these slowly as time permits, without making any proclamations.
I think this is largely resolved -- we have monitoring that keeps us from having too many eggs in one basket.
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!