Make sure that toollabs can function fully even with one virt* host fully down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Feb 24 2015, 9:24 AM

Description

A single virt* host going down should not take down toollabs.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T90534 Make toolforge reliable enough (tracking)
Declined	None	T91068 Set up a schedule for doing failover exercises for toollabs
Resolved	Andrew	T90542 Make sure that toollabs can function fully even with one virt* host fully down
Resolved	coren	T90546 Test and verify that OGE master/shadow failover works as expected
Resolved	yuvipanda	T89995 ToolLabs web proxy tolerate the failure of virt host
Declined	yuvipanda	T91484 Monitor that the redundant webproxies have same state in terms of what they are proxying to whom
Resolved	yuvipanda	T90557 Generic services nodes should be redundant so OGE can reschedule them onto another machine if one goes down
Resolved	yuvipanda	T91065 Move uwsgi jobs to be run on generic hosts, retire uwsgi hosts
Resolved	yuvipanda	T91066 Retire 'tomcat' node, make Java apps run on the generic webgrid
Resolved	bd808	T91072 Move toollabs instances around to minimize damage from a single downed virt* host
Resolved	yuvipanda	T99347 Migrate tools-checker-02 away from labvirt1003
Resolved	yuvipanda	T101635 Write an icinga check to ensure that toollabs instances are appropriately distributed across labvirt** hosts
Declined	yuvipanda	T91237 Have bigbrother run on multiple nodes to provide redundancy against tools-submit failure
Resolved	yuvipanda	T91239 Setup a redis slave for toollabs as backup / redundancy
Resolved	yuvipanda	T96966 Make tools-static redundant
Declined	None	T96967 Make tools-mail redundant
Duplicate	None	T101636 Move tools-shadow away from labvirt1004

Event Timeline

yuvipanda created this task.Feb 24 2015, 9:24 AM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added projects: Cloud-Services, Tracking-Neverending, Toolforge.

yuvipanda added subscribers: Aklapper, yuvipanda.

yuvipanda added a subtask: T89995: ToolLabs web proxy tolerate the failure of virt host.Feb 24 2015, 9:29 AM

For the record, this basically requires three things:

(1) That the shadow master be on a different virt host than the grid master (that is already the case)
(2) That tools-submit be redundant
(3) That the webproxy be redundant

Of course, having the virt nodes themselves distributed over as many hosts as possible reduces the fraction lost should any one of them be down - I believe right now they are spread over three or four of them, but they may not be spread equally. Migrating them to spread them around would also be a robustness gain.

@coren: I've filed blocking tasks for things I think need to be done.

As I wrote on T89995, I don't think this is feasible. Tools running in the Tools project would require the whole foundation of networking & Co. to work, on top of that OpenStack, on top of that the Tools project, on top of that SGE, on top of that the individual tools written (by individual authors with varying skills) in a fault-tolerant way to recover from literally unimaginable outage scenarios.

On the other hand, even the worst outage (hardware failure of a virtual node) was handled in less than three hours for the most part (for the uninitiated: On Toolserver, this could have taken days to fix).

If it is unacceptable for a tool to be not available for three hours, it needs to be moved to production proper, with code review and assigned babysitters. But setting unattainable goals will only lead to frustration.

@scfc: We don't need full on redundancy like how we have for prod, but just enough that tools will limp along rather than just die. Webproxy comes to mind as an immediate start, for example. And enough nodes for jobs to be relocated.

The "limping along" bit is what I am afraid of :-). Planning for catastrophes is a hard problem. Instead of depending on that a fire will stop at the preplanned fire walls and the "unaffected" neighbours can carry on with their day, I'd much rather invest in the fire department so that they can show up in force and short time and do whatever the individual situation requires. The includes shiny engines like the spare servers that @Andrew put on the wish list.

But IIRC only the last two outages (where the rapid succession skewed the public perception) were due to a virt host being down. In "most" cases it was scheduled NFS maintenance, and the last unplanned outage that I remember was the digger operator who found the third line to Tampa.

My concern is that a lot of effort could be spent on this task, a lot of PR follows that Tools is now fire-proof, and a month later a network switch dying undermines all that.

@scfc: hmm, fair enough. I guess we should make sure to avoid any PR that says Tools is now 'fire-proof'. Allowing one virt host to die and still allow toollabs to be 'fine' will also help with other things as well (like a rolling restart of all virt* hosts, as happened with GHOST). I guess we'll work on these slowly as time permits, without making any proclamations.

Ricordisamoa subscribed.Feb 27 2015, 9:23 AM

yuvipanda closed subtask T89995: ToolLabs web proxy tolerate the failure of virt host as Resolved.Feb 27 2015, 1:53 PM

yuvipanda mentioned this in T91068: Set up a schedule for doing failover exercises for toollabs.Feb 27 2015, 2:33 PM

coren closed subtask T90546: Test and verify that OGE master/shadow failover works as expected as Invalid.Feb 27 2015, 3:35 PM

coren changed the status of subtask T90546: Test and verify that OGE master/shadow failover works as expected from Invalid to Resolved.

yuvipanda closed subtask T90557: Generic services nodes should be redundant so OGE can reschedule them onto another machine if one goes down as Resolved.Mar 2 2015, 6:55 AM

yuvipanda closed subtask T91065: Move uwsgi jobs to be run on generic hosts, retire uwsgi hosts as Resolved.Mar 4 2015, 12:00 PM

yuvipanda added a project: ToolLabs-Goals-Q4.Mar 25 2015, 9:21 PM

yuvipanda moved this task from Backlog to Redundancy on the ToolLabs-Goals-Q4 board.Mar 25 2015, 9:38 PM

yuvipanda closed subtask T91239: Setup a redis slave for toollabs as backup / redundancy as Resolved.Apr 3 2015, 12:57 AM

scfc triaged this task as Medium priority.Apr 6 2015, 7:47 AM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

scfc added a parent task: T91068: Set up a schedule for doing failover exercises for toollabs.

yuvipanda closed subtask T91237: Have bigbrother run on multiple nodes to provide redundancy against tools-submit failure as Declined.Apr 10 2015, 6:51 AM

scfc closed subtask T91066: Retire 'tomcat' node, make Java apps run on the generic webgrid as Resolved.Apr 10 2015, 10:00 AM

yuvipanda added subtasks: T96966: Make tools-static redundant, T96967: Make tools-mail redundant.Apr 24 2015, 9:54 PM

yuvipanda closed subtask T96966: Make tools-static redundant as Resolved.Apr 25 2015, 3:43 AM

yuvipanda reopened subtask T90546: Test and verify that OGE master/shadow failover works as expected as Open.May 28 2015, 3:08 PM

coren closed subtask T90546: Test and verify that OGE master/shadow failover works as expected as Resolved.Jun 4 2015, 1:14 PM

Krenair subscribed.Jun 9 2015, 3:50 PM

valhallasw mentioned this in T109732: Add monitoring for expected load issues on tool labs exec nodes.Aug 20 2015, 4:45 PM

intracer subscribed.Oct 24 2015, 4:57 PM

Luke081515 moved this task from Triage to Backlog on the Cloud-Services board.Mar 25 2016, 4:12 PM

tom29739 subscribed.Apr 4 2016, 3:36 PM

Danny_B moved this task from Tag to Should be Goal instead on the Tracking-Neverending board.Jul 9 2016, 1:51 PM

• Phabricator_maintenance added a project: Goal.Aug 13 2016, 8:41 PM

• Phabricator_maintenance removed a project: Tracking-Neverending.Aug 13 2016, 9:57 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:54 PM

• GTirloni edited projects, added cloud-services-team (Kanban); removed Toolforge.Mar 23 2019, 9:53 PM

bd808 closed subtask T91072: Move toollabs instances around to minimize damage from a single downed virt* host as Resolved.Mar 26 2019, 2:10 AM

I think this is largely resolved -- we have monitoring that keeps us from having too many eggs in one basket.

dcaro closed subtask T96967: Make tools-mail redundant as Declined.Feb 21 2024, 11:50 AM

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Make sure that toollabs can function fully even with one virt* host fully downClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Make sure that toollabs can function fully even with one virt* host fully down
Closed, ResolvedPublic
Actions

Related Objects
Search...