Not sure if we have explicitly done this before. We should, if we haven't.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T90534 Make toolforge reliable enough (tracking) | |||
Open | None | T91068 Set up a schedule for doing failover exercises for toollabs | |||
Resolved | Andrew | T90542 Make sure that toollabs can function fully even with one virt* host fully down | |||
Resolved | valhallasw | T100554 Grid engine masters down | |||
Resolved | coren | T90546 Test and verify that OGE master/shadow failover works as expected |
Event Timeline
@coren tried it just now, didn't work. He's investigating.
If virt1003 goes down then master goes down as well, and things are bad.
It worked all long, so long as the failover is tested by making the master fail. If it's shutdown cleanly then the shadow masters (correctly) refuse to start a new one.
Failover failed during outage caused by T100554. tools-shadow just didn't start a master process at all, even with explicit start attempts (after the main master was killed in various ways)
This works properly (and tested) provided that what prevented the master from starting does not also apply to the shadow master - as was the case during that specific outage (overlong /etc/hosts causing the libnss to fail).
A point to note, however, is that while the start delay is configurable, its effective resolution is over a minute because the shadow master only checks the staleness of the heartbeat file every 60 seconds. This means that, effectively, switchover time is 90s + timeout.