Not sure if we have explicitly done this before. We should, if we haven't.
|Open||None||T90534 Make toolforge reliable enough (tracking)|
|Open||None||T91068 Set up a schedule for doing failover exercises for toollabs|
|Resolved||Andrew||T90542 Make sure that toollabs can function fully even with one virt* host fully down|
|Resolved||valhallasw||T100554 Grid engine masters down|
|Resolved||coren||T90546 Test and verify that OGE master/shadow failover works as expected|
This works properly (and tested) provided that what prevented the master from starting does not also apply to the shadow master - as was the case during that specific outage (overlong /etc/hosts causing the libnss to fail).
A point to note, however, is that while the start delay is configurable, its effective resolution is over a minute because the shadow master only checks the staleness of the heartbeat file every 60 seconds. This means that, effectively, switchover time is 90s + timeout.