Page MenuHomePhabricator

Test and verify that OGE master/shadow failover works as expected
Closed, ResolvedPublic


Not sure if we have explicitly done this before. We should, if we haven't.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added subscribers: yuvipanda, Aklapper.
yuvipanda added a subscriber: coren.

@coren tried it just now, didn't work. He's investigating.

If virt1003 goes down then master goes down as well, and things are bad.

It worked all long, so long as the failover is tested by making the master fail. If it's shutdown cleanly then the shadow masters (correctly) refuse to start a new one.

coren changed the task status from Invalid to Resolved.Feb 27 2015, 3:36 PM

Failover failed during outage caused by T100554. tools-shadow just didn't start a master process at all, even with explicit start attempts (after the main master was killed in various ways)

This works properly (and tested) provided that what prevented the master from starting does not also apply to the shadow master - as was the case during that specific outage (overlong /etc/hosts causing the libnss to fail).

A point to note, however, is that while the start delay is configurable, its effective resolution is over a minute because the shadow master only checks the staleness of the heartbeat file every 60 seconds. This means that, effectively, switchover time is 90s + timeout.

coren moved this task from Doing to Done on the Labs-Sprint-100 board.