Page MenuHomePhabricator

Test and verify that OGE master/shadow failover works as expected
Closed, ResolvedPublic

Description

Not sure if we have explicitly done this before. We should, if we haven't.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added subscribers: yuvipanda, Aklapper.
yuvipanda triaged this task as High priority.Feb 27 2015, 3:17 PM
yuvipanda added a subscriber: coren.

@coren tried it just now, didn't work. He's investigating.

If virt1003 goes down then master goes down as well, and things are bad.

coren closed this task as Invalid.Feb 27 2015, 3:35 PM

It worked all long, so long as the failover is tested by making the master fail. If it's shutdown cleanly then the shadow masters (correctly) refuse to start a new one.

coren changed the task status from Invalid to Resolved.Feb 27 2015, 3:36 PM
yuvipanda reopened this task as Open.May 28 2015, 3:08 PM

Failover failed during outage caused by T100554. tools-shadow just didn't start a master process at all, even with explicit start attempts (after the main master was killed in various ways)

coren moved this task from To Do to Doing on the Labs-Sprint-100 board.Jun 1 2015, 7:15 PM
coren added a comment.Jun 4 2015, 1:14 PM

This works properly (and tested) provided that what prevented the master from starting does not also apply to the shadow master - as was the case during that specific outage (overlong /etc/hosts causing the libnss to fail).

A point to note, however, is that while the start delay is configurable, its effective resolution is over a minute because the shadow master only checks the staleness of the heartbeat file every 60 seconds. This means that, effectively, switchover time is 90s + timeout.

coren closed this task as Resolved.Jun 4 2015, 1:14 PM
coren moved this task from Doing to Done on the Labs-Sprint-100 board.