Per the fallout from the lead failure, it was discussed by myself, @Dzahn and the rest of RelEng that having a warm spare of Gerrit running (ideally in the other DC) is necessary to avoid extended downtimes of a crucial service.
It doesn't have to be completely hot and failover does not have to be instantaneous/automatic, but the warmer it gets and the less we have to do to swap the better.
Ideally I'm thinking:
- Misc server in codfw with public IP like eqiad
- Run gerrit role in slave mode (read-only)
- rsync git, lucene data hourly from master -> slave (index drift & rebuild would suck!)
- Failover would be "swap which node is master" and "swap dns"