Page MenuHomePhabricator

Build warm slave for Gerrit in Dallas
Closed, ResolvedPublic

Description

Per the fallout from the lead failure, it was discussed by myself, @Dzahn and the rest of RelEng that having a warm spare of Gerrit running (ideally in the other DC) is necessary to avoid extended downtimes of a crucial service.

It doesn't have to be completely hot and failover does not have to be instantaneous/automatic, but the warmer it gets and the less we have to do to swap the better.

Ideally I'm thinking:

  • Misc server in codfw with public IP like eqiad
  • Run gerrit role in slave mode (read-only)
  • rsync git, lucene data hourly from master -> slave (index drift & rebuild would suck!)
  • Failover would be "swap which node is master" and "swap dns"

Event Timeline

Change 325596 had a related patch set uploaded (by Dzahn):
add gerrit2001.mgmt for WMF6408.mgmt

https://gerrit.wikimedia.org/r/325596

Change 325596 abandoned by Dzahn:
add gerrit2001.mgmt for WMF6408.mgmt

Reason:
it was already done by Rob now in https://gerrit.wikimedia.org/r/#/c/325860/1

https://gerrit.wikimedia.org/r/325596

demon claimed this task.

Spare is running in Dallas, data is being replicated in real time so I think we're warm.

Only improvements would be like shared cache stores (T152802) and swapping to elasticsearch for shared indexing. Then we'd be able to run a much hotter spare.

But I think we could fail over pretty quick at this point so resolving.