We know this will cause downtime. We should really have a better way to fail-over.
Indeed. Here it is https://wikitech.wikimedia.org/wiki/Incident_documentation/20180314-ORES.
Anyway, let's close this for now and followup on new actionables. I 've already added 2 ideas in the incident report, need to create actual tasks out of them.