Page MenuHomePhabricator

Make primary DB masters page on HOST DOWN alert
Open, MediumPublic


At the moment, any host going down won't page, they will just send an IRC alert.

While this might be ok for the rest of the infra, if a primary database master goes down, that means that all the wikis on it will automatically go on read-only (apart from replication getting broken on the slaves).
In some cases, replication broken alerts can take up to 15 minutes to actually send an SMS - we should page for a master going down at it needs immediate action.

Event Timeline

Marostegui triaged this task as Medium priority.Sep 24 2019, 5:55 AM
Marostegui added a project: Wikimedia-Incident.
Marostegui moved this task from Triage to Backlog on the DBA board.

There is some interaction between this and T252679 (although they are technically separate tickets). T252679 would solve this by not monitoring "Host X is down" but to change the logic into "Section X is in read only mode (probably because the primary server is down)" aka moving away from monitoring host and monitor abstract services instead.

This ticket is a short term solution, that one is more long term "model change". But I think it is useful to point it here for architectural considerations.