Page MenuHomePhabricator

Ensure we are unlikely to have both deployment-prep DB instances hosted together again in future
Closed, ResolvedPublic

Description

Somehow during instance shuffling, probably during the eqiad1-r migration, deployment-db03 and deployment-db04 ended up on the same host, cloudvirt1018.
This went unnoticed until cloudvirt1018 popped its clogs last month and resulted in both T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start and T216067: Recover from corrupted beta MySQL slave (deployment-db04).
The damage could've been far worse than it was. We didn't entirely loose instances (and were able to recover the databases) but others did.
I've heard there's an instance spread alarm used in tools, we should make that apply to these instances, and maybe also the ms-be instances too.
(Note due to T219087: Get rid of deployment-db0[34] this is probably currently db04 and db05, soon to be db05 and db06)

Details

Related Gerrit Patches:

Event Timeline

Krenair created this task.Mar 24 2019, 4:48 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 24 2019, 4:48 AM
Krenair updated the task description. (Show Details)Mar 24 2019, 4:49 AM
Krenair updated the task description. (Show Details)Mar 24 2019, 4:52 AM
bd808 added a subscriber: bd808.Mar 24 2019, 5:49 AM

The openstack::monitor::spreadcheck Puppet module sets up the nrpe monitor for Toolforge instances. Adding a config file and associated monitor for deployment-prep shouldn't be too difficult.

Change 498699 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/puppet@production] openstack: Add an instance spread check for deployment-prep

https://gerrit.wikimedia.org/r/498699

Krenair claimed this task.Mar 24 2019, 5:51 AM

Change 498699 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: Add an instance spread check for deployment-prep

https://gerrit.wikimedia.org/r/498699

Krenair closed this task as Resolved.Mar 26 2019, 1:51 PM