db1073 is the current primary master for m5 which holds the following databases:
root@db1073.eqiad.wmnet[(none)]> show databases; +------------------------+ | Database | +------------------------+ | designate | | designate_pool_manager | | glance | | keystone | | labsdbaccounts | | labspuppet | | labswiki | | labtestwiki | | neutron | | nova | | nova_api | | nova_api_eqiad1 | | nova_eqiad1 | | performance_schema | | striker | | test_labsdbaccounts | | testreduce_0715 | | testreduce_vd | +------------------------+ 22 rows in set (0.00 sec)
Apart from the cloud ones, it also holds wikitech (labswiki database).
db1073 is very old, out of warranty and has 2 disks on predictive failure. This host is also scheduled for decommission T217396: Decommission db1061-db1073
I would like to fail it over to db1133, a newer and more powerful host.
The procedure would be to set db1073 into read-only, promote db1133 and set db1133 to be writable - db1073 will remain on read-only. That MySQL operations should only take a few seconds
However, we need to make sure the services start using db1133.
For the cloud services that use it
m5 currently doesn't use a proxy:
# host m5-master m5-master.eqiad.wmnet is an alias for db1073.eqiad.wmnet. db1073.eqiad.wmnet has address 10.64.16.79
Even though the proxy isn't in use, we have to also change it to reflect that db1133 is the master.
So we'd need to do a DNS switch for it.
Currently its TTL is 5M, so I think we should decrease it to 1M, to avoid that 5 minutes downtime until they full start using db1133.
Update 8th August: TTL changed: https://gerrit.wikimedia.org/r/529065
For wikitech, we just need to use the new dbctl tool to promote it to master (after pooling db1133 with weight 0, which can be done a day in advance). So the command would be
dbctl --scope eqiad section wikitech set-master db1133 dbctl config commit
When:
Tuesday 3rd Sept at 13:00 UTC
I think total read-only would be around 5 minutes, reads won't be affected as db1073 will be up at all times.
I would like to coordinate with cloud-services-team to find a proper date and time to do this operation and communicate it on wikitech-l and on other channels you might consider necessary.
Also CCing @CDanis and @Volans as this would be the first time we'd use dbctl to set a master and it would be nice to have one of them online just in case :)
Procedure:
Old master: db1073
New master: db1133
Pre-failover steps a few minutes before 13:00 UTC
- @Marostegui to silence alerts on m5 hosts
- @Marostegui to change replication and get everything to replicate from db1133
- @Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit and then dbctl config commit -m "Pool db1133 with weight 0 T229657" so it can be later set as master.
- @Marostegui to disable puppet on db1073 and db1133 and merge: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/529333/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/529331/
Failover at 13:00 UTC
- @Marostegui to log on -operations that the failover is starting
- @Marostegui to set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"`
- @Marostegui to perform the failover on a mysql level (at this point db1073 will become read-only)
- @Marostegui to change the master on MW: dbctl --scope eqiad section wikitech set-master db1133 ; dbctl config commit -m "Promote db1133 to wikitech master T229657"
- @Marostegui to kill connections on db1073
- @Marostegui to set wikitech back to RW: dbctl --scope eqiad section wikitech rw && dbctl config commit -m "Set wikitech back to RW after maintenance T229657"
- @Marostegui to authdns-update the DNS change
- @Marostegui to reload dbproxy1005 proxy
- @JHedden to verify everything starts connecting to db1133 as the m5-master record gets changed from db1073 to db1133 and restart services if needed.
Failover clean up steps
- @Marostegui to merge https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/534144/
- @Marostegui to re-enable and run puppet on db1073 and db1133
- @Marostegui to change query killers for db1073 and db1133.
- @Marostegui to depool db1073 from wikitech: dbctl instance db1073 depool ; dbctl config commit -m "Depool db1073 from wikitech T229657"