Page MenuHomePhabricator

Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC
Closed, ResolvedPublic

Description

db1073 is the current primary master for m5 which holds the following databases:

root@db1073.eqiad.wmnet[(none)]> show databases;
+------------------------+
| Database               |
+------------------------+
| designate              |
| designate_pool_manager |
| glance                 |
| keystone               |
| labsdbaccounts         |
| labspuppet             |
| labswiki               |
| labtestwiki            |
| neutron                |
| nova                   |
| nova_api               |
| nova_api_eqiad1        |
| nova_eqiad1            |
| performance_schema     |
| striker                |
| test_labsdbaccounts    |
| testreduce_0715        |
| testreduce_vd          |
+------------------------+
22 rows in set (0.00 sec)

Apart from the cloud ones, it also holds wikitech (labswiki database).

db1073 is very old, out of warranty and has 2 disks on predictive failure. This host is also scheduled for decommission T217396: Decommission db1061-db1073
I would like to fail it over to db1133, a newer and more powerful host.

The procedure would be to set db1073 into read-only, promote db1133 and set db1133 to be writable - db1073 will remain on read-only. That MySQL operations should only take a few seconds
However, we need to make sure the services start using db1133.

For the cloud services that use it
m5 currently doesn't use a proxy:

# host m5-master
m5-master.eqiad.wmnet is an alias for db1073.eqiad.wmnet.
db1073.eqiad.wmnet has address 10.64.16.79

Even though the proxy isn't in use, we have to also change it to reflect that db1133 is the master.

So we'd need to do a DNS switch for it.
Currently its TTL is 5M, so I think we should decrease it to 1M, to avoid that 5 minutes downtime until they full start using db1133.
Update 8th August: TTL changed: https://gerrit.wikimedia.org/r/529065

For wikitech, we just need to use the new dbctl tool to promote it to master (after pooling db1133 with weight 0, which can be done a day in advance). So the command would be

dbctl --scope eqiad section wikitech set-master db1133
dbctl config commit

When:
Tuesday 3rd Sept at 13:00 UTC
I think total read-only would be around 5 minutes, reads won't be affected as db1073 will be up at all times.

I would like to coordinate with cloud-services-team to find a proper date and time to do this operation and communicate it on wikitech-l and on other channels you might consider necessary.
Also CCing @CDanis and @Volans as this would be the first time we'd use dbctl to set a master and it would be nice to have one of them online just in case :)

Procedure:

Old master: db1073
New master: db1133

Pre-failover steps a few minutes before 13:00 UTC

Failover at 13:00 UTC

dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"`
  • @Marostegui to perform the failover on a mysql level (at this point db1073 will become read-only)
  • @Marostegui to change the master on MW: dbctl --scope eqiad section wikitech set-master db1133 ; dbctl config commit -m "Promote db1133 to wikitech master T229657"
  • @Marostegui to kill connections on db1073
  • @Marostegui to set wikitech back to RW: dbctl --scope eqiad section wikitech rw && dbctl config commit -m "Set wikitech back to RW after maintenance T229657"
  • @Marostegui to authdns-update the DNS change
  • @Marostegui to reload dbproxy1005 proxy
  • @JHedden to verify everything starts connecting to db1133 as the m5-master record gets changed from db1073 to db1133 and restart services if needed.

Failover clean up steps

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui triaged this task as Medium priority.Aug 2 2019, 11:54 AM

I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania + offsite.

This is mostly affecting the CloudVPS (openstack) control plane. We would need to somehow notify this to our CloudVPS users. The only problem I can think of about doing this next week would be the short notification period.

I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania + offsite.

This is mostly affecting the CloudVPS (openstack) control plane. We would need to somehow notify this to our CloudVPS users. The only problem I can think of about doing this next week would be the short notification period.

This can wait until September, no problem.
Just pick a date/time in September and let me know (if it can be done on a Tuesday/Wednesday it would good, so we have a few days to monitor the new host until the weekend arrives) :-)

FYI I'll be on vacation and without a work laptop approx Sept 10th - Sept 20th, and possibly Sept 9th as well. Outside of that window I'm happy to be around for this any time of day.

Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ?

Ok, so I'm proposing two dates:

Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ?

LGTM

@aborrero's dates would be fine with me as well.

Ok, 2019-10-03, work for us. Will let my team know, since I won't be around.

@aborrero are you proposing October?

no, sorry, a typo. Fixed.

Let's try to go for the 3rd of September at 13:00 UTC if @Andrew and/or @JHedden can confirm they'll be available to support this.

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

Yeah, I'll take care of the Cloud VPS / OpenStack announcement.

Change 529065 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Decrease m5-master TTL to 1M

https://gerrit.wikimedia.org/r/529065

Change 529065 merged by Marostegui:
[operations/dns@master] wmnet: Decrease m5-master TTL to 1M

https://gerrit.wikimedia.org/r/529065

Change 529331 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529331

Change 529333 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529333

I have submitted the patches for review, I would appreciate if the cloud-services-team folks can give them a look (specially @JHedden as he will be online supporting this). Also @CDanis for the dbctl part
The procedure for the failover, on a high level description, along these lines:

Old master: db1073
New master: db1133

Pre-failover steps a few minutes before 13:00 UTC
@Marostegui to silence alerts on m5 hosts
@Marostegui to change replication and get everything to replicate from db1133
@Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit and then dbctl config commit -m "Pool db1133 with weight 0 T229657" so it can be later set as master.
@Marostegui to disable puppet on db1073 and db1133 and merge: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/529333/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/529331/

Failover at 13:00 UTC
@Marostegui to log on -operations that the failover is starting
@Marostegui to set read-only

dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

@Marostegui to perform the failover on a mysql level (at this point db1073 will become read-only)
@Marostegui to change the master on MW: dbctl --scope eqiad section wikitech set-master db1133 ; dbctl config commit -m "Promote db1133 to wikitech master T229657"
@Marostegui to authdns-update the DNS change
@Marostegui to kill connections on db1073
@Marostegui to reload dbproxy1005 proxy
@JHedden to verify everything starts connecting to db1133 as the m5-master record gets changed from db1073 to db1133 and restart services if needed.

Failover clean up steps
@Marostegui to re-enable and run puppet on db1073 and db1133
@Marostegui to change query killers for db1073 and db1133.
@Marostegui to depool db1073 from wikitech: dbctl instance db1073 depool ; dbctl config commit -m "Depool db1073 from wikitech T229657"

That plan sounds good. Remember that you may need to manually restart ferm in some places because we pass FQDNs directly to the ferm config and thus the backing IP change for m5-master.eqiad.wmnet won't be detected as a puppet change (so puppet agent won't restart ferm)

@Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit so it can be later set as master.

You likely also want a dbctl config commit here.

Otherwise LGTM!

This comment was removed by Marostegui.

@Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit so it can be later set as master.

You likely also want a dbctl config commit here.

Otherwise LGTM!

Fixed! Thank you :)

The plan looks good to me. In the pre-failover stage I'll be shutting down the OpenStack scheduler and designate services to ensure there are no actions in queue, then re-enabling these in the clean up steps.

@CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:

# set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

Thanks!

@CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:

# set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

Thanks!

@Marostegui LGTM, also you can test the first command, see the diff and then issue a rw command and check that there is no diff.

Indeed @Volans - thanks!

root@cumin1001:~# dbctl --scope eqiad section wikitech rw  && dbctl config diff
root@cumin1001:~#
Marostegui renamed this task from Switchover m5 primary master: db1073 to db1133 to Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.Aug 28 2019, 11:40 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T09:08:07Z] <marostegui> Reboot db1133 to upgrade kernel - T229657

Window reserved on the Deployments page. It will happen at the same time that the Train, but it shouldn't be an issue really. Plus it will just take a few seconds.

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-09-03T06:39:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1133 with weight 0 T229657', diff saved to https://phabricator.wikimedia.org/P9029 and previous config saved to /var/cache/conftool/dbconfig/20190903-063932-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T08:09:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1133 with weight 0 T229657', diff saved to https://phabricator.wikimedia.org/P9031 and previous config saved to /var/cache/conftool/dbconfig/20190903-080958-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T08:26:50Z] <marostegui> Add REPLICATION grant to wikiuser and wikiadmin on db1073 with replication enabled - T229657

Mentioned in SAL (#wikimedia-operations) [2019-09-03T11:55:31Z] <marostegui> Change topology on m5 and make everything replicate from db1133 - T229657

Change 534144 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1133 as wikitech master

https://gerrit.wikimedia.org/r/534144

Mentioned in SAL (#wikimedia-operations) [2019-09-03T12:02:22Z] <marostegui> Disable puppet on db1073 and db1133 - T229657

Change 529331 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529331

Mentioned in SAL (#wikimedia-cloud) [2019-09-03T12:42:17Z] <jeh> Set Icingia downtime on cloudcontrol100[34]and cloudservices100[34] for database switch over T229657

Change 529333 merged by Marostegui:
[operations/dns@master] wmnet: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529333

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:00:49Z] <marostegui> Failover m5 from db1073 to db1133 - T229657

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:01:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9033 and previous config saved to /var/cache/conftool/dbconfig/20190903-130113-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:08:40Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9035 and previous config saved to /var/cache/conftool/dbconfig/20190903-130839-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:09:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1133 to wikitech master T229657', diff saved to https://phabricator.wikimedia.org/P9036 and previous config saved to /var/cache/conftool/dbconfig/20190903-130937-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:10:01Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech back to RW after maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9037 and previous config saved to /var/cache/conftool/dbconfig/20190903-131000-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:11:47Z] <marostegui> Reload haproxy on dbproxy1005 T229657

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:13:34Z] <marostegui> Re-enable puppet on db1073 and db1133 T229657

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:14:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1073 from wikitech T229657', diff saved to https://phabricator.wikimedia.org/P9038 and previous config saved to /var/cache/conftool/dbconfig/20190903-131456-marostegui.json

This was done successfully.

wikitech read only start: 13:08:40
wikitech read only stop: 13:10:01

@JHedden is still checking the WMCS side to confirm everything is back up normally

Cloud VPS OpenStack has been fully switched over and all services are back online.

Change 534144 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1133 as wikitech master

https://gerrit.wikimedia.org/r/534144

Mentioned in SAL (#wikimedia-operations) [2019-09-03T14:39:12Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Promote db1133 as wikitech master T229657 (duration: 00m 54s)

Marostegui updated the task description. (Show Details)

This is all done - db1073 will be decommissioned in a few days (most likely next week) once we are sure everything is working as expected.

Change 535509 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Restore 5M TTL for m5-master

https://gerrit.wikimedia.org/r/535509

Change 535509 merged by Marostegui:
[operations/dns@master] wmnet: Restore 5M TTL for m5-master

https://gerrit.wikimedia.org/r/535509