⚓ T229657 Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC

Subject	Repo	Branch	Lines +/-
wmnet: Restore 5M TTL for m5-master	operations/dns	master	+1 -1
wmnet: Decrease m5-master TTL to 1M	operations/dns	master	+1 -1
db-eqiad.php: Promote db1133 as wikitech master	operations/mediawiki-config	master	+2 -2
wmnet: Promote db1133 to m5 master	operations/dns	master	+1 -1
mariadb: Promote db1133 to m5 master	operations/puppet	production	+4 -4

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Marostegui	T220170 Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment
Resolved	None	T208323 Predictive failures on disk S.M.A.R.T. status
Resolved	Marostegui	T217396 Decommission db1061-db1073
Resolved	Jclark-ctr	T231892 Decommission db1073.eqiad.wmnet
Resolved	Marostegui	T229657 Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC

Marostegui triaged this task as Medium priority.Aug 2 2019, 11:54 AM

Marostegui moved this task from Triage to Pending comment on the DBA board.Aug 2 2019, 11:57 AM

Marostegui added a parent task: T220170: Address Database hardware infrastructure blockers on datacenter switchover & multi-dc deployment.Aug 2 2019, 12:31 PM

I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania + offsite.

This is mostly affecting the CloudVPS (openstack) control plane. We would need to somehow notify this to our CloudVPS users. The only problem I can think of about doing this next week would be the short notification period.

In T229657#5387587, @aborrero wrote:

I think we could either do this next week or wait until september because the WMCS team we will be traveling for Wikimania + offsite.

This is mostly affecting the CloudVPS (openstack) control plane. We would need to somehow notify this to our CloudVPS users. The only problem I can think of about doing this next week would be the short notification period.

This can wait until September, no problem.
Just pick a date/time in September and let me know (if it can be done on a Tuesday/Wednesday it would good, so we have a few days to monitor the new host until the weekend arrives) :-)

FYI I'll be on vacation and without a work laptop approx Sept 10th - Sept 20th, and possibly Sept 9th as well. Outside of that window I'm happy to be around for this any time of day.

Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ?

Ok, so I'm proposing two dates:

2019-09-03 -- I'm unavailable, but I think both @JHedden and @Andrew will be around. Also @CDanis
2019-09-24 -- I think we will all be available, @JHedden @Andrew @CDanis and me

In T229657#5387609, @Marostegui wrote:

Let me pick a tentative.....Tuesday 3rd Sept at 13:00 UTC? @aborrero @CDanis ?

LGTM

@aborrero's dates would be fine with me as well.

@aborrero are you proposing October?

Ok, 2019-10-03, work for us. Will let my team know, since I won't be around.

In T229657#5387615, @Marostegui wrote:

@aborrero are you proposing October?

no, sorry, a typo. Fixed.

Let's try to go for the 3rd of September at 13:00 UTC if @Andrew and/or @JHedden can confirm they'll be available to support this.

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

In T229657#5393428, @Marostegui wrote:

As per the sync on the SRE meeting, @JHedden will be online from WMCS.
I will handle the announcement for wikitech, could you handle the announcement (if it is needed) for the OpenStack part of things?

Yeah, I'll take care of the Cloud VPS / OpenStack announcement.

Thanks!

bd808 moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.Aug 5 2019, 4:45 PM

Marostegui updated the task description. (Show Details)Aug 6 2019, 6:22 AM

Announcement about wikitech sent to wikitech-l and operations list:
https://lists.wikimedia.org/pipermail/wikitech-l/2019-August/092392.html

Marostegui moved this task from Pending comment to In progress on the DBA board.Aug 6 2019, 10:38 AM

Change 529065 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Decrease m5-master TTL to 1M

https://gerrit.wikimedia.org/r/529065

gerritbot added a project: Patch-For-Review.Aug 8 2019, 11:11 AM

Change 529065 merged by Marostegui:
[operations/dns@master] wmnet: Decrease m5-master TTL to 1M

https://gerrit.wikimedia.org/r/529065

Maintenance_bot removed a project: Patch-For-Review.Aug 8 2019, 1:10 PM

Marostegui updated the task description. (Show Details)Aug 8 2019, 2:30 PM

Marostegui updated the task description. (Show Details)

Change 529331 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529331

gerritbot added a project: Patch-For-Review.Aug 9 2019, 10:02 AM

Change 529333 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529333

I have submitted the patches for review, I would appreciate if the cloud-services-team folks can give them a look (specially @JHedden as he will be online supporting this). Also @CDanis for the dbctl part
The procedure for the failover, on a high level description, along these lines:

Old master: db1073
New master: db1133

Pre-failover steps a few minutes before 13:00 UTC
@Marostegui to silence alerts on m5 hosts
@Marostegui to change replication and get everything to replicate from db1133
@Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit and then dbctl config commit -m "Pool db1133 with weight 0 T229657" so it can be later set as master.
@Marostegui to disable puppet on db1073 and db1133 and merge: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/529333/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/529331/

Failover at 13:00 UTC
@Marostegui to log on -operations that the failover is starting
@Marostegui to set read-only

dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

@Marostegui to perform the failover on a mysql level (at this point db1073 will become read-only)
@Marostegui to change the master on MW: dbctl --scope eqiad section wikitech set-master db1133 ; dbctl config commit -m "Promote db1133 to wikitech master T229657"
@Marostegui to authdns-update the DNS change
@Marostegui to kill connections on db1073
@Marostegui to reload dbproxy1005 proxy
@JHedden to verify everything starts connecting to db1133 as the m5-master record gets changed from db1073 to db1133 and restart services if needed.

Failover clean up steps
@Marostegui to re-enable and run puppet on db1073 and db1133
@Marostegui to change query killers for db1073 and db1133.
@Marostegui to depool db1073 from wikitech: dbctl instance db1073 depool ; dbctl config commit -m "Depool db1073 from wikitech T229657"

That plan sounds good. Remember that you may need to manually restart ferm in some places because we pass FQDNs directly to the ferm config and thus the backing IP change for m5-master.eqiad.wmnet won't be detected as a puppet change (so puppet agent won't restart ferm)

@Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit so it can be later set as master.

You likely also want a dbctl config commit here.

Otherwise LGTM!

In T229657#5405047, @CDanis wrote:

@Marostegui to pool db1133 with weight 0 on wikitech section via dbctl instance db1133 edit so it can be later set as master.

You likely also want a dbctl config commit here.

Otherwise LGTM!

Fixed! Thank you :)

The plan looks good to me. In the pre-failover stage I'll be shutting down the OpenStack scheduler and designate services to ensure there are no actions in queue, then re-enabling these in the clean up steps.

@CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:

# set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"

Thanks!

In T229657#5431230, @Marostegui wrote:
@CDanis @Volans can you confirm this command will set wikitech (db1073 is its master) on read-only?:
# set read-only
dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T229657 " && dbctl config commit -m "Set wikitech as read-only for maintenance T229657"
Thanks!

@Marostegui LGTM, also you can test the first command, see the diff and then issue a rw command and check that there is no diff.

Indeed @Volans - thanks!

root@cumin1001:~# dbctl --scope eqiad section wikitech rw  && dbctl config diff
root@cumin1001:~#

Marostegui renamed this task from Switchover m5 primary master: db1073 to db1133 to Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC.Aug 28 2019, 11:40 AM

Marostegui updated the task description. (Show Details)Aug 28 2019, 2:02 PM

Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-08-29T09:08:07Z] <marostegui> Reboot db1133 to upgrade kernel - T229657

Window reserved on the Deployments page. It will happen at the same time that the Train, but it shouldn't be an issue really. Plus it will just take a few seconds.

Marostegui updated the task description. (Show Details)Aug 30 2019, 8:19 AM

Marostegui updated the task description. (Show Details)

Marostegui updated the task description. (Show Details)Sep 2 2019, 1:00 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-03T06:39:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1133 with weight 0 T229657', diff saved to https://phabricator.wikimedia.org/P9029 and previous config saved to /var/cache/conftool/dbconfig/20190903-063932-marostegui.json

Marostegui updated the task description. (Show Details)Sep 3 2019, 6:40 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-03T08:09:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1133 with weight 0 T229657', diff saved to https://phabricator.wikimedia.org/P9031 and previous config saved to /var/cache/conftool/dbconfig/20190903-080958-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T08:26:50Z] <marostegui> Add REPLICATION grant to wikiuser and wikiadmin on db1073 with replication enabled - T229657

Mentioned in SAL (#wikimedia-operations) [2019-09-03T11:48:02Z] <marostegui> Downtime m5 hosts T229657

Marostegui updated the task description. (Show Details)Sep 3 2019, 11:49 AM

Mentioned in SAL (#wikimedia-operations) [2019-09-03T11:55:31Z] <marostegui> Change topology on m5 and make everything replicate from db1133 - T229657

Marostegui updated the task description. (Show Details)Sep 3 2019, 11:58 AM

Change 534144 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1133 as wikitech master

https://gerrit.wikimedia.org/r/534144

Mentioned in SAL (#wikimedia-operations) [2019-09-03T12:02:22Z] <marostegui> Disable puppet on db1073 and db1133 - T229657

Change 529331 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529331

@JHedden all the PRE steps are done.

Mentioned in SAL (#wikimedia-cloud) [2019-09-03T12:42:17Z] <jeh> Set Icingia downtime on cloudcontrol100[34]and cloudservices100[34] for database switch over T229657

Change 529333 merged by Marostegui:
[operations/dns@master] wmnet: Promote db1133 to m5 master

https://gerrit.wikimedia.org/r/529333

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:00:49Z] <marostegui> Failover m5 from db1073 to db1133 - T229657

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:01:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9033 and previous config saved to /var/cache/conftool/dbconfig/20190903-130113-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:08:40Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9035 and previous config saved to /var/cache/conftool/dbconfig/20190903-130839-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:09:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1133 to wikitech master T229657', diff saved to https://phabricator.wikimedia.org/P9036 and previous config saved to /var/cache/conftool/dbconfig/20190903-130937-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:10:01Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech back to RW after maintenance T229657', diff saved to https://phabricator.wikimedia.org/P9037 and previous config saved to /var/cache/conftool/dbconfig/20190903-131000-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:11:47Z] <marostegui> Reload haproxy on dbproxy1005 T229657

Marostegui updated the task description. (Show Details)Sep 3 2019, 1:13 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:13:34Z] <marostegui> Re-enable puppet on db1073 and db1133 T229657

Marostegui updated the task description. (Show Details)Sep 3 2019, 1:14 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-03T13:14:58Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1073 from wikitech T229657', diff saved to https://phabricator.wikimedia.org/P9038 and previous config saved to /var/cache/conftool/dbconfig/20190903-131456-marostegui.json

Marostegui updated the task description. (Show Details)Sep 3 2019, 1:15 PM

This was done successfully.

wikitech read only start: 13:08:40
wikitech read only stop: 13:10:01

@JHedden is still checking the WMCS side to confirm everything is back up normally

Cloud VPS OpenStack has been fully switched over and all services are back online.

jcrespo awarded a token.Sep 3 2019, 1:44 PM

Marostegui updated the task description. (Show Details)Sep 3 2019, 1:57 PM

Change 534144 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1133 as wikitech master

https://gerrit.wikimedia.org/r/534144

Maintenance_bot removed a project: Patch-For-Review.Sep 3 2019, 2:10 PM

Mentioned in SAL (#wikimedia-operations) [2019-09-03T14:39:12Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Promote db1133 as wikitech master T229657 (duration: 00m 54s)

This is all done - db1073 will be decommissioned in a few days (most likely next week) once we are sure everything is working as expected.

Marostegui added a parent task: T231892: Decommission db1073.eqiad.wmnet.Sep 3 2019, 2:49 PM

Marostegui mentioned this in T231892: Decommission db1073.eqiad.wmnet.

Change 535509 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Restore 5M TTL for m5-master

https://gerrit.wikimedia.org/r/535509

Change 535509 merged by Marostegui:
[operations/dns@master] wmnet: Restore 5M TTL for m5-master

https://gerrit.wikimedia.org/r/535509

Maintenance_bot removed a project: Patch-For-Review.Sep 10 2019, 8:10 AM

Marostegui mentioned this in T260324: Upgrade m5 to Buster and MariaDB 10.4.Aug 13 2020, 5:35 AM

Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTCClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Switchover m5 primary master: db1073 to db1133: Tuesday 3rd Sept at 13:00 UTC
Closed, ResolvedPublic
Actions

Related Objects
Search...