Page MenuHomePhabricator

Upgrade m5 to Buster and MariaDB 10.4
Closed, ResolvedPublic

Description

Continuing with the misc sections upgrade, m5 needs to be upgraded to Buster and Mariadb 10.4

Hosts:

  • db2135
  • db2078
  • db1133 (to be replaced by db1128 once it is removed as master from m3 T259589)
  • db1128
  • db1117

This will require a master failover once db1128 is provisioned.
m5 doesn't use the proxies, so we'll need to do a DNS failover, similar to what we did last time: T229657

Failover procedure:

Procedure:

Old master: db1133
New master: db1128

Decrease TTL a few days before the switchover

Pre-failover steps a few minutes before 14:00 UTC

Failover at 14:00 UTC

dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T260324 " && dbctl config commit -m "Set wikitech as read-only for maintenance T260324"`
  • @Marostegui to perform the failover on a mysql level (at this point db1133 will become read-only)
  • @Marostegui to pool db1128 first on dbctl see T260324#6429410
  • @Marostegui to change the master on MW: dbctl --scope eqiad section wikitech set-master db1128 ; dbctl config commit -m "Promote db1128 to wikitech master T260324"
  • @Marostegui to kill connections on db1133
  • @Marostegui to set wikitech back to RW: dbctl --scope eqiad section wikitech rw && dbctl config commit -m "Set wikitech back to RW after maintenance T260324"
  • @Marostegui to authdns-update the DNS change
  • @Marostegui to reload dbproxy1021 and dbproxy1017
  • @Andrew or someone from cloud-services-team to verify everything starts connecting to db1128 as the m5-master record gets changed from db1133 to db1128 and restart services if needed.

Failover clean up steps

  • @Marostegui to re-enable and run puppet on db1133 and db1128
  • @Marostegui to depool db1133 from wikitech: dbctl instance db1133 depool ; dbctl config commit -m "Depool db1133 from wikitech T260324"
  • @Marostegui to change m5's master alias TTL from 1M to 5M

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptAug 13 2020, 5:35 AM

Change 619902 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2135: Upgrade m5 codfw master to Buster

https://gerrit.wikimedia.org/r/619902

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Change 619902 merged by Marostegui:
[operations/puppet@production] db2135: Upgrade m5 codfw master to Buster

https://gerrit.wikimedia.org/r/619902

Mentioned in SAL (#wikimedia-operations) [2020-08-13T05:43:06Z] <marostegui> Stop MySQL on db2135 (codfw master), haproxy irc alert will fire T260324

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2135.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008130544_marostegui_11138.log.

Completed auto-reimage of hosts:

['db2135.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1099.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008170532_marostegui_19563.log.

Completed auto-reimage of hosts:

['db1099.eqiad.wmnet']

and were ALL successful.

Change 620818 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1128: Disable notifications

https://gerrit.wikimedia.org/r/620818

Change 620818 merged by Marostegui:
[operations/puppet@production] db1128: Disable notifications

https://gerrit.wikimedia.org/r/620818

Change 622120 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move db1128 to m5

https://gerrit.wikimedia.org/r/622120

Change 622120 merged by Marostegui:
[operations/puppet@production] mariadb: Move db1128 to m5

https://gerrit.wikimedia.org/r/622120

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1128.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202008241313_marostegui_12147.log.

Completed auto-reimage of hosts:

['db1128.eqiad.wmnet']

and were ALL successful.

Mentioned in SAL (#wikimedia-operations) [2020-08-24T13:59:22Z] <marostegui> Stop mysql on db1117:3325 to clone db1128 - T260324

Change 622164 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Do not reimage db1128

https://gerrit.wikimedia.org/r/622164

Change 622164 merged by Marostegui:
[operations/puppet@production] install_server: Do not reimage db1128

https://gerrit.wikimedia.org/r/622164

Change 622260 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/622260

Change 622260 merged by Marostegui:
[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/622260

Marostegui added a subscriber: Kormat.

@Andrew and cloud-services-team I would like to failover this master maybe Monday 31st at 14:00 UTC? Would that work for you?
The procedure for m5 is a bit different from the rest of the misc sections as there is no proxy being used here. So the procedure we've used in the past is the following:

Procedure:

Old master: db1133
New master: db1128

Decrease TTL a few days before the switchover

Pre-failover steps a few minutes before 14:00 UTC

  • @Marostegui to silence alerts on m5 hosts
  • @Marostegui to change replication and get everything to replicate from db1128
  • @Marostegui to pool db1128 with weight 0 on wikitech section via dbctl instance db1128 edit and then dbctl config commit -m "Pool db1128 with weight 0 T260324" so it can be later set as master.
  • @Marostegui to disable puppet on db1133 and db1128 and merge: DNS change to change m5-master alias and puppet change to change site.pp and dbproxy1021 dbproxy1017 config (even if it is not used)

Failover at 14:00 UTC

dbctl --scope eqiad section wikitech ro "Maintenance on wikitech T260324 " && dbctl config commit -m "Set wikitech as read-only for maintenance T260324"`
  • @Marostegui to perform the failover on a mysql level (at this point db1133 will become read-only)
  • @Marostegui to change the master on MW: dbctl --scope eqiad section wikitech set-master db1128 ; dbctl config commit -m "Promote db1128 to wikitech master T260324"
  • @Marostegui to kill connections on db1133
  • @Marostegui to set wikitech back to RW: dbctl --scope eqiad section wikitech rw && dbctl config commit -m "Set wikitech back to RW after maintenance T260324"
  • @Marostegui to authdns-update the DNS change
  • @Marostegui to reload dbproxy1021 and dbproxy1017
  • @Andrew or someone from cloud-services-team to verify everything starts connecting to db1128 as the m5-master record gets changed from db1133 to db1128 and restart services if needed.

Failover clean up steps

  • @Marostegui to re-enable and run puppet on db1133 and db1128
  • @Marostegui to change query killers for db1133 and db1128.
  • @Marostegui to depool db1133 from wikitech: dbctl instance db1133 depool ; dbctl config commit -m "Depool db1133 from wikitech T260324"
  • @Marostegui to change m5's master alias TTL from 1M to 5M

Change 622266 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Decrease m5-master TTL to 1M

https://gerrit.wikimedia.org/r/622266

Sorry I missed this before! I can definitely be available at 14:00 UTC on Monday. It might be nice to have a mediawiki expert around as well, possibly @Reedy ?

Change 622266 merged by Marostegui:
[operations/dns@master] wmnet: Decrease m5-master TTL to 1M

https://gerrit.wikimedia.org/r/622266

Sorry I missed this before! I can definitely be available at 14:00 UTC on Monday. It might be nice to have a mediawiki expert around as well, possibly @Reedy ?

I didn't realise we have the DC switchover the day after this, so we'll probably be busy getting that ready, let's move it to Thursday 3rd Sep at 14:00 UTC instead?

Change 623748 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] instances.yaml: Add db1128 to dbctl

https://gerrit.wikimedia.org/r/623748

Change 623748 merged by Marostegui:
[operations/puppet@production] instances.yaml: Add db1128 to dbctl

https://gerrit.wikimedia.org/r/623748

Mentioned in SAL (#wikimedia-operations) [2020-09-02T08:55:04Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1128 into s10 (wikitech) with weight 0 - T260324', diff saved to https://phabricator.wikimedia.org/P12431 and previous config saved to /var/cache/conftool/dbconfig/20200902-085455-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-02T08:57:06Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Remove db1128 from s10 - T260324', diff saved to https://phabricator.wikimedia.org/P12432 and previous config saved to /var/cache/conftool/dbconfig/20200902-085705-marostegui.json

@Andrew @CDanis when doing the initial step of pooling the future wikitech master with weight 0 (https://phabricator.wikimedia.org/P12431) I broke wikitech, saying that it couldn't connect to the database and there were no replicas available.
s10 doesn't support having replicas even if they have 0 weight?
If that doesn't work, I guess the step of adding db1128 and promoting it to to master with dbctl set master needs to be done when the read-only is done

Change 623757 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1128 to m5 master

https://gerrit.wikimedia.org/r/623757

Change 623759 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Promote db1128 to m5-master

https://gerrit.wikimedia.org/r/623759

Reserved failover window on the deployment's calendar

So I will do the pooling and the master promoting while we have read only on the same dbctl commit

Mentioned in SAL (#wikimedia-operations) [2020-09-03T13:08:26Z] <marostegui> Start pre m5 failover steps T260324

Change 623757 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1128 to m5 master

https://gerrit.wikimedia.org/r/623757

Change 623759 merged by Marostegui:
[operations/dns@master] wmnet: Promote db1128 to m5-master

https://gerrit.wikimedia.org/r/623759

Mentioned in SAL (#wikimedia-operations) [2020-09-03T14:00:54Z] <marostegui> Failover m5 (wikitech) master - T260324

Mentioned in SAL (#wikimedia-operations) [2020-09-03T14:01:38Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech as read-only for maintenance T260324', diff saved to https://phabricator.wikimedia.org/P12487 and previous config saved to /var/cache/conftool/dbconfig/20200903-140135-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T14:04:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1128 to wikitech master T260324', diff saved to https://phabricator.wikimedia.org/P12488 and previous config saved to /var/cache/conftool/dbconfig/20200903-140411-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T14:04:37Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1128 to wikitech master T260324', diff saved to https://phabricator.wikimedia.org/P12489 and previous config saved to /var/cache/conftool/dbconfig/20200903-140436-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-09-03T14:04:52Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set wikitech back to RW after maintenance T260324', diff saved to https://phabricator.wikimedia.org/P12490 and previous config saved to /var/cache/conftool/dbconfig/20200903-140451-marostegui.json

Change 624079 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1133: Disable notifications

https://gerrit.wikimedia.org/r/624079

Change 624079 merged by Marostegui:
[operations/puppet@production] db1133: Disable notifications

https://gerrit.wikimedia.org/r/624079

The failover was successfully done.
wikitech went read only at 14:01:38 and went back to writable at 16:04:40

I am going to leave the TTL to 1M until tomorrow, just in case. I will revert it back to 5M tomorrow morning.

Change 624628 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Revert m5-master TTL from 1M to 5M

https://gerrit.wikimedia.org/r/624628

Change 624628 merged by Marostegui:
[operations/dns@master] wmnet: Revert m5-master TTL from 1M to 5M

https://gerrit.wikimedia.org/r/624628

Marostegui updated the task description. (Show Details)