Page MenuHomePhabricator

Upgrade s2 to Buster + MariaDB 10.4
Closed, ResolvedPublic

Description

Given that the switch back will be in September, let's upgrade s2 to fully Buster and MariaDB 10.4

  • Available backup source on the standby DC (all sections have sources everywhere now)
  • Reimage dbprov1002 from stretch to buster, reorganize backup generation
  • Switchover backup generation standby DC (to be confirmed by @jcrespo)
  • Candidate master on the standby DC
  • db1129
  • Master on the standby DC
  • Candidate master on the primary DC
  • Available backup source on the primary DC (to be confirmed by @jcrespo)
  • Reimage dbprov2002 from stretch to buster, reorganize backup generation
  • Switchover backup generation Primary DC (to be confirmed by @jcrespo)
  • Switchover on the primary DC to promote a Buster+10.4 host to master: T287454
  • Upgrade the old master and make it a candidate master, pool it in s2 in API (weight 100) and main (weight 500)
  • Cleanup (remove) old backup sources from both DCs

Please read the doc about procedure for more details.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

^I have my patches ready, will wait for Monday morning backups to finish and then reimage dbprov1002 .

Change 707243 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reimage dbprov1002 to buster

https://gerrit.wikimedia.org/r/707243

Change 707250 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reorganize backups after dbprov1002 reimage

https://gerrit.wikimedia.org/r/707250

dbprov1002 has been successfully reimaged to buster, with no issues.

I cannot discard I could had made some mistakes on backup reorganization, but those should not affect the following steps- just minor corrections afterwards/backup tuning to be done tomorrow.

Change 708202 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1162: Disable notifications

https://gerrit.wikimedia.org/r/708202

Mentioned in SAL (#wikimedia-operations) [2021-07-27T05:12:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1162 T287230', diff saved to https://phabricator.wikimedia.org/P16899 and previous config saved to /var/cache/conftool/dbconfig/20210727-051212-marostegui.json

Change 708202 merged by Marostegui:

[operations/puppet@production] db1162: Disable notifications

https://gerrit.wikimedia.org/r/708202

Change 708203 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1162 to Buster

https://gerrit.wikimedia.org/r/708203

Change 708203 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1162 to Buster

https://gerrit.wikimedia.org/r/708203

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1162.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107270519_marostegui_20252.log.

Completed auto-reimage of hosts:

['db1162.eqiad.wmnet']

and were ALL successful.

db1162 reimaged (now checking tables).
I just realised we also have db1129 with Stretch, which needs to be reimaged too. I will wait for db1162 to finish its check first.

Change 708239 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1129 to buster

https://gerrit.wikimedia.org/r/708239

Change 708239 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1129 to buster

https://gerrit.wikimedia.org/r/708239

Mentioned in SAL (#wikimedia-operations) [2021-07-27T14:25:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1129 T287230', diff saved to https://phabricator.wikimedia.org/P16916 and previous config saved to /var/cache/conftool/dbconfig/20210727-142520-marostegui.json

Change 708299 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1129: Disable notifications

https://gerrit.wikimedia.org/r/708299

Change 708299 merged by Marostegui:

[operations/puppet@production] db1129: Disable notifications

https://gerrit.wikimedia.org/r/708299

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1129.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107271427_marostegui_6876.log.

Completed auto-reimage of hosts:

['db1129.eqiad.wmnet']

and were ALL successful.

db1129 reimaged, now checking its tables.

Mentioned in SAL (#wikimedia-operations) [2021-07-27T14:53:53Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1162 T287230', diff saved to https://phabricator.wikimedia.org/P16917 and previous config saved to /var/cache/conftool/dbconfig/20210727-145352-marostegui.json

Change 708397 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1122 to Buster

https://gerrit.wikimedia.org/r/708397

Change 708397 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1122 to Buster

https://gerrit.wikimedia.org/r/708397

Change 708466 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1122: Disable notifications

https://gerrit.wikimedia.org/r/708466

Change 708466 merged by Marostegui:

[operations/puppet@production] db1122: Disable notifications

https://gerrit.wikimedia.org/r/708466

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1122.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107280820_marostegui_18853.log.

Completed auto-reimage of hosts:

['db1122.eqiad.wmnet']

and were ALL successful.

db1122 (s2 eqiad master) has been reimaged, I am now checking its tables before starting replication

eqiad master has been upgraded to 10.4+Buster.
@jcrespo you can probably proceed with db1139 as you wish.

@jcrespo am I good to go with codfw candidate master or you prefer to work out the backup sources there first?
Let me know!

@jcrespo am I good to go with codfw candidate master or you prefer to work out the backup sources there first?
Let me know!

Yes, that can be done now, no blocker, as next backup won't be until Sunday.

Change 708730 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db2104 to Buster

https://gerrit.wikimedia.org/r/708730

Change 708730 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db2104 to Buster

https://gerrit.wikimedia.org/r/708730

Mentioned in SAL (#wikimedia-operations) [2021-07-29T10:27:54Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2104 T287230', diff saved to https://phabricator.wikimedia.org/P16925 and previous config saved to /var/cache/conftool/dbconfig/20210729-102753-marostegui.json

Change 708733 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2104: Disable notifications

https://gerrit.wikimedia.org/r/708733

Change 708733 merged by Marostegui:

[operations/puppet@production] db2104: Disable notifications

https://gerrit.wikimedia.org/r/708733

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2104.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202107291032_marostegui_7546.log.

Change 708736 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reimage dbprov2002 to buster

https://gerrit.wikimedia.org/r/708736

Change 708737 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reorganize backups after dbprov2002 reimage

https://gerrit.wikimedia.org/r/708737

Completed auto-reimage of hosts:

['db2104.codfw.wmnet']

and were ALL successful.

db2104 (candidate) reimaged - checking its tables before pooling it back

db2104 check was ok, started replication again.

Change 709393 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Purge db1139:s2, old eqiad stretch backup source

https://gerrit.wikimedia.org/r/709393

Change 709393 merged by Jcrespo:

[operations/puppet@production] dbbackups: Purge db1139:s2, old eqiad stretch backup source

https://gerrit.wikimedia.org/r/709393

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1107.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108090651_marostegui_24172.log.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1107.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108090651_marostegui_24172.log.

This was meant to be for T288197: Failover m3 (phabricator) master (db1132) to a different host to upgrade its kernel

Completed auto-reimage of hosts:

['db1107.eqiad.wmnet']

and were ALL successful.

Change 708736 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reimage dbprov2002 to buster

https://gerrit.wikimedia.org/r/708736

Script wmf-auto-reimage was launched by jynus on cumin2002.codfw.wmnet for hosts:

['dbprov2002.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108090818_jynus_1653649.log.

Completed auto-reimage of hosts:

['dbprov2002.codfw.wmnet']

and were ALL successful.

Change 708737 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reorganize backups after dbprov2002 reimage

https://gerrit.wikimedia.org/r/708737

Change 711258 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db2107 to Buster

https://gerrit.wikimedia.org/r/711258

Change 711258 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db2107 to Buster

https://gerrit.wikimedia.org/r/711258

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2107.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202108110537_marostegui_4804.log.

Completed auto-reimage of hosts:

['db2107.codfw.wmnet']

and were ALL successful.

db2107 has been reimaged - I am now checking its tables.

db2107 has been reimaged - I am now checking its tables.

The check finished - all clean. Once the host catches up, I will start pooling it.

db2107 is being slowly repooled

All done, the only left step is the old backup source clean up to be done by @jcrespo

Change 712925 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Remove s2 stretch codfw backup source, move s4, upgrade 2099

https://gerrit.wikimedia.org/r/712925

Change 712925 merged by Jcrespo:

[operations/puppet@production] dbbackups: Remove s2 stretch codfw backup source, move s4, upgrade 2099

https://gerrit.wikimedia.org/r/712925

Mentioned in SAL (#wikimedia-operations) [2021-08-16T08:49:54Z] <jynus> replacing s2 with s4 on db2097 T287230

@Marostegui I think there are no more 10.1 s2 instances, but please confirm.

Only db2097:3312 on tendril (orchestrator and zarcillo look good), intended for any reason or should I remove it?

I must have made a mistake when removing it, or the script didn't complete- the instance doesn't exist anymore in reality. Could you please clean it up? Thank you and sorry.

All done and looking good
Resolving!