Page MenuHomePhabricator

Upgrade x1 databases to Buster and Mariadb 10.4
Closed, ResolvedPublic

Description

x1 should be an "easy" section to upgrade fully, as it doesn't replicate labs hosts, so we could upgrade up to the master.

  • db1120 (eqiad master: switched over to db1103 14th July 2020)
  • db1095 (removed) moved now to db1102 as buster (backup source)
  • db1103
  • db1137
  • db2096 (codfw master)
  • db2101 (backup source)
  • db2115
  • db2131
  • dbstore1005 (T254870 )

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@jcrespo would you be okay if I upgrade db1095 and db2101 to Buster? Those are backup sources.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2131.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006091353_marostegui_105111.log.

Completed auto-reimage of hosts:

['db2131.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2131.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006091421_marostegui_129669.log.

Completed auto-reimage of hosts:

['db2131.codfw.wmnet']

and were ALL successful.

Change 604036 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Enable notification on db2131

https://gerrit.wikimedia.org/r/604036

Change 604036 merged by Marostegui:
[operations/puppet@production] mariadb: Enable notification on db2131

https://gerrit.wikimedia.org/r/604036

@jcrespo would you be okay if I upgrade db1095 and db2101 to Buster? Those are backup sources.

If you upgrade them, snapshots may not work, as dbprov hosts are not upgraded and it is not advisable to prepare them with older xtrabackup versions.

They also host s2 and s3 (db1095).

Aside from the version mismatch, as long a they are prepared with the same or newer version, everything should work.

Let me know how you want to proceed with this.

They also host s2 and s3 (db1095).

I can migrate x1 to db1140, which is already on buster, solving the extra sections issue. But not sure how to go about dbprov hosts.

I don't really know how to proceed :-( I was looking for ideas, should maybe upgrade more hosts in s2 and s3 so it it "worth" upgrading db1095 and dbprov?

This is the plan after a conversation on IRC:

<jynus> 1) I upgrade db2101 (x1) to 10.4
<jynus> and send snapshots to backup2002
<jynus> 2) I move db1102 (s4, s5) to db1145 (stretch)
<jynus> (I will keep dumps on the same dbprov)
<jynus> 3) I put x1 on db1102 (buster)
<jynus> (I then can remove it from db1095)
<jynus> 4) I backup x1 from db1102 to backup1002

<jynus> on Q1
<jynus> we will get dbprov[12]003
<jynus> and will be on buster directly and return snapshots from buster to it
<jynus> plus we will get an extra backup source also on buster for extra flexibility
jcrespo moved this task from Pending comment to In progress on the DBA board.

Change 607228 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Reimage db2101 (x1 backup source) to buster

https://gerrit.wikimedia.org/r/607228

Change 607228 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Reimage db2101 (x1 backup source) to buster

https://gerrit.wikimedia.org/r/607228

Mentioned in SAL (#wikimedia-operations) [2020-06-23T09:46:39Z] <jynus> stopping and reimaging db2101 into buster T254871

Script wmf-auto-reimage was launched by jynus on cumin2001.codfw.wmnet for hosts:

['db2101.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006230948_jynus_15257.log.

Completed auto-reimage of hosts:

['db2101.codfw.wmnet']

and were ALL successful.

Change 607264 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Reenable db2101 with snapshots to backup2002

https://gerrit.wikimedia.org/r/607264

Change 607264 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Reenable db2101 with snapshots to backup2002

https://gerrit.wikimedia.org/r/607264

Change 607267 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] install_server: Reimage and wipe db1145 into stretch

https://gerrit.wikimedia.org/r/607267

Change 607267 merged by Jcrespo:
[operations/puppet@production] install_server: Reimage and wipe db1145 into stretch

https://gerrit.wikimedia.org/r/607267

Change 607288 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Setup db1145 as a mariadb backup source for s4, s5

https://gerrit.wikimedia.org/r/607288

Change 607288 merged by Jcrespo:
[operations/puppet@production] mariadb: Setup db1145 as a mariadb backup source for s4, s5

https://gerrit.wikimedia.org/r/607288

See: T253217#6252401 After that, x1 will be removed from db1095 (which will continue with stretch).

Change 607510 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Move x1 backup source from db1095 to db1102

https://gerrit.wikimedia.org/r/607510

Change 607515 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Remove x1 from db1095 and enable db1102 notif.

https://gerrit.wikimedia.org/r/607515

Change 607510 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Move x1 backup source from db1095 to db1102

https://gerrit.wikimedia.org/r/607510

Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts:

['db1102.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006260844_jynus_24187.log.

Completed auto-reimage of hosts:

['db1102.eqiad.wmnet']

and were ALL successful.

Change 607515 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Remove x1 from db1095 and enable db1102 notif.

https://gerrit.wikimedia.org/r/607515

db1095 instance (stretch) has been backed up and moved to db1102 (buster). Backups are now done there and sent to backup1002.

Edit: s/not/now/

Thank you! I will go ahead and finish codfw master and then start looking at dates to failover x1 primary master.

Change 608304 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Reimage db2096 (codfw x1 master) to Buster

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608304

Change 608304 merged by Marostegui:
[operations/puppet@production] mariadb: Reimage db2096 (codfw x1 master) to Buster

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608304

Mentioned in SAL (#wikimedia-operations) [2020-06-29T12:20:39Z] <marostegui> Stop MySQL on db2096 (codfw x1 master) for reimage T254871

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2096.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202006291223_marostegui_12612.log.

Completed auto-reimage of hosts:

['db2096.codfw.wmnet']

and were ALL successful.

Change 608327 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2096: Enable notifications

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608327

Change 608327 merged by Marostegui:
[operations/puppet@production] db2096: Enable notifications

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608327

Pending: schedule x1 master switchover

@jcrespo @Kormat let's do this Wednesday 15th July at 06:00 AM UTC?

Marostegui added subscribers: Tgr, JoeWalsh, Dbrant and 4 others.

Wednesday 15th July at 06:00 AM UTC we will be setting x1 in read-only for around 1 minute to switchover the master to an upgraded one.
During this time, writes will be blocked. Reads will not be affected in anyway.

Change 612474 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Promote db1103 to x1 master

https://gerrit.wikimedia.org/r/612474

Change 612475 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/dns@master] wmnet: Update x1 alias

https://gerrit.wikimedia.org/r/612475

Mentioned in SAL (#wikimedia-operations) [2020-07-15T04:44:32Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1103 weight to 0 before the switchover T254871', diff saved to https://phabricator.wikimedia.org/P11908 and previous config saved to /var/cache/conftool/dbconfig/20200715-044432-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-15T04:46:25Z] <marostegui> Start x1 pre failover steps T254871

Change 612474 merged by Marostegui:
[operations/puppet@production] mariadb: Promote db1103 to x1 master

https://gerrit.wikimedia.org/r/612474

Mentioned in SAL (#wikimedia-operations) [2020-07-15T06:00:52Z] <marostegui> Starting x1 failover from db1120 to db1103 - T254871

Mentioned in SAL (#wikimedia-operations) [2020-07-15T06:01:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1103 to x1 master T254871', diff saved to https://phabricator.wikimedia.org/P11910 and previous config saved to /var/cache/conftool/dbconfig/20200715-060145-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-07-15T06:06:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1120 for reimage T254871', diff saved to https://phabricator.wikimedia.org/P11911 and previous config saved to /var/cache/conftool/dbconfig/20200715-060649-marostegui.json

Change 612475 merged by Marostegui:
[operations/dns@master] wmnet: Update x1 alias

https://gerrit.wikimedia.org/r/612475

Mentioned in SAL (#wikimedia-operations) [2020-07-15T06:09:14Z] <marostegui> Stop replication on db1120 to avoid having 10.4 -> 10.1 replication for long T254871

Switchover was done successfully. We had 69 read-only errors only.

RO started at 06:01:26
RO stopped at 06:01:45
Total read-only time: 19 seconds

All of them coming from

Error 1290 from MediaWiki\Extensions\ReadingLists\ReadingListRepository::selectValidList, The MariaDB server is running with the --read-only option so it cannot execute this statement (10.64.32.11) SELECT  rl_id,rl_is_default,rl_name,rl_description,rl_date_created,rl_date_updated,rl_deleted,rl_user_id  FROM `reading_list`    WHERE rl_id = xx  LIMIT 1   FOR UPDATE 10.64.32.11

Change 612701 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db1120: Disable notifications

https://gerrit.wikimedia.org/r/612701

Change 612701 merged by Marostegui:
[operations/puppet@production] db1120: Disable notifications

https://gerrit.wikimedia.org/r/612701

Change 612702 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] install_server: Reimage db1120 to Buster

https://gerrit.wikimedia.org/r/612702

Change 612702 merged by Marostegui:
[operations/puppet@production] install_server: Reimage db1120 to Buster

https://gerrit.wikimedia.org/r/612702

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1120.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202007150825_marostegui_20513.log.

Completed auto-reimage of hosts:

['db1120.eqiad.wmnet']

and were ALL successful.

All done, x1 fully running Buster and MariaDB 10.4