Page MenuHomePhabricator

Upgrade all sanitarium masters to 10.4 and Buster
Open, MediumPublic

Description

labsdb* hosts have been moved under 10.4 sanitarium hosts, so we can now migrate all sanitarium masters to Buster and 10.4

Hosts:

eqiad:

  • s1 db1106 - [x] tables checked - [] tables checked after the upgrade
  • s2 db1074 (replace it with db1156 T258361 - [x] tables checked)
  • s3 db1112 - [x] tables checked - [] tables checked after the upgrade
  • s4 db1121 - [x] tables checked (before the upgrade) - [] tables checked after the upgrade
  • s5 db1082 (replace it with db1161 T258361 - [x] tables checked)
  • s6 db1085 (replace it with db1165 T258361 - [x] tables checked)
  • s7 db1079 (replace it with db1158 T258361 - [x] tables checked)
  • s8 db1087 (replace it with db1167 T258361 - [x] tables checked)

codfw:

  • s1 db2072 - [x] tables checked
  • s2 db2126 - [x] tables checked
  • s3 db2074 - [x] tables checked
  • s4 db2073 - [x] tables checked
  • s5 db2128 - [x] tables checked
  • s6 db2076 - [x] tables checked
  • s7 db2077 - [x] tables checked
  • s8 db2082 - [x] tables checked

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
OpenNone
OpenJhernandez
ResolvedRobH
OpenBstorm
ResolvedBstorm
ResolvedMarostegui
ResolvedMarostegui
OpenNone
OpenNone
ResolvedMarostegui
ResolvedRobH
ResolvedMarostegui
ResolvedMarostegui
OpenNone
OpenMarostegui
ResolvedRequestwiki_willy
OpenRequestMarostegui
OpenRequestMarostegui
OpenRequestMarostegui
OpenRequestKormat

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 681448 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Remove s3 from db2098

https://gerrit.wikimedia.org/r/681448

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2077.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104210546_marostegui_8470.log.

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db2082.codfw.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104210549_marostegui_8813.log.

Completed auto-reimage of hosts:

['db2077.codfw.wmnet']

and were ALL successful.

Completed auto-reimage of hosts:

['db2082.codfw.wmnet']

and were ALL successful.

I finished setting up db2139 with an s3 instance on buster- as soon as I merge the above patch (https://gerrit.wikimedia.org/r/681439) db2139:s3 will be the canonical location for s3 backups on codfw. I will do it now, unless any of you happen to be around still for a review.

There is some cleanup to do afterwards (reenabling alerts, destroying db2098:s3, removing it from tendril and zarcillo, but that can wait). In fact, if you want to keep db2098:s3 and start replicationg to check if it breaks- as an experiment, I think it could be a good check. I am going to destroy it otherwise, as I have kept a backup on dbprov2002 for longer term. db2098:s3 can break- db2139:s3 is the new source of backups now.

In other words, feel free to restart replication as soon as you are able to- I am no longer a blocker.

Thanks for being so fast with this.
Replication re-started on s3 codfw master.

@jcrespo - also started replication on db2098:3313 too, let's see what happens.

Tables being checked on db1156 as it was just productionized at T258361

Change 681447 merged by Jcrespo:

[operations/puppet@production] mariadb: Reenable notifications for db2139 after maintenance

https://gerrit.wikimedia.org/r/681447

db1165 is ready to take over db1085 in s6

db1156 needs to be built from a logical dump. The copy from db1074 looks corrupted, so best to build it from the logical dumps.

@jcrespo may I offload the above ^ to you? that would help me a lot time-wise

All codfw sanitarium masters are running 10.4 and Buster

Could you provide/confirm more details about what you want to achieve? db1156 needs to host s2 and you want to rebuild it from db1171 (stretch backup source) and rebuild it logically into db1156, removing its current content and setting it up as a Buster regular dedicated core host. then possibly checked against db1122 (primary eqiad s2)? Is that correct?

db1156 is now running 10.4 + Buster.
What I would like to do would be:

  • Delete all its content
  • Place a logical dump from buster backup source (if there is not one, stretch is also fine).
  • Start replication from its master (db1122)
  • Check data against db1074 (the host it will replace)

Thanks for clarifications- this helps me making sure I don't break the wrong host and I recover the right one, because I lack all the context.
Will do as requested.

We don't have yet s2 buster backups (I setup s6 ones as I was told that was the next), but it shouldn't matter too much for a logical recovery.

No worries, stretch + mysql_upgrade should be fine :)
Feel free to ask as many things you need if something isn't clear

Thanks a lot for the help, I really appreciate it.

FYI - I will load the stretch backup directly, as the logical backups do not contain system tables (only wiki ones)- that will have to be setup separately. Don't worry, I will still run mysql_upgade just in case (as I always do), but because the way we do backups, it should be a noop, as I expect our logical backups to be interoperable (unlike snapshots, that are very version dependent) as we don't use any deprecated data type.

That's perfectly fine yeah. Thanks :)

Mentioned in SAL (#wikimedia-operations) [2021-04-23T07:56:22Z] <jynus> deleting db1156 s2 database and reloading it from logical backups T280492

db1156 should be almost ready for handover to core production, but I am going to take the opportunity to setup a new s2 backup source, now that it has been logically loaded.

Change 682574 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Add s2 to db1102, a buster backup source

https://gerrit.wikimedia.org/r/682574

Change 682574 merged by Jcrespo:

[operations/puppet@production] dbbackups: Add s2 to db1102, a buster backup source

https://gerrit.wikimedia.org/r/682574

Change 682668 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Reenable notifications on db1156 after maintenance

https://gerrit.wikimedia.org/r/682668

I am running a compare on db1156:

# tail -n +2 mediawiki-config/dblists/s2.dblist | while read db; do while read table; do echo "$db.$table"; db-compare $db $table $column db1074 db1156 || break 2; done < software/dbtools/tables_to_check.txt; done

And then it will be all yours.

Excellent, thanks. It will take around a day I'd guess.

Change 682682 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariadb: Reenable notifications on db1102 after maintenance

https://gerrit.wikimedia.org/r/682682

Change 682682 merged by Jcrespo:

[operations/puppet@production] mariadb: Reenable notifications on db1102 after maintenance

https://gerrit.wikimedia.org/r/682682

Excellent, thanks. It will take around a day I'd guess.

It finished at ~3am: all yours. Please note I loaded grants and events to the best of my ability, but please double check those, as I saw the grant files are very outdated (no tendril grants, prometheus and icinga don't use socket authentication, etc.) so I fixed to the best of my ability.

Thank you, I will take over it!

Change 682668 merged by Marostegui:

[operations/puppet@production] mariadb: Reenable notifications on db1156 after maintenance

https://gerrit.wikimedia.org/r/682668

Change 683483 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1121: Disable notifications

https://gerrit.wikimedia.org/r/683483

Change 683483 merged by Marostegui:

[operations/puppet@production] db1121: Disable notifications

https://gerrit.wikimedia.org/r/683483

Change 684667 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1121: Enable notifications

https://gerrit.wikimedia.org/r/684667

Change 684667 merged by Marostegui:

[operations/puppet@production] db1121: Enable notifications

https://gerrit.wikimedia.org/r/684667

Mentioned in SAL (#wikimedia-operations) [2021-05-04T07:11:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1161 and db1082 to change s5 sanitarium master T280492', diff saved to https://phabricator.wikimedia.org/P15692 and previous config saved to /var/cache/conftool/dbconfig/20210504-071146-marostegui.json

s5 sanitarium master switched: db1154 now replicates from db1161 (10.4)

Mentioned in SAL (#wikimedia-operations) [2021-05-04T08:02:07Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1106 from s1 vslow to get its tables checked and pool db1099:3311 instead T280492', diff saved to https://phabricator.wikimedia.org/P15699 and previous config saved to /var/cache/conftool/dbconfig/20210504-080206-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-05-04T08:02:58Z] <marostegui> Check tables on db1106, lag will show up on s1 on wiki replicas (T280492)

Change 684795 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1106: Disable notifications

https://gerrit.wikimedia.org/r/684795

Change 684795 merged by Marostegui:

[operations/puppet@production] db1106: Disable notifications

https://gerrit.wikimedia.org/r/684795

Mentioned in SAL (#wikimedia-operations) [2021-05-05T06:40:59Z] <marostegui> Check tables on db1112 (lag might show up on s3 on wiki replicas) T280492

Mentioned in SAL (#wikimedia-operations) [2021-05-05T06:42:05Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1074 and db1156 to switch sanitarium hosts T280492', diff saved to https://phabricator.wikimedia.org/P15730 and previous config saved to /var/cache/conftool/dbconfig/20210505-064204-marostegui.json

s2 sanitarium master db1074 has been replaced by db1156

s7 sanitarium master db1079 has been replaced by db1158

I am going to remove the db2098 s3 10.1 instance, now that db2139 has been working fine for a while. A last backup of the old instance will be available on dbprov2002 until it is no longer recoverable (and we can always recover from logical backups).

Change 685717 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: remove db2098 s3 section for this codfw backup source

https://gerrit.wikimedia.org/r/685717

Change 685717 merged by Jcrespo:

[operations/puppet@production] dbbackups: remove db2098 s3 section for this codfw backup source

https://gerrit.wikimedia.org/r/685717

db2098 s3 should be gone now, and will be soon gone from grafana/prometheus.

s8 sanitarium master db1087 has been replaced by db1167

Change 687965 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1121 to Buster

https://gerrit.wikimedia.org/r/687965

Change 687965 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1121 to Buster

https://gerrit.wikimedia.org/r/687965

Change 688742 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1121: Disable notifications

https://gerrit.wikimedia.org/r/688742

Mentioned in SAL (#wikimedia-operations) [2021-05-11T05:11:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1121 - going to be reimaged to buster T280492', diff saved to https://phabricator.wikimedia.org/P15895 and previous config saved to /var/cache/conftool/dbconfig/20210511-051102-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2021-05-11T05:11:41Z] <marostegui> Reimage db1121 to buster, this will generate lag on s4 (commonswiki) on wikireplicas T280492

Change 688742 merged by Marostegui:

[operations/puppet@production] db1121: Disable notifications

https://gerrit.wikimedia.org/r/688742

Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts:

['db1121.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105110519_marostegui_32061.log.

Completed auto-reimage of hosts:

['db1121.eqiad.wmnet']

and were ALL successful.

db1121 has been reimaged to Buster.
I am checking the tables now, this means commonswiki will show lag on wikireplicas.