⚓ T295965 Test MariaDB 10.4 with Bullseye

Subject	Repo	Branch	Lines +/-
es2022: Disable notifications	operations/puppet	production	+1 -0
db2087: Disable notifications	operations/puppet	production	+1 -0
es1022: Disable notifications	operations/puppet	production	+1 -0
db2078: Disable notifications	operations/puppet	production	+1 -0
db1128: Enable notifications	operations/puppet	production	+0 -1
ProductionServices.php: Replace pc1011 with pc1014	operations/mediawiki-config	master	+2 -2
mariadb: Promote pc1014 to pc1 master	operations/puppet	production	+2 -1
mariadb: Move db1128 to s1.	operations/puppet	production	+3 -2
db1169: Disable notifications	operations/puppet	production	+1 -0
es2032: Disable notifications	operations/puppet	production	+1 -0
dbproxy2004: Disable notifications	operations/puppet	production	+1 -0
db2094: Disable notifications	operations/puppet	production	+1 -0
pc2014: Disable notifications	operations/puppet	production	+1 -0
install_server: Allow reimage dbproxy2004	operations/puppet	production	+1 -1
install_server: Reimage db1125 deleting /srv	operations/puppet	production	+1 -1
site.pp: Add testing cluster	operations/puppet	production	+3 -0
pc2014: Disable notifications	operations/puppet	production	+1 -0
control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye	operations/software	master	+12 -0
control-mariadb-client-10.4-bullseye: Bump version	operations/software	master	+1 -1
db1128: Move it to test-s1	operations/puppet	production	+6 -6
db1128: Disable notifications	operations/puppet	production	+1 -0
packages_wmf.pp: Add bullseye support	operations/puppet	production	+2 -0

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Marostegui	T298585 Upgrade WMF database-and-backup-related hosts to bullseye
Resolved	Marostegui	T295965 Test MariaDB 10.4 with Bullseye
Resolved	• Cmjohnson	T299025 db1169 reimage/idrac failure
Resolved	• Cmjohnson	T299123 es1022 troubles with PXE

db2078 (misc host in codfw used for backups) has finished running the dumps, so I am going to go ahead and reimage it to Bullseye.

Change 752993 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2078: Disable notifications

https://gerrit.wikimedia.org/r/752993

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2078.codfw.wmnet with OS bullseye

Change 752993 merged by Marostegui:

[operations/puppet@production] db2078: Disable notifications

https://gerrit.wikimedia.org/r/752993

Marostegui updated the task description. (Show Details)Jan 11 2022, 9:14 AM

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2078.codfw.wmnet with OS bullseye completed:

db2078 (WARN)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201110848_marostegui_15620_db2078.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

I am going to include on the testing the reimage of an eqiad parsercache host. Given that parsercache has a very peculiar workload (lots of REPLACEs and DELETEs), it would be good to see if there's any regression on the OS side when dealing with writes.

In preparation for pc1 failover to reimage pc1011, I have upgraded pc1014's mysql to 10.4.22 which will become pc1 master tomorrow most likely

I just saw that db2078 has a failed service: prometheus-mysqld-exporter.service I haven't researched further, don't know if it fails because new package, it is a one time failure because of the reimage, or a WIP/known issue, but notifying it here, as you told me to bring up anything weird I saw. This doesn't affect backups or my work in any way.

Ah it is probably that the main service isn't disabled (as this is a multi instance host). I will do that tomorrow morning. Thanks for the heads up!

Change 753341 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1169: Disable notifications

https://gerrit.wikimedia.org/r/753341

Mentioned in SAL (#wikimedia-operations) [2022-01-12T06:08:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1169 for Bullseye reimage T295965', diff saved to https://phabricator.wikimedia.org/P18617 and previous config saved to /var/cache/conftool/dbconfig/20220112-060803-marostegui.json

Change 753341 merged by Marostegui:

[operations/puppet@production] db1169: Disable notifications

https://gerrit.wikimedia.org/r/753341

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye executed with errors:

db1169 (FAIL)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye

I am facing issues with the reboot/pxe of db1169 - investigating

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye executed with errors:

db1169 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Created T299025 as I am unable to do anything as the idrac doesn't seem to be working on db1169 - the host gets stuck somewhere during the reboot as it never reaches the debian installer (or normal boot up) cause ping isn't available either

I think I am going to convert db1128 to s1 slave (it is already replicating there) to cover for db1169 so we can test live traffic there.

Change 753343 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1128 to s1.

https://gerrit.wikimedia.org/r/753343

Change 753343 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1128 to s1.

https://gerrit.wikimedia.org/r/753343

Doing a data check on db1128 before letting it serve traffic. I am checking these tables:

user
recentchanges
watchlist
logging
actor
slots
revision
archive
page

Change 753420 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Replace pc1011 with pc1014

https://gerrit.wikimedia.org/r/753420

Replication position for pc1014:

root@pc1014.eqiad.wmnet[(none)]> show master status\G
*************************** 1. row ***************************
            File: pc1014-bin.050274
        Position: 406581448
    Binlog_Do_DB:
Binlog_Ignore_DB:

Change 753420 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Replace pc1011 with pc1014

https://gerrit.wikimedia.org/r/753420

Change 753422 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote pc1014 to pc1 master

https://gerrit.wikimedia.org/r/753422

Change 753422 merged by Marostegui:

[operations/puppet@production] mariadb: Promote pc1014 to pc1 master

https://gerrit.wikimedia.org/r/753422

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1011.eqiad.wmnet with OS bullseye completed:

pc1011 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201120912_marostegui_29890_pc1011.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

pc1011 is now serving pc1 as a master and it is running Bullseye, so far so good. Let's see how the performance is and if there's any regression.

Marostegui updated the task description. (Show Details)Jan 12 2022, 10:18 AM

In T295965#7615569, @Marostegui wrote:

Doing a data check on db1128 before letting it serve traffic. I am checking these tables:

user

recentchanges

watchlist

logging

actor

slots

revision

archive

page

This all came clean, so I am going to start pooling db1128 with very low weight

Change 753430 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/753430

Mentioned in SAL (#wikimedia-operations) [2022-01-12T10:29:39Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1128 in s1 with minimal weight T295965', diff saved to https://phabricator.wikimedia.org/P18631 and previous config saved to /var/cache/conftool/dbconfig/20220112-102938-marostegui.json

Change 753430 merged by Marostegui:

[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/753430

Mentioned in SAL (#wikimedia-operations) [2022-01-12T10:36:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1128 in s1 with minimal weight T295965', diff saved to https://phabricator.wikimedia.org/P18633 and previous config saved to /var/cache/conftool/dbconfig/20220112-103619-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T10:56:51Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18637 and previous config saved to /var/cache/conftool/dbconfig/20220112-105650-marostegui.json

Marostegui updated the task description. (Show Details)Jan 12 2022, 10:57 AM

Mentioned in SAL (#wikimedia-operations) [2022-01-12T11:31:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18640 and previous config saved to /var/cache/conftool/dbconfig/20220112-113119-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T11:52:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18644 and previous config saved to /var/cache/conftool/dbconfig/20220112-115259-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T12:09:32Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18646 and previous config saved to /var/cache/conftool/dbconfig/20220112-120931-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T12:27:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18648 and previous config saved to /var/cache/conftool/dbconfig/20220112-122742-marostegui.json

Marostegui mentioned this in T299046: Upgrade parsercache infra to Bullseye.Jan 12 2022, 12:39 PM

Mentioned in SAL (#wikimedia-operations) [2022-01-12T13:00:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18654 and previous config saved to /var/cache/conftool/dbconfig/20220112-130050-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T13:58:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18661 and previous config saved to /var/cache/conftool/dbconfig/20220112-135858-marostegui.json

db1128 is Bullseye and it is now serving on s1 with normal weight, if you notice something strange: dbctl instance db1128 depool ; dbctl config commit -m "Depooling db1128"
Let's wait till next week to make sure it is serving fine before giving greenlight for Bullseye!

• Cmjohnson closed subtask T299025: db1169 reimage/idrac failure as Resolved.Jan 12 2022, 4:58 PM

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye

db1169 is back up after Chris fixed the HW issue on-site. I have reimaged it to Bullseye and tomorrow I will start pooling it (after running a schema change that has been deployed on s1).

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye completed:

db1169 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201121700_marostegui_16909_db1169.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Platonides updated the task description. (Show Details)Jan 12 2022, 11:31 PM

db1169 is now fully pooled back in s1.

We currently have two Bullseye hosts serving MW traffic in s1: db1128 and db1169.

Marostegui updated the task description. (Show Details)Jan 13 2022, 8:26 AM

Marostegui updated the task description. (Show Details)Jan 13 2022, 8:57 AM

Change 753681 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/753681

Mentioned in SAL (#wikimedia-operations) [2022-01-13T08:59:07Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1022, give weight to es1021 T295965 ', diff saved to https://phabricator.wikimedia.org/P18718 and previous config saved to /var/cache/conftool/dbconfig/20220113-085906-marostegui.json

Change 753681 merged by Marostegui:

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/753681

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye