Page MenuHomePhabricator

Test MariaDB 10.4 with Bullseye
Open, MediumPublic

Description

Let's start testing out Bullseye + 10.4.

We are going to go for Bullseye and 10.4 rather than 10.5 or 10.6 in order to minimize the amount of variables that could impact performance.

  • Compile and package 10.4 for Bullseye
  • Reimage db1125 replica (test-cluster host) with Bullseye (tested both: keeping /srv and wiping it entirely)
  • Reimage db1124 master (test-cluster) with Bullseye
  • Reimage db1128 (it was m5 master, but it is now a spare after T288720). Reimage it with bullseye and move it to s1 to let it replicate.
  • Reimage pc2014 (pc1, codfw spare) with Bullseye and let it replicate

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui moved this task from Triage to Ready on the DBA board.

I have finished compiling 10.4.22 on bullseye. Tomorrow I'll try to package it

Packaged 10.4.22 on bullseye - tested very briefly on my local testing environment. I will try to go for the reimage on db1125 next week and install the packages there and see how it goes in our environment.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye executed with errors:

  • db1125 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

The installer is failing to get db1125 installed with bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye executed with errors:

  • db1125 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye executed with errors:

  • db1125 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111221012_marostegui_16710_db1125.out
    • The reimage failed, see the cookbook logs for the details

The installer went fine this time after merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/740531/
There's no a different error puppet-related. I will double check

Change 740541 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] packages_wmf.pp: Add bullseye support

https://gerrit.wikimedia.org/r/740541

Change 740541 merged by Marostegui:

[operations/puppet@production] packages_wmf.pp: Add bullseye support

https://gerrit.wikimedia.org/r/740541

The above patch fixed it, puppet ran fine. I am going to issue another full reimage.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye completed:

  • db1125 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111221045_marostegui_21527_db1125.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

The package needs to be recreated due to an issue with buster and bullseye. We cannot have the same package name on both version on our internal repo, so I have talked to Moritz and the solution would be to call the 10.4 for bullseye: wmf-mariadb104_10.4.22+deb11u1_amd64.deb instead.

Pushed wmf-mariadb104-client_10.4.22+deb11u1_amd64.deb to the repo. Going to doble check it has not broken anything before rebuilding the server one.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

Pushed wmf-mariadb104_10.4.22+deb11u1_amd64.deb

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye completed:

  • db1125 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111230641_marostegui_20506_db1125.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Marostegui renamed this task from Reimage db1125 (test cluster) with Bullseye to Reimage db1124 and db1125 (test cluster) with Bullseye.Tue, Nov 23, 7:18 AM
Marostegui updated the task description. (Show Details)

I have added also db1124 to get reimaged, so that way we can have a master (even if it doesn't have production traffic) running Bullseye

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1124.eqiad.wmnet with OS bullseye completed:

  • db1124 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111230909_marostegui_11175_db1124.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1128 is now free from m5. I am going to do a few more tests on db1124 and db1125 and then I will reimage it to bullseye and place it on s1 to simply let it replicate.

Marostegui renamed this task from Reimage db1124 and db1125 (test cluster) with Bullseye to Test MariaDB 10.4 with Bullseye.Wed, Nov 24, 6:11 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Change 740967 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1128: Disable notifications

https://gerrit.wikimedia.org/r/740967

Change 740967 merged by Marostegui:

[operations/puppet@production] db1128: Disable notifications

https://gerrit.wikimedia.org/r/740967

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye completed:

  • db1125 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111240740_marostegui_27279_db1125.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

It looks like prometheus isn't showing the query latency, I am investigating.

So looks like http_request_duration_microseconds metric is removed in bullseye, which is the one we use for the query monitoring graph (note: it is not how long our queries take, but how long it takes for prometheus to reach them - we know that but we use it as measurement, as when the host is very loaded that also spikes).

After investigating with Filippo, we've decided to create a graph that uses sum(mysql_exporter_collector_duration_seconds to see if it could be a replacement of the other one.
The idea is to leave it there and see if they are both somewhat similar when the hosts have issues. We have no other way anyways (apart from patching prometheus to add the metric back - but that is very painful and has been discarded by now)
This is the example graph: https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=40&orgId=1&from=now-30m&to=now&refresh=1m&var-job=All&var-server=db1125&var-port=9104

It is enabled everywhere, but it will only come up if the latency is higher than 10 microseconds.

We actually believe it might be even better than the current one, but we'll see how they go once a given server is under real issues, as we didn't have any recent data of servers going under lots of real issues to compare both.

Given that db1124 and db1125 seemed ok. I have configured replication on db1124 (test-s4 cluster) with STATEMENT+GTID and just replicating enwiki.recentchanges table to see how that goes with some more writes (apart from pt-heartbeat).

Change 741754 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1128: Move it to test-s1

https://gerrit.wikimedia.org/r/741754

Change 741754 merged by Marostegui:

[operations/puppet@production] db1128: Move it to test-s1

https://gerrit.wikimedia.org/r/741754

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1128.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2021-11-25T07:49:43Z] <marostegui> Stop mysql on db1133 to clone db1128 as a test host T295965

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1128.eqiad.wmnet with OS bullseye completed:

  • db1128 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111250727_marostegui_1795_db1128.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I have disconnected db1124 from s1, as I am getting db1128 ready to fully replicate from s1.

db1128 is now fully replicating from s1 master (all tables). It is using GTID as any other normal replica.
I am not planning to pool this host any time soon (definitely not before end of year holidays). But I want to do a few tests with it, apart from letting replication flow and see if there's any performance regression on bullseye (reminder mariadb version remains 10.4).

@MatthewVernon question about the prometheus-mysqld-exporter. I think in Buster the exporter starts (or restarts) automatically once mysql is started, right? I don't see that happening on bullseye, do we need to do something else on that front to adapt that behaviour also in Bullseye? It was a great improvement in buster as we sometimes could forget to start/restart it, so if you could take a look and see if there's something needed there, that'd be great.

If you need to test, you can stop/start mariadb on db1124 or db1125 as you wish (they do not replicate from anywhere else nor have valid data)

On a side note, it looks like there's no need to recreate the host on tendril when going from buster to bullseye (essentially cause I think the problem was 10.1 -> 10.4 and in this case we are not switching versions).

The problem is that /lib/systemd/system/mariadb.service lacks the changes from https://gerrit.wikimedia.org/r/c/operations/software/+/715926

so, e.g. on a working system:

mvernon@db1118:~$ grep prometheus /lib/systemd/system/mariadb.service 
# If available, cause prometheus-mysqld-exporter to be started when
Before=prometheus-mysqld-exporter.service
Wants=prometheus-mysqld-exporter.service

whereas on db1124:

mvernon@db1124:~$ grep prometheus /lib/systemd/system/mariadb.service
mvernon@db1124:~$

Thanks @MatthewVernon - looks like I built the package from the wrong directory, which didn't contain the latest version of the dbtool directory and hence the patch is missing.
I am rebuilding the packages with the correct one.

Thanks for troubleshooting!

I have rebuilt the server's package with the fix. I will be testing it tomorrow.

New package installed:

root@db1128:/home/marostegui# dpkg -i wmf-mariadb104_10.4.22+deb11u2_amd64.deb
(Reading database ... 44973 files and directories currently installed.)
Preparing to unpack wmf-mariadb104_10.4.22+deb11u2_amd64.deb ...
Unpacking wmf-mariadb104 (10.4.22+deb11u2) over (10.4.22+deb11u1) ...
Setting up wmf-mariadb104 (10.4.22+deb11u2) ...
<snip>
root@db1128:/home/marostegui# grep prometheus /lib/systemd/system/mariadb.service
# If available, cause prometheus-mysqld-exporter to be started when
Before=prometheus-mysqld-exporter.service
Wants=prometheus-mysqld-exporter.service

Change 741997 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] control-mariadb-client-10.4-bullseye: Bump version

https://gerrit.wikimedia.org/r/741997

Change 741997 merged by jenkins-bot:

[operations/software@master] control-mariadb-client-10.4-bullseye: Bump version

https://gerrit.wikimedia.org/r/741997

Upgraded db1128, db1124 and db1125

Upgraded db1128, db1124 and db1125

Also pushed wmf-mariadb104_10.4.22+deb11u2_amd64.deb to the repo

On the grafana mysql dashboard (the per host one) we'd also need to change the monitoring response top box to get it to use sum(mysql_exporter_collector_duration_seconds)

Change 742282 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] pc2014: Disable notifications

https://gerrit.wikimedia.org/r/742282

Change 742282 merged by Marostegui:

[operations/puppet@production] pc2014: Disable notifications

https://gerrit.wikimedia.org/r/742282

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc2014.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc2014.codfw.wmnet with OS bullseye completed:

  • pc2014 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111290539_marostegui_23235_pc2014.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

pc2014 has been reimaged with Bullseye too. It is a pc1 codfw spare, so it will get no reads (even if we switch DCs). I would like to test Bullseye on this as pc has a very peculiar mysql traffic (a huge amount of REPLACE) so I want to see if there're regressions on that front.

Change 742588 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye

https://gerrit.wikimedia.org/r/742588

Change 742588 merged by jenkins-bot:

[operations/software@master] control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye

https://gerrit.wikimedia.org/r/742588

Change 742588 merged by jenkins-bot:

[operations/software@master] control-mariadb-10.4-bullseye: Control file for 10.4 on Bullseye

https://gerrit.wikimedia.org/r/742588

Pushed this new file after checking that 10.4.22+deb11u2 works fine and there are no more builds immediately needed.

On the grafana mysql dashboard (the per host one) we'd also need to change the monitoring response top box to get it to use sum(mysql_exporter_collector_duration_seconds)

Done

Looking at the source for the exporter, the metric we were using went away shortly after 0.11.0 with this PR: https://github.com/prometheus/mysqld_exporter/pull/397

http_request_duration_microseconds used to measure how long it took to respond to a request (from prometheus). It was a Summary, which meant it couldn't be aggregated.

The "new" metric, mysql_exporter_collector_duration_seconds, has been around for a long time. It measures how long each collector module takes to run (e.g. running show global variables is one, running show slave status is another).

sum(mysql_exporter_collector_duration_seconds) should be roughly equivalent to http_request_duration_microseconds for our purposes, from what i can see.

Sweet - thanks for the analysis! Let's keep sum(mysql_exporter_collector_duration_seconds) for now then. As of now it is added to the dashboard so it run on both bullseye and buster hosts as stated at T295965#7526111 and T295965#7536603

Change 743285 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db1125 deleting /srv

https://gerrit.wikimedia.org/r/743285

I am going to reimage db1125 including deleting /srv as we've not tested that with Bullseye yet.

Change 743285 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db1125 deleting /srv

https://gerrit.wikimedia.org/r/743285

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1125.eqiad.wmnet with OS bullseye completed:

  • db1125 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202112030602_marostegui_5561_db1125.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I am going to reimage db1125 including deleting /srv as we've not tested that with Bullseye yet.

This worked fine

Change 743288 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Add testing cluster

https://gerrit.wikimedia.org/r/743288

Change 743288 merged by Marostegui:

[operations/puppet@production] site.pp: Add testing cluster

https://gerrit.wikimedia.org/r/743288