Page MenuHomePhabricator

Test MariaDB 10.4 with Bullseye
Closed, ResolvedPublic

Description

Let's start testing out Bullseye + MariaDB 10.4.

We are going to go for Bullseye and MariaDB 10.4 rather than 10.5 or 10.6 in order to minimize the amount of variables that could impact performance.

  • Compile and package 10.4 for Bullseye
  • Reimage db1125 replica (test-cluster host) with Bullseye (tested both: keeping /srv and wiping it entirely)
  • Reimage db1124 master (test-cluster) with Bullseye
  • Reimage db1128 (it was m5 master, but it is now a spare after T288720). Reimage it with bullseye and move it to s1 to let it replicate. Update from 12th Dec 2021: T295965#7580466
  • Reimage pc2014 (pc1, codfw spare) with Bullseye and let it replicate
  • Reimage one dbproxy (in codfw) with Bullseye (dbproxy2004 - m5) T295965#7580099
  • Reimage one codfw sanitarium T295965#7595058
  • Reimage one codfw multi-instance (db2087 s6 and s7) T295965#7597921
  • Reimage db2078 (misc multi-instance)
  • Reimage an external store codfw host (es2032, es2022)
  • Reimage s1 replica and let it serve traffic. (db1128 - originally planned for db1169 but we ran into issues: T299025)
  • Reimage pc1 master and let it serve traffic (pc1011)
  • Reimage an external store eqiad host (es1022)
  • Possibly, move db1128 to a different section/use it for some other testing (move it to a mX section to make it master at some point)
    • currently serving on s1, but to be moved to m1 next week

Details

Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

db2078 (misc host in codfw used for backups) has finished running the dumps, so I am going to go ahead and reimage it to Bullseye.

Change 752993 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2078: Disable notifications

https://gerrit.wikimedia.org/r/752993

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2078.codfw.wmnet with OS bullseye

Change 752993 merged by Marostegui:

[operations/puppet@production] db2078: Disable notifications

https://gerrit.wikimedia.org/r/752993

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2078.codfw.wmnet with OS bullseye completed:

  • db2078 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201110848_marostegui_15620_db2078.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

I am going to include on the testing the reimage of an eqiad parsercache host. Given that parsercache has a very peculiar workload (lots of REPLACEs and DELETEs), it would be good to see if there's any regression on the OS side when dealing with writes.

In preparation for pc1 failover to reimage pc1011, I have upgraded pc1014's mysql to 10.4.22 which will become pc1 master tomorrow most likely

I just saw that db2078 has a failed service: prometheus-mysqld-exporter.service I haven't researched further, don't know if it fails because new package, it is a one time failure because of the reimage, or a WIP/known issue, but notifying it here, as you told me to bring up anything weird I saw. This doesn't affect backups or my work in any way.

Ah it is probably that the main service isn't disabled (as this is a multi instance host). I will do that tomorrow morning. Thanks for the heads up!

Change 753341 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1169: Disable notifications

https://gerrit.wikimedia.org/r/753341

Mentioned in SAL (#wikimedia-operations) [2022-01-12T06:08:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1169 for Bullseye reimage T295965', diff saved to https://phabricator.wikimedia.org/P18617 and previous config saved to /var/cache/conftool/dbconfig/20220112-060803-marostegui.json

Change 753341 merged by Marostegui:

[operations/puppet@production] db1169: Disable notifications

https://gerrit.wikimedia.org/r/753341

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye executed with errors:

  • db1169 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye

I am facing issues with the reboot/pxe of db1169 - investigating

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye executed with errors:

  • db1169 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Created T299025 as I am unable to do anything as the idrac doesn't seem to be working on db1169 - the host gets stuck somewhere during the reboot as it never reaches the debian installer (or normal boot up) cause ping isn't available either

I think I am going to convert db1128 to s1 slave (it is already replicating there) to cover for db1169 so we can test live traffic there.

Change 753343 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Move db1128 to s1.

https://gerrit.wikimedia.org/r/753343

Change 753343 merged by Marostegui:

[operations/puppet@production] mariadb: Move db1128 to s1.

https://gerrit.wikimedia.org/r/753343

Doing a data check on db1128 before letting it serve traffic. I am checking these tables:

  • user
  • recentchanges
  • watchlist
  • logging
  • actor
  • slots
  • revision
  • archive
  • page

Change 753420 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] ProductionServices.php: Replace pc1011 with pc1014

https://gerrit.wikimedia.org/r/753420

Replication position for pc1014:

root@pc1014.eqiad.wmnet[(none)]> show master status\G
*************************** 1. row ***************************
            File: pc1014-bin.050274
        Position: 406581448
    Binlog_Do_DB:
Binlog_Ignore_DB:

Change 753420 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices.php: Replace pc1011 with pc1014

https://gerrit.wikimedia.org/r/753420

Change 753422 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote pc1014 to pc1 master

https://gerrit.wikimedia.org/r/753422

Change 753422 merged by Marostegui:

[operations/puppet@production] mariadb: Promote pc1014 to pc1 master

https://gerrit.wikimedia.org/r/753422

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host pc1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host pc1011.eqiad.wmnet with OS bullseye completed:

  • pc1011 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201120912_marostegui_29890_pc1011.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

pc1011 is now serving pc1 as a master and it is running Bullseye, so far so good. Let's see how the performance is and if there's any regression.

Doing a data check on db1128 before letting it serve traffic. I am checking these tables:

  • user
  • recentchanges
  • watchlist
  • logging
  • actor
  • slots
  • revision
  • archive
  • page

This all came clean, so I am going to start pooling db1128 with very low weight

Change 753430 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/753430

Mentioned in SAL (#wikimedia-operations) [2022-01-12T10:29:39Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1128 in s1 with minimal weight T295965', diff saved to https://phabricator.wikimedia.org/P18631 and previous config saved to /var/cache/conftool/dbconfig/20220112-102938-marostegui.json

Change 753430 merged by Marostegui:

[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/753430

Mentioned in SAL (#wikimedia-operations) [2022-01-12T10:36:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Pool db1128 in s1 with minimal weight T295965', diff saved to https://phabricator.wikimedia.org/P18633 and previous config saved to /var/cache/conftool/dbconfig/20220112-103619-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T10:56:51Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18637 and previous config saved to /var/cache/conftool/dbconfig/20220112-105650-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T11:31:19Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18640 and previous config saved to /var/cache/conftool/dbconfig/20220112-113119-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T11:52:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18644 and previous config saved to /var/cache/conftool/dbconfig/20220112-115259-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T12:09:32Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18646 and previous config saved to /var/cache/conftool/dbconfig/20220112-120931-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T12:27:42Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18648 and previous config saved to /var/cache/conftool/dbconfig/20220112-122742-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T13:00:50Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18654 and previous config saved to /var/cache/conftool/dbconfig/20220112-130050-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-01-12T13:58:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give more traffic to db1128 in s1 T295965', diff saved to https://phabricator.wikimedia.org/P18661 and previous config saved to /var/cache/conftool/dbconfig/20220112-135858-marostegui.json

db1128 is Bullseye and it is now serving on s1 with normal weight, if you notice something strange: dbctl instance db1128 depool ; dbctl config commit -m "Depooling db1128"
Let's wait till next week to make sure it is serving fine before giving greenlight for Bullseye!

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye

db1169 is back up after Chris fixed the HW issue on-site. I have reimaged it to Bullseye and tomorrow I will start pooling it (after running a schema change that has been deployed on s1).

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1169.eqiad.wmnet with OS bullseye completed:

  • db1169 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201121700_marostegui_16909_db1169.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1169 is now fully pooled back in s1.

We currently have two Bullseye hosts serving MW traffic in s1: db1128 and db1169.

Change 753681 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/753681

Mentioned in SAL (#wikimedia-operations) [2022-01-13T08:59:07Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1022, give weight to es1021 T295965 ', diff saved to https://phabricator.wikimedia.org/P18718 and previous config saved to /var/cache/conftool/dbconfig/20220113-085906-marostegui.json

Change 753681 merged by Marostegui:

[operations/puppet@production] es1022: Disable notifications

https://gerrit.wikimedia.org/r/753681

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

I am trying to reinstall es1022, it does PXE boot but it is not getting into the Debian Installer, so I am investigating.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye

I was able to get into the debian installer by force PXE from the idrac manually, not sure if it is not trying to PXE on the right interface or what (cause the screen goes blank after PXE gets selected and then times out and boots from disk).
Using F12 to make sure it PXEs boot seem to have worked. I will create a task for DCOPs about this.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es1022.eqiad.wmnet with OS bullseye executed with errors:

  • es1022 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201130949_marostegui_26834_es1022.out
    • The reimage failed, see the cookbook logs for the details

The reimage was actually fine, what failed was to change BIOS parameters, but I have forced that manually.
Probably this host needs a good firmware and BIOS upgrade to start with cause the reimage failed with:

Running IPMI command: ipmitool -I lanplus -H es1022.mgmt.eqiad.wmnet -U root -E chassis bootparam get 5
Exception raised while executing cookbook sre.hosts.reimage:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 487, in run
    self.ipmi.check_bootparams()
  File "/usr/lib/python3/dist-packages/spicerack/ipmi.py", line 125, in check_bootparams
    raise IpmiCheckError(f"Expected BIOS boot params in {IPMI_SAFE_BOOT_PARAMS} got: {param}")
spicerack.ipmi.IpmiCheckError: Expected BIOS boot params in ('0000000000', '8000020000') got: 0000020000

It worked manually I will create the task for DCOPs.

Change 753698 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] es2022: Disable notifications

https://gerrit.wikimedia.org/r/753698

Change 753698 merged by Marostegui:

[operations/puppet@production] es2022: Disable notifications

https://gerrit.wikimedia.org/r/753698

I am trying a reimage on es2022 to see if this was a one time thing or could be something related to all the new es hosts.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host es2022.codfw.wmnet with OS bullseye

es2022 worked fine, so it could be restricted to either es1022 or es10XX hosts. We'll see...

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host es2022.codfw.wmnet with OS bullseye completed:

  • es2022 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202201131042_marostegui_2024_es2022.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

All tests were fine. No regressions found. We can proceed and start migrating hosts to Bullseye,