Page MenuHomePhabricator

Compile and package MariaDB 10.6.16 and 10.4.32
Closed, ResolvedPublic

Description

New versions released:

MariaDB 10.4.32

MariaDB 10.6.16

10.6:

  • Bullseye (db2122, db1132)
    • Uploaded to repo
  • Bookworm (testing on db1119, db2133, pc2014, pc2012)
    • Uploaded to repo

10.4

  • Bullseye (testing on db1210, db1126)

Finally https://jira.mariadb.org/browse/MDEV-32132 (affecting both versions) has been fixed, which has potentially bitten us in the past.

Event Timeline

Marostegui moved this task from Triage to In progress on the DBA board.

Compiled 10.6.16 for bookworm, and testing it for now on pc2014.

Mentioned in SAL (#wikimedia-operations) [2023-11-16T12:33:06Z] <marostegui> Install Test MariaDB 10.6.16 (Bookworm) on pc2014 T351283

Change 974978 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] control-mariadb-10.6-bookworm: Bump version

https://gerrit.wikimedia.org/r/974978

Change 974978 merged by jenkins-bot:

[operations/software@master] control-mariadb-10.6-bookworm: Bump version

https://gerrit.wikimedia.org/r/974978

Also installed 10.6.16 on bookworm on db2133

Installed on db1119 bookworm 10.6.16

10.4.32 for bullseye is compiled. Given it is Friday I am not going to install this on a production host, and rather I will do some initial testing on my local env.

Mentioned in SAL (#wikimedia-operations) [2023-11-20T06:47:34Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1210 T351283', diff saved to https://phabricator.wikimedia.org/P53591 and previous config saved to /var/cache/conftool/dbconfig/20231120-064733-root.json

Change 975453 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1210: Disable notifications

https://gerrit.wikimedia.org/r/975453

Change 975453 merged by Marostegui:

[operations/puppet@production] db1210: Disable notifications

https://gerrit.wikimedia.org/r/975453

10.4.32 for bullseye is compiled. Given it is Friday I am not going to install this on a production host, and rather I will do some initial testing on my local env.

Testing on db1210 (s5)

db1210 being repooled with 10.4.32

Compiled 10.6.16 for Bullseye, testing locally first.

Change 975855 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1210: Disable notifications

https://gerrit.wikimedia.org/r/975855

Something is going on with 10.4.32, first it was lagging behind. Now it has crashed.
I am investigating and will probably report back to MariaDB

Change 975855 merged by Marostegui:

[operations/puppet@production] db1210: Disable notifications

https://gerrit.wikimedia.org/r/975855

Nov 20 13:59:36 db1210 mysqld[2791013]: 2023-11-20 13:59:36 0x7eb76e05d700  InnoDB: Assertion failure in file /root/mariadb-10.4.32/storage/innobase/row/row0ins.cc line 219
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: Failing assertion: !cursor->index->is_committed()
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: We intentionally generate a memory trap.
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: If you get repeated assertion failures or crashes, even
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: immediately after the mysqld startup, there may be
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: corruption in the InnoDB tablespace. Please refer to
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
Nov 20 13:59:36 db1210 mysqld[2791013]: InnoDB: about forcing recovery.
Nov 20 13:59:36 db1210 mysqld[2791013]: 231120 13:59:36 [ERROR] mysqld got signal 6 ;
Nov 20 13:59:36 db1210 mysqld[2791013]: Sorry, we probably made a mistake, and this is a bug.
Nov 20 13:59:36 db1210 mysqld[2791013]: Your assistance in bug reporting will enable us to fix this for the next release.
Nov 20 13:59:36 db1210 mysqld[2791013]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Nov 20 13:59:36 db1210 mysqld[2791013]: We will try our best to scrape up some info that will hopefully help
Nov 20 13:59:36 db1210 mysqld[2791013]: diagnose the problem, but since we have already crashed,
Nov 20 13:59:36 db1210 mysqld[2791013]: something is definitely wrong and this may fail.
Nov 20 13:59:36 db1210 mysqld[2791013]: Server version: 10.4.32-MariaDB-log source revision: c4143f909528e3fab0677a28631d10389354c491
Nov 20 13:59:36 db1210 mysqld[2791013]: key_buffer_size=134217728
Nov 20 13:59:36 db1210 mysqld[2791013]: read_buffer_size=131072
Nov 20 13:59:36 db1210 mysqld[2791013]: max_used_connections=38
Nov 20 13:59:36 db1210 mysqld[2791013]: max_threads=2010
Nov 20 13:59:36 db1210 mysqld[2791013]: thread_count=25
Nov 20 13:59:36 db1210 mysqld[2791013]: It is possible that mysqld could use up to
Nov 20 13:59:36 db1210 mysqld[2791013]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4753987 K  bytes of memory
Nov 20 13:59:36 db1210 mysqld[2791013]: Hope that's ok; if not, decrease some variables in the equation.
Nov 20 13:59:36 db1210 mysqld[2791013]: Thread pointer: 0x7eb388001538
Nov 20 13:59:36 db1210 mysqld[2791013]: Attempting backtrace. You can use the following information to find out
Nov 20 13:59:36 db1210 mysqld[2791013]: where mysqld died. If you see no messages after this, something went
Nov 20 13:59:36 db1210 mysqld[2791013]: terribly wrong...
Nov 20 13:59:36 db1210 mysqld[2791013]: stack_bottom = 0x7eb76e05c760 thread_stack 0x30000
Nov 20 13:59:37 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(my_print_stacktrace+0x2e)[0x55d14b2599fe]
Nov 20 13:59:37 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(handle_fatal_signal+0x53d)[0x55d14acdad8d]
Nov 20 13:59:37 db1210 mysqld[2791013]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140)[0x7f168ebaa140]
Nov 20 13:59:38 db1210 mysqld[2791013]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7f168e6e2ce1]
Nov 20 13:59:38 db1210 mysqld[2791013]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7f168e6cc537]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0x5c9a7a)[0x55d14a9b7a7a]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0x5b60c5)[0x55d14a9a40c5]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0xc89637)[0x55d14b077637]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0xc89e85)[0x55d14b077e85]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0xc9b105)[0x55d14b089105]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0xbde14f)[0x55d14afcc14f]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(_ZN7handler12ha_write_rowEPKh+0x326)[0x55d14ace9506]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(_Z12write_recordP3THDP5TABLEP12st_copy_info+0x19d)[0x55d14aa95dad]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(_Z12mysql_insertP3THDP10TABLE_LISTR4ListI4ItemERS3_IS5_ES6_S6_15enum_duplicatesb+0xb26)[0x55d14aaa0776]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(_Z21mysql_execute_commandP3THD+0x1aba)[0x55d14aacd1da]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x223)[0x55d14aad3e73]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(_ZN15Query_log_event14do_apply_eventEP14rpl_group_infoPKcj+0x6ad)[0x55d14adf296d]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0x62ba14)[0x55d14aa19a14]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(handle_slave_sql+0x1672)[0x55d14aa23992]
Nov 20 13:59:38 db1210 mysqld[2791013]: /opt/wmf-mariadb104/bin/mysqld(+0xb7c7d2)[0x55d14af6a7d2]
Nov 20 13:59:39 db1210 mysqld[2791013]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7)[0x7f168eb9eea7]
Nov 20 13:59:39 db1210 mysqld[2791013]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f168e7a5a2f]
Nov 20 13:59:39 db1210 mysqld[2791013]: Trying to get some variables.
Nov 20 13:59:39 db1210 mysqld[2791013]: Some pointers may be invalid and cause the dump to abort.
Nov 20 13:59:39 db1210 mysqld[2791013]: Query (0x7eb38b306855): INSERT /* FlaggableWikiPage::updatePendingList  */ INTO `flaggedpage_pending` (fpp_page_id,fpp_quality,fpp_rev_id,fpp_pending_since) VALUES (519821,0,237617526,'20231118074553')
Nov 20 13:59:39 db1210 mysqld[2791013]: Connection ID (thread ID): 15630
Nov 20 13:59:39 db1210 mysqld[2791013]: Status: NOT_KILLED
Nov 20 13:59:39 db1210 mysqld[2791013]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=on,mrr_cost_based=on,mrr_sort_keys=on,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=off,condition_pushdown_from_having=on
Nov 20 13:59:39 db1210 mysqld[2791013]: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
Nov 20 13:59:39 db1210 mysqld[2791013]: information that should help you find out what is causing the crash.
Nov 20 13:59:39 db1210 mysqld[2791013]: Writing a core file...
Nov 20 13:59:39 db1210 mysqld[2791013]: Working directory at /srv/sqldata
Nov 20 13:59:39 db1210 mysqld[2791013]: Resource Limits:
Nov 20 13:59:39 db1210 mysqld[2791013]: Limit                     Soft Limit           Hard Limit           Units
Nov 20 13:59:39 db1210 mysqld[2791013]: Max cpu time              unlimited            unlimited            seconds
Nov 20 13:59:39 db1210 mysqld[2791013]: Max file size             unlimited            unlimited            bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max data size             unlimited            unlimited            bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max stack size            8388608              unlimited            bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max core file size        0                    0                    bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max resident set          unlimited            unlimited            bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max processes             2061612              2061612              processes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max open files            200001               200001               files
Nov 20 13:59:39 db1210 mysqld[2791013]: Max locked memory         65536                65536                bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max address space         unlimited            unlimited            bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max file locks            unlimited            unlimited            locks
Nov 20 13:59:39 db1210 mysqld[2791013]: Max pending signals       2061612              2061612              signals
Nov 20 13:59:39 db1210 mysqld[2791013]: Max msgqueue size         819200               819200               bytes
Nov 20 13:59:39 db1210 mysqld[2791013]: Max nice priority         0                    0
Nov 20 13:59:39 db1210 mysqld[2791013]: Max realtime priority     0                    0
Nov 20 13:59:39 db1210 mysqld[2791013]: Max realtime timeout      unlimited            unlimited            us
Nov 20 13:59:39 db1210 mysqld[2791013]: Core pattern: /var/tmp/core/core.%h.%e.%p.%t
Nov 20 13:59:39 db1210 mysqld[2791013]: Kernel version: Linux version 5.10.0-21-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.162-1 (2023-01-21)

This definitely seems related to https://jira.mariadb.org/browse/MDEV-32132

Attempting to run analyze table on all the tables has made this host crash too. I am going to upgrade to 10.6.16 and see if this is just a 10.4.X issue.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1210.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1210.eqiad.wmnet with OS bookworm completed:

  • db1210 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311210727_marostegui_578298_db1210.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1210 migrated to Bookworm (and 10.6) so I am going to give it 24h depooled and just replicating to see if it crashes again. If not, I will start repooling and see if the load makes it crash.
I am going to force an analyze on all the tables and see if that makes it crash (it did with 10.4.32)

db1210 migrated to Bookworm (and 10.6) so I am going to give it 24h depooled and just replicating to see if it crashes again. If not, I will start repooling and see if the load makes it crash.
I am going to force an analyze on all the tables and see if that makes it crash (it did with 10.4.32)

This finished with no issues, so I am going to start repooling and see if it crashes that way.

Mentioned in SAL (#wikimedia-operations) [2023-11-22T07:19:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1126 to test 10.4.32 T351283', diff saved to https://phabricator.wikimedia.org/P53692 and previous config saved to /var/cache/conftool/dbconfig/20231122-071911-root.json

Marostegui added a subscriber: ABran-WMF.

Testing 10.4.32 on db1126 cc @ABran-WMF please don't decommission/depool this host without coordinating with me first. Thanks :)

It seems to be that 10.4.32 has been perfectly stable on db1126, so I am wondering if it was just a one time thing on db1230. I will get two more hosts to test it.

10.6.16 for bookworm pushed to the repo

Change 977120 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/software@master] control-mariadb-10.6-bullseye: Update version

https://gerrit.wikimedia.org/r/977120

Change 977120 merged by jenkins-bot:

[operations/software@master] control-mariadb-10.6-bullseye: Update version

https://gerrit.wikimedia.org/r/977120

Pushed 10.6.16 for bullseye to the repo

I am still unsure what to do with 10.4.32. It is true that we are just 6 months away from EOL for 10.4.32 and we should aim for 10.6 anyways, so given that, maybe it is wise NOT to upload this version to the repo as it made one host crash, although the other one is fine, so it could be a just one time thing.

My very unscientific preference is to aim at 10.6 so we can focus on one thing only (unless there are security fixes in 10.4.32 or higher). But feel free to discard my preference :D

That's a very good approach, thanks for the input :)

So I have been talking to @MoritzMuehlenhoff about the possible CVE that is included on 10.4.32 and whether it is worth to keep testing the stability of 10.4.32. So far there's very little information about it, but it is unlikely that it can affect our env.
I am going to reimage the testing hosts to bookworm and 10.6 and then I will close this task, NOT pushing 10.4.32 to the repo.

Mentioned in SAL (#wikimedia-operations) [2023-11-30T06:32:59Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1126 T351283', diff saved to https://phabricator.wikimedia.org/P53951 and previous config saved to /var/cache/conftool/dbconfig/20231130-063258-root.json

Mentioned in SAL (#wikimedia-operations) [2023-11-30T06:33:17Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1210 T351283', diff saved to https://phabricator.wikimedia.org/P53952 and previous config saved to /var/cache/conftool/dbconfig/20231130-063317-root.json

Change 978726 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1210,db1126: Disable notifications

https://gerrit.wikimedia.org/r/978726

Change 978726 merged by Marostegui:

[operations/puppet@production] db1210,db1126: Disable notifications

https://gerrit.wikimedia.org/r/978726

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1210.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1126.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1126.eqiad.wmnet with OS bookworm completed:

  • db1126 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300649_marostegui_2091702_db1126.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1210.eqiad.wmnet with OS bookworm completed:

  • db1210 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311300653_marostegui_2091642_db1210.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

This is done and host are being repooled