Page MenuHomePhabricator

[toolsdb] crash recovery can fail because of insufficient innodb_log_file_size
Closed, ResolvedPublicBUG REPORT

Description

Yesterday 2025-11-11 tools-db-4, at the time the active toolsdb primary host, crashed suddenly with:

Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 251111 13:28:06 [ERROR] /opt/wmf-mariadb106/bin/mysqld got signal 6 ;
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Sorry, we probably made a mistake, and this is a bug.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Your assistance in bug reporting will enable us to fix this for the next release.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs about how to report
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: a bug on https://jira.mariadb.org/.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Please include the information from the server start above, to the end of the
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: information below.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Server version: 10.6.22-MariaDB-log source revision: 19644f6821d59ecca0f9b1f44fadb3b887061965
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: The information page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: contains instructions to obtain a better version of the backtrace below.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Following these instructions will help MariaDB developers provide a fix quicker.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Attempting backtrace. Include this in the bug report.
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: (note: Retrieving this information may fail)
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: Thread pointer: 0x0
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [Note] /opt/wmf-mariadb106/bin/mysqld (initiated by: unknown): Normal shutdown
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [Note] Event Scheduler: Waiting for the scheduler thread to reply
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:06 0 [Note] Event Scheduler: Stopped
Nov 11 13:28:06 tools-db-4 mysqld[3528040]: stack_bottom = 0x0 thread_stack 0x30000
Nov 11 13:28:07 tools-db-4 mysqld[3528040]: Printing to addr2line failed
Nov 11 13:28:07 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(my_print_stacktrace+0x2e)[0x56099c91fcbe]
Nov 11 13:28:07 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(handle_fatal_signal+0x229)[0x56099c3bb0f9]
Nov 11 13:28:08 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libc.so.6(+0x3c050)[0x7f73b0c5a050]
Nov 11 13:28:08 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libc.so.6(+0x8aeec)[0x7f73b0ca8eec]
Nov 11 13:28:08 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x12)[0x7f73b0c59fb2]
Nov 11 13:28:08 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f73b0c44472]
Nov 11 13:28:09 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(+0x694dab)[0x56099c025dab]
Nov 11 13:28:09 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(+0x68cfbf)[0x56099c01dfbf]
Nov 11 13:28:09 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(_ZN5tpool19thread_pool_generic13timer_generic7executeEPv+0x38)[0x56099c8b6478]
Nov 11 13:28:09 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(_ZN5tpool4task7executeEv+0x2f)[0x56099c8b6d3f]
Nov 11 13:28:09 tools-db-4 mysqld[3528040]: /opt/wmf-mariadb106/bin/mysqld(_ZN5tpool19thread_pool_generic11worker_mainEPNS_11worker_dataE+0x4f)[0x56099c8b577f]
Nov 11 13:28:10 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44a3)[0x7f73b0ed44a3]
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5)[0x7f73b0ca71f5]
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: /lib/x86_64-linux-gnu/libc.so.6(+0x1098dc)[0x7f73b0d278dc]
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Writing a core file...
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Working directory at /srv/labsdb/data
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Resource Limits (excludes unlimited resources):
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Limit                     Soft Limit           Hard Limit           Units
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max stack size            8388608              unlimited            bytes
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max core file size        0                    0                    bytes
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max processes             257164               257164               processes
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max open files            200001               200001               files
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max locked memory         8388608              8388608              bytes
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max pending signals       257164               257164               signals
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max msgqueue size         819200               819200               bytes
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max nice priority         0                    0
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Max realtime priority     0                    0
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Core pattern: core
Nov 11 13:28:11 tools-db-4 mysqld[3528040]: Kernel version: Linux version 6.1.0-31-cloud-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.128-1 (2025-02-07)
Nov 11 13:28:12 tools-db-4 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
Nov 11 13:28:12 tools-db-4 systemd[1]: mariadb.service: Failed with result 'signal'.
Nov 11 13:28:12 tools-db-4 systemd[1]: mariadb.service: Consumed 3w 23h 16min 6.223s CPU time.

Systemd immediately restarted it, but MariaDB could not recover from the crash:

Nov 11 13:28:17 tools-db-4 systemd[1]: mariadb.service: Scheduled restart job, restart counter is at 1.
Nov 11 13:28:17 tools-db-4 systemd[1]: Stopped mariadb.service - mariadb database server.
Nov 11 13:28:17 tools-db-4 systemd[1]: mariadb.service: Consumed 3w 23h 16min 6.223s CPU time.
Nov 11 13:28:17 tools-db-4 systemd[1]: Starting mariadb.service - mariadb database server...
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [Note] Starting MariaDB 10.6.22-MariaDB-log source revision 19644f6821d59ecca0f9b1f44fadb3b887061965 server_uid 1AP9uRg2La3G8x2jAq3EyE1A3qk= as process 3869146
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [ERROR] mysqld: Plugin 'unix_socket' already installed
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'.
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [Note] mysqld: Aria engine: starting recovery
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: recovered pages: 0% 10% 21% 32% 42% 54% 64% 74% 84% 95% 100% (0.0 seconds); tables to flush: 9 8 7 6 5 4 3 2 1 0
Nov 11 13:28:17 tools-db-4 mysqld[3869146]:  (0.0 seconds);
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [Note] mysqld: Aria engine: recovery done
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [Note] InnoDB: Compressed tables use zlib 1.3.1
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [Note] InnoDB: Number of pools: 1
Nov 11 13:28:17 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:17 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [Note] InnoDB: Using Linux native AIO
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [Note] InnoDB: Initializing buffer pool, total size = 33285996544, chunk size = 134217728
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [Note] InnoDB: Completed initialization of buffer pool
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [ERROR] InnoDB: Missing FILE_CHECKPOINT at 123517410991977 between the checkpoint 123517317140845 and the end 123517410991616.
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [Note] InnoDB: Starting shutdown...
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [ERROR] Unknown/unsupported storage engine: InnoDB
Nov 11 13:28:18 tools-db-4 mysqld[3869146]: 2025-11-11 13:28:18 0 [ERROR] Aborting
Nov 11 13:28:18 tools-db-4 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Nov 11 13:28:18 tools-db-4 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Nov 11 13:28:18 tools-db-4 systemd[1]: Failed to start mariadb.service - mariadb database server.

A few minutes before the crash, MariaDB started logging that "Crash recovery is broken":

Nov 11 13:20:48 tools-db-4 mysqld[3528040]: 2025-11-11 13:20:48 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517437934963.
Nov 11 13:21:04 tools-db-4 mysqld[3528040]: 2025-11-11 13:21:04 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517443605760.
Nov 11 13:21:20 tools-db-4 mysqld[3528040]: 2025-11-11 13:21:20 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517448805961.
Nov 11 13:21:36 tools-db-4 mysqld[3528040]: 2025-11-11 13:21:36 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517454457173.
Nov 11 13:21:52 tools-db-4 mysqld[3528040]: 2025-11-11 13:21:52 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517459906387.
Nov 11 13:22:08 tools-db-4 mysqld[3528040]: 2025-11-11 13:22:08 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517467057426.
Nov 11 13:22:24 tools-db-4 mysqld[3528040]: 2025-11-11 13:22:24 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517471936563.
Nov 11 13:22:40 tools-db-4 mysqld[3528040]: 2025-11-11 13:22:40 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517474357994.
Nov 11 13:22:56 tools-db-4 mysqld[3528040]: 2025-11-11 13:22:56 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517480078015.
Nov 11 13:23:12 tools-db-4 mysqld[3528040]: 2025-11-11 13:23:12 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517485540056.
Nov 11 13:23:28 tools-db-4 mysqld[3528040]: 2025-11-11 13:23:28 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517490866192.
Nov 11 13:23:44 tools-db-4 mysqld[3528040]: 2025-11-11 13:23:44 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517496772178.
Nov 11 13:24:00 tools-db-4 mysqld[3528040]: 2025-11-11 13:24:00 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517501932264.
Nov 11 13:24:16 tools-db-4 mysqld[3528040]: 2025-11-11 13:24:16 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517507581802.
Nov 11 13:24:32 tools-db-4 mysqld[3528040]: 2025-11-11 13:24:32 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517513385334.
Nov 11 13:24:48 tools-db-4 mysqld[3528040]: 2025-11-11 13:24:48 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517518999950.
Nov 11 13:25:04 tools-db-4 mysqld[3528040]: 2025-11-11 13:25:04 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517525232292.
Nov 11 13:25:20 tools-db-4 mysqld[3528040]: 2025-11-11 13:25:20 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517530567886.
Nov 11 13:25:36 tools-db-4 mysqld[3528040]: 2025-11-11 13:25:36 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517536511964.
Nov 11 13:25:52 tools-db-4 mysqld[3528040]: 2025-11-11 13:25:52 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517542710565.
Nov 11 13:26:08 tools-db-4 mysqld[3528040]: 2025-11-11 13:26:08 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517549227844.
Nov 11 13:26:24 tools-db-4 mysqld[3528040]: 2025-11-11 13:26:24 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517555908148.
Nov 11 13:26:40 tools-db-4 mysqld[3528040]: 2025-11-11 13:26:40 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517562440274.
Nov 11 13:26:56 tools-db-4 mysqld[3528040]: 2025-11-11 13:26:56 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517568556637.
Nov 11 13:27:12 tools-db-4 mysqld[3528040]: 2025-11-11 13:27:12 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517574177154.
Nov 11 13:27:28 tools-db-4 mysqld[3528040]: 2025-11-11 13:27:28 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517580469967.
Nov 11 13:27:44 tools-db-4 mysqld[3528040]: 2025-11-11 13:27:44 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517586206627.
Nov 11 13:28:00 tools-db-4 mysqld[3528040]: 2025-11-11 13:28:00 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=123517317140845, current LSN=123517592381940.

This same error can be found many times in the logs but we didn't notice it before:

fnegri@tools-db-4:~$ sudo journalctl -u mariadb -g "recovery is broken" |awk '{print $1 $2}' |uniq
Apr14
Apr22
Apr25
Jun11
Jun12
Jun14
Jun18
Jun19
Jun21
Jun23
Jun24
Jun26
Jun27
Jul03
Jul04
Jul24
Jul29
Aug08
Aug09
Aug10
Aug11
Aug21
Aug28
Aug29
Sep09
Sep10
Sep11
Sep12
Oct03
Oct05
Oct08
Oct13
Oct19
Oct21
Oct27
Nov10
Nov11

Event Timeline

fnegri changed the task status from Open to In Progress.Wed, Nov 12, 12:40 PM
fnegri claimed this task.
fnegri triaged this task as High priority.

Change #1204472 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] toolsdb: increase innodb_log_file_size to 512M

https://gerrit.wikimedia.org/r/1204472

This blog post has a lot of info on how to pick a good value for innodb_log_file_size: https://www.percona.com/blog/what-is-a-big-innodb_log_file_size/

The main downside of a big innodb_log_file_size seems to be slower recovery times, but every time ToolsDB successfully recovered in the past it was very quick (a few seconds), so I think we should be able to bump the current value significantly without making recovery times too slow to be usable.

This is still happening:

fnegri@tools-db-6:~$ sudo journalctl -u mariadb -g "recovery is broken" --since -7d --no-pager
Nov 23 06:51:01 tools-db-6 mysqld[2532036]: 2025-11-23  6:51:01 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125343066437714, current LSN=125343187231841.
Nov 23 09:53:32 tools-db-6 mysqld[2532036]: 2025-11-23  9:53:32 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125349415185380, current LSN=125349535979510.
Nov 23 11:35:05 tools-db-6 mysqld[2532036]: 2025-11-23 11:35:05 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125351917257630, current LSN=125352038051742.
Nov 23 11:35:12 tools-db-6 mysqld[2532036]: 2025-11-23 11:35:12 132484610 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125351920724125, current LSN=125352042108536.
Nov 23 22:44:51 tools-db-6 mysqld[2532036]: 2025-11-23 22:44:51 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125375253581073, current LSN=125375374375198.
Nov 24 12:13:38 tools-db-6 mysqld[2532036]: 2025-11-24 12:13:38 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125411113700635, current LSN=125411234494768.
Nov 24 12:13:40 tools-db-6 mysqld[2532036]: 2025-11-24 12:13:40 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125411113701007, current LSN=125411235250286.
Nov 24 13:17:59 tools-db-6 mysqld[2532036]: 2025-11-24 13:17:59 0 [ERROR] InnoDB: Crash recovery is broken due to insufficient innodb_log_file_size; last checkpoint LSN=125414307362770, current LSN=125414428156885.

Change #1204472 merged by FNegri:

[operations/puppet@production] toolsdb: increase innodb_log_file_size to 512M

https://gerrit.wikimedia.org/r/1204472

I restarted MariaDB in the replica tools-db-7 to apply the new setting innodb_log_file_size=512M.

Things are looking good, but the replica rarely reached the limit even with the previous setting, because it doesn't have the same load of the primary.

Screenshot 2025-11-25 at 17.32.59.png (924×1 px, 198 KB)

I will restart the primary tomorrow and see if the new setting will reduce the occurrences of "Crash recovery is broken due to insufficient innodb_log_file_size" in the logs.

I created a new debugging dashboard in grafana: https://grafana.wmcloud.org/d/5968f014-0a22-4f48-b6dc-6d1c46636f80/fnegri-toolsdb-debugging

It shows checkpoint_age and checkpoint_max_age for both primary and replica in the same graph:

Screenshot 2025-11-25 at 18.08.59.png (624×1 px, 153 KB)

Mentioned in SAL (#wikimedia-cloud) [2025-11-26T09:46:17Z] <dhinus> restarting tools-db-6 to apply a config change T409922

This is promising but I'll keep this task open for a few days before resolving:

fnegri@tools-db-6:~$ sudo journalctl -u mariadb --since 2025-11-26 -g "Crash recovery is broken"
-- No entries --

Screenshot 2025-11-27 at 16.10.26.png (622×1 px, 204 KB)

fnegri moved this task from In progress to Done on the cloud-services-team (FY2025/26-Q1-Q2) board.

A week later, the "Crash recovery is broken" message was never logged, although the graph shows the checkpoint_age is still getting close to the limit:

fnegri@tools-db-6:~$ sudo journalctl -u mariadb --since 2025-11-26 -g "Crash recovery is broken"
-- No entries --

Screenshot 2025-12-02 at 16.01.45.png (614×1 px, 279 KB)

We could increase innodb_log_file_size again, but a better strategy is probably to work on T291782: Migrate largest ToolsDB users to Trove to decrease the overall load on the server.

I will mark this as Resolved for now, we should check back in a few weeks to verify if we're still getting "Crash recovery is broken" messages in the logs. I set myself a reminder to check this.