$ sudo tail -50 kern.log Aug 15 23:09:12 tools-db-1 kernel: [15825831.803662] lowmem_reserve[]: 0 0 0 0 0 Aug 15 23:09:12 tools-db-1 kernel: [15825831.803665] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB Aug 15 23:09:12 tools-db-1 kernel: [15825831.803677] Node 0 DMA32: 52*4kB (UE) 127*8kB (UEH) 901*16kB (UEH) 1006*32kB (UEH) 782*64kB (UMEH) 563*128kB (UMEH) 192*256kB (UMEH) 51*512kB (UME) 5*1024kB (UM) 0*2048kB 0*4096kB = 250328kB Aug 15 23:09:12 tools-db-1 kernel: [15825831.803690] Node 0 Normal: 1676*4kB (ME) 1450*8kB (ME) 3326*16kB (ME) 2861*32kB (UME) 2113*64kB (UME) 491*128kB (UM) 32*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 369344kB Aug 15 23:09:12 tools-db-1 kernel: [15825831.803701] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 15 23:09:12 tools-db-1 kernel: [15825831.803702] 1618 total pagecache pages Aug 15 23:09:12 tools-db-1 kernel: [15825831.803703] 0 pages in swap cache Aug 15 23:09:12 tools-db-1 kernel: [15825831.803704] Swap cache stats: add 0, delete 0, find 0/0 Aug 15 23:09:12 tools-db-1 kernel: [15825831.803705] Free swap = 0kB Aug 15 23:09:12 tools-db-1 kernel: [15825831.803705] Total swap = 0kB Aug 15 23:09:12 tools-db-1 kernel: [15825831.803706] 16777081 pages RAM Aug 15 23:09:12 tools-db-1 kernel: [15825831.803707] 0 pages HighMem/MovableOnly Aug 15 23:09:12 tools-db-1 kernel: [15825831.803707] 311481 pages reserved Aug 15 23:09:12 tools-db-1 kernel: [15825831.803708] Tasks state (memory values in pages): Aug 15 23:09:12 tools-db-1 kernel: [15825831.803708] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name Aug 15 23:09:12 tools-db-1 kernel: [15825831.803716] [ 538] 0 538 24986 353 69632 0 0 dhclient Aug 15 23:09:12 tools-db-1 kernel: [15825831.803719] [ 664] 101 664 4912 255 65536 0 -900 dbus-daemon Aug 15 23:09:12 tools-db-1 kernel: [15825831.803722] [ 674] 0 674 36101 16503 262144 0 0 sssd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803724] [ 714] 0 714 3461 244 61440 0 0 systemd-logind Aug 15 23:09:12 tools-db-1 kernel: [15825831.803726] [ 727] 0 727 719 28 49152 0 0 agetty Aug 15 23:09:12 tools-db-1 kernel: [15825831.803728] [ 728] 0 728 5102 187 65536 0 0 login Aug 15 23:09:12 tools-db-1 kernel: [15825831.803730] [ 735] 0 735 6652 1793 90112 0 0 unattended-upgr Aug 15 23:09:12 tools-db-1 kernel: [15825831.803732] [ 768] 0 768 3798 312 69632 0 0 systemd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803734] [ 769] 0 769 41768 723 98304 0 0 (sd-pam) Aug 15 23:09:12 tools-db-1 kernel: [15825831.803736] [ 783] 0 783 1749 386 49152 0 0 bash Aug 15 23:09:12 tools-db-1 kernel: [15825831.803739] [ 55238] 0 55238 1954 91 53248 0 0 rsync Aug 15 23:09:12 tools-db-1 kernel: [15825831.803741] [ 55492] 497 55492 16206140 15900418 128692224 0 -600 mysqld Aug 15 23:09:12 tools-db-1 kernel: [15825831.803743] [ 995371] 0 995371 6835 3392 90112 0 0 perl Aug 15 23:09:12 tools-db-1 kernel: [15825831.803745] [2144336] 0 2144336 1409 68 45056 0 0 cron Aug 15 23:09:12 tools-db-1 kernel: [15825831.803748] [2206324] 111 2206324 548461 6145 344064 0 0 prometheus-mysq Aug 15 23:09:12 tools-db-1 kernel: [15825831.803750] [3793592] 111 3793592 604787 6605 405504 0 0 prometheus-node Aug 15 23:09:12 tools-db-1 kernel: [15825831.803752] [1749845] 0 1749845 4802 187 57344 0 -1000 systemd-udevd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803755] [1752273] 109 1752273 22109 188 73728 0 0 systemd-timesyn Aug 15 23:09:12 tools-db-1 kernel: [15825831.803757] [1774204] 112 1774204 4651 246 81920 0 0 lldpd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803758] [1774206] 112 1774206 4651 243 77824 0 0 lldpd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803764] [1786952] 0 1786952 3338 244 61440 0 -1000 sshd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803767] [1826782] 0 1826782 45886 276 376832 0 -250 systemd-journal Aug 15 23:09:12 tools-db-1 kernel: [15825831.803769] [1969545] 0 1969545 110493 13790 212992 0 0 rsyslogd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803771] [2911330] 0 2911330 40057 16581 241664 0 0 sssd_be Aug 15 23:09:12 tools-db-1 kernel: [15825831.803773] [2934250] 0 2934250 566074 1201 319488 0 0 prometheus-rsys Aug 15 23:09:12 tools-db-1 kernel: [15825831.803775] [2939448] 110 2939448 7429 243 77824 0 0 exim4 Aug 15 23:09:12 tools-db-1 kernel: [15825831.803778] [3014750] 0 3014750 26771 357 204800 0 0 sssd_nss Aug 15 23:09:12 tools-db-1 kernel: [15825831.803780] [3014799] 0 3014799 3407 263 69632 0 0 sshd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803782] [3014800] 107 3014800 3367 248 69632 0 0 sshd Aug 15 23:09:12 tools-db-1 kernel: [15825831.803784] [3014801] 0 3014801 6425 2952 86016 0 0 prometheus-pupp Aug 15 23:09:12 tools-db-1 kernel: [15825831.803786] [3014802] 0 3014802 18177 1456 172032 0 0 puppet Aug 15 23:09:12 tools-db-1 kernel: [15825831.803788] [3014805] 110 3014805 2649 40 57344 0 0 exim4 Aug 15 23:09:12 tools-db-1 kernel: [15825831.803790] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mysqld,pid=55492,uid=497 Aug 15 23:09:12 tools-db-1 kernel: [15825831.804009] Out of memory: Killed process 55492 (mysqld) total-vm:64824560kB, anon-rss:63601672kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:125676kB oom_score_adj:-600 Aug 15 23:09:13 tools-db-1 kernel: [15825832.764060] oom_reaper: reaped process 55492 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Description
Related Objects
- Mentioned In
- T353093: [toolsdb] MariaDB process is killed by OOM killer (December 2023)
T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023)
T344411: [toolsdb] ToolsDB replication is broken on tools-db-2 (errno 1032) - 2023-08-17
T344299: toolsdb perhaps down? - Mentioned Here
- T313030: [toolforge.infra] Replace Toolschecker alerts with Prometheus based ones
T344717: [toolsdb] test creating a new replica host
T344719: [toolsdb] test failover procedure
Event Timeline
Mentioned in SAL (#wikimedia-cloud) [2023-08-15T23:29:32Z] <bd808> Rebooted tools-db-1.tools.eqiad1.wikimedia.cloud for T344298
After rebooting mariadb wouldn't start:
$ sudo journalctl -u mariadb.service --no-pager -- Journal begins at Fri 2023-06-23 15:55:02 UTC, ends at Tue 2023-08-15 23:36:53 UTC. -- Jul 13 16:08:59 tools-db-1 mysqld[55492]: 2023-07-13 16:08:59 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492962 Jul 17 11:29:07 tools-db-1 mysqld[55492]: 2023-07-17 11:29:07 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492966 Aug 15 23:09:12 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer. Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'. Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Consumed 10month 1w 1d 9h 11min 58.904s CPU time. -- Boot 1bf46427f5154442abde2c180a6e3825 -- Aug 15 23:28:52 tools-db-1 systemd[1]: mariadb database server is not active. Aug 15 23:31:32 tools-db-1 systemd[1]: Starting mariadb database server... Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] Starting MariaDB 10.4.29-MariaDB-log source revision 2594da7a33580bf03590502a011679c878487d0c as process 2228 Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] mysqld: Plugin 'unix_socket' already installed Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'. Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Warning] The parameter innodb_file_format is deprecated and has no effect. It may be removed in future releases. See https://mariadb.com/kb/en/library/xtradbinnodb-file-format/ Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using Linux native AIO Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: The first innodb_system data file 'ibdata1' did not exist. A new tablespace will be created! Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Uses event mutexes Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Compressed tables use zlib 1.2.13 Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Number of pools: 1 Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using SSE2 crc32 instructions Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Initializing buffer pool, total size = 31G, instances = 8, chunk size = 128M Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Completed initialization of buffer pool Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority(). Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting file './ibdata1' size to 12 MB. Physically writing the file full; Please wait ... Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: File './ibdata1' size is now 12 MB. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 134217728 bytes Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Sebd808@tools-db-1:~$ sudo journalctl -u mariadb.service --no-pager -- Journal begins at Fri 2023-06-23 15:55:02 UTC, ends at Tue 2023-08-15 23:37:42 UTC. -- Jul 13 16:08:59 tools-db-1 mysqld[55492]: 2023-07-13 16:08:59 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492962 Jul 17 11:29:07 tools-db-1 mysqld[55492]: 2023-07-17 11:29:07 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492966 Aug 15 23:09:12 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer. Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'. Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Consumed 10month 1w 1d 9h 11min 58.904s CPU time. -- Boot 1bf46427f5154442abde2c180a6e3825 -- Aug 15 23:28:52 tools-db-1 systemd[1]: mariadb database server is not active. Aug 15 23:31:32 tools-db-1 systemd[1]: Starting mariadb database server... Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] Starting MariaDB 10.4.29-MariaDB-log source revision 2594da7a33580bf03590502a011679c878487d0c as process 2228 Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] mysqld: Plugin 'unix_socket' already installed Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'. Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Warning] The parameter innodb_file_format is deprecated and has no effect. It may be removed in future releases. See https://mariadb.com/kb/en/library/xtradbinnodb-file-format/ Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using Linux native AIO Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: The first innodb_system data file 'ibdata1' did not exist. A new tablespace will be created! Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Uses event mutexes Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Compressed tables use zlib 1.2.13 Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Number of pools: 1 Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using SSE2 crc32 instructions Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Initializing buffer pool, total size = 31G, instances = 8, chunk size = 128M Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Completed initialization of buffer pool Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority(). Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting file './ibdata1' size to 12 MB. Physically writing the file full; Please wait ... Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: File './ibdata1' size is now 12 MB. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 134217728 bytes Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting log file ./ib_logfile1 size to 134217728 bytes Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0 Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: New log files created, LSN=11451 Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Doublewrite buffer not found: creating new Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Doublewrite buffer created Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: 128 out of 128 rollback segments are active. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating foreign key constraint system tables. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating tablespace and datafile system tables. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating sys_virtual system tables. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating shared tablespace for temporary tables Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: 10.4.29 started; log sequence number 0; transaction id 7 Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Could not open mysql.plugin table. Some plugins may be not loaded Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory") Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] Server socket created on IP: '::'. Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.db' doesn't exist Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Aborting Aug 15 23:31:36 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE Aug 15 23:31:36 tools-db-1 systemd[1]: mariadb.service: Failed with result 'exit-code'. Aug 15 23:31:36 tools-db-1 systemd[1]: Failed to start mariadb database server. Aug 15 23:31:36 tools-db-1 systemd[1]: mariadb.service: Consumed 1.725s CPU time.
@Andrew noticed that the data volume was not mounted.
- /etc/fstab was totally wrong so the volume didn't remount on reboot. It had an invalid volume uuid and also mounted to the wrong place (mariadb expected /srv/labsdb)
- we need updated docs about how to properly validate replication before switching to r/w
- we need updated docs about failover should that be necessary
Outage is over, but the config drift and runbook status seem worthy of leaving this open and high priority. Hopefully @fnegri can make time to update things or delegate to someone.
@bd808 agreed, I'm claiming this and will look at it next week when I'm back from holidays!
Sorry for the trouble, I'm pretty sure I did reboot both hosts a few times without seeing issues with /etc/fstab, not sure what happened this time!
I will also update the failover docs, though that's harder to test. We might schedule a planned failover as a test.
I did a big refresh of the ToolsDB admin docs at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb
I also created T344719 and T344717 to road-test the new procedures I have written.
We should also move this alert(s) from icinga to alertmanager
This is tracked in T313030.