Page MenuHomePhabricator

mysqld killed by oomkiller on tools-db-1.tools.eqiad1.wikimedia.cloud
Closed, ResolvedPublicBUG REPORT

Description

$ sudo tail -50  kern.log
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803662] lowmem_reserve[]: 0 0 0 0 0
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803665] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803677] Node 0 DMA32: 52*4kB (UE) 127*8kB (UEH) 901*16kB (UEH) 1006*32kB (UEH) 782*64kB (UMEH) 563*128kB (UMEH) 192*256kB (UMEH) 51*512kB (UME) 5*1024kB (UM) 0*2048kB 0*4096kB = 250328kB
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803690] Node 0 Normal: 1676*4kB (ME) 1450*8kB (ME) 3326*16kB (ME) 2861*32kB (UME) 2113*64kB (UME) 491*128kB (UM) 32*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 369344kB
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803701] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803702] 1618 total pagecache pages
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803703] 0 pages in swap cache
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803704] Swap cache stats: add 0, delete 0, find 0/0
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803705] Free swap  = 0kB
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803705] Total swap = 0kB
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803706] 16777081 pages RAM
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803707] 0 pages HighMem/MovableOnly
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803707] 311481 pages reserved
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803708] Tasks state (memory values in pages):
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803708] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803716] [    538]     0   538    24986      353    69632        0             0 dhclient
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803719] [    664]   101   664     4912      255    65536        0          -900 dbus-daemon
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803722] [    674]     0   674    36101    16503   262144        0             0 sssd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803724] [    714]     0   714     3461      244    61440        0             0 systemd-logind
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803726] [    727]     0   727      719       28    49152        0             0 agetty
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803728] [    728]     0   728     5102      187    65536        0             0 login
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803730] [    735]     0   735     6652     1793    90112        0             0 unattended-upgr
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803732] [    768]     0   768     3798      312    69632        0             0 systemd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803734] [    769]     0   769    41768      723    98304        0             0 (sd-pam)
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803736] [    783]     0   783     1749      386    49152        0             0 bash
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803739] [  55238]     0 55238     1954       91    53248        0             0 rsync
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803741] [  55492]   497 55492 16206140 15900418 128692224        0          -600 mysqld
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803743] [ 995371]     0 995371     6835     3392    90112        0             0 perl
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803745] [2144336]     0 2144336     1409       68    45056        0             0 cron
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803748] [2206324]   111 2206324   548461     6145   344064        0             0 prometheus-mysq
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803750] [3793592]   111 3793592   604787     6605   405504        0             0 prometheus-node
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803752] [1749845]     0 1749845     4802      187    57344        0         -1000 systemd-udevd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803755] [1752273]   109 1752273    22109      188    73728        0             0 systemd-timesyn
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803757] [1774204]   112 1774204     4651      246    81920        0             0 lldpd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803758] [1774206]   112 1774206     4651      243    77824        0             0 lldpd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803764] [1786952]     0 1786952     3338      244    61440        0         -1000 sshd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803767] [1826782]     0 1826782    45886      276   376832        0          -250 systemd-journal
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803769] [1969545]     0 1969545   110493    13790   212992        0             0 rsyslogd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803771] [2911330]     0 2911330    40057    16581   241664        0             0 sssd_be
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803773] [2934250]     0 2934250   566074     1201   319488        0             0 prometheus-rsys
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803775] [2939448]   110 2939448     7429      243    77824        0             0 exim4
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803778] [3014750]     0 3014750    26771      357   204800        0             0 sssd_nss
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803780] [3014799]     0 3014799     3407      263    69632        0             0 sshd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803782] [3014800]   107 3014800     3367      248    69632        0             0 sshd
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803784] [3014801]     0 3014801     6425     2952    86016        0             0 prometheus-pupp
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803786] [3014802]     0 3014802    18177     1456   172032        0             0 puppet
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803788] [3014805]   110 3014805     2649       40    57344        0             0 exim4
Aug 15 23:09:12 tools-db-1 kernel: [15825831.803790] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mysqld,pid=55492,uid=497
Aug 15 23:09:12 tools-db-1 kernel: [15825831.804009] Out of memory: Killed process 55492 (mysqld) total-vm:64824560kB, anon-rss:63601672kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:125676kB oom_score_adj:-600
Aug 15 23:09:13 tools-db-1 kernel: [15825832.764060] oom_reaper: reaped process 55492 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2023-08-15T23:29:32Z] <bd808> Rebooted tools-db-1.tools.eqiad1.wikimedia.cloud for T344298

After rebooting mariadb wouldn't start:

$ sudo journalctl -u mariadb.service --no-pager
-- Journal begins at Fri 2023-06-23 15:55:02 UTC, ends at Tue 2023-08-15 23:36:53 UTC. --
Jul 13 16:08:59 tools-db-1 mysqld[55492]: 2023-07-13 16:08:59 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492962
Jul 17 11:29:07 tools-db-1 mysqld[55492]: 2023-07-17 11:29:07 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492966
Aug 15 23:09:12 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.
Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Consumed 10month 1w 1d 9h 11min 58.904s CPU time.
-- Boot 1bf46427f5154442abde2c180a6e3825 --
Aug 15 23:28:52 tools-db-1 systemd[1]: mariadb database server is not active.
Aug 15 23:31:32 tools-db-1 systemd[1]: Starting mariadb database server...
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] Starting MariaDB 10.4.29-MariaDB-log source revision 2594da7a33580bf03590502a011679c878487d0c as process 2228
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] mysqld: Plugin 'unix_socket' already installed
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'.
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Warning] The parameter innodb_file_format is deprecated and has no effect. It may be removed in future releases. See https://mariadb.com/kb/en/library/xtradbinnodb-file-format/
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using Linux native AIO
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: The first innodb_system data file 'ibdata1' did not exist. A new tablespace will be created!
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Uses event mutexes
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Number of pools: 1
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using SSE2 crc32 instructions
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Initializing buffer pool, total size = 31G, instances = 8, chunk size = 128M
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Completed initialization of buffer pool
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting file './ibdata1' size to 12 MB. Physically writing the file full; Please wait ...
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: File './ibdata1' size is now 12 MB.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 134217728 bytes
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Sebd808@tools-db-1:~$ sudo journalctl -u mariadb.service --no-pager
-- Journal begins at Fri 2023-06-23 15:55:02 UTC, ends at Tue 2023-08-15 23:37:42 UTC. --
Jul 13 16:08:59 tools-db-1 mysqld[55492]: 2023-07-13 16:08:59 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492962
Jul 17 11:29:07 tools-db-1 mysqld[55492]: 2023-07-17 11:29:07 401717440 [Note] InnoDB: Deferring DROP TABLE `s51434__mixnmatch_p`.`cnd_tmp`; renaming to s51434__mixnmatch_p/#sql-ib2492966
Aug 15 23:09:12 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.
Aug 15 23:09:13 tools-db-1 systemd[1]: mariadb.service: Consumed 10month 1w 1d 9h 11min 58.904s CPU time.
-- Boot 1bf46427f5154442abde2c180a6e3825 --
Aug 15 23:28:52 tools-db-1 systemd[1]: mariadb database server is not active.
Aug 15 23:31:32 tools-db-1 systemd[1]: Starting mariadb database server...
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] Starting MariaDB 10.4.29-MariaDB-log source revision 2594da7a33580bf03590502a011679c878487d0c as process 2228
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] mysqld: Plugin 'unix_socket' already installed
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [ERROR] Couldn't load plugin 'unix_socket' from 'auth_socket.so'.
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Warning] The parameter innodb_file_format is deprecated and has no effect. It may be removed in future releases. See https://mariadb.com/kb/en/library/xtradbinnodb-file-format/
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using Linux native AIO
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: The first innodb_system data file 'ibdata1' did not exist. A new tablespace will be created!
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Uses event mutexes
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Number of pools: 1
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Using SSE2 crc32 instructions
Aug 15 23:31:32 tools-db-1 mysqld[2228]: 2023-08-15 23:31:32 0 [Note] InnoDB: Initializing buffer pool, total size = 31G, instances = 8, chunk size = 128M
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Completed initialization of buffer pool
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting file './ibdata1' size to 12 MB. Physically writing the file full; Please wait ...
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: File './ibdata1' size is now 12 MB.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 134217728 bytes
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting log file ./ib_logfile1 size to 134217728 bytes
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: New log files created, LSN=11451
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Doublewrite buffer not found: creating new
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Doublewrite buffer created
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: 128 out of 128 rollback segments are active.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating foreign key constraint system tables.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating tablespace and datafile system tables.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating sys_virtual system tables.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Creating shared tablespace for temporary tables
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] InnoDB: 10.4.29 started; log sequence number 0; transaction id 7
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Could not open mysql.plugin table. Some plugins may be not loaded
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 6 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [Note] Server socket created on IP: '::'.
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Fatal error: Can't open and lock privilege tables: Table 'mysql.db' doesn't exist
Aug 15 23:31:33 tools-db-1 mysqld[2228]: 2023-08-15 23:31:33 0 [ERROR] Aborting
Aug 15 23:31:36 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Aug 15 23:31:36 tools-db-1 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Aug 15 23:31:36 tools-db-1 systemd[1]: Failed to start mariadb database server.
Aug 15 23:31:36 tools-db-1 systemd[1]: mariadb.service: Consumed 1.725s CPU time.

@Andrew noticed that the data volume was not mounted.

  • /etc/fstab was totally wrong so the volume didn't remount on reboot. It had an invalid volume uuid and also mounted to the wrong place (mariadb expected /srv/labsdb)
  • we need updated docs about how to properly validate replication before switching to r/w
  • we need updated docs about failover should that be necessary
bd808 triaged this task as High priority.Aug 16 2023, 12:09 AM
bd808 added a project: cloud-services-team.
bd808 added a subscriber: fnegri.

Outage is over, but the config drift and runbook status seem worthy of leaving this open and high priority. Hopefully @fnegri can make time to update things or delegate to someone.

We should also move this alert(s) from icinga to alertmanager

@bd808 agreed, I'm claiming this and will look at it next week when I'm back from holidays!

Sorry for the trouble, I'm pretty sure I did reboot both hosts a few times without seeing issues with /etc/fstab, not sure what happened this time!

I will also update the failover docs, though that's harder to test. We might schedule a planned failover as a test.

I did a big refresh of the ToolsDB admin docs at https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb

I also created T344719 and T344717 to road-test the new procedures I have written.

We should also move this alert(s) from icinga to alertmanager

This is tracked in T313030.