Page MenuHomePhabricator

[toolsdb] MariaDB process is killed by OOM killer (December 2023)
Closed, ResolvedPublic

Description

This happened twice (by the irc logs) in less than one day:

taavi> i'm looking at toolsdb
01:53:57 oom killed once again
01:53:58 [3211718.996904] Out of memory: Killed process 725680 (mysqld) total-vm:64552308kB, anon-rss:63683220kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:125676kB oom_score_adj:-600
01:55:03 last one was on nov 9th, it made it almost a month without crashing
01:55:11 <andrewbogott> oops I just rebooted it out from under you sorry
01:55:26 But I imagine you were about to do that anyway :)
01:55:56 taavi: do you mind setting r/w after it comes up?  I have food on the stove that's about to burn :)
01:56:00
<taavi> sure
01:56:18 yeah I was just going through lots of email from the past week and happened to spot that particular one just as it got in
01:56:22 <andrewbogott> thank you! I'll check back in a few to make sure it was an easy resolution
01:58:24
<taavi> mariadb is taking suspiciously long to even start
01:58:54 there we go
08:14:07 — dcaro paged
08:16:57 <dcaro> oom killed too
08:16:57 Dec 09 07:03:41 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
08:18:29 I'm starting it again
08:19:32 made it writable again
root@tools-db-1:~# journalctl | grep  oom-kill
...
Dec 09 00:45:39 tools-db-1 kernel: mysqld invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-600
Dec 09 00:45:39 tools-db-1 kernel:  oom_kill_process.cold+0xb/0x10
Dec 09 00:45:39 tools-db-1 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec 09 00:45:39 tools-db-1 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mysqld,pid=725680,uid=497
Dec 09 00:45:39 tools-db-1 kernel: Out of memory: Killed process 725680 (mysqld) total-vm:64552308kB, anon-rss:63683220kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:125676kB oom_score_adj:-600
Dec 09 00:45:40 tools-db-1 kernel: oom_reaper: reaped process 725680 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 09 07:03:41 tools-db-1 kernel: mysqld invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-600
Dec 09 07:03:41 tools-db-1 kernel:  oom_kill_process.cold+0xb/0x10
Dec 09 07:03:41 tools-db-1 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec 09 07:03:41 tools-db-1 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mysqld,pid=892,uid=497
Dec 09 07:03:41 tools-db-1 kernel: Out of memory: Killed process 892 (mysqld) total-vm:68657680kB, anon-rss:64977180kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:129096kB oom_score_adj:-600
Dec 09 07:03:41 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Dec 09 07:03:42 tools-db-1 kernel: oom_reaper: reaped process 892 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 09 07:03:43 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.
Dec 09 11:23:37 tools-db-1 kernel: mysqld invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-600
Dec 09 11:23:37 tools-db-1 kernel:  oom_kill_process.cold+0xb/0x10
Dec 09 11:23:37 tools-db-1 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec 09 11:23:37 tools-db-1 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mysqld,pid=26159,uid=497
Dec 09 11:23:37 tools-db-1 kernel: Out of memory: Killed process 26159 (mysqld) total-vm:67846824kB, anon-rss:64994320kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:129200kB oom_score_adj:-600
Dec 09 11:23:39 tools-db-1 kernel: oom_reaper: reaped process 26159 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Dec 09 16:11:12 tools-db-1 kernel: mysqld invoked oom-killer: gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-600
Dec 09 16:11:12 tools-db-1 kernel:  oom_kill_process.cold+0xb/0x10
Dec 09 16:11:12 tools-db-1 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Dec 09 16:11:12 tools-db-1 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/mariadb.service,task=mysqld,pid=41033,uid=497
Dec 09 16:11:12 tools-db-1 kernel: Out of memory: Killed process 41033 (mysqld) total-vm:68536228kB, anon-rss:64961000kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:129092kB oom_score_adj:-600
Dec 09 16:11:14 tools-db-1 kernel: oom_reaper: reaped process 41033 (mysqld), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

It had been a while, getting "stabler" :)

Event Timeline

dcaro triaged this task as High priority.Dec 9 2023, 7:30 AM
dcaro created this task.

It has happened 4 times this saturday

After discussing with @dcaro I started a tmux session in tools-db-1 that dumps the MariaDB processlist every 30 seconds:

while pgrep mysqld; do date; mariadb --execute 'show processlist'; sleep 30; done

In this way, if MariaDB gets killed again, we can check which queries were running just before the crash.

There was a new OOM crash this morning:

fnegri@tools-db-1:~$ sudo dmesg -T |grep Killed
[Thu Dec 14 09:34:59 2023] Out of memory: Killed process 922 (mysqld) total-vm:68471344kB, anon-rss:64957444kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:129048kB oom_score_adj:-600

Unfortunately the server was rebooted for unrelated reasons last night, which meant the tmux session dumping the processlist was no longer running.

I've restarted a similar tmux session, this time adding a tee -a that appends the output to /root/mariadb_processlist:

root@tools-db-1:~# while pgrep mysqld; do date; mariadb --execute 'show processlist' | tee -a mariadb_processlist; sleep 30; done

And another crash now:

[Thu Dec 14 16:22:41 2023] Out of memory: Killed process 28489 (mysqld) total-vm:68017760kB, anon-rss:64976664kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:129144kB oom_score_adj:-600

This time systemctl restarted MariaDB automatically, I only had to set the db to read-write.

These were the queries running just before the crash: https://phabricator.wikimedia.org/P54447

Change 983221 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Kill queries taking longer than 1 hour

https://gerrit.wikimedia.org/r/983221

The log before the crash shows two very long queries (running for 2 hours) on the db s51412__data (persondata tool). I'm not sure those queries caused the OOM, but it's worth trying to set a query timeout at a very high value (1 hour), and see if it helps. We should monitor if that ends up killing too many queries, but I think it will likely affect very few users.

I have prepared a patch to set the query timeout, but not enabled it yet: https://gerrit.wikimedia.org/r/983221

Another crash a few minutes ago:

[Fri Dec 15 04:13:09 2023] Out of memory: Killed process 898 (mysqld) total-vm:68600852kB, anon-rss:64962980kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:129136kB oom_score_adj:-600

Restarting mariadb took longer than usual (about 5 minutes):

Dec 15 04:21:37 tools-db-1 mysqld[22395]: 2023-12-15  4:21:37 0 [Note] mysqld: Aria engine: starting recovery
Dec 15 04:26:54 tools-db-1 mysqld[22395]: recovered pages: 0% 10% 20% 30% 40% 50% 63% 73% 83% 93% 100% (11.3 seconds); transactions to roll back: 1 0 (302.8 seconds); tables to flush: 2 1 0
Dec 15 04:26:54 tools-db-1 mysqld[22395]:  (2.5 seconds);
Dec 15 04:26:54 tools-db-1 mysqld[22395]: 2023-12-15  4:26:54 0 [Note] mysqld: Aria engine: recovery done

The tmux session writing the mariadb_processlist debug file was not running, I don't know why. The file was last updated on Dec 14 22:42, a few hours before the crash. I have restarted it.

Change 983221 merged by FNegri:

[operations/puppet@production] [toolsdb] Kill queries taking longer than 1 hour

https://gerrit.wikimedia.org/r/983221

Mentioned in SAL (#wikimedia-cloud) [2023-12-15T10:48:37Z] <dhinus> restarting toolsdb to apply new config T353093

Change 983368 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Fix slow query logging condition

https://gerrit.wikimedia.org/r/983368

Change 983368 merged by FNegri:

[operations/puppet@production] [toolsdb] Fix slow query logging condition

https://gerrit.wikimedia.org/r/983368

Mentioned in SAL (#wikimedia-cloud) [2023-12-15T12:19:06Z] <dhinus> restarting toolsdb again to apply a config fix T353093

cc-ing @Marostegui as I'm running out of ideas here. As described in the previous comments, I've started logging the output of SHOW PROCESSLIST as Manuel suggested back in October, but that didn't help me identify any suspicious query. This is a dump of the processlist a few seconds before one of the OOM crashes: https://phabricator.wikimedia.org/P54447

The only odd thing I can see in that list are two queries running for more than 2 hours, so I introduced a 60-minute query timeout earlier today. Only 1 query was killed (again on the db db s51412__data (persondata tool), but when the long-running query was killed, no memory was freed and memory usage continued to grow.

The free memory is slowly going to zero, as it has been doing multiple times in the past few days:

Screenshot 2023-12-15 at 17.29.56.png (1×3 px, 663 KB)

Each spike in the graph is either a OOM crash, or a manual service restart, where the free memory goes up to ~ 62G, then starts going down. Sometimes it stops going down for a few days, then it goes down again.

Link to the graph: https://grafana.wmcloud.org/goto/Y0yPY64Sk?orgId=1

Looks like there are quite a few memory issues that have been fixed in future versions. Some are listed at the end of this stackexchange: https://dba.stackexchange.com/questions/279486/troubleshooting-high-memory-usage-on-mariadb-10-5-4

It seems like it'll be a pain to upgrade, but we should upgrade if possible. We can consider switching to stock debian-packaged versions.

I agree we should upgrade asap, but I'd also like to find a temporary fix for 10.4...

If we're really pressed we could do an in-place upgrade, on the replica first, then on the primary, without a failover.

Upgrade task: T352206

Mentioned in SAL (#wikimedia-cloud) [2023-12-16T00:21:12Z] <dhinus> restarting toolsdb again as it's again low in free mem T353093

I'm gonna try using jemalloc as suggested in this stackoverflow thread. I tested the config in tools-db-2 and it seems to be working fine, I will add the same line to tools-db-1:

# /lib/systemd/system/mariadb.service

Environment="LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

Change 983513 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Use jemalloc to prevent memory issues

https://gerrit.wikimedia.org/r/983513

I think jemalloc did the trick! Memory usage is no longer growing constantly:

Screenshot 2023-12-16 at 10.17.39.png (1×3 px, 595 KB)

I opened a patch to set this option permanently: https://gerrit.wikimedia.org/r/983513

fnegri lowered the priority of this task from Unbreak Now! to High.Dec 16 2023, 9:18 AM
fnegri changed the task status from Open to In Progress.Dec 16 2023, 10:18 AM

Change 983513 merged by FNegri:

[operations/puppet@production] [toolsdb] Use jemalloc to prevent memory issues

https://gerrit.wikimedia.org/r/983513

Change 983722 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] fix override syntax

https://gerrit.wikimedia.org/r/983722

Change 983722 merged by FNegri:

[operations/puppet@production] [toolsdb] fix override syntax

https://gerrit.wikimedia.org/r/983722

Change 984207 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Revert "[toolsdb] Lower innodb_buffer_pool_size"

https://gerrit.wikimedia.org/r/984207

Change 984207 merged by FNegri:

[operations/puppet@production] Revert "[toolsdb] Lower innodb_buffer_pool_size"

https://gerrit.wikimedia.org/r/984207

Mentioned in SAL (#wikimedia-cloud) [2023-12-19T15:39:49Z] <dhinus> restarting toolsdb to apply a config change T353093