Page MenuHomePhabricator

[toolsdb] MariaDB process is killed by OOM killer (October 2023)
Closed, ResolvedPublic

Description

This happened twice in the last 2 days:

Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.
Oct 25 09:52:46 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Oct 25 09:52:46 tools-db-1 systemd[1]: mariadb.service: Failed with result 'signal'.
Oct 25 09:52:46 tools-db-1 systemd[1]: mariadb.service: Consumed 2d 17h 12min 12.958s CPU time.
Oct 25 09:52:51 tools-db-1 systemd[1]: mariadb.service: Scheduled restart job, restart counter is at 1.
Oct 25 09:52:51 tools-db-1 systemd[1]: Stopped mariadb database server.
Oct 25 09:52:51 tools-db-1 systemd[1]: mariadb.service: Consumed 2d 17h 12min 12.958s CPU time.
Oct 25 09:52:51 tools-db-1 systemd[1]: Starting mariadb database server...

Note how the error on Oct 25 is slightly different and doesn't mention OOM, but dmesg shows the reason was OOM in both instances:

[Tue Oct 24 09:19:23 2023] Out of memory: Killed process 2437 (mysqld) total-vm:64835688kB, anon-rss:64103256kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:126460kB oom_score_adj:-600

[Wed Oct 25 09:48:25 2023] Out of memory: Killed process 1853830 (mysqld) total-vm:67422936kB, anon-rss:64695060kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:128052kB oom_score_adj:-600

Another difference is that on Oct 25, systemctl did restart the unit automatically, while on Oct 24 it did not, and I had to manually systemctl start mariadb. In both cases I also had to manually enable read-write mode, as mariadb is configured to start in read-only mode.

A similar issue happened a few months ago: T344298: mysqld killed by oomkiller on tools-db-1.tools.eqiad1.wikimedia.cloud

Event Timeline

fnegri triaged this task as High priority.
fnegri changed the task status from Open to In Progress.Oct 30 2023, 1:01 PM

This happened yesterday, and again today.

Oct 29 17:23:09
Oct 30 17:32:29

And one more time, seems to be roughly around the same time every day:

Oct 31 17:27:00 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.

It seems like there's some scheduled job that's running a query that's causing MariaDB to crash?

That's also my assumption, but I'm not sure how to find it!

again! But three hours late this time.

Nov 1 20:29:33

We should tell systemd to tell oomkiller not to kill our mariadb process. I think this can be done by using the mariadb::service::override hiera setting to provision an /etc/systemd/system/mariadb.service.d/override.conf file containing something like:

/etc/systemd/system/mariadb.service.d/override.conf
[Service]
OOMScoreAdjust=-500

See also: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/mysql.service#127

Another occurrence today:

Nov 03 05:34:07 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.

@bd808 thanks for that suggestion, I will try setting that value.

Turns out that value is already set in /lib/systemd/system/mariadb.service and I verified it is applied:

root@tools-db-1:~# cat /proc/119707/oom_score_adj
-600

The MariaDB documentation on memory allocation (linked in a previous comment by @dcaro) has some useful things to check and a promising statement:

If the MariaDB server is crashing because of 'out-of-memory' then it is probably wrongly configured.

I asked for help in #wikimedia-data-persistence and @Marostegui had some tips:

[15:55:46] <dhinus>	 o/ it's been a while since I came here with some pesky mariadb issues... but I have an "interesting" one :P
[15:55:50] <dhinus>	 T349695
[15:55:51] <stashbot>	 T349695: [toolsdb] MariaDB process is killed by OOM killer (October 2023) - https://phabricator.wikimedia.org/T349695
[15:56:27] <dhinus>	 do you have any hint at what config value I should tweak to prevent mariadb from using too much memory?
[15:56:56] <dhinus>	 there are a few mentioned here but I'm not sure where to start https://mariadb.com/kb/en/mariadb-memory-allocation/
[15:58:16] <marostegui>	 dhinus: Probably the most obvious ones is to reduce the innodb buffer pool size, however, I doubt't that is the root cause. Reducing it might give you just a bit more time until it crashes again
[15:59:05] <dhinus>	 I checked that it's already lower than the recommended "70% of RAM", but I can try reducing it more
[15:59:05] <marostegui>	 If it is a pattern, I would try to see if there's an specific query causing issues
[15:59:32] <marostegui>	 Well the recommendation is that, but you can reduce it to whatever number you want. That's going to affect how "warm" the database is
[15:59:51] <marostegui>	 So the more memory you have there, the more tables will be stored in memory, so the faster the queries using those tables would be
[15:59:53] <dhinus>	 makes sense. I also suspect some specific query, but it's hard to find it...
[16:00:18] <dhinus>	 for a couple days it happened at the same time, but then it's now happening at totally different times
[16:00:20] <marostegui>	 But again, this might be a matter of time until it hits the threshold again. In other words, it is probably something else
[16:00:41] <marostegui>	 Maybe you can enable the slow query log and see if there's something massive there
[16:01:23] <dhinus>	 yeah I was thinking that as well. or also a query timeout
[16:01:24] <marostegui>	 Another option would be to monitor show full processlist periodically and check what was there before each crash
[16:01:58] <marostegui>	 You can use pt-kill to kill read queries that take more than X time
[16:01:59] <marostegui>	 Like we have on the wikireplicas
[16:02:22] <marostegui>	 I'd reduce the buffer pool size, but you definitely need to keep investigating because the crashes are likely to keep happening
[16:08:00] <dhinus>	 thanks, I will try a combination of these strategies!

I decreased the buffer pool size from 31G to 20G, and enabled slow query logging for queries longer than 30 seconds. Please note that I did the changes from a SQL prompt, so they won't persist if the MariaDB process is restarted, but for some initial testing it should be enough.

MariaDB [(none)]> SET GLOBAL long_query_time=30;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> SET GLOBAL slow_query_log=1;
Query OK, 0 rows affected (1.900 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-------------+
| Variable_name           | Value       |
+-------------------------+-------------+
| innodb_buffer_pool_size | 33285996544 |
+-------------------------+-------------+
1 row in set (0.001 sec)

MariaDB [(none)]> SET GLOBAL innodb_buffer_pool_size=21474836480;
Query OK, 0 rows affected (0.001 sec)

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'innodb_buffer_pool_size';
+-------------------------+-------------+
| Variable_name           | Value       |
+-------------------------+-------------+
| innodb_buffer_pool_size | 21474836480 |
+-------------------------+-------------+
1 row in set (0.001 sec)

Slow queries are being logged to /srv/labsdb/data/tools-db-1-slow.log and I verified the logging works with a test query SELECT SLEEP(35).

root@tools-db-1:~# cat /srv/labsdb/data/tools-db-1-slow.log
/opt/wmf-mariadb104/bin/mysqld, Version: 10.4.29-MariaDB-log (MariaDB Server). started with:
Tcp port: 3306  Unix socket: /var/run/mysqld/mysqld.sock
Time                Id Command  Argument
# Time: 231103 16:23:28
# User@Host: root[root] @ localhost []
# Thread_id: 402799  Schema:   QC_hit: No
# Query_time: 35.000193  Lock_time: 0.000000  Rows_sent: 1  Rows_examined: 0
# Rows_affected: 0  Bytes_sent: 64
SET timestamp=1699028608;
SELECT SLEEP(35);

Two more OOM crashes in the last two days :/

Nov 08 11:08:55 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Nov 09 11:12:21 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL

I have analyzed the slow query log file with mariadb-dumpslow (I downloaded the log file and analyzed it locally, because mariadb-dumpslow is not installed on the tools-db hosts).

These two are the slow queries with the highest count:

Count: 679  Time=289.80s (196774s)  Lock=0.00s (0s)  Rows_sent=0.0 (0), Rows_examined=67452086.0 (45799966394), Rows_affected=0.0 (0), s53952[s53952]@[172.16.1.36]
  SELECT `keys`.topic_id AS keys_topic_id, `keys`.`key` AS keys_key
  FROM `keys`
  WHERE `keys`.`key` = 'S'
  LIMIT N

Count: 408  Time=1889.12s (770760s)  Lock=0.00s (0s)  Rows_sent=655.9 (267598), Rows_examined=123972948.3 (50580962891), Rows_affected=0.0 (0), s51434[s51434]@[172.16.6.134]
  SELECT `id`,`ext_name` FROM entry WHERE catalog=N AND q IS NULL
  AND NOT EXISTS (SELECT * FROM `log` WHERE log.entry_id=entry.id AND log.action='S')
  AND ((`q` IS NULL) OR (`q`>N AND `user`=N))
  ORDER BY `id` LIMIT N OFFSET N

Sorting by total time used, the same query is coming on top:

Count: 408  Time=1889.12s (770760s)  Lock=0.00s (0s)  Rows_sent=655.9 (267598), Rows_examined=123972948.3 (50580962891), Rows_affected=0.0 (0), s51434[s51434]@[172.1
6.6.134]
  SELECT `id`,`ext_name` FROM entry WHERE catalog=N AND q IS NULL
  AND NOT EXISTS (SELECT * FROM `log` WHERE log.entry_id=entry.id AND log.action='S')
  AND ((`q` IS NULL) OR (`q`>N AND `user`=N))
  ORDER BY `id` LIMIT N OFFSET N

This is a query from the mix-n-match tool, on database s51434__mixnmatch_p.

I'm not sure those slow queries are causing the OOM errors, but it's one more reason to move s51434__mixnmatch_p to a Trove instance. I have created T350862: [toolsdb] Migrate mixnmatch db to Trove.

FWIW I just saw this ticket, and added an index to the log table which should speed things up. I can also rewrite the query if that does not suffice, please let me know. Let's discuss Trove in T350862 .

Thanks @Magnus, much appreciated. Fingers crossed this index will also help with the OOM errors, though I'm in no way sure the two things were related... And of course we would like ToolsDB to be stable enough that some slow queries don't crash it completely!

I will comment in T350862 about strategies for the Trove migration.

Another OOM crash last night, this one woke up @taavi who was on call:

Nov 10 03:53:05 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL

Possibly another red herring, but looking at the logs it looks like every time after a OOM crash, some tables are "marked as crashed" and are repaired automatically by MariaDB. Looking at the ones that are "marked as crashed" more often, we might find out what's happening on the database just before the crash:

fnegri@tools-db-1:~$ sudo journalctl -u mariadb |grep "marked as crashed" |awk '{print $12}' |sort |uniq -c |sort -r
     10 './s52861__secWatch/rev_log'
     10 './heartbeat/heartbeat'
      9 './s51083__ukbot/fulltexts'
      9 './s51083__ukbot/contribs'
      9 './s51083__ukbot/contests'
      5 './s51211__duplicity_p/no_wd'
      5 './s51138__heritage_p/commonscat'
      4 './s51138__heritage_p/prox_search'
      3 './s52861__fnbot/review_opt_in'
      2 './s54159__discordwiki/searchindex'
      2 './s52861__thanks_de/thx_list'
      2 './s51391__mp/mentor'
      2 './s51138__heritage_p/wlpa_es@002dct_@0028ca@0029'
      2 './s51138__heritage_p/image'
      1 './s53507__zh_delwiki/searchindex'
      1 './s51999__merge_candidates/users'
      1 './s51999__merge_candidates/scores'
      1 './s51138__heritage_p/wlpa_il_@0028en@0029'
      1 './s51138__heritage_p/wlpa_fi_@0028fi@0029'
      1 './s51138__heritage_p/wlpa_at_@0028de@0029'

Example log line:

Oct 24 09:43:48 tools-db-1 mysqld[1853830]: 2023-10-24  9:43:48 3388 [ERROR] mysqld: Table './s51083__ukbot/fulltexts' is marked as crashed and should be repaired

The grafana chart for node_memory_MemAvailable_bytes makes it look more like a memory leak, because the available memory decreases constantly after each restart.

7-day view

Screenshot 2023-11-10 at 13.35.33.png (1×2 px, 361 KB)

30-day view:

Screenshot 2023-11-10 at 13.37.30.png (1×2 px, 417 KB)

Looking at the graphs, when I lowered innodb_buffer_pool_size from 32G to 20G, the memory leak became slower, then after MariaDB went OOM and was restarted, the value was reset to 32G because that's what we have in the config file.

I have just lowered it again, this time to 10G, so hopefully it will last for a few days before going OOM again, but it doesn't look like a long-term solution:

MariaDB [(none)]> SET GLOBAL innodb_buffer_pool_size=10737418240;
Query OK, 0 rows affected (0.006 sec)

I see a couple of threads online suggesting that using jemalloc can be an answer to MariaDB memory leaks, e.g. https://stackoverflow.com/a/60488432/1765344. I know nothing about jemalloc so I'm not sure if this is something we should try.

The tables crashed are a consequence and not a cause. That's normal after a crash. As I mentioned a few days ago, lowering the buffer pool would just give you a few more days but it wouldn't stop the crashes.

Also, I suggested to monitor show process list every minute and see if you can catch what was running before the crash.

The tables crashed are a consequence and not a cause.

I was just hoping they might point to some database that was very active during the crash, but I agree they're not likely to be useful, also because...

I suggested to monitor show process list every minute and see if you can catch what was running before the crash.

I don't think what is happening right before the crash is interesting, because the problem starts hours before, basically as soon as the server is restarted the memory starts going down slowly, until it gets very close to zero and the OOMkiller steps in:

Screenshot 2023-11-10 at 16.32.39.png (1×2 px, 349 KB)

The points in the graph where the free memory goes up are 3 restarts, and the last smaller jump is caused by me changing innodb_buffer_pool_size from 32G to 10G.

Until we find a proper fix, we can can monitor the memory graph and pre-emptively restart MariaDB when it gets under 10G. This should at least avoid being paged in the middle of the night.

Link to the memory graph: https://grafana.wmcloud.org/goto/Y0yPY64Sk?orgId=1

The tables crashed are a consequence and not a cause.

I was just hoping they might point to some database that was very active during the crash, but I agree they're not likely to be useful, also because...

I suggested to monitor show process list every minute and see if you can catch what was running before the crash.

I don't think what is happening right before the crash is interesting, because the problem starts hours before, basically as soon as the server is restarted the memory starts going down slowly, until it gets very close to zero and the OOMkiller steps in:

That is pretty normal behaviour with MariaDB, starting to use more and more memory while it fills its cache. Even if you've limited the innodb_buffer_pool size, there's still more memory usage due to other buffers, tracking connections etc etc.
You could be also dealing with temporary tables as well (due to long transactions).

As we've discussed on IRC, maybe you could migrate to mariadb 10.6 (which we use in production) and see how it works.

Reminder, to be able to upgrade to 10.6, you'd need to:

  • Stop mariadb
  • Remove 10.4 package
  • Install 10.6 package
  • Start mariadb
  • Run mysql_upgrade

Thanks @Marostegui, I will try that on a replica next week.

I will probably combine this with the already pending T344717: [toolsdb] test creating a new replica host, so I will set up a new 10.4 replica on the new instance tools-db-3, then upgrade it to 10.6 following your procedure.

Change 973847 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] [toolsdb] Lower innodb_buffer_pool_size

https://gerrit.wikimedia.org/r/973847

Change 973847 merged by FNegri:

[operations/puppet@production] [toolsdb] Lower innodb_buffer_pool_size

https://gerrit.wikimedia.org/r/973847

After I lowered innodb_buffer_pool_size from 31G to 10G on Nov, 10th, the free memory on the primary stabilized around 12GB, and we haven't seen more OOM crashes. I have hardcoded that value in my.cnf in this patch that I merged today.

The amount of memory used by MariaDB has stabilized around 50G (as reported by top) which roughly matches 64G (total memory) - 12G (free memory shown in the graph below). I would like to understand better what exactly is using all that memory.

Screenshot 2023-11-14 at 18.32.58.png (1×2 px, 412 KB)

fnegri moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.

Two weeks later, the free memory is quite stable around 12GB, and has never gone below 9GB, as shown in the attached graph. There doesn't seem to be any memory leak, but I haven't found an explanation of why MariaDB is using around 50GB of memory in total, given that innodb_buffer_pool_size is now limited at 10G and I could not identify other caches or buffers taking a significant amount of memory.

Upgrading to MariaDB 10.6 would give us some new interesting monitoring options like memory_summary_global_by_event_name. I created T352206: [toolsdb] Upgrade to MariaDB 10.6.

In the meantime, I think we can keep innodb_buffer_pool_size at 10G and resolve this task. I will open a new one if we get more OOM errors.

Screenshot 2023-11-28 at 18.23.57.png (1×3 px, 532 KB)