dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	jcrespo
	Sep 13 2021, 7:10 AM

Description

dbstore1007 is using 96% of its total memory: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&orgId=1&var-server=dbstore1007&var-datasource=thanos&var-cluster=misc&from=1623138837938&to=1631516764499

It is frequently swapping:
https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=18&orgId=1&var-server=dbstore1007&var-datasource=thanos&var-cluster=misc&from=1623740593737&to=1631516593737&refresh=30s

which not only makes it run with lower performance, it also has the danger of the OOM killer activating and killing a mysql daemon.

I recommend to research a possible memory leak on those servers and/or restart some to prevent the killing.

Details

	Subject	Repo	Branch	Lines +/-
	dbstore1007: Decrease buffer_pool_sizes	operations/puppet	production	+3 -3

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T270112 mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking
		Resolved	BUG REPORT	BTullis	T290841 dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error

Event Timeline

jcrespo created this task.Sep 13 2021, 7:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2021, 7:10 AM

This is not the first time it happens, and seems specific to analytics dbs: T270112

jcrespo added a parent task: T270112: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking.Sep 13 2021, 7:14 AM

Change 720739 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbstore1007: Decrease buffer_pool_sizes

https://gerrit.wikimedia.org/r/720739

Change 720739 merged by Marostegui:

[operations/puppet@production] dbstore1007: Decrease buffer_pool_sizes

https://gerrit.wikimedia.org/r/720739

I have merged the above patch to decrease mysql buffer pool sizes for all the instances. This requires mysql restarts. Please do so or let me know when I can do it

Maintenance_bot removed a project: Patch-For-Review.Sep 13 2021, 12:10 PM

odimitrijevic assigned this task to • razzi.Sep 13 2021, 3:46 PM

odimitrijevic triaged this task as High priority.

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

Mentioned in SAL (#wikimedia-analytics) [2021-09-13T18:13:00Z] <razzi> razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841

Mentioned in SAL (#wikimedia-operations) [2021-09-13T18:13:13Z] <razzi> razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841

Mentioned in SAL (#wikimedia-analytics) [2021-09-13T18:19:28Z] <razzi> razzi@dbstore1007:~$ sudo systemctl restart mariadb@s4.service for T290841

Mentioned in SAL (#wikimedia-analytics) [2021-09-13T18:24:57Z] <razzi> razzi@dbstore1007:~$ for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done - reenable replication for T290841

Mentioned in SAL (#wikimedia-operations) [2021-09-13T18:25:37Z] <razzi> reenable replication on dbstore1007 for T290841

Restarting the 3 mysqld sections put the memory into a reasonable 14% usage. It's possible there's something leaking memory however and this isn't the last we'll see of this situation.

• razzi added a project: Analytics-Kanban.Sep 14 2021, 4:17 PM

• razzi moved this task from Next Up to Done on the Analytics-Kanban board.

LSobanski subscribed.Sep 30 2021, 12:28 PM

odimitrijevic added a project: Data-Engineering-Kanban.Oct 28 2021, 5:11 AM

odimitrijevic moved this task from Next Up to Done on the Data-Engineering-Kanban board.Oct 28 2021, 5:15 AM

odimitrijevic removed • razzi as the assignee of this task.Nov 1 2021, 4:44 PM

odimitrijevic moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.

odimitrijevic added a subscriber: • razzi.

This has occurred again on dbstore1007.

I will do some investigation to see if I can find out where the memory leak might be, but if I can't get anywhere I will have to restart the three mariadb sections again.

@BTullis mariadb 10.4.22 has fixed some memory leaks, which might or might be related to this. If you want, I can try to install it now (or whenever you tell me it is a good moment to restart mariadb).

@Marostegui - That sound like a great idea to me. I think that now would be a good time to try this upgrade to 10.4.22.
It looks like they're not being heavily used at the moment, according to Grafana, so I say go for it if you have the time.

Sorry I commented while manuel did already.

In T290841#7513451, @BTullis wrote:

@Marostegui - That sound like a great idea to me. I think that now would be a good time to try this upgrade to 10.4.22.
It looks like they're not being heavily used at the moment, according to Grafana, so I say go for it if you have the time.

Ok, I will go for it now.

Mentioned in SAL (#wikimedia-operations) [2021-11-18T12:15:53Z] <marostegui> Upgrade dbstore1007 to 10.4.22 T290841 T295970

Stashbot mentioned this in T295970: Compile and package mariadb 10.4.22.Nov 18 2021, 12:15 PM

Upgrade done, replication started.

Great, thanks. I'll try to keep an eye on this graph for the next few months to see if it's resolved the issue fully.

For what is worth, dbstore1007 memory in the last 30 days remains stable after the restart: https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=4&from=now-30d&to=now&refresh=5m&var-server=dbstore1007&var-datasource=eqiad%20prometheus%2Fops&var-cluster=misc
Let's give it another month.

Nevermind, I was looking at the wrong graph. It keeps increasing, we'll see if it stabilizes at some point.

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:29 AM

Marostegui mentioned this in T270112: mariadb on dbstore hosts, and specifically dbstore1004, possible memory leaking.Mar 22 2022, 9:52 AM

Marostegui added a project: Data-Persistence (work done).

At this point in my career, I came to zen and accept memleaks as a fact of life. I suggest we should just restart those hosts from time to time for usual maintenance (os upgrade, security updates, mariadb minor and major upgrade, etc.)

Thanks @Ladsgroup for the reflection. Sadly. at this point in my career I have yet to achieve these levels of zen and the killing of processes due to memory leaks like this still cause me a degree of anguish.
To mis-quote Dylan Thomas *

Rage, rage against the dying of the bytes.

However, on this occasion, I think you're probably right and we should just take the pragmatic decision to restart it as required.
The rate of increase in the memory usage is now so slow that normal maintenance restarts are more frequent than likely incidents of memory exhaustion and associated swapping, so I'll resolve this ticket.

While searching for other things on MariaDB's JIRA, I saw there are a bug or a few related to performance_schema memory leaks on MariaDB (this seems to be specific to MariaDB, and not happening on MySQL). It was reported that disabling P_S didn't make the leak fully disappear but it made it way slower. I wouldn't advise against doing that on production mw hosts, as P_S is such a great debbugging tool, but maybe it is something that could be considered for analytics dbs? Because analytics dbs have such different query patterns (long running queries) it would make sense those are more affected. Or check if there are active events/cron jobs using it that could be disabled. Just a suggestion that seemed relevant, you don't have to listen to me.

I always listen to your suggestions @jcrespo :-)
Do you happen to have a handy link to any of those MariaDB bug reports about the performance_schema memory leaks please?

I'm not aware of any active jobs that use the feature, but I'd be happy to try turning it off to find out:

if any users complain that their jobs no longer work
if it slows or stops the memory leak

Maybe we could run the dbstore servers for a few months with this feature disabled, just to test the hypothesis.

@Ottomata - can you see any issues with this? Do you know of any active jobs that make use of the performance_schema feature on the dbstore hosts?

Not that I know of! But I probably wouldn't know either! :)

In T290841#8077106, @BTullis wrote:

Do you happen to have a handy link to any of those MariaDB bug reports about the performance_schema memory leaks please?

I cannot find the exact one right now, but these are related (I was monitoring some of these for mediawiki production):
https://jira.mariadb.org/browse/MDEV-24417
https://jira.mariadb.org/browse/MDEV-20933
https://jira.mariadb.org/browse/MDEV-23936

Heads up to @Ladsgroup about https://jira.mariadb.org/browse/MDEV-12205 which is unrelated to analytics, but could hit production.

If performance schema is disabled, please consider enabling user_stats plugin, which is a poor man's P_S.

@jcrespo Thanks but it seems that only happens in write queries with max time (unlike our system where only read queries have max time). I make sure we don't add max time to our write queries and it's not needed since mw is good at killing slow write queries (unlike slow read queries)

	F35286146: image.png
	Jun 29 2022, 8:56 AM

	F34753764: image.png
	Nov 18 2021, 11:53 AM

dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM errorClosed, ResolvedPublicBUG REPORTActions

Description

Details

Related ObjectsSearch...

Event Timeline

dbstore1007 is swapping heavilly, potentially soon killing mysql services due to OOM error
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...