Page MenuHomePhabricator

Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas
Closed, ResolvedPublic

Description

At Mon May 3 15:58:28 UTC 2021 clouddb1013 pinged the ops channel and the WMCS email list about a memory alert.

PROBLEM - MariaDB memory on clouddb1013 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6246) = 66.4% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting

The databases are honestly under only light usage without lots of connections above normal. What to do.

Event Timeline

Bstorm triaged this task as Medium priority.May 3 2021, 4:10 PM
Bstorm created this task.

I've ack'd the alert. We could try tuning the buffer parameter thing, or we could just change the alert. I'm honestly surprised that the web side threw the alert first.

I think this alert is "normal" as we are starting to get more and more requests.
The buffer pool size is the amount of memory reserved for that, but mysql obviously uses more than that for other buffers, connections tracking etc.

Change 693492 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] wikireplicas: raise limits on memory alerts

https://gerrit.wikimedia.org/r/693492

Change 693492 merged by Bstorm:

[operations/puppet@production] wikireplicas: raise limits on memory alerts

https://gerrit.wikimedia.org/r/693492

BTW, not sure if the alert is fully understood, based on the summary:

Largest process: mysqld (6246) = 66.4%

It points the largest process, but I am guessing it is running at least 2 mysql processes, taking over 95% of the total memory. Using all of the memory (or almost all of it) is not by itself an issue, but if it goes over a threshold (less than 100%, but the exact case depends on the type of workload and kernel configuration, mysql will start to swap and performance can be from slightly degraded to completely unusable.

I tuned this for production usage, but both the innodb_buffer_pool_size and the alerting should be tuned to detecting slowdowns before it gets too bad for the cloud needs. The idea is that if a host is swapping, reducing the buffer pool and giving more memory for query (per connection client) is desirable - which is probably what may need tuning on cloud environement. In the past, in most cases, when there was a slowdown on the old labsdbs it was because it was swapping due to large ongoing queries.

Of course, the other aspect is tuning monitoring- the alert can be tuned to be less strict or even be removed if not useful for cloud environment- for production I believe it was useful to detect memory leaks in the past, preventing a worse outage- but like most things, profiles are configurable, and that should be taken advantage of :-). Sadly, there is no perfect way to tune monitoring that trial-and-error.

For example, in my previous comment (T281732#7115507) the db seems to be actually swapping (which mysql doesn't take well, performance-wise)- so what I would do is reduce a bit the buffer pool size so that it doesn't run so memory-bound. Clients may have less cache area but likely will run more smoothly[test needed]. If it ends up consuming all of the memory, with wmf's default configuration, the usual effect is the largest mysql process being killed (OOM).

But cloud has the final decision on tuning vs monitoring! :-) For example, it would make sense to monitor query latency instead for cloud, if that is a better indicator of "are there wikireplica issues?". Or increase the percentage critcal to 99% for OOM detection, not swapping. Or monitor swapping frequency rather than memory usage. So many possibilities :-D!

So, I did read your comment, but I got very busy. Overall, it seems like this could use a good hard look to make the whole setup more useful for determining how to best tune the systems. The config as is was just kind of some guesses until we saw real traffic. We have real traffic now, so it's time to dig in and see what we think and what to monitor.

could use a good hard look

I think this is true because going to dedicated db model to a multi-instance one is not easy as just "memory_available / number of instances". There will be some overhead, plus some changes needed due to per-connection memory, which will be large on cloud. I think the monitoring will be useful once that is done.

Change 719758 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] wikireplicas: reduce the innodb_buffer_pool_size for s1 analytics

https://gerrit.wikimedia.org/r/719758

Change 719758 merged by Bstorm:

[operations/puppet@production] wikireplicas: reduce the innodb_buffer_pool_size for s1

https://gerrit.wikimedia.org/r/719758

Change 720187 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1018: Depool clouddb1017

https://gerrit.wikimedia.org/r/720187

Change 720187 merged by Marostegui:

[operations/puppet@production] dbproxy1018: Depool clouddb1017

https://gerrit.wikimedia.org/r/720187

Change 720189 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1019: Depool clouddb1013

https://gerrit.wikimedia.org/r/720189

Change 720189 merged by Marostegui:

[operations/puppet@production] dbproxy1019: Depool clouddb1013

https://gerrit.wikimedia.org/r/720189

s1 restarted on clouddb1017 and clouddb1013