Available memory on the scb machines is decreasing over time. Restarting ORES (and it's workers) reduces the memory pressure. This suggests a memory leak of some sort. Investigate.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
mediawiki/services/ores/deploy | master | +1 -0 | Restart celery workers once in a while |
Related Objects
- Mentioned In
- rORESDEPLOY0e9555f3131e: Restart celery workers once in a while
- Mentioned Here
- P3413 uwsgi ores memory load
Event Timeline
Reviewing the status of scb1001.
We have 16 celery and 72 uwsgi workers running. Each process takes between 2.9 and 3.5% of memory. Obviously much of this is shared.
The RSS of uwsgi is between 964 MB and 1167 MB.
The RSS of celery is between 1068 MB and 1162 MB.
I've been tracking memory usage using ps.
Here's two plots show the distribution of *resident* memory usage over time.
https://commons.wikimedia.org/wiki/File:Ores.uwsgi_memory_usage_over_time.rss.svg
https://commons.wikimedia.org/wiki/File:Ores.celery_memory_usage_over_time.rss.svg
Better, these two plots show how memory usage changes per-process:
https://commons.wikimedia.org/wiki/File:Ores.per_process.celery_memory_usage_over_time.rss.svg
https://commons.wikimedia.org/wiki/File:Ores.per_process.uwsgi_memory_usage_over_time.rss.svg
It looks like individual processes experience a bump in memory usage, but then they don't show a substantial jump after that point.
Graphs updated. I'm going to call it a night, but if someone else could run the following commands on scb1001 before I get back and record the results and the hour, that would be great :)
$ ps aux | head -n1; ps aux | grep uwsgi; $ ps aux | head -n1; ps aux | grep celery;
It seems we do have a memory leak on labs as well: https://grafana.wikimedia.org/dashboard/db/ores-labs
OK. Graphs are updated. It looks like uwsgi clearly has a ceiling and celery has been migrating slowly upwards in memory usage.
If the leak is in celery, I don't think it's our code, but maybe celery's.
I looked at setting celery's CELERYD_MAX_TASKS_PER_CHILD setting. That could help us get periodic restarts for each worker. This would be a good mitigating strategy while we work out what's up.
Change 298922 had a related patch set uploaded (by Ladsgroup):
Restart celery workers once in a while