Available memory on the scb machines is decreasing over time. Restarting ORES (and it's workers) reduces the memory pressure. This suggests a memory leak of some sort. Investigate.
Reviewing the status of scb1001.
We have 16 celery and 72 uwsgi workers running. Each process takes between 2.9 and 3.5% of memory. Obviously much of this is shared.
The RSS of uwsgi is between 964 MB and 1167 MB.
The RSS of celery is between 1068 MB and 1162 MB.
I've been tracking memory usage using ps.
Here's two plots show the distribution of *resident* memory usage over time.
Better, these two plots show how memory usage changes per-process:
It looks like individual processes experience a bump in memory usage, but then they don't show a substantial jump after that point.
Graphs updated. I'm going to call it a night, but if someone else could run the following commands on scb1001 before I get back and record the results and the hour, that would be great :)
$ ps aux | head -n1; ps aux | grep uwsgi; $ ps aux | head -n1; ps aux | grep celery;
OK. Graphs are updated. It looks like uwsgi clearly has a ceiling and celery has been migrating slowly upwards in memory usage.
If the leak is in celery, I don't think it's our code, but maybe celery's.
I looked at setting celery's CELERYD_MAX_TASKS_PER_CHILD setting. That could help us get periodic restarts for each worker. This would be a good mitigating strategy while we work out what's up.