Page MenuHomePhabricator

Explore growing memory usage of web workers on scb machines
Closed, ResolvedPublic

Description

Available memory on the scb machines is decreasing over time. Restarting ORES (and it's workers) reduces the memory pressure. This suggests a memory leak of some sort. Investigate.

Event Timeline

Halfak triaged this task as High priority.

Reviewing the status of scb1001.

We have 16 celery and 72 uwsgi workers running. Each process takes between 2.9 and 3.5% of memory. Obviously much of this is shared.

The RSS of uwsgi is between 964 MB and 1167 MB.

The RSS of celery is between 1068 MB and 1162 MB.

Better, these two plots show how memory usage changes per-process:

https://commons.wikimedia.org/wiki/File:Ores.per_process.celery_memory_usage_over_time.rss.svg

https://commons.wikimedia.org/wiki/File:Ores.per_process.uwsgi_memory_usage_over_time.rss.svg

It looks like individual processes experience a bump in memory usage, but then they don't show a substantial jump after that point.

I'll update these graphs in the next hour.

Graphs updated. I'm going to call it a night, but if someone else could run the following commands on scb1001 before I get back and record the results and the hour, that would be great :)

$ ps aux | head -n1; ps aux | grep uwsgi;
$ ps aux | head -n1; ps aux | grep celery;

OK. Graphs are updated. It looks like uwsgi clearly has a ceiling and celery has been migrating slowly upwards in memory usage.

If the leak is in celery, I don't think it's our code, but maybe celery's.

I looked at setting celery's CELERYD_MAX_TASKS_PER_CHILD setting. That could help us get periodic restarts for each worker. This would be a good mitigating strategy while we work out what's up.

I just deployed this to wmflabs.

While it's not my favorite solution, I think this is good for now.

Change 298922 had a related patch set uploaded (by Ladsgroup):
Restart celery workers once in a while

https://gerrit.wikimedia.org/r/298922

Change 298922 merged by Ladsgroup:
Restart celery workers once in a while

https://gerrit.wikimedia.org/r/298922

Deployed in prod. Monitoring it.