Page MenuHomePhabricator

Explore growing memory usage of web workers on scb machines
Closed, ResolvedPublic

Description

Available memory on the scb machines is decreasing over time. Restarting ORES (and it's workers) reduces the memory pressure. This suggests a memory leak of some sort. Investigate.

Details

Related Gerrit Patches:
mediawiki/services/ores/deploy : masterRestart celery workers once in a while

Event Timeline

Halfak created this task.Jul 11 2016, 9:10 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 11 2016, 9:10 PM
Halfak claimed this task.Jul 11 2016, 9:10 PM
Halfak triaged this task as High priority.

Reviewing the status of scb1001.

We have 16 celery and 72 uwsgi workers running. Each process takes between 2.9 and 3.5% of memory. Obviously much of this is shared.

The RSS of uwsgi is between 964 MB and 1167 MB.

The RSS of celery is between 1068 MB and 1162 MB.

I've been tracking memory usage using ps.

Here's two plots show the distribution of *resident* memory usage over time.

https://commons.wikimedia.org/wiki/File:Ores.uwsgi_memory_usage_over_time.rss.svg

https://commons.wikimedia.org/wiki/File:Ores.celery_memory_usage_over_time.rss.svg

Better, these two plots show how memory usage changes per-process:

https://commons.wikimedia.org/wiki/File:Ores.per_process.celery_memory_usage_over_time.rss.svg

https://commons.wikimedia.org/wiki/File:Ores.per_process.uwsgi_memory_usage_over_time.rss.svg

It looks like individual processes experience a bump in memory usage, but then they don't show a substantial jump after that point.

I'll update these graphs in the next hour.

Graphs updated. I'm going to call it a night, but if someone else could run the following commands on scb1001 before I get back and record the results and the hour, that would be great :)

$ ps aux | head -n1; ps aux | grep uwsgi;
$ ps aux | head -n1; ps aux | grep celery;

It seems we do have a memory leak on labs as well: https://grafana.wikimedia.org/dashboard/db/ores-labs

OK. Graphs are updated. It looks like uwsgi clearly has a ceiling and celery has been migrating slowly upwards in memory usage.

If the leak is in celery, I don't think it's our code, but maybe celery's.

I looked at setting celery's CELERYD_MAX_TASKS_PER_CHILD setting. That could help us get periodic restarts for each worker. This would be a good mitigating strategy while we work out what's up.

I just deployed this to wmflabs.

While it's not my favorite solution, I think this is good for now.

Change 298922 had a related patch set uploaded (by Ladsgroup):
Restart celery workers once in a while

https://gerrit.wikimedia.org/r/298922

Change 298922 merged by Ladsgroup:
Restart celery workers once in a while

https://gerrit.wikimedia.org/r/298922

Deployed in prod. Monitoring it.

Ladsgroup closed this task as Resolved.Jul 18 2016, 9:09 PM