Investigate increased memory pressure on scb1001/2
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bearND
	Jul 1 2016, 4:09 PM

Description

The mobileapps services flapped four times so far on 7/01/2016 UTC.

[01:46:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:48:35] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[04:15:26] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[04:17:36] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[05:46:47] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:48:57] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[16:07:05] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[16:09:24] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy

Checking Ganglia showed increased memory pressure. Looking at the yearly graph shows a steadily growing need for memory.[1][2] The long-term trend is very concerning.

I checked on scb1001 for the top memory using processes, and it looks like ORES is the most memory hungry, with each of the top processes consuming around 3-4% each.

$ ps aux | awk '{print $2, $4, $11, $15}' | sort -k2r | head -n 40
PID %MEM COMMAND 
2844 4.0 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2910 3.9 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2913 3.9 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2914 3.9 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2679 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2692 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2718 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2736 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2757 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2781 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2798 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2804 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2814 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2832 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2859 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2872 3.9 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2909 3.7 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2912 3.7 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2887 3.7 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2894 3.7 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2911 3.6 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2763 3.6 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2905 3.5 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2904 3.4 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2906 3.4 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2907 3.4 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2908 3.4 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2898 3.2 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2903 3.2 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2125 3.2 /srv/deployment/ores/venv/bin/python3 ores_celery.application
2181 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2676 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2677 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2678 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2682 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2686 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2687 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2688 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini
2695 3.1 /usr/bin/uwsgi /etc/uwsgi/apps-enabled/ores.ini

[1] https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=scb1001.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Service+Cluster+B+eqiad
[2] https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=scb1002.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=mem_report&c=Service+Cluster+B+eqiad

Details

	Subject	Repo	Branch	Lines +/-
	ores: reduce the web workers to 3/4	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T146933: MCS endpoint checks timing out / flapping in production again
rOPUP6e4073b52d3c: ores: reduce the web workers to 3/4
rOPUP5dcb5a4a0e43: ores: reduce the web workers to 3/4

Event Timeline

• bearND created this task.Jul 1 2016, 4:09 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJul 1 2016, 4:09 PM

• Mholloway subscribed.Jul 1 2016, 4:10 PM

• bearND triaged this task as High priority.Jul 1 2016, 4:10 PM

• bearND updated the task description. (Show Details)

• bearND added projects: ORES, Services, Mobile-Content-Service.Jul 1 2016, 4:13 PM

• bearND removed subscribers: ORES, Services, Mobile-Content-Service.

Dbrant subscribed.Jul 1 2016, 4:14 PM

• bearND added a subscriber: Halfak.Jul 1 2016, 4:22 PM

This is ORES' expected behavior. I'm working on a change right now that will limit memory usage for uwsgi. The problem is that each worker process needs to have the full set of prediction models loaded into memory in order to make use of them. Some of the models are pretty big.

We should be experimenting with this lower memory footprint strategy in the coming week. If all goes as planned, we could have the change deployed within a couple of weeks.

If this is expected that ORES takes over the memory of the machine I think it would be better to use dedicated machines for ORES instead of sharing these machines with other RB services.

Well, ORES doesn't "take over memory" so much as it has high requirements.

Either way, I agree. I was surprised to find out that we were sharing a machine with other services.

You can see the list of other services hosted on the machines when you login there.

Indeed. It was not my decision to host this service on scb100(1|2)

Added Ops to bring this to their attention. I hope we can find a way to move the ORES service or the other RB services to separate production machines in the near future. I would like to hear some thoughts from the services team, too.

• bearND moved this task from Incoming to Tracking on the Mobile-Content-Service board.Jul 1 2016, 7:47 PM

We can reduce number of web workers to 3/4 for now (by changing https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/web.pp#L4 to 3 or via hiera) until we improve memory usage and come back up with the full capacity unless number of requests is close to our capacity right now which would be already worrying.

@Ladsgroup, that would be great as a stop-gap.

Resource usage was one of the concerns we brought up in the initial ORES-to-production discussion (as well as the experimental nature of this service), but then the consensus was that resource use would be very moderate, and sharing would be okay for now.

Now it looks like usage is higher than projected, and my understanding is that it is also set to grow with additional languages being added. If this is correct, then isolating ORES more strongly with restrictive cgroups or separate hardware would be strongly desirable to avoid destabilizing other production services & taking up other team's time.

Change 297137 had a related patch set uploaded (by Ladsgroup):
ores: reduce the web workers to 3/4

https://gerrit.wikimedia.org/r/297137

gerritbot added a project: Patch-For-Review.Jul 2 2016, 5:42 PM

Ladsgroup mentioned this in rOPUP5dcb5a4a0e43: ores: reduce the web workers to 3/4.Jul 2 2016, 5:45 PM

Okay, I checked the ores uwsgi files and each node has 96 web processors (= 192 processors) since most of the work should be done via worker nodes not web processors. I'm reducing the number to 3/4 which is acceptable for now.

Ladsgroup added a project: Machine-Learning-Team (Active Tasks).Jul 2 2016, 9:30 PM

Ladsgroup moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

Change 297137 merged by Alexandros Kosiaris:
ores: reduce the web workers to 3/4

https://gerrit.wikimedia.org/r/297137

akosiaris mentioned this in rOPUP6e4073b52d3c: ores: reduce the web workers to 3/4.Jul 4 2016, 9:52 AM

Thank you, @Ladsgroup, memory usage on scb1001&2 looks much better.

Ladsgroup moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jul 4 2016, 9:44 PM

Ladsgroup claimed this task.Jul 4 2016, 9:47 PM

Thanks! That's much better. Since it sounds like there will be more languages added to ORES in the future, I'm thinking long-term we still should consider moving ORES to dedicated servers.

In T139177#2430221, @bearND wrote:

Thanks! That's much better. Since it sounds like there will be more languages added to ORES in the future, I'm thinking long-term we still should consider moving ORES to dedicated servers.

I'm not against moving it to a dedicated server but first, let's wait and see how our refactoring goes. We expected a huge decrease in memory usage.

Ladsgroup closed this task as Resolved.Jul 6 2016, 9:56 PM

• Phabricator_maintenance added a project: User-Ladsgroup.Aug 12 2016, 8:08 PM

• Mholloway mentioned this in T146933: MCS endpoint checks timing out / flapping in production again.Sep 28 2016, 8:09 PM

	F4235155: 1001.png
	Jul 4 2016, 1:35 PM

	F4235158: 1002.png
	Jul 4 2016, 1:35 PM

Investigate increased memory pressure on scb1001/2Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Investigate increased memory pressure on scb1001/2
Closed, ResolvedPublic
Actions