uwsgi takes a long time to restart (Debian Jessie in labs)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Nov 12 2015, 8:21 PM

Description

For some reason, uwsgi server restarts take a very long time on my labs servers. On my local machine, uwsgi with the same configuration takes seconds. On the labs servers, it takes 1-2 minutes.

Labs servers tested on:

ores-web-01.eqiad.wmflabs (uwsgi-ores-web)
ores-web-02.eqiad.wmflabs (uwsgi-ores-web)
ores-staging-01.eqiad.wmflabs (uwsgi-ores-web)

Details

	Subject	Repo	Branch	Lines +/-
	Use die-on-term on ores uwsgi	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T130369 [Epic] Structured deployment of ORES
Resolved	Ladsgroup	T130404 Setup ORES service in beta cluster
Resolved	Ladsgroup	T118495 uwsgi takes a long time to restart (Debian Jessie in labs)

Event Timeline

Halfak created this task.Nov 12 2015, 8:21 PM

Halfak assigned this task to yuvipanda.

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: SRE.

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 12 2015, 8:21 PM

Ive noticed the same thing on other servers too - I think graphite, invisible-unicorn etc all take a loooong time to restart.

Halfak renamed this task from uwsgi takes a long time to restart to uwsgi takes a long time to restart (Debian Jessie in labs).Nov 12 2015, 8:23 PM

Halfak updated the task description. (Show Details)

yuvipanda removed yuvipanda as the assignee of this task.Nov 13 2015, 7:02 AM

I can't reproduce on graphite2.eqiad.wmflabs using service, it is a specific command taking a long time to return or uwsgi to come up and bind the port or sth like that?

$ /usr/bin/time sudo service uwsgi-graphite-web restart
uwsgi-graphite-web stop/waiting
uwsgi-graphite-web start/running, process 28006
0.10user 0.02system 0:01.24elapsed 10%CPU (0avgtext+0avgdata 8348maxresident)k
0inputs+0outputs (0major+5825minor)pagefaults 0swaps

intracer subscribed.Feb 24 2016, 10:06 AM

Halfak merged a task: T131572: uwsgi takes a long time to restart.Apr 2 2016, 6:01 AM

A uwsgi start takes less than one second. The majority of the waiting seems to happen when stopping the last uwsgi. I ran these commands on our staging server while no traffic was being directed to the instance.

halfak@ores-staging-02:~$ time sudo service uwsgi-ores-web restart

real	1m30.771s
user	0m0.044s
sys	0m0.008s
halfak@ores-staging-02:~$ time sudo service uwsgi-ores-web stop

real	1m31.253s
user	0m0.032s
sys	0m0.016s
halfak@ores-staging-02:~$ time sudo service uwsgi-ores-web start

real	0m0.138s
user	0m0.020s
sys	0m0.012s

Here's the relevant configuration for the ores uwsgi process: https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/web.pp#L11

ORES starts up a lot of workers per core (currently 28!), so lets' compare to wikilabels which only starts 4 processes per core and uses minimal memory per process.

halfak@wikilabels-staging-01:~$ time sudo service uwsgi-wikilabels-web restart

real	1m30.325s
user	0m0.040s
sys	0m0.004s
halfak@wikilabels-staging-01:~$ time sudo service uwsgi-wikilabels-web stop

real	1m30.430s
user	0m0.028s
sys	0m0.020s
halfak@wikilabels-staging-01:~$ time sudo service uwsgi-wikilabels-web start

real	0m3.607s
user	0m0.024s
sys	0m0.012s

See relevant configuration for the wikilabels uwsgi process: https://github.com/wikimedia/operations-puppet/blob/production/modules/wikilabels/manifests/web.pp#L45

Halfak added a parent task: T130404: Setup ORES service in beta cluster.Apr 2 2016, 6:20 AM

Halfak mentioned this in T130404: Setup ORES service in beta cluster.

Joe subscribed.Apr 2 2016, 6:33 AM

Change 281161 had a related patch set uploaded (by Ladsgroup):
Use die-on-term on ores uwsgi

https://gerrit.wikimedia.org/r/281161

gerritbot added a project: Patch-For-Review.Apr 2 2016, 1:08 PM

I checked logs and it seems uwsgi service can't shut down with SIGTERM (uwsgi in restart sends SIGHUP to workers and then SIGTERM to the main process) worker gracefully shuts down then the main process ignores SIGTERM (it seems uwsgi can do and it does too often) so after the timeout session wich is 90 seconds the system sends SIGINT (or another brutal signal) so with adding die-on-term It was able to restart fast:

ladsgroup@deployment-ores-web:/etc$ time sudo service uwsgi-ores-web restart

real	0m1.157s
user	0m0.040s
sys	0m0.024s

Also this article is a very good reading. I think we should implement subscription so when one of our web nodes is restarting, LVS knows that.

Ladsgroup claimed this task.Apr 4 2016, 5:31 AM

Ladsgroup added a project: Machine-Learning-Team (Active Tasks).

Ladsgroup moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

• schana subscribed.Apr 4 2016, 4:44 PM

• MZMcBride subscribed.Apr 7 2016, 4:46 PM

Change 281161 merged by Alexandros Kosiaris:
Use die-on-term on ores uwsgi

https://gerrit.wikimedia.org/r/281161

Ladsgroup moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Apr 7 2016, 6:12 PM

Ladsgroup closed this task as Resolved.Apr 26 2016, 3:06 PM

• Phabricator_maintenance added a project: User-Ladsgroup.Aug 12 2016, 8:09 PM

uwsgi takes a long time to restart (Debian Jessie in labs)Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

uwsgi takes a long time to restart (Debian Jessie in labs)
Closed, ResolvedPublic
Actions

Related Objects
Search...