Page MenuHomePhabricator

uwsgi takes a long time to restart (Debian Jessie in labs)
Closed, ResolvedPublic

Description

For some reason, uwsgi server restarts take a very long time on my labs servers. On my local machine, uwsgi with the same configuration takes seconds. On the labs servers, it takes 1-2 minutes.

Labs servers tested on:

  • ores-web-01.eqiad.wmflabs (uwsgi-ores-web)
  • ores-web-02.eqiad.wmflabs (uwsgi-ores-web)
  • ores-staging-01.eqiad.wmflabs (uwsgi-ores-web)

Event Timeline

Halfak assigned this task to yuvipanda.
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak added a project: SRE.
Halfak subscribed.
yuvipanda renamed this task from uwsgi takes a long time to restart (Debian Jessie in labs) to uwsgi takes a long time to restart .Nov 12 2015, 8:22 PM
yuvipanda set Security to None.

Ive noticed the same thing on other servers too - I think graphite, invisible-unicorn etc all take a loooong time to restart.

Halfak renamed this task from uwsgi takes a long time to restart to uwsgi takes a long time to restart (Debian Jessie in labs).Nov 12 2015, 8:23 PM
Halfak updated the task description. (Show Details)
fgiunchedi triaged this task as Medium priority.Dec 1 2015, 12:38 PM
fgiunchedi subscribed.

I can't reproduce on graphite2.eqiad.wmflabs using service, it is a specific command taking a long time to return or uwsgi to come up and bind the port or sth like that?

$ /usr/bin/time sudo service uwsgi-graphite-web restart
uwsgi-graphite-web stop/waiting
uwsgi-graphite-web start/running, process 28006
0.10user 0.02system 0:01.24elapsed 10%CPU (0avgtext+0avgdata 8348maxresident)k
0inputs+0outputs (0major+5825minor)pagefaults 0swaps

A uwsgi start takes less than one second. The majority of the waiting seems to happen when stopping the last uwsgi. I ran these commands on our staging server while no traffic was being directed to the instance.

halfak@ores-staging-02:~$ time sudo service uwsgi-ores-web restart

real	1m30.771s
user	0m0.044s
sys	0m0.008s
halfak@ores-staging-02:~$ time sudo service uwsgi-ores-web stop

real	1m31.253s
user	0m0.032s
sys	0m0.016s
halfak@ores-staging-02:~$ time sudo service uwsgi-ores-web start

real	0m0.138s
user	0m0.020s
sys	0m0.012s

Here's the relevant configuration for the ores uwsgi process: https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/web.pp#L11

ORES starts up a lot of workers per core (currently 28!), so lets' compare to wikilabels which only starts 4 processes per core and uses minimal memory per process.

halfak@wikilabels-staging-01:~$ time sudo service uwsgi-wikilabels-web restart

real	1m30.325s
user	0m0.040s
sys	0m0.004s
halfak@wikilabels-staging-01:~$ time sudo service uwsgi-wikilabels-web stop

real	1m30.430s
user	0m0.028s
sys	0m0.020s
halfak@wikilabels-staging-01:~$ time sudo service uwsgi-wikilabels-web start

real	0m3.607s
user	0m0.024s
sys	0m0.012s

See relevant configuration for the wikilabels uwsgi process: https://github.com/wikimedia/operations-puppet/blob/production/modules/wikilabels/manifests/web.pp#L45

Change 281161 had a related patch set uploaded (by Ladsgroup):
Use die-on-term on ores uwsgi

https://gerrit.wikimedia.org/r/281161

I checked logs and it seems uwsgi service can't shut down with SIGTERM (uwsgi in restart sends SIGHUP to workers and then SIGTERM to the main process) worker gracefully shuts down then the main process ignores SIGTERM (it seems uwsgi can do and it does too often) so after the timeout session wich is 90 seconds the system sends SIGINT (or another brutal signal) so with adding die-on-term It was able to restart fast:

ladsgroup@deployment-ores-web:/etc$ time sudo service uwsgi-ores-web restart

real	0m1.157s
user	0m0.040s
sys	0m0.024s

Also this article is a very good reading. I think we should implement subscription so when one of our web nodes is restarting, LVS knows that.

Change 281161 merged by Alexandros Kosiaris:
Use die-on-term on ores uwsgi

https://gerrit.wikimedia.org/r/281161