Page MenuHomePhabricator

webservice stop says service not running but service.manifest not cleared
Closed, ResolvedPublic

Description

Current workaround for people having this problem:

  • webservice stop
  • rm $HOME/service.manifest
  • webservice [add other args here as needed] start

The commons-mass-description tool ran into this problem today. @Urbanecm was trying to start using webservice --backend=kubernetes python start and got the error message:

Looks like you already have another webservice running, with a gridengine backend
You should stop that webservice by issuing:
    webservice --backend=gridengine stop
And then start it again with backend kubernetes by issuing:
    webservice --backend=kubernetes start

qstat showed no process running at that instant, but service.log recorded 2017-04-19T17:13:22.646841 No running webservice job found, attempting to start it and then a grid job spawned. This job died as soon as it started with this uwsgi.log output:

*** Starting uWSGI 1.9.17.1-debian (64bit) on [Wed Apr 19 17:13:55 2017] ***
compiled with version: 4.8.2 on 23 March 2014 17:15:32
os: Linux-3.13.0-100-generic #147-Ubuntu SMP Tue Oct 18 16:48:51 UTC 2016
nodename: tools-webgrid-generic-1404
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /mnt/nfs/labstore-secondary-tools-project/commons-mas
s-description
detected binary path: /usr/bin/uwsgi-core
your processes number limit is 63707
your process address space limit is 4294967296 bytes (4096 MB)
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :52476 fd 3
Python version: 2.7.6 (default, Oct 26 2016, 20:33:43)  [GCC 4.8.4]
Set PythonHome to /data/project/commons-mass-description/www/python/venv
ImportError: No module named site

This error is caused by the $HOME/www/python/venv being built for Python3 on Kubernetes and the app being executed by Python2 on grid engine.

Multiple attempts to stop the grid engine webservice backend with webservice stop just printed Your webservice is not running. The service.manifest file however still contained:

# This file is used by toollabs infrastructure.
# Please do not edit manually at this time.
backend: gridengine
version: 2
web: uwsgi-python

Ultimately, using rm service.manifest stopped the webservice reconciliation loop from trying to restart the service on the job grid and allowed it to be started on kubernetes with webservice --backend=kubernetes python start.

Event Timeline

bd808 triaged this task as Medium priority.
bd808 moved this task from Backlog to Waiting for code review on the Toolforge board.

Mentioned in SAL (#wikimedia-labs) [2017-05-30T17:14:04Z] <bd808> Removed $HOME/service.manifest. webservice was stuck in restart loop. (T163355)

Change 350362 merged by jenkins-bot:
[operations/software/tools-webservice@master] Always cleanup manifest on stop

https://gerrit.wikimedia.org/r/350362

Mentioned in SAL (#wikimedia-labs) [2017-05-31T19:16:15Z] <bd808> Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 (T163355)

Mentioned in SAL (#wikimedia-labs) [2017-05-31T19:24:22Z] <bd808> Updating toolabs-webservice package via clush (T163355)

Mentioned in SAL (#wikimedia-labs) [2017-05-31T19:29:36Z] <bd808> Rebuiding all Docker images to pick up toollabs-webservice v0.37 (T163355)

bd808 moved this task from Doing to Done on the cloud-services-team (Kanban) board.

tools.commons-video-clicks experienced this today. Here are the contents of the existing service.manifest:

backend: gridengine
version: 2

Trying to stop/start it:

tools.commons-video-clicks@tools-sgebastion-07:~$ webservice --backend=gridengine stop
Your webservice is not running

tools.commons-video-clicks@tools-bastion-03:~$ webservice --backend=gridengine stop
Your webservice is not running


tools.commons-video-clicks@tools-sgebastion-07:~$ webservice --backend=kubernetes start
Looks like you already have another webservice running, with a gridengine backend
You should stop that webservice by issuing:
    webservice --backend=gridengine stop
And then start it again with backend kubernetes by issuing:
    webservice --backend=kubernetes start

And the new service.manifest after applying the workaround:

backend: kubernetes
distribution: debian
version: 3
web: php5.6

There were not pods / jobs running for this tool when the error happened.

Reopening because variations of this problem keep happening during the Trusty grid deprecation.

Moving back from the hidden "Done" column to the "Inbox" column on the workboard so the task is actually visible. Feel free to correct.

Re-closing as resolved. No reports of this in quite some time and we have done significant updates to the webservice code since the last report.