Page MenuHomePhabricator

webservicemontior failing to restart grid based webservices
Closed, ResolvedPublic

Description

Sep 17 20:22:34 tools-sgecron-01 sudo[20106]:     root : TTY=unknown ; PWD=/ ; USER=tools.ato ; COMMAND=/bin/bash -c /usr/bin/webservice restart
Sep 17 20:22:34 tools-sgecron-01 sudo[20106]: pam_unix(sudo:session): session opened for user tools.ato by (uid=0)
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]: WARNING: No explict backend provided.
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   Using default of 'kubernetes'
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   For help refer to <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web>
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]: Backend 'gridengine' from /data/project/ato/service.manifest does not match 'kubernetes'
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   Try stopping your current webservice:
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:     webservice stop
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   Then try starting it again:
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:     /usr/bin/webservice restart
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   If you have already tried that and it did not help, remove the state file before retrying:
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:     rm /data/project/ato/service.manifest
Sep 17 20:22:35 tools-sgecron-01 sudo[20106]: pam_unix(sudo:session): session closed for user tools.ato
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]: 2020-09-17 20:22:35,611 Could not start webservice for tool ato
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]: Traceback (most recent call last):
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   File "/usr/lib/python3/dist-packages/tools/manifest/webservicemonitor.py", line 192, in _start_webservice
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:     subprocess.check_output(command, timeout=30)  # 30 second timeout!
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:     **kwargs).stdout
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:   File "/usr/lib/python3.5/subprocess.py", line 398, in run
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]:     output=stdout, stderr=stderr)
Sep 17 20:22:35 tools-sgecron-01 collector-runner[19548]: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.ato

This looks like it is actually a bug in webservice and not really webservicemonitor.

Event Timeline

tools.ato@tools-sgebastion-08:~$ webservice restart
WARNING: No explict backend provided.
  Using default of 'kubernetes'
  For help refer to <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web>
Backend 'gridengine' from /data/project/ato/service.manifest does not match 'kubernetes'
  Try stopping your current webservice:
    webservice stop
  Then try starting it again:
    /usr/local/bin/webservice restart
  If you have already tried that and it did not help, remove the state file before retrying:
    rm /data/project/ato/service.manifest
tools.ato@tools-sgebastion-08:~$ webservice --backend=gridengine restart
Your job is not running, starting...
tools.ato@tools-sgebastion-08:~$ webservice status
WARNING: No explict backend provided.
  Using default of 'kubernetes'
  For help refer to <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web>
Backend 'gridengine' from /data/project/ato/service.manifest does not match 'kubernetes'
  Try stopping your current webservice:
    webservice stop
  Then try starting it again:
    /usr/local/bin/webservice status
  If you have already tried that and it did not help, remove the state file before retrying:
    rm /data/project/ato/service.manifest

Looks like we ended up in a state where the backend: ... stanza in service.manifest is not being picked up by webservice.

bd808 triaged this task as High priority.Sep 17 2020, 8:28 PM

Mentioned in SAL (#wikimedia-cloud) [2020-09-17T20:34:27Z] <bd808> Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 (T263190)

Change 628194 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-manifest@master] Add --backend=gridengine to restart command

https://gerrit.wikimedia.org/r/628194

Change 628194 merged by jenkins-bot:
[operations/software/tools-manifest@master] Add --backend=gridengine to restart command

https://gerrit.wikimedia.org/r/628194

Change 628200 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/software/tools-webservice@master] webservice: restore setting backend via service.manifest

https://gerrit.wikimedia.org/r/628200

Mentioned in SAL (#wikimedia-cloud) [2020-09-17T21:56:19Z] <bd808> Built and deployed tools-manifest v0.22 (T263190)

bd808 lowered the priority of this task from High to Medium.Sep 17 2020, 10:07 PM

The updated version of tools-manifest has fixed the horrible bug here of grid based webservices not getting restarted. Lowering priority to medium pending the merge and deploy of an updated webservice that fixes the bug with it not reading backend from $HOME/service.manifest.

Change 628200 merged by jenkins-bot:
[operations/software/tools-webservice@master] webservice: restore setting backend via service.manifest

https://gerrit.wikimedia.org/r/628200