Page MenuHomePhabricator

webservice command not available when running a job on the task queue of the job grid
Open, Needs TriagePublic

Description

Steps to reproduce

  1. crontab -e
  2. jsub -N something -once -quiet script.sh
  3. script.sh: if webservice-broken do webservice restart end (pseudocode)

Expected behavior
It should work! At least jlocal works!

Current behavior
webservice command not found in grid!

Configuration
Stretch bastion

Event Timeline

Dvorapa created this task.Feb 9 2019, 8:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2019, 8:08 PM
Dvorapa updated the task description. (Show Details)Feb 9 2019, 8:09 PM

@bd808 Nope, they are not the same as jlocal works for me, just jsub and jstart does not

Dvorapa updated the task description. (Show Details)Feb 10 2019, 8:13 PM
bd808 reopened this task as Open.Feb 10 2019, 11:41 PM

The webservice command is installed on the bastions, the cron hosts, and the job grid nodes in the "web" queues. It is not as @Dvorapa points out installed on the job grid nodes in the "task" queues.

@Dvorapa can you explain your use-case a bit more? Is your script.sh implementation going to do something more than check to see if https://tools.wmflabs.org/$TOOLNAME/ is returning a response? I am wondering if a better solution might be for us to find a way to configure liveness probes for webservices running on the Kubernetes cluster rather than providing a way to run jobs on the job grid that can restart webservices.

bd808 renamed this task from Jsub doesn't know about webservice to webservice command not available when running a job on the task queue of the job grid.Feb 10 2019, 11:58 PM
Dvorapa added a comment.EditedFeb 11 2019, 12:15 AM

Look into the /data/project/mapycz/restart.sh and /data/project/mapycz/webwatcher.py (the same for /data/project/kmlexport/*).

If I add jsub -N jobname -once -quiet python3 webwatcher.py into crontab, I sometimes get the error like there is no such file as 'webservice'

If I add jlocal python3 webwatcher.py >jobname.out 2>jobname.err into crontab, it just works (at least it worked yesterday).

The Python liveness probe is the easiest what I could think of. It just asks for HTTP status and if it gets other than 200, it calls webservice restart (hidden in .sh as I sometimes use it manually as a shortcut as well)

Look into the /data/project/mapycz/restart.sh and /data/project/mapycz/webwatcher.py (the same for /data/project/kmlexport/*).

webwatcher.py
import urllib.request
import subprocess

try:
    assert urllib.request.urlopen("https://tools.wmflabs.org/mapycz/").getcode() == 200
except urllib.error.HTTPError:
    subprocess.call(['./restart.sh'])
restart.sh
#!/bin/sh
webservice restart

If I add jsub -N jobname -once -quiet python3 webwatcher.py into crontab, I sometimes get the error like there is no such file as 'webservice'

I would expect it to fail with this message every time that the assert fails and it tries to execute restart.sh because the webservice command is not available on the "task" queue nodes of either grid engine deployment.

If I add jlocal python3 webwatcher.py >jobname.out 2>jobname.err into crontab, it just works (at least it worked yesterday).

This should work because jlocal ... runs the task directly on the cron server and the webservice command is installed there.

The Python liveness probe is the easiest what I could think of. It just asks for HTTP status and if it gets other than 200, it calls webservice restart (hidden in .sh as I sometimes use it manually as a shortcut as well)

This liveness probe really should not be needed at all. We have system level monitoring for all webservice jobs. This monitoring is slightly different for webservices run with the gridengine backend vs those using the kubernetes backend, but both systems will ensure that the webservice is running somewhere on the associated backend. This monitoring system is looking at the executable process rather than the output of an HTTP request, so there is a chance that the executable is somehow hung. This possibility is what made me think about the liveness probe feature of Kubernetes.

The mapycz tool is only serving up a static index.html document currently, so I would personally recommend running it with webservice --backend=kubernetes php7.2 start. This should really eliminate any need for your own cron based monitoring except in the most extreme circumstances where your tools in being overwhelmed by a large number of concurrent requests.

Dvorapa added a comment.EditedFeb 11 2019, 11:41 AM

This liveness probe really should not be needed at all. We have system level monitoring for all webservice jobs.

Definitely that's not working well. kmlexport was down for 14 days in January, until someone got mad and e-mailed me. The old several MB large .sh probe was not working correctly, so I created new .py one.

The mapycz tool is only serving up a static index.html document currently, so I would personally recommend running it with webservice --backend=kubernetes php7.2 start. This should really eliminate any need for your own cron based monitoring except in the most extreme circumstances where your tools in being overwhelmed by a large number of concurrent requests.

I could not make webservices --backend=kubernetes start work here. Neither for mapycz, nor for kmlexport. In mapycz there is probably .kube folder missing, in kmlexport it fails with T214343

Dvorapa added a comment.EditedFeb 11 2019, 11:44 AM

BTW I don't know if you know, but multiple toolforge users got all their cron jobs, grid jobs and webservices cancelled in 7th February with no warning on Trusty. So kmlexport and mapycz were down for cca 24 hours (and T215704 didn't help to me) and multiple cswiki maintenance robots are down since then too. Also without any fix from your webservices monitoring. That's why I decided to move my tools to Stretch directly the day after, because everything was down so why wait.

This comment was removed by Dvorapa.

I could not make webservices --backend=kubernetes start work here. Neither for mapycz, nor for kmlexport. In mapycz there is probably .kube folder missing, in kmlexport it fails with T214343

The missing $HOME/.kube directory for mapycz is not a problem I was aware of. I can see that there is a token in the Kubernetes auth control file for this tool, so either there was a crash at the time it was generated or someone deleted the directory at some point. I will try to fix that.

Mentioned in SAL (#wikimedia-cloud) [2019-02-12T01:24:52Z] <bd808> Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704)

I could not make webservices --backend=kubernetes start work here. Neither for mapycz, nor for kmlexport. In mapycz there is probably .kube folder missing, in kmlexport it fails with T214343

The missing $HOME/.kube directory for mapycz is not a problem I was aware of. I can see that there is a token in the Kubernetes auth control file for this tool, so either there was a crash at the time it was generated or someone deleted the directory at some point. I will try to fix that.

$ sudo become mapycz
$ webservice --backend=kubernetes php7.2 shell
Defaulting container name to interactive.
Use 'kubectl describe pod/interactive -n mapycz' to see all of the containers in this pod.
If you don't see a command prompt, try pressing enter.
$ pstree
bash───pstree
$ ls
access.log           error.log       restart.sh
cron-webwatcher.err  public_html     service.manifest
cron-webwatcher.out  replica.my.cnf  webwatcher.py

That problem is at least fixed now.

BTW I don't know if you know, but multiple toolforge users got all their cron jobs, grid jobs and webservices cancelled in 7th February with no warning on Trusty. So kmlexport and mapycz were down for cca 24 hours (and T215704 didn't help to me) and multiple cswiki maintenance robots are down since then too. Also without any fix from your webservices monitoring. That's why I decided to move my tools to Stretch directly the day after, because everything was down so why wait.

I did not know, apparently because no one filed a bug or reported this in the #wikimedia-cloud irc channel. I have looked through our SAL logging, email alerts, and irc logs and do not see an obvious reason for mass job failures. I can understand why an unplanned outage would make you want to look at better monitoring solutions, but I can't help fix problems that nobody reports either.

BTW I don't know if you know, but multiple toolforge users got all their cron jobs, grid jobs and webservices cancelled in 7th February with no warning on Trusty. So kmlexport and mapycz were down for cca 24 hours (and T215704 didn't help to me) and multiple cswiki maintenance robots are down since then too. Also without any fix from your webservices monitoring. That's why I decided to move my tools to Stretch directly the day after, because everything was down so why wait.

I did not know, apparently because no one filed a bug or reported this in the #wikimedia-cloud irc channel. I have looked through our SAL logging, email alerts, and irc logs and do not see an obvious reason for mass job failures. I can understand why an unplanned outage would make you want to look at better monitoring solutions, but I can't help fix problems that nobody reports either.

No problem in this, all bots and tools I'm aware of are up and running now and it also pushed us to move to Stretch now, so maybe the failure was a good thing. I'll look into kube on tools later, thank you for your investigation and work.