I have cleaned this problem up for a couple of tools now, most recently T267368: Meetbot server is down.
From the tool maintainer's point of view everything looks to be running correctly, but the front proxy is responding with a "503 No webservice" message instead of the output of the tool's running webservice.
Here is one scenario that can lead to this:
- webservice --backend=gridengine ... is running for the tool
- The front proxy has a routing entry in it's redis data store pointing to the IP and port of the grid engine job.
- Something unexpected happens which causes the webservice-runner launched process on the job grid to die without executing the normal post-script that removes the route registration from the front proxy.
- The tool maintainer notices that their webservice is down and runs webservice ... to restart it.
- The tool maintainer does not notice, or is happy about, the switch to --backend=kubernetes being the new default.
- kubectl get pods and other debugging shows the webservice running properly on the Kubernetes cluster.
- Requests for the tool via the front proxy find the dangling registration of a grid engine ip and port.
- The front proxy tries to reverse proxy the request to the ip:port and cannot establish the necessary TCP socket because the grid job died.
I believe this has happened more frequently in recent weeks/months because of migration of grid engine worker nodes to the Ceph storage scheme. That wave of disruptions is over now, but there are an unknown number of dangling route registrations left in the front proxy as a result.
Quick fix:
- Stop kubernetes webservice
- Run webservice --backend=gridengine start to start a new grid engine backend process
- Run webservice stop to kill the grid engine backend process in the normal manner which includes deleting the route registration with the front proxy.
- Run webservice [type] start to start the kubernetes webservice