Page MenuHomePhabricator

Front proxy can keep bad routing info for webservices previously running on the grid engine
Closed, DeclinedPublic

Description

I have cleaned this problem up for a couple of tools now, most recently T267368: Meetbot server is down.

From the tool maintainer's point of view everything looks to be running correctly, but the front proxy is responding with a "503 No webservice" message instead of the output of the tool's running webservice.

Here is one scenario that can lead to this:

  1. webservice --backend=gridengine ... is running for the tool
  2. The front proxy has a routing entry in it's redis data store pointing to the IP and port of the grid engine job.
  3. Something unexpected happens which causes the webservice-runner launched process on the job grid to die without executing the normal post-script that removes the route registration from the front proxy.
  4. The tool maintainer notices that their webservice is down and runs webservice ... to restart it.
  5. The tool maintainer does not notice, or is happy about, the switch to --backend=kubernetes being the new default.
  6. kubectl get pods and other debugging shows the webservice running properly on the Kubernetes cluster.
  7. Requests for the tool via the front proxy find the dangling registration of a grid engine ip and port.
  8. The front proxy tries to reverse proxy the request to the ip:port and cannot establish the necessary TCP socket because the grid job died.

I believe this has happened more frequently in recent weeks/months because of migration of grid engine worker nodes to the Ceph storage scheme. That wave of disruptions is over now, but there are an unknown number of dangling route registrations left in the front proxy as a result.


Quick fix:

  1. Stop kubernetes webservice
  2. Run webservice --backend=gridengine start to start a new grid engine backend process
  3. Run webservice stop to kill the grid engine backend process in the normal manner which includes deleting the route registration with the front proxy.
  4. Run webservice [type] start to start the kubernetes webservice

Related Objects

Event Timeline

One possible bandaid for this would be to have the front proxy remove routing registrations when the proxying fails. Basically handle the 503 on the front proxy by cleaning up the route before returning the 503 message to the original requestor. In the switch from grid to k8s scenario, the first request that is expected to hit the k8s pod would still return a 503, but a refresh should then pass through and hit the k8s cluster.

A different fix would be to have the k8s backend send a message to the front proxy on process start telling the front proxy to clear any and all registered routes for the tool. This feels like a more robust solution.

I wonder if this was caused by grid engine nodes being migrated to ceph? The processes might have died without the grid job being deleted and thus not running the post-script.

The gridmaster's messages file is literally filled with messages like this:

11/05/2020 02:29:38|worker|tools-sgegrid-master|E|execd@tools-sgeexec-0909.tools.eqiad.wmflabs reports running job (336318.1/master) in queue "continuous@tools-sgeexec-0909.tools.eqiad.wmflabs" that was not supposed to be there - killing

I don't know if that helps.

Cleaned up all the error'd jobs in the grid. Hopefully that helps things.

A different fix would be to have the k8s backend send a message to the front proxy on process start telling the front proxy to clear any and all registered routes for the tool. This feels like a more robust solution.

What about cleaning up old entries instead? Can we track which routes are "alive" and remove the ones that are not?
Maybe as in 'if that route has not had a successful request in the last <enter grace period>, then remove from the routing table", that would allow cleaning up any old routes, and still allow new flaky jobs to be started even if they take a bit to reply correctly.
It also lets the k8s backend keep being 'simple', as that route registry won't be needed in the long run if I understood correctly.

+1 to the garbage collector approach. Here is the logic I see for it:

  • the script runs in the active front proxy node
  • fetch redis entries (remember, only grid engine webservices live here!)
  • checks each entry in redis: if the backend is offline for whatever reason, mark it for deletion
  • wait a few seconds
  • check each entry marked for deletion again, if the 2º request also fails, remove the entry for redis!
  • if the 2º request succeed, then ignore it

This sounds fun! I volunteer to write it!

I prefer the "garbage collector script" approach, it should be able to handle many different cases of not clear "deregistration", not only the grid -> k8s migration scenario. But perhaps the other solutions proposed by @bd808 can be implemented too, for extra robustness.

Andrew triaged this task as Medium priority.Dec 8 2020, 5:45 PM
Andrew moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

There already is a script that is *supposed* to do the cleanup here, but it just restarts stuff based on a very cursory analysis. I'll get the name of that script and maybe we can just update that.

The thing is https://gerrit.wikimedia.org/r/admin/repos/operations/software/tools-manifest

The webservicemonitor script is meant to keep an eye on gridengine web services.