Page MenuHomePhabricator

Service health monitoring / connection timeouts
Closed, ResolvedPublic

Description

All services need to properly handle long-running requests:

  • Client connection should receive an error if the service takes longer than N seconds, not an abrupt connection drop
  • Resilience - maybe allow Varnish to re-issue a request to another backend service if the first one took too long.
  • Hung service should be restarted if it no longer responds to a heartbeat
  • Possibly handle cases of "responds to heartbeat but doesn't handle any requests"

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added subscribers: Yurik, GWicke, mobrovac, ori.
Yurik renamed this task from Handle service timeoutouts to Service health monitoring / connection timeouts.Feb 22 2015, 12:20 AM
Yurik set Security to None.

Based on our IRC conversation I believe that you are concerned with CPU-bound code (ex: infinite loops). Those are normally bugs, so should be fixed. I do agree though that it would be nice to systematically enforce per-request cpu timeouts. In node 0.10 this is quite difficult, but in iojs and 0.12 the vm module (which exposes v8's sandbox capabilties) has gained a timeout parameter, which might make it easier to enforce cpu limits.

The Parsoid team has developed an intermediate solution, which works by setting a per-request timer in the master process, and killing the worker if the timeout has expired. This does however kill other requests in the same process as well. Alternatively, it could perform a graceful restart by using worker.disconnect() first (which lets other requess finish), followed by a hard kill a minute or so later.

Another more general variant on this that would be more self-contained is to set up a periodic heartbeat from the worker (which can be set up transparently in service-runner), and do the graceful restart of the worker if the heartbeat stops.

Most of your other concerns aren't really CPU timeout related, so let me respond inline:

Client connection should receive an error if the service takes longer than N seconds, not an abrupt connection drop

The HTTP timeouts we set in varnish are normally shorter anyway, so this isn't a problem in practice.

Resilience - maybe allow Varnish to re-issue a request to another backend service if the first one took too long.

Varnish retries idempotent requests by default, so there is nothing special needed here.

Possibly handle cases of "responds to heartbeat but doesn't handle any requests"

Afaik it is not easily possible to target requests at a specific worker, as all workers listen on the same socket. Errors like this would be picked up regular monitoring (including LVS) in any case, so I'm not too concerned about this case. The majority of the cases are a) CPU or b) memory related lock-ups, of which a) can be covered by heartbeats, and b) is covered by T89932.

So far my biggest concern has been long network timeouts when by service calls api (this was due to the host being unable to access itself via public hostname). So IO-bound has been higher priority.

So far my biggest concern has been long network timeouts when by service calls api (this was due to the host being unable to access itself via public hostname).

Request timeouts and retries are normally handled by the library you are using to do those calls, like for example preq. That's per request and should not require any special support in service-runner.

@GWicke, I am not the one using preq, Vega lib does its own resource loading. I could modify it a bit to set timeout, but there is no request "execution context", so if Vega needs multiple remote resources to process one request, the request loading lib does not know how much longer it it is allowed for processing.

I work around it by having a race timeout promise wrapped around the whole execution, but it might make sense to provide it as a lib/template somehow.

The PR for this issue has been merged, service-runner version 0.2.2 release, so this ticket can be closed.