Page MenuHomePhabricator

Some webservice-created applications do not have functioning liveness checks
Closed, DuplicatePublic

Description

When a critical part of a web application pod is killed by the OOM-killer or some other thing, the pod should fail. This allows the tool maintainer to see that there is a problem on that level and allows admins to troubleshoot more easily.

Importantly, it should be there in case of a scaled out application so that kubernetes doesn't keep trying to send traffic to the pod.

webservice should define pods with some minimum level of healthchecking, even if that is just making sure that it knows which process to watch. This may or may not be possible in all cases since not every app will work the same way. Simply watching TCP port 8000 might be the best we've got.

Event Timeline

Bstorm created this task.

"some or all" may be more accurate here.

"some or all" may be more accurate here.

We are relying on complete PID 1 process failure today rather than any sort of functional liveness test in the Pod description. For containers which are running uWSGI (python*) or lighttpd (php*) as PID 1, there are definitely situations where things can be obviously broken but PID 1 is still running.

Finding a default liveness probe would be a good start. Going on to make it possible to configure a custom probe that might be tied into internal application logic would be ideal. I keep hoping that rather than adding more complexity to our home grown webservice command we will jump ahead into using some upstream FLOSS PaaS solution (like OpenShift, Rancher, Knative, etc). In addition to the liveness issue, we also have at minimum various container limits, replica count, and ingress customizations that should be relatively easy to configure for a Kubernetes webservice. This list is likely to keep growing as we unblock more powerful Kubernetes features on the platform side.

bd808 renamed this task from Some webservice-created applications do not have sane liveness checks to Some webservice-created applications do not have functioning liveness checks.Feb 29 2020, 9:21 PM
Andrew lowered the priority of this task from High to Medium.Mar 10 2020, 4:23 PM