Some webservice-created applications do not have functioning liveness checks
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• Bstorm
	Feb 29 2020, 5:14 PM

Description

When a critical part of a web application pod is killed by the OOM-killer or some other thing, the pod should fail. This allows the tool maintainer to see that there is a problem on that level and allows admins to troubleshoot more easily.

Importantly, it should be there in case of a scaled out application so that kubernetes doesn't keep trying to send traffic to the pod.

webservice should define pods with some minimum level of healthchecking, even if that is just making sure that it knows which process to watch. This may or may not be possible in all cases since not every app will work the same way. Simply watching TCP port 8000 might be the best we've got.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	• Bstorm	T246523 Scholia on Toolforge is not responsive and does not start
		Duplicate		None	T246540 Some webservice-created applications do not have functioning liveness checks

Event Timeline

"some or all" may be more accurate here.

In T246540#5929307, @Bstorm wrote:

"some or all" may be more accurate here.

We are relying on complete PID 1 process failure today rather than any sort of functional liveness test in the Pod description. For containers which are running uWSGI (python*) or lighttpd (php*) as PID 1, there are definitely situations where things can be obviously broken but PID 1 is still running.

Finding a default liveness probe would be a good start. Going on to make it possible to configure a custom probe that might be tied into internal application logic would be ideal. I keep hoping that rather than adding more complexity to our home grown webservice command we will jump ahead into using some upstream FLOSS PaaS solution (like OpenShift, Rancher, Knative, etc). In addition to the liveness issue, we also have at minimum various container limits, replica count, and ingress customizations that should be relatively easy to configure for a Kubernetes webservice. This list is likely to keep growing as we unblock more powerful Kubernetes features on the platform side.

bd808 renamed this task from Some webservice-created applications do not have sane liveness checks to Some webservice-created applications do not have functioning liveness checks.Feb 29 2020, 9:21 PM

Andrew lowered the priority of this task from High to Medium.Mar 10 2020, 4:23 PM

• Bstorm removed • Bstorm as the assignee of this task.Jun 18 2020, 10:36 PM

bd808 mentioned this in T314053: Allow automatically restarting tool web services if non OK error code.Jul 29 2022, 4:30 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:32 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

bd808 mentioned this in T322483: Is it possible to run webservice restart in toolforge-jobs?.Jan 5 2024, 1:51 AM

Dvorapa subscribed.Feb 17 2024, 6:37 PM

bd808 closed this task as a duplicate of T341919: Support probes in kubernetes webservices.Feb 25 2024, 10:18 PM