Add /healthz system status check endpoint
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bd808
	Mar 3 2021, 5:55 PM

Description

Using a /healthz route as a check for service health is a common Kubernetes convention (which seems to come from Google internal practices).

There are a few Django packages intended for making this sort of health check:

Details

	Subject	Repo	Branch	Lines +/-
	monitoring: Add /healthz endpoint	wikimedia/toolhub	main	+8 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T288685 Establish active/active multi-dc support for Toolhub
Resolved	• bd808	T115650 Create an authoritative and well promoted catalog of Wikimedia tools
Resolved	• bd808	T271483 Complete and announce initial production deployment of Toolhub
Resolved	• bd808	T276373 Add /healthz system status check endpoint

Event Timeline

• bd808 created this task.Mar 3 2021, 5:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2021, 5:56 PM

• bd808 added a parent task: T271483: Complete and announce initial production deployment of Toolhub.Mar 3 2021, 5:56 PM

After some discussion on irc with @CDanis, @Joe, @RLazarus, and @JMeybohm I have a different understanding of /healthz. Basically we shuold only be testing that the container's main process (whatever our wsgi container is) is alive and the app is mounted. Testing external dependencies like databases and cache systems is not desired.

@CDanis summarized the discussion like this:

[16:51]  <   cdanis> bd808: yeah, basically liveness should just demonstrate that your process isn't deadlocked or unresponsive.  if it does depending on external things -- especially if you have transitive chains of that -- thrashing lots of pods with restarts becomes a real concern, amongst other things