Using a /healthz route as a check for service health is a common Kubernetes convention (which seems to come from Google internal practices).
There are a few Django packages intended for making this sort of health check:
Using a /healthz route as a check for service health is a common Kubernetes convention (which seems to come from Google internal practices).
There are a few Django packages intended for making this sort of health check:
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
monitoring: Add /healthz endpoint | wikimedia/toolhub | main | +8 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T288685 Establish active/active multi-dc support for Toolhub | |||
Resolved | • bd808 | T115650 Create an authoritative and well promoted catalog of Wikimedia tools | |||
Resolved | • bd808 | T271483 Complete and announce initial production deployment of Toolhub | |||
Resolved | • bd808 | T276373 Add /healthz system status check endpoint |
After some discussion on irc with @CDanis, @Joe, @RLazarus, and @JMeybohm I have a different understanding of /healthz. Basically we shuold only be testing that the container's main process (whatever our wsgi container is) is alive and the app is mounted. Testing external dependencies like databases and cache systems is not desired.
@CDanis summarized the discussion like this:
[16:51] < cdanis> bd808: yeah, basically liveness should just demonstrate that your process isn't deadlocked or unresponsive. if it does depending on external things -- especially if you have transitive chains of that -- thrashing lots of pods with restarts becomes a real concern, amongst other things
Change 708608 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):
[wikimedia/toolhub@main] monitoring: Add /healthz endpoint
Change 708608 merged by jenkins-bot:
[wikimedia/toolhub@main] monitoring: Add /healthz endpoint