Page MenuHomePhabricator

Support probes in kubernetes webservices
Closed, ResolvedPublicFeature

Description

Feature summary:
The webservice command should include support to declare k8s probes for webservices using the Kubernetes backend.

Use case(s):

  • zero-downtime restart (compare T337182) using startup probes (startupProbe)
  • automatic restart in case of issues (e.g. worker node problems) using readiness or liveness probes (readinessProbe, livenessProbe)

Benefits:
With native support of probes in webservice, users won’t have to bypass webservice and interact with k8s objects directly to fulfill the use cases listed above. (Currently, users can patch probes into their deployment on their own, but any such patches will be lost on the next webservice restart, because that recreates the entire deployment; so until webservice supports probes natively, such users will have to do restarts without it, using kubectl rollout restart deployment.)

Details

TitleReferenceAuthorSource BranchDest Branch
d/changelog: bump to 0.103.2repos/cloud/toolforge/tools-webservice!25dcarobump_to_0.103.2main
k8s: allow passing the http probe pathrepos/cloud/toolforge/tools-webservice!24dcaroadd_http_probesmain
Customize query in GitLab

Event Timeline

I don’t think webservice necessarily has to support patching in raw YAML snippets as each probe type, at least not as the main mode of this feature. We can probably assume that HTTP probes are the common case, and standardize on a common default path (e.g. /health or /healthz), which would then also be filtered out of access logs by default (T127367? though that talks about error logs).

See also T314053: Allow automatically restarting tool web services if non OK error code (which I think is closer to the automatic restart use case than the zero-downtime restart use case).

and standardize on a common default path (e.g. /health or /healthz), which would then also be filtered out of access logs by default

And if you have some good reason to deviate from the default path (my preference is /healthz since it doesn't conflict with a real word), it could be configurable via service.template.

dcaro triaged this task as Medium priority.Feb 8 2024, 9:00 AM
dcaro changed the task status from Open to In Progress.Feb 9 2024, 11:48 AM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 05) board.

standardize on a common default path (e.g. /health or /healthz), which would then also be filtered out of access logs by default (T127367? though that talks about error logs).

Liveness and other lifecycle probes would happen from within Kubernetes and thus show on grafana dashboards via the Kubernetes stats import to Prometheus, but they would not route through the front proxy where we collect access log data for https://toolviews.toolforge.org/.

Mentioned in SAL (#wikimedia-cloud) [2024-02-28T11:57:21Z] <dcaro> deploy tools-webservice 0.103.2 with probes (T341919)

This is available for use now, I'll leave the task open to do a bit of following for the next few days monitoring https://grafana.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1

Current state:

image.png (1×3 px, 438 KB)

@dcaro, this is awesome and I think also completely worthy of a message to cloud-announce. :)

Mentioned in SAL (#wikimedia-cloud) [2024-02-28T18:50:17Z] <wmbot~lucaswerkmeister@tools-sgebastion-10> deployed 3030faaa3c (health-check-path, T341919)

Seems to work like a charm, thanks a lot! The “only terminate old pod once new one is ready” seems to behave as expected:

Screenshot from 2024-02-28 19-49-39.png (633×952 px, 165 KB)

(The fact that the pods often seem to need exactly one restart before they become ready is strange but not new, that’s been happening on and off for a while.)

We can probably assume that HTTP probes are the common case, and standardize on a common default path (e.g. /health or /healthz), which would then also be filtered out of access logs by default (T127367? though that talks about error logs).

FWIW, as long as we don’t have the common logging solution yet, filtering out /healthz from the uWSGI logs turns out to be easy enough for Python webservices: see uwsgi.ini in the commit in my tool.

@dcaro, this is awesome and I think also completely worthy of a message to cloud-announce. :)

+1 :)

If HTTP probes are configurable in service.template, can that please be documented on Wikitech? If it is not configurable, can that feature be added?

If HTTP probes are configurable in service.template, can that please be documented on Wikitech? If it is not configurable, can that feature be added?

Added something there. (It’s health-check-path: PATH, i.e. the service.template YAML and the command line option use the same key, similar to mem:/--mem or cpu:/--cpu.)

If HTTP probes are configurable in service.template, can that please be documented on Wikitech? If it is not configurable, can that feature be added?

Added something there. (It’s health-check-path: PATH, i.e. the service.template YAML and the command line option use the same key, similar to mem:/--mem or cpu:/--cpu.)

Thanks, yes that's the key, you can see all the options that are set in the service.manifest file after you run toolforge webservice start ...options..., and copy that file as template.
We should add a subcommand to generate the template file with all the defaults too as example, that would be helpful.

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 06) board.

Things seem stable, I'll close the task and reopen if any bugs arise.

Could someone point me to how to make the probe not appear in access.log for perl5.36 webservice?

Could someone point me to how to make the probe not appear in access.log for perl5.36 webservice?

Based on https://redmine.lighttpd.net/projects/lighttpd/wiki/Mod_accesslog#Disable-logging, this might work:

.lighttpd.conf
$HTTP["url"] == "/healthz" { accesslog.filename = "" }