Summary
We want to gather uptime status for the harbor service (currently only on toolsbeta)
The idea is to do it as close as possible as a user would, so we want to:
Steps
Create script
It should do an https request to toolsbeta harbor to check that it's up (that 200 is returned, with some content).
It should dump on standard out the information formatted as:
# HELP metric_name <Replace this with a metric description> # TYPE metric_name gauge harbor_ui_up{method="get"} 1 1395066363000 ---- Or when the service is down # HELP metric_name <Replace this with a metric description> # TYPE metric_name gauge harbor_ui_up{method="get"} 0 1395066363000 ---- The latter number is the current time (milliseconds since epoch, i.e. 1970-01-01 00:00:00 UTC, excluding leap seconds). ---- For a detailed description of the format see [[ https://prometheus.io/docs/concepts/metric_types/ | this]] and [[ https://prometheus.io/docs/concepts/data_model/ | this]].
Add the puppet code for it
This will be under the existing profile::toolforge::k8s::worker profile.
Modify the script to become an epp puppet template (see this for an example), with a 'url' parameter to check, set to 'https://harbor.toolsbeta.wmflabs.org' as default.
Add a systemd timer that runs the script:
- Running every 5 min
- Putting it's output in the (new) file /var/lib/prometheus/node.d/node_harbor.prom, that will be picked up later by the prometheus-node-exporter service.
Existing example
As an example you can check https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/node_cloudvirt_libvirt_stats.pp
Acceptance
You should be able to see the new statistics appearing in prometheus, with a label for each k8s worker (hostname), and for the project (toolsbeta).