tbs: user-story 9: Gather statistics for harbor uptime in toolsbeta
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 20 2023, 4:00 PM

Description

Summary

We want to gather uptime status for the harbor service (currently only on toolsbeta)

The idea is to do it as close as possible as a user would, so we want to:

Steps

Create script

It should do an https request to toolsbeta harbor to check that it's up (that 200 is returned, with some content).
It should dump on standard out the information formatted as:

# HELP metric_name <Replace this with a metric description>
# TYPE metric_name gauge
harbor_ui_up{method="get"}  1   1395066363000

---- Or when the service is down

# HELP metric_name <Replace this with a metric description>
# TYPE metric_name gauge
harbor_ui_up{method="get"}  0   1395066363000

---- The latter number is the current time (milliseconds since epoch, i.e. 1970-01-01 00:00:00 UTC, excluding leap seconds).
---- For a detailed description of the format see [[ https://prometheus.io/docs/concepts/metric_types/ | this]] and [[ https://prometheus.io/docs/concepts/data_model/ | this]].

Add the puppet code for it

This will be under the existing profile::toolforge::k8s::worker profile.

Modify the script to become an epp puppet template (see this for an example), with a 'url' parameter to check, set to 'https://harbor.toolsbeta.wmflabs.org' as default.

Add a systemd timer that runs the script:

Running every 5 min
Putting it's output in the (new) file /var/lib/prometheus/node.d/node_harbor.prom, that will be picked up later by the prometheus-node-exporter service.

Existing example

As an example you can check https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/node_cloudvirt_libvirt_stats.pp

Acceptance

You should be able to see the new statistics appearing in prometheus, with a label for each k8s worker (hostname), and for the project (toolsbeta).

Related Objects
Search...

Status	Assigned	Task
Open	None	T380882 openstack network problems (November 2024)
Resolved	aborrero	T380827 tools-nfs outage 2024-11-25
Open	None	T380832 [jobs-api] crashing
Open	None	T380959 [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components
Resolved	LucasWerkmeister	T320140 Migrate wd-shex-infer from Toolforge GridEngine to Toolforge Kubernetes
Resolved	matmarex	T319707 Migrate dtcheck from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320062 Migrate steve-adder from Toolforge GridEngine to Toolforge Kubernetes
Resolved	Legoktm	T320011 Migrate rfa-voting-history from Toolforge GridEngine to Toolforge Kubernetes
Open	dcaro	T194332 [Epic,builds-api,components-api,webservice,jobs-api] Make Toolforge a proper platform as a service with push-to-deploy and build packs
Resolved	dcaro	T267374 [tbs.beta] Create a toolforge build service beta release
Resolved	dcaro	T325172 [builds-api,harbor,builds-builder] user-story 11: I want to know how to debug the service
Resolved	None	T325174 [builds-builder,harbor,bulid-service,docs] user-story 11: Add section to admin docs on how to debug the service, how to pin-point the failing component and how to get the logs for each of them.
Resolved	dcaro	T325166 tbs: user-story 10: I want to know how to manage the service
Resolved	dcaro	T325167 tbs: user-story 10: Create admin wiki page for the toolforge build service
Resolved	Raymond_Ndibe	T325175 tbs: user-story 11: Add a runbook for each of the service alerts.
Resolved	Raymond_Ndibe	T325160 tbs: user-story 9: I want to know when the service is down
Resolved	Raymond_Ndibe	T325165 tbs: user-story 9: Create an alert on metricsinfra for harbor being down on toolsbeta
Resolved	dcaro	T330096 tbs: user-story 9: Gather statistics for harbor uptime in toolsbeta

Event Timeline

dcaro created this task.Feb 20 2023, 4:00 PM

This kind of HTTP monitoring is an already solved problem, and that solution is blackbox probes. You'll need a Puppet prometheus::blackbox::check::http resource on the monitored hosts (something similar to what's in profile::toolforge::static for example), and those will automatically be applied to the appropriate Prometheus instances.

That sounds good!

I'll update the parent task with the blackbox details and close this one.

tbs: user-story 9: Gather statistics for harbor uptime in toolsbetaClosed, ResolvedPublicActions