Page MenuHomePhabricator

tbs: user-story 9: Gather statistics for harbor uptime in toolsbeta
Closed, ResolvedPublic

Description

Summary

We want to gather uptime status for the harbor service (currently only on toolsbeta)

The idea is to do it as close as possible as a user would, so we want to:

Steps

Create script

It should do an https request to toolsbeta harbor to check that it's up (that 200 is returned, with some content).
It should dump on standard out the information formatted as:

# HELP metric_name <Replace this with a metric description>
# TYPE metric_name gauge
harbor_ui_up{method="get"}  1   1395066363000

---- Or when the service is down

# HELP metric_name <Replace this with a metric description>
# TYPE metric_name gauge
harbor_ui_up{method="get"}  0   1395066363000

---- The latter number is the current time (milliseconds since epoch, i.e. 1970-01-01 00:00:00 UTC, excluding leap seconds).
---- For a detailed description of the format see [[ https://prometheus.io/docs/concepts/metric_types/ | this]] and [[ https://prometheus.io/docs/concepts/data_model/ | this]].

Add the puppet code for it

This will be under the existing profile::toolforge::k8s::worker profile.

Modify the script to become an epp puppet template (see this for an example), with a 'url' parameter to check, set to 'https://harbor.toolsbeta.wmflabs.org' as default.

Add a systemd timer that runs the script:

  • Running every 5 min
  • Putting it's output in the (new) file /var/lib/prometheus/node.d/node_harbor.prom, that will be picked up later by the prometheus-node-exporter service.

Existing example

As an example you can check https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/prometheus/manifests/node_cloudvirt_libvirt_stats.pp

Acceptance

You should be able to see the new statistics appearing in prometheus, with a label for each k8s worker (hostname), and for the project (toolsbeta).

Related Objects

StatusSubtypeAssignedTask
ResolvedLucasWerkmeister
Resolvedmatmarex
ResolvedLegoktm
ResolvedLegoktm
Opendcaro
Resolveddcaro
OpenNone
OpenNone
Resolveddcaro
Resolveddcaro
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
Resolveddcaro

Event Timeline

This kind of HTTP monitoring is an already solved problem, and that solution is blackbox probes. You'll need a Puppet prometheus::blackbox::check::http resource on the monitored hosts (something similar to what's in profile::toolforge::static for example), and those will automatically be applied to the appropriate Prometheus instances.

dcaro claimed this task.

That sounds good!

I'll update the parent task with the blackbox details and close this one.