Measure how often PAWS stays up / goes down.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
prometheus: tools: scrape paws metrics into prometheus | operations/puppet | production | +12 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | rook | T218228 Optimize for capacity | |||
Resolved | None | T223511 PAWS newcomer tasks | |||
Duplicate | Feature | None | T195030 Develop availability metrics for PAWS |
Event Timeline
@Harej What exactly do you want to measure? The availability of the https://paws.wmflabs.org/paws/hub/login entry page, or something more complex like the ability of PAWS to spawn a new container as a particular Wikimedia user account?
More along those lines. Going to paws.wmflabs.org by itself seems to work consistently, but the Start My Server button works inconsistently and I'd like to begin tracking when it works and when it doesn't.
Jupyterhub 0.9 will come with prometheus metrics. See https://paws-beta.wmflabs.org/paws/hub/metrics
server_spawn_duration_seconds_count{status="failure"} might take care of that.
https://medium.com/@yuvipanda/prometheus-metrics-from-jupyter-notebooks-8ffc8e5d0319
There will also be prometheus metrics from the notebooks soon.
Change 441514 had a related patch set uploaded (by Chico Venancio; owner: Chico Venancio):
[operations/puppet@production] prometheus: tools: scrape paws metrics into prometheus
Change 441514 merged by Andrew Bogott:
[operations/puppet@production] prometheus: tools: scrape paws metrics into prometheus
@Andrew Thanks for the merge. I can't verify the resulting file due to permissions, but since puppet has run in tools-prometheus-01 and the log indicates /srv/prometheus/tools/prometheus.yml should now have the new paws-hub target should be set up correctly. https://tools-prometheus.wmflabs.org/tools/targets still does not have it. Perhaps a prometheus service reload/restart is needed?
Since the metrics are available for most aspects of this in k8s now, I'm just merging this over to the enable metrics task for the new cluster.