Page MenuHomePhabricator

[jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components
Closed, ResolvedPublic

Description

This should include:

  • Retrieving metrics on prometheus side (if there's anything missing)
  • Add alerts for "down" events - with runbooks
  • Add a basic grafana board with the "up/down" metric to add as 'dashboard' to the alerts
  • jobs-api
    • gather stats
    • add alert
  • jobs-emailer
    • gather stats
    • add alerts

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
jobs-emailer: add basic alertsrepos/cloud/toolforge/alerts!28dcaroadd_jobs_emailer_basicmain
jobs-emailer: bump to 0.0.52-20250219170643-a14ae54drepos/cloud/toolforge/toolforge-deploy!670ghostbump_jobs-emailermain
jobs-api: add alerts for it being downrepos/cloud/toolforge/alerts!20dcaroadd_jobs_apimain
webserver: add a minimal metrics endpointrepos/cloud/toolforge/jobs-emailer!7dcaroadd_prometheus_statsmain
Customize query in GitLab

Event Timeline

Change 840225 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/toolforge/jobs-framework-api@main] Configure prometheus flask exporter

https://gerrit.wikimedia.org/r/840225

Change 840225 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] Configure prometheus flask exporter

https://gerrit.wikimedia.org/r/840225

Change 841033 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: scrape jobs-api

https://gerrit.wikimedia.org/r/841033

Change 841033 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] P:toolforge::prometheus: scrape jobs-api

https://gerrit.wikimedia.org/r/841033

dcaro renamed this task from Prometheus monitoring toolforge-jobs server side components to [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components.Mar 11 2024, 2:32 PM
dcaro triaged this task as High priority.
dcaro edited projects, added Toolforge; removed Toolforge Jobs framework.
dcaro updated the task description. (Show Details)
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.
dcaro changed the task status from Open to In Progress.Nov 14 2024, 1:56 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 16) board.

group_203_bot_4866fc124f4b41659f667468a6115cf3 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/670

jobs-emailer: bump to 0.0.52-20250219170643-a14ae54d

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-19T17:22:11Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.component.deploy for component jobs-emailer (T320284)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-19T17:30:40Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-emailer (T320284)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-20T13:18:13Z] <wmbot~dcaro@urcuchillay> START - Cookbook wmcs.toolforge.component.deploy for component jobs-emailer (T320284)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-02-20T13:26:35Z] <wmbot~dcaro@urcuchillay> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component jobs-emailer (T320284)

Change #1121364 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] toolforge: add jobs-emailer stats gathering

https://gerrit.wikimedia.org/r/1121364

Change #1121364 merged by David Caro:

[operations/puppet@production] toolforge: add jobs-emailer stats gathering

https://gerrit.wikimedia.org/r/1121364

Added them to the toolforge overview dashboard:

image.png (394×1 px, 37 KB)

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 17) board.