Page MenuHomePhabricator

[components-api] Add alerts and runbooks for basic service health
Closed, ResolvedPublic

Description

This includes:

  • Add 'up' alert

For this you have to add a new directory here https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts , similar to the other services (feel free to copy-paste and tweak after).
The alert should be something very similar to https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/blob/main/buildservice/builds-api.yaml?ref_type=heads

  • Add runbook for that alert (the previous patch will not pass the tests until you add it)

You can use https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/BuildsApiDown as inspiration, the new runbook should have '''exactly''' the new alert name as title. The focus is to help whomever gets paged by the alert a starting point to triage without having to know much context.

Event Timeline

dcaro triaged this task as High priority.May 14 2025, 8:54 AM
Raymond_Ndibe changed the task status from Open to In Progress.May 30 2025, 5:30 PM

Hmm... it seems that the alerts are not being deployed in tools (they did in toolsbeta).

The alerts-deploy service seems to have some permission issues:

Jun 12 17:01:39 tools-prometheus-8 systemd[1]: Starting alerts-deploy@project-tools.service - Deploy alerts from git to Prometheus/Thanos (instance project/tools)...
░░ Subject: A start job for unit alerts-deploy@project-tools.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit alerts-deploy@project-tools.service has begun execution.
░░ 
░░ The job identifier is 19636.
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]: Traceback (most recent call last):
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:   File "/usr/local/bin/alerts-deploy", line 178, in <module>
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:     sys.exit(main())
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:              ^^^^^^
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:   File "/usr/local/bin/alerts-deploy", line 168, in main
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:     deployed_paths = deploy_rulefiles(rulefiles, deploy_dir, alerts_dir)
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:   File "/usr/local/bin/alerts-deploy", line 40, in deploy_rulefiles
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:     shutil.copy2(src_path.as_posix(), dest_path.as_posix())
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:   File "/usr/lib/python3.11/shutil.py", line 436, in copy2
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:     copyfile(src, dst, follow_symlinks=follow_symlinks)
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:   File "/usr/lib/python3.11/shutil.py", line 258, in copyfile
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:     with open(dst, 'wb') as fdst:
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]:          ^^^^^^^^^^^^^^^
Jun 12 17:01:39 tools-prometheus-8 alerts-deploy@project/tools[14227]: PermissionError: [Errno 13] Permission denied: '/srv/alerts/project-tools/kubernetes_kyverno.yaml'
Jun 12 17:01:39 tools-prometheus-8 systemd[1]: alerts-deploy@project-tools.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ An ExecStart= process belonging to unit alerts-deploy@project-tools.service has exited.
░░ 
░░ The process' exit code is 'exited' and its exit status is 1.
Jun 12 17:01:39 tools-prometheus-8 systemd[1]: alerts-deploy@project-tools.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ The unit alerts-deploy@project-tools.service has entered the 'failed' state with result 'exit-code'.
Jun 12 17:01:39 tools-prometheus-8 systemd[1]: Failed to start alerts-deploy@project-tools.service - Deploy alerts from git to Prometheus/Thanos (instance project/tools).
░░ Subject: A start job for unit alerts-deploy@project-tools.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit alerts-deploy@project-tools.service has finished with a failure.
░░ 
░░ The job identifier is 19636 and the job result is failed.

Just chowning seemed to do the trick:

root@tools-prometheus-8:/srv# chown -R alerts-deploy:alerts-deploy /srv/alerts

Now showing up on tools too \o/

image.png (554×787 px, 133 KB)

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/816

components-api: bump to 0.0.118-20250616130629-11e499d8

All alerts are now in place in both prometheus, and the stats are correct, closing.

dcaro moved this task from In Review to Done on the Toolforge (Toolforge iteration 21) board.