Page MenuHomePhabricator

[toolforge.infra] Replace Toolschecker alerts with Prometheus based ones
Open, MediumPublic

Description

Remove or move to Alertmanager:

  • k8s etcd node health
  • k8s worker health

These can be replaced with T357977: [toolforge.infra] create fullstack tests:

  • dumps access
  • nfs read/write
  • redis set/get
  • ldap
  • dns
  • toolsdb read/write

To be removed:
T358333: Remove toolschecker grid engine checks

  • grid cron job mtime
  • grid long-running
  • grid start a job

Event Timeline

Change 956071 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::checker: remove ToolsDB R/W check

https://gerrit.wikimedia.org/r/956071

Change 956071 merged by Majavah:

[operations/puppet@production] P:toolforge::checker: remove ToolsDB R/W check

https://gerrit.wikimedia.org/r/956071

Change 982786 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::checker: remove kubernetes node readiness check

https://gerrit.wikimedia.org/r/982786

Change 982786 merged by Majavah:

[operations/puppet@production] P:toolforge::checker: remove kubernetes node readiness check

https://gerrit.wikimedia.org/r/982786

Change 991289 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::checker: remove webservice checks

https://gerrit.wikimedia.org/r/991289

Change 991289 merged by Majavah:

[operations/puppet@production] P:toolforge::checker: remove webservice checks

https://gerrit.wikimedia.org/r/991289

dcaro triaged this task as Medium priority.Feb 20 2024, 1:14 PM
dcaro moved this task from Backlog to Workspace for triaging whenever needed on the Toolforge board.
dcaro renamed this task from Replace Toolschecker alerts with Prometheus based ones to [toolforge.infra] Replace Toolschecker alerts with Prometheus based ones.Feb 21 2024, 10:23 AM