Page MenuHomePhabricator

Apt-staging: add alerting
Closed, ResolvedPublic

Description

Follow up from T409253: Continuous breakages of apt-staging

Right now we have no alerting on failed imports, relying on the users to tell us when there is a problem. We should add alerting to be aware of issues with the import pipeline.

Event Timeline

Change #1205162 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] apt-staging: logging and metrics

https://gerrit.wikimedia.org/r/1205162

basic metrics are visible in T409833#11382292:

# HELP gitlab_package_puller_jobs_considered Number of CI jobs inspected in the last run
# TYPE gitlab_package_puller_jobs_considered gauge
gitlab_package_puller_jobs_considered 2566
# HELP gitlab_package_puller_jobs_downloaded Number of CI jobs whose artifacts were downloaded in the last run
# TYPE gitlab_package_puller_jobs_downloaded gauge
gitlab_package_puller_jobs_downloaded 8
# HELP gitlab_package_puller_run_success Whether the last run completed without unhandled exceptions (1=success, 0=failure)
# TYPE gitlab_package_puller_run_success gauge
gitlab_package_puller_run_success 1
# HELP gitlab_package_puller_last_run_timestamp_seconds Unix timestamp of the end of the last run
# TYPE gitlab_package_puller_last_run_timestamp_seconds gauge
gitlab_package_puller_last_run_timestamp_seconds 1763451717

With T409832: Apt-staging: add error handling to gitlab_package_puller I'll add some more

Change #1207791 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] apt: add an alert on reprepro errors

https://gerrit.wikimedia.org/r/1207791

I've added alerts on reprepro more general script failures in operations/alerts/+/1207791

It'll also be possible to add another notification medium if needed

Change #1207791 merged by jenkins-bot:

[operations/alerts@master] apt: add an alert on reprepro errors

https://gerrit.wikimedia.org/r/1207791

Change #1205162 merged by Arnaudb:

[operations/puppet@production] apt-staging: logging and metrics

https://gerrit.wikimedia.org/r/1205162

as mentioned in T409833: Apt-staging: fix logging

We now have a bit more observability via metrics [...]:

root@apt-staging2001:node.d $ cat gitlab_package_puller.prom 
[...]
gitlab_package_puller_run_success 1
[...]
gitlab_package_puller_reprepro_notify_failed 0

[...]

A simple set of alerts is now in production

Change #1211001 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] apt-staging: wrong error code in gitlab_package_puller_run_success

https://gerrit.wikimedia.org/r/1211001

Change #1211001 merged by jenkins-bot:

[operations/alerts@master] apt-staging: wrong error code in gitlab_package_puller_run_success

https://gerrit.wikimedia.org/r/1211001

Jelto subscribed.

There is an open alert since 11 days Linting problems found for GitlabPackagePullerFailedOnRun:

description: Pint reporter promql/series found problem(s) in /srv/alerts/ops/team-sre_apt.yaml: prometheus "ops" at http://127.0.0.1:9900/ops didn't have any series for "gitlab_package_puller_reprepro_notify_failed" metric in the last 1w
summary: Linting problems found for GitlabPackagePullerFailedOnReprepro

https://alerts.wikimedia.org/?q=%40state%3Dactive&q=alertname%3DAlertLintProblem&q=filename%3D%2Fsrv%2Falerts%2Fops%2Fteam-sre_apt.yaml

I'm not fully sure why this is happening, thanos shows a series with that name. The summary mentions GitlabPackagePullerFailedOnRun which is a test added in https://gerrit.wikimedia.org/r/c/operations/alerts/+/1211001.

So I'll reopen the task until that alert test is fixed.

There is an open alert since 11 days Linting problems found for GitlabPackagePullerFailedOnRun:

I'm not fully sure why this is happening, thanos shows a series with that name.

I can confirm the metric is also present on the host:

root@apt-staging2001:node.d $ rg gitlab_package_puller_reprepro_notify_failed gitlab_package_puller.prom
25:# HELP gitlab_package_puller_reprepro_notify_failed Number of times a reprepro failure notification was (or would have been) triggered in the last run
26:# TYPE gitlab_package_puller_reprepro_notify_failed gauge
27:gitlab_package_puller_reprepro_notify_failed 0
root@apt-staging2001:node.d $ pwd
/var/lib/prometheus/node.d

I'll ask Observability for help

Change #1216549 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] apt-staging: alert only on codfw

https://gerrit.wikimedia.org/r/1216549

Change #1216549 merged by jenkins-bot:

[operations/alerts@master] apt-staging: alert only on codfw

https://gerrit.wikimedia.org/r/1216549

I mistakenly deployed these alerts on eqiad as well