Follow up from T409253: Continuous breakages of apt-staging
Right now we have no alerting on failed imports, relying on the users to tell us when there is a problem. We should add alerting to be aware of issues with the import pipeline.
Follow up from T409253: Continuous breakages of apt-staging
Right now we have no alerting on failed imports, relying on the users to tell us when there is a problem. We should add alerting to be aware of issues with the import pipeline.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | ABran-WMF | T409835 Apt-staging: add alerting | |||
| Resolved | ABran-WMF | T409832 Apt-staging: add error handling to gitlab_package_puller | |||
| Resolved | ABran-WMF | T410984 GitlabPackagePullerFailedOnRun |
Change #1205162 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/puppet@production] apt-staging: logging and metrics
basic metrics are visible in T409833#11382292:
# HELP gitlab_package_puller_jobs_considered Number of CI jobs inspected in the last run # TYPE gitlab_package_puller_jobs_considered gauge gitlab_package_puller_jobs_considered 2566 # HELP gitlab_package_puller_jobs_downloaded Number of CI jobs whose artifacts were downloaded in the last run # TYPE gitlab_package_puller_jobs_downloaded gauge gitlab_package_puller_jobs_downloaded 8 # HELP gitlab_package_puller_run_success Whether the last run completed without unhandled exceptions (1=success, 0=failure) # TYPE gitlab_package_puller_run_success gauge gitlab_package_puller_run_success 1 # HELP gitlab_package_puller_last_run_timestamp_seconds Unix timestamp of the end of the last run # TYPE gitlab_package_puller_last_run_timestamp_seconds gauge gitlab_package_puller_last_run_timestamp_seconds 1763451717
With T409832: Apt-staging: add error handling to gitlab_package_puller I'll add some more
Change #1207791 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/alerts@master] apt: add an alert on reprepro errors
I've added alerts on reprepro more general script failures in operations/alerts/+/1207791
It'll also be possible to add another notification medium if needed
Change #1207791 merged by jenkins-bot:
[operations/alerts@master] apt: add an alert on reprepro errors
Change #1205162 merged by Arnaudb:
[operations/puppet@production] apt-staging: logging and metrics
as mentioned in T409833: Apt-staging: fix logging
A simple set of alerts is now in production
Change #1211001 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/alerts@master] apt-staging: wrong error code in gitlab_package_puller_run_success
Change #1211001 merged by jenkins-bot:
[operations/alerts@master] apt-staging: wrong error code in gitlab_package_puller_run_success
There is an open alert since 11 days Linting problems found for GitlabPackagePullerFailedOnRun:
description: Pint reporter promql/series found problem(s) in /srv/alerts/ops/team-sre_apt.yaml: prometheus "ops" at http://127.0.0.1:9900/ops didn't have any series for "gitlab_package_puller_reprepro_notify_failed" metric in the last 1w summary: Linting problems found for GitlabPackagePullerFailedOnReprepro
I'm not fully sure why this is happening, thanos shows a series with that name. The summary mentions GitlabPackagePullerFailedOnRun which is a test added in https://gerrit.wikimedia.org/r/c/operations/alerts/+/1211001.
So I'll reopen the task until that alert test is fixed.
I can confirm the metric is also present on the host:
root@apt-staging2001:node.d $ rg gitlab_package_puller_reprepro_notify_failed gitlab_package_puller.prom 25:# HELP gitlab_package_puller_reprepro_notify_failed Number of times a reprepro failure notification was (or would have been) triggered in the last run 26:# TYPE gitlab_package_puller_reprepro_notify_failed gauge 27:gitlab_package_puller_reprepro_notify_failed 0 root@apt-staging2001:node.d $ pwd /var/lib/prometheus/node.d
I'll ask Observability for help
Change #1216549 had a related patch set uploaded (by Arnaudb; author: Arnaudb):
[operations/alerts@master] apt-staging: alert only on codfw
Change #1216549 merged by jenkins-bot:
[operations/alerts@master] apt-staging: alert only on codfw