As per title, right now we're alerting on high etcdmirror lag via a regexp-based check on /lag endpoint. Nowadays we must use a metric/prometheus based check instead.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T321808 Port most/all Icinga checks to Prometheus/Alertmanager | |||
Open | None | T288622 All Prometheus based alerts move from Icinga to alert manager exclusively | |||
Open | None | T305847 Migrate SRE paging alerts off Icinga and to Alertmanager | |||
Resolved | fgiunchedi | T309546 Export etcdmirror 'lag' metric and alert on it |
Event Timeline
Change 801642 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] Run isort/black on the codebase
Change 801643 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] tox: add formattercheck
Change 801644 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] Use etcdmirror namespace for metrics
Change 801645 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] Export lag as a Gauge metric
Change 801646 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] Port to Python 3.5
Change 801642 merged by jenkins-bot:
[operations/software/etcd-mirror@master] Run isort/black on the codebase
Change 801643 merged by jenkins-bot:
[operations/software/etcd-mirror@master] tox: add formattercheck
Change 801644 merged by jenkins-bot:
[operations/software/etcd-mirror@master] Use etcdmirror namespace for metrics
Change 801645 merged by jenkins-bot:
[operations/software/etcd-mirror@master] Export lag as a Gauge metric
Change 803871 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] New release 0.0.7-1
Change 803871 merged by jenkins-bot:
[operations/software/etcd-mirror@master] New release 0.0.7-1
Change 810864 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/software/etcd-mirror@master] rest: fix getLag typo and add test
Change 810864 merged by jenkins-bot:
[operations/software/etcd-mirror@master] rest: fix getLag typo and add test
Change 810918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/alerts@master] sre: add etcd-mirror lag page
Change 810919 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] etcd: remove paging alert, moved to Prometheus
Change 810927 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] wmflib: remove distro conditionals from blackbox http module options
Change 810927 merged by Filippo Giunchedi:
[operations/puppet@production] wmflib: remove distro conditionals from blackbox http module options
Change 810919 merged by Filippo Giunchedi:
[operations/puppet@production] etcd: remove paging alert, moved to Prometheus
Change 810918 merged by Filippo Giunchedi:
[operations/alerts@master] sre: add etcd-mirror lag page
We have the etcdmirror_lag metric now and pages set up on alertmanager to fire, the icinga check is gone!
Change 801646 abandoned by Alexandros Kosiaris:
[operations/software/etcd-mirror@master] Port to Python 3.5
Reason:
Missed this change, sorry about that. Already done in https://gerrit.wikimedia.org/r/c/operations/software/etcd-mirror/+/812306