The Toolforge Prometheus server has been crashing for the last day or so.
Description
Details
Title | Reference | Author | Source Branch | Dest Branch | |
---|---|---|---|---|---|
Revert "wmcs-k8s-metrics: rollback tools" | repos/cloud/toolforge/toolforge-deploy!183 | taavi | taavi/metrics | main | |
wmcs-k8s-metrics: rollback tools | repos/cloud/toolforge/toolforge-deploy!125 | taavi | taavi/revert-metrics | main |
Event Timeline
The instances are using g3.cores8.ram36.disk20, so I'm a bit surprised they're running out of RAM.
taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125
wmcs-k8s-metrics: rollback tools
taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125
wmcs-k8s-metrics: rollback tools
Mentioned in SAL (#wikimedia-cloud) [2023-11-02T13:13:31Z] <taavi> wiping data directory from tools-prometheus-7 so we have least one working server T350227
taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/183
Revert "wmcs-k8s-metrics: rollback tools"
taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/183
Revert "wmcs-k8s-metrics: rollback tools"