The Toolforge Prometheus server has been crashing for the last day or so.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Raymond_Ndibe | T359641 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27 | |||
| Restricted Task | |||||
| Resolved | Slst2020 | T327025 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26 | |||
| Resolved | • aborrero | T316107 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25 | |||
| Resolved | • aborrero | T307651 Upgrade Toolforge Kubernetes to version 1.24 | |||
| Resolved | None | T360699 Toolsbeta: migrate to Debian Bullseye or later | |||
| Resolved | taavi | T311897 [infra] Toolforge: migrate to Debian Bullseye or later | |||
| Resolved | taavi | T311908 Migrate Toolforge Kubernetes hosts to Debian Bullseye or later | |||
| Resolved | • Bstorm | T262550 Toolforge returns HTTP 502 error | |||
| Open | taavi | T262562 [infra] Fix the mis-named k8s service in tools and toolsbeta projects | |||
| Resolved | taavi | T355883 Create a pool of NFS-less Toolforge Kubernetes workers | |||
| Resolved | taavi | T284656 Toolforge k8s: Migrate workers to Containerd and Bookworm | |||
| Resolved | taavi | T349795 Upgrade cadvisor | |||
| Resolved | taavi | T350227 toolforge prometheus servers OOMing |
Event Timeline
The instances are using g3.cores8.ram36.disk20, so I'm a bit surprised they're running out of RAM.
taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125
wmcs-k8s-metrics: rollback tools
taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/125
wmcs-k8s-metrics: rollback tools
Mentioned in SAL (#wikimedia-cloud) [2023-11-02T13:13:31Z] <taavi> wiping data directory from tools-prometheus-7 so we have least one working server T350227
taavi opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/183
Revert "wmcs-k8s-metrics: rollback tools"
taavi merged https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/183
Revert "wmcs-k8s-metrics: rollback tools"