Page MenuHomePhabricator

toolforge prometheus servers OOMing
Closed, ResolvedPublic

Description

The Toolforge Prometheus server has been crashing for the last day or so.

Details

TitleReferenceAuthorSource BranchDest Branch
Revert "wmcs-k8s-metrics: rollback tools"repos/cloud/toolforge/toolforge-deploy!183taavitaavi/metricsmain
wmcs-k8s-metrics: rollback toolsrepos/cloud/toolforge/toolforge-deploy!125taavitaavi/revert-metricsmain
Customize query in GitLab

Event Timeline

taavi triaged this task as High priority.Nov 1 2023, 11:03 AM
taavi created this task.

The instances are using g3.cores8.ram36.disk20, so I'm a bit surprised they're running out of RAM.

Mentioned in SAL (#wikimedia-cloud) [2023-11-02T13:13:31Z] <taavi> wiping data directory from tools-prometheus-7 so we have least one working server T350227