The beta cluster has been unreachable (timing out) for the better part of today, as far as I’m aware (under both its old and new domain names). The Ceph issues at T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures are a likely culprit according to T399281#10995431.
Description
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Andrew | T399858 Cloud Ceph misbehaving on Debian Bookworm | |||
| Resolved | taavi | T399870 [k8s,infra,o11y] Add paging alert when many tools are unreachable | |||
| Resolved | Andrew | T399281 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures | |||
| Resolved | hashar | T399283 Remove reliance on EXECUTOR_NUMBER environment variable in CI | |||
| Duplicate | None | T399303 Beta cluster down 2025-07-11 | |||
| Resolved | fnegri | T399261 Widespread instances down in project deployment-prep | |||
| Duplicate | None | T399297 Widespread instances down in project deployment-prep |
Event Timeline
Comment Actions
The beta-code-update-eqiad, beta-scap-sync-world and beta-update-databases-eqiad CI jobs have been green throughout FWIW.