Page MenuHomePhabricator

KernelErrors
Closed, ResolvedPublic

Description

Common information

  • alertname: KernelErrors
  • cluster: wmcs
  • job: node
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts





Event Timeline

fnegri claimed this task.
fnegri subscribed.

cloudcephosd1013 had a hard drive failure, see T399366: KernelErrors Server cloudcephosd1013 logged kernel errors.

cloudcephosd1036 had a single error message logged on 2025-07-11 15:56 UTC during the outage (T399281: 2025-07-11 Ceph issues causing Toolforge and Cloud VPS failures). Unfortunately the journald logs have already been rotated and I cannot see that message anymore. It was one of the hosts that were upgraded to Bookworm and were struggling, but it's the only one who reported a Kernel error.

Resolving as I don't think there's much more investigation we can do.