Page MenuHomePhabricator

KernelErrors Server cloudcephosd1013 logged kernel errors
Closed, ResolvedPublic

Description

Common information

  • alertname: KernelErrors
  • category: priority_err
  • cluster: wmcs
  • instance: cloudcephosd1013:9100
  • job: node
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Event Timeline

This was due to a reboot, after sdj disappeared.

This was due to a reboot, after sdj disappeared.

The alert was not due to the reboot. The metric was reporting 318 kernel errors starting on Jul 11 at 14:17 UTC (before the reboot). The reboot happened on Jul 11 at 18:36 UTC. The alert did not fire until Jul 12 only because there was a silence in place that expired on Jul 12.

Screenshot 2025-07-14 at 17.37.13.png (1×2 px, 368 KB)

A sample of the 318 kernel errors:

Jul 11 14:12:32 cloudcephosd1013 kernel: INFO: task md2_raid1:668 blocked for more than 120 seconds.
Jul 11 14:12:32 cloudcephosd1013 kernel:       Not tainted 5.10.0-35-amd64 #1 Debian 5.10.237-1
Jul 11 14:12:32 cloudcephosd1013 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

[...]

Jul 11 14:12:32 cloudcephosd1013 kernel: megaraid_sas 0000:18:00.0: pending commands remain after waiting, will reset adapter scsi0.
Jul 11 14:12:32 cloudcephosd1013 kernel: blk_update_request: I/O error, dev sdj, sector 427105432 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
Jul 11 14:12:32 cloudcephosd1013 kernel: scsi 0:0:8:0: rejecting I/O to dead device

[...]

Jul 11 14:12:32 cloudcephosd1013 kernel: Buffer I/O error on dev dm-4, logical block 226257573, async page read
fnegri triaged this task as Medium priority.Jul 23 2025, 2:07 PM