Today at 11:12:54 cp3053 began to log errors related to nvme0 (used to store the ATS backend cache):
vgutierrez@cp3053:/var/log$ grep nvm kern.log Jun 29 11:12:54 cp3053 kernel: [11938064.923042] nvme nvme0: I/O 131 QID 38 timeout, aborting [...] Jun 29 11:13:24 cp3053 kernel: [11938095.130806] nvme nvme0: I/O 131 QID 38 timeout, reset controller Jun 29 11:13:55 cp3053 kernel: [11938126.106589] nvme nvme0: I/O 0 QID 0 timeout, reset controller Jun 29 11:15:09 cp3053 kernel: [11938199.897851] nvme nvme0: Device not ready; aborting reset [...] Jun 29 11:17:25 cp3053 kernel: [11938335.728449] Buffer I/O error on dev nvme0n1, logical block 128, async page read Jun 29 11:17:25 cp3053 kernel: [11938335.736151] block nvme0n1: no path - failing I/O Jun 29 11:17:25 cp3053 kernel: [11938335.736154] block nvme0n1: no path - failing I/O Jun 29 11:17:25 cp3053 kernel: [11938335.736155] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read Jun 29 11:47:44 cp3053 kernel: [11940155.128342] block nvme0n1: no path - failing I/O Jun 29 11:47:44 cp3053 kernel: [11940155.128346] Buffer I/O error on dev nvme0n1, logical block 128, async page read Jun 29 11:47:44 cp3053 kernel: [11940155.136074] block nvme0n1: no path - failing I/O Jun 29 11:47:44 cp3053 kernel: [11940155.136079] block nvme0n1: no path - failing I/O Jun 29 11:47:44 cp3053 kernel: [11940155.136080] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read Jun 29 12:18:09 cp3053 kernel: [11941979.570358] block nvme0n1: no path - failing I/O Jun 29 12:18:09 cp3053 kernel: [11941979.570361] Buffer I/O error on dev nvme0n1, logical block 128, async page read Jun 29 12:18:09 cp3053 kernel: [11941979.578093] block nvme0n1: no path - failing I/O Jun 29 12:18:09 cp3053 kernel: [11941979.578098] block nvme0n1: no path - failing I/O Jun 29 12:18:09 cp3053 kernel: [11941979.578100] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read Jun 29 12:47:59 cp3053 kernel: [11943770.083909] block nvme0n1: no path - failing I/O Jun 29 12:47:59 cp3053 kernel: [11943770.083913] Buffer I/O error on dev nvme0n1, logical block 128, async page read Jun 29 12:47:59 cp3053 kernel: [11943770.091672] block nvme0n1: no path - failing I/O Jun 29 12:47:59 cp3053 kernel: [11943770.091676] block nvme0n1: no path - failing I/O Jun 29 12:47:59 cp3053 kernel: [11943770.091678] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read Jun 29 13:18:01 cp3053 kernel: [11945571.669553] block nvme0n1: no path - failing I/O Jun 29 13:18:01 cp3053 kernel: [11945571.669556] Buffer I/O error on dev nvme0n1, logical block 128, async page read Jun 29 13:18:01 cp3053 kernel: [11945571.677274] block nvme0n1: no path - failing I/O Jun 29 13:18:01 cp3053 kernel: [11945571.677278] block nvme0n1: no path - failing I/O Jun 29 13:18:01 cp3053 kernel: [11945571.677279] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
I've depooled the node as ats-backend is clearly being impacted by this:
Jun 29 11:15:40 cp3053 kernel: [11938230.555735] INFO: task [ET_AIO 3:169]:35516 blocked for more than 120 seconds.