Page MenuHomePhabricator

cp3053 nvme0 issues
Open, Stalled, MediumPublic

Description

Today at 11:12:54 cp3053 began to log errors related to nvme0 (used to store the ATS backend cache):

kern.log
vgutierrez@cp3053:/var/log$ grep nvm kern.log
Jun 29 11:12:54 cp3053 kernel: [11938064.923042] nvme nvme0: I/O 131 QID 38 timeout, aborting
[...]
Jun 29 11:13:24 cp3053 kernel: [11938095.130806] nvme nvme0: I/O 131 QID 38 timeout, reset controller
Jun 29 11:13:55 cp3053 kernel: [11938126.106589] nvme nvme0: I/O 0 QID 0 timeout, reset controller
Jun 29 11:15:09 cp3053 kernel: [11938199.897851] nvme nvme0: Device not ready; aborting reset
[...]
Jun 29 11:17:25 cp3053 kernel: [11938335.728449] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 11:17:25 cp3053 kernel: [11938335.736151] block nvme0n1: no path - failing I/O
Jun 29 11:17:25 cp3053 kernel: [11938335.736154] block nvme0n1: no path - failing I/O
Jun 29 11:17:25 cp3053 kernel: [11938335.736155] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 11:47:44 cp3053 kernel: [11940155.128342] block nvme0n1: no path - failing I/O
Jun 29 11:47:44 cp3053 kernel: [11940155.128346] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 11:47:44 cp3053 kernel: [11940155.136074] block nvme0n1: no path - failing I/O
Jun 29 11:47:44 cp3053 kernel: [11940155.136079] block nvme0n1: no path - failing I/O
Jun 29 11:47:44 cp3053 kernel: [11940155.136080] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 12:18:09 cp3053 kernel: [11941979.570358] block nvme0n1: no path - failing I/O
Jun 29 12:18:09 cp3053 kernel: [11941979.570361] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 12:18:09 cp3053 kernel: [11941979.578093] block nvme0n1: no path - failing I/O
Jun 29 12:18:09 cp3053 kernel: [11941979.578098] block nvme0n1: no path - failing I/O
Jun 29 12:18:09 cp3053 kernel: [11941979.578100] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 12:47:59 cp3053 kernel: [11943770.083909] block nvme0n1: no path - failing I/O
Jun 29 12:47:59 cp3053 kernel: [11943770.083913] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 12:47:59 cp3053 kernel: [11943770.091672] block nvme0n1: no path - failing I/O
Jun 29 12:47:59 cp3053 kernel: [11943770.091676] block nvme0n1: no path - failing I/O
Jun 29 12:47:59 cp3053 kernel: [11943770.091678] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 13:18:01 cp3053 kernel: [11945571.669553] block nvme0n1: no path - failing I/O
Jun 29 13:18:01 cp3053 kernel: [11945571.669556] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 13:18:01 cp3053 kernel: [11945571.677274] block nvme0n1: no path - failing I/O
Jun 29 13:18:01 cp3053 kernel: [11945571.677278] block nvme0n1: no path - failing I/O
Jun 29 13:18:01 cp3053 kernel: [11945571.677279] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read

I've depooled the node as ats-backend is clearly being impacted by this:

Jun 29 11:15:40 cp3053 kernel: [11938230.555735] INFO: task [ET_AIO 3:169]:35516 blocked for more than 120 seconds.

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMon, Jun 29, 1:47 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez triaged this task as Medium priority.Mon, Jun 29, 1:47 PM
Vgutierrez moved this task from Triage to Hardware on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2020-06-30T07:42:52Z] <vgutierrez> reboot cp3053 - T256632

Mentioned in SAL (#wikimedia-operations) [2020-06-30T08:05:15Z] <vgutierrez> powercycle cp3053 (unresponsive after reboot) - T256632

Mentioned in SAL (#wikimedia-operations) [2020-06-30T08:23:53Z] <vgutierrez> repool cp3053 - T256632

Vgutierrez changed the task status from Open to Stalled.Tue, Jun 30, 8:29 AM

repooled after powercycling & issuing the following commands:

/usr/sbin/nvme format /dev/nvme0n1 -l 2
 echo ';' | /usr/sbin/sfdisk /dev/nvme0n1

I'll keep an eye on kern.log