Page MenuHomePhabricator

cp3053 nvme0 issues
Closed, ResolvedPublic

Description

Today at 11:12:54 cp3053 began to log errors related to nvme0 (used to store the ATS backend cache):

kern.log
vgutierrez@cp3053:/var/log$ grep nvm kern.log
Jun 29 11:12:54 cp3053 kernel: [11938064.923042] nvme nvme0: I/O 131 QID 38 timeout, aborting
[...]
Jun 29 11:13:24 cp3053 kernel: [11938095.130806] nvme nvme0: I/O 131 QID 38 timeout, reset controller
Jun 29 11:13:55 cp3053 kernel: [11938126.106589] nvme nvme0: I/O 0 QID 0 timeout, reset controller
Jun 29 11:15:09 cp3053 kernel: [11938199.897851] nvme nvme0: Device not ready; aborting reset
[...]
Jun 29 11:17:25 cp3053 kernel: [11938335.728449] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 11:17:25 cp3053 kernel: [11938335.736151] block nvme0n1: no path - failing I/O
Jun 29 11:17:25 cp3053 kernel: [11938335.736154] block nvme0n1: no path - failing I/O
Jun 29 11:17:25 cp3053 kernel: [11938335.736155] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 11:47:44 cp3053 kernel: [11940155.128342] block nvme0n1: no path - failing I/O
Jun 29 11:47:44 cp3053 kernel: [11940155.128346] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 11:47:44 cp3053 kernel: [11940155.136074] block nvme0n1: no path - failing I/O
Jun 29 11:47:44 cp3053 kernel: [11940155.136079] block nvme0n1: no path - failing I/O
Jun 29 11:47:44 cp3053 kernel: [11940155.136080] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 12:18:09 cp3053 kernel: [11941979.570358] block nvme0n1: no path - failing I/O
Jun 29 12:18:09 cp3053 kernel: [11941979.570361] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 12:18:09 cp3053 kernel: [11941979.578093] block nvme0n1: no path - failing I/O
Jun 29 12:18:09 cp3053 kernel: [11941979.578098] block nvme0n1: no path - failing I/O
Jun 29 12:18:09 cp3053 kernel: [11941979.578100] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 12:47:59 cp3053 kernel: [11943770.083909] block nvme0n1: no path - failing I/O
Jun 29 12:47:59 cp3053 kernel: [11943770.083913] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 12:47:59 cp3053 kernel: [11943770.091672] block nvme0n1: no path - failing I/O
Jun 29 12:47:59 cp3053 kernel: [11943770.091676] block nvme0n1: no path - failing I/O
Jun 29 12:47:59 cp3053 kernel: [11943770.091678] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read
Jun 29 13:18:01 cp3053 kernel: [11945571.669553] block nvme0n1: no path - failing I/O
Jun 29 13:18:01 cp3053 kernel: [11945571.669556] Buffer I/O error on dev nvme0n1, logical block 128, async page read
Jun 29 13:18:01 cp3053 kernel: [11945571.677274] block nvme0n1: no path - failing I/O
Jun 29 13:18:01 cp3053 kernel: [11945571.677278] block nvme0n1: no path - failing I/O
Jun 29 13:18:01 cp3053 kernel: [11945571.677279] Buffer I/O error on dev nvme0n1p1, logical block 390703168, async page read

I've depooled the node as ats-backend is clearly being impacted by this:

Jun 29 11:15:40 cp3053 kernel: [11938230.555735] INFO: task [ET_AIO 3:169]:35516 blocked for more than 120 seconds.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Vgutierrez triaged this task as Medium priority.Jun 29 2020, 1:47 PM
Vgutierrez moved this task from Triage to Hardware on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2020-06-30T08:05:15Z] <vgutierrez> powercycle cp3053 (unresponsive after reboot) - T256632

Vgutierrez changed the task status from Open to Stalled.Jun 30 2020, 8:29 AM

repooled after powercycling & issuing the following commands:

/usr/sbin/nvme format /dev/nvme0n1 -l 2
 echo ';' | /usr/sbin/sfdisk /dev/nvme0n1

I'll keep an eye on kern.log

Hi @Vgutierrez - just wanted to follow up to see if you've seen any issues since....and if this can be closed now. Much appreciated. Thanks, Willy

Vgutierrez claimed this task.

Thanks for pinging me @wiki_willy, we can close to this task, everything seems good in cp3053 so far. I'll reopen the task if needed