Page MenuHomePhabricator

restbase2009 reimaging issues
Closed, ResolvedPublic

Description

restbase2009 had a disk replaced as part of T268622. When reimaging, the debian installer panics and the logs note hpsa errors when configuring disks and eventually installation stalls when attempting to create the ext4 fs for md0. a resync is triggered for md0 while the installer is still running.

Errors along the following lines are emitted for all disks:

[  565.139342] sd 0:0:2:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ABORT driverbyte=DRIVER_SENSE
[  565.139347] sd 0:0:2:0: [sdb] tag#0 Sense Key : Unit Attention [current]
[  565.139354] sd 0:0:2:0: [sdb] tag#0 Add. Sense: Power on, reset, or bus device reset occurred
[  565.139359] sd 0:0:2:0: [sdb] tag#0 CDB: Write same(16) 93 08 00 00 00 00 05 f0 68 00 00 40 00 00 00 00

The hpsa errors:

[ 1101.771168] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: Tag:0x00000000:000000d0: unknown abort service response 0x00
[ 1101.771171] hpsa 0000:03:00.0: scsi 0:0:2:0 Aborting command ffff90796bfc4240Tag:0x00000000:000000d0 CDBLen: 16 CDB: 0x9308... SN: 0x0  SENT, FAILED
[ 1101.771178] hpsa 0000:03:00.0: scsi 0:0:2:0: FAILED to abort command Direct-Access     ATA      LK1600GEYMV      PHYS DRV SSDSmartPathCap- En- Exp=1
[ 1101.771202] hpsa 0000:03:00.0: scsi 0:0:4:0 Aborting command ffff90796bfc4840Tag:0x00000000:00000100 CDBLen: 16 CDB: 0x9308... SN: 0x0  BEING SENT
[ 1101.771207] hpsa 0000:03:00.0: scsi 0:0:4:0: Aborting command Direct-Access     ATA      INTEL SSDSC2BX01 PHYS DRV SSDSmartPathCap- En- Exp=1
[ 1119.507897] hpsa 0000:03:00.0: CDB 93080000000000010800004000000000 was aborted with status 0x0
[ 1132.490841] hpsa 0000:03:00.0: Command timed out.
[ 1132.490847] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: Tag:0x00000000:00000100: unknown abort service response 0x00
[ 1132.490850] hpsa 0000:03:00.0: scsi 0:0:4:0 Aborting command ffff90796bfc4840Tag:0x00000000:00000100 CDBLen: 16 CDB: 0x9308... SN: 0x0  SENT, FAILED
[ 1132.490856] hpsa 0000:03:00.0: scsi 0:0:4:0: FAILED to abort command Direct-Access     ATA      INTEL SSDSC2BX01 PHYS DRV SSDSmartPathCap- En- Exp=1
[ 1132.490915] hpsa 0000:03:00.0: scsi 0:0:2:0: resetting physical  Direct-Access     ATA      LK1600GEYMV      PHYS DRV SSDSmartPathCap- En- Exp=1

The kernel traces start as task md0_resync:20551 blocked for more than 120 seconds

This could be a problem with the controller firmware or configuration (similar errors were seen in T128107#2234413 after the host had been provisioned), but I'm not certain about that because restbase2009 had a disk replaced earlier this year which also required reimaging, and this error did not occur.

Event Timeline

hnowlan updated the task description. (Show Details)

"task md0_resync:20551 blocked for more than 120 seconds" smells like a hw issue. Best to open a DC ops ticket to get the controller and system firmware update and then retry to reimage,

jbond triaged this task as Medium priority.Dec 10 2020, 4:08 PM
jbond added a project: DC-Ops.

Change 650159 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Remove restbase2009, reimaging issues.

https://gerrit.wikimedia.org/r/650159

Change 650159 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Remove restbase2009, reimaging issues.

https://gerrit.wikimedia.org/r/650159

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

restbase2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101150228_pt1979_18340_restbase2009_codfw_wmnet.log.

Completed auto-reimage of hosts:

['restbase2009.codfw.wmnet']

Of which those FAILED:

['restbase2009.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

restbase2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101150229_pt1979_18377_restbase2009_codfw_wmnet.log.

Completed auto-reimage of hosts:

['restbase2009.codfw.wmnet']

Of which those FAILED:

['restbase2009.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

restbase2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101191734_pt1979_8366_restbase2009_codfw_wmnet.log.

Completed auto-reimage of hosts:

['restbase2009.codfw.wmnet']

Of which those FAILED:

['restbase2009.codfw.wmnet']

Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts:

restbase2009.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202101191734_pt1979_8395_restbase2009_codfw_wmnet.log.

Completed auto-reimage of hosts:

['restbase2009.codfw.wmnet']

and were ALL successful.

Papaul claimed this task.
Papaul subscribed.

@hnowlan this server is ready for service