Page MenuHomePhabricator

cloudvirt1004 failed RAID controller
Closed, ResolvedPublic

Description

root@cloudvirt1004:/opt/hp/hpssacli# hpssacli controller all show

Smart Array P420i in Slot 0 (Embedded)  

root@cloudvirt1004:/opt/hp/hpssacli# hpssacli controller slot=0 pd all show

Error: The specified controller does not have any physical drives on it.
[Tue Apr 21 21:27:33 2020] hpsa 0000:02:00.0: scsi 0:1:0:0: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1                                                                                                   
[Tue Apr 21 21:28:06 2020] hpsa 0000:02:00.0: Controller lockup detected: 0xffff0000 after 30                                                                                                                                                                      
[Tue Apr 21 21:28:06 2020] hpsa 0000:02:00.0: controller lockup detected: LUN:0000004000000000 CDB:01040000000000000000000000000000                                                                                                                                
[Tue Apr 21 21:28:06 2020] hpsa 0000:02:00.0: Controller lockup detected during reset wait                                                                                                                                                                         
[Tue Apr 21 21:28:06 2020] hpsa 0000:02:00.0: scsi 0:1:0:0: reset logical  failed Direct-Access     HP       LOGICAL VOLUME   RAID-1(+0) SSDSmartPathCap- En- Exp=1                                                                                                
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:0: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:0: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:0: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: Device offlined - not ready after error recovery                                                                                                                                                                            
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] tag#19 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK                                                                                                                                                    
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] tag#19 CDB: Write(10) 2a 00 87 5e 74 a9 00 00 08 00                                                                                                                                                                   
[Tue Apr 21 21:28:06 2020] hpsa 0000:02:00.0: failed 21 commands in fail_all                                                                                                                                                                                       
[Tue Apr 21 21:28:06 2020] blk_update_request: I/O error, dev sdb, sector 2271114409                                                                                                                                                                               
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: rejecting I/O to offline device                                                                                                                                                                                             
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] killing request                                                                                                                                                                                                       
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: rejecting I/O to offline device                                                                                                                                                                                             
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: rejecting I/O to offline device                                                                                                                                                                                             
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK                                                                                                                                                           
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] CDB: Write(10) 2a 00 7e 12 a2 00 00 02 00 00                                                                                                                                                                          
[Tue Apr 21 21:28:06 2020] blk_update_request: I/O error, dev sdb, sector 2115150336                                                                                                                                                                               
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK                                                                                                                                                    
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 0f c1 ed 11 00 00 18 00                                                                                                                                                                   
[Tue Apr 21 21:28:06 2020] blk_update_request: I/O error, dev sdb, sector 264367377                                                                                                                                                                                
[Tue Apr 21 21:28:06 2020] sd 0:1:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

Event Timeline

List of effected virtual machines

/etc/libvirt/qemu/i-00000406.xml:      <nova:name>toolsbeta-sgewebgrid-generic-0901</nova:name>
/etc/libvirt/qemu/i-00001507.xml:      <nova:name>incubator-mw</nova:name>
/etc/libvirt/qemu/i-00001d3c.xml:      <nova:name>tools-sgeexec-0901</nova:name>
/etc/libvirt/qemu/i-00002cf4.xml:      <nova:name>tools-sgewebgrid-lighttpd-0918</nova:name>
/etc/libvirt/qemu/i-00002cf5.xml:      <nova:name>tools-sgewebgrid-lighttpd-0919</nova:name>
/etc/libvirt/qemu/i-0000735c.xml:      <nova:name>media-streaming</nova:name>
/etc/libvirt/qemu/i-00007e14.xml:      <nova:name>wikilink-prod</nova:name>
/etc/libvirt/qemu/i-00007e7c.xml:      <nova:name>commonsarchive-mwtest</nova:name>
/etc/libvirt/qemu/i-000081a8.xml:      <nova:name>wikidata-autodesc</nova:name>
/etc/libvirt/qemu/i-000088a9.xml:      <nova:name>deployment-schema-2</nova:name>
/etc/libvirt/qemu/i-0000892a.xml:      <nova:name>discovery-testing-02</nova:name>
/etc/libvirt/qemu/i-00009819.xml:      <nova:name>visionoid</nova:name>
/etc/libvirt/qemu/i-0001027b.xml:      <nova:name>deployment-echostore01</nova:name>
/etc/libvirt/qemu/i-000105b2.xml:      <nova:name>Esther-outreachy-intern</nova:name>
/etc/libvirt/qemu/i-00012d1a.xml:      <nova:name>tools-k8s-worker-38</nova:name>
/etc/libvirt/qemu/i-00012d29.xml:      <nova:name>tools-k8s-worker-52</nova:name>
/etc/libvirt/qemu/i-00014212.xml:      <nova:name>canary1004-01</nova:name>
JHedden renamed this task from cloudvirt1004 lost access to all drives to cloudvirt1004 failed RAID controller.Apr 21 2020, 9:55 PM

Screen Shot 2020-04-21 at 2.57.05 PM.png (134×2 px, 50 KB)

Uploading this one for the timestamp.

VMs by project
  • cloudvirt-canary
    • canary1004-01.cloudvirt-canary.eqiad.wmflabs
  • commonsarchive
    • commonsarchive-mwtest.commonsarchive.eqiad.wmflabs
  • deployment-prep
    • deployment-echostore01.deployment-prep.eqiad.wmflabs
    • deployment-schema-2.deployment-prep.eqiad.wmflabs
  • incubator
    • incubator-mw.incubator.eqiad.wmflabs
  • machine-vision
    • visionoid.machine-vision.eqiad.wmflabs
  • ogvjs-integration
    • media-streaming.ogvjs-integration.eqiad.wmflabs
  • services
    • Esther-outreachy-intern.services.eqiad.wmflabs
  • shiny-r
    • discovery-testing-02.shiny-r.eqiad.wmflabs
  • tools
    • tools-k8s-worker-38.tools.eqiad.wmflabs
    • tools-k8s-worker-52.tools.eqiad.wmflabs
    • tools-sgeexec-0901.tools.eqiad.wmflabs
    • tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs
    • tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs
  • toolsbeta
    • toolsbeta-sgewebgrid-generic-0901.toolsbeta.eqiad.wmflabs
  • wikidata-autodesc
    • wikidata-autodesc.wikidata-autodesc.eqiad.wmflabs
  • wikilink
    • wikilink-prod.wikilink.eqiad.wmflabs

Mentioned in SAL (#wikimedia-cloud) [2020-04-21T22:02:31Z] <bstorm_> draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host T250869

Mentioned in SAL (#wikimedia-cloud) [2020-04-21T22:09:17Z] <bstorm_> depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869

Mentioned in SAL (#wikimedia-cloud) [2020-04-21T23:06:34Z] <bstorm_> repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 T250869

All instances were successfully evacuated from the host. We should have logged this earlier for folks who were following along from the cloud-announce announcement.

Andrew claimed this task.
Andrew subscribed.

I'm about to mark this host for decom so I don't think there's anything else interesting left for this task.