Page MenuHomePhabricator

pc2006 crashed
Closed, ResolvedPublic

Description

Latest hardware logs:

2017-07-12T21:08:52-0500	CPU0000	CPU 1 has an internal error (IERR).
2017-07-12T21:08:50-0500	SYS1003	System CPU Resetting.
2017-07-12T21:08:49-0500	RAC0703	Requested system hardreset.
2017-07-12T21:08:49-0500	SYS1003	System CPU Resetting.
2017-07-12T21:08:45-0500	MEM0001	Multi-bit memory errors detected on a memory device at location(s) DIMM_A2.

Complete logs:

@Papaul please contact supplier to request memorys and cpus replacement. These are leased hard, AFAIK.

Event Timeline

jcrespo created this task.Jul 13 2017, 8:25 AM
Restricted Application added a project: Operations. · View Herald TranscriptJul 13 2017, 8:25 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think RAID is ok, but I wouldn't mind a second opinion @Marostegui - was noisy just because unclean umount:

[    4.993319] ata1: SATA max UDMA/133 abar m2048@0x91c01000 port 0x91c01100 irq 38                                                  
[    4.993321] ata2: SATA max UDMA/133 abar m2048@0x91c01000 port 0x91c01180 irq 38                                                  
[    4.993323] ata3: SATA max UDMA/133 abar m2048@0x91c01000 port 0x91c01200 irq 38                                                  
[    4.993325] ata4: SATA max UDMA/133 abar m2048@0x91c01000 port 0x91c01280 irq 38                                                  
[    4.993590] ahci 0000:00:1f.2: SSS flag set, parallel bus scan disabled                                                           
[    4.993648] ahci 0000:00:1f.2: AHCI 0001.0300 32 slots 6 ports 6 Gbps 0x3f impl SATA mode                                         
[    4.993651] ahci 0000:00:1f.2: flags: 64bit ncq stag led clo pio slum part ems apst                                               
[    5.041159] scsi host4: ahci                                                                                                      
[    5.041280] scsi host5: ahci                                                                                                      
[    5.041397] scsi host6: ahci                                                                                                      
[    5.041513] scsi host7: ahci                                                                                                      
[    5.041623] scsi host8: ahci                                                                                                      
[    5.041737] scsi host9: ahci                                                                                                      
[    5.041807] ata5: SATA max UDMA/133 abar m2048@0x91c00000 port 0x91c00100 irq 39                                                  
[    5.041809] ata6: SATA max UDMA/133 abar m2048@0x91c00000 port 0x91c00180 irq 39                                                  
[    5.041811] ata7: SATA max UDMA/133 abar m2048@0x91c00000 port 0x91c00200 irq 39                                                  
[    5.041813] ata8: SATA max UDMA/133 abar m2048@0x91c00000 port 0x91c00280 irq 39                                                  
[    5.041815] ata9: SATA max UDMA/133 abar m2048@0x91c00000 port 0x91c00300 irq 39                                                  
[    5.041818] ata10: SATA max UDMA/133 abar m2048@0x91c00000 port 0x91c00380 irq 39                                                 
Begin: Loading essential drivers ... done.                                                                                           
Begin: Running /scripts/init-premount ... Waiting 5s for disks to show up (T131961)                                                  
[    5.220235] usb 1-1: new high-speed USB device number 2 using ehci-pci                                                            
[    5.232230] usb 2-1: new high-speed USB device number 2 using ehci-pci                                                            
[    5.312245] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)                                                                
[    5.319463] ata1.00: ATA-10: INTEL SSDSC2BX800G4R, G201DL29, max UDMA/133                                                         
[    5.327049] ata1.00: 1562824368 sectors, multi 1: LBA48 NCQ (depth 31/32)                                                         
[    5.335040] ata1.00: configured for UDMA/133                                                                                      
[    5.340209] scsi 0:0:0:0: Direct-Access     ATA      INTEL SSDSC2BX80 DL29 PQ: 0 ANSI: 5                                          
[    5.350191] ata1.00: Enabling discard_zeroes_data                                                                                 
[    5.355452] sd 0:0:0:0: [sda] 1562824368 512-byte logical blocks: (800 GB/745 GiB)                                                
[    5.356502] usb 1-1: New USB device found, idVendor=8087, idProduct=800a                                                          
[    5.356503] usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0                                                     
[    5.356653] hub 1-1:1.0: USB hub found                                                                                            
[    5.356750] hub 1-1:1.0: 6 ports detected                                                                                         
[    5.368204] ata5: SATA link down (SStatus 0 SControl 300)                                                                         
[    5.368500] usb 2-1: New USB device found, idVendor=8087, idProduct=8002                                                          
[    5.368502] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0                                                     
[    5.368665] hub 2-1:1.0: USB hub found                                                                                            
[    5.368749] hub 2-1:1.0: 8 ports detected                                                                                         
[    5.418099] sd 0:0:0:0: [sda] 4096-byte physical blocks                                                                           
[    5.423949] sd 0:0:0:0: [sda] Write Protect is off                                                                                
[    5.429293] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA                              
[    5.439571] ata1.00: Enabling discard_zeroes_data                                                                                 
[    5.445014]  sda: sda1 sda2 sda3                                                                                                  
[    5.448873] ata1.00: Enabling discard_zeroes_data                                                                                 
[    5.454189] sd 0:0:0:0: [sda] Attached SCSI disk                                                                                  
[    5.632198] usb 1-1.6: new high-speed USB device number 3 using ehci-pci                                                          
[    5.672174] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)                                                                
[    5.679358] ata2.00: ATA-10: INTEL SSDSC2BX800G4R, G201DL29, max UDMA/133                                                         
[    5.686934] ata2.00: 1562824368 sectors, multi 1: LBA48 NCQ (depth 31/32)                                                         
[    5.694893] ata2.00: configured for UDMA/133                                                                                      
[    5.699936] scsi 1:0:0:0: Direct-Access     ATA      INTEL SSDSC2BX80 DL29 PQ: 0 ANSI: 5                                          
[    5.709129] ata2.00: Enabling discard_zeroes_data                                                                                 
[    5.714391] sd 1:0:0:0: [sdb] 1562824368 512-byte logical blocks: (800 GB/745 GiB)                                                
[    5.722839] sd 1:0:0:0: [sdb] 4096-byte physical blocks                                                                           
[    5.728691] sd 1:0:0:0: [sdb] Write Protect is off                                                                                
[    5.729090] usb 1-1.6: New USB device found, idVendor=413c, idProduct=a001                                                        
[    5.729091] usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=3                                                   
[    5.729092] usb 1-1.6: Product: Gadget USB HUB                                                                                    
[    5.729093] usb 1-1.6: Manufacturer: no manufacturer                                                                              
[    5.729094] usb 1-1.6: SerialNumber: 0123456789                                                                                   
[    5.729366] hub 1-1.6:1.0: USB hub found                                                                                          
[    5.729465] hub 1-1.6:1.0: 6 ports detected                                                                                       
[    5.741936] sd 0:0:0:0: Attached scsi generic sg0 type 0                                                                          
[    5.742015] sd 1:0:0:0: Attached scsi generic sg1 type 0                                                                          
[    5.786310] sd 1:0:0:0: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA                              
[    5.796613] ata2.00: Enabling discard_zeroes_data                                                                                 
[    5.802032]  sdb: sdb1 sdb2 sdb3                                                                                                  
[    5.806068] ata2.00: Enabling discard_zeroes_data                                                                                 
[    5.811403] sd 1:0:0:0: [sdb] Attached SCSI disk                                                                                  
[    6.032199] ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)                                                                
[    6.039437] ata3.00: ATA-10: INTEL SSDSC2BX800G4R, G201DL29, max UDMA/133                                                         
[    6.047024] ata3.00: 1562824368 sectors, multi 1: LBA48 NCQ (depth 31/32)                                                         
[    6.054999] ata3.00: configured for UDMA/133                                                                                      
[    6.060352] scsi 2:0:0:0: Direct-Access     ATA      INTEL SSDSC2BX80 DL29 PQ: 0 ANSI: 5                                          
[    6.069691] ata3.00: Enabling discard_zeroes_data                                                                                 
[    6.069818] sd 2:0:0:0: Attached scsi generic sg2 type 0                                                                          
[    6.080881] sd 2:0:0:0: [sdc] 1562824368 512-byte logical blocks: (800 GB/745 GiB)                                                
[    6.089332] sd 2:0:0:0: [sdc] 4096-byte physical blocks                                                                           
[    6.095175] sd 2:0:0:0: [sdc] Write Protect is off                                                                                
[    6.100527] sd 2:0:0:0: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA                              
[    6.110843] ata3.00: Enabling discard_zeroes_data                                                                                 
[    6.116281]  sdc: sdc1 sdc2 sdc3                                                                                                  
[    6.120476] ata3.00: Enabling discard_zeroes_data                                                                                 
[    6.125818] sd 2:0:0:0: [sdc] Attached SCSI disk                                                                                  
[    6.400162] ata4: SATA link down (SStatus 0 SControl 300)                                                                         
[    6.724129] ata6: SATA link down (SStatus 0 SControl 300)                                                                         
[    7.048102] ata7: SATA link down (SStatus 0 SControl 300)                                                                         
[    7.372068] ata8: SATA link down (SStatus 0 SControl 300)                                                                         
[    7.696041] ata9: SATA link down (SStatus 0 SControl 300)                                                                         
[    8.020008] ata10: SATA link down (SStatus 0 SControl 300)                                                                        
done.                                                                                                                                
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... Begin: Assembling all MD arrays ... [   10.152425] md: md.
[   10.156684] md: bind<sdb2>                                                                                                        
[   10.161867] md: bind<sdc2>                                                                                                        
[   10.165160] md: bind<sda2>                                                                                                        
[   10.169712] md: raid0 personality registered for level 0                                                                          
[   10.176351] md/raid0:md0: md_size is 2924544 sectors.                                                                             
[   10.181998] md: RAID0 configuration for md0 - 1 zone                                                                              
[   10.187540] md: zone0=[sda2/sdb2/sdc2]                                                                                            
[   10.191768]       zone-offset=         0KB, device-offset=         0KB, size=   1462272KB                                         
[   10.200898]                                                                                                                       
[   10.202577] md0: detected capacity change from 0 to 1497366528                                                                    
mdadm: /dev/md/0 has been started with 3 drives.                                                                                     
[   10.220863] md: md1 stopped.                                                                                                      
[   10.225855] md: bind<sdb3>                                                                                                        
[   10.229103] md: bind<sdc3>                                                                                                        
[   10.232363] md: bind<sda3>                                                                                                        
[   10.236877] md/raid0:md1: md_size is 4684689408 sectors.                                                                          
[   10.242822] md: RAID0 configuration for md1 - 1 zone                                                                              
[   10.248366] md: zone0=[sda3/sdb3/sdc3]                                                                                            
[   10.252589]       zone-offset=         0KB, device-offset=         0KB, size=2342344704KB                                         
[   10.261716]                                                                                                                       
[   10.263389] md1: detected capacity change from 0 to 2398560976896                                                                 
mdadm: /dev/md/1 has been started with 3 drives.                                                                                     
Success: assembled all arrays.                                                                                                       
done.                                                                                                                                
[   10.284315] device-mapper: uevent: version 1.0.3                                                                                  
[   10.289698] device-mapper: ioctl: 4.34.0-ioctl (2015-10-28) initialised: dm-devel@redhat.com                                      
[   10.304726] random: lvm: uninitialized urandom read (4 bytes read, 15 bits of entropy available)                                  
[   10.336463] random: lvm: uninitialized urandom read (2 bytes read, 16 bits of entropy available)                                  
done.                                                                                                                                
Begin: Running /scripts/local-premount ... done.                                                                                     
Begin: Will now check root file system ... fsck from util-linux 2.25.2                                                               
[/sbin/fsck.ext3 (1) -- /dev/mapper/pc2006--vg-root] fsck.ext3 -a -C0 /dev/mapper/pc2006--vg-root                                    
/dev/mapper/pc2006--vg-root: recovering journal                                                                                      
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2293834 (uid=0, gid=0, mode=0100644, size=924096)                               
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294068 (uid=0, gid=0, mode=0100644, size=165856)                               
[   10.686065] EXT4-fs (dm-0): mounting ext3 file system using the ext4 subsystem                                                    
/dev/mapper/pc2006--vg-root: Cle[   10.698218] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)               
aring orphaned inode 2378714 (uid=0, gid=0, mode=0100644, size=1607712)                                                              
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294466 (uid=0, gid=0, mode=0100644, size=10680)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294463 (uid=0, gid=0, mode=0100644, size=31784)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294462 (uid=0, gid=0, mode=0100644, size=84856)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294459 (uid=0, gid=0, mode=0100644, size=43592)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294457 (uid=0, gid=0, mode=0100644, size=47712)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294456 (uid=0, gid=0, mode=0100644, size=22952)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294455 (uid=0, gid=0, mode=0100644, size=31632)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294454 (uid=0, gid=0, mode=0100644, size=89104)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294452 (uid=0, gid=0, mode=0100644, size=1051056)                              
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294450 (uid=0, gid=0, mode=0100644, size=14664)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294448 (uid=0, gid=0, mode=0100644, size=35176)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2294444 (uid=0, gid=0, mode=0100755, size=1738176)                              
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2293981 (uid=0, gid=0, mode=0100755, size=140928)                               
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2293794 (uid=0, gid=0, mode=0100755, size=137384)                               
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2277804 (uid=0, gid=0, mode=0100644, size=26258)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2056195 (uid=0, gid=0, mode=0100644, size=151984)                               
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 2379253 (uid=0, gid=0, mode=0100644, size=31048)                                
/dev/mapper/pc2006--vg-root: Clearing orphaned inode 1056772 (uid=0, gid=0, mode=0100600, size=0)                                    
/dev/mapper/pc2006--vg-root: clean, 63275/3055616 files, 2343815/12206080 blocks                                                     
done.                                                                                                                                
done.                                                                                                                                
Begin: Running /scripts/local-bottom ... done.                                                                                       
Begin: Running /scripts/init-bottom ... done.                                                                                        
[   10.921945] random: systemd: uninitialized urandom read (16 bytes read, 53 bits of entropy available)                             
[   10.935045] systemd[1]: systemd 215 running in system mode. (+PAM +AUDIT +SELINUX +IMA +SYSVINIT +LIBCRYPTSETUP +GCRYPT +ACL +XZ )
[   10.951045] systemd[1]: Detected architecture 'x86-64'.                                                                           

Welcome to Debian GNU/Linux 8 (jessie)!

[   10.976599] systemd[1]: Inserted module 'autofs4'
[   10.982238] systemd[1]: Set hostname to <pc2006>.
[   11.005711] random: systemd-sysv-ge: uninitialized urandom read (16 bytes read, 60 bits of entropy available)
[   11.025415] systemd-sysv-generator[474]: Overwriting existing symlink /run/systemd/generator.late/mysql.service with real service
[   11.039031] random: systemd: uninitialized urandom read (16 bytes read, 62 bits of entropy available)
[   11.049564] random: systemd: uninitialized urandom read (16 bytes read, 62 bits of entropy available)
[   11.060003] random: systemd: uninitialized urandom read (16 bytes read, 62 bits of entropy available)
[   11.073227] random: systemd: uninitialized urandom read (16 bytes read, 62 bits of entropy available)
[   11.083808] random: systemd: uninitialized urandom read (16 bytes read, 62 bits of entropy available)
[   11.118179] systemd[1]: Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed.
[   11.135022] systemd[1]: Expecting device dev-ttyS1.device...
         Expecting device dev-ttyS1.device...
[   11.151747] systemd[1]: Starting Forward Password Requests to Wall Directory Watch.
[   11.160371] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[   11.168847] systemd[1]: Starting Remote File Systems (Pre).
[  OK  ] Reached target Remote File Systems (Pre).
[   11.187718] systemd[1]: Reached target Remote File Systems (Pre).
[   11.194560] systemd[1]: Starting Dispatch Password Requests to Console Directory Watch.
[   11.203537] systemd[1]: Started Dispatch Password Requests to Console Directory Watch.
[   11.212412] systemd[1]: Starting Arbitrary Executable File Formats File System Automount Point.
[  OK  ] Set up automount Arbitrary Executable File Formats F...utomount Point.
[   11.235709] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[   11.246222] systemd[1]: Starting Swap.
[  OK  ] Reached target Swap.
[   11.259697] systemd[1]: Reached target Swap.
[   11.264489] systemd[1]: Expecting device dev-disk-by\x2duuid-8734275b\x2d11d9\x2d4deb\x2db5e5\x2d26ceda7a00fe.device...
         Expecting device dev-disk-by\x2duuid-8734275b\x2d11d...a00fe.device...
[   11.291700] systemd[1]: Expecting device dev-mapper-pc2006\x2d\x2dvg\x2dsrv.device...
         Expecting device dev-mapper-pc2006\x2d\x2dvg\x2dsrv.device...
[   11.311692] systemd[1]: Starting Root Slice.
[  OK  ] Created slice Root Slice.

I agree, RAID looks good. mdadm looks good, and it didn't log anything relevant and it didn't rename the arrays or anything (which usually happens when it gets back in a weird state)

I have talked to Papaul about this and he'll contact Dell next week

Papaul triaged this task as Medium priority.Jul 17 2017, 3:38 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-17T15:47:25Z] <marostegui> Stop MySQL on pc2006 - T170520

RobH added a comment.Jul 17 2017, 6:34 PM

This should be handled via self dispatch, not by contacting Dell.

@Papaul: Please see my email on self dispatch, and get that handled before processing this request.

Test complete with error 2002-0251

RobH added a comment.Jul 17 2017, 8:17 PM

Please note that memory test output doesn't denote which memory dimm reported failed in the memory test. That information is required for that test to be useful.

Otherwise, the logs show a single CPU reporting errors, along with errors from:

"Normal","Thu Dec 24 2015 11:53:28","A problem was detected in Memory Reference Code (MRC)."
"Critical","Thu Dec 24 2015 11:53:28","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Normal","Thu Dec 24 2015 12:12:34","A problem was detected in Memory Reference Code (MRC)."
"Critical","Thu Dec 24 2015 12:12:34","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Normal","Tue Jan 05 2016 15:35:52","A problem was detected in Memory Reference Code (MRC)."
"Critical","Tue Jan 05 2016 15:35:52","Multi-bit memory errors detected on a memory device at location(s) DIMM_B2."
"Critical","Sun May 08 2016 22:56:15","CPU 2 has an internal error (IERR)."
"Critical","Sun May 08 2016 21:57:47","Multi-bit memory errors detected on a memory device at location(s) DIMM_B8."
"Critical","Sun May 08 2016 21:57:47","Multi-bit memory errors detected on a memory device at location(s) DIMM_B7."
"Critical","Sun May 08 2016 21:57:47","Multi-bit memory errors detected on a memory device at location(s) DIMM_B7."
"Critical","Sun May 08 2016 21:57:47","Multi-bit memory errors detected on a memory device at location(s) DIMM_B7."
"Critical","Sun May 08 2016 21:57:47","Multi-bit memory errors detected on a memory device at location(s) DIMM_B7."
"Critical","Sun May 08 2016 21:57:47","Multi-bit memory errors detected on a memory device at location(s) DIMM_B7."
"Warning","Sat Jun 18 2016 23:08:48","Correctable memory error rate exceeded for DIMM_B7."
"Critical","Sun Jul 03 2016 13:19:55","Multi-bit memory errors detected on a memory device at location(s) DIMM_B8."
"Critical","Sun Jul 03 2016 13:21:41","Multi-bit memory errors detected on a memory device at location(s) DIMM_B8."
"Critical","Sun Jul 03 2016 13:21:41","Multi-bit memory errors detected on a memory device at location(s) DIMM_B7."
"Critical","Wed Jul 12 2017 21:08:45","Multi-bit memory errors detected on a memory device at location(s) DIMM_A2."
"Critical","Wed Jul 12 2017 21:08:52","CPU 1 has an internal error (IERR)."

Since both CPUs, and memory from both banks have reported bad, this seems to be a mainboard issue.

I'll go ahead and have a new mainboard dispatched via the self-dispatch portal (I wanted to process one of these anyhow) to codfw for repair. Papaul will need to remove the parts, replace the mainboard, restore the system tag, and then the system will need to be reimaged. (The new MAC makes reimage easier than trying to re-detected and hope no chipsets have changed.)

RobH added a comment.EditedJul 17 2017, 8:47 PM

I've gone ahead and used the self-dispatch tool to request a new mainboard. It seems the self-dispatch is still reviewed by their team, but we can list the exact parts and timeline for replacement. At the time of dispatch creation, we can also elect to have a Dell tech show up as well, and @Papaul likes having them do board replacements (from what I have seen in the past) so I've elected for them to call and schedule a time with @Papaul

If he decides he doesn't need a Dell Tech, he can decline them when they contact him.

case dispatch: SR951102210

When we replaced the main board we normally do not reimage the server if it not necessary (to avoid copying all the content back, in this case 2TB).
I would say let's try if it works without reimage first.
We normally delete the net persistent rules before shutdown so the new MAC is detected finely and ask @Papaul to disable PXE boot during the first start, to avoid accidental reimages.

Thank you guys!

@Papaul I have started MySQL again on pc2006 so it doesn't fall behind too many days, if you need this host to be taken down again, please ping us so we can stop MySQL again.
Thank you!

Mentioned in SAL (#wikimedia-operations) [2017-07-18T15:14:16Z] <marostegui> Stop MySQL and shutdown pc2006 for mainboard replacement - T170520

Papaul reassigned this task from Papaul to Marostegui.Jul 18 2017, 6:27 PM

Main board replacement complete. Server is back up. Please see below for return information on bad main board for reference.

Marostegui closed this task as Resolved.Jul 19 2017, 5:31 AM

Thanks @Papaul

I have started MySQL and the replication thread is catching up.
Going to close this as resolved, and if it crashes again, let's reopen it.