Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
operations/mediawiki-config | master | +1 -1 | db-codfw.php: Repool es2019 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T130702 Several es20XX servers keep crashing (es2017, es2019, es2015, es2014) since 23 March | |||
Resolved | jcrespo | T149526 es2019 crashed again |
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2016-10-30T16:35:25Z] <jynus> powercycle es2019 after crash T149526
Fri May 27 2016 17:31:34 Correctable memory error rate exceeded for DIMM_A1. Fri May 27 2016 18:42:52 Correctable memory error rate exceeded for DIMM_A1. Tue Jun 07 2016 17:21:45 Correctable memory error rate exceeded for DIMM_A1. Tue Jun 07 2016 17:27:14 Correctable memory error rate exceeded for DIMM_A1. Tue Jun 07 2016 17:27:31 Correctable memory error rate exceeded for DIMM_A1. Sun Oct 30 2016 15:14:10 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. Sun Oct 30 2016 15:14:13 CPU 2 has an internal error (IERR).
MEM0001: Multi-bit memory errors detected on a memory device at location(s) DIMM_B2. 2016-10-30T15:14:10-0500 Log Sequence Number: 362 Detailed Description: The memory has encountered a uncorrectable error. System performance may be degraded. The operating system and/or applications may fail as a result. Recommended Action: Re-install the memory component. If the problem persists, contact technical support. Refer to the product documentation to choose a convenient contact method. Comment: root
CPU0000: CPU 2 has an internal error (IERR). 2016-10-30T15:14:14-0500 Log Sequence Number: 366 Detailed Description: System event log and OS logs may indicate that the exception is external to the processor. Recommended Action: Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method. Comment: root
This is from the hardware logs
Open to schedule it when you have the time. Should we update the BIOS (again?). CC'ing @RobH here.
Mentioned in SAL (#wikimedia-operations) [2016-11-21T15:58:12Z] <marostegui> Shutting down MySQL es2019 for HW maintenance - T149526
Mentioned in SAL (#wikimedia-operations) [2016-11-21T15:59:31Z] <marostegui> Powering off es2019 for HW maintenance - T149526
1- Swapped CPU 2 to CPU1
2 - Update BIOS from 2.1.6 to 2.2.5
3- Clear syslog
Leaving this task open for now .
This ask has been open for almost 2 months. closing it. it can be reopen anytime we have the issue again.
Normal","Mon Nov 21 2016 16:22:07","Log cleared. Critical","Sat Apr 22 2017 07:52:10","CPU 2 has an internal error (IERR). Normal","Sat Apr 22 2017 06:56:24","A problem was detected related to the previous server boot. Critical","Sat Apr 22 2017 06:56:24","Multi-bit memory errors detected on a memory device at location(s) DIMM_B4. Critical","Sat Apr 22 2017 06:56:24","Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. Critical","Sat Apr 22 2017 06:56:24","Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. Critical","Sat Apr 22 2017 06:56:24","Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. Critical","Sat Apr 22 2017 06:56:24","Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
Severity Date and Time Message ID Summary Comment 2017-04-22T07:37:48-0500 USR0032 The session for root from 10.64.32.20 using SSH is logged off. 2017-04-22T07:37:41-0500 USR0030 Successfully logged in using root, from 10.192.16.172 and GUI. 2017-04-22T07:36:16-0500 USR0030 Successfully logged in using root, from 10.64.32.20 and SSH. 2017-04-22T07:36:08-0500 USR0032 The session for root from 208.80.154.149 using SSH is logged off. 2017-04-22T07:30:25-0500 USR0032 The session for root from 10.64.32.20 using SSH is logged off. 2017-04-22T07:27:55-0500 USR0030 Successfully logged in using root, from 208.80.154.149 and SSH. 2017-04-22T07:22:50-0500 USR0030 Successfully logged in using root, from 10.64.32.20 and SSH. 2017-04-22T06:56:31-0500 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. 2017-04-22T06:56:30-0500 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. 2017-04-22T06:56:28-0500 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. 2017-04-22T06:56:27-0500 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_A1. 2017-04-22T06:56:26-0500 MEM0001 Multi-bit memory errors detected on a memory device at location(s) DIMM_B4. 2017-04-22T06:56:24-0500 UEFI0079 One or more Uncorrectable Memory errors occurred in the previous boot. 2017-04-22T06:56:24-0500 PST0090 A problem was detected related to the previous server boot. 2017-04-22T07:52:10-0500 CPU0000 CPU 2 has an internal error (IERR). 2017-04-22T07:52:10-0500 SYS1003 System CPU Resetting. 2017-04-22T07:52:09-0500 RAC0703 Requested system hardreset. 2017-04-22T07:52:09-0500 SYS1003 System CPU Resetting. 2017-04-22T03:58:35-0500 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-04-15T08:15:00-0500 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-04-15T03:58:37-0500 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-04-08T08:10:41-0500 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-04-08T03:58:44-0500 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-04-01T08:10:57-0500 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-04-01T03:58:49-0500 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-03-25T08:12:35-0500 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-03-25T03:58:52-0500 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-03-18T08:10:54-0500 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-03-18T03:58:55-0500 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-03-11T07:10:17-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-03-11T02:59:02-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-03-04T07:20:34-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-03-04T02:59:05-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-02-25T07:14:41-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-02-25T02:59:06-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-02-18T07:18:58-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-02-18T02:59:12-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-02-11T07:18:01-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-02-11T02:59:14-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-02-04T07:11:28-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-02-04T02:59:20-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-01-28T07:19:12-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-01-28T02:59:22-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-01-21T07:11:32-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-01-21T02:59:31-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-01-14T07:15:38-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-01-14T02:59:34-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2017-01-07T07:14:37-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2017-01-07T02:59:35-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-12-31T07:10:41-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2016-12-31T02:59:43-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-12-24T07:14:53-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2016-12-24T02:59:47-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-12-17T07:14:41-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2016-12-17T02:59:50-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-12-10T07:13:03-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2016-12-10T02:59:51-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-12-03T07:09:55-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2016-12-03T02:59:55-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-11-26T07:14:52-0600 CTL38 The Patrol Read operation completed for Integrated RAID Controller 1. 2016-11-26T03:00:00-0600 CTL37 A Patrol Read operation started for Integrated RAID Controller 1. 2016-11-22T19:30:50-0600 USR0032 The session for root from 208.80.154.149 using SSH is logged off. 2016-11-22T19:30:45-0600 JCP036 The (installation or configuration) job JID_798645851255 is successfully deleted. 2016-11-22T19:29:45-0600 JCP027 Job created successfully. 2016-11-22T19:29:37-0600 USR0030 Successfully logged in using root, from 208.80.154.149 and SSH. 2016-11-22T19:29:37-0600 LOG007 The previous log entry was repeated 1 times. 2016-11-22T19:28:31-0600 SYS1003 System CPU Resetting. 2016-11-22T19:24:13-0600 JCP036 The (installation or configuration) job JID_798641868867 is successfully deleted. 2016-11-22T19:23:51-0600 USR0032 The session for root from 208.80.154.149 using SSH is logged off. 2016-11-22T19:23:06-0600 JCP027 Job created successfully. 2016-11-22T19:23:02-0600 USR0030 Successfully logged in using root, from 208.80.154.149 and SSH. 2016-11-22T19:23:02-0600 LOG007 The previous log entry was repeated 1 times. 2016-11-22T19:21:52-0600 SYS1003 System CPU Resetting. 2016-11-21T17:11:23-0600 USR0032 The session for root from 208.80.153.5 using SSH is logged off. 2016-11-21T17:11:03-0600 USR0032 The session for root from 10.193.0.7 using GUI is logged off. 2016-11-21T17:09:41-0600 USR0174 The Front Panel USB device is removed from the operating system. 2016-11-21T17:09:07-0600 NIC101 The NIC Integrated 1 Port 1 network link is started. 2016-11-21T17:08:26-0600 USR0030 Successfully logged in using root, from 208.80.153.5 and SSH. 2016-11-21T17:08:15-0600 USR0032 The session for root from 91.198.174.112 using SSH is logged off. 2016-11-21T17:08:13-0600 USR0032 The session for root from 208.80.153.5 using SSH is logged off. 2016-11-21T17:08:13-0600 LOG007 The previous log entry was repeated 1 times. 2016-11-21T17:02:30-0600 USR0030 Successfully logged in using root, from 208.80.153.5 and SSH. 2016-11-21T17:00:04-0600 USR0030 Successfully logged in using root, from 10.193.0.7 and GUI. 2016-11-21T17:00:04-0600 LOG007 The previous log entry was repeated 1 times. 2016-11-21T16:59:16-0600 USR0030 Successfully logged in using root, from 208.80.153.5 and SSH. 2016-11-21T16:58:51-0600 USR0032 The session for root from 10.193.0.7 using GUI is logged off. 2016-11-21T16:58:48-0600 JCP036 The (installation or configuration) job JID_797690674507 is successfully deleted. 2016-11-21T16:57:47-0600 JCP027 Job created successfully. 2016-11-21T16:57:28-0600 USR0173 The Front Panel USB port switched automatically from iDRAC to operating system. 2016-11-21T16:57:28-0600 USR0171 The Front Panel USB port is detached from the iDRAC Disk.USBFront.1. Device Details: Device Class 3, Vendor ID 0B38, Product ID 0010. 2016-11-21T16:57:23-0600 USR0170 The Front Panel USB port is attached to iDRAC Disk.USBFront.1. Device details: Device class 3, Vendor ID 0B38, Manufacturer Name Not Available, Product ID 0010, Product Name Not Available, Serial Number Not Available. 2016-11-21T16:57:15-0600 PR36 Version change detected for BIOS firmware. Previous version:2.1.6, Current version:2.2.5 2016-11-21T16:55:55-0600 USR0030 Successfully logged in using root, from 208.80.153.5 and SSH. 2016-11-21T16:55:52-0600 USR0031 Unable to log in for root from 208.80.153.5 using SSH. 2016-11-21T16:55:41-0600 USR0030 Successfully logged in using root, from 208.80.153.5 and SSH. 2016-11-21T16:55:35-0600 USR0031 Unable to log in for root from 208.80.153.5 using SSH. 2016-11-21T16:55:35-0600 LOG007 The previous log entry was repeated 1 times. 2016-11-21T16:55:26-0600 SYS1003 System CPU Resetting. 2016-11-21T16:55:25-0600 SYS1000 System is turning on. 2016-11-21T16:55:16-0600 SYS1003 System CPU Resetting.
The memory module on this host apparently was replaced: T149526#2756898 maybe memory slot is what is broken?
Mentioned in SAL (#wikimedia-operations) [2017-04-26T15:24:31Z] <marostegui> Shutdown es2019 for maintenance with papaul and Dell - T149526
Hi Papaul,
Thank you for contacting Dell EMC Basic Server Support.
This mail is with reference to the (Memory and CPU Issue) you had reported on your PowerEdge(R730XD).
Please find the service request number for your reference in below. Please reply back with the address of the location where you want the part to be delivered
Service Tag : 8PVW382
Service request number : 947570621
Please feel free to reply to the email if you have any other concerns since I would be able to reply back and help you further.
Regards
Suraj Kumar
Enterprise Tech Support Analyst
Dell EMC | NA Basic Server Support
Enterprise Remote Services and Solutions
Hi Papaul,
I will get the motherboard and the memory module replaced at the same time but at the same time would like to request you to help me with the address of the location where you want the parts and the engineer to be sent .
Thanks and regards
Suraj Kumar
Thanks Papaul! As per our chat, I have brought MySQL, ping me when you need it down again,
Mentioned in SAL (#wikimedia-operations) [2017-04-27T14:29:54Z] <marostegui> Stop MySQL and shutdown es2019 for HW replacement - T149526
Main board replacement
DIMM B4 Replaced
DIMM A1 Replaced
BIOS update from 2.2.5 to 2.4.3
Left running on neodymium (I did some optimizations to ignore values below and beyond max id respectively):
while read db id; do echo "./compare.py es1019.eqiad.wmnet es2019.codfw.wmnet $db blobs_cluster25 blob_id --step=100 --from-value $(($id - 300000)) --to-value $(($id + 300000))"; ./compare.py es1019.eqiad.wmnet es2019.codfw.wmnet $db blobs_cluster25 blob_id --step=100 --from-value $(($id - 300000)) --to-value $(($id + 300000)) ; done < P5309
Change 352752 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Repool es2019
Change 352752 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Repool es2019
Mentioned in SAL (#wikimedia-operations) [2017-05-09T07:12:23Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Repool es2019 - T149526 (duration: 00m 39s)
As there were no differences found by compare.py I have repooled this host. Finger crossed so it doesn't crash anymore!