Page MenuHomePhabricator

db1139 memory errors on boot (issue continues after board change) 2020-08-27
Closed, ResolvedPublic

Assigned To
Authored By
jcrespo
Aug 27 2020, 12:43 PM
Referenced Files
F9: aiOZA8rh.jpg
Nov 25 2020, 8:57 PM
F33917836: Screenshot_20201117_201050.png
Nov 17 2020, 7:11 PM
F33917835: Screenshot_20201117_201025.png
Nov 17 2020, 7:11 PM
F33917670: IML.csv
Nov 17 2020, 4:50 PM
F33917669: EventLog.csv
Nov 17 2020, 4:50 PM
F32443139: HPE_QB98NC3094_20201106.ahs
Nov 6 2020, 9:54 AM
F32409575: db1139.iml.csv
Oct 20 2020, 9:28 PM

Description

Ongoing issue

Memory keeps failing to be recognized after board change, see: T261405#6617019

Previous maintenance cycle

  • - Provide FQDN of system. - done db1139
  • - DB systems are left in service until work is ready to begin, then ping a DBA and they'll depool it.
  • - Put system into a failed state in Netbox. - cannot be done until the above is confirmed and it is ready for power state changes.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc) - host is enwiki slave, so important but not critical unless we lose more hosts. This shouldn't be UBN, but should reside at top of hw repairs.
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help) - partially done with the original request body below
  • - rob working with HP to get replacement part/engineer dispatch.
  • - system repaired
  • - system reimaged (due to mainboard swap)

original hw failure request body

315 - An uncorrectable memory error was detected prior to this system boot.
Action: Check the Integrated Management Log (IML) for additional information.



Starting required devices. Please wait, this may take a few moments....
Important information available or errors detected
 Press 'ESC+1' to continue, or 'ESC+2' for more information
System will continue to boot in 1 seconds..

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui triaged this task as Medium priority.Aug 27 2020, 1:07 PM
Marostegui added a subscriber: wiki_willy.

Hi @Jclark-ctr - can you take a look at this one? It was just purchased last year, so it's still under warranty. Thanks, Willy

Hello John,

Thank you for the AHS. However, as per the AHS we see no hardware errors and the log event page also seems to be empty. Hence, request you to assist us with the screenshot of the error from the iLO IML or the iLO GUI.

Also, kindly help us with the memory tab screenshot on the iLO
iLO -> System Information -> Memory Tab

Thanks and Regards,
Namrata Dutta
Technical Solutions Consultant
Hewlett Packard Enterprise
Working Days: Mon-Fri - 10:30AM-06:30PM CT
Team PDL: iss.scm@hpe.com
Feedback to my manager – Ramesh Srinivasan– Ramesh.srinivasan@hpe.com

@Jclark-ctr I understand that is hpe's response. What is your advice regarding followup steps, close this due to "no actionable"?

@Jclark-ctr Have you had any discussion with HPE about this?

@Cmjohnson @Jclark-ctr Is there anything we (DBA) can do to help move this forward?

Hi @LSobanski - John is still out with the broken hand and Chris was out sick the entire week, so we'll sync up next week to where things currently are with this one. Thanks, Willy

wiki_willy added subscribers: RobH, Jclark-ctr.

Hi @RobH - since John is still out and Chris is knee deep with installs, can you see if you're able to work with HP remotely, in getting a replacement part for this? Thanks, Willy

This task has a number of issues, starting with:

I've updated the task description with the checkboxes required. However, I'm not sure of a few things and need input from DBA team:

  • Is this system able to be shutdown and rebooted at will so we can perform memory and other hardware testing?
    • If not, please remove from service and update task when complete.
  • I am assuming the hardware error was observed on 2020-08-27?
    • The HP log tool to pull detailed hardware info requires a date range, and also to upload said log to HPE support requires it be under 250MB. Generating from 2020-07-01 to present is too large, so I'll just run this report for 2020-07-01 to 2020-07-31 and try to parse that.
    • Any further testing other than pulling the hw log requires I reboot this machine, update firmwares, etc... So I just need confirmation from DBA team that this is indeed offline/out of service and I can proceed.

Please comment and assign back to me!

HW Testing Notes:
If any hardware testing was done previously, no one logged the tests on this task (or in the SAL tied to this task), which means I have to basically start from step 1 on everything. In the future, please everyone update tasks with any and all troubleshooting, as you do it!

  • The Active Health Log tool requires you input a date range, so I went with August, and it is downloading. Hopefully its under 250MB, since that is the parse limit for the HPE support site. When done I'll upload and attempt to open a support case with just that, but I suspect, like Dell, they'll want us to try the memory in a different slot (if i can recreate the error) to rule out the mainboard. I'll update as I find out.
RobH updated the task description. (Show Details)

I'm waiting on the very slow HPE site upload to parse the AHS file I downloaded for this, and I also noticed that via https interface (https://db1139.mgmt.eqiad.wmnet/) that it has an Integrated Management Log (nearly identical to Dell's Service Event Log) which includes the memory error.

The AHS log cannot be uploaded, as it has improper filetype for phab storage. I've downloaded it locally and put into our google drive for dc ops with this format: task.server.filetype so T261405.db1139.ahs in the DC ops google drive under the HP logs directory.

After 10 minutes, HP site is still slowly processing the AHS file =P

DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 6)
DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 5)

Additionally, both CPUs report machine check errors.

Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000026, Bank 0x0000000A, Status 0xFE200000'0004017A, Address 0x00000000'BF003982, Misc 0x00000AC1'E0400086).
Uncorrectable Machine Check Exception (Processor 1, APIC ID 0x00000006, Bank 0x00000006, Status 0xB3800000'00000E0B, Address 0x00000000'00000000, Misc 0x00000000'00000000).

In my experience, it is fairly rare to get two bad dimms and two bad cpus, and far more likely to have a bad mainboard causing all of these issues. With that in mind, I'll request a new mainboard via support and update this task as things progress.

Case ID: 5350976764 opened, requesting a new mainboard and any/all migration directions to be dispatched to eqiad to @Cmjohnson's attention. (He is currently out sick, but is projected to be on-site before John.)

I'm not reassigning this yet, until I get confirmation of parts dispatch. I listed myself as case contact but Chris as the engineering/shipping contact. Having not opened a case with HP in years, I'm not sure how they'll notify us of the parts dispatch yet.

They emailed me and required I upload the AHS log via a https drop box utility, so I did so along with the IML log file.

Awaiting reply from HP support.

Sorry for the late response, it was very late on our TZ.

Apologies also for not using the template, I was not aware of it existence, at least I've never seen it used before. I know there are several, but I may not know all, is https://phabricator.wikimedia.org/tag/dc-ops/ a list of all that there is? Sorry again.

Regarding your questions:

"Is this system able to be shutdown and rebooted at will so we can perform memory and other hardware testing?"

The main issue with databases is that we normally can "depool them/retire from service" with no issue, but we cannot do it for long periods of time because of it affects redundancy and can make them desync if shut down for long periods of time, due to replication. This means we can leave a server off for a day or even a week, but not for 2 months without a huge toll on us. For these reason, when we work with people onsite we request some time in advance (can be the same day, or even the day before if it is more confortable for TZ reasons, or schedule it, whatever is more comfortable to DC Ops) to ping us that "they may be working on it", and we can shut them down in advance, no issue. We like to do it because these are stateful services so we make sure they are shutdown cleanly.

Let me know how to communicate this to you- although both john, chris and papaul are used to operate like this with us, so I am sure they are aware.

Regarding "Provide urgency of request", this is not high priority because the host is up and giving service right now, so we can live with it until some actionable is about to be done. The main concern was this crashing. We stopped receiving updates to direct questions (I am guessing because staff unavailability) and we just wanted to be sure things were in radar, even if it would take time to get actionables done. Our worry was that this could have been forgotten due to almost 2 month passing since the initial report.

So precisely because we know you are low on availability, we wait for your answer to put it down, as we know it may take some days, no problem with that. Let me know if you want to proceed with it now and I will shut it down now safely and downtime it.

Let me know if the explanation is enough.

Jaime: I didn't realize the DB systems hardware repair cadence was different then the other systems (with DBA team only taking it offline immediately before work.) I'll have to figure out where to document that so I don't forget when I don't work on the db systems for a few months. Your explanation makes perfect sense, thank you!

While all of the DC ops templates are indeed linked off our landing page, I am pretty sure we end up having to tell every single SRE individually. Folks tend to not hear about documentation updates in SRE meetings, no worries. Since DB machines cannot be offline far in advance of work, its a bit different than the rest.

For now, I have in a support request with HP support. I uploaded the log files (AHS and IML) last evening, so we should hear something back today.

I suspect this actually has a bad mainboard, as it had both CPU and memory errors, and it is rare to see both CPUs throw errors. We'll know soon if HP will just dispatch us a new mainboard, or if they are going to be a PITA about parts dispatch and potential testing.

Scheduling/Timeline Notes: Typically if its just one bad bit of kit, you swap its location and determine if it follows the smaller hardware, or if it stays in the slot that had the failure. When its both CPU sockets, the chances of it being the mainboard becomes very, very high. My understanding of our on-sites preferences is if they can perform the hardware swap easily, they prefer it to having an engineer from Dell or HP dispatched with the hardware. If HP doesn't push back, and will just send a new mainboard, we'll have it dispatched out for swap ASAP. If they push back, and we have to coordinate an HP engineer to meet with Chris or John at eqiad, this will take a week or more longer to handle.

Oh, if it is a mainboard replacement, the host will need reimage. I assume if that is the case, it can come offline well in advance as its basically re-entering service as a new host. We'll know later today.

the host will need reimage

A reimage is not a problem, even with data loss- the problem is being down for an extended amount of time (e.g. ~1 week).

RobH updated the task description. (Show Details)

John, on Thursday can you swap the motherboard out please. The new one is the flex space.

I will have db1139 down and downtimed for a day by Thursday, unless you tell me not to.

@jcrespo thanks please have host down will change mainboard tomorrow

Thanks, will do and report here when done (will do on my -Europe- morning).

Creating a backup before shutting them down, in case data got lost after maintenance.

Mentioned in SAL (#wikimedia-operations) [2020-11-05T11:58:32Z] <jynus> shutting down db1139 in preparation of maintenance T261405

jcrespo updated the task description. (Show Details)

db1139 is down, backed up, and ready for maintenance - I have downtime'd until Friday. Let us know either if you will need more time or when it has been done to put it back into production.

As a reminder (you probably already know this) please note that default boot option should be set back to disk (BIOS gets reset when board changes, may need similar procedures as first install).

Thank you!

@jcrespo mainboard replaced server is powered on

Jclark-ctr updated the task description. (Show Details)

Is it possible one of the memory stick needs reseating?

462 - Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 6).  The
DIMM is mapped out and is currently not available.
Action: Take corrective action for the failing DIMM. Re-map all DIMMs back into
the memory map in RBSU. If the issue persists, contact support.


511 - One or more DIMMs have been mapped out due to a memory error, resulting
in an unbalanced memory configuration across memory controllers. This may
result in non-optimal memory performance.
Action: See the Integrated Management Log (IML) for information on the memory
error.  Consult documentation for memory population guidelines.

IML:

"ID","Severity","Class","Description","Last Update","Count","Category",
"139","Repaired","Network","HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1","11/06/2020 09:41:25","1","Hardware",
"138","Informational","UEFI","One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance.","11/06/2020 09:40:29","1","Configuration",
"137","Critical","UEFI","Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 6).  The DIMM is mapped out and is currently not available.","11/06/2020 09:41:08","1","Hardware",
"136","Repaired","Network","HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1","11/05/2020 18:14:17","1","Hardware",
"135","Informational","UEFI","One or more DIMMs have been mapped out due to a memory error, resulting in an unbalanced memory configuration across memory controllers. This may result in non-optimal memory performance.","11/05/2020 18:12:58","1","Configuration",
"134","Critical","UEFI","Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 6).  The DIMM is mapped out and is currently not available.","11/05/2020 18:12:15","2","Hardware",
"133","Critical","UEFI","DIMM Failure - Uncorrectable Memory Error (Processor 2, DIMM 6)","11/05/2020 18:11:28","1","Hardware",
"132","Critical","CPU","Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000026, Bank 0x0000000F, Status 0xFD092E00'001000C0, Address 0x00000040'530FD140, Misc 0x11201305'61BCC086). ","11/05/2020 18:10:18","1","Hardware",
"131","Caution","UEFI","Uncorrectable Error Detected on the Previous Boot. Error information logged to the Integrated Management Log.","11/05/2020 18:10:01","1","Hardware",
"130","Critical","CPU","Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000026, Bank 0x0000000F, Status 0xFD08E080'001000C0, Address 0x00000040'530FD140, Misc 0x11201305'61BCC086). ","11/05/2020 18:09:56","1","Hardware",
"129","Informational","UEFI","A new network or storage device has been detected.  This device will not be shown in the Legacy BIOS Boot Order options in RBSU until the system has booted once.","11/05/2020 18:08:46","1","Administration",
"128","Repaired","Network","HPE Ethernet 1Gb 4-port 331i Adapter - NIC Connectivity status changed to OK for adapter in slot 0, port 1","11/05/2020 17:52:04","1","Hardware",
"127","Informational","UEFI","Default configuration settings have been restored per user request. If Secure Boot was enabled, related security settings may have been lost.","11/05/2020 17:51:14","1","Administration",
"126","Informational","Maintenance","IML Cleared ( user: System Administrator)","07/17/2020 22:14:33","1","Maintenance, Administration",
$ free -m
              total       
Mem:         483434

(514666 or 515688 expected)

ahs:

Either "Processor 2, DIMM 6" stick is not seated correctly (which would be a reasonable assumption, given there is a lot of sticks to take out and put back) or the board was not the issue, and that stick is problematic.

@Jclark-ctr: Can we try to pull it out and put it back? Let me know if you think is a reasonable action to take and I will shut down the server (can be next week).

jcrespo renamed this task from db1139 memory errors on boot 2020-08-27 to db1139 memory errors on boot (Issues continues after board change) 2020-08-27.Nov 6 2020, 1:33 PM
jcrespo renamed this task from db1139 memory errors on boot (Issues continues after board change) 2020-08-27 to db1139 memory errors on boot (issue continues after board change) 2020-08-27.

@Jclark-ctr: Has the defective HP mainboard been sent back to HP yet? They are spamming my inbox about it =]

Mentioned in SAL (#wikimedia-operations) [2020-11-10T17:30:17Z] <jynus> about to shutdown db1139 for hw maintenance T261405

@Cmjohnson host should be shut down right now after stopping mysql cleanly- you are free to disconnect/open/check ram now. Thank you!

reseated all of the DIMM, the erorr remained the same

Uncorrectable Machine Check Exception (Processor 2, APIC ID 0x00000026, Bank 0x0000000D, Status 0xEE0007C0'001000C0, Address 0x00000040'4088B4C0, Misc 0x12294F78'D3D4C086).
12:49 Uncorrectable Memory Error - The failed memory module could not be determined.
12:49 Uncorrectable Memory Error Threshold Exceeded (Processor 2, DIMM 6). The DIMM is mapped out and is currently not available.

Powered off again and swapped DIMM 6 processor 2 with DIMM 6 processor 1

The new error is DIMM Initialization Error - Processor 2 Channel 1

Possibly a bad cpu or they sent us a bad board. @Jclark-ctr this will require a call to HPE.

:-( I will put db1139 back into production so it is somewhat useful until next week.

The memory view tells us that it is now 2, not 1 memory slot that is affected (of course, given the above test it is more likely it is CPU / board, not the sticks themselves).

PROC 2 DIMM 7 	
 Map Out Error 	32.00 GB 	2666 MHz 	RDIMM
PROC 2 DIMM 8 	
 Map Out Error 	32.00 GB 	2666 MHz 	RDIMM

Change 640482 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Reduce memory consumption of mariadb@s6 while hw degraded

https://gerrit.wikimedia.org/r/640482

Change 640482 merged by Jcrespo:
[operations/puppet@production] mariadb: Reduce memory consumption of mariadb@s6 while hw degraded

https://gerrit.wikimedia.org/r/640482

@Jclark-ctr - can you double-check the S/N for db1139. We're getting the following Netbox error:

mismatched serials: MXQ91300JF (netbox) != QB98NC3094 (puppetdb)

Thanks,
Willy

@wiki_willy I do not know what the Q number would be, all of the HP servers start with MXQ and confirmed MXQ91300JF is correct.

@Jclark-ctr - can you double-check the S/N for db1139. We're getting the following Netbox error:

mismatched serials: MXQ91300JF (netbox) != QB98NC3094 (puppetdb)

Thanks,
Willy

This was due to the new mainboard being installed, but the serial number not being imported during the mainboard swap. While Dell servers do this automatically, it appears that HP servers do not. We'll need to reboot this to troubleshoot/fix.

@Jclark-ctr is taking this over, as the mainboard swap did not fix the memory and CPU errors.

phab won't let me upload the AHS file, so emailing it to John with this task #.


@jcrespo will need downtime for host to remap dimms per HPE

@Jclark-ctr - can you double-check the S/N for db1139. We're getting the following Netbox error:

mismatched serials: MXQ91300JF (netbox) != QB98NC3094 (puppetdb)

Thanks,
Willy

This was due to the new mainboard being installed, but the serial number not being imported during the mainboard swap. While Dell servers do this automatically, it appears that HP servers do not. We'll need to reboot this to troubleshoot/fix.

@Jclark-ctr: When you reboot to troubleshoot memory, please also fix the serial number in the bios settings!

Mentioned in SAL (#wikimedia-operations) [2020-11-17T18:09:37Z] <jynus> stopping db1139 for hw maintenance T261405

@Jclark-ctr I just stopped the host and downtimed it for almost a day, thank you!

@jcrespo. replacement dimms should arrive Thursday. Unsure what time they will arrive we can shoot for Thursday. If they arrive late it will not be till Friday.

Hello John,

Greetings!

As discussed on call, I have placed order for 3 DIMMs to be shipped.
You will receive an email with tracking number once the parts are dispatched.

Please replace the DIMMs in the slot Proc 1 DIMM 6, Proc 1 DIMM 7, Proc 1 DIMM 8.
Serial number of the existing DIMMs :
Proc 1 DIMM 6 : 1E2041CD
Proc 1 DIMM 7 : 1E204B4E
Proc 1 DIMM 8 : 1E204B39

I will put the server back up temporarily for some hours so it catches up and we can generate a full backup before the maintenance.

Server is back down and ready for maintenance after the backup.

Still happening. Log from last night:

233 - DIMM Initialization Error - Processor 2 Channel 1. The identified memory
channel could not be properly trained and has been mapped out. (Major
Code:00000026, Minor Code:00000001).
Action: Re-seat the DIMMs in the identified channel and update the System ROM.
If the issue persists, contact support.


511 - One or more DIMMs have been mapped out due to a memory error, resulting
in an unbalanced memory configuration across memory controllers. This may
result in non-optimal memory performance.
DIMM Initialization Error - Processor 2 Channel 1. The identified memory channel could not be properly trained and has been mapped out. (Major Code:00000026, Minor Code:00000001).
Initial Update
	11/19/2020 20:10:11
Event Class
	0x32
Event Code
	0x233
Learn More
	http://www.hpe.com/support/class0x0032code0x0233-gen10
Recommended Action
	Re-seat the DIMMs in the identified channel and update the System ROM. If the issue persists, contact support.
PROC 2 DIMM 7 	
 Map Out Error 	32.00 GB 	2666 MHz 	RDIMM
PROC 2 DIMM 8 	
 Map Out Error 	32.00 GB 	2666 MHz 	RDIMM

Host will be down for now.

@Jclark-ctr Any updates from HP? The host is shut down for now.

@jcrespo just heard back from HP they want to remap dimm again

We see that the original issue with the 3 DIMMs is now cleared.
However, the DIMMs on Proc 2 Slot 7 & 8 are mapped.
Please perform the below steps and let us know the outcome.

Reseat the DIMMs in proc 2 slot 7 & 8.
Reboot the server > press F9 > Select System Configuration > Select BIOS configuration > Select Memory Options > Select Remap memory > Remap all memory.

If the error persists, then please swap the DIMM in proc 2 DIMM 7 & 8 with proc 2 DIMM 9 & 10.

collect the AHS logs after swapping the DIMMs and send it to us.
I will check and let you know the next action plan.

@Jclark-ctr based on the DC entry schedule, when do you expect you will be able to take a look at this? Knowing this would allow us to better plan next steps regarding backup source locations.

reseated dimm`s errors persisted following up with HP

@Jclark-ctr were you able to contact HP again? Host is again down so it can be managed at any time.

downloaded and sent log again to hp.

moved dimm again per hp request.. Error continues

DIMM 7 to slot 9 and DIMM 9 to slot 7
DIMM 8 to slot 10 and DIMM 10 to slot 8.

HP sent 2 more Dimms requesting to be changed. just arrived will change tomorrow

Replaced more Dimms per HP no errors at this

Can confirm from os command line:

$ free -m
Mem:         515690

Thank you very much!

Also no more errors on reboot:

Installed System Memory: 512 GB, Available System Memory: 512 GB

2 Processor(s) detected, 8 total cores enabled, Hyperthreading is enabled
Proc 1: Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz
Proc 2: Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz
UPI Speed: 10.4 GT/s 

Workload Profile: General Power Efficient Compute
Power Regulator Mode: Dynamic Power Savings
Advanced Memory Protection Mode: Fast Fault Tolerant Memory (ADDDC)
Boot Mode: Legacy BIOS
HPE SmartMemory authenticated in all populated DIMM slots.

For access via BIOS Serial Console:
Press 'ESC+9' for System Utilities
Press 'ESC+0' for Intelligent Provisioning
Press 'ESC+!' for One-Time Boot Menu
Press 'ESC+@' for Network Boot