Page MenuHomePhabricator

labvirt1005 memory errors
Closed, ResolvedPublic

Description

dmesg full of happy messages:

[422615.937175] TSC 0 ADDR 303fff2580 MISC 424c3e00 PROCESSOR 0:306e4 TIME 1430311410 SOCKET 0 APIC 1
[422615.937256] sbridge: HANDLING MCE MEMORY ERROR
[422615.937351] CPU 24: Machine Check Exception: 0 Bank 14: c800d74d00800091
[422615.937352] TSC 0 ADDR 0 MISC 2431800222001a8c PROCESSOR 0:306e4 TIME 1430311410 SOCKET 0 APIC 1
[422616.696930] EDAC MC0: 13926 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#2 (channel:1 slot:2 page:0x1c9faa4 offset:0xfc0 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:0 channel_mask:2 rank:8)

Event Timeline

coren raised the priority of this task from to Unbreak Now!.
coren updated the task description. (Show Details)
coren added subscribers: coren, Andrew.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For the record, the following instances live on that dying hardware:

| e43664a5-e763-469a-9948-2f2c6c539db2 | cvn-app4                       | labvirt1005           |
| 9d05dbda-4103-432c-9449-498243e10db6 | deployment-cache-bits01        | labvirt1005           |
| 1411d0ec-e934-4bfa-8327-81bfbbe4df32 | deployment-elastic06           | labvirt1005           |
| dfacf7e3-d60c-4990-9681-30610df4ae3d | deployment-kafka02             | labvirt1005           |
| 2cfaf18c-e6ea-4c2d-b96f-df7f50b6bc9a | deployment-mediawiki03         | labvirt1005           |
| b26e5c79-7190-431c-9fc9-e12bf05c0cd6 | deployment-parsoid05           | labvirt1005           |
| a71ec107-2c2a-4a5a-bdb5-d35d1ca95302 | deployment-restbase01          | labvirt1005           |
| ec228fb1-7cca-4c1b-9f5f-63bfc0aee45c | deployment-test                | labvirt1005           |
| f7e8f15f-d5b3-4cf7-847b-612f4443b86c | etherpadt                      | labvirt1005           |
| b76d3c87-ed26-4ad0-aa66-49bca3d7496b | huggle-d2                      | labvirt1005           |
| 92196d8b-2520-4fc1-b4f8-93c29c4661fb | integration-raita              | labvirt1005           |
| 78c56d53-1770-466b-9ad2-6955a539561c | integration-saltmaster         | labvirt1005           |
| 31b01867-44f1-48d5-8ce6-90dabd8d0fe5 | integration-vmbuilder-trusty   | labvirt1005           |
| 5853c165-347a-4597-b9ff-a80288b9332d | otto-hadoop-worker01           | labvirt1005           |
| 65baec4e-311c-402e-b409-2591372d8c94 | phantomcirrus                  | labvirt1005           |
| 6a73ec36-5f5b-4074-9c30-128a738f91ee | puppet-jmm-salt-trusty-minion  | labvirt1005           |
| 0d61121b-5f29-4c3c-a5db-3a2b5f20ad56 | sol                            | labvirt1005           |
| 43224d6d-882a-4da0-9e4b-6a594edb3901 | staging-elastic03              | labvirt1005           |
| 63356051-954e-4d82-965f-2718d5976fe9 | staging-ms-be01                | labvirt1005           |
| d35af3fe-0e9e-41e3-82df-ae5bcad08812 | staging-ocg01                  | labvirt1005           |
| c69e1936-4c35-4892-a84c-c5b2803a60ee | staging-stream                 | labvirt1005           |
| fa611e16-6b85-4f74-92a3-2ed1635fa481 | tools-exec-04                  | labvirt1005           |
| c75db281-14e8-4f6a-a1dd-13f9b89aac8a | tools-exec-1201                | labvirt1005           |
| 9e2161be-8058-4306-b29d-51327a2a00b7 | tools-exec-1202                | labvirt1005           |
| f7def60c-ff22-4fec-a9d9-0cf9d23fe0c6 | tools-exec-1203                | labvirt1005           |
| 14d68305-9095-4e64-9c69-068063c2e7d9 | tools-exec-1208                | labvirt1005           |
| 81e62cef-f1e9-468f-ac65-6ecd8f0abd6a | tools-exec-1401                | labvirt1005           |
| 7f99aa1d-4ee3-4256-9b70-871271501600 | tools-mailrelay-01             | labvirt1005           |
| 74d91098-55cc-43df-b353-5dd2d0efcb50 | tools-static-02                | labvirt1005           |
| 3fd88e9c-cf82-4a64-82e2-e015ae90f489 | toolsbeta-exec-101             | labvirt1005           |
| 75f972e3-4735-4c29-8342-fc3a1cf8d8ec | wikibrain0                     | labvirt1005           |

Further detail, logs report errors dimm 0, 1 and 2 of channel 1

Andrew added a subscriber: Cmjohnson.

All instances are now migrated off of labvirt1005 -- Chris, you can do whatever you need to fix this box; I'm going to re-image it before putting it back to work.

@Andrew thank you for the instances migrations!

Post Error

Inlet Ambient Temperature: 17C/62F
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
Advanced Memory Protection Mode: Advanced ECC Support
HP SmartMemory authenticated in all populated DIMM slots.

Moved bad DIMM module to processor 2 socket 4 to see if the error will follow the DIMM. After rebooting the error returned to the same socket. Post message below. Swapping CPU's next
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
Advanced Memory Protection Mode: Advanced ECC Support
HP SmartMemory authenticated in all populated DIMM slots.

The cpu changed did nothing

207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
207-Memory initialization error on Processor 1 Socket 4. The operating system
may not have access to all of the memory installed in the system.
Advanced Memory Protection Mode: Advanced ECC Support
HP SmartMemory authenticated in all populated DIMM slots.

Hi Christopher,

This is Regarding the Case Number:4651331170

I have made arrangements to ship a replacement System board along with an onsite engineer.

Part description:System I/O board (motherboard) assembly - For use with Ivy Bridge (E5-2600 v2) series processors - Includes subpan, thermal grease, alcohol pad, and instruction card

Customer Satisfaction is very important to us and we would like to be sure that you are very satisfied with the management of this case. If you have any suggestions for improving our service please do not hesitate to contact me or my manager Rohan Saraf <mailto:rohan.saraf@hp.com>

Cmjohnson claimed this task.

The system board has been changed, everything posted as it should. Updated the bios and iLom settings. Verified MAC address didn't change. Plugged back in and booted to the OS.