Page MenuHomePhabricator

mw1272 crashed: Bad page map in process hhvm
Closed, ResolvedPublic

Description

I've rebooted mw1272 today around 9:30 UTC, it was marked as down on icinga for the past 12 hours.

It looks like the host had been rebooted several times due to crashes in the past:

A few days before the latest crash, the kernel logged this:

Dec  5 16:30:49 mw1272 kernel: [1519174.005014] BUG: Bad page map in process hhvm  pte:a3ae6f845 pmd:8081f1067

A similar issue occurred in October this year: T207983

Event Timeline

ema created this task.Dec 11 2018, 9:42 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2018, 9:42 AM
ema triaged this task as Medium priority.Dec 11 2018, 9:42 AM
ema added a subscriber: Cmjohnson.

The problem could be due to bad RAM. @Cmjohnson could you check?

The idrac logs reporting a couple of things. The errors could just be DIMM but there is a CPU Machine Check error, that indicates that CPU2 may be bad now. A DIMM Swap is needed first, clear the log and see if the error follows the DIMM or stays with CPU2.

Record: 77
Date/Time: 12/10/2018 21:41:54
Source: system
Severity: Ok

Description: A problem was detected related to the previous server boot.

Record: 78
Date/Time: 12/10/2018 21:41:54
Source: system
Severity: Critical

Description: CPU 1 machine check error detected.


Record: 112
Date/Time: 12/10/2018 21:44:45
Source: system
Severity: Non-Critical

Description: Correctable memory error rate exceeded for DIMM_B1.

Record: 113
Date/Time: 12/10/2018 21:46:35
Source: system
Severity: Critical

Description: Correctable memory error rate exceeded for DIMM_B1.

Today I swapped the DIMM from B1 to A1 and cleared the log. We have to wait and see

This host crashed today again:

-------------------------------------------------------------------------------
Record:      40
Date/Time:   02/22/2019 06:10:16
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      74
Date/Time:   02/22/2019 06:10:18
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Record:      75
Date/Time:   02/22/2019 06:12:12
Source:      system
Severity:    Non-Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------
Record:      76
Date/Time:   02/22/2019 06:14:41
Source:      system
Severity:    Critical
Description: Correctable memory error rate exceeded for DIMM_B1.
-------------------------------------------------------------------------------

Mentioned in SAL (#wikimedia-operations) [2019-02-22T06:51:38Z] <marostegui> Power cycle mw1272 as it crashed - T211668

Mentioned in SAL (#wikimedia-operations) [2019-02-22T09:32:49Z] <_joe_> set pooled=inactive on mw1272, T211668

MoritzMuehlenhoff added a subscriber: RobH.

This server is still under warranty for another 6-7 weeks.

A self-dispatch ticket has been created for a new DIMM and CPU

You have successfully submitted request SR986941367.

Received the parts, replaced CPU2 and DIMM B1 and cleared the log

Return shipping info
USPS 9202 3946 2441 1124 14
FEDEX 9611918 2393026 77862432

RobH closed this task as Resolved.Mar 12 2019, 11:03 PM