Page MenuHomePhabricator

db2127 memory issues
Closed, ResolvedPublic

Description

@Papaul reported the following on irc:

15:08:57 <papaul> marostegui: hello can you check db2127 it is making a lot of noise and i have a yellow blinking lid

And these are the logs HW which show memory issues

1Record: 1
2Date/Time: 07/12/2019 03:59:07
3Source: system
4Severity: Ok
5Description: Log cleared.
6-------------------------------------------------------------------------------
7Record: 2
8Date/Time: 08/14/2019 21:10:59
9Source: system
10Severity: Non-Critical
11Description: Correctable Machine Check Exception detected on CPU 1.
12-------------------------------------------------------------------------------
13Record: 3
14Date/Time: 08/14/2019 21:10:59
15Source: system
16Severity: Ok
17Description: An OEM diagnostic event occurred.
18-------------------------------------------------------------------------------
19Record: 4
20Date/Time: 08/14/2019 21:10:59
21Source: system
22Severity: Ok
23Description: An OEM diagnostic event occurred.
24-------------------------------------------------------------------------------
25Record: 5
26Date/Time: 08/14/2019 21:10:59
27Source: system
28Severity: Ok
29Description: An OEM diagnostic event occurred.
30-------------------------------------------------------------------------------
31Record: 6
32Date/Time: 08/14/2019 21:10:59
33Source: system
34Severity: Ok
35Description: An OEM diagnostic event occurred.
36-------------------------------------------------------------------------------
37Record: 7
38Date/Time: 08/14/2019 21:10:59
39Source: system
40Severity: Ok
41Description: An OEM diagnostic event occurred.
42-------------------------------------------------------------------------------
43Record: 8
44Date/Time: 08/14/2019 21:10:59
45Source: system
46Severity: Ok
47Description: An OEM diagnostic event occurred.
48-------------------------------------------------------------------------------
49Record: 9
50Date/Time: 08/14/2019 21:10:59
51Source: system
52Severity: Ok
53Description: An OEM diagnostic event occurred.
54-------------------------------------------------------------------------------
55Record: 10
56Date/Time: 08/14/2019 21:10:59
57Source: system
58Severity: Ok
59Description: An OEM diagnostic event occurred.
60-------------------------------------------------------------------------------
61Record: 11
62Date/Time: 08/14/2019 21:10:59
63Source: system
64Severity: Ok
65Description: An OEM diagnostic event occurred.
66-------------------------------------------------------------------------------
67Record: 12
68Date/Time: 08/14/2019 21:11:00
69Source: system
70Severity: Ok
71Description: An OEM diagnostic event occurred.
72-------------------------------------------------------------------------------
73Record: 13
74Date/Time: 08/14/2019 21:11:00
75Source: system
76Severity: Ok
77Description: An OEM diagnostic event occurred.
78-------------------------------------------------------------------------------
79Record: 14
80Date/Time: 08/14/2019 21:11:00
81Source: system
82Severity: Ok
83Description: An OEM diagnostic event occurred.
84-------------------------------------------------------------------------------
85Record: 15
86Date/Time: 08/14/2019 21:11:00
87Source: system
88Severity: Critical
89Description: Correctable memory error logging disabled for a memory device at location DIMM_A6.
90-------------------------------------------------------------------------------

After putting down the host @Papaul reports that the noise and blinking led is gone as well as the DIMM issue cleared out.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Marostegui triaged this task as Medium priority.Sep 18 2019, 5:03 AM
Marostegui moved this task from Triage to In progress on the DBA board.

We are leaving this task opened for a few days to see if the errors get back.

Mentioned in SAL (#wikimedia-operations) [2019-09-18T05:47:56Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool host after onsite checks T233184', diff saved to https://phabricator.wikimedia.org/P9123 and previous config saved to /var/cache/conftool/dbconfig/20190918-054755-marostegui.json

No more errors, if it continues clean on Monday I will close this task

Marostegui assigned this task to Papaul.

HW logs look clean, closing this!
Thanks @Papaul for catching this!