Page MenuHomePhabricator

wtp1032 bootlooping on CPU error
Closed, ResolvedPublic

Description

wtp1032's racadm lclog view shows this happening over and over:

1/admin1-> racadm lclog view
2SeqNumber = 368
3Message ID = RAC0703
4Category = Audit
5AgentID = RACLOG
6Severity = Information
7Timestamp = 2020-06-02 15:09:14
8Message = Requested system hardreset.
9FQDD = iDRAC.Embedded.1
10--------------------------------------------------------------------------------
11SeqNumber = 367
12Message ID = SYS1003
13Category = Audit
14AgentID = DE
15Severity = Information
16Timestamp = 2020-06-02 15:09:14
17Message = System CPU Resetting.
18FQDD = iDRAC.Embedded.1#HostPowerCtrl
19--------------------------------------------------------------------------------
20SeqNumber = 366
21Message ID = CPU0000
22Category = System
23AgentID = iDRAC
24Severity = Information
25Timestamp = 2020-06-02 15:09:14
26Message = Internal error has occurred check for additional logs.
27--------------------------------------------------------------------------------
28SeqNumber = 365
29Message ID = RAC0703
30Category = Audit
31AgentID = RACLOG
32Severity = Information
33Timestamp = 2020-06-02 15:08:49
34Message = Requested system hardreset.
35FQDD = iDRAC.Embedded.1
36--------------------------------------------------------------------------------
37SeqNumber = 364
38Message ID = SYS1003
39Category = Audit
40AgentID = DE
41Severity = Information
42Timestamp = 2020-06-02 15:08:49
43Message = System CPU Resetting.
44FQDD = iDRAC.Embedded.1#HostPowerCtrl
45--------------------------------------------------------------------------------
46SeqNumber = 363
47Message ID = CPU0000
48Category = System
49AgentID = iDRAC
50Severity = Information
51Timestamp = 2020-06-02 15:08:48
52Message = Internal error has occurred check for additional logs.
53--------------------------------------------------------------------------------
54SeqNumber = 362
55Message ID = RAC0703
56Category = Audit
57AgentID = RACLOG
58Severity = Information
59Timestamp = 2020-06-02 15:08:24
60Message = Requested system hardreset.
61FQDD = iDRAC.Embedded.1
62--------------------------------------------------------------------------------
63SeqNumber = 361
64Message ID = SYS1003
65Category = Audit
66AgentID = DE
67Severity = Information
68Timestamp = 2020-06-02 15:08:24
69Message = System CPU Resetting.
70FQDD = iDRAC.Embedded.1#HostPowerCtrl
71--------------------------------------------------------------------------------
72SeqNumber = 360
73Message ID = CPU0000
74Category = System
75AgentID = iDRAC
76Severity = Information
77Timestamp = 2020-06-02 15:08:23
78Message = Internal error has occurred check for additional logs.

I tried a manual power cycle but that didn't fix it.

For now it is depooled pending dcops checking out what's up. Changed netbox status to FAILED

Related Objects

Event Timeline

CDanis created this task.Jun 2 2020, 3:40 PM
Restricted Application added a project: Operations. · View Herald TranscriptJun 2 2020, 3:40 PM
wiki_willy added subscribers: Cmjohnson, wiki_willy.

@Cmjohnson - looks like the warranty on this one just ended a few months ago, so just let me know whatever you find during troubleshooting, and we can order the part. Thanks, Willy

Machine seems to still be in the dsh group; can this be fixed?

Cmjohnson closed this task as Resolved.Jun 2 2020, 6:54 PM

the server is out of warranty, I reseated both CPUs and cleared the system event log. The server booted okay. I will resolve this for now, please open again if the issue comes back.

Mentioned in SAL (#wikimedia-operations) [2020-06-02T21:12:33Z] <cdanis> repooled wtp1032 T254258