Page MenuHomePhabricator

hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet
Closed, ResolvedPublic

Description

  • - Provide FQDN of system.
  • - If other than a hard drive issue, please depool the machine (and confirm that it’s been depooled) for us to work on it. If not, please provide time frame for us to take the machine down.
  • - Put system into a failed state in Netbox.
  • - Provide urgency of request, along with justification (redundancy, dependencies, etc)
  • - Describe issue and/or attach hardware failure log. (Refer to https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook if you need help)
  • - Assign correct project tag and appropriate owner (based on above). Also, please ensure the service owners of the host(s) are added as subscribers to provide any additional input.

FQDN: rdb1014.eqiad.wmnet
Urgency: Medium, it is the passive failover for rdb1013, so nothing degraded right now, but if rdb1013 misbehaves we have a problem.

Record:      103
Date/Time:   10/14/2024 10:05:49
Source:      system
Severity:    Critical
Description: CPU 2 machine check error detected.

Event Timeline

akosiaris claimed this task.
akosiaris subscribed.

The host has some history of failure per T370633: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet

It is the passive failover for rdb1013, which means we have no degradation of anything right now.

Nothing particularly useful in SEL. The last set of lines are many of these.

Record:      87
Date/Time:   10/08/2024 22:55:57
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------

This is 2024-10-08 fwiw, so Oct 08th 2024.

After a racadm serveraction powerdown, the machine came up fine

SEL has new entries, same text as above, just current dates. journald doesn't have anything worthy of note.

I 'll resolve, although something tells me we 'll soon see this again.

Mentioned in SAL (#wikimedia-operations) [2024-10-14T11:26:31Z] <claime> Running ./redis-check-aof --fix on rdb1014 tcp_6379 instance - T376961

root@rdb1014:/srv/redis# redis-check-aof --fix rdb1014-6379.aof
The AOF appears to start with an RDB preamble.
Checking the RDB preamble to start:
[offset 0] Checking RDB file --fix
[offset 27] AUX FIELD redis-ver = '6.0.16'
[offset 41] AUX FIELD redis-bits = '64'
[offset 53] AUX FIELD ctime = '1728427840'
[offset 68] AUX FIELD used-mem = '1430438952'
[offset 84] AUX FIELD aof-preamble = '1'
[offset 86] Selecting DB ID 0
[offset 684290431] Checksum OK
[offset 684290431] \o/ RDB looks OK! \o/
[info] 6149454 keys read
[info] 6113290 expires
[info] 6109592 already expired
RDB preamble is OK, proceeding with AOF tail...
0x        2be9e78b: Expected prefix '*', got: '
AOF analyzed: size=736751616, ok_up_to=736749451, diff=2165
This will shrink the AOF from 736751616 bytes, with 2165 bytes, to 736749451 bytes
Continue? [y/N]: y
Successfully truncated AOF

Followed by a full resync from master was needed to restart the redis instance on port 6379

MoritzMuehlenhoff subscribed.

I 'll resolve, although something tells me we 'll soon see this again.

You jinxed it :-) rdb1014 is again down since three days, reopening the task.

Clement_Goubert renamed this task from host rdb1014 is down to hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet.Oct 23 2024, 9:29 AM
Clement_Goubert reassigned this task from akosiaris to Jclark-ctr.
Clement_Goubert triaged this task as Medium priority.
Clement_Goubert added projects: ops-eqiad, DC-Ops.
Clement_Goubert updated the task description. (Show Details)

I 'll resolve, although something tells me we 'll soon see this again.

You jinxed it :-) rdb1014 is again down since three days, reopening the task.

Hurrah! Not.

Icinga downtime and Alertmanager silence (ID=0d71122b-3e94-47c7-a121-4dda9db372d8) set by cgoubert@cumin1002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hardware issue

rdb1014.eqiad.wmnet

Confirmed: Service Request 199807744 was successfully submitted.

Dell has agreed to replace mainboard and cpu. should be this week

Mentioned in SAL (#wikimedia-operations) [2024-10-30T22:03:40Z] <brett> Running ./redis-check-aof --fix on rdb1014 tcp_6379 instance - T376961

Cpu2 and main board replaced today by tech