Page MenuHomePhabricator

cloudstore1008 crash - Memory error
Closed, ResolvedPublic

Description

Just now I got an alert that cloudstore1008 was down. I cycled power via mgmt.

After checking the web console, this was caused by a memory error in "DIMM_B2" apparently.

Event Timeline

I see nothing at all in the syslog that would explain this crash -- just an empty spot

Dec  2 07:13:09 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=48 ID=62385 PROTO=TCP SPT=49648 DPT=873 SEQ=4054715434 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Dec  2 07:13:09 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=48 ID=54280 PROTO=TCP SPT=49649 DPT=873 SEQ=4054780971 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Dec  2 07:13:09 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=57 ID=14724 PROTO=TCP SPT=49650 DPT=873 SEQ=4054584360 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Dec  2 07:13:15 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=56 ID=21729 PROTO=TCP SPT=49648 DPT=1207 SEQ=4054715434 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Dec  2 07:13:15 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=56 ID=60664 PROTO=TCP SPT=49649 DPT=1207 SEQ=4054780971 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
Dec  2 07:13:15 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=35 ID=26501 PROTO=TCP SPT=49650 DPT=1207 SEQ=4054584360 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Dec  2 07:21:03 cloudstore1008 systemd-modules-load[1656]: Inserted module 'nf_conntrack'
Dec  2 07:21:03 cloudstore1008 systemd-modules-load[1656]: Inserted module 'ipmi_devintf'
Dec  2 07:21:03 cloudstore1008 systemd[1]: Started Load Kernel Modules.

When it was rebooted, it filled the root disk due to lack of bind mounts. I'm going to add the silly bind mounts to fstab today.
That doesn't explain the above crash, though.

I see the debugfs is mounted. Not sure why that is.

Well, that was easy. Once I logged into the web console the error is right there 🙂

Screen Shot 2019-12-02 at 3.08.26 PM.png (190×1 px, 65 KB)

Bstorm renamed this task from cloudstore1008 crash to cloudstore1008 crash - Memory error.Dec 2 2019, 10:11 PM
Bstorm triaged this task as Medium priority.
Bstorm updated the task description. (Show Details)
Bstorm added a subscriber: wiki_willy.

This host is currently the passive partner in the cluster and can be power cycled for troubleshooting if it is downtimed and syncs stopped. Assigning to @wiki_willy to have DCOps take a look.

Please coordinate with me for restarts and downtime.

wiki_willy added subscribers: Jclark-ctr, JHedden.

Re-assigning to @Jclark-ctr , who can coordinate with @JHedden during their weekly sync up on Tuesdays. Thanks, Willy

Confirmed: Service Request 1004932600 was successfully submitted.

Mentioned in SAL (#wikimedia-operations) [2019-12-04T22:39:32Z] <bstorm_> powered off cloudstore1008, disabled sync from cloudstore1009, and downtimed both cloudstore1008 and cloudstore1009 for memory module replacement T239569

Finished replacement of DIMM_B2

The server shows the right amount of RAM, so that's a good start. Checking logs on the web console

No errors for now. Hopefully, we don't see more pop up. Thanks!