Just now I got an alert that cloudstore1008 was down. I cycled power via mgmt.
After checking the web console, this was caused by a memory error in "DIMM_B2" apparently.
Andrew | |
Dec 2 2019, 7:19 AM |
F31454493: Screen Shot 2019-12-02 at 3.08.26 PM.png | |
Dec 2 2019, 10:09 PM |
Just now I got an alert that cloudstore1008 was down. I cycled power via mgmt.
After checking the web console, this was caused by a memory error in "DIMM_B2" apparently.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Unknown Object (Task) | |||||
Open | None | T209530 Build user data backup service based on remote sync rather than NFS | |||
Resolved | Bstorm | T193655 rack/setup/install cloudstore1008 & cloudstore1009 | |||
Resolved | Jclark-ctr | T239569 cloudstore1008 crash - Memory error | |||
Resolved | Bstorm | T239721 Set static bind mounts on cloudstore1008/9 |
I see nothing at all in the syslog that would explain this crash -- just an empty spot
Dec 2 07:13:09 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=48 ID=62385 PROTO=TCP SPT=49648 DPT=873 SEQ=4054715434 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Dec 2 07:13:09 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=48 ID=54280 PROTO=TCP SPT=49649 DPT=873 SEQ=4054780971 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Dec 2 07:13:09 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=57 ID=14724 PROTO=TCP SPT=49650 DPT=873 SEQ=4054584360 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Dec 2 07:13:15 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=56 ID=21729 PROTO=TCP SPT=49648 DPT=1207 SEQ=4054715434 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Dec 2 07:13:15 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=56 ID=60664 PROTO=TCP SPT=49649 DPT=1207 SEQ=4054780971 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 Dec 2 07:13:15 cloudstore1008 ulogd[1910]: [fw-in-drop] IN=eno1 OUT= MAC=d0:94:66:26:d5:6a:5c:5e:ab:3d:87:c4:08:00 SRC=172.16.2.185 DST=208.80.155.125 LEN=44 TOS=00 PREC=0x00 TTL=35 ID=26501 PROTO=TCP SPT=49650 DPT=1207 SEQ=4054584360 ACK=0 WINDOW=1024 SYN URGP=0 MARK=0 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Dec 2 07:21:03 cloudstore1008 systemd-modules-load[1656]: Inserted module 'nf_conntrack' Dec 2 07:21:03 cloudstore1008 systemd-modules-load[1656]: Inserted module 'ipmi_devintf' Dec 2 07:21:03 cloudstore1008 systemd[1]: Started Load Kernel Modules.
When it was rebooted, it filled the root disk due to lack of bind mounts. I'm going to add the silly bind mounts to fstab today.
That doesn't explain the above crash, though.
This host is currently the passive partner in the cluster and can be power cycled for troubleshooting if it is downtimed and syncs stopped. Assigning to @wiki_willy to have DCOps take a look.
Re-assigning to @Jclark-ctr , who can coordinate with @JHedden during their weekly sync up on Tuesdays. Thanks, Willy
Mentioned in SAL (#wikimedia-operations) [2019-12-04T22:39:32Z] <bstorm_> powered off cloudstore1008, disabled sync from cloudstore1009, and downtimed both cloudstore1008 and cloudstore1009 for memory module replacement T239569
The server shows the right amount of RAM, so that's a good start. Checking logs on the web console