Still under investigation:
[15:21:00] <+icinga-wm> PROBLEM - Host db1150 is DOWN: PING CRITICAL - Packet loss = 100%
This is a backup source for s4 and s5
Still under investigation:
[15:21:00] <+icinga-wm> PROBLEM - Host db1150 is DOWN: PING CRITICAL - Packet loss = 100%
This is a backup source for s4 and s5
racadm getsel shows RAM issues
------------------------------------------------------------------------------- Record: 135 Date/Time: 03/21/2023 14:17:07 Source: system Severity: Critical Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A8. ------------------------------------------------------------------------------- Record: 136 Date/Time: 03/21/2023 14:17:07 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 137 Date/Time: 03/21/2023 14:17:07 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 138 Date/Time: 03/21/2023 14:17:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 139 Date/Time: 03/21/2023 14:17:08 Source: system Severity: Critical Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_A8. ------------------------------------------------------------------------------- Record: 140 Date/Time: 03/21/2023 14:17:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 141 Date/Time: 03/21/2023 14:17:08 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 142 Date/Time: 03/21/2023 14:17:09 Source: system Severity: Ok Description: An OEM diagnostic event occurred. ------------------------------------------------------------------------------- Record: 143 Date/Time: 03/21/2023 14:17:09 Source: system Severity: Critical Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A8. -------------------------------------------------------------------------------
@wiki_willy we'd need help with the above. We probably need to get the DIMM swapped with another DIMM to see if the problem is the DIMM itself or the main board.
This host by still be under warranty until August.
Change 901601 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] monitoring: Disable notifications for db1150 after crash
Change 901601 merged by Jcrespo:
[operations/puppet@production] monitoring: Disable notifications for db1150 after crash
Sadly it doesn't powercycle from the management interface, so requiring "manual" power drain and reboot when possible from DC-Ops.
The host will be left unused and with notifications disabled so it can be serviced at any time (no rush). Thank you.
Change 901624 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] dbbackups: Setup db1145 as a backup source replacement for db1150
FYI: Logged but forgot to add the ticket number:
jynus: running from cumin1001: transfer.py --type=decompress dbprov1003.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s5.2023-03-20--04-00-30.tar.gz db1145.eqiad.wmnet:/srv/sqldata.s5
Change 901624 merged by Jcrespo:
[operations/puppet@production] dbbackups: Setup db1145 as a backup source replacement for db1150
@Marostegui @jynus I apologize for the delay for this DIMM, Dell had a question that needed responding to and it's delaying the shipment. It should go out today.
No worries, as I said- this is not in a rush. If you can (can be next week)- power drain it next time you are available, that can help us.
Icinga downtime and Alertmanager silence (ID=3b38157a-7d2c-4b9f-ad17-b2b2c6932dcb) set by jynus@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reason: reprovisioning after maintenance
db1150.eqiad.wmnet
Change 904764 had a related patch set uploaded (by Jcrespo; author: Jcrespo):
[operations/puppet@production] database-backups: Provision db1150 with s4 and s3 sections
Change 904764 merged by Jcrespo:
[operations/puppet@production] database-backups: Provision db1150 with s4 and s3 sections