Page MenuHomePhabricator

logstash1011 is down
Closed, ResolvedPublic

Description

Syslog indicates /dev/sdc has failed. The node has been cordoned and puppet disabled.

Mar  8 04:32:33 logstash1011 kernel: [17036681.856985] ata5.00: exception Emask 0x0 SAct 0x6000 SErr 0x0 action 0x0
Mar  8 04:32:33 logstash1011 kernel: [17036681.863972] ata5.00: irq_stat 0x40000008
Mar  8 04:32:33 logstash1011 kernel: [17036681.868188] ata5.00: failed command: READ FPDMA QUEUED
Mar  8 04:32:33 logstash1011 kernel: [17036681.873600] ata5.00: cmd 60/08:68:18:ea:3c/00:00:c3:01:00/40 tag 13 ncq dma 4096 in
Mar  8 04:32:33 logstash1011 kernel: [17036681.873600]          res 43/40:05:1b:ea:3c/00:00:c3:01:00/40 Emask 0x409 (media error) <F>
Mar  8 04:32:33 logstash1011 kernel: [17036681.890014] ata5.00: status: { DRDY SENSE ERR }
Mar  8 04:32:33 logstash1011 kernel: [17036681.894814] ata5.00: error: { UNC }
Mar  8 04:32:33 logstash1011 kernel: [17036682.029786] ata5.00: configured for UDMA/133
Mar  8 04:32:33 logstash1011 kernel: [17036682.029813] sd 4:0:0:0: [sdc] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
Mar  8 04:32:33 logstash1011 kernel: [17036682.029821] sd 4:0:0:0: [sdc] tag#13 Sense Key : Medium Error [current] 
Mar  8 04:32:33 logstash1011 kernel: [17036682.029828] sd 4:0:0:0: [sdc] tag#13 Add. Sense: Unrecovered read error - auto reallocate failed
Mar  8 04:32:33 logstash1011 kernel: [17036682.029834] sd 4:0:0:0: [sdc] tag#13 CDB: Read(16) 88 00 00 00 00 01 c3 3c ea 18 00 00 00 08 00 00
Mar  8 04:32:33 logstash1011 kernel: [17036682.029842] blk_update_request: I/O error, dev sdc, sector 7570516504 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar  8 04:32:33 logstash1011 kernel: [17036682.040756] ata5: EH complete
Mar  8 04:32:36 logstash1011 kernel: [17036684.844914] ata5.00: exception Emask 0x0 SAct 0x208000 SErr 0x0 action 0x0
Mar  8 04:32:36 logstash1011 kernel: [17036684.852069] ata5.00: irq_stat 0x40000008
Mar  8 04:32:36 logstash1011 kernel: [17036684.856278] ata5.00: failed command: READ FPDMA QUEUED
Mar  8 04:32:36 logstash1011 kernel: [17036684.861699] ata5.00: cmd 60/08:a8:18:ea:3c/00:00:c3:01:00/40 tag 21 ncq dma 4096 in
Mar  8 04:32:36 logstash1011 kernel: [17036684.861699]          res 43/40:05:1b:ea:3c/00:00:c3:01:00/40 Emask 0x409 (media error) <F>
Mar  8 04:32:36 logstash1011 kernel: [17036684.878108] ata5.00: status: { DRDY SENSE ERR }
Mar  8 04:32:36 logstash1011 kernel: [17036684.882912] ata5.00: error: { UNC }
Mar  8 04:32:36 logstash1011 kernel: [17036684.947296] ata5.00: configured for UDMA/133
Mar  8 04:32:36 logstash1011 kernel: [17036684.947329] sd 4:0:0:0: [sdc] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s
Mar  8 04:32:36 logstash1011 kernel: [17036684.947336] sd 4:0:0:0: [sdc] tag#21 Sense Key : Medium Error [current] 
Mar  8 04:32:36 logstash1011 kernel: [17036684.947342] sd 4:0:0:0: [sdc] tag#21 Add. Sense: Unrecovered read error - auto reallocate failed
Mar  8 04:32:36 logstash1011 kernel: [17036684.947348] sd 4:0:0:0: [sdc] tag#21 CDB: Read(16) 88 00 00 00 00 01 c3 3c ea 18 00 00 00 08 00 00
Mar  8 04:32:36 logstash1011 kernel: [17036684.947355] blk_update_request: I/O error, dev sdc, sector 7570516504 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Mar  8 04:32:36 logstash1011 kernel: [17036684.958259] ata5: EH complete
...
Mar  8 04:32:36 logstash1011 systemd[1]: opensearch_2@production-elk7-eqiad.service: Main process exited, code=killed, status=6/ABRT
Mar  8 04:32:36 logstash1011 systemd[1]: opensearch_2@production-elk7-eqiad.service: Failed with result 'signal'.
Mar  8 04:32:36 logstash1011 systemd[1]: opensearch_2@production-elk7-eqiad.service: Consumed 7min 25.738s CPU time.

Event Timeline

colewhite changed the task status from Open to In Progress.Mar 11 2024, 7:44 AM
colewhite claimed this task.
colewhite triaged this task as High priority.

The logging-hd nodes are in service and the old nodes have been cordoned. Shards are rebalancing. It's known that puppet has a problem installing elasticsearch-curator because it has to be rebuilt and tested for bookworm.

Hello, I see several email alerts regarding logstash1011:

Systemd timer ran the following command:

    /usr/bin/debmonitor-client

Its return value was 1 and emitted the following output:

INFO:debmonitor:Found 599 installed binary packages
INFO:debmonitor:Found 13 upgradable binary packages (including new dependencies)
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2622)'))': /hosts/logstash1011.eqiad.wmnet/update
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2622)'))': /hosts/logstash1011.eqiad.wmnet/update
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2622)'))': /hosts/logstash1011.eqiad.wmnet/update
ERROR:debmonitor:Failed to execute DebMonitor CLI: HTTPSConnectionPool(host='debmonitor.discovery.wmnet', port=443): Max retries exceeded with url: /hosts/logstash1011.eqiad.wmnet/update (Caused by SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2622)')))

logstash1011 is no more