We've noticed that an-worker1145 is effectively out of the cluster.
The datanode service is down, we can't ssh into it, although it is still responding to pings.
Description
Description
Event Timeline
Comment Actions
I've logged in via the SOL console and I can see that there is a problem with the storage controller.
This kind of thing is scrolling past on the console.
[14278134.771367] systemd[22798]: confd.service: Failed to execute command: Input/output error [14278134.779775] systemd[22798]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error [14278145.015307] print_req_error: I/O error, dev sda, sector 43786240 [14278145.021627] systemd[22810]: confd.service: Failed to execute command: Input/output error [14278145.030029] systemd[22810]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error [14278154.484173] print_req_error: I/O error, dev sda, sector 46724152 [14278154.490521] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm rm: reading directory lblock 0 [14278154.501294] print_req_error: I/O error, dev sda, sector 21487616 [14278154.507574] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [14278154.515197] EXT4-fs (dm-0): I/O error while writing superblock [14278154.521395] print_req_error: I/O error, dev sda, sector 46724152 [14278154.527727] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm rm: reading directory lblock 0 [14278154.538483] print_req_error: I/O error, dev sda, sector 21487616 [14278154.544759] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [14278154.552386] EXT4-fs (dm-0): I/O error while writing superblock [14278154.560382] systemd[1]: Failed to start Update Debian version stat exported by node_exporter. [14278155.265134] print_req_error: I/O error, dev sda, sector 43786240 [14278155.271462] systemd[22837]: confd.service: Failed to execute command: Input/output error [14278155.279861] systemd[22837]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error [14278162.783786] print_req_error: I/O error, dev sda, sector 46724152 [14278162.790072] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm confd-prometheu: reading directory lblock 0 [14278162.801935] print_req_error: I/O error, dev sda, sector 21487616 [14278162.808205] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [14278162.815822] EXT4-fs (dm-0): I/O error while writing superblock [14278162.822317] print_req_error: I/O error, dev sda, sector 47010056 [14278162.828611] print_req_error: I/O error, dev sda, sector 47010056 [14278162.834910] print_req_error: I/O error, dev sda, sector 47010056 [14278162.850566] systemd[1]: Failed to start Export confd Prometheus metrics. [14278163.692088] print_req_error: I/O error, dev sda, sector 46724152 [14278163.698442] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm prometheus-pupp: reading directory lblock 0 [14278163.710351] print_req_error: I/O error, dev sda, sector 21487616 [14278163.716631] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [14278163.724283] EXT4-fs (dm-0): I/O error while writing superblock [14278163.731196] print_req_error: I/O error, dev sda, sector 47010056 [14278163.737498] print_req_error: I/O error, dev sda, sector 47010056 [14278163.743976] print_req_error: I/O error, dev sda, sector 47010056 [14278163.769342] systemd[1]: Failed to start Regular job to collect puppet agent stats. [14278165.515784] systemd[22856]: confd.service: Failed to execute command: Input/output error [14278165.524213] systemd[22856]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error exim[22857]: 2023-07-10 14:00:09 Start queue run: pid=22857 exim[22857]: 2023-07-10 14:00:09 Cannot open main log file "/var/log/exim4/mainlog": Permission denied: euid=0 egid=115 exim[22857]: exim: could not open panic log - aborting: see message(s) above [14278175.765740] print_req_error: I/O error, dev sda, sector 43786240 [14278175.772072] systemd[22860]: confd.service: Failed to execute command: Input/output error [14278175.780475] systemd[22860]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error [14278186.016155] print_req_error: I/O error, dev sda, sector 43786240 [14278186.022557] systemd[22875]: confd.service: Failed to execute command: Input/output error [14278186.031024] systemd[22875]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error [14278196.266208] print_req_error: I/O error, dev sda, sector 43786240 [14278196.272561] systemd[22886]: confd.service: Failed to execute command: Input/output error [14278196.281013] systemd[22886]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error [14278206.516312] print_req_error: I/O error, dev sda, sector 43786240 [14278206.522630] systemd[22899]: confd.service: Failed to execute command: Input/output error [14278206.531055] systemd[22899]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
Comment Actions
Mentioned in SAL (#wikimedia-analytics) [2023-07-10T14:02:33Z] <btullis> powered off an-worker1145 for T341481
Comment Actions
Cold booted the host. We'll see if this reinitializes the storage system, or whether it fails to boot.