Page MenuHomePhabricator

an-worker1145 has a problem
Closed, ResolvedPublic

Description

We've noticed that an-worker1145 is effectively out of the cluster.
The datanode service is down, we can't ssh into it, although it is still responding to pings.

Event Timeline

BTullis triaged this task as High priority.
BTullis added a project: Data-Platform-SRE.
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

I've logged in via the SOL console and I can see that there is a problem with the storage controller.
This kind of thing is scrolling past on the console.

[14278134.771367] systemd[22798]: confd.service: Failed to execute command: Input/output error
[14278134.779775] systemd[22798]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
[14278145.015307] print_req_error: I/O error, dev sda, sector 43786240
[14278145.021627] systemd[22810]: confd.service: Failed to execute command: Input/output error
[14278145.030029] systemd[22810]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
[14278154.484173] print_req_error: I/O error, dev sda, sector 46724152
[14278154.490521] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm rm: reading directory lblock 0
[14278154.501294] print_req_error: I/O error, dev sda, sector 21487616
[14278154.507574] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[14278154.515197] EXT4-fs (dm-0): I/O error while writing superblock
[14278154.521395] print_req_error: I/O error, dev sda, sector 46724152
[14278154.527727] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm rm: reading directory lblock 0
[14278154.538483] print_req_error: I/O error, dev sda, sector 21487616
[14278154.544759] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[14278154.552386] EXT4-fs (dm-0): I/O error while writing superblock
[14278154.560382] systemd[1]: Failed to start Update Debian version stat exported by node_exporter.
[14278155.265134] print_req_error: I/O error, dev sda, sector 43786240
[14278155.271462] systemd[22837]: confd.service: Failed to execute command: Input/output error
[14278155.279861] systemd[22837]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
[14278162.783786] print_req_error: I/O error, dev sda, sector 46724152
[14278162.790072] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm confd-prometheu: reading directory lblock 0
[14278162.801935] print_req_error: I/O error, dev sda, sector 21487616
[14278162.808205] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[14278162.815822] EXT4-fs (dm-0): I/O error while writing superblock
[14278162.822317] print_req_error: I/O error, dev sda, sector 47010056
[14278162.828611] print_req_error: I/O error, dev sda, sector 47010056
[14278162.834910] print_req_error: I/O error, dev sda, sector 47010056
[14278162.850566] systemd[1]: Failed to start Export confd Prometheus metrics.
[14278163.692088] print_req_error: I/O error, dev sda, sector 46724152
[14278163.698442] EXT4-fs error (device dm-0): __ext4_find_entry:1488: inode #784954: comm prometheus-pupp: reading directory lblock 0
[14278163.710351] print_req_error: I/O error, dev sda, sector 21487616
[14278163.716631] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[14278163.724283] EXT4-fs (dm-0): I/O error while writing superblock
[14278163.731196] print_req_error: I/O error, dev sda, sector 47010056
[14278163.737498] print_req_error: I/O error, dev sda, sector 47010056
[14278163.743976] print_req_error: I/O error, dev sda, sector 47010056
[14278163.769342] systemd[1]: Failed to start Regular job to collect puppet agent stats.
[14278165.515784] systemd[22856]: confd.service: Failed to execute command: Input/output error
[14278165.524213] systemd[22856]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
exim[22857]: 2023-07-10 14:00:09 Start queue run: pid=22857
exim[22857]: 2023-07-10 14:00:09 Cannot open main log file "/var/log/exim4/mainlog": Permission denied: euid=0 egid=115
exim[22857]: exim: could not open panic log - aborting: see message(s) above
[14278175.765740] print_req_error: I/O error, dev sda, sector 43786240
[14278175.772072] systemd[22860]: confd.service: Failed to execute command: Input/output error
[14278175.780475] systemd[22860]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
[14278186.016155] print_req_error: I/O error, dev sda, sector 43786240
[14278186.022557] systemd[22875]: confd.service: Failed to execute command: Input/output error
[14278186.031024] systemd[22875]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
[14278196.266208] print_req_error: I/O error, dev sda, sector 43786240
[14278196.272561] systemd[22886]: confd.service: Failed to execute command: Input/output error
[14278196.281013] systemd[22886]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error
[14278206.516312] print_req_error: I/O error, dev sda, sector 43786240
[14278206.522630] systemd[22899]: confd.service: Failed to execute command: Input/output error
[14278206.531055] systemd[22899]: confd.service: Failed at step EXEC spawning /usr/bin/confd: Input/output error

Mentioned in SAL (#wikimedia-analytics) [2023-07-10T14:02:33Z] <btullis> powered off an-worker1145 for T341481

Cold booted the host. We'll see if this reinitializes the storage system, or whether it fails to boot.

Server appears to have booted correctly and all services are recovering.