We are getting timeouts generally on the grid, making jobs fail and the master unresponsive.
Errors on the jobs look like:
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
2023-04-17T12:46:05+00:00 Restarting job 'test-long-running-stretch' ('/data/project/toolschecker/bin/long-running.sh')
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
File "/usr/bin/job", line 52, in <module>
root = xml.etree.ElementTree.fromstring(proc.stdout.read())
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1316, in XML
return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0On the master, the logs show:
Apr 17 12:27:42 tools-sgegrid-master kernel: INFO: task sge_qmaster:25531 blocked for more than 120 seconds. Apr 17 12:27:42 tools-sgegrid-master kernel: Not tainted 4.19.0-23-cloud-amd64 #1 Debian 4.19.269-1 Apr 17 12:27:42 tools-sgegrid-master kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 17 12:27:42 tools-sgegrid-master kernel: sge_qmaster D 0 25531 1 0x00000000 Apr 17 12:27:42 tools-sgegrid-master kernel: Call Trace: Apr 17 12:27:42 tools-sgegrid-master kernel: __schedule+0x29f/0x840 Apr 17 12:27:42 tools-sgegrid-master kernel: schedule+0x28/0x80 Apr 17 12:27:42 tools-sgegrid-master kernel: io_schedule+0x12/0x40 Apr 17 12:27:42 tools-sgegrid-master kernel: __lock_page+0x118/0x170 Apr 17 12:27:42 tools-sgegrid-master kernel: ? page_cache_tree_insert+0xe0/0xe0 Apr 17 12:27:42 tools-sgegrid-master kernel: pagecache_get_page+0x19a/0x2c0 Apr 17 12:27:42 tools-sgegrid-master kernel: grab_cache_page_write_begin+0x1f/0x40 Apr 17 12:27:42 tools-sgegrid-master kernel: nfs_write_begin+0x63/0x310 [nfs] Apr 17 12:27:42 tools-sgegrid-master kernel: generic_perform_write+0xfd/0x1c0 Apr 17 12:27:42 tools-sgegrid-master kernel: nfs_file_write+0xdc/0x210 [nfs] Apr 17 12:27:42 tools-sgegrid-master kernel: new_sync_write+0xfb/0x160 Apr 17 12:27:42 tools-sgegrid-master kernel: vfs_write+0xa5/0x1a0 Apr 17 12:27:42 tools-sgegrid-master kernel: ksys_pwrite64+0x61/0xa0 Apr 17 12:27:42 tools-sgegrid-master kernel: do_syscall_64+0x50/0xf0 Apr 17 12:27:42 tools-sgegrid-master kernel: entry_SYSCALL_64_after_hwframe+0x5c/0xc1 Apr 17 12:27:42 tools-sgegrid-master kernel: RIP: 0033:0x7f8fec9f6edf Apr 17 12:27:42 tools-sgegrid-master kernel: Code: Bad RIP value. Apr 17 12:27:42 tools-sgegrid-master kernel: RSP: 002b:00007f8fe33fbad0 EFLAGS: 00000297 ORIG_RAX: 0000000000000012 Apr 17 12:27:42 tools-sgegrid-master kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f8fec9f6edf Apr 17 12:27:42 tools-sgegrid-master kernel: RDX: 000000000000128a RSI: 00007f8fec270580 RDI: 000000000000000b Apr 17 12:27:42 tools-sgegrid-master kernel: RBP: 00007f8fec270580 R08: 0000000000000001 R09: 00007f8fe77ff900 Apr 17 12:27:42 tools-sgegrid-master kernel: R10: 00000000003daa35 R11: 0000000000000297 R12: 000000000000128a Apr 17 12:27:42 tools-sgegrid-master kernel: R13: 00000000003daa35 R14: 0000000000000002 R15: 000000000000128a
Right after the first error, looking