Page MenuHomePhabricator

[wmcs.gridengine] syncron message receive errors
Closed, ResolvedPublic

Description

We are getting timeouts generally on the grid, making jobs fail and the master unresponsive.

Errors on the jobs look like:

error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
2023-04-17T12:46:05+00:00 Restarting job 'test-long-running-stretch' ('/data/project/toolschecker/bin/long-running.sh')
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
  File "/usr/bin/job", line 52, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1316, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

On the master, the logs show:

Apr 17 12:27:42 tools-sgegrid-master kernel: INFO: task sge_qmaster:25531 blocked for more than 120 seconds.
Apr 17 12:27:42 tools-sgegrid-master kernel:       Not tainted 4.19.0-23-cloud-amd64 #1 Debian 4.19.269-1
Apr 17 12:27:42 tools-sgegrid-master kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 17 12:27:42 tools-sgegrid-master kernel: sge_qmaster     D    0 25531      1 0x00000000
Apr 17 12:27:42 tools-sgegrid-master kernel: Call Trace:
Apr 17 12:27:42 tools-sgegrid-master kernel:  __schedule+0x29f/0x840
Apr 17 12:27:42 tools-sgegrid-master kernel:  schedule+0x28/0x80
Apr 17 12:27:42 tools-sgegrid-master kernel:  io_schedule+0x12/0x40
Apr 17 12:27:42 tools-sgegrid-master kernel:  __lock_page+0x118/0x170
Apr 17 12:27:42 tools-sgegrid-master kernel:  ? page_cache_tree_insert+0xe0/0xe0
Apr 17 12:27:42 tools-sgegrid-master kernel:  pagecache_get_page+0x19a/0x2c0
Apr 17 12:27:42 tools-sgegrid-master kernel:  grab_cache_page_write_begin+0x1f/0x40
Apr 17 12:27:42 tools-sgegrid-master kernel:  nfs_write_begin+0x63/0x310 [nfs]
Apr 17 12:27:42 tools-sgegrid-master kernel:  generic_perform_write+0xfd/0x1c0
Apr 17 12:27:42 tools-sgegrid-master kernel:  nfs_file_write+0xdc/0x210 [nfs]
Apr 17 12:27:42 tools-sgegrid-master kernel:  new_sync_write+0xfb/0x160
Apr 17 12:27:42 tools-sgegrid-master kernel:  vfs_write+0xa5/0x1a0
Apr 17 12:27:42 tools-sgegrid-master kernel:  ksys_pwrite64+0x61/0xa0
Apr 17 12:27:42 tools-sgegrid-master kernel:  do_syscall_64+0x50/0xf0
Apr 17 12:27:42 tools-sgegrid-master kernel:  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Apr 17 12:27:42 tools-sgegrid-master kernel: RIP: 0033:0x7f8fec9f6edf
Apr 17 12:27:42 tools-sgegrid-master kernel: Code: Bad RIP value.
Apr 17 12:27:42 tools-sgegrid-master kernel: RSP: 002b:00007f8fe33fbad0 EFLAGS: 00000297 ORIG_RAX: 0000000000000012
Apr 17 12:27:42 tools-sgegrid-master kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f8fec9f6edf
Apr 17 12:27:42 tools-sgegrid-master kernel: RDX: 000000000000128a RSI: 00007f8fec270580 RDI: 000000000000000b
Apr 17 12:27:42 tools-sgegrid-master kernel: RBP: 00007f8fec270580 R08: 0000000000000001 R09: 00007f8fe77ff900
Apr 17 12:27:42 tools-sgegrid-master kernel: R10: 00000000003daa35 R11: 0000000000000297 R12: 000000000000128a
Apr 17 12:27:42 tools-sgegrid-master kernel: R13: 00000000003daa35 R14: 0000000000000002 R15: 000000000000128a

Right after the first error, looking

Event Timeline

dcaro triaged this task as High priority.

Trying to stop and start the process did not help, the process did not go away:

Apr 17 12:56:43 tools-sgegrid-master systemd[1]: gridengine-master.service: State 'stop-sigterm' timed out. Killing.
Apr 17 12:56:43 tools-sgegrid-master systemd[1]: gridengine-master.service: Killing process 25521 (sge_qmaster) with signal SIGKILL.
Apr 17 12:58:13 tools-sgegrid-master systemd[1]: gridengine-master.service: Processes still around after SIGKILL. Ignoring.
Apr 17 12:59:44 tools-sgegrid-master systemd[1]: gridengine-master.service: State 'stop-final-sigterm' timed out. Killing.
Apr 17 12:59:44 tools-sgegrid-master systemd[1]: gridengine-master.service: Killing process 25521 (sge_qmaster) with signal SIGKILL.
Apr 17 13:01:14 tools-sgegrid-master systemd[1]: gridengine-master.service: Processes still around after final SIGKILL. Entering failed mode.
Apr 17 13:01:14 tools-sgegrid-master systemd[1]: gridengine-master.service: Failed with result 'timeout'.
Apr 17 13:01:14 tools-sgegrid-master systemd[1]: Stopped SGE Master daemon.
Apr 17 13:01:45 tools-sgegrid-master systemd[1]: gridengine-master.service: Found left-over process 25521 (sge_qmaster) in control group while starting unit. Ignoring.
Apr 17 13:01:45 tools-sgegrid-master systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 17 13:01:45 tools-sgegrid-master systemd[1]: Started SGE Master daemon.
Apr 17 13:01:45 tools-sgegrid-master sge_qmaster[25756]: error: communication error for "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1" running on port 6444: "can't bind socket"
Apr 17 13:01:45 tools-sgegrid-master sge_qmaster[25756]:   main|tools-sgegrid-master|E|communication error for "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1" running on port 6444: "can't bind socket"
Apr 17 13:01:46 tools-sgegrid-master sge_qmaster[25756]: error: commlib error: can't bind socket (no additional information available)
Apr 17 13:01:46 tools-sgegrid-master sge_qmaster[25756]:   main|tools-sgegrid-master|E|commlib error: can't bind socket (no additional information available)

The process became a zombie:

root@tools-sgegrid-master:~# ps aux | grep 25521
sgeadmin 25521  1.9  0.0      0     0 ?        Zsl  Apr05 333:56 [sge_qmaster] <defunct>

Mentioned in SAL (#wikimedia-cloud) [2023-04-17T13:10:33Z] <dcaro> rebooting tools-sgegrid-master node (T334847)

After rebooting the host, I manually started the master service:

root@tools-sgegrid-master:~# systemctl start gridengine-master.service

It started ok this time, and that brought everything back up.

tools-sgewebgen-10-3 was non-responsive too, could not ssh to it myself, though andrew was able to and rebooted the host.

There was this log in dmesg:

root@tools-sgewebgen-10-3:~# dmesg -T
...
[Mon Apr 17 06:23:28 2023] Process accounting resumed
[Mon Apr 17 12:27:08 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, timed out
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4

Ran puppet on the console and failed with:

Error: '/usr/bin/timeout -k 5s 20s /bin/mkdir -p /mnt/nfs/labstore-secondary-tools-project' returned 124 instead of one of [0]

So point to NFS being stuck. Waiting for it to reboot.

dcaro moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q4) board.