Page MenuHomePhabricator

[wmcs.gridengine] syncron message receive errors
Closed, ResolvedPublic

Description

We are getting timeouts generally on the grid, making jobs fail and the master unresponsive.

Errors on the jobs look like:

error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
2023-04-17T12:46:05+00:00 Restarting job 'test-long-running-stretch' ('/data/project/toolschecker/bin/long-running.sh')
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
  File "/usr/bin/job", line 52, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1316, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

On the master, the logs show:

Apr 17 12:27:42 tools-sgegrid-master kernel: INFO: task sge_qmaster:25531 blocked for more than 120 seconds.
Apr 17 12:27:42 tools-sgegrid-master kernel:       Not tainted 4.19.0-23-cloud-amd64 #1 Debian 4.19.269-1
Apr 17 12:27:42 tools-sgegrid-master kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 17 12:27:42 tools-sgegrid-master kernel: sge_qmaster     D    0 25531      1 0x00000000
Apr 17 12:27:42 tools-sgegrid-master kernel: Call Trace:
Apr 17 12:27:42 tools-sgegrid-master kernel:  __schedule+0x29f/0x840
Apr 17 12:27:42 tools-sgegrid-master kernel:  schedule+0x28/0x80
Apr 17 12:27:42 tools-sgegrid-master kernel:  io_schedule+0x12/0x40
Apr 17 12:27:42 tools-sgegrid-master kernel:  __lock_page+0x118/0x170
Apr 17 12:27:42 tools-sgegrid-master kernel:  ? page_cache_tree_insert+0xe0/0xe0
Apr 17 12:27:42 tools-sgegrid-master kernel:  pagecache_get_page+0x19a/0x2c0
Apr 17 12:27:42 tools-sgegrid-master kernel:  grab_cache_page_write_begin+0x1f/0x40
Apr 17 12:27:42 tools-sgegrid-master kernel:  nfs_write_begin+0x63/0x310 [nfs]
Apr 17 12:27:42 tools-sgegrid-master kernel:  generic_perform_write+0xfd/0x1c0
Apr 17 12:27:42 tools-sgegrid-master kernel:  nfs_file_write+0xdc/0x210 [nfs]
Apr 17 12:27:42 tools-sgegrid-master kernel:  new_sync_write+0xfb/0x160
Apr 17 12:27:42 tools-sgegrid-master kernel:  vfs_write+0xa5/0x1a0
Apr 17 12:27:42 tools-sgegrid-master kernel:  ksys_pwrite64+0x61/0xa0
Apr 17 12:27:42 tools-sgegrid-master kernel:  do_syscall_64+0x50/0xf0
Apr 17 12:27:42 tools-sgegrid-master kernel:  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Apr 17 12:27:42 tools-sgegrid-master kernel: RIP: 0033:0x7f8fec9f6edf
Apr 17 12:27:42 tools-sgegrid-master kernel: Code: Bad RIP value.
Apr 17 12:27:42 tools-sgegrid-master kernel: RSP: 002b:00007f8fe33fbad0 EFLAGS: 00000297 ORIG_RAX: 0000000000000012
Apr 17 12:27:42 tools-sgegrid-master kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f8fec9f6edf
Apr 17 12:27:42 tools-sgegrid-master kernel: RDX: 000000000000128a RSI: 00007f8fec270580 RDI: 000000000000000b
Apr 17 12:27:42 tools-sgegrid-master kernel: RBP: 00007f8fec270580 R08: 0000000000000001 R09: 00007f8fe77ff900
Apr 17 12:27:42 tools-sgegrid-master kernel: R10: 00000000003daa35 R11: 0000000000000297 R12: 000000000000128a
Apr 17 12:27:42 tools-sgegrid-master kernel: R13: 00000000003daa35 R14: 0000000000000002 R15: 000000000000128a

Right after the first error, looking

Event Timeline

dcaro triaged this task as High priority.Apr 17 2023, 1:01 PM
dcaro created this task.

Trying to stop and start the process did not help, the process did not go away:

Apr 17 12:56:43 tools-sgegrid-master systemd[1]: gridengine-master.service: State 'stop-sigterm' timed out. Killing.
Apr 17 12:56:43 tools-sgegrid-master systemd[1]: gridengine-master.service: Killing process 25521 (sge_qmaster) with signal SIGKILL.
Apr 17 12:58:13 tools-sgegrid-master systemd[1]: gridengine-master.service: Processes still around after SIGKILL. Ignoring.
Apr 17 12:59:44 tools-sgegrid-master systemd[1]: gridengine-master.service: State 'stop-final-sigterm' timed out. Killing.
Apr 17 12:59:44 tools-sgegrid-master systemd[1]: gridengine-master.service: Killing process 25521 (sge_qmaster) with signal SIGKILL.
Apr 17 13:01:14 tools-sgegrid-master systemd[1]: gridengine-master.service: Processes still around after final SIGKILL. Entering failed mode.
Apr 17 13:01:14 tools-sgegrid-master systemd[1]: gridengine-master.service: Failed with result 'timeout'.
Apr 17 13:01:14 tools-sgegrid-master systemd[1]: Stopped SGE Master daemon.
Apr 17 13:01:45 tools-sgegrid-master systemd[1]: gridengine-master.service: Found left-over process 25521 (sge_qmaster) in control group while starting unit. Ignoring.
Apr 17 13:01:45 tools-sgegrid-master systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Apr 17 13:01:45 tools-sgegrid-master systemd[1]: Started SGE Master daemon.
Apr 17 13:01:45 tools-sgegrid-master sge_qmaster[25756]: error: communication error for "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1" running on port 6444: "can't bind socket"
Apr 17 13:01:45 tools-sgegrid-master sge_qmaster[25756]:   main|tools-sgegrid-master|E|communication error for "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud/qmaster/1" running on port 6444: "can't bind socket"
Apr 17 13:01:46 tools-sgegrid-master sge_qmaster[25756]: error: commlib error: can't bind socket (no additional information available)
Apr 17 13:01:46 tools-sgegrid-master sge_qmaster[25756]:   main|tools-sgegrid-master|E|commlib error: can't bind socket (no additional information available)

The process became a zombie:

root@tools-sgegrid-master:~# ps aux | grep 25521
sgeadmin 25521  1.9  0.0      0     0 ?        Zsl  Apr05 333:56 [sge_qmaster] <defunct>

Mentioned in SAL (#wikimedia-cloud) [2023-04-17T13:10:33Z] <dcaro> rebooting tools-sgegrid-master node (T334847)

After rebooting the host, I manually started the master service:

root@tools-sgegrid-master:~# systemctl start gridengine-master.service

It started ok this time, and that brought everything back up.

tools-sgewebgen-10-3 was non-responsive too, could not ssh to it myself, though andrew was able to and rebooted the host.

There was this log in dmesg:

root@tools-sgewebgen-10-3:~# dmesg -T
...
[Mon Apr 17 06:23:28 2023] Process accounting resumed
[Mon Apr 17 12:27:08 2023] nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, timed out
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:35 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4
[Mon Apr 17 14:04:43 2023] serial8250: too much work for irq4

Ran puppet on the console and failed with:

Error: '/usr/bin/timeout -k 5s 20s /bin/mkdir -p /mnt/nfs/labstore-secondary-tools-project' returned 124 instead of one of [0]

So point to NFS being stuck. Waiting for it to reboot.

dcaro moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q4) board.