Page MenuHomePhabricator

New SGE nodes can't talk to the grid engine master
Closed, ResolvedPublic

Description

I just created two new sge nodes, tools-sgeexec-10-11/12. They both are showing similar errors when trying to talk to the gridmaster:

Jun 01 19:03:17 tools-sgeexec-10-11 gridengine-exec[2094]: error: got send timeout
Jun 01 19:03:17 tools-sgeexec-10-11 gridengine-exec[2094]: error: can't get configuration from qmaster -- backgrounding
Jun 01 19:03:17 tools-sgeexec-10-11 gridengine-exec[2094]: critical error: unable to write to file fd_pipe[1]: Broken pipe
Jun 01 19:03:17 tools-sgeexec-10-11 sge_execd[2234]:   main|tools-sgeexec-10-11|E|got send timeout
Jun 01 19:03:17 tools-sgeexec-10-11 sge_execd[2234]:   main|tools-sgeexec-10-11|E|can't get configuration from qmaster -- backgrounding
Jun 01 19:03:17 tools-sgeexec-10-11 sge_execd[2234]:   main|tools-sgeexec-10-11|C|unable to write to file fd_pipe[1]: Broken pipe

Jun 01 19:04:20 tools-sgeexec-10-11 sge_execd[2234]:   main|tools-sgeexec-10-11|E|getting configuration: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad1.wikimed

This persists after service restarts and reboots. Tcpdumping on the grid master (tcpdump "port 6444"|grep tools-sgeexec-10-1) shows traffic from a working node (sgeexec-10-10) but doesn't show any traffic from the new nodes. tools-sgeexec-10-11 and -12 are in the correct 'execnode' security group.

Event Timeline

drive-by: can you confirm that this isn't a firewall/connectivity issue (e.g. with telnet?)

Not sure, although leaning towards it being one. If I run echo foo | nc tools-sgegrid-master 6444 anywhere else and then run a tcpdump on the sge master, it shows some packets being received. That doesn't happen if I do that from sgeexec-11.

Ok, this pretty much confirms it is a networking issue:

taavi@tools-sgegrid-master:~ $ curl tools-sgegrid-master:9100
<html>
taavi@tools-sgeexec-10-10:~ $ curl tools-sgegrid-master:9100
<html>
taavi@tools-sgeexec-10-11:~ $ curl --connect-timeout 5 tools-sgegrid-master:9100
curl: (28) Connection timed out after 5000 milliseconds

Mentioned in SAL (#wikimedia-cloud) [2022-06-02T07:51:47Z] <taavi> restart neutron-linuxbridge-agent.service on cloudvirt1034 T309732

Found this from the neutron logs on the cloudvirt that tools-sgegrid-master runs on:

May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]: 2022-05-29 14:51:23.844 3730 ERROR oslo.messaging._drivers.impl_rabbit [-] [dc4d716a-767a-4ec4-b412-1651105ce32e] AMQP server on cloudcontrol1004.wikimedia.org:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: ConnectionRefusedError: [Errno 111] ECONNREFUSED
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]: 2022-05-29 14:51:23.962 3730 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]: Traceback (most recent call last):
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/hub.py", line 476, in fire_timers
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]:     timer()
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]:   File "/usr/lib/python3/dist-packages/eventlet/hubs/timer.py", line 59, in __call__
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]:     cb(*args, **kw)
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]:   File "/usr/lib/python3/dist-packages/eventlet/semaphore.py", line 152, in _do_acquire
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]:     waiter.switch()
May 29 14:51:23 cloudvirt1034 neutron-linuxbridge-agent[3730]: greenlet.error: cannot switch to a different thread

Restarted that service and things seem to be working much better

taavi claimed this task.