There's a flood of emails like the following:
Date: Mon, 01 Mar 2021 09:18:01 +0000 From: Cron Daemon <root@tools.wmflabs.org> To: root@tools.wmflabs.org Subject: Cron <root@toolsbeta-sgegrid-master> /usr/local/bin/prometheus-sge-stats --outfile /var/lib/prometheus/node.d/sge.prom error: commlib error: got select error (Connection refused) unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs": got send error WARNING:__main__:Output from failed shell command ['/usr/bin/qconf', '-sql']: Traceback (most recent call last): File "/usr/local/bin/prometheus-sge-stats", line 235, in <module> sys.exit(main()) File "/usr/local/bin/prometheus-sge-stats", line 226, in main collect_sge_stats(registry) File "/usr/local/bin/prometheus-sge-stats", line 162, in collect_sge_stats for q in get_queues(): File "/usr/local/bin/prometheus-sge-stats", line 92, in get_queues queues = grid_cmd(["/usr/bin/qconf", "-sql"]) File "/usr/local/bin/prometheus-sge-stats", line 81, in grid_cmd raise e File "/usr/local/bin/prometheus-sge-stats", line 75, in grid_cmd cmd, env={"SGE_ROOT": SGE_ROOT}, universal_newlines=True File "/usr/lib/python3.5/subprocess.py", line 316, in check_output **kwargs).stdout File "/usr/lib/python3.5/subprocess.py", line 398, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['/usr/bin/qconf', '-sql']' returned non-zero exit status 1
Investigate and triage/fix