Page MenuHomePhabricator

toolbeta.sgegrid: crons failing
Closed, ResolvedPublic

Description

There's a flood of emails like the following:

Date: Mon, 01 Mar 2021 09:18:01 +0000
From: Cron Daemon <root@tools.wmflabs.org>
To: root@tools.wmflabs.org
Subject: Cron <root@toolsbeta-sgegrid-master> /usr/local/bin/prometheus-sge-stats --outfile /var/lib/prometheus/node.d/sge.prom

error: commlib error: got select error (Connection refused)
unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs": got send error
WARNING:__main__:Output from failed shell command ['/usr/bin/qconf', '-sql']:
Traceback (most recent call last):
  File "/usr/local/bin/prometheus-sge-stats", line 235, in <module>
    sys.exit(main())
  File "/usr/local/bin/prometheus-sge-stats", line 226, in main
    collect_sge_stats(registry)
  File "/usr/local/bin/prometheus-sge-stats", line 162, in collect_sge_stats
    for q in get_queues():
  File "/usr/local/bin/prometheus-sge-stats", line 92, in get_queues
    queues = grid_cmd(["/usr/bin/qconf", "-sql"])
  File "/usr/local/bin/prometheus-sge-stats", line 81, in grid_cmd
    raise e
  File "/usr/local/bin/prometheus-sge-stats", line 75, in grid_cmd
    cmd, env={"SGE_ROOT": SGE_ROOT}, universal_newlines=True
  File "/usr/lib/python3.5/subprocess.py", line 316, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.5/subprocess.py", line 398, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/usr/bin/qconf', '-sql']' returned non-zero exit status 1

Investigate and triage/fix

Event Timeline

dcaro triaged this task as High priority.Mar 1 2021, 9:20 AM
dcaro created this task.

Current theory is that there were a lot of stuck emails (the oldest from Feb 26th) that got flushed, caused by a server
rebbot (same date, https://sal.toolforge.org/log/fg8R4HcBgTbpqNOmDYXN), and that coupled with the server not starting
the sge_qmaster daemon by default, made both, that a bunch of emails arrived with the error suddenly, and that it kept
sending new error emails. Started the sge_qmaster service and everything seemed to go back to normal:

root@toolsbeta-sgegrid-master:~# sudo systemctl status sge_qmaster
● gridengine-master.service - SGE Master daemon
   Loaded: loaded (/lib/systemd/system/gridengine-master.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2021-02-26 20:44:26 UTC; 2 days ago
  Process: 735 ExecStart=/usr/lib/gridengine/sge_qmaster (code=exited, status=216/GROUP)
 Main PID: 735 (code=exited, status=216/GROUP)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

root@toolsbeta-sgegrid-master:~# sudo systemctl start sge_qmaster

root@toolsbeta-sgegrid-master:~# sudo systemctl status sge_qmaster
● gridengine-master.service - SGE Master daemon
   Loaded: loaded (/lib/systemd/system/gridengine-master.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-03-01 09:22:09 UTC; 1s ago
 Main PID: 26108 (sge_qmaster)
    Tasks: 5 (limit: 4915)
   CGroup: /system.slice/gridengine-master.service
           └─26108 /usr/lib/gridengine/sge_qmaster

Mar 01 09:22:09 toolsbeta-sgegrid-master systemd[1]: Started SGE Master daemon.
This comment has been deleted.
This comment has been deleted.