Page MenuHomePhabricator

toolsbeta grid is down
Closed, ResolvedPublic

Description

The toolsbeta grid is down. We need to fix it so it can serve its purpose of being a staging/devel environment for tools proper.

toolsbeta.automated-toolforge-tests@toolsbeta-sgebastion-05:~$ qstat
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "toolsbeta-sgegrid-shadow.toolsbeta.eqiad1.wikimedia.cloud": got send error
arturo@nostromo:~$ cookbook wmcs.toolforge.grid.get_cluster_status --project toolsbeta
START - Cookbook wmcs.toolforge.grid.get_cluster_status
PASS |                                                                                                                       |   0% (0/1) [00:07<?, ?hosts/s]
FAIL |███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:07<00:00,  7.19s/hosts]
Exception raised while executing cookbook wmcs.toolforge.grid.get_cluster_status:
Traceback (most recent call last):
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/toolforge/grid/get_cluster_status.py", line 99, in run
    nodes_info = self.grid_controller.get_nodes_info()
  File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/libs/grid.py", line 314, in get_nodes_info
    xml_output = run_one_raw(node=self._master_node, command=["qhost", "-q", "-xml"], print_output=False)
  File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/libs/common.py", line 423, in run_one_raw
    result = next(node.run_sync(command, **kwargs))
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 520, in run_sync
    return self._execute(
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook wmcs.toolforge.grid.get_cluster_status (exit_code=99)
arturo@nostromo:~$ cookbook wmcs.toolforge.tests --bastion-hostname toolsbeta-sgebastion-05 --project toolsbeta
START - Cookbook wmcs.toolforge.tests
----- OUTPUT of 'sudo -i cmd-chec...forge-tests.yaml' -----
[2022-09-28 09:12:34] INFO: --- toolsbeta-sgebastion-05 Debian GNU/Linux 10 (buster) 4.19.0-19-cloud-amd64
[2022-09-28 09:12:34] INFO: ---
[...]
[2022-09-28 09:20:01] INFO: ---
[2022-09-28 09:20:01] INFO: --- passed tests: 9
[2022-09-28 09:20:01] INFO: --- failed tests: 11
[2022-09-28 09:20:01] INFO: --- total tests: 20

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-cloud) [2022-09-28T09:48:44Z] <arturo> manually starting gridengine-master.service on toolsbeta-sgegrid-master (T318788)

aborrero triaged this task as Medium priority.Sep 28 2022, 9:48 AM

making sure the grid is well configured:

aborrero@toolsbeta-sgegrid-master:~ $ sudo grid-configurator --all-domains --beta
2022-09-28 09:47:37,190 WARNING unknown file prefix: /data/project/.system_sge/store/hostkey-toolsbeta-sgecron-01.toolsbeta.eqiad.wmflabs, we only know {'submithost', 'execnode'}
2022-09-28 09:47:37,191 WARNING unknown file prefix: /data/project/.system_sge/store/hostkey-toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs, we only know {'submithost', 'execnode'}
2022-09-28 09:47:37,192 WARNING unknown file prefix: /data/project/.system_sge/store/hostkey-toolsbeta-sgegrid-shadow.toolsbeta.eqiad.wmflabs, we only know {'submithost', 'execnode'}
2022-09-28 09:47:37,727 WARNING command 'qconf -ds toolsbeta-sgecron-01.toolsbeta.eqiad.wmflabs' generated stderr: 'root@toolsbeta-sgegrid-master.toolsbeta.eqiad1.wikimedia.cloud removed "toolsbeta-sgecron-01.toolsbeta.eqiad.wmflabs" from submit host list'
2022-09-28 09:47:37,728 INFO removing /data/project/.system_sge/gridengine/etc/hosts/toolsbeta-sgecron-01.toolsbeta.eqiad1.wikimedia.cloud, 'toolsbeta-sgecron-01' is not a VM
aborrero@toolsbeta-sgegrid-master:~ $ sudo rm /data/project/.system_sge/store/hostkey-toolsbeta-sgecron-01.toolsbeta.eqiad.wmflabs
aborrero@toolsbeta-sgegrid-master:~ $ sudo rm /data/project/.system_sge/store/hostkey-toolsbeta-sgegrid-master.toolsbeta.eqiad.wmflabs
aborrero@toolsbeta-sgegrid-master:~ $ sudo rm /data/project/.system_sge/store/hostkey-toolsbeta-sgegrid-shadow.toolsbeta.eqiad.wmflabs
aborrero@toolsbeta-sgegrid-master:~ $ sudo grid-configurator --all-domains --beta
aborrero claimed this task.

For whatever reason the gridengine-master.service was down on toolsbeta-sgegrid-master. Starting it manually got the grid in shape again.