Page MenuHomePhabricator

Grid job submission failing on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
Closed, ResolvedPublicBUG REPORT

Description

Multiple tools report errors for job submissions. Example:

Cron <tools.heritage@tools-sgecron-2> jsub -mem 200m -release buster -once -j y -o /data/project/heritage/logs/check_emailable_users.log -N check_emailable_users /data/project/heritage/bin/run_erfgoedbot_script.sh erfgoedbot/check_emailable_users.py -category:"Images_from_Wiki_Loves_Monuments_2022" -delta:4 -notify >> /data/project/heritage/logs/crontab.log

error: unable to send message to qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud": got send timeout
Traceback (most recent call last):

File "/usr/bin/job", line 52, in <module>
  root = xml.etree.ElementTree.fromstring(proc.stdout.read())
File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1316, in XML
  return parser.close()

xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Event Timeline

Yes it is, it should be fixed already, can you retry?

@dcaro I'm still seeing issues when working with the toolforge-jobs API:

tools.patrocle@tools-sgebastion-10:~$ toolforge-jobs list -l
WARNING: the `--long` flag is deprecated, use `--output long` instead
ERROR: An internal error occured while executing this command.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 167, in _make_request
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/api/v1/list/

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 60, in _make_http_error
    json = original.response.json()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3/dist-packages/simplejson/__init__.py", line 518, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 370, in decode
    obj, end = self.raw_decode(s)
  File "/usr/lib/python3/dist-packages/simplejson/decoder.py", line 400, in raw_decode
    return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 645, in main
    run_subcommand(args=args, api=api)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 601, in run_subcommand
    op_list(api, output_format)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 310, in op_list
    list = _list_jobs(api)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 305, in _list_jobs
    list = api.get("/list/").json()
  File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 175, in get
    return self._make_request("GET", url_path, **kwargs)
  File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 170, in _make_request
    new_error = _make_http_error(e)
  File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 72, in _make_http_error
    except requests.exceptions.InvalidJSONError:
AttributeError: module 'requests.exceptions' has no attribute 'InvalidJSONError'
ERROR: Please report this issue to the Toolforge admins: https://w.wiki/6Zuu

Yes it is, it should be fixed already, can you retry?

Retry? These are all failing cron emails. I had about 40 of these. Last failure was 19:05 (Amsterdam time) and that's a couple of hours ago. I see some jobs running so things seem to be coming back.

@dcaro I'm still seeing issues when working with the toolforge-jobs API:

tools.patrocle@tools-sgebastion-10:~$ toolforge-jobs list -l
WARNING: the `--long` flag is deprecated, use `--output long` instead
ERROR: An internal error occured while executing this command.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 167, in _make_request
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 502 Server Error: Bad Gateway for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/api/v1/list/

Toolforge Jobs framework is backed by the Kubernetes cluster rather than the grid engine that @Multichill was reporting issues with in this ticket. That being said, the error message basically shows the load balancer for the API service behind the toolforge-jobs command failing to connect to a backend.

@dcaro restarted part of this service and it seems to be back to working now.

Multichill claimed this task.