Page MenuHomePhabricator

[jobs-api] Getting errors when listing jobs
Closed, ResolvedPublic

Description

For the tool listeria, when listing jobs we get intermittent errors.

An example:

On the cli:

tools.listeria@tools-sgebastion-10:~$ toolforge jobs list
ERROR: An internal error occured while executing this command.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 117, in _make_request
    response.raise_for_status()
  File "/usr/lib/python3/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/jobs/api/v1/jobs/

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 797, in main
    run_subcommand(args=args, api=api)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 731, in run_subcommand
    op_list(api, output_format)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 366, in op_list
    list = _list_jobs(api)
  File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 362, in _list_jobs
    return api.get("/jobs/")
  File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 146, in get
    return self._make_request("GET", url, **kwargs).json()
  File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 130, in _make_request
    raise self.exception_handler(e) from e
tjf_cli.api.TjfCliHttpError: Internal Server Error
ERROR: Please report this issue to the Toolforge admins: https://w.wiki/6Zuu

On the API logs:

172.16.1.69 - - [22/Feb/2024:10:17:51 +0000] "GET /healthz HTTP/1.1" 200 2 "-" "kube-probe/1.23"
    for entry in self._get_pod_logs(
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.9/lib/python3.9/site-packages/toolforge_weld/logs/kubernetes.py", line 41, in _get_pod_logs
    datetime, message = line.split(" ", 1)
ValueError: not enough values to unpack (expected 2, got 1)
[2024-02-22 10:13:06 +0000] [18108] [ERROR] Error handling request
Traceback (most recent call last):
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.9/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 183, in handle_request
    for item in respiter:
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.9/lib/python3.9/site-packages/werkzeug/wsgi.py", line 256, in __next__
    return self._next()
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.9/lib/python3.9/site-packages/werkzeug/wrappers/response.py", line 32, in _iter_encoded
    for item in iterable:
  File "/app/tjf/api/logs.py", line 33, in format_logs
    for entry in logs:
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.9/lib/python3.9/site-packages/toolforge_weld/logs/kubernetes.py", line 66, in query
    for entry in self._get_pod_logs(
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.9/lib/python3.9/site-packages/toolforge_weld/logs/kubernetes.py", line 41, in _get_pod_logs
    datetime, message = line.split(" ", 1)
ValueError: not enough values to unpack (expected 2, got 1)

There's also another one that I have not reproduced yet to get the logs from the api:

On the cli:

Job name:     Job type:             Status:
------------  --------------------  ----------------------------------------
update-wikis  schedule: 17 * * * *  Last schedule time: 2024-02-22T09:17:00Z
rustbot       continuous            Running
File "/usr/lib/python3/dist-packages/tjf_cli/cli.py", line 528, in op_logs
    params=params,
 File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 167, in get_raw_lines
    **kwargs,
   File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 130, in _make_request
     raise self.exception_handler(e) from e
   File "/usr/lib/python3/dist-packages/tjf_cli/api.py", line 59, in handle_http_exception
     except requests.exceptions.InvalidJSONError:
 AttributeError: module 'requests.exceptions' has no attribute 'InvalidJSONError'
11:12:33 ERROR: Please report this issue to the Toolforge admins: https://w.wiki/6Zuu

Details

TitleReferenceAuthorSource BranchDest Branch
kubernetes.logs: use a default date if the logs come without itrepos/cloud/toolforge/toolforge-weld!42dcarohandle_logs_without_datemain
jobs-api: bump to 0.0.263-20240222104806-5ddd710frepos/cloud/toolforge/toolforge-deploy!206project_1317_bot_df3177307bed93c3f34e421e26c86e38bump_jobs-apimain
deployment: Pin jobs-api pod to NFS-enabled workersrepos/cloud/toolforge/jobs-api!62taavimain-I0b1d43e2a173b39f145bbcbf5142ed169b9ea259main
Customize query in GitLab

Event Timeline

dcaro triaged this task as High priority.Feb 22 2024, 10:22 AM

This happens ~50% of the time, re-running the exact same command often works.

This error is new today, didn't happen yesterday.

project_1317_bot_df3177307bed93c3f34e421e26c86e38 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/206

jobs-api: bump to 0.0.263-20240222104806-5ddd710f

taavi subscribed.

Adding the missing nodeSelector seems to have fixed it. So T355883: Create a pool of NFS-less Toolforge Kubernetes workers broke this, since I thought I'd added that everywhere already.