Page MenuHomePhabricator

[jobs-api] crashing
Open, LowPublic

Description

This is also maybe (or maybe not) related to the NFS outage; it started crashing in earnest after I started rebooting nfs k8s worker nodes.

@Slst2020 is bailing me out now, but I was very confused by the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Jobs_framework that don't really explain where/how the api is run. I'm not clear on if those docs are just really out of date or if I'm badly misreading. For instance 'jobs-framework-api (code) --- uses flask-restful and runs inside the k8s cluster as a webservice' sent me to the 'jobs' tool but that seems to be unrelated?

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedaborrero
OpenNone
OpenNone
Resolveddcaro
ResolvedNone
Resolveddcaro
Resolveddcaro
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
ResolvedRaymond_Ndibe
Resolveddcaro
Resolveddcaro
In Progressdcaro
In Progressdcaro

Event Timeline

This seems to be resolved now, pending questions are:

  • why no alerts?
  • are the docs as wrong as the look to me at 2AM?

This seems to be resolved now, pending questions are:

  • why no alerts?

Were jobs-api pods crashing? I think the monitoring for it was recently reviewed in T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components.

I think the actual problem was T380844: 2024-11-26 Toolforge DNS incident.

  • are the docs as wrong as the look to me at 2AM?

I think the docs you selected are 'how this works' kind of document, and not a 'runbook' kind of document.

The page has been refreshed just now, but the content remains in the same tone 'how this works'.

Maybe we need to:

  • review and create 'runbook' kind of documents for Toolforge components
  • have a round of training about Toolforge components

I think the actual problem was T380844: 2024-11-26 Toolforge DNS incident.

Yes, I can confirm this:

sed by NameResolutionError(
    "<urllib3.connection.HTTPSConnection object at 0x7f60e6211a50>: 
    Failed to resolve 'tools-harbor.wmcloud.org' 
    ([Errno -3] Temporary failure in name resolution)"
)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.11/lib/python3.11/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.11/lib/python3.11/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/app/tjf/api/jobs.py", line 110, in create_job
    new_job = NewJob.model_validate(request.json)

why no alerts?

Patch for the alerts https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/20

For some reason I was expecting to get an alert for any scrape target being down, but that's only on prod alertmanager I think, this adds it explicitly.

Added a runbook for each alert too (that helps a bit with the docs side, giving an idea on how to check the pods, get the logs, ...)

dcaro renamed this task from jobs-api crashing to [jobs-api] crashing.Wed, Nov 27, 2:20 PM
dcaro edited projects, added Toolforge (Toolforge iteration 16); removed Toolforge.