[jobs-api] crashing
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Andrew
	Nov 26 2024, 7:23 AM

Description

This is also maybe (or maybe not) related to the NFS outage; it started crashing in earnest after I started rebooting nfs k8s worker nodes.

@Slst2020 is bailing me out now, but I was very confused by the docs at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Jobs_framework that don't really explain where/how the api is run. I'm not clear on if those docs are just really out of date or if I'm badly misreading. For instance 'jobs-framework-api (code) --- uses flask-restful and runs inside the k8s cluster as a webservice' sent me to the 'jobs' tool but that seems to be unrelated?

Related Objects
Search...

Status	Assigned	Task
Open	None	T380882 openstack network problems (November 2024)
Resolved	aborrero	T380827 tools-nfs outage 2024-11-25
Resolved	None	T380832 [jobs-api] crashing

Event Timeline

Andrew created this task.Nov 26 2024, 7:23 AM

This seems to be resolved now, pending questions are:

why no alerts?
are the docs as wrong as the look to me at 2AM?

dcaro subscribed.Nov 26 2024, 7:57 AM

Count_Count subscribed.Nov 26 2024, 9:37 AM

In T380832#10356349, @Andrew wrote:

This seems to be resolved now, pending questions are:

why no alerts?

Were jobs-api pods crashing? I think the monitoring for it was recently reviewed in T320284: [jobs-api,jobs-emailer] Prometheus monitoring toolforge-jobs server side components.

I think the actual problem was T380844: 2024-11-26 Toolforge DNS incident.

are the docs as wrong as the look to me at 2AM?

I think the docs you selected are 'how this works' kind of document, and not a 'runbook' kind of document.

The page has been refreshed just now, but the content remains in the same tone 'how this works'.

Maybe we need to:

review and create 'runbook' kind of documents for Toolforge components
have a round of training about Toolforge components

aborrero added a project: User-aborrero.Nov 26 2024, 12:01 PM

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

In T380832#10357169, @aborrero wrote:

I think the actual problem was T380844: 2024-11-26 Toolforge DNS incident.

Yes, I can confirm this:

sed by NameResolutionError(
    "<urllib3.connection.HTTPSConnection object at 0x7f60e6211a50>: 
    Failed to resolve 'tools-harbor.wmcloud.org' 
    ([Errno -3] Temporary failure in name resolution)"
)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.11/lib/python3.11/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/lib/poetry/tjf-9TtSrW0h-py3.11/lib/python3.11/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/app/tjf/api/jobs.py", line 110, in create_job
    new_job = NewJob.model_validate(request.json)

why no alerts?

Patch for the alerts https://gitlab.wikimedia.org/repos/cloud/toolforge/alerts/-/merge_requests/20

For some reason I was expecting to get an alert for any scrape target being down, but that's only on prod alertmanager I think, this adds it explicitly.

Added a runbook for each alert too (that helps a bit with the docs side, giving an idea on how to check the pods, get the logs, ...)

aborrero mentioned this in T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.Nov 27 2024, 10:09 AM

dcaro renamed this task from jobs-api crashing to [jobs-api] crashing.Nov 27 2024, 2:20 PM

dcaro edited projects, added Toolforge (Toolforge iteration 16); removed Toolforge.

AFAICS this is resolved and fixing the alerts / docs are being tracked in separate tasks.

dcaro moved this task from Next Up to Done on the Toolforge (Toolforge iteration 16) board.Nov 29 2024, 10:16 AM

fnegri removed a subtask: T380959: [docs,envvars-api,jobs-api,builds-api] create docs on how to operate the cluster and core components.Wed, Dec 18, 11:25 AM

[jobs-api] crashingClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[jobs-api] crashing
Closed, ResolvedPublic
Actions

Related Objects
Search...