Page MenuHomePhabricator

[jobs-emailer] http requests are blocked by the loops
Closed, ResolvedPublic

Description

When doing an http request (ex. /healthz or /metrics) the request hangs for several seconds (I'm guessing one of the tasks loop or similar, to investigate).

ex:

local.tf-test@lima-kilo:~$ time curl http://127.0.0.1:8080/healthz
{"state": "pretty much alive :-)"}
real    0m13.820s
user    0m0.008s
sys     0m0.008s

Event Timeline

dcaro triaged this task as Medium priority.Nov 14 2024, 3:06 PM

The easier solution is to run the webserver task in a different thread.

Running it in the same thread as the other emailer tasks was a conscious decision back when I introduced it, to use the /health endpoint as kind of 'detector' to see if the main loop was still running.
It was maybe a too defensive approach. There is a task that checks that all other tasks are scheduled, which should works well, so there is no need to have the webserver in the same thread.

The easier solution is to run the webserver task in a different thread.

Running it in the same thread as the other emailer tasks was a conscious decision back when I introduced it, to use the /health endpoint as kind of 'detector' to see if the main loop was still running.
It was maybe a too defensive approach. There is a task that checks that all other tasks are scheduled, which should works well, so there is no need to have the webserver in the same thread.

That will prevent us from gathering any stats from the other tasks though (without some not-so-easy/nice workarounds).

The easier solution is to run the webserver task in a different thread.

Running it in the same thread as the other emailer tasks was a conscious decision back when I introduced it, to use the /health endpoint as kind of 'detector' to see if the main loop was still running.
It was maybe a too defensive approach. There is a task that checks that all other tasks are scheduled, which should works well, so there is no need to have the webserver in the same thread.

That will prevent us from gathering any stats from the other tasks though (without some not-so-easy/nice workarounds).

Hmm... and that will still block the other tasks :/, I think that the ideal would be to wrap the non-async code and run it in it's own thread (like fastapi does by itself).

The easier solution is to run the webserver task in a different thread.

Running it in the same thread as the other emailer tasks was a conscious decision back when I introduced it, to use the /health endpoint as kind of 'detector' to see if the main loop was still running.
It was maybe a too defensive approach. There is a task that checks that all other tasks are scheduled, which should works well, so there is no need to have the webserver in the same thread.

That will prevent us from gathering any stats from the other tasks though (without some not-so-easy/nice workarounds).

If the case is incrementing a counter for prometheus, I don't see that as very problematic.

From the main thread/tasks you write, then in the webserver thread you read. One thread is RW, the other is RO. There are no problems with that regarding memory access. RW/RW would be a different thing, but I don't think that's the case here.

Thanks, I already knew what you meant.

The easier solution is to run the webserver task in a different thread.

Running it in the same thread as the other emailer tasks was a conscious decision back when I introduced it, to use the /health endpoint as kind of 'detector' to see if the main loop was still running.
It was maybe a too defensive approach. There is a task that checks that all other tasks are scheduled, which should works well, so there is no need to have the webserver in the same thread.

That will prevent us from gathering any stats from the other tasks though (without some not-so-easy/nice workarounds).

If the case is incrementing a counter for prometheus, I don't see that as very problematic.

From the main thread/tasks you write, then in the webserver thread you read. One thread is RW, the other is RO. There are no problems with that regarding memory access. RW/RW would be a different thing, but I don't think that's the case here.

The way the stats are generated is writing to them, having more than one task writing to the stats (and/or the webserver itself if we instrument it) means many writers, one reader.

It also does not solve the blocking issues between the current tasks, I'm testing another approach currently, will send the patch soon.

dcaro changed the task status from Open to In Progress.Nov 19 2024, 10:37 AM
dcaro moved this task from Next Up to In Review on the Toolforge (Toolforge iteration 16) board.

group_203_bot_4866fc124f4b41659f667468a6115cf3 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/670

jobs-emailer: bump to 0.0.52-20250219170643-a14ae54d

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 17) board.