Page MenuHomePhabricator

Update cache in a single thread
Closed, ResolvedPublic

Description

Collection cache is being updated at service startup and every hour. It is currently being executed at roughly the same time in all of the process workers. There are 4 by default in dev, maybe more in prod. This is wasteful and should be done in a single thread.

Event Timeline

SBisson triaged this task as High priority.Nov 5 2024, 2:18 AM
SBisson moved this task from Backlog to Prioritized on the LPL Hypothesis board.

Change #1088376 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[research/recommendation-api@master] update cache in a single thread

https://gerrit.wikimedia.org/r/1088376

Change #1088376 merged by jenkins-bot:

[research/recommendation-api@master] update cache in a single thread

https://gerrit.wikimedia.org/r/1088376

The patch above was a big improvement but I think we should do a little more before calling it done. Ideally, we should use the actually number of workers but at a minimum, if we hardcode a number, we should make sure that this number is the one used in production. It shouldn't be too hard to get that from the startup logs.

Change #1098064 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[research/recommendation-api@master] fix cache update in a single thread for production

https://gerrit.wikimedia.org/r/1098064

Change #1098064 merged by jenkins-bot:

[research/recommendation-api@master] fix cache update in a single thread for production

https://gerrit.wikimedia.org/r/1098064

I think the current logic to restrict the cache update to one thread is fragile - the PID % number of workers check.

Gunicorn knows this, but none of the fastapi worker knows the number of workers. The TODO annotation in the code is asking to read this from configuration file. But this is not configuration for application per se. It is application run time information, like number of processor core. Fastapi Application should be agnostic of the workers.

A better approach would be to do the cache update outside the gunicorn and fastapi process. The container(docker) setup a cron job and run poetry run update_cache in regular intervals. That will guarantee that only one process updating the cache and will run independent of the fastapi workers.

However, a small headache to setup this in docker(blubber file) is we are running the application as user somebody with limited permissions and found that Mr Somebody cannot run cronjobs. Blubber is not that flexible to set this permissions too. This need some exploration or may be discuss with developer experience team to see if we somebody can run cronjobs.

Change #1098495 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update recommendation-api to 2024-11-27-065850-production

https://gerrit.wikimedia.org/r/1098495

Change #1098495 merged by jenkins-bot:

[operations/deployment-charts@master] Update recommendation-api to 2024-11-27-065850-production

https://gerrit.wikimedia.org/r/1098495

Mentioned in SAL (#wikimedia-operations) [2024-11-27T15:05:23Z] <kart_> Updated recommendation-api to 2024-11-27-142924-production (T380838, T379036, T380699)

While cache update is not where we want it to be yet, I can confidently say this is is happening in a single thread as opposed to in all 4 workers like it was initially. I'll open another task for setting it up as a cron job.