Page MenuHomePhabricator

Toolforge kubernetes does not meet expectaction of some users/tools
Closed, ResolvedPublicBUG REPORT

Description

When kubernetes for toolforge was announced, I was quick to change my tools over from grid, even though that meant sometimes massive reorganisation and, in a few cases, rewrites; some tools worked well with the fire-and-forget grid engine approach, but conflict with the kubernetes toolforge approach.

I rewrote some massive codebases in Rust, mostly in order to perform continuous jobs in an efficient way, using async and munti-threading. But even these jobs suffer from the limitations imposed by toolforge.

My latest example is my Mix'n'match tool. I rewrote large parts of the PHP code in Rust, in order to run both on-demand and scheduled tasks on the MnM database. These jobs can vary from a few seconds to a few days. Ideally, I would start a kubernetes job with several CPUs, to run threads efficiently, and a few GB of RAM. Sadly, there are massive restrictions on both CPUs and RAM.

My workaraound (and when your users have to create workarounds to be able to use your infrastructure, you know you have an issue) is to start two copies of the same job under slightly different names. This is obviously more inefficient thatn running a single larger job. Furthermore, these jobs are prone to spantaneously stop (an issue I have noticed with other tools using kuberentes) and are sometimes just left as "fails to start", even when marked as continuous. On top of that, if I start two jobs (1CPU/1GB RAM each), the webservice stops as well, and needs to be restarted manually. Again, this does not just affect this tool but several.

I believe this will not be solved by changing configuration for one tool or the other; there needs to be a re-think of how to offer kubernetes to us toolforge users, in a way that does not make us want to ditch it in favour of screen.

Event Timeline

aborrero changed the task status from Open to In Progress.Jun 30 2023, 3:00 PM
aborrero triaged this task as Low priority.
aborrero added a project: User-aborrero.
aborrero subscribed.

In general, when a tool encounters the kind of "massive restrictions on both CPUs and RAM" that you mention, we kindly suggest moving it into a Cloud VPS project.

Would that be an option for Mix'n'match ?

aborrero renamed this task from Toolforge kubernetes is dysfunctional to Toolforge kubernetes does not meet expectaction of some users/tools.Jun 30 2023, 3:00 PM
aborrero moved this task from Backlog to Radar on the User-aborrero board.

In principle it would (I am already running the petscan VPS), but this is missing my point.

My point is, gridengine was a lot more flexible, accomodating, and (as an end result) useful for me.

I get that kubernetes is the way to go. I am usually one of the first to adopt new tech on toolforge. But given that kubernetes is better, why do I now have to move to a VPS? Is there no middle ground?

That is to say, you say I'd have to move to VPS because you changed the tech stack, not because of resource usage. Resource usage was fine under grid engine. In fact, it was fine running the PHP version of the code. Don't tell me kubernetes can not support the Rust version which uses a lot less resources?

And it doesn't answer the question why the webservice stops pretty much every time I start kubernetes jobs?

Most of the wrong things you are experiencing are the result of Kubernetes being a better quota / policy enforcer compared to Grid Engine.
In the grid you could effectively take over a whole worker node resource-wise, to the point of making other tools fail or even bringing the worker virtual machine down completely.
Stricter quota enforcement in Toolforge is a feature in this case. It makes the system more stable for 99% of the users (except those like your tool being resource intensive).

The quotas in k8s are also shared between jobs and webservices. The webservice restart situation that you mention could be a bug, or could be the result of your tool being in the edge of the quotas.

Speaking of quotas we are also in the process of revisiting Toolforge k8s them, see T333979: Re-visit Toolforge Kubernetes default quotas (April 2023).

The webservice restart situation that you mention could be a bug, or could be the result of your tool being in the edge of the quotas.

Is there any specific tool right now that you are seeing this issue in?
That would help on trying to reproduce/debug it, thanks!

The webservice restart situation that you mention could be a bug, or could be the result of your tool being in the edge of the quotas.

Is there any specific tool right now that you are seeing this issue in?
That would help on trying to reproduce/debug it, thanks!

In the meantime I'm looking at mix-n-match

@aborrero To be fair, Mix'n'match used to fire off small/medium-sized jobs to gridengine, unlikely to bring a machine down. The requirement for uniquely named kubernetes jobs, and the limitation on the number of jobs, with no ""waiting queue", forced me to rewrite much of my code, into my own "job engine" that runs as a single kubernetes job. However, now the restrictions to individual kubernetes jobs bite me.

The webservice does not restart, it just stops. It works fine after I restart it manually. So it can run in parallel with the other jobs. That it fails in the first place points to some faulty configuration. Also, that kubernetes, known for auto-restarting failed pods, does not do so here should raise questions.

Good to head though that in T333979 it seems clear that the quotas are way too tight for many tools.

@dcaro Listeria was one of them, and quickstatements at some point.

Ah here's a new one. I start kubernetes job rustbot. It runs. I start rustbot2. Now rustbot2 runs but rustbot has vanished:

tools.mix-n-match@tools-sgebastion-10:~/mixnmatch_rs$ toolforge jobs list
Job name:    Job type:    Status:
-----------  -----------  ---------
rustbot      continuous   Running
tools.mix-n-match@tools-sgebastion-10:~/mixnmatch_rs$ toolforge-jobs delete rustbot2 ; \rm ~/rustbot2.* ; \
> toolforge-jobs run --image tf-php74 --mem 1000Mi --continuous --command '/data/project/mix-n-match/mixnmatch_rs/run.sh second' rustbot2

tools.mix-n-match@tools-sgebastion-10:~/mixnmatch_rs$
tools.mix-n-match@tools-sgebastion-10:~/mixnmatch_rs$ toolforge jobs list
Job name:    Job type:    Status:
-----------  -----------  ---------
rustbot2     continuous   Running

If I then delete rustbot2, rustbot will not re-emerge. And, of couse, this silently killed the webservice:

tools.mix-n-match@tools-sgebastion-10:~/mixnmatch_rs$ webservice status
Your webservice is not running

How am I supposed to work with this?

And it doesn't answer the question why the webservice stops pretty much every time I start kubernetes jobs?

Please report such bugs so they can be fixed. toolforge jobs delete works properly now.

It might be worth mentioning that the limitations for Mix'n'match have been increased at some point, I can now run a job with 2CPUs/3GB RAM, which is much better.

I do need to wait a minute or so after deleting a job before I can start the new one; otherwise, it will be accepted by toolforge jobs run, but then enter a "not enough CPUs" state. Apparently kubernetes needs some time to update internal numbers.

And it doesn't answer the question why the webservice stops pretty much every time I start kubernetes jobs?

Please report such bugs so they can be fixed. toolforge jobs delete works properly now.

Thanks!

aborrero claimed this task.

Thanks for the reports and the feedback.

Please do keep sending bug reports in the future about things you feel are unexpected or misbehaving.