Page MenuHomePhabricator

toolforge-jobs: figure out default quotas and limits
Closed, ResolvedPublic

Description

I'm not sure we ever evaluated default quotas and limits for jobs @ kubernetes.

I propose we double check:

  • max number of defined Jobs objects
  • max number of defined Deployments objects (also used by webservices)
  • max number of defined CronJobs objects
  • max number of Pod objects
  • default mem & cpu for jobs etc

I suspect @Bstorm may have an opinion on this.

Event Timeline

In this case, the original setup of quotas was intended to take jobs into account, but that doesn't mean we did a good job! I see we didn't use any of the count/ quotas, just the standard object quotas available in v1.15. I recall something weird about them that didn't work quite right then.

There should be no reason not to now.

We currently have some very basic limits:

configmaps: "10"
limits.cpu: "2"
limits.memory: 8Gi
persistentvolumeclaims: "3"
pods: "4"
replicationcontrollers: "1"
requests.cpu: "2"
requests.memory: 6Gi
secrets: "10"
services: "1"
services.nodeports: "0"

Compared to grid engine's per-user limit of 50 jobs, 4 pods is pretty tight. We may very well want to increase that to encourage adoption.
For web services, we already limit services to 1 until someone requests more. That suggests that it might be reasonable to set
count/deployments.apps to 1 as well unless that is somehow used in this system.

For jobs and cronjobs, there's more questions. a cronjob is a controller with many jobs, right? You get one job and pod per execution, and they get cleaned up but not necessarily right away, no?

The main constraint I see in putting pods/jobs to 50 is that our images are too thick. If a person is using only the tf-buster-std, they can easily get away with doing more jobs with less filling of disks and such on the workers than tf-python37. While we could experiment with producing a smaller python container or something like that, perhaps a new default limit of 10 would be sensible? We also want people to spread out their cronjob object schedules, so that might work.

For cronjobs themselves, the limit likely doesn't need to be terribly strict, and we could even use 50 as the base. You can only have so many pods running at once with only so much CPU and mem anyway.

Since you've added the ability to select memory and cpu for jobs, I'd suggest we go with small initial sizes, which is already taken care of by the limit ranges in place. That has this default:

- default:
    cpu: 500m
    memory: 512Mi

If you set nothing, it should give you that. I'm game for setting something lower in the API if you want! I believe we do in webservice.

This'll call for a patch to maintain-kubeusers and a script to backfill the quotas without changing the existing ones (since some people already have non-default quotas).

@aborrero if all that sounds reasonable, I can try to start implementing it quick before you are done with the beta so we get feedback.

Compared to grid engine's per-user limit of 50 jobs, 4 pods is pretty tight. We may very well want to increase that to encourage adoption.

The pod limit was one of my concerns with adoption. I wouldn't have hit >4 yet, but I've only migrated some jobs over (mostly very short). Most of the time it wouldn't be an issue, but sometimes multiple longer duration cronjobs running at the same time may prevent others from starting (at all or when they should).

For jobs and cronjobs, there's more questions. a cronjob is a controller with many jobs, right? You get one job and pod per execution, and they get cleaned up but not necessarily right away, no?

The jobs from cronjobs have ttlSecondsAfterFinished set to 30 (was 0 before today; rCTJF07346d715d17: jobs: adjust garbage collection), so they get cleaned up pretty quickly.

For cronjobs themselves, the limit likely doesn't need to be terribly strict, and we could even use 50 as the base. You can only have so many pods running at once with only so much CPU and mem anyway.

50 seems like a reasonable place to start. If you have a way to do so, maybe you could check the max number of grid obs scheduled by any tool currently and base it on that.

@aborrero if all that sounds reasonable, I can try to start implementing it quick before you are done with the beta so we get feedback.

sounds reasonable!

The 50 limit also means that we should probably do a serious scaling of the k8s cluster before/while we migrate away from the old grid.

50 seems like a reasonable place to start. If you have a way to do so, maybe you could check the max number of grid obs scheduled by any tool currently and base it on that.

It's 50 on the grid. It should be noted that a grid job is much lighter than a k8s pod because it doesn't come with one of our really huge container images--that said, by only using a few images we have a lot of deduplication. We are working on slimming those down a bit. I'm thinking about 50 for the schedules, basically, with 10 for concurrent pods/jobs maybe (instead of the current 4) to start. It is much easier to process increases for people who need more than that vs. the grid, which is why they are artificially very high on the grid. There are actually only 2 or 3 users who have ever hit the limit there, and the limit was really created for them.

The grid is also not a very healthy environment and requires regular manual work to keep going, so it is not always the best basis for such things.

@aborrero: we will probably need to scale the cluster up to something like the tools-sgeexec cluster size in general to really move most people, anyway, so yeah :)

This is all figuring a cronjob = the schedule vs a pod which = an actually concurrently running thing. A one-off job...we can set the limit on those high to prevent DoS, but the pod limit should be the real limit there.

The main constraint I see in putting pods/jobs to 50 is that our images are too thick. If a person is using only the tf-buster-std, they can easily get away with doing more jobs with less filling of disks and such on the workers than tf-python37. While we could experiment with producing a smaller python container or something like that, perhaps a new default limit of 10 would be sensible?

Part of the problem is that we use the same images for interactive mode (includes all your favorite $EDITORs) that are used for execution. Current toolforge-buster-sssd image is 724 MB, when I strip out emacs, vim and nano, it's down to 408 MB. Stripping out gawk, git, jq, curl, less, sed brings us down to 324 MB - 55% smaller.

We could have -slim images that drop all these tools and have toolforge-jobs use it by default. If you pass some --debug flag to the job, then it uses the full container and so you can attach to it for debugging.

Depending on how far we want to trim, all the various -dev packages could be replaced with the shared library (e.g. libldap2 instead of libldap2-dev) since they shouldn't be needed a runtime, only at compile/install time.

Removing locales gets us down to 81.4 MB. Could we mount in the locales from the host instead of shipping it in every container when it's basically static?

Sidenote: I think pywikibot uses git to identify what version people are on, but that would be a good reason to include it in the Pywikibot container and promote usage of that.

We also want people to spread out their cronjob object schedules, so that might work.

Does k8s support something like systemd's RandomizedDelaySec=? I have some bots that need to run hourly but it doesn't matter at what time in the hour they run at, though I usually put them at the top of the hour because that's easier than picking a random minute. If k8s doesn't do it, the toolforge-jobs command could interpret some @randomdaily cron syntax and substitute in a random time that still runs daily.

nskaggs moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 711728 had a related patch set uploaded (by Bstorm; author: Bstorm):

[labs/tools/maintain-kubeusers@master] jobs service: add count quotas for a bunch of related objects

https://gerrit.wikimedia.org/r/711728

Change 711728 merged by jenkins-bot:

[labs/tools/maintain-kubeusers@master] jobs service: add count quotas for a bunch of related objects

https://gerrit.wikimedia.org/r/711728

Does k8s support something like systemd's RandomizedDelaySec=? I have some bots that need to run hourly but it doesn't matter at what time in the hour they run at, though I usually put them at the top of the hour because that's easier than picking a random minute. If k8s doesn't do it, the toolforge-jobs command could interpret some @randomdaily cron syntax and substitute in a random time that still runs daily.

Not to my knowledge, unfortunately. Mind you, I haven't looked in a bit.

Mentioned in SAL (#wikimedia-cloud) [2021-09-02T01:02:03Z] <bstorm> deployed new version of maintain-kubeusers with new count quotas for new tools T286784

Still to do: a backfill script.

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T22:36:23Z] <bstorm> backfilling quotas in screen for T286784

That's done. We can always make adjustments later and folks can request increases on a case-by-case basis as well.