toolforge-jobs: figure out default quotas and limits
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	aborrero
	Jul 16 2021, 12:47 PM

Description

I'm not sure we ever evaluated default quotas and limits for jobs @ kubernetes.

I propose we double check:

max number of defined Jobs objects
max number of defined Deployments objects (also used by webservices)
max number of defined CronJobs objects
max number of Pod objects
default mem & cpu for jobs etc

I suspect @Bstorm may have an opinion on this.

Details

	Subject	Repo	Branch	Lines +/-
	jobs service: add count quotas for a bunch of related objects	labs/tools/maintain-kubeusers	master	+4 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• JHedden	T251027 "signatures" tool has failed job pods on Kubernetes cluster
Resolved	aborrero	T251917 Design the Jobs service in k8s
Resolved	aborrero	T283238 Toolforge: develop jobs-framework-api
Resolved	aborrero	T285944 Toolforge: beta phase for the new jobs framework
Resolved	• Bstorm	T286784 toolforge-jobs: figure out default quotas and limits

Event Timeline

aborrero created this task.Jul 16 2021, 12:47 PM

aborrero updated the task description. (Show Details)Jul 16 2021, 12:50 PM

In this case, the original setup of quotas was intended to take jobs into account, but that doesn't mean we did a good job! I see we didn't use any of the count/ quotas, just the standard object quotas available in v1.15. I recall something weird about them that didn't work quite right then.

There should be no reason not to now.

We currently have some very basic limits:

configmaps: "10"
limits.cpu: "2"
limits.memory: 8Gi
persistentvolumeclaims: "3"
pods: "4"
replicationcontrollers: "1"
requests.cpu: "2"
requests.memory: 6Gi
secrets: "10"
services: "1"
services.nodeports: "0"

Compared to grid engine's per-user limit of 50 jobs, 4 pods is pretty tight. We may very well want to increase that to encourage adoption.
For web services, we already limit services to 1 until someone requests more. That suggests that it might be reasonable to set
count/deployments.apps to 1 as well unless that is somehow used in this system.

For jobs and cronjobs, there's more questions. a cronjob is a controller with many jobs, right? You get one job and pod per execution, and they get cleaned up but not necessarily right away, no?

The main constraint I see in putting pods/jobs to 50 is that our images are too thick. If a person is using only the tf-buster-std, they can easily get away with doing more jobs with less filling of disks and such on the workers than tf-python37. While we could experiment with producing a smaller python container or something like that, perhaps a new default limit of 10 would be sensible? We also want people to spread out their cronjob object schedules, so that might work.

For cronjobs themselves, the limit likely doesn't need to be terribly strict, and we could even use 50 as the base. You can only have so many pods running at once with only so much CPU and mem anyway.

Since you've added the ability to select memory and cpu for jobs, I'd suggest we go with small initial sizes, which is already taken care of by the limit ranges in place. That has this default:

- default:
    cpu: 500m
    memory: 512Mi

If you set nothing, it should give you that. I'm game for setting something lower in the API if you want! I believe we do in webservice.

This'll call for a patch to maintain-kubeusers and a script to backfill the quotas without changing the existing ones (since some people already have non-default quotas).

@aborrero if all that sounds reasonable, I can try to start implementing it quick before you are done with the beta so we get feedback.

Compared to grid engine's per-user limit of 50 jobs, 4 pods is pretty tight. We may very well want to increase that to encourage adoption.

The pod limit was one of my concerns with adoption. I wouldn't have hit >4 yet, but I've only migrated some jobs over (mostly very short). Most of the time it wouldn't be an issue, but sometimes multiple longer duration cronjobs running at the same time may prevent others from starting (at all or when they should).

For jobs and cronjobs, there's more questions. a cronjob is a controller with many jobs, right? You get one job and pod per execution, and they get cleaned up but not necessarily right away, no?

The jobs from cronjobs have ttlSecondsAfterFinished set to 30 (was 0 before today; rCTJF07346d715d17: jobs: adjust garbage collection), so they get cleaned up pretty quickly.

For cronjobs themselves, the limit likely doesn't need to be terribly strict, and we could even use 50 as the base. You can only have so many pods running at once with only so much CPU and mem anyway.

50 seems like a reasonable place to start. If you have a way to do so, maybe you could check the max number of grid obs scheduled by any tool currently and base it on that.

In T286784#7227761, @Bstorm wrote:

@aborrero if all that sounds reasonable, I can try to start implementing it quick before you are done with the beta so we get feedback.

sounds reasonable!

The 50 limit also means that we should probably do a serious scaling of the k8s cluster before/while we migrate away from the old grid.

In T286784#7228583, @JJMC89 wrote:

50 seems like a reasonable place to start. If you have a way to do so, maybe you could check the max number of grid obs scheduled by any tool currently and base it on that.

It's 50 on the grid. It should be noted that a grid job is much lighter than a k8s pod because it doesn't come with one of our really huge container images--that said, by only using a few images we have a lot of deduplication. We are working on slimming those down a bit. I'm thinking about 50 for the schedules, basically, with 10 for concurrent pods/jobs maybe (instead of the current 4) to start. It is much easier to process increases for people who need more than that vs. the grid, which is why they are artificially very high on the grid. There are actually only 2 or 3 users who have ever hit the limit there, and the limit was really created for them.

The grid is also not a very healthy environment and requires regular manual work to keep going, so it is not always the best basis for such things.

@aborrero: we will probably need to scale the cluster up to something like the tools-sgeexec cluster size in general to really move most people, anyway, so yeah :)

This is all figuring a cronjob = the schedule vs a pod which = an actually concurrently running thing. A one-off job...we can set the limit on those high to prevent DoS, but the pod limit should be the real limit there.

In T286784#7225413, @Bstorm wrote:

The main constraint I see in putting pods/jobs to 50 is that our images are too thick. If a person is using only the tf-buster-std, they can easily get away with doing more jobs with less filling of disks and such on the workers than tf-python37. While we could experiment with producing a smaller python container or something like that, perhaps a new default limit of 10 would be sensible?

Part of the problem is that we use the same images for interactive mode (includes all your favorite $EDITORs) that are used for execution. Current toolforge-buster-sssd image is 724 MB, when I strip out emacs, vim and nano, it's down to 408 MB. Stripping out gawk, git, jq, curl, less, sed brings us down to 324 MB - 55% smaller.

We could have -slim images that drop all these tools and have toolforge-jobs use it by default. If you pass some --debug flag to the job, then it uses the full container and so you can attach to it for debugging.

Depending on how far we want to trim, all the various -dev packages could be replaced with the shared library (e.g. libldap2 instead of libldap2-dev) since they shouldn't be needed a runtime, only at compile/install time.

Removing locales gets us down to 81.4 MB. Could we mount in the locales from the host instead of shipping it in every container when it's basically static?

Sidenote: I think pywikibot uses git to identify what version people are on, but that would be a good reason to include it in the Pywikibot container and promote usage of that.

We also want people to spread out their cronjob object schedules, so that might work.

Does k8s support something like systemd's RandomizedDelaySec=? I have some bots that need to run hourly but it doesn't matter at what time in the hour they run at, though I usually put them at the top of the hour because that's easier than picking a random minute. If k8s doesn't do it, the toolforge-jobs command could interpret some @randomdaily cron syntax and substitute in a random time that still runs daily.

• nskaggs triaged this task as Medium priority.Aug 10 2021, 4:28 PM

• nskaggs moved this task from Inbox to Soon! on the cloud-services-team (Kanban) board.

Change 711728 had a related patch set uploaded (by Bstorm; author: Bstorm):

[labs/tools/maintain-kubeusers@master] jobs service: add count quotas for a bunch of related objects

https://gerrit.wikimedia.org/r/711728

gerritbot added a project: Patch-For-Review.Aug 11 2021, 9:51 PM

Change 711728 merged by jenkins-bot:

[labs/tools/maintain-kubeusers@master] jobs service: add count quotas for a bunch of related objects

https://gerrit.wikimedia.org/r/711728

In T286784#7231357, @Legoktm wrote:

Does k8s support something like systemd's RandomizedDelaySec=? I have some bots that need to run hourly but it doesn't matter at what time in the hour they run at, though I usually put them at the top of the hour because that's easier than picking a random minute. If k8s doesn't do it, the toolforge-jobs command could interpret some @randomdaily cron syntax and substitute in a random time that still runs daily.

Not to my knowledge, unfortunately. Mind you, I haven't looked in a bit.

Mentioned in SAL (#wikimedia-cloud) [2021-09-02T01:02:03Z] <bstorm> deployed new version of maintain-kubeusers with new count quotas for new tools T286784

Still to do: a backfill script.

Maintenance_bot removed a project: Patch-For-Review.Sep 2 2021, 1:10 AM

• Bstorm mentioned this in rLTMK262b907ad1e2: jobs service: add count quotas for a bunch of related objects.Sep 2 2021, 2:48 AM

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T22:34:31Z] <bstorm> backfilled quotas for T286784

Mentioned in SAL (#wikimedia-cloud) [2021-09-03T22:36:23Z] <bstorm> backfilling quotas in screen for T286784

That's done. We can always make adjustments later and folks can request increases on a case-by-case basis as well.

toolforge-jobs: figure out default quotas and limitsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

toolforge-jobs: figure out default quotas and limits
Closed, ResolvedPublic
Actions

Related Objects
Search...