Create a separate memory optimized gitlab runner pool for memory hungry jobs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• dduvall
	Aug 25 2023, 4:28 PM

Description

In T344874#9120450, @xcollazo wrote:
(Original thread on Slack at https://wikimedia.slack.com/archives/C05H0JYT85V/p1692975749455829 )

TL;DR: This change broke my CI pipeline, apparently because my CI uses >= 1GB of RAM. Can we bump that?

Longer:
I have a Gitlab project at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump, with a simple CI pipeline like so https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/.gitlab-ci.yml.
My builds started to fail recently, with no changes to my CI scripts. I get this weird stack:
...
Running hooks in /etc/ca-certificates/update.d...
done.
$ source /opt/conda/etc/profile.d/conda.sh
$ conda env create -f conda-environment.yaml
Collecting package metadata (repodata.json): ...working... 
/opt/conda/etc/profile.d/conda.sh: line 35:  3677 Killed                  "$CONDA_EXE" $_CE_M $_CE_CONDA "$@"
...
So my process is being killed for some reason.
Example failure on main branch (which had previously succeeded): https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/135829#L446
Other example failure on separate branch where I first reproed this issue: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/135826#L446

@dancy pointed out that untagged Gitlab jobs now use k8s runners, but that these runners are configured as follows:
cpu_limit = "1500m"
cpu_request = "250m"
memory_limit = "1Gi"
memory_request = "512Mi"
I can tag my job to get former behavior, but we should consider bumping the memory limit. In Data Platform Engineering, we have many CI pipelines that do the same conda env create -f conda-environment.yaml, so they are likely to fail as well.

See archived Slack thread at https://wikimedia.slack.com/archives/C05H0JYT85V/p1692975749455829 for discussion but the TL;DR is that we can't simply add a separate [[runners]] section since tags is seemingly not configurable per runner (only per registration). We can, however, parametrize workload/requests/limits and perform two separate runner deployments.

Details

Title	Reference	Author	Source Branch	Dest Branch
Use memory-optimized runners on .gitlab-ci.yml	repos/data-engineering/dumps/mediawiki-content-dump!9	xcollazo	test-memory-optimized-runners	main
gitlab: Parameterize node toleration and selector	repos/releng/gitlab-cloud-runner!252	dduvall	review/parameterize-runner-workload	main
gitlab: Revert resource name changes	repos/releng/gitlab-cloud-runner!251	dduvall	review/revert-gitlab-resource-name-changes	main
gitlab-runner: Create memory optimized runners	repos/releng/gitlab-cloud-runner!250	dduvall	review/memory-optimized-runner-pool	main

Customize query in GitLab

Related Objects

Mentioned Here: T344874: Adopt gitlab-cloud-runner Digital Ocean based cluster as the default for untagged jobs

Event Timeline

• dduvall created this task.Aug 25 2023, 4:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 25 2023, 4:28 PM

• dduvall claimed this task.Aug 25 2023, 4:41 PM

• dduvall triaged this task as Medium priority.

dduvall opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/250

gitlab-runner: Create memory optimized runners

LSobanski subscribed.Aug 28 2023, 11:28 AM

dduvall merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/250

gitlab-runner: Create memory optimized runners

dduvall opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/251

gitlab: Revert resource name changes

dduvall merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/251

gitlab: Revert resource name changes

@xcollazo the memory optimized runner pool has been deployed. You can schedule jobs to this pool by using the tag memory-optimized, either at the pipeline level under defaults or at the job level.

Scratch that. I just realized that I overlooked the node tolerations in the runner config. I'll have to do a follow up. Sorry.

Maintenance_bot removed a project: Patch-For-Review.Aug 29 2023, 5:31 PM

dduvall opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/252

gitlab: Parameterize node toleration and selector

dduvall merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/252

gitlab: Parameterize node toleration and selector

@xcollazo alright, we should be good now. Test out the memory-optimized tag and let us know the results!

Maintenance_bot removed a project: Patch-For-Review.Aug 29 2023, 6:11 PM

Some testing from my side:

My pytest job ran successfully: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137133

My somewhat heavier publish_conda_env release job seems to have stalled though: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137134

It was already publishing the artifact at 11 minutes in, but then it got at warning, presumably from k8s?:

WARNING: Getting job pod status pods "runner-emdqpmey-project-1420-concurrent-02lhsm" is forbidden: User "system:serviceaccount:gitlab-runner:gitlab-runner-memopt" cannot get resource "pods" in API group "" in the namespace "gitlab-runner"

And even though these new pods have a timeout of 20m, the job is still running at 26m+ as of this comment, but seems stalled? This job typically takes ~10 minutes.

(Manually canceled https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137134 after ~54 minutes )

In T345000#9128483, @xcollazo wrote:
Some testing from my side:

My pytest job ran successfully: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137133

My somewhat heavier publish_conda_env release job seems to have stalled though: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137134

It was already publishing the artifact at 11 minutes in, but then it got at warning, presumably from k8s?:
WARNING: Getting job pod status pods "runner-emdqpmey-project-1420-concurrent-02lhsm" is forbidden: User "system:serviceaccount:gitlab-runner:gitlab-runner-memopt" cannot get resource "pods" in API group "" in the namespace "gitlab-runner"
And even though these new pods have a timeout of 20m, the job is still running at 26m+ as of this comment, but seems stalled? This job typically takes ~10 minutes.

That's a strange error indeed and definition k8s related. It's very possible I was redeploying GitLab runners during that time to get some important config changes out. Could you re-run the job and see what happens?

Could you re-run the job and see what happens?

Certainly. Will report in a few.

Ok, it seems like we had a transient issue indeed when I reproed a couple days ago.

Both my jobs were successful this time:
https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/139559
and
https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/139560

Thus, confirming that the new memory-optimized containers solve my issue.

Also, should we document at https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner ?

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/9

Use memory-optimized runners on .gitlab-ci.yml

xcollazo merged https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/9

Use memory-optimized runners on .gitlab-ci.yml

@xcollazo: Checking in. Are your CI issues resolved with the use of the memory-optimized runner?

In T345000#9187755, @dancy wrote:

@xcollazo: Checking in. Are your CI issues resolved with the use of the memory-optimized runner?

They are! This is solved from my side. Thanks for checking in!

dancy closed this task as Resolved.Sep 21 2023, 3:54 PM

I went ahead and added some documentation at https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner#GitLab_Runner_types

Create a separate memory optimized gitlab runner pool for memory hungry jobsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Create a separate memory optimized gitlab runner pool for memory hungry jobs
Closed, ResolvedPublic
Actions