Page MenuHomePhabricator

Create a separate memory optimized gitlab runner pool for memory hungry jobs
Closed, ResolvedPublic

Description

(Original thread on Slack at https://wikimedia.slack.com/archives/C05H0JYT85V/p1692975749455829 )

TL;DR: This change broke my CI pipeline, apparently because my CI uses >= 1GB of RAM. Can we bump that?

Longer:
I have a Gitlab project at https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump, with a simple CI pipeline like so https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/blob/main/.gitlab-ci.yml.
My builds started to fail recently, with no changes to my CI scripts. I get this weird stack:

...
Running hooks in /etc/ca-certificates/update.d...
done.
$ source /opt/conda/etc/profile.d/conda.sh
$ conda env create -f conda-environment.yaml
Collecting package metadata (repodata.json): ...working... 
/opt/conda/etc/profile.d/conda.sh: line 35:  3677 Killed                  "$CONDA_EXE" $_CE_M $_CE_CONDA "$@"
...

So my process is being killed for some reason.
Example failure on main branch (which had previously succeeded): https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/135829#L446
Other example failure on separate branch where I first reproed this issue: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/135826#L446

@dancy pointed out that untagged Gitlab jobs now use k8s runners, but that these runners are configured as follows:

cpu_limit = "1500m"
cpu_request = "250m"
memory_limit = "1Gi"
memory_request = "512Mi"

I can tag my job to get former behavior, but we should consider bumping the memory limit. In Data Platform Engineering, we have many CI pipelines that do the same conda env create -f conda-environment.yaml, so they are likely to fail as well.

See archived Slack thread at https://wikimedia.slack.com/archives/C05H0JYT85V/p1692975749455829 for discussion but the TL;DR is that we can't simply add a separate [[runners]] section since tags is seemingly not configurable per runner (only per registration). We can, however, parametrize workload/requests/limits and perform two separate runner deployments.

Details

TitleReferenceAuthorSource BranchDest Branch
Use memory-optimized runners on .gitlab-ci.ymlrepos/data-engineering/dumps/mediawiki-content-dump!9xcollazotest-memory-optimized-runnersmain
gitlab: Parameterize node toleration and selectorrepos/releng/gitlab-cloud-runner!252dduvallreview/parameterize-runner-workloadmain
gitlab: Revert resource name changesrepos/releng/gitlab-cloud-runner!251dduvallreview/revert-gitlab-resource-name-changesmain
gitlab-runner: Create memory optimized runnersrepos/releng/gitlab-cloud-runner!250dduvallreview/memory-optimized-runner-poolmain
Customize query in GitLab

Event Timeline

dduvall triaged this task as Medium priority.

@xcollazo the memory optimized runner pool has been deployed. You can schedule jobs to this pool by using the tag memory-optimized, either at the pipeline level under defaults or at the job level.

Scratch that. I just realized that I overlooked the node tolerations in the runner config. I'll have to do a follow up. Sorry.

@xcollazo alright, we should be good now. Test out the memory-optimized tag and let us know the results!

Some testing from my side:

My pytest job ran successfully: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137133

My somewhat heavier publish_conda_env release job seems to have stalled though: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137134

It was already publishing the artifact at 11 minutes in, but then it got at warning, presumably from k8s?:

WARNING: Getting job pod status pods "runner-emdqpmey-project-1420-concurrent-02lhsm" is forbidden: User "system:serviceaccount:gitlab-runner:gitlab-runner-memopt" cannot get resource "pods" in API group "" in the namespace "gitlab-runner"

And even though these new pods have a timeout of 20m, the job is still running at 26m+ as of this comment, but seems stalled? This job typically takes ~10 minutes.

Some testing from my side:

My pytest job ran successfully: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137133

My somewhat heavier publish_conda_env release job seems to have stalled though: https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/137134

It was already publishing the artifact at 11 minutes in, but then it got at warning, presumably from k8s?:

WARNING: Getting job pod status pods "runner-emdqpmey-project-1420-concurrent-02lhsm" is forbidden: User "system:serviceaccount:gitlab-runner:gitlab-runner-memopt" cannot get resource "pods" in API group "" in the namespace "gitlab-runner"

And even though these new pods have a timeout of 20m, the job is still running at 26m+ as of this comment, but seems stalled? This job typically takes ~10 minutes.

That's a strange error indeed and definition k8s related. It's very possible I was redeploying GitLab runners during that time to get some important config changes out. Could you re-run the job and see what happens?

Could you re-run the job and see what happens?

Certainly. Will report in a few.

Ok, it seems like we had a transient issue indeed when I reproed a couple days ago.

Both my jobs were successful this time:
https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/139559
and
https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/jobs/139560

Thus, confirming that the new memory-optimized containers solve my issue.

@xcollazo: Checking in. Are your CI issues resolved with the use of the memory-optimized runner?

@xcollazo: Checking in. Are your CI issues resolved with the use of the memory-optimized runner?

They are! This is solved from my side. Thanks for checking in!