Page MenuHomePhabricator

(some) Gitlab builds hanging
Closed, ResolvedPublic

Description

Hi,

I've been looking at the problems of packaging trafficserver; those are mostly out of scope here, but it does seem that there is a problem with builds hanging that used to work.

The last successful build was pipeline 35775, 5th January 2024, building commit 6f4ce348 (the build step taking 10m37s). One can't just re-run that, though, as the CI steps needed have subsequently changed. But the same code now reliably fails to build - what you observe is that the build gets so far and then just stops until the timeout comes along.

For examples of failures, see job 329574, job 329639 (observed to have hung after 32m), job 329657 (observed to have hung after 9m45s), all of which use memory-optimized runners, and job 329641 which used a regular one and I noticed had hung after 1m33s.

Obviously this could be bugs in the build system, but the fact that the code built back in January and now does not (and the fact that the build hangs rather than something failing) makes me suspect the infrastructure instead. I'd expect resource exhaustion to result in a more obvious failure than a hang, but I really don't know...

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
production/prod.tfvars: Increase runner_memopt_memory_limit to 5700Mirepos/releng/gitlab-cloud-runner!396dancymain-I743812c6da363684268d8ff25589cbf3439af191main
Customize query in GitLab

Event Timeline

As far as I can tell, all of the linked jobs used the Digital Ocean Kubernetes runners, and all jobs fail because of system errors with the Kubernetes Pod. So, this is most likely a Kubernetes issue. I was not able to find logs or events for the old pods. My guess is that autoscaling removed one of the nodes, which then caused the pods to be removed as well. Maybe RelEng can help troubleshoot this more.

A workaround could be to use the Wikimedia Cloud runners. You can force jobs to run on the Wikimedia runners by setting tags: [wmcs] in your job definition, similar to this example.

Thanks for that suggestion - I tried a wmcs build, and it completed OK.

I'm now re-trying the memory-optimized build...

Build 334109 has hung again. So either I'm really unlucky or there's a consistent problem with the Kubernetes runners.

If it's useful for debugging, feel free to re-run pipeline 68330 ("CI: specify memory-optimized tag") as much as you like :)

[it did go on to fail ERROR: Job failed (system failure): pods "runner-9e2abdumz-project-1859-concurrent-0-wmwlr4zv" not found after just over an hour]

I re-ran the 68330 pipeline, but it failed with the same error. I did some troubleshooting in the DigitalOcean Kubernetes cluster.

The job pod ran for about 10 minutes and was terminated - OOMKilled (exit code: 137). The pod requested 3.6GB of memory and 2.1 CPUs. It peaked at around 3.6GB RAM and 7.2 CPUs.

This appears to be a resource issue. I'm not sure if we can create even larger runner pods. @dancy, do you think it makes sense to increase the requested memory of the memory-optimized pods even more?

I increased the memory limit to 5700Mi but that did not resolve the problem. The information I need is how much memory this job actually needs to complete. That said, if the WMCS runners are adequate, using them seems like the best solution.

Btw, the job timing out instead of failing outright after it is killed is a bad behavior of the kubernetes gitlab-runner. It fails about 10 minutes after starting.

@MatthewVernon Can you give me an update on the status of your gitlab builds? Have you managed to work around the original problems?

@dancy it's been a while now, but I think we just moved trafficserver to specify tags: [wmcs] for all its jobs and that worked around the issue.

dancy claimed this task.

@dancy it's been a while now, but I think we just moved trafficserver to specify tags: [wmcs] for all its jobs and that worked around the issue.

Great! I will consider this matter resolved then.