Page MenuHomePhabricator

Unable to connect to Beta Cluster from WMCS GitLab CI runners
Closed, ResolvedPublic

Description

As part of migrating m3api to Wikimedia GitLab, I’m currently working on porting the CI from GitHub Actions to GitLab CI. Some of the repositories (m3api-botpassword, later also m3api-oauth2) run tests against the Beta Cluster, and currently I’m running into issues there. When trying to use a WMCS runner (via tags: [ "wmcs" ]), the build fails with a “Connection refused” error:

TypeError: fetch failed
 at node:internal/deps/undici/undici:12618:11
 at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
 at async NodeSession.internalGet (file:///builds/repos/m3api/tmp-m3api-botpassword/node_modules/m3api/fetch.js:34:20)
 at async NodeSession.request (file:///builds/repos/m3api/tmp-m3api-botpassword/node_modules/m3api/core.js:561:28)
 at async NodeSession.requestAndContinue (file:///builds/repos/m3api/tmp-m3api-botpassword/node_modules/m3api/core.js:625:21)
 at async NodeSession.getToken (file:///builds/repos/m3api/tmp-m3api-botpassword/node_modules/m3api/core.js:706:22)
 at async NodeSession.request (file:///builds/repos/m3api/tmp-m3api-botpassword/node_modules/m3api/combine.js:28:4)
 at async login (file:///builds/repos/m3api/tmp-m3api-botpassword/index.js:109:19)
 at async Context.<anonymous> (file:///builds/repos/m3api/tmp-m3api-botpassword/test/integration/index.test.js:86:24)
Caused by: Error: connect ECONNREFUSED 172.16.3.164:443
 at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1555:16)

@bd808 says this is probably because the GitLab runners are firewalled off from the rest of Cloud VPS, yet they still have a DNS setup that returns private IP addresses (the 172.16.3.164 above).

(FTR, trying to use non-WMCS runners fails with a different error – some kind of HTML being returned rather than JSON – which I haven’t investigated yet, but that’s not in scope for this task.)

Event Timeline

(FTR, trying to use non-WMCS runners fails with a different error – some kind of HTML being returned rather than JSON – which I haven’t investigated yet, but that’s not in scope for this task.)

FWIW, I tried to investigate this now and instead found that non-WMCS runners currently work fine. Which is nice, if a bit irritating.

So, while in general I think it would make sense for m3api to use WMCS runners (and for this task to be fixed), it’s probably not a blocker for me at the moment.

According to https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner/Security_Evaluation#Firewall_configuration,

If access to a specific WMCS service is needed, the hostname can be added to profile::gitlab::runner::allowed_services. See T317341 for more information.

Would that be acceptable for deployment-prep, or does anything in there rely on the network being private? (IIUC the hostnames to add would be deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud and maybe also deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud, though I’m a bit worried about the prospect of Someone™ having to update those in Puppet each time a new server is set up.)

I'm not really sure that I agree with the analysis by @Jelto from T317341#8324825 that traffic allowed by OpenStack Security Groups should be further restricted by ferm rules installed on the shared runners themselves. My argument would be that access to GitLab in its entirety as a contributor requires the same level of vetting as becoming a Toolforge maintainer; becoming a Toolforge maintainer is one of the lower friction ways of getting past the GitLab account vetting process. Once you are a Toolforge maintainer (or a member of any other Cloud VPS project) you have access to the same network as you have access to from inside a GitLab runner. The threat as stated is that ICMP traffic could be abused to explore community services and other WMCS projects. It is true that ping can be used in this network, but https://openstack-browser.toolforge.org/ already makes the project topology and ip addresses of all instances public information. See https://openstack-browser.toolforge.org/server/runner-1030.gitlab-runners.eqiad1.wikimedia.cloud and https://openstack-browser.toolforge.org/server/runner-1030.gitlab-runners.eqiad1.wikimedia.cloud as examples of the data we already publish for anyone on the internet to see.

T374129: openstack: consider removing labs-ip-aliaser is the task to remove the split horizon DNS resolution magic from Cloud VPS generally. It is stalled at the moment, so not super likely to be the answer to this particular problem.

According to https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner/Security_Evaluation#Firewall_configuration,

If access to a specific WMCS service is needed, the hostname can be added to profile::gitlab::runner::allowed_services. See T317341 for more information.

Would that be acceptable for deployment-prep, or does anything in there rely on the network being private? (IIUC the hostnames to add would be deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud and maybe also deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud, though I’m a bit worried about the prospect of Someone™ having to update those in Puppet each time a new server is set up.)

This two hostnames can be added to profile::gitlab::runner::allowed_services. Changing the hostnames requires a change in puppet as well so yes this is quite painful. We could introduce additional DNS entries and use those. Then you could manage the DNS entries on your own.

I'm not really sure that I agree with the analysis by @Jelto from T317341#8324825

Most security groups in WMCS projects are too lose and allows also access beside ICMP. Restricting access between gitlab-runners project and other project was a recommendation from a security review we've done three years ago T317341: Findings in Security Readiness Reviews of Trusted GitLab Runners.

The current process of explicitly allow-listing WMCS services is confusing and painful. I'm open to discuss dropping this rule in a dedicated task and get sign-of from WMCS/management.

The current process of explicitly allow-listing WMCS services is confusing and painful. I'm open to discuss dropping this rule in a dedicated task and get sign-of from WMCS/management.

Let's take that discussion to {T397888}.

According to https://wikitech.wikimedia.org/wiki/GitLab/Gitlab_Runner/Security_Evaluation#Firewall_configuration,

If access to a specific WMCS service is needed, the hostname can be added to profile::gitlab::runner::allowed_services. See T317341 for more information.

Would that be acceptable for deployment-prep, or does anything in there rely on the network being private? (IIUC the hostnames to add would be deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud and maybe also deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud, though I’m a bit worried about the prospect of Someone™ having to update those in Puppet each time a new server is set up.)

I've been thinking about this a bit and I think we can and should put the public service names into the hiera. The hostnames are resolved to IP addresses in the Ferm rules (daddr (@resolve(${allowed_service['host']})) proto ${proto} dport ${allowed_service['port']} ACCEPT;) and this resolution will produce the correct private address based on where the service name is bound. We know that all Beta Cluster ingress is handled by the cache servers, so we don't need to enumerate all of them; just adding en.wikipedia.beta.wmcloud.org should cover all of them. I'll prepare a patch for review.

Change #1166262 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] gitlab: Allow WMCS runners to talk to deployment-prep wikis

https://gerrit.wikimedia.org/r/1166262

Change #1166262 merged by Dzahn:

[operations/puppet@production] gitlab: Allow WMCS runners to talk to deployment-prep wikis

https://gerrit.wikimedia.org/r/1166262

Mentioned in SAL (#wikimedia-cloud) [2025-07-08T00:26:58Z] <mutante> gitlab-runners-puppetserver-01 cd /srv/git/operations/puppet ; sudo -u gitpuppet git pull | after deploying gerrit:1166262 & gerrit:1166263 | run puppet on runner-1033 which created /etc/ferm/conf.d/18_docker-allow-cloudinfra-puppet-enc, open firewall holes for T397591

I deployed Bryan's changes that had already been reviewed by Jelto.

After merging on prod puppserver, updated the local gitlab-runners-puppetserver.

Then ran puppet on runner-1033 first.

It created /etc/ferm/conf.d/18_docker-allow-cloudinfra-puppet-enc and /etc/ferm/conf.d/18_docker-allow-deployment-prep-cdn which look like:

domain (ip ip6) {
	table filter {
		chain DOCKER-ISOLATION {
			daddr (@resolve(puppet-enc.cloudinfra.wmcloud.org)) proto tcp dport 443 ACCEPT;
		}
	}
}

...

domain (ip ip6) {
	table filter {
		chain DOCKER-ISOLATION {
			daddr (@resolve(en.wikipedia.beta.wmcloud.org)) proto tcp dport 443 ACCEPT;
		}
	}
}

To see an actual change in iptables -L DOCKER-ISOLATION I had to systemctl restart ferm.

The DNS name for the cloudinfra rule ends up being enc-1.cloudinfra.eqiad1.wikimedia.cloud because that's the reverse record for 172.16.7.53.

And the DNS name for the deployment-prep rule is deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud because that's the reverse record for 172.16.3.164.

iptables -L DOCKER-ISOLATION | grep infra
ACCEPT     tcp  --  anywhere             enc-1.cloudinfra.eqiad1.wikimedia.cloud  tcp dpt:https

...

iptables -L DOCKER-ISOLATION | grep deploy
ACCEPT     tcp  --  anywhere             deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud  tcp dpt:https

I am running puppet and restarting ferm manually on all gitlab-runner instances.

This should work now. ferm restarted on all wmcs gitlab-runners.

They all have the iptables rule for deployment-cache-text08 as destination now.

Can confirm that a build with the wmcs tag works now, thanks!

Anything left to do here before closing the task?

Dzahn claimed this task.

Thanks for confirming it works! Not that I know of. I will be bold and say it should just be reopened if there is anything left.