GitLab Runners in WMCS are offline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	LSobanski
	Sep 11 2023, 4:17 PM

Description

GitLab Runners in WMCS are showing as offline in the GitLab UI and are also unreachable via SSH. Timing indicates that this is related to T342621: eqiad1: cloudlb: transition DNS clients (VMs) to the new BGP-based recursor VIP.

Firewall rules have been updated in https://gerrit.wikimedia.org/r/c/operations/puppet/+/956463 but application on the hosts is failing.

Details

	Subject	Repo	Branch	Lines +/-
	gitlab_runner: change docker_subnet in WMCS	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aborrero	T296411 cloud: decide on general idea for having cloud-dedicated hardware provide service in the cloud realm & the internet
Resolved	aborrero	T297596 have cloud hardware servers in the cloud realm using a dedicated LB layer
Resolved	Jelto	T338130 cloud-private: CIDR clash with gitlab-runners
Resolved	aborrero	T346060 GitLab Runners in WMCS are offline

Event Timeline

LSobanski created this task.Sep 11 2023, 4:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2023, 4:17 PM

LSobanski added subscribers: Jelto, aborrero, Andrew.Sep 11 2023, 4:18 PM

LSobanski added subscribers: Arnoldokoth, eoghan.

There was a suggestion that T338130: cloud-private: CIDR clash with gitlab-runners may be related.

dancy subscribed.Sep 11 2023, 4:42 PM

Change 956784 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] gitlab_runner: change docker_subnet in WMCS

https://gerrit.wikimedia.org/r/956784

gerritbot added a project: Patch-For-Review.Sep 12 2023, 8:13 AM

P52422 GitLab Runners to run Puppet on

1	runner-1021.gitlab-runners.eqiad1.wikimedia.cloud
2	runner-1022.gitlab-runners.eqiad1.wikimedia.cloud
3	runner-1023.gitlab-runners.eqiad1.wikimedia.cloud
4	runner-1024.gitlab-runners.eqiad1.wikimedia.cloud
5	runner-1025.gitlab-runners.eqiad1.wikimedia.cloud
6	runner-1026.gitlab-runners.eqiad1.wikimedia.cloud
7	runner-1027.gitlab-runners.eqiad1.wikimedia.cloud
8	runner-1028.gitlab-runners.eqiad1.wikimedia.cloud
9	runner-1029.gitlab-runners.eqiad1.wikimedia.cloud
10	runner-1030.gitlab-runners.eqiad1.wikimedia.cloud
11	gitlab-runner-1002.devtools.eqiad1.wikimedia.cloud
12	gitlab-runner-1003.devtools.eqiad1.wikimedia.cloud

Change 956784 merged by Jelto:

[operations/puppet@production] gitlab_runner: change docker_subnet in WMCS

https://gerrit.wikimedia.org/r/956784

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2023, 8:30 AM

aborrero added a parent task: T338130: cloud-private: CIDR clash with gitlab-runners.Sep 12 2023, 10:38 AM

I hadn't realised we had a potential clash here. Unsure exactly what the answer is.

Assuming the affected machines running docker containers are VMs on 172.16.0.0/24 you can potentially add a work-around to improve the situation until Jelto's above patch is rolled out / working everywhere, by adding static routes for the unreachable IPs via the gateway, i.e.

ip route add  172.20.255.1/32 via 172.16.0.1

The more-specific mask on that route would take precedence over the range assigned to the local docker0 bridge and traffic should get to the affected (non-docker) hosts using 172.20.x.x addressing.

Solved with:

user@laptop:~$ ssh -o StrictHostKeyChecking=no root@gitlab-runner-1003.devtools.eqiad1.wikimedia.cloud "ip route delete 172.20.0.0/16 ; run-puppet-agent"

on the bunch of affected hosts.

Thanks @aborrero for fixing all WMCS runners!

Additional to the workaround it was necessary to delete and re-create the gitlab-runner docker network. The following command was used:

systemctl stop docker-resource-monitor.service ; systemctl stop buildkitd.service ; docker network rm gitlab-runner ; run-puppet-agent

The network looks good an all WMCS runners now:

docker network inspect gitlab-runner

GitLab Runners in WMCS are offlineClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

GitLab Runners in WMCS are offline
Closed, ResolvedPublic
Actions

Related Objects
Search...