Page MenuHomePhabricator

jenkins-slave@contint1001 not a member of docker group (CI tests for mathoid broken)
Closed, ResolvedPublic

Description

It occurs to me that the CI tests for mathoid are broken.
cf. https://gerrit.wikimedia.org/r/#/c/409014/
Jenkins complains about permission problems

Root cause:

The puppet admin module enforce group memberships but the unix user jenkins-slave is not managed by it. Hence, the user get removed from the docker unix group:

Notice: /Stage[main]/Profile::Ci::Docker/Exec[jenkins user docker membership]/returns: executed successfully
Notice: /Stage[main]/Admin/Admin::Groupmembers[contint-docker]/Exec[docker_ensure_members]/returns: executed successfully

Details

Related Gerrit Patches:
operations/puppet : productionForce jenkins-slave being member of docker

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2018, 1:07 PM
hashar added subscribers: akosiaris, hashar.

https://integration.wikimedia.org/ci/job/service-pipeline-test-only/28/flowGraphTable/

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock

The jenkins-slave unix user on contint1001 is no more a member of the docker unix group:

contint1001$ id -Gn jenkins-slave
jenkins-slave

Running puppet:

contint1001# run-puppet-agent
Notice: /Stage[main]/Profile::Ci::Docker/Exec[jenkins user docker membership]/returns: executed successfully
Notice: /Stage[main]/Admin/Admin::Groupmembers[contint-docker]/Exec[docker_ensure_members]/returns: executed successfully

Exec[jenkins user docker membership] does /usr/sbin/usermod -aG docker 'jenkins-slave'. Namely add the user to the group.

I have confirmed the user get properly added but that seems to then get overridden by our admin module:

modules/admin/data/data.yaml
groups:
  contint-docker:
    posix_name: docker # Use posix_name to avoid potential conflicts with other uses of the docker group
    description: Allow releng team to be in the docker group for contint. No gid on purpose
    members: [*ops_members, dduvall, demon, gjg, hashar, thcipriani, twentyafterfour, zfilipin, legoktm, addshore]

Maybe @akosiaris would know?

hashar renamed this task from CI tests for mathoid broken to jenkins-slave@contint1001 not a member of docker group (CI tests for mathoid broken).Feb 8 2018, 3:24 PM

I fail to see how that ever worked. admin::groupmembers make sure the unix group only have the members defined in the yaml file but jenkins-slave is not in there.

Or maybe puppet used to run our custom Exec[jenkins user docker membership] AFTER the admin::groupmembers. We could potentially had a before/require statement in puppet to enforce that. But it is still a race condition.

So I guess the question is: how to add a user created by Puppet to an admin group managed by the yaml file ? :(

I fail to see how that ever worked. admin::groupmembers make sure the unix group only have the members defined in the yaml file but jenkins-slave is not in there.

This worked previously because jenkins-slave was the only user executing docker on that machine. Since then we've started running docker-pkg on that machine since it is a convenient place from which to push images to the docker registry.

For the tests we could add blubber back to permanent labs docker instances and build the Mathoid test variant on those machines; however, we'll still need to solve this problem for the post-merge portion of the Mathoid pipeline (wherein it pushes a tested production-ready image to the registry).

Adding @dduvall since we setup this process initially.

A puppet run on contint1001 currently first adds jenkins-slave to the docker group and a bit later removes it. And that's behavior we actually want, that is the yaml file to be the canonical representation of our groups. There is some talk in T174465 about an approach to solve this, but it's probably gonna take a while. So, I 'd say we figure out something else to unblock this and solve it correctly down the road. One simple way would be to stop managing the group althogether via the admin module, but it's clearly not ideal.

I thought of moving the jenkins-slave user to admin but it seems to be merely for mortals, not system accounts :-( I am not even sure puppet would like it since there would be no User['jenkins-slave'] resource defined.

Another idea is to check whether admin supports adding to a group a user it does not manage. Ie add jenkins-slave to the contint-docker admin group.

hashar updated the task description. (Show Details)Feb 14 2018, 8:07 AM

Change 410763 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] WIP: Force jenkins-slave being member of docker

https://gerrit.wikimedia.org/r/410763

Change 410763 merged by Alexandros Kosiaris:
[operations/puppet@production] Force jenkins-slave being member of docker

https://gerrit.wikimedia.org/r/410763

akosiaris closed this task as Resolved.Feb 15 2018, 11:30 AM
akosiaris claimed this task.

I don't particularly love the solution I 've given above as it's brittle, but it has solved the problem for now. When a better approach is fleshed out in T174465 I 'll amend this to conform to that. For now, I am resolving this

@akosiaris Thank you very much for fixing that. When will the patch be "deployed" to the main Jenkins server. At this moment the problem seems to be still prevalent https://integration.wikimedia.org/ci/job/service-pipeline-test-only/30/console

It already has been. jenkins-slave is part of that group currently. I am guessing jenkins needs a restart ? @hashar

It already has been. jenkins-slave is part of that group currently. I am guessing jenkins needs a restart ? @hashar

Had to restart the jenkins-agent so that the ssh connection would close and re-open. Did that, looks like the job is working again! https://integration.wikimedia.org/ci/job/service-pipeline-test-only/32/

Thank @akosiaris !

Thank you for the hotfix @akosiaris !!!

I guess the long term fix is T174465: Puppet admin module should support adding system users to managed groups, so at least that is tracked :]