Page MenuHomePhabricator

Migrate contint* hosts to Buster
Closed, ResolvedPublic

Description

The CI production servers are using Debian Jessie and have to be upgrade. We will go for Buster. The hosts are:

HostRole
contint1001.wikimedia.orgPrimary
contint2001.wikimedia.orgSpare

The services to be migrated are:

  • docker-pkg
    • Buster is supported since Gerrit 585451
    • Updated on April 2nd. Will need to be redeployed after upgrade: scap deploy --limit contint.*
  • Docker
    • We can afford to loose the containers. They will be redownloaded from the registry if need be.
  • Pipeline containers building

Migration

The overall sequence is to upgrade contint2001, migrate the services to it, upgrade contint1001, move the services back to contint1001.

Zuul to scap

Zuul has been deployed using a Debian package but that methods is painful for everyone. On Buster we will deploy it using scap. We can do all the scap related work before upgrading from Jessie to Buster, we need to feature switch based on the target distribution so that a host still reies on the Debian package as long as it is still using Jessie.

  • Craft a scap deployment repository for Zuul
  • Get puppet patches to vary based on the Distribution

The deployment on a host can only be done after it has been upgraded to Buster.

contint2001 upgrade

zuul-merger
~~~~~~~~~

The sole production service being run on contint2001 is zuul-merger:

docker-pkg
~~~~~~~~

For contint2001:

Puppet run on contint2001 without errors / missing packages built:

  • E: Unable to locate package blubber
  • E: Unable to locate package zuul
  • E: Unable to locate package helm
  • E: Unable to locate package helmfile
  • E: Unable to locate package helm-diff
  • E: Unable to locate package kubernetes-client
  • mod_php_7.3 - ERROR: Module mpm_event is enabled - cannot proceed due to conflicts. It needs to be disabled first (known issue -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206)

Jenkins agent
~~~~~~~~~~~

  • Add contint2001 as a Jenkins agent (copying contint1001)
  • Disable contint1001 agent
  • Update jobs in integration/config to point to contint2001

Migrate

  • Stop zuul, zuul, jenkins on contint1001
  • rsync data
  • change DNS backend for contint.wikimedia.org
  • update fabfile to have zuul reloaded on contint2001 instead of contint1001
  • Set contint2001 as master in Puppet / Hiera
  • Start Jenkins, verify that agents are connected and jobs set
  • Start Zuul scheduler

contint1001 upgrade

  • reinstall with Buster
  • deploy the zuul scap to the machine: scap deploy --limit contint1001 - deployed as part of
  • redeploy docker-pkg scap deploy --limit contint1001
  • update the fabfile.py for deploy_docker We now use contint.wikimedia.org
  • Stop zuul, zuul-merger, jenkins on contint2001
  • rsync data
  • change DNS backend for contint.wikimedia.org
  • update fabfile to have Zuul reloaded on contint1001 instead of contint2001 We now use contint.wikimedia.org
  • Set contint1001 as master in Puppet / Hiera
  • Start Jenkins, verify that agents are connected and jobs set
  • Start Zuul scheduler

Details

Due Date
Mar 29 2020, 10:00 PM
SubjectRepoBranchLines +/-
operations/puppetproduction+13 -17
operations/puppetproduction+13 -15
operations/puppetproduction+4 -1
operations/puppetproduction+5 -6
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+4 -0
operations/puppetproduction+7 -37
operations/puppetproduction+1 -1
operations/puppetproduction+13 -1
operations/puppetproduction+7 -0
operations/dnsmaster+1 -1
operations/puppetproduction+7 -7
integration/configmaster+1 -1
integration/configmaster+13 -13
operations/puppetproduction+10 -15
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
blubberdebian+6 -0
operations/puppetproduction+1 -1
operations/dnsmaster+2 -0
operations/puppetproduction+16 -1
operations/puppetproduction+0 -1
operations/puppetproduction+9 -0
operations/puppetproduction+18 -8
operations/puppetproduction+12 -0
operations/puppetproduction+7 -0
operations/puppetproduction+8 -2
integration/zuul/deploymaster+1 -1
integration/zuul/deploymaster+1 -1
integration/configmaster+3 -2
operations/puppetproduction+0 -5
operations/puppetproduction+13 -1
operations/puppetproduction+2 -0
operations/puppetproduction+10 -17
operations/puppetproduction+18 -10
operations/puppetproduction+21 -5
operations/puppetproduction+38 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
StalledNone
ResolvedNone
Resolvedakosiaris
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
InvalidJdforrester-WMF
ResolvedMoritzMuehlenhoff
ResolvedKrinkle
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
DeclinedJdforrester-WMF
DuplicateNone
ResolvedMilimetric
ResolvedMilimetric
ResolvedLadsgroup
Resolvedakosiaris
DeclinedNone
Resolved Mholloway
DuplicateNone
ResolvedNone
ResolvedNone
DeclinedNone
ResolvedMSantos
DuplicateNone
Resolvedjeena
ResolvedJdforrester-WMF
ResolvedJdrewniak
DuplicateNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedMoritzMuehlenhoff
Resolvedhashar
Resolvedhashar
DeclinedMoritzMuehlenhoff
Invalidthcipriani
Resolved mmodell
Resolvedhashar
ResolvedJoe
ResolvedJMeybohm
ResolvedJMeybohm

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 594477 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] contint: switch jenkins/zuul from contint1001 to contint2001

https://gerrit.wikimedia.org/r/594477

Change 594480 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch contint from 1001 to 2001

https://gerrit.wikimedia.org/r/594480

Change 594475 merged by Dzahn:
[operations/puppet@production] contint: move common and default Hiera settings to role level

https://gerrit.wikimedia.org/r/594475

@hashar So.. when do we schedule the maintenance window early next week? Monday?

Mentioned in SAL (#wikimedia-operations) [2020-05-11T08:32:36Z] <mutante> rsynced data from contint1001 to contint2001 - pathes per T224591#6039192 for the migration later today

Mentioned in SAL (#wikimedia-operations) [2020-05-11T09:07:32Z] <mutante> contint1001 - rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ (T224591)

Change 595511 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Switch jobs to contint2001

https://gerrit.wikimedia.org/r/595511

Change 595512 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] fab: deploy zuul config on contint2001

https://gerrit.wikimedia.org/r/595512

Change 595511 merged by jenkins-bot:
[integration/config@master] Switch jobs to contint2001

https://gerrit.wikimedia.org/r/595511

Change 595512 merged by jenkins-bot:
[integration/config@master] fab: deploy zuul config on contint2001

https://gerrit.wikimedia.org/r/595512

Mentioned in SAL (#wikimedia-operations) [2020-05-11T12:14:47Z] <hashar> shutting down Zuul and Jenkins for system switch # T224591

Change 594480 merged by Dzahn:
[operations/dns@master] switch contint from 1001 to 2001

https://gerrit.wikimedia.org/r/594480

Change 594477 merged by Dzahn:
[operations/puppet@production] contint: switch jenkins/zuul/gearman to contint2001

https://gerrit.wikimedia.org/r/594477

Mentioned in SAL (#wikimedia-operations) [2020-05-11T12:50:26Z] <hashar> Pointing CI Jenkins to contint2001 Gearman server T224591

Change 595521 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] jenkins: add missing /srv/jenkins dir

https://gerrit.wikimedia.org/r/595521

Mentioned in SAL (#wikimedia-operations) [2020-05-11T13:36:31Z] <hashar> Rolling back CI system switch to previous known state # T224591

Change 595525 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] contint: fix git cloning of docroot for integration.wm.org

https://gerrit.wikimedia.org/r/595525

The upgrade itself went well:

  • the rsync of data from contint1001 to contint2001 finished just in time
  • our rsync configuration cause file ownership to be transferred by uid instead of name which caused us to do a lot of chown to fix them
  • the zuul scheduler deployed from scap did process events. I have tested it locally.

In Jenkins, the jobs requests landed in its build queue but the agent executors rejected them. Each claiming they were busy. It might be due to the different java version being used (8 vs 11) which in turn might point at the Gearman plugin :-\ We have thus rolled back to contint1001 which took just a couple of minutes.

Follow up actions

Try to reproduce the issue with same zuul/jenkins/java11

Moritz pointed out we have a java 8 package for Buster: deb http://apt.wikimedia.org/wikimedia buster-wikimedia component/jdk8

Find a fix for rsync to keep the user/group either:

  • ensure a consistent uid accross the fleet for jenkins and zuul users
  • update rsyncd.conf which has use chroot which defaults to force numeric ids and thus prevent the name/id mapping to occur

Change 595531 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Initially use Java 8 for contint on Buster

https://gerrit.wikimedia.org/r/595531

Change 595531 merged by Dzahn:
[operations/puppet@production] Initially use Java 8 for contint on Buster

https://gerrit.wikimedia.org/r/595531

Change 595521 merged by Dzahn:
[operations/puppet@production] jenkins: add missing /srv/jenkins dir

https://gerrit.wikimedia.org/r/595521

Mentioned in SAL (#wikimedia-operations) [2020-05-18T09:46:37Z] <mutante> contint2001 - apt-get remove --purge openjdk-11-* - T224591

Change 597090 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] jenkins: master should stick to java 8

https://gerrit.wikimedia.org/r/597090

Change 597090 merged by Dzahn:
[operations/puppet@production] jenkins: master should stick to java 8

https://gerrit.wikimedia.org/r/597090

Change 597255 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] jenkins/icinga: fix process monitoring after change in command line

https://gerrit.wikimedia.org/r/597255

Change 597255 merged by Dzahn:
[operations/puppet@production] jenkins/icinga: fix process monitoring after change in command line

https://gerrit.wikimedia.org/r/597255

We now have Jenkins pinned to java8 so I guess we to do the switch over again.

@Dzahn would you be available at the beginning of next week (Monday/Tuesday)? (we can sync up over irc to find a good time)

@Dzahn would you be available at the beginning of next week (Monday/Tuesday)? (we can sync up over irc to find a good time)

Yes, some time between Monday 9 to 5 or Tuesday 9 to 4. Or simply tomorrow, Thursday, even preferable.

After discussion with @Dzahn we will do the switch Wednesday May 27th at 9:30 CEST (7:30 UTC).

Change 598068 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] zuul: add convenience link to 'zuul' bin

https://gerrit.wikimedia.org/r/598068

Change 598068 merged by Dzahn:
[operations/puppet@production] zuul: add convenience link to 'zuul' bin

https://gerrit.wikimedia.org/r/598068

Mentioned in SAL (#wikimedia-operations) [2020-05-27T09:39:11Z] <hashar> Stopping Zuul and Jenkins CI for scheduled maintenance # T224591

2020-05-27

    10:15 hashar: contint2001: starting zuul
    10:15 hashar: contint2001: started jenkins
    10:03 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown -h jenkins:jenkins {} \;
    10:02 mutante: repeated rsync of /var/lib/jenkins with -p ; find /var/lib/jenkins -group bacula -user statsite -exec chown -h jenkins:jenkins {} \;
    09:55 hashar: contint2001: starting jenkins
    09:54 hashar: contint1001 / contint2001 : deleted obsolete files /var/lib/jenkins/.git and /var/lib/jenkins/jobs/_shared/
    09:52 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown -h jenkins:jenkins {} \;
    09:49 mutante: contint2001 - find /var/lib/jenkins -group bacula -user statsite -exec chown jenkins:jenkins {} \;
    09:48 hashar: contint2001: unmasked jenkins and started it
    09:42 mutante: switching CI backend from contint1001 to contint2001
    09:40 mutante: repeated rsync -avp --delete /var/lib/zuul/ rsync://contint2001.wikimedia.org/ci--var-lib-zuul-
    09:40 hashar: contint1001: masked jenkins and zuul
    09:39 mutante: repeated rsync -avp --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins-
    09:39 hashar: Stopping Zuul and Jenkins CI for scheduled maintenance # T224591
    08:52 hashar: contint1001: find /srv/jenkins/builds/operations-puppet-wmf-style-guide -type f -name '*.tmp' -delete # T253729
    08:08 hashar: contint1001 / contint2001 : deleted unused /var/lib/zuul/git (the real one is /srv/zuul/git )
    08:02 mutante: contint2001 - chown root:root /var/lib/zuul/git
    07:45 hashar: contint2001 also fixing symlink permissions: sudo find /var/lib/jenkins -not -user jenkins -exec chown -h jenkins:jenkins {} +
    07:35 mutante: contint2001 - find /var/lib/jenkins -group bacula -user jenkins -exec chown jenkins:jenkins {} \;
    07:30 mutante: contint2001 - find /var/lib/jenkins -user statsite -exec chown jenkins {} \;
    07:26 mutante: contint2001 - chown -R zuul:zuul /var/lib/zuul/
    07:26 mutante: contint1001:~# rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/
    07:25 mutante: contint1001:~# rsync -avp --delete /var/lib/jenkins/ rsync://contint2001.wikimedia.org/ci--var-lib-jenkins-
    07:25 mutante: contint1001:~# rsync -avp --delete /var/lib/zuul/ rsync://contint2001.wikimedia.org/ci--var-lib-zuul-

I forgot about having the data synchronized ahead of the maintenance window which caused a 2 hours delay. We then had to fight with file permissions changes introduced by rsync. Eventually the service got switched and seems operational now. I will continue monitoring but I am guessing it will be fine from now on.

Dzahn raised the priority of this task from Medium to High.May 27 2020, 10:34 AM

Mentioned in SAL (#wikimedia-releng) [2020-05-27T15:33:39Z] <James_F> Zuul: Re-pulling config forwards ~2 weeks on contint2001; forgotten in T224591?

Sorry, my mistake, not two weeks, just lots and lots of semi-familiar changes, apparently all merged but not deployed so far today by @hashar and one by @Reedy: abcca40b..9fb87190.

Hashar confirmed things are working now.

We agreed that next week I will reimage contint1001 to buster.

Change 601645 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch contint1001 from jessie to buster installer

https://gerrit.wikimedia.org/r/601645

Change 601645 merged by Dzahn:
[operations/puppet@production] DHCP: switch contint1001 from jessie to buster installer

https://gerrit.wikimedia.org/r/601645

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

contint1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202006020808_dzahn_28298_contint1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['contint1001.wikimedia.org']

Of which those FAILED:

['contint1001.wikimedia.org']

Change 601666 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] partman: switch contint1001 to raid10-4dev, previous recipe is gone

https://gerrit.wikimedia.org/r/601666

Change 601666 merged by Dzahn:
[operations/puppet@production] partman: switch contint1001 to raid10-4dev, previous recipe is gone

https://gerrit.wikimedia.org/r/601666

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

contint1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202006020932_dzahn_98476_contint1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['contint1001.wikimedia.org']

Of which those FAILED:

['contint1001.wikimedia.org']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

contint1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202006020937_dzahn_104332_contint1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['contint1001.wikimedia.org']

Of which those FAILED:

['contint1001.wikimedia.org']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

contint1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202006020938_dzahn_105219_contint1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['contint1001.wikimedia.org']

Of which those FAILED:

['contint1001.wikimedia.org']

Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts:

contint1001.wikimedia.org

The log can be found in /var/log/wmf-auto-reimage/202006020945_dzahn_110360_contint1001_wikimedia_org.log.

Completed auto-reimage of hosts:

['contint1001.wikimedia.org']

and were ALL successful.

Dzahn updated the task description. (Show Details)

re-assigning for the deployment steps that are next

Partitions on contint1001 have changed. It used to have a secondary volume group with a 250G volume mounted at /mnt/docker. That followed up the addition of two extra SSD that went with their own lvm group T207707

Now we have four SSD on the same RAID (md0) and a single volume group. It has a 80G logical volume for / and the rest (1.4T) for /srv. I guess it is nicer this way, we just have to move Docker to write under /srv instead of /mnt/docker.

Change 601760 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: move Docker data to /srv/docker

https://gerrit.wikimedia.org/r/601760

Change 601760 merged by Dzahn:
[operations/puppet@production] contint: move Docker data to /srv/docker

https://gerrit.wikimedia.org/r/601760

Mentioned in SAL (#wikimedia-operations) [2020-06-02T15:45:54Z] <mutante> contint1001 - restarting docker afer changed data-root path (T224591)

Mentioned in SAL (#wikimedia-operations) [2020-06-02T15:48:24Z] <mutante> contint1001 - rm -rf /mnt/docker (T224591)

Technically this is done because both contint hosts are on buster.

Whether contint2001 or contint1001 is the "active" server could be seen as unrelated.

hashar lowered the priority of this task from High to Medium.Jun 15 2020, 8:57 AM

I would like to move the service back to contint1001. But before doing that, I am going to dig into the rsync setup to saves us from having to manually fix up the files ownerships. It is a bit of a rabbit hole, but it affected us for gerrit-test as well so it is probably worth investigating and fixing ;)

We went into this issue several times before. Let's solve it by using the same UID for our system users (also see recent mail from Moritz about systemd-sysusers). Let's not try to solve it with rsync parameters again.

Also would prefer if it was a separate ticket. It's not really related to buster upgrade. It's a general issue we have had since the beginning of Wikimedia time when https://wikitech.wikimedia.org/wiki/UID was created.

Definitely, if you need a system user with fixed IDs across servers, use one in the 9xx range and reserve it in data.yaml. everything else will cause problems down the road.

The easy trick is to have the rsync daemon to run as root without chroot, it is then able do:

  • map names (being outside of a chroot it has access to eg getpwname())
  • adjust the files ownership (being root)

Which in rsync configuration can be achieved with:

uid = root
use chroot no

Which is scary. An alternative is to keep the chroot and explicitly enable mapping with numeric ids = no, but that requires copying libraries into the chroot and (per doc) make sure they can't be written to. So that sounds all a bit too complicated.

So yeah lets reserve a uid.

Actually we do not need any name mapping. The uid and gid are the exact same ones on each servers

$ getent passwd zuul jenkins jenkins-slave
zuul:x:497:498::/var/lib/zuul:/bin/bash
jenkins:x:499:499::/var/lib/jenkins:/bin/bash
jenkins-slave:x:498:1001::/var/lib/jenkins-slave:/bin/bash
$ getent group zuul jenkins jenkins-slave
zuul:x:498:
jenkins:x:499:
jenkins-slave:x:1001:

So I do not get why we would get the jenkins user becoming statsite or the jenkins group becoming bacula. That does not make any sense :-\

Change 606394 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] ci:master: prepare for contint switchover

https://gerrit.wikimedia.org/r/606394

Actually we do not need any name mapping. The uid and gid are the exact same ones on each servers

They are unique by chance, but not by design. When a local system user is created, the next available UID in the specified range is use, but it's not necessarily deterministic. If we reimage any of the contint* servers at a later point, the excution could be different and we end up with different UIDs again.

Change 606286 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] jenkins: replace system user/group with systemd-sysuser

https://gerrit.wikimedia.org/r/606286

They are unique by chance, but not by design.

Or it's likely we already changed it in the past to avoid this issue. I remember doing this on a bunch of occasions where i first just edited the UID and then used find - exec to chown all the files. I agree though it's a bit mysterious why we had that issue during the 1001->2001 switch recently.

I just noticed the permissions of files in /var/lib/jenkins on contint1001 are _not_ like on contint2001 again, while i definitely remember fixing them to be the same on both sides.

Also f.e. there is a file called "secret.key.not-so-secret" there that does not exist on 2001, even though we ran rsync with --delete to have exact mirrors.

Why is that?

Change 606394 merged by Dzahn:
[operations/puppet@production] ci:master: prepare for contint switchover

https://gerrit.wikimedia.org/r/606394

Change 595525 abandoned by Hashar:
contint: fix git cloning of docroot for integration.wm.org

Reason:
The way we deploy integration/docroot and maintain its deployment is broken. Instead we will migrate the deployment to use scap (T256005) and have Apache DocumentRoot to point to the deployed path under /srv/deployment. That will also addresses the issue Daniel raised to me: the git workspace is polluted by files published by CI.

https://gerrit.wikimedia.org/r/595525

Change 607645 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] admins: add system user for jenkins, reserve UID 903

https://gerrit.wikimedia.org/r/607645

Change 607645 abandoned by Dzahn:
admins: add system user for jenkins, reserve UID 903

Reason:
needs to be merged into https://gerrit.wikimedia.org/r/c/operations/puppet/ /606286

https://gerrit.wikimedia.org/r/607645

Change 607853 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] zuul: replace user/group with systemd-sysuser and reserved UID

https://gerrit.wikimedia.org/r/607853

Decided with Hashar we are calling this resolved because both hosts are on buster.

We are treating the switch-back and rsync issues in a separate ticket.

Eventually I wanted to switch back to contint1001 as part of the migration, notably due to the added network latency between contint2001 in codfw and Gerrit in eqiad. But that is a non issue in the end.

There are bunch of follow up actions for issues that surface during the migration. Some are listed as T224591#6124875

Then, both hosts are now using Buster which was the aim of this task. As such, in accordance with @Dzahn lets mark this solved and file other tasks for the rest ;)

Change 606286 abandoned by Dzahn:
[operations/puppet@production] jenkins: replace system user/group with systemd-sysuser

Reason:

https://gerrit.wikimedia.org/r/606286

Change 607853 abandoned by Dzahn:
[operations/puppet@production] zuul: replace user/group with systemd-sysuser and reserved UID

Reason:

https://gerrit.wikimedia.org/r/607853