Page MenuHomePhabricator

contint2002 service implementation tracking
Open, In Progress, MediumPublic

Description

This is to track the service implement of serviceops host contint2002 which is the primary CI server with Jenkins/Zuul etc. It is done independently from contint1002 which is simpler (T313832).

topic branch https://gerrit.wikimedia.org/r/q/topic:contint2002

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+34 -0
operations/puppetproduction+1 -1
operations/puppetproduction+53 -67
operations/puppetproduction+14 -14
operations/puppetproduction+181 -172
operations/puppetproduction+3 -0
labs/privatemaster+16 -0
operations/puppetproduction+17 -2
operations/puppetproduction+16 -10
operations/puppetproduction+9 -1
operations/puppetproduction+0 -4
operations/puppetproduction+32 -33
operations/puppetproduction+1 -1
operations/puppetproduction+3 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+8 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -2
Show related patches Customize query in gerrit

Event Timeline

LSobanski triaged this task as Medium priority.Dec 12 2022, 4:55 PM
LSobanski moved this task from Incoming to Backlog on the serviceops-collab board.

Change 867670 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] scap: add contint2002 to ci-docroot, jenkins, zuul deploy

https://gerrit.wikimedia.org/r/867670

Change 867673 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add contint2002 to ci::master role

https://gerrit.wikimedia.org/r/867673

Change 867675 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] cloud: allow VMs to connect to contint1002 and contint2002

https://gerrit.wikimedia.org/r/867675

Change 867703 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: add contint2002 to firewall, jenkins and zuul-merger

https://gerrit.wikimedia.org/r/867703

Change 867705 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci/zuul: switch gearman server from contint2001 to contint2002

https://gerrit.wikimedia.org/r/867705

Change 867708 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] docker_registry_ha: add contint2002 to image builder hosts

https://gerrit.wikimedia.org/r/867708

Change 867710 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: add contint2002 to zuul_merger firewall, ferm_srange

https://gerrit.wikimedia.org/r/867710

Change 867711 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: add contint2002 as an migration rsync source host

https://gerrit.wikimedia.org/r/867711

Change 867712 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: make contint2002 the new rsync source, remove contint2001

https://gerrit.wikimedia.org/r/867712

Change 867714 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] elasticsearch/relforge: add contint2002 to cirrus::ferm_srange

https://gerrit.wikimedia.org/r/867714

Change 867708 merged by Dzahn:

[operations/puppet@production] docker_registry_ha: add contint2002 to image builder hosts

https://gerrit.wikimedia.org/r/867708

Change 867675 merged by Dzahn:

[operations/puppet@production] cloud: allow VMs to connect to contint1002 and contint2002

https://gerrit.wikimedia.org/r/867675

Change 867711 merged by Dzahn:

[operations/puppet@production] ci: add contint2002 as a rsync destination host

https://gerrit.wikimedia.org/r/867711

Change 867710 merged by Dzahn:

[operations/puppet@production] ci: add contint2002 to zuul_merger firewall, ferm_srange

https://gerrit.wikimedia.org/r/867710

Change 867703 merged by Dzahn:

[operations/puppet@production] ci: add contint2002 to firewall, jenkins and zuul-merger

https://gerrit.wikimedia.org/r/867703

Change 867714 merged by Dzahn:

[operations/puppet@production] elasticsearch/relforge: add contint2002 to cirrus::ferm_srange

https://gerrit.wikimedia.org/r/867714

Change 901576 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: fix up service enable and ensure

https://gerrit.wikimedia.org/r/901576

Change 901576 merged by Dzahn:

[operations/puppet@production] zuul: fix up service enable and ensure

https://gerrit.wikimedia.org/r/901576

Change 867673 merged by Dzahn:

[operations/puppet@production] site: add contint2002 to ci::master role

https://gerrit.wikimedia.org/r/867673

Change 904370 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: remove superfluous insetup role for contint2002

https://gerrit.wikimedia.org/r/904370

Change 904370 merged by Dzahn:

[operations/puppet@production] site: remove superfluous insetup role for contint2002

https://gerrit.wikimedia.org/r/904370

Mentioned in SAL (#wikimedia-operations) [2023-03-29T23:48:45Z] <mutante> contint2002 - a2dismod mpm_event (ONCE AGAIN this year old issue when applying roles with apache for the first time) - running puppet - now it can actually install PHP 7.3 and start apache T324659

Change 904374 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::master: add parameter to enable/disable monitoring of jenkins/httpd

https://gerrit.wikimedia.org/r/904374

Change 904374 merged by Dzahn:

[operations/puppet@production] ci::master: add parameter to enable/disable monitoring of jenkins/httpd

https://gerrit.wikimedia.org/r/904374

The production role for ci::master is now applied on contint2002.

Some minor follow-ups were needed:

  • run puppet multiple times due to dependencies (not a big deal, only on first run)
  • add ability to disable new monitoring on the new host to avoid false alerts because https://integration.wikimedia.org does not work on it yet (monitoring we just added recently)
  • manually run "a2dismod mpm_event" and run puppet to unblock installing PHP) (old and known but still pretty annoying issue that we can potentially fix)

Previously @hashar had fixed the situation with masking zuul, zuul-merger and jenkins.

I confirmed zuul, zuul-merger and jenkins are all masked on the new host.

Puppet runs now without errors or warnings.

releng now has shell access to the new machine which came with the role.

Dzahn changed the task status from Open to In Progress.Mar 30 2023, 1:37 AM
Dzahn moved this task from Backlog to Work in Progress on the serviceops-collab board.

I don't know what has happened over the night but the zuul-merger service started alarming over night:

Notification Type: PROBLEM

Service: Check systemd state
Host: contint2002
Address: 208.80.153.39
State: CRITICAL

Date/Time: Wed Apr 5 23:54:18 UTC 2023

Notes URLs: https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

Acknowledged by : 

Additional Info:

CRITICAL - degraded: The following units failed: wmf_auto_restart_zuul-merger.service

Even though the zuul-merger has always been masked on contint2002... It is a mystery.

Change 906307 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: disable monitoring for disabled merger service

https://gerrit.wikimedia.org/r/906307

Change 906307 merged by Elukey:

[operations/puppet@production] zuul: disable monitoring for disabled merger service

https://gerrit.wikimedia.org/r/906307

Change 907885 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: rename ci::master role to ci::manager

https://gerrit.wikimedia.org/r/907885

Change 907886 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: split contint hosts to different roles

https://gerrit.wikimedia.org/r/907886

Change 907898 had a related patch set uploaded (by Hashar; author: Hashar):

[labs/private@master] ci: add secreats for ci::manager and ci::worker roles

https://gerrit.wikimedia.org/r/907898

Change 907898 merged by Hashar:

[labs/private@master] ci: add secreats for ci::manager and ci::worker roles

https://gerrit.wikimedia.org/r/907898

Change 867670 abandoned by Hashar:

[operations/puppet@production] scap: add contint2002 to ci-docroot, jenkins, zuul deploy

Reason:

I have made the list of scap target to be populated from the Puppet DB with https://gerrit.wikimedia.org/r/c/operations/puppet/+/893483/ .

This change triggered me to get the pending change deployed and verified (I ran dummy deployment for the new hosts).

https://gerrit.wikimedia.org/r/867670

Change 908232 had a related patch set uploaded (by Jbond; author: Hashar):

[operations/puppet@production] ci: indicate which server is the control server via a hiera param

https://gerrit.wikimedia.org/r/908232

Change 907886 abandoned by Hashar:

[operations/puppet@production] ci: split contint hosts to different roles

Reason:

Abandoning in favor of a single role for all hosts and using a hiera setting to define which one is the primary: https://gerrit.wikimedia.org/r/c/operations/puppet/+/908232/

https://gerrit.wikimedia.org/r/907886

Change 907885 merged by Jbond:

[operations/puppet@production] ci: rename ci::master role to ci

https://gerrit.wikimedia.org/r/907885

Change 908232 merged by Jbond:

[operations/puppet@production] ci: indicate which server is the control server via a hiera param

https://gerrit.wikimedia.org/r/908232

The process to migrate involves a rsync running a chroot which is thus unable to do user/group id mapping between the host. Short of removing the use_chroot clause, all users and groups need id to be reserved in Puppet.

The transferred directories are:

hieradata/role/common/ci.yaml
profile::ci::migration::rsync_data_dirs:
  - "/var/lib/jenkins/"
  - "/var/lib/zuul/"
  - "/srv/"

On the current primary I went with a brute force approach to find all used user/groups:

sudo find /var/lib/jenkins /var/lib/zuul /srv -printf '| %u | %U | %g | %G\n'|uniq > uid_gid.txt

Which gives me:

UserUID
6553365533
900900
_apt100
brennen20958
dancy25006
deploy-ci-docroot492
deploy-jenkins491
deploy-service494
deploy-zuul493
hashar1010
jenkins499
jenkins-slave498
jforrester2417
root0
slyngshede39083
thcipriani11634
zuul497
GroupGID
6553365533
900900
adm4
bacula118
deploy-ci-docroot494
deploy-jenkins493
deploy-service496
deploy-zuul495
jenkins499
jenkins-slave1001
mail8
nogroup65534
root0
shadow42
staff50
tty5
ulog119
utmp43
wikidev500
zuul498

Some are human users that have uid reserved via modules/admin/data/data.yaml. The deploy-* users are created by Puppet for the scap::target. I guess it is sufficient to NOT rsync /srv/deployment.
There are unknown such as _apt (100), 65533, 900 which apparently come from /srv/docker which we do not need to rsync (the images will be downloaded from the Docker registry when they are missing).

Further inspecting what is in /srv we only need to rsync /srv/jenkins which has:

sudo find /srv/jenkins -printf '| %u | %U | %g | %G\n'|uniq|sort|uniq

UserUIDGroupGID
jenkins499jenkins499
root0root0

And for /var/lib/jenkins and /var/lib/zuul:

sudo find /var/lib/jenkins /var/lib/zuul -printf '| %u | %U | %g | %G\n'|uniq|sort|uniq

UserUIDGroupGID
jenkins499adm4
jenkins499jenkins499
jenkins499nogroup65534
root0bacula118
root0jenkins499
root0root0
zuul497zuul498

The adm group comes from /var/lib/jenkins/logs and is set to Puppet. Debian might have hardcoded it to be GID=4.

/var/lib/jenkins/.gitignore is owned by root:bacula which I believe is from a restore from backup we did when gallium had a disk crash. We no more use git to manage that directory so I went ahead and deleted it contint2001 (it was not on the others).

We thus need to reserve UID/GID for jenkins and zuul then migrate the files to the new UID (which will take a while since there are a lot of files owned by jenkins under /srv/jenkins).


The alternative is to set use chroot = no (which implies numeric ids = no which lets rsync do the name mapping).

Change 917908 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: in /srv only migrate /srv/jenkins

https://gerrit.wikimedia.org/r/917908

+ @jnuche who co manages our Jenkins nowadays. This task is to migrate the Jenkins/Zuul/integration website services from contint2001.wikimedia.org to contint2002.wikimedia.org.

Change 917916 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] admin: reserve jenkins and zuul uid/gid

https://gerrit.wikimedia.org/r/917916

Change 917918 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] zuul: switch to fixed uid/gid 923

https://gerrit.wikimedia.org/r/917918

Change 917919 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] jenkins: switch to fixed uid/gid 924

https://gerrit.wikimedia.org/r/917919

Change 917908 merged by Dzahn:

[operations/puppet@production] ci: in /srv only migrate /srv/jenkins

https://gerrit.wikimedia.org/r/917908

Change 917916 merged by Dzahn:

[operations/puppet@production] admin: reserve jenkins and zuul uid/gid

https://gerrit.wikimedia.org/r/917916

Change 917918 merged by Dzahn:

[operations/puppet@production] zuul: switch to fixed uid/gid 923

https://gerrit.wikimedia.org/r/917918

after deploying the change above carefully on all 3 contint* servers, stopping services, running manual chown commands , starting services, verifying etc.. now the zuul user is using uid:gid 923:923 instead of the old 497:498. this is good for several reasons.

[contint2002:~] $ id zuul
uid=923(zuul) gid=923(zuul) groups=923(zuul)

[contint2001:~] $ id zuul
uid=923(zuul) gid=923(zuul) groups=923(zuul)

[contint2001:~] $ id zuul
uid=923(zuul) gid=923(zuul) groups=923(zuul)

A find / -uid 497 and find / -gid 498 on all hosts also showed there were no files owned left the old UID.

Monitoring alerted and recovered. Triggering a "recheck" on Gerrit worked and got jenkins vote.

  • disable puppet
  • stop services
  • chown -R 923:923 /srv/zuul/git /var/lib/zuul
  • chown -R 923:923 /var/log/zuul_repack/ /var/log/zuul/
  • chown -R 923:923 /etc/zuul/
  • sudo chown root:root /var/lib/zuul/{.gitconfig,git-template*}
  • re-enable puppet, it corrects /var/log/zuul to zuul:adm
  • puppet does the actual change for zuul user name to point to 923 instead of 497 (and 923 instead og 498 for gid)
  • id zuul (to verify)
  • start service
  • find / -uid 497 (to verify)
  • find / -gid 498 (to verify)

Mentioned in SAL (#wikimedia-operations) [2023-05-23T23:30:05Z] <mutante> contint*, releases* - maintenance - changing UID of jenkins user - jenkins will be stopped for a little bit, releases-jenkins is first though - T324659

Change 917919 merged by Dzahn:

[operations/puppet@production] jenkins: switch to fixed uid/gid 924

https://gerrit.wikimedia.org/r/917919

after carefully deploying the patch above to change jenkins UID/GID, following the instructions, changing file ownership etc (details on gerrit comments), we now have:

[contint2001:~] $ id jenkins
uid=924(jenkins) gid=924(jenkins) groups=924(jenkins)

[contint2002:~] $ id jenkins
uid=924(jenkins) gid=924(jenkins) groups=924(jenkins)

[contint1002:/tmp] $ id jenkins
uid=924(jenkins) gid=924(jenkins) groups=924(jenkins)

[releases1002:~] $ id jenkins
uid=924(jenkins) gid=924(jenkins) groups=924(jenkins)

[releases2002:~] $ id jenkins
uid=924(jenkins) gid=924(jenkins) groups=924(jenkins)

so no more worrying about rsync and file ownership when migrating. yay!