Page MenuHomePhabricator

setup 2 contint machines for jenkins
Closed, ResolvedPublic

Description

We are going to setup 2 contint machines on physical hardware as a stop gap for T418109 and migrate jenkins to them.

host names: contint1003.wikimedia.org, contint2003.wikimedia.org

In another step a "zuul-legacy" site will be added to the existing contint machines.

https://gerrit.wikimedia.org/r/q/topic:%22contint-split%22

  • add jenkins to APT repo for trixie
  • create puppet role/profile to install only jenkins without other CI role
  • create jenkins.discovery.wmnet
  • install envoy on contint1003/2003
  • make puppet changes to allow configuring jenkins proxy config to a new host
  • upload change to make the actual switch to new jenkins
  • setup rsync in puppet to allow syncing /var/lib/jenkins
  • pre-sync /var/lib/jenkins
  • empty out /var/lib/jenkins/jobs
  • edit firewall to allow existing contint manager to connect to port 1443 (envoy) on new machines
  • verify connection to cloud VPS works
  • verify connecting from the Internet is not allowed by firewall
  • patch puppet to enable jenkins service on contint1003 (and not contint2003)
  • remove httpd again
  • fix firewalling to connect to envoy in front of jenkins from legacy hosts
  • startup jenkins manually the first time, use an ssh tunnel to open the web UI from home and go through the setup
  • make jenkins service actually start with systemd
  • make envoy listen on IPv6 ----
    contint1002 (Manager)                                      contint1003 (New)
    [ Debian 11 Bullseye ]                                     [ Debian 13 Trixie ]
+---------------------------+                              +---------------------------+

|                           |                              |                           |
|  [ Zuul Manager ] --------|== jenkins.discovery.wmnet:1443 (HTTPS) ==> [ FIREWALL ]  |
|         |                 |                              |                |          |
|         v                 |                              |                v          |
|  [ Envoy (1443) ]         |                              |         [ Envoy (1443) ]  |
|         |                 |                              |                |          |
|         v                 |                              |                v          |
|  [ Apache (80) ]          |                              |         [ Jenkins (8080)] |
|         |                 |                              |            (Java 21)      |
|         v                 |                              +---------------------------+
|  [ Jenkins (8080) ]       |
|      (Java 17)            |
+---------------------------+

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+20 -1
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+7 -0
operations/puppetproduction+0 -1
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+8 -0
operations/puppetproduction+1 -1
operations/puppetproduction+14 -0
operations/puppetproduction+5 -3
operations/puppetproduction+13 -5
operations/dnsmaster+3 -1
operations/puppetproduction+5 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+5 -0
operations/dnsmaster+1 -0
operations/puppetproduction+14 -12
operations/puppetproduction+29 -8
operations/puppetproduction+3 -5
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+5 -1
operations/puppetproduction+21 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1248083 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] installserver: add contint[1-2]003 to preseed regex

https://gerrit.wikimedia.org/r/1248083

Change #1248083 merged by Dzahn:

[operations/puppet@production] installserver: add contint[1-2]003 to preseed regex

https://gerrit.wikimedia.org/r/1248083

Change #1248118 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::website: support 2 different websites, integration vs zuul-legacy

https://gerrit.wikimedia.org/r/1248118

Change #1248127 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::website/ci::httpd: move monitoring to website, not httpd

https://gerrit.wikimedia.org/r/1248127

Dzahn renamed this task from setup 2 contint machines for zuul or jenkins to setup 2 contint machines for jenkins.Mar 5 2026, 11:02 PM
Dzahn updated the task description. (Show Details)

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host contint2003.wikimedia.org with OS trixie

Change #1248082 merged by Dzahn:

[operations/puppet@production] create role skeleton for jenkins

https://gerrit.wikimedia.org/r/1248082

Change #1248635 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: apply jenkins stub role on contint2003

https://gerrit.wikimedia.org/r/1248635

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host contint2003.wikimedia.org with OS trixie completed:

  • contint2003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh trixie OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202603052333_dzahn_4003818_contint2003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1248641 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] aptrepo: add jenkins for trixie

https://gerrit.wikimedia.org/r/1248641

Change #1248641 merged by Dzahn:

[operations/puppet@production] aptrepo: add jenkins for trixie

https://gerrit.wikimedia.org/r/1248641

Change #1248635 merged by Dzahn:

[operations/puppet@production] site: apply jenkins stub role on contint2003

https://gerrit.wikimedia.org/r/1248635

Change #1248892 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: set the CI manager host in Hiera

https://gerrit.wikimedia.org/r/1248892

Change #1248892 merged by Dzahn:

[operations/puppet@production] jenkins: set the CI manager host in Hiera

https://gerrit.wikimedia.org/r/1248892

Imported jenkins into trixie-wikimedia.

[apt1002:/srv/wikimedia] $ sudo -i reprepro -C thirdparty/jenkins checkupdate trixie-wikimedia
Calculating packages to get...
Updates needed for 'trixie-wikimedia|thirdparty/jenkins|amd64':
'jenkins': newly installed as '2.541.2' (from 'jenkins-bookworm'):
 files needed: pool/thirdparty/jenkins/j/jenkins/jenkins_2.541.2_all.deb

--

[apt1002:/srv/wikimedia] $ sudo -i reprepro -C thirdparty/jenkins update trixie-wikimedia
Calculating packages to get...
Getting packages...
Installing (and possibly deleting) packages...
Exporting indices...

contint2003 now the basic jenkins installed via the new stub role "jenkins".

Notice: /Stage[main]/Jenkins/Systemd::Service[jenkins]/Systemd::Unit[jenkins]/Exec[systemd daemon-reload for jenkins.service (jenkins)]: Triggered 'refresh' from 1 event
Notice: /Stage[main]/Jenkins/Systemd::Service[jenkins]/Service[jenkins]/ensure: ensure changed 'running' to 'stopped'
Notice: Applied catalog in 33.11 seconds
[contint2003:~]

It gets masked as puppet code tells it to because this is not the main/active contint server.

[contint2003:~] $ systemctl status jenkins
○ jenkins.service
     Loaded: masked (Reason: Unit jenkins.service is masked.)

Java 21 has been installed straight from Debian.

[contint2003:~] $ dpkg -l | grep jre
ii  openjdk-21-jre:amd64                 21.0.10+7-1~deb13u1                  amd64        OpenJDK Java runtime, using Hotspot JIT
ii  openjdk-21-jre-headless:amd64        21.0.10+7-1~deb13u1                  amd64        OpenJDK Java runtime, using Hotspot JIT (headless)

Change #1250743 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/jenkins: apply jenkins role on contint1003

https://gerrit.wikimedia.org/r/1250743

Change #1250743 merged by Dzahn:

[operations/puppet@production] site/jenkins: apply jenkins role on contint1003

https://gerrit.wikimedia.org/r/1250743

Change #1250748 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: add proxy_jenkins profile to role

https://gerrit.wikimedia.org/r/1250748

Change #1248118 merged by Dzahn:

[operations/puppet@production] ci::website: support 2 different websites, integration vs zuul-legacy

https://gerrit.wikimedia.org/r/1248118

Change #1248127 merged by Dzahn:

[operations/puppet@production] ci::website/ci::httpd: move monitoring to website, not httpd

https://gerrit.wikimedia.org/r/1248127

Change #1250752 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: add ci::httpd profile to role

https://gerrit.wikimedia.org/r/1250752

Change #1250755 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] profile::ci: add support for trixie / PHP8.4

https://gerrit.wikimedia.org/r/1250755

Change #1250756 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add zuul-legacy to point at old zuul

https://gerrit.wikimedia.org/r/1250756

Change #1250757 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] trafficserver/contint: add zuul-legacy site

https://gerrit.wikimedia.org/r/1250757

Change #1250756 abandoned by Dzahn:

[operations/dns@master] add zuul-legacy to point at old zuul

Reason:

we probably don't need this after all

https://gerrit.wikimedia.org/r/1250756

Change #1250757 abandoned by Dzahn:

[operations/puppet@production] trafficserver/contint: add zuul-legacy site

Reason:

not needed

https://gerrit.wikimedia.org/r/1250757

Change #1250755 merged by Dzahn:

[operations/puppet@production] profile::ci: add support for trixie / PHP8.4

https://gerrit.wikimedia.org/r/1250755

Change #1250752 merged by Dzahn:

[operations/puppet@production] jenkins: add ci::httpd profile to role

https://gerrit.wikimedia.org/r/1250752

Mentioned in SAL (#wikimedia-operations) [2026-03-13T01:26:39Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on contint2003.wikimedia.org with reason: T418521

Mentioned in SAL (#wikimedia-operations) [2026-03-13T01:26:53Z] <dzahn@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on contint1003.wikimedia.org with reason: T418521

Mentioned in SAL (#wikimedia-operations) [2026-03-13T01:37:20Z] <mutante> contint1003/contint2003 - every time(?) we setup machines with puppet using our httpd module and PHP - and puppet runs for the first time we run into the same old issue with "Exec[ensure_present_mod_php" failing and "Considering conflict mpm_worker for mpm_prefork"sudo a2dismod mpm_event". The fix is: 'sudo a2dismod mpm_event' and run puppet again. T418521

The following is a known issue that we have when puppet runs for the very first time on a machine using our httpd module and PHP.

Notice: /Stage[main]/Httpd/Httpd::Mod_conf[php8.4]/Exec[ensure_present_mod_php8.4]/returns: Considering conflict mpm_worker for mpm_prefork:

Error: '/usr/sbin/a2enmod php8.4' returned 1 instead of one of [0]
Error: /Stage[main]/Httpd/Httpd::Mod_conf[php8.4]/Exec[ensure_present_mod_php8.4]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/a2enmod php8.4' returned 1 instead of one of [0]

The fix for this is:

root@contint1003:/home/dzahn# a2dismod mpm_event
Module mpm_event disabled.

and run puppet a second time.

From there on puppet recovers and there is no manual step needed.

Change #1251205 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: add envoy and config for jenkins.discovery.wmnet (WIP)

https://gerrit.wikimedia.org/r/1251205

Change #1251208 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: enable the jenkins service if using new role

https://gerrit.wikimedia.org/r/1251208

Change #1250748 abandoned by Dzahn:

[operations/puppet@production] jenkins: add proxy_jenkins profile to role

Reason:

not needed - proxy config stays on old host

https://gerrit.wikimedia.org/r/1250748

Change #1254292 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] create a discovery name for new jenkins on contint machines

https://gerrit.wikimedia.org/r/1254292

Change #1254292 merged by Dzahn:

[operations/dns@master] create a discovery name for new jenkins on contint machines

https://gerrit.wikimedia.org/r/1254292

name to talk to the new jenkins:

[dns1004:~] $ host jenkins.discovery.wmnet
jenkins.discovery.wmnet is an alias for contint1003.wikimedia.org.
contint1003.wikimedia.org has address 208.80.154.137
contint1003.wikimedia.org has IPv6 address 2620:0:861:2:208:80:154:137

Change #1254295 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: define contint1003 as the manager_host for the jenkins role

https://gerrit.wikimedia.org/r/1254295

Change #1251208 abandoned by Dzahn:

[operations/puppet@production] jenkins: enable the jenkins service if using new role

Reason:

replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1254295

https://gerrit.wikimedia.org/r/1251208

Change #1254307 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint/jenkins: make the jenkins host name configurable

https://gerrit.wikimedia.org/r/1254307

Change #1254308 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: switch jenkins proxy target to new discovery name

https://gerrit.wikimedia.org/r/1254308

Change #1254307 merged by Dzahn:

[operations/puppet@production] contint/jenkins: make the jenkins host name configurable

https://gerrit.wikimedia.org/r/1254307

Change #1251205 merged by Dzahn:

[operations/puppet@production] jenkins: add envoy and config for jenkins.discovery.wmnet

https://gerrit.wikimedia.org/r/1251205

Change #1254295 merged by Dzahn:

[operations/puppet@production] jenkins: define contint1003 as the manager_host for the jenkins role

https://gerrit.wikimedia.org/r/1254295

Change #1255136 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: allow rsyncing of data for migrating a jenkins server

https://gerrit.wikimedia.org/r/1255136

Change #1255144 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::jenkins: add firewall rule to allow legacy machines to new jenkins

https://gerrit.wikimedia.org/r/1255144

Change #1255136 merged by Dzahn:

[operations/puppet@production] jenkins: allow rsyncing of data for migrating a jenkins server

https://gerrit.wikimedia.org/r/1255136

Change #1255144 merged by Dzahn:

[operations/puppet@production] ci::jenkins: add firewall rule to allow legacy machines to new jenkins

https://gerrit.wikimedia.org/r/1255144

Change #1255797 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: pass srange as an array to firewall::service

https://gerrit.wikimedia.org/r/1255797

Change #1255797 merged by Dzahn:

[operations/puppet@production] jenkins: pass srange as an array to firewall::service

https://gerrit.wikimedia.org/r/1255797

Change #1256485 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::jenkins: add dependency of jenkins service on firewall

https://gerrit.wikimedia.org/r/1256485

Change #1256485 merged by Dzahn:

[operations/puppet@production] ci::jenkins: add dependency of jenkins service on firewall

https://gerrit.wikimedia.org/r/1256485

Change #1256508 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: ensure /srv/jenkins/builds exists

https://gerrit.wikimedia.org/r/1256508

Change #1255139 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: remove httpd profile from role

https://gerrit.wikimedia.org/r/1255139

Change #1255139 merged by Dzahn:

[operations/puppet@production] jenkins: remove httpd profile from role

https://gerrit.wikimedia.org/r/1255139

Mentioned in SAL (#wikimedia-operations) [2026-03-20T20:45:08Z] <mutante> contint1003/2003 apt remove --purge apache2* ; apt remove --purge php* | T418521

Change #1256508 merged by Dzahn:

[operations/puppet@production] jenkins: ensure /srv/jenkins/builds exists

https://gerrit.wikimedia.org/r/1256508

    contint1002 (Manager)                                      contint1003 (New)
    [ Debian 11 Bullseye ]                                     [ Debian 13 Trixie ]
+---------------------------+                              +---------------------------+

|                           |                              |                           |
|  [ Zuul Manager ] --------|== jenkins.discovery.wmnet:1443 (HTTPS) ==> [ FIREWALL ]  |
|         |                 |                              |                |          |
|         v                 |                              |                v          |
|  [ Envoy (1443) ]         |                              |         [ Envoy (1443) ]  |
|         |                 |                              |                |          |
|         v                 |                              |                v          |
|  [ Apache (80) ]          |                              |         [ Jenkins (8080)] |
|         |                 |                              |            (Java 21)      |
|         v                 |                              +---------------------------+
|  [ Jenkins (8080) ]       |
|      (Java 17)            |
+---------------------------+

Change #1256614 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: include firewall and set provider for new role

https://gerrit.wikimedia.org/r/1256614

Change #1256614 merged by Dzahn:

[operations/puppet@production] jenkins: include firewall and set provider for new role

https://gerrit.wikimedia.org/r/1256614

Change #1256642 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: let envoy listen on IPv6

https://gerrit.wikimedia.org/r/1256642

Change #1256642 merged by Dzahn:

[operations/puppet@production] jenkins: let envoy listen on IPv6

https://gerrit.wikimedia.org/r/1256642

@thcipriani @hashar @LSobanski @Muehlenhoff

This is resolved.

We now have 2 new machines, contint1003 and contint2003, with:

  • Debian trixie
  • Java 21
  • envoy
  • Jenkins

using the new puppet role::jenkins which sets up envoy and jenkins.

We removed the Apache httpd proxy between them.

The name jenkins.discovery.wmnet has been created and envoy listens there on port 1443 and proxies to jenkins on 8080.

Only legacy contint machines are allowed to connect to it.

We are using nftables as firewall and it also works via IPv6 since envoy now listens on it.

A couple puppet issues have been fixed.

rsyncing the /var/lib/jenkins data is possible via the rsync::quickdatacopy class which sets up rsyncd and the firewall rules for it plus the file /usr/local/sbin/sync-var-lib-jenkins-contint with the actual sync command.

Automatic syncing is disabled.

The profile::ci::manager_host (active contint machine) is set to legacy contint1002 for role::ci and to contint1003 for role::jenkins.

This is what decides if jenkins runs or is masked. Therefore we have exactly one old and one new jenkins with jenkins masked on both codfw machines.

[contint1002:~] $ telnet -6 jenkins.discovery.wmnet 1443
Trying 2620:0:861:2:208:80:154:137...
Connected to contint1003.wikimedia.org.
Escape character is '^]'.

Screenshot at 2026-03-20 18-27-27.png (588×1 px, 48 KB)

On first start jenkins wants to go through the initial admin setup. I have done this twice.

Once with the default plugins and now another time after syncing data. This makes it "preselect" the plugins we have installed in prod, but let's double check that together.

The current password is /var/lib/jenkins/secrets/initialAdminPassword.

The setup process can be restarted by deleting the contents of /var/lib/jenkins and restarting the service.

After syncing data I deleted the contents of /var/lib/jenkins/jobs so that it does not start doing anything on existing jobs.

To configure it, use a ssh tunnel:

ssh -D 8080 contint1002.wikimedia.org

configure browser to use SOCKS5 proxy on localhost:8080

open https://jenkins.discovery.wmnet:1443/ci/ in your browser  (accept warning)

(The need to add /ci/ is because we used that in apache config on the legacy servers.)

Change #1260816 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: include docker, add comments

https://gerrit.wikimedia.org/r/1260816

Change #1260816 abandoned by Dzahn:

[operations/puppet@production] jenkins: include docker, add comments

Reason:

replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1267173

https://gerrit.wikimedia.org/r/1260816

Change #1267173 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: add profile::ci::docker to role

https://gerrit.wikimedia.org/r/1267173

Change #1267173 merged by Dzahn:

[operations/puppet@production] jenkins: add profile::ci::docker to role

https://gerrit.wikimedia.org/r/1267173

after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1267173 the new contint1003/2003 hosts have docker(.io) installed now.

Change #1268258 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] jenkins: switch firewall provider to ferm

https://gerrit.wikimedia.org/r/1268258

Change #1268258 merged by Dzahn:

[operations/puppet@production] jenkins: switch firewall provider to ferm

https://gerrit.wikimedia.org/r/1268258

Change #1268262 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci::docker: also install docker-cli when installing docker.io

https://gerrit.wikimedia.org/r/1268262

Change #1268262 merged by Dzahn:

[operations/puppet@production] ci::docker: also install docker-cli when installing docker.io

https://gerrit.wikimedia.org/r/1268262

@Dzahn Would it make sense to wait until @hashar is back from his vacation before we do the switch? He's the only one who really knows the Jenkins contint setup