Page MenuHomePhabricator

Migrate contint* hosts to Buster
Open, MediumPublic

Description

The CI production servers are using Debian Jessie and have to be upgrade. We will go for Buster. The hosts are:

HostRole
contint1001.wikimedia.orgPrimary
contint2001.wikimedia.orgSpare

The services to be migrated are:

  • docker-pkg
    • Buster is supported since Gerrit 585451
    • Updated on April 2nd. Will need to be redeployed after upgrade: scap deploy --limit contint.*
  • Docker
    • We can afford to loose the containers. They will be redownloaded from the registry if need be.
  • Pipeline containers building

Migration

The overall sequence is to upgrade contint2001, migrate the services to it, upgrade contint1001, move the services back to contint1001.

Zuul to scap

Zuul has been deployed using a Debian package but that methods is painful for everyone. On Buster we will deploy it using scap. We can do all the scap related work before upgrading from Jessie to Buster, we need to feature switch based on the target distribution so that a host still reies on the Debian package as long as it is still using Jessie.

  • Craft a scap deployment repository for Zuul
  • Get puppet patches to vary based on the Distribution

The deployment on a host can only be done after it has been upgraded to Buster.

contint2001 upgrade

zuul-merger
~~~~~~~~~

The sole production service being run on contint2001 is zuul-merger:

docker-pkg
~~~~~~~~

For contint2001:

Puppet run on contint2001 without errors / missing packages built:

  • E: Unable to locate package blubber
  • E: Unable to locate package zuul
  • E: Unable to locate package helm
  • E: Unable to locate package helmfile
  • E: Unable to locate package helm-diff
  • E: Unable to locate package kubernetes-client
  • mod_php_7.3 - ERROR: Module mpm_event is enabled - cannot proceed due to conflicts. It needs to be disabled first (known issue -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206)

Jenkins agent
~~~~~~~~~~~

  • Add contint2001 as a Jenkins agent (copying contint1001)
  • Disable contint1001 agent
  • Update jobs in integration/config to point to contint2001

Migrate

  • Stop zuul, zuul, jenkins on contint1001
  • rsync data
  • change DNS backend for contint.wikimedia.org
  • update fabfile to have zuul reloaded on contint2001 instead of contint1001
  • Set contint2001 as master in Puppet / Hiera
  • Start Jenkins, verify that agents are connected and jobs set
  • Start Zuul scheduler

contint1001 upgrade

  • reinstall with buster
  • deploy the zuul scap to the machine: scap deploy --limit contint2001
  • redeploy docker-pkg scap deploy --limit contint1001
  • update the fabfile.py for deploy_docker
  • Stop zuul, zuul-merger, jenkins on contint2001
  • rsync data
  • change DNS backend for contint.wikimedia.org
  • update fabfile to have zuul reloaded on contint1001 instead of contint2001
  • Set contint1001 as master in Puppet / Hiera
  • Start Jenkins, verify that agents are connected and jobs set
  • Start Zuul scheduler

Details

Due Date
Mar 29 2020, 10:00 PM
ProjectBranchLines +/-Subject
operations/puppetproduction+4 -0
operations/puppetproduction+5 -6
operations/puppetproduction+7 -37
operations/puppetproduction+1 -1
operations/puppetproduction+13 -1
operations/puppetproduction+7 -0
operations/dnsmaster+1 -1
operations/puppetproduction+7 -7
integration/configmaster+1 -1
integration/configmaster+13 -13
operations/puppetproduction+10 -15
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+6 -1
blubberdebian+6 -0
operations/puppetproduction+1 -1
operations/dnsmaster+2 -0
operations/puppetproduction+16 -1
operations/puppetproduction+0 -1
operations/puppetproduction+9 -0
operations/puppetproduction+18 -8
operations/puppetproduction+12 -0
operations/puppetproduction+7 -0
operations/puppetproduction+8 -2
integration/zuul/deploymaster+1 -1
integration/zuul/deploymaster+1 -1
integration/configmaster+3 -2
operations/puppetproduction+0 -5
operations/puppetproduction+13 -1
operations/puppetproduction+2 -0
operations/puppetproduction+10 -17
operations/puppetproduction+18 -10
operations/puppetproduction+21 -5
operations/puppetproduction+38 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

https://gerrit.wikimedia.org/r/587706 would build Blubber for Buster. At least it works for me locally, there is another failure which is not related though.

Anyway, we might not depend on the Debian package anymore. Blubber is nowadays deployed as a micro service (blubberoid) and it seems the Pipeline stuff entirely relies on the micro service rather than the package. So maybe we will be able to drop it entirely.

@dduvall / @thcipriani would know whether the blubber package is still needed.

Change 587782 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] zuul: fix dependency on /etc/zuul and package if on buster

https://gerrit.wikimedia.org/r/587782

https://gerrit.wikimedia.org/r/587706 would build Blubber for Buster. At least it works for me locally, there is another failure which is not related though.

Anyway, we might not depend on the Debian package anymore. Blubber is nowadays deployed as a micro service (blubberoid) and it seems the Pipeline stuff entirely relies on the micro service rather than the package. So maybe we will be able to drop it entirely.

@dduvall / @thcipriani would know whether the blubber package is still needed.

Should not block the migration. The pipeline library now uses blubberoid rather than blubber. Feel free to drop this from blocking tasks.

Icinga downtime for 1 day, 0:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: reimage

contint2001.wikimedia.org

Icinga downtime for 4 days, 0:00:00 set by dzahn@cumin1001 on 1 host(s) and their services with reason: reimage

contint2001.wikimedia.org

Change 587862 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] ci: remove blubber Debian package

https://gerrit.wikimedia.org/r/587862

Change 587782 merged by Dzahn:
[operations/puppet@production] zuul: fix dependency on /etc/zuul and package if on buster

https://gerrit.wikimedia.org/r/587782

Change 587862 merged by Dzahn:
[operations/puppet@production] ci: remove blubber Debian package

https://gerrit.wikimedia.org/r/587862

Change 587963 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] fab: use contint2001 to build Docker images

https://gerrit.wikimedia.org/r/587963

hashar updated the task description. (Show Details)Apr 10 2020, 8:11 AM

Change 587963 merged by jenkins-bot:
[integration/config@master] fab: use contint2001 to build Docker images

https://gerrit.wikimedia.org/r/587963

Dzahn added a comment.Apr 10 2020, 8:35 AM

Should not block the migration. The pipeline library now uses blubberoid rather than blubber. Feel free to drop this from blocking tasks.

Merged the change that removes the blubber package from puppet. Did not manually remove it from contint1001 but should have removed one of the issues on contint2001.

Fixed the comment in the deployment key for zuul, re-arming keyholder. Antoine tested zuul deployment with scap before and it worked.

Dzahn updated the task description. (Show Details)Apr 10 2020, 8:37 AM

Change 587967 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul/deploy@master] scap: strip 'integration' from git_repo

https://gerrit.wikimedia.org/r/587967

Change 587967 merged by Hashar:
[integration/zuul/deploy@master] scap: strip 'integration' from git_repo

https://gerrit.wikimedia.org/r/587967

Change 587970 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/zuul/deploy@master] scap: fix deploy_user

https://gerrit.wikimedia.org/r/587970

Change 587970 merged by Hashar:
[integration/zuul/deploy@master] scap: fix deploy_user

https://gerrit.wikimedia.org/r/587970

Change 587974 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] zuul: use zuul-deployers not contint-root

https://gerrit.wikimedia.org/r/587974

Change 587974 merged by Dzahn:
[operations/puppet@production] zuul: use zuul-deployers not contint-root

https://gerrit.wikimedia.org/r/587974

hashar updated the task description. (Show Details)Apr 10 2020, 2:35 PM
hashar updated the task description. (Show Details)Apr 10 2020, 4:15 PM

Mentioned in SAL (#wikimedia-releng) [2020-04-10T16:46:26Z] <hashar> contint1001: deleting docker-pkg maintained images. They are in the registry anyway. # T224591

I have pruned all the containers from contint1001, they are in the Docker registry anyway:

Filesystem                            Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--data-docker  246G  168M  234G   1% /mnt/docker

Change 588687 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] zuul: create /var/log/zuul

https://gerrit.wikimedia.org/r/588687

Change 588687 merged by Dzahn:
[operations/puppet@production] zuul: create /var/log/zuul

https://gerrit.wikimedia.org/r/588687

Change 588707 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] zuul: add missing ssh public key for zuul-merger

https://gerrit.wikimedia.org/r/588707

Change 588708 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] Profile to inject Gerrit ssh public key to known_hosts

https://gerrit.wikimedia.org/r/588708

Change 588707 merged by Dzahn:
[operations/puppet@production] zuul: add missing ssh public key for zuul-merger

https://gerrit.wikimedia.org/r/588707

Change 588708 merged by Dzahn:
[operations/puppet@production] Profile to inject Gerrit ssh public key to known_hosts

https://gerrit.wikimedia.org/r/588708

Change 588968 had a related patch set uploaded (by Dzahn; owner: Hashar):
[operations/puppet@production] zuul: inject Gerrit ssh public key to known_hosts

https://gerrit.wikimedia.org/r/588968

Change 588973 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] ATS: switch backend for integration.wm.org to contint2001

https://gerrit.wikimedia.org/r/588973

Change 588968 merged by Dzahn:
[operations/puppet@production] zuul: inject Gerrit ssh public key to known_hosts

https://gerrit.wikimedia.org/r/588968

Mentioned in SAL (#wikimedia-operations) [2020-04-15T13:10:39Z] <hashar> contint2001: starting zuul-merger process # T224591

Change 589013 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: enable zuul-merger on contint2001

https://gerrit.wikimedia.org/r/589013

hashar updated the task description. (Show Details)Apr 15 2020, 1:26 PM

Change 589013 merged by Dzahn:
[operations/puppet@production] contint: enable zuul-merger on contint2001

https://gerrit.wikimedia.org/r/589013

Change 589023 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: allow masters to ssh to themselves

https://gerrit.wikimedia.org/r/589023

Change 589023 merged by Dzahn:
[operations/puppet@production] contint: allow masters to ssh to themselves

https://gerrit.wikimedia.org/r/589023

Change 589285 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add contint.wikimedia.org service alias for contint machines

https://gerrit.wikimedia.org/r/589285

Change 589285 merged by Dzahn:
[operations/dns@master] add contint.wikimedia.org service alias for contint machines

https://gerrit.wikimedia.org/r/589285

Change 588973 merged by Dzahn:
[operations/puppet@production] ATS: use contint service alias as backend for integration.wm.org

https://gerrit.wikimedia.org/r/588973

Dzahn added a comment.EditedApr 17 2020, 11:38 AM
  • E: Unable to locate package blubber
  • E: Unable to locate package zuul
  • E: Unable to locate package helm
  • E: Unable to locate package helmfile
  • E: Unable to locate package helm-diff
  • E: Unable to locate package kubernetes-client
  • mod_php_7.3 - ERROR: Module mpm_event is enabled - cannot proceed due to conflicts. It needs to be disabled first (known issue -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/451206)
Dzahn updated the task description. (Show Details)Apr 17 2020, 11:56 AM

kubernetes-client just ships a static ELF binary, so I copied the existing package from stretch-wikimedia to buster-wikimedia.

Mentioned in SAL (#wikimedia-operations) [2020-04-17T12:28:06Z] <moritzm> copied kubernetes-client from stretch-wikimedia to buster-wikimedia T224591

Dzahn updated the task description. (Show Details)Apr 17 2020, 12:41 PM

Change 587706 abandoned by Hashar:
Rebuild for Buster

Reason:
CI uses Blubberoid and we no more rely on the Debian package.

https://gerrit.wikimedia.org/r/587706

Change 591000 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tlsproxy::envoy: allow limiting firewall srange

https://gerrit.wikimedia.org/r/591000

Dzahn updated the task description. (Show Details)Apr 20 2020, 9:14 AM
Dzahn added a comment.Apr 20 2020, 9:17 AM

@hashar Thanks to Janis also building helmfile package, puppet on contint2001 runs without any errors now. Also it's now all green on Icinga and i removed all monitoring downtimes.

Change 591000 merged by Dzahn:
[operations/puppet@production] tlsproxy::envoy: allow limiting firewall srange

https://gerrit.wikimedia.org/r/591000

Change 591037 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: ignore more Docker partitions disk checks

https://gerrit.wikimedia.org/r/591037

Change 591038 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: move Docker data out of / on contint2001

https://gerrit.wikimedia.org/r/591038

Over the week-end I have noticed a few ephemeral disk alarms for Docker on contint2001. The reason is that the data files are on /var/lib/docker which is not ignored, and eventually that the data files should be on the /srv/ disk.

The following patches should fix that:


On another topic, I have cleaned up a bunch of data on contint1001 to speed up the rsync that will have to be done before the migration:

$ ssh contint1001.wikimedia.org df -h /srv
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/contint1001--vg-data  870G  214G  613G  26% /srv

Change 591037 merged by Dzahn:
[operations/puppet@production] contint: ignore more Docker partitions disk checks

https://gerrit.wikimedia.org/r/591037

Change 591038 merged by Dzahn:
[operations/puppet@production] contint: move Docker data out of / on contint2001

https://gerrit.wikimedia.org/r/591038

Mentioned in SAL (#wikimedia-operations) [2020-04-27T10:32:42Z] <mutante> contint2001 - systemd status was degraded. icinga alerted. failed unit was jenkins. starting it failed with "address already in use". manually started without using systemctl? killed jenkins and started again with systemctl. T224591

@hashar I noticed that the zuul command is missing on contint2001 (e.g. for zuul enqueue). Not sure whether that is expected right now, but figured I'd report it here in case :)

Change 594475 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] contint: move common and default Hiera settings to role level

https://gerrit.wikimedia.org/r/594475

Change 594477 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] contint: switch jenkins/zuul from contint1001 to contint2001

https://gerrit.wikimedia.org/r/594477

Change 594480 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch contint from 1001 to 2001

https://gerrit.wikimedia.org/r/594480

Change 594475 merged by Dzahn:
[operations/puppet@production] contint: move common and default Hiera settings to role level

https://gerrit.wikimedia.org/r/594475

Dzahn added a comment.Thu, May 7, 7:51 AM

@hashar So.. when do we schedule the maintenance window early next week? Monday?

Dzahn updated the task description. (Show Details)Mon, May 11, 8:27 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-11T08:32:36Z] <mutante> rsynced data from contint1001 to contint2001 - pathes per T224591#6039192 for the migration later today

Mentioned in SAL (#wikimedia-operations) [2020-05-11T09:07:32Z] <mutante> contint1001 - rsync -avpz --delete /srv/jenkins/ rsync://contint2001.wikimedia.org/ci--srv-/jenkins/ (T224591)

hashar updated the task description. (Show Details)Mon, May 11, 12:04 PM

Change 595511 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Switch jobs to contint2001

https://gerrit.wikimedia.org/r/595511

Change 595512 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] fab: deploy zuul config on contint2001

https://gerrit.wikimedia.org/r/595512

Change 595511 merged by jenkins-bot:
[integration/config@master] Switch jobs to contint2001

https://gerrit.wikimedia.org/r/595511

Change 595512 merged by jenkins-bot:
[integration/config@master] fab: deploy zuul config on contint2001

https://gerrit.wikimedia.org/r/595512

Mentioned in SAL (#wikimedia-operations) [2020-05-11T12:14:47Z] <hashar> shutting down Zuul and Jenkins for system switch # T224591

Change 594480 merged by Dzahn:
[operations/dns@master] switch contint from 1001 to 2001

https://gerrit.wikimedia.org/r/594480

hashar updated the task description. (Show Details)Mon, May 11, 12:28 PM

Change 594477 merged by Dzahn:
[operations/puppet@production] contint: switch jenkins/zuul/gearman to contint2001

https://gerrit.wikimedia.org/r/594477

Mentioned in SAL (#wikimedia-operations) [2020-05-11T12:50:26Z] <hashar> Pointing CI Jenkins to contint2001 Gearman server T224591

Change 595521 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] jenkins: add missing /srv/jenkins dir

https://gerrit.wikimedia.org/r/595521

Mentioned in SAL (#wikimedia-operations) [2020-05-11T13:36:31Z] <hashar> Rolling back CI system switch to previous known state # T224591

Change 595525 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] contint: fix git cloning of docroot for integration.wm.org

https://gerrit.wikimedia.org/r/595525

The upgrade itself went well:

  • the rsync of data from contint1001 to contint2001 finished just in time
  • our rsync configuration cause file ownership to be transferred by uid instead of name which caused us to do a lot of chown to fix them
  • the zuul scheduler deployed from scap did process events. I have tested it locally.

In Jenkins, the jobs requests landed in its build queue but the agent executors rejected them. Each claiming they were busy. It might be due to the different java version being used (8 vs 11) which in turn might point at the Gearman plugin :-\ We have thus rolled back to contint1001 which took just a couple of minutes.

Follow up actions

Try to reproduce the issue with same zuul/jenkins/java11

Moritz pointed out we have a java 8 package for Buster: deb http://apt.wikimedia.org/wikimedia buster-wikimedia component/jdk8

Find a fix for rsync to keep the user/group either:

  • ensure a consistent uid accross the fleet for jenkins and zuul users
  • update rsyncd.conf which has use chroot which defaults to force numeric ids and thus prevent the name/id mapping to occur

Change 595531 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Initially use Java 8 for contint on Buster

https://gerrit.wikimedia.org/r/595531

Change 595531 merged by Dzahn:
[operations/puppet@production] Initially use Java 8 for contint on Buster

https://gerrit.wikimedia.org/r/595531

Change 595521 merged by Dzahn:
[operations/puppet@production] jenkins: add missing /srv/jenkins dir

https://gerrit.wikimedia.org/r/595521

Mentioned in SAL (#wikimedia-operations) [2020-05-18T09:46:37Z] <mutante> contint2001 - apt-get remove --purge openjdk-11-* - T224591

Change 597090 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] jenkins: master should stick to java 8

https://gerrit.wikimedia.org/r/597090

Change 597090 merged by Dzahn:
[operations/puppet@production] jenkins: master should stick to java 8

https://gerrit.wikimedia.org/r/597090

Change 597255 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] jenkins/icinga: fix process monitoring after change in command line

https://gerrit.wikimedia.org/r/597255

Change 597255 merged by Dzahn:
[operations/puppet@production] jenkins/icinga: fix process monitoring after change in command line

https://gerrit.wikimedia.org/r/597255

We now have Jenkins pinned to java8 so I guess we to do the switch over again.

@Dzahn would you be available at the beginning of next week (Monday/Tuesday)? (we can sync up over irc to find a good time)

Dzahn added a comment.Wed, May 20, 7:46 AM

@Dzahn would you be available at the beginning of next week (Monday/Tuesday)? (we can sync up over irc to find a good time)

Yes, some time between Monday 9 to 5 or Tuesday 9 to 4. Or simply tomorrow, Thursday, even preferable.

After discussion with @Dzahn we will do the switch Wednesday May 27th at 9:30 CEST (7:30 UTC).

Change 598068 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] zuul: add convenience link to 'zuul' bin

https://gerrit.wikimedia.org/r/598068

Change 598068 merged by Dzahn:
[operations/puppet@production] zuul: add convenience link to 'zuul' bin

https://gerrit.wikimedia.org/r/598068