Page MenuHomePhabricator

Rebuild integration-slave-docker-* instances to use less RAM, new name and Stretch
Closed, ResolvedPublic

Description

I would like to rebuild the whole fleet of integration-slave-docker instances for a few reasons:

  1. upgrade from Jessie to Stretch
  2. reduce RAM from 32 to 24 by switching the flavor from bigram to mediumram (available since T225025)
  3. in the hostname, replace slave by agent
  4. use /srv/jenkins/workspace (instead of jenkins-workspace)

Steps:

Wait 3 or 4 minutes for the instance to be fully provisioned. Then on the instance:

  • rm -fR /var/lib/puppet/ssl && puppet agent -tv
  • if that complains:
    • get the instance fully qualified domain name hostname --fqdn
    • on integration-puppetmaster01.integration.eqiad.wmflabs: sudo puppet cert clean <FQDN OF INSTANCE HERE>

Apply the puppet role

  • Run puppet on the instance
  • Make sure there is a /var/lib/docker partition for Docker
  • Upgrade and reboot: sudo apt -y dist-upgrade && /usr/sbin/reboot

Add the instance to Jenkins

  • Create node copying integration-agent-docker-1001
  • Remote root directory is /srv/jenkins/workspace
  • Change the IP address
  • Once the master tries to connect, manually accept the ssh key from the side bar: Trust SSH Host Key

Event Timeline

E: Version '18.06.2~ce~3-0~debian' for 'docker-ce' was not found

On Jessie we have:

$ apt-cache policy docker-ce
docker-ce:
  Installed: 18.06.2~ce~3-0~debian
  Candidate: 18.06.2~ce~3-0~debian
  Version table:
 *** 18.06.2~ce~3-0~debian 0
       1001 http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/thirdparty/ci amd64 Packages
        100 /var/lib/dpkg/status

Change 518222 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: remove zuul-cloner from Docker agent

https://gerrit.wikimedia.org/r/518222

I have also removed the zuul package from the Docker instances.

hashar@integration-cumin:~$ sudo cumin --force 'name:docker' 'apt-get -y remove --purge zuul'
hashar@integration-cumin:~$ sudo cumin --force 'name:docker' 'rm -fR /var/lib/zuul /usr/share/python/zuul'

greg triaged this task as Medium priority.Jul 6 2019, 5:01 AM
greg moved this task from Soon-ish to Next on the Release-Engineering-Team-TODO board.

Given that as of two days ago Buster is now officially stable and so per https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy is allowed to be used in production, do we want to upgrade straight to that?

I would rather not be one of the first adopters of Buster :-]

Change 518222 merged by Alexandros Kosiaris:
[operations/puppet@production] contint: remove zuul-cloner from Docker agent

https://gerrit.wikimedia.org/r/518222

Yay stretch! Please make sure to test that the less RAM will still be enough for phan to run relatively quickly.

hashar renamed this task from Rebuild integration-slave-docker-* instances to use less RAM, new name and Stretch to Rebuild integration-slave-docker-* instances to use less RAM, new name.Sep 19 2019, 9:50 AM
hashar updated the task description. (Show Details)

Will do the Stretch upgrade later on when I can also handle the upgrade to a more recent Docker daemon. Lets stick to the current stack for now.

Change 538259 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: pin Docker on stretch

https://gerrit.wikimedia.org/r/538259

hashar renamed this task from Rebuild integration-slave-docker-* instances to use less RAM, new name to Rebuild integration-slave-docker-* instances to use less RAM, new name and Stretch.Sep 20 2019, 1:02 PM
hashar updated the task description. (Show Details)
hashar updated the task description. (Show Details)

Eventually I wanted to reuse the exact same Docker package on Stretch (T226236) which got rejected. After some madness that seems to work (so far). The Stretch instances would receive Docker 18.09.7 from thirdparty/ci instead of 18.06.2 on Jessie.

There is now:

integration-agent-docker-1001172.16.7.14424G RAM8vCPUs

https://integration.wikimedia.org/ci/computer/integration-agent-docker-1001/

Mentioned in SAL (#wikimedia-releng) [2019-09-20T13:14:32Z] <hashar> Pooled integration-agent-docker-1001 based on Stretch # T226233

Change 538259 merged by Muehlenhoff:
[operations/puppet@production] contint: pin Docker on stretch

https://gerrit.wikimedia.org/r/538259

Mentioned in SAL (#wikimedia-releng) [2019-09-20T14:47:27Z] <hashar> Pooled integration-agent-docker-1002 integration-agent-docker-1003 based on Stretch # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-20T15:14:26Z] <hashar> Pooled integration-agent-docker-1004 based on Stretch # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-20T15:16:20Z] <hashar> Pooled integration-agent-docker-1005 based on Stretch # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T18:43:52Z] <hashar> Added integration-agent-docker-1006 # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T19:06:37Z] <hashar> Added integration-agent-docker-1007 # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T19:09:11Z] <hashar> Added integration-agent-docker-1008 # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T20:23:12Z] <hashar> Added integration-agent-docker-1009 # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T20:37:22Z] <hashar> Added integration-agent-docker-1010 # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T20:38:21Z] <hashar> Added integration-agent-docker-1011 # T226233

Mentioned in SAL (#wikimedia-releng) [2019-09-23T20:42:56Z] <hashar> Added integration-agent-docker-1012 # T226233

All the bigram instances are gone \o/