Page MenuHomePhabricator

Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye
Closed, ResolvedPublic

Description

T236576: Move all Wikimedia CI (WMCS integration project) instances from jessie to stretch || NOTYETCREATED

Most instances are currently running Stretch; we should plan to migrate them over to Buster or Bullseye ahead of the deadline (WHEN?).

Full list: https://openstack-browser.toolforge.org/project/integration and https://os-deprecation.toolforge.org/stretch/integration.html

Stretch-based machines as of 2022-01-26:

  • integration-agent-docker-XXXX
  • integration-agent-puppet-docker-1002
  • integration-agent-qemu-1001 - should be replaced by integration-agent-qemu-1003 when ready T284774
  • integration-castor03 (new flavor asked T304080)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hashar triaged this task as Medium priority.Aug 20 2020, 6:52 PM

Mentioned in SAL (#wikimedia-releng) [2021-09-10T21:44:45Z] <James_F> Pulling oldest CI agent integration-agent-docker-1001 from rotation so it can be replaced by a bullseye one for T252071

Mentioned in SAL (#wikimedia-releng) [2021-09-10T21:48:39Z] <James_F> Deleting CI agent integration-agent-docker-1001 for T252071

Mentioned in SAL (#wikimedia-releng) [2021-09-10T21:52:27Z] <James_F> Created experimental integration-agent-docker-1021 for T252071

taavi renamed this task from Move all Wikimedia CI (WMCS integration project) instances from stretch to buster to Move all Wikimedia CI (WMCS integration project) instances from stretch to buster/bullseye.Jan 14 2022, 3:28 PM
taavi updated the task description. (Show Details)

To summarize, we would need a flavor of instances that has more disks:

  • more disk, I have added some figure at T290783#7622515 . Gotta find out the exact limit and change how the docker image partition is created by puppet (70% FREE is too large).
  • boosted IO rate limiting which I have asked at T299211

This is ready to be done and there is already an integration-agent-docker-1023.

I have updated the setup documentation: https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup

up-to-date progress can be seen at https://horizon.wikimedia.org/project/instances/

Mentioned in SAL (#wikimedia-releng) [2022-01-26T18:17:01Z] <hashar> integration: pooled in Jenkins a few Bullseye docker agent for T252071

Mentioned in SAL (#wikimedia-releng) [2022-01-26T18:56:00Z] <hashar> integration: pooled in Jenkins a few more Bullseye docker agents for T252071

Mentioned in SAL (#wikimedia-releng) [2022-01-26T20:29:12Z] <hashar> Completed migration of integration-agent-docker-XXXX instances from Stretch to Bullseye - T252071

I have created and pooled integration-agent-puppet-docker-1003 so we can put it online and put the old one integration-agent-puppet-docker-1002 offline to switch the Puppet job. It will have to be verified, but at least Docker seems to work properly.

Mentioned in SAL (#wikimedia-releng) [2022-02-23T12:41:33Z] <hashar> Depooling integration-agent-puppet-docker-1002 , pooling integration-agent-puppet-docker-1003 # T252071

Change 770921 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] dockerfiles: set castor hostname from CASTOR_HOST env

https://gerrit.wikimedia.org/r/770921

Change 770921 merged by jenkins-bot:

[integration/config@master] dockerfiles: set castor hostname from CASTOR_HOST env

https://gerrit.wikimedia.org/r/770921

Change 771608 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: update castor image to support CASTOR_HOST

https://gerrit.wikimedia.org/r/771608

Mentioned in SAL (#wikimedia-releng) [2022-03-17T14:11:47Z] <hashar> Update all jobs to support CASTOR_HOST env variable | https://gerrit.wikimedia.org/r/770921 | T216244 | T252071

Change 771608 merged by jenkins-bot:

[integration/config@master] jjb: update castor image to support CASTOR_HOST

https://gerrit.wikimedia.org/r/771608

hashar updated the task description. (Show Details)

I have asked a new OpenStack flavor to create the new integration-castor instance: T304080

Mentioned in SAL (#wikimedia-releng) [2022-03-28T16:55:15Z] <hashar> integration: created instance integration-castor04 with flavor g3.cores8.ram32.disk20 (twice more ram than integration-castor03) # T252071

Change 774525 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: on castor server drop /srv requirement

https://gerrit.wikimedia.org/r/774525

Change 774771 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: relocate castor storage directory

https://gerrit.wikimedia.org/r/774771

Jdforrester-WMF added subscribers: StrikerBot, greg, Legoktm and 21 others.

@hashar, it looks like integration-castor04 is up and perhaps working now? Does that mean we can kill integration-castor03 and declare this done?

Aklapper set Due Date to Apr 30 2022, 11:59 PM.Apr 18 2022, 6:21 PM

@hashar, it looks like integration-castor04 is up and perhaps working now? Does that mean we can kill integration-castor03 and declare this done?

The new instance is provisioned indeed but the migration has not yet happened. I notably need to get the couple pending Puppet patches merged.

For the migration:

  • Create the Jenkins agent for castor05
  • unpool castor03 in Jenkins
  • move the Cinder volume attachment from castor05 to castor03
  • manually mount the volume on castor03
  • copy the caches data
  • unmount the volume, attach it back to castor04 and ensure it is mounted
  • run puppet
  • Change CASTOR_HOST env variable in the CI Jenkins (currently set to integration-castor03.integration.eqiad.wmflabs)
  • pool castor05 in Jenkins

I will look at getting the puppet patch merged ( https://gerrit.wikimedia.org/r/774525 and https://gerrit.wikimedia.org/r/774771 ) then commit to do the migration during the morning when CI is quiet. From the note I wrote in my last comment, it should be a rather quick migration.

Thank you @komla for the ping :)

getting the puppet patch merged ( https://gerrit.wikimedia.org/r/774525 and https://gerrit.wikimedia.org/r/774771 )

Please add some reviewers to get patches merged.

A side effect is that Bullseye instances do not have Diamond/Graphite monitoring. I filed T307655 to find a replacement (likely Prometheus?)

Change 774525 merged by Jbond:

[operations/puppet@production] ci: on castor server drop /srv requirement

https://gerrit.wikimedia.org/r/774525

Change 774771 merged by Jbond:

[operations/puppet@production] ci: relocate castor storage directory

https://gerrit.wikimedia.org/r/774771

I had to delete integration-castor04 because it had some issue after being migrated earlier this week.

The new one is integration-castor05 with IP 172.16.1.98, I have attached the volume and ran wmcs-prepare-cinder-volume.

Mentioned in SAL (#wikimedia-releng) [2022-05-06T12:55:41Z] <hashar> Migrated Castor service from integration-castor03 to integration-castor05 # T252071

hashar closed this task as Resolved.EditedMay 6 2022, 1:11 PM
hashar updated the task description. (Show Details)

After 3 years and 2 months integration-castor03 is gone. That was the last Stretch instance in the integration project.

I have added some updates to https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Castor

Mentioned in SAL (#wikimedia-releng) [2022-05-18T07:32:12Z] <hashar> Deleting Jenkins agent configuration for integration-castor03 # T252071

Part of this task relocated the Castor instance and I took that opportunity to also change the directory layout for the service. As a result we have been serving stall caches for quite a while. T323051 fixes it.