Page MenuHomePhabricator

Ganeti VM for contint migration
Closed, ResolvedPublic

Description

As part of upgrading the contint servers from BusterBullseye we could use a Ganetic VM based on Bullseye to:

  • test the puppet role
  • check Zuul works under Bullseye
  • verify Jenkins works

I am not sure about the limits to use:

CPU2
RAM2 GB
Disk80G (20 G for OS, rest for Zuul/Jenkins/Scap/Docker etc)

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
integration/zuul/deploymaster+4 -1
integration/zuul/deploymaster+1 -1
operations/puppetproduction+7 -0
operations/puppetproduction+0 -7
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+53 -1
operations/puppetproduction+1 -0
operations/puppetproduction+0 -0
operations/puppetproduction+1 -1
operations/puppetproduction+0 -1
operations/puppetproduction+5 -0
operations/puppetproduction+3 -0
operations/puppetproduction+4 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

What do you want to use as the host name, something like zuul1001?

I'd go with contint1003. Daniel mentioned using the host to test the Puppet manifests and, beside, we can then also verify the php web service, Jenkins.

We can get a public IP assigned if it is not too much hassle. That would be for the duration of the tests, a few weeks. If it is troublesome we can survive with private IPs (one would just have to establish a ssh tunnel to reach the web service and do some host mapping, which is not the end of the world).

For testing hosts I'd prefer running on private IPs as those tend to have puppet disabled for longer period of time and "experimental" changes.

Would a cloud VPS works for your usecase ? Other than maybe the disks, they seem to fit all your requirements (including external access, Puppet, etc).

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

I'll go with private IP but cloud VPS doesn't really seem feasible to me.

Change 1006557 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add contint1003 with insetup::collab role

https://gerrit.wikimedia.org/r/1006557

Change 1006557 merged by Dzahn:

[operations/puppet@production] site: add contint1003 with insetup::collab role

https://gerrit.wikimedia.org/r/1006557

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1003.eqiad.wmnet with OS bullseye executed with errors:

  • contint1003 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint1003.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2024-02-26T19:04:16Z] <mutante> T358237 - makevm cookbook was interrupted by accident. re-running it would create a second IP with the same DNS name, running decom cookbook also fails, stuck

Mentioned in SAL (#wikimedia-operations) [2024-02-26T19:09:20Z] <mutante> decom cookbook finishes with 0 but does not remove DNS record of virtual machine T358237

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1004.eqiad.wmnet with OS bullseye

Change 1006581 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: replace contint1003 with contint1004

https://gerrit.wikimedia.org/r/1006581

Change 1006581 merged by Dzahn:

[operations/puppet@production] site: replace contint1003 with contint1004

https://gerrit.wikimedia.org/r/1006581

Mentioned in SAL (#wikimedia-operations) [2024-02-26T20:44:55Z] <mutante> T358237 used the next hostname number,1004, to avoid the duplicate IP issue. makevm cookbook is at attempt 103/240 to detect a reboot of the VM and uptime just keeps going up. used the "gnt-instance console --show-cmd " trick to get a console despite https://phabricator.wikimedia.org/T309724 - was missing partman config

Change 1006583 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] installserver: add partman config for contint100[34]

https://gerrit.wikimedia.org/r/1006583

Change 1006583 merged by Dzahn:

[operations/puppet@production] installserver: add partman config for contint100[34]

https://gerrit.wikimedia.org/r/1006583

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1004.eqiad.wmnet with OS bullseye executed with errors:

  • contint1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" contint1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change 1006586 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: add shell access and cluster contacts to contint1004

https://gerrit.wikimedia.org/r/1006586

Change 1006586 merged by Dzahn:

[operations/puppet@production] contint: add shell access and cluster contacts to contint1004

https://gerrit.wikimedia.org/r/1006586

Change 1006590 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: remove contint-docker group since it won't work without ci role

https://gerrit.wikimedia.org/r/1006590

Change 1006590 merged by Dzahn:

[operations/puppet@production] contint: remove contint-docker group from contint1004

https://gerrit.wikimedia.org/r/1006590

Dzahn changed the task status from Open to In Progress.Feb 26 2024, 10:00 PM

cloud VPS doesn't really seem feasible to me

I'm curious to know more why it doesn't ?

Maybe if there are limitations they can be solved on the WMCS side (cc @taavi ) ?

Dzahn changed the task status from In Progress to Stalled.Feb 27 2024, 3:57 PM
Dzahn removed Dzahn as the assignee of this task.

@ayounsi we're on a tight schedule here as we're trying to get contint off of Buster by EOL. The approach of using a Ganeti VM was discussed over a few sessions with Moritz and RelEng as the fastest path forward and I would prefer not to change it.

I'd be happy for us to discuss WMCS limitations and opportunities but as a side conversation that doesn't delay this particular task.

To clarify, there was no blocker in any of my comments.

On the last one, I was genuinely wondering why a CloudVPS was not suitable for this task.

cookbooks.sre.hosts.decommission executed by dzahn@cumin1002 for hosts: contint1003.eqiad.wmnet

  • contint1003.eqiad.wmnet (WARN)
    • Host not found on Icinga, unable to downtime it
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by dzahn@cumin1002 for hosts: contint1004.eqiad.wmnet

  • contint1004.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
Dzahn changed the task status from Stalled to Open.Feb 27 2024, 5:38 PM

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1003.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2024-02-27T23:52:26Z] <mutante> T358237 - creating VM with cookbook fails because puppet runs have certificate issue, applied role is already migrated to puppet 7 though

Mentioned in SAL (#wikimedia-operations) [2024-02-27T23:57:22Z] <mutante> T358237 - manually went through "fix forward"-steps from T349619 (install puppet-agent package, delete old key material, create new CSR, sign on puppetserver, node clean on puppetmaster) to fix puppet failures while makevm cookbook still running (which couldn't find succesful puppet run)

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1003.eqiad.wmnet with OS bullseye completed:

  • contint1003 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402272244_dzahn_3369692_contint1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 1007017 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site: add ci role to contint1003

https://gerrit.wikimedia.org/r/1007017

The VM has been created.

The "ci" prod role is not applied yet but I'm about to.

Looking at the compiler output what it does is possible before merge and quite interesting:

https://puppet-compiler.wmflabs.org/output/1007017/1517/contint1003.eqiad.wmnet/index.html

A lot of monitoring comes with this. We can either ACK and downtime all that or maybe better first add some parameter to the ci::profiles to indicate this is a testing host and then it should skip the monitoring prod stuff.

I think I'm going to first upload a patch for that.

Change 1007426 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: move hiera hosts file to correct hostname for test host

https://gerrit.wikimedia.org/r/1007426

Change 1007426 merged by Dzahn:

[operations/puppet@production] contint: move hiera hosts file to correct hostname for test host

https://gerrit.wikimedia.org/r/1007426

Change 1007428 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: ensure zuul-merger is disabled on test host initially

https://gerrit.wikimedia.org/r/1007428

Change 1007428 merged by Dzahn:

[operations/puppet@production] contint: ensure zuul-merger is disabled on test host initially

https://gerrit.wikimedia.org/r/1007428

Change 1007433 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: allow data rsyncing to contint1003

https://gerrit.wikimedia.org/r/1007433

Change 1007434 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] contint: create ci_test role for zuul-only and apply on contint1003

https://gerrit.wikimedia.org/r/1007434

Change 1007434 merged by Dzahn:

[operations/puppet@production] contint: create ci_test role for zuul-only and apply on contint1003

https://gerrit.wikimedia.org/r/1007434

Change 1007958 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: add profile::ci::httpd to ci_test role

https://gerrit.wikimedia.org/r/1007958

Change 1007958 merged by Dzahn:

[operations/puppet@production] ci: add profile::ci::httpd to ci_test role

https://gerrit.wikimedia.org/r/1007958

This is done now. please use contint1003.eqiad.wmnet with private IP.

test VM has been created and has a new "ci_test" role on it that installs zuul_server, zuul_merger, zuul_proxy and httpd (but not all the other stuff that comes with the "ci" role, for now)

zuul service exists but is currently masked (as puppet code says for a server that isn't the ci::manager_host, which is contint2002)

releng-roots have shell access.

the contint1003 host has NOT been added to the list of prod CI servers where it is used for firewall holes. (common/profile/ci/firewall.yaml) and is not allowed to push to docker_registry (common/profile/docker_registry_ha/registry.yaml).

If needed there is a patch to allow it to be a destination for rsyncing data including the contents of /var/lib/zuul from a prod host.

[contint1003:~] $ id hashar; id jnuche
uid=1010(hashar) gid=500(wikidev) groups=500(wikidev),720(contint-roots)
uid=37672(jnuche) gid=500(wikidev) groups=500(wikidev),720(contint-roots)
..

[contint1003:~] $ sudo systemctl status zuul
● zuul.service
     Loaded: masked (Reason: Unit zuul.service is masked.)
..
Dzahn claimed this task.

Change 1007017 abandoned by Dzahn:

[operations/puppet@production] site: add ci role to contint1003

Reason:

for now replaced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007958

https://gerrit.wikimedia.org/r/1007017

Change 1008539 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci_test: add profile::ci::website to allow deployments

https://gerrit.wikimedia.org/r/1008539

Change 1008539 merged by Dzahn:

[operations/puppet@production] ci_test: add profile::ci::website to allow deployments

https://gerrit.wikimedia.org/r/1008539

Change 1008545 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci_test: test switching firewall provider back to iptables

https://gerrit.wikimedia.org/r/1008545

Change 1008545 merged by Dzahn:

[operations/puppet@production] ci_test: test switching firewall provider back to iptables

https://gerrit.wikimedia.org/r/1008545

Change 1008576 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci_test: include scap::ferm class directly

https://gerrit.wikimedia.org/r/1008576

Change 1008576 merged by Dzahn:

[operations/puppet@production] ci_test: switch firewall::provider to ferm

https://gerrit.wikimedia.org/r/1008576

Change 1008579 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci_test: include profile::firewall in test role

https://gerrit.wikimedia.org/r/1008579

Change 1008579 merged by Dzahn:

[operations/puppet@production] ci_test: include profile::firewall in test role

https://gerrit.wikimedia.org/r/1008579

Change 1008823 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[operations/puppet@production] ci_test: do not remove python 2 packages

https://gerrit.wikimedia.org/r/1008823

Change 1008823 merged by Muehlenhoff:

[operations/puppet@production] ci_test: do not remove python 2 packages

https://gerrit.wikimedia.org/r/1008823

Change 1008849 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[operations/puppet@production] ci_test.pp: add missing Python2 packages

https://gerrit.wikimedia.org/r/1008849

Change 1008850 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[operations/puppet@production] ci_test.pp: remove explicit installation of Python2 packages

https://gerrit.wikimedia.org/r/1008850

Change 1008850 abandoned by Jaime Nuche:

[operations/puppet@production] ci_test.pp: remove explicit installation of Python2 packages

Reason:

Installed the package manually on the host

https://gerrit.wikimedia.org/r/1008850

Change 1008849 abandoned by Jaime Nuche:

[operations/puppet@production] ci_test.pp: add missing Python2 packages

Reason:

Installed the packages manually on the host

https://gerrit.wikimedia.org/r/1008849

Change 1008866 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[integration/zuul/deploy@master] Makefile.deploy: use `python2.7` as the interpreter for virtualenv

https://gerrit.wikimedia.org/r/1008866

Change 1008866 merged by jenkins-bot:

[integration/zuul/deploy@master] Makefile.deploy: use `python2.7` as the interpreter for virtualenv

https://gerrit.wikimedia.org/r/1008866

Change 1009524 had a related patch set uploaded (by Jaime Nuche; author: Jaime Nuche):

[integration/zuul/deploy@master] Makefile.deploy: split installation in two steps

https://gerrit.wikimedia.org/r/1009524

Change 1009524 merged by jenkins-bot:

[integration/zuul/deploy@master] Makefile.deploy: split installation in two steps

https://gerrit.wikimedia.org/r/1009524

zuul has now succesfully been deployed to this machine by @jnuche :)

zuul has now succesfully been deployed to this machine by @jnuche :)

Indeed, installation worked and zuul + dependencies were installed at the right location :)

jnuche@deploy2002:/srv/deployment/zuul/deploy$ scap deploy -l contint1003.eqiad.wmnet 'test deployment for new host'
13:27:26 Started deploy [zuul/deploy@efce3ee]
13:27:26 Deploying Rev: HEAD = efce3ee5ddd596bac6e83a5af0696eb0abdf8f90
13:27:26 Started deploy [zuul/deploy@efce3ee]: test deployment for new host
13:27:26 
== DEFAULT ==
:* contint1003.eqiad.wmnet
13:27:33 zuul/deploy: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:27:34 zuul/deploy: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:27:40 zuul/deploy: promote stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:27:40 default deploy successful
13:27:40 
== DEFAULT ==
:* contint1003.eqiad.wmnet
13:27:41 zuul/deploy: finalize stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0) |
13:27:41 default deploy successful
13:27:41 Finished deploy [zuul/deploy@efce3ee]: test deployment for new host (duration: 00m 15s)
13:27:41 Finished deploy [zuul/deploy@efce3ee] (duration: 00m 15s)
jnuche@contint1003:/srv/deployment/zuul$ ls -1 venv/lib/python2.7/site-packages/ | grep dist-info
APScheduler-3.0.6.dist-info
Babel-1.3.dist-info
docutils-0.16.dist-info
ecdsa-0.15.dist-info
extras-0.0.3.dist-info
futures-3.0.5.dist-info
gear-0.16.0.dist-info
gitdb2-2.0.5.dist-info
GitPython-2.1.11.dist-info
lockfile-0.12.2.dist-info
paramiko-1.18.5.dist-info
Paste-1.7.5.1.dist-info
pbr-1.10.0.dist-info
pip-20.3.4.dist-info
pkg_resources-0.0.0.dist-info
prettytable-0.7.2.dist-info
pycrypto-2.6.1.dist-info
python_daemon-2.0.6.dist-info
pytz-2020.1.dist-info
PyYAML-3.11.dist-info
setuptools-44.0.0.dist-info
six-1.9.0.dist-info
smmap2-2.0.5.dist-info
statsd-2.1.2.dist-info
tzlocal-1.2.2.dist-info
voluptuous-0.8.4.dist-info
WebOb-1.4.dist-info
wheel-0.34.2.dist-info
zuul-2.5.2.dev30.dist-info

Change 1007433 abandoned by Dzahn:

[operations/puppet@production] contint: allow data rsyncing to contint1003

Reason:

hashar said this isn't needed, just what puppet deploys

https://gerrit.wikimedia.org/r/1007433

Change #1020329 had a related patch set uploaded (by Dzahn; author: Hashar):

[operations/puppet@production] zuul: require python2.7

https://gerrit.wikimedia.org/r/1020329

Change #1020329 merged by Dzahn:

[operations/puppet@production] zuul: require python2.7

https://gerrit.wikimedia.org/r/1020329

Change #1029295 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] ci: include envoyproxy in ci_test role

https://gerrit.wikimedia.org/r/1029295

Mentioned in SAL (#wikimedia-operations) [2024-05-08T22:53:03Z] <mutante> contint1003 - systemctl start wmf_auto_restart_envoyproxy T364510 T358237

added envoy to contint1003 to fix T364510