Page MenuHomePhabricator

Bullseye upgrade for remaining Collab hosts
Closed, ResolvedPublic

Description

ServiceOps-Collab systems still running Buster: https://os-reports.wikimedia.org/os-report-todo-2023-05-15-buster.html

  • role::ci::master (2 host(s)) (T334517)
  • role::releases (2 host(s)) (T334435)
  • role::gerrit (2 hosts(s)) (T326368, T334521) decom gerrit1001: T336427
  • role::miscweb (2 host(s)) (Racktables should be removed first: T327405)
  • role::requesttracker (1 host(s))
  • role::planet (2 host(s))
  • Requires upgrading the RSS aggregator (T281219), alternatively may be a decom target
  • role::phabricator (2 host(s)) (T334519)
  • role::aphlict (1 host(s)) (T333452)
  • role::vrts (1 host(s)) (T295416)
  • role::doc (2 host(s)) (T319477)
  • role::lists (1 host) (T331706)

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedDzahn
ResolvedArnoldokoth
ResolvedArnoldokoth
ResolvedArnoldokoth
Resolvedeoghan
Resolvedandrea.denisse
Resolvedandrea.denisse
Resolvedhashar
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedRequestDzahn
ResolvedDzahn
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedNone
Resolvedhashar
ResolvedDzahn
Resolvedhashar
DeclinedNone
Resolvedhashar
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedDzahn
ResolvedJclark-ctr
ResolvedBUG REPORThashar
ResolvedJelto
ResolvedDzahn
ResolvedLegoktm
ResolvedDzahn
ResolvedMoritzMuehlenhoff
InvalidNone
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
Resolvedeoghan
ResolvedLadsgroup

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 908644 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] sre: update planned quarters and tickets for collab services

https://gerrit.wikimedia.org/r/908644

Change 908644 merged by Dzahn:

[operations/puppet@production] sre: update planned quarters and tickets for collab services

https://gerrit.wikimedia.org/r/908644

re: gerrit, gerrit1003 is now production and bullseye. gerrit1001 is in grace period before being shut down in T336427 and reimaging gerrit2002 with bullseye remains to be done but is part of T334521

upgrade of gerrit2002 scheduled for May 25th

contint - in progress - just merged changes to reserve UID for zuul etc.. prep for migration

doc - in progress - Eoghan working on it after Andrea created new VMs

gerrit2002 is now on bullseye, number of buster hosts reduced by 1 without name change.

Dzahn updated the task description. (Show Details)

gerrit1001 has been destroyed today

Dzahn removed Dzahn as the assignee of this task.Jun 22 2023, 8:21 PM

Change 973213 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator/httpd: add support for bullseye/bookworm PHP versions

https://gerrit.wikimedia.org/r/973213

Change 973213 merged by Dzahn:

[operations/puppet@production] phabricator/httpd: add support for bullseye/bookworm PHP versions

https://gerrit.wikimedia.org/r/973213

Change 974280 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator::main: add support for PHP versions other than 7.3

https://gerrit.wikimedia.org/r/974280

Change 974280 merged by Dzahn:

[operations/puppet@production] phabricator::main: add support for PHP versions other than 7.3

https://gerrit.wikimedia.org/r/974280

Change 974286 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] php: add templates to support php8.2 on bookworm

https://gerrit.wikimedia.org/r/974286

after patches above 7.4 can be installed on phab hosts / phab role can be applied on bullseye. tested in cloud VPS.

remaining issues:

  • Unable to locate package python-phabricator
  • Needs scap deployment but "Provider scap3 is not functional on this host"
  • something with SSL certs

Change 974286 merged by Dzahn:

[operations/puppet@production] php: add templates to support php8.2 on bookworm

https://gerrit.wikimedia.org/r/974286

Change 974660 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: install python3-phabricator if bullseye or newer

https://gerrit.wikimedia.org/r/974660

Change 974660 merged by Dzahn:

[operations/puppet@production] phabricator: install python3-phabricator if bullseye or newer

https://gerrit.wikimedia.org/r/974660

Mentioned in SAL (#wikimedia-cloud) [2023-11-21T21:00:34Z] <mutante> - deleted instance phorge-1001 to get quota back and allow for creting new phabricator-on-bullseye instance T328595 T327068

Mentioned in SAL (#wikimedia-cloud) [2023-11-21T21:05:17Z] <mutante> - creating instance phabricator-bullseye g3.cores2.ram4.disk20 T327068

Mentioned in SAL (#wikimedia-cloud) [2023-11-21T21:24:21Z] <mutante> - initial puppet run on newly created VM fails with "SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: puppetmaster-1001.devtools.eqiad.wmflabs" T327068

Mentioned in SAL (#wikimedia-cloud) [2023-11-21T21:41:52Z] <mutante> - cert issue on new machine related to having local puppetmaster, like T349937#9288547 except "rm -rf /var/lib/puppet/ssl" was enough since puppetmaster did auto-sign new CSR - T327068

Mentioned in SAL (#wikimedia-cloud) [2023-11-21T21:52:05Z] <mutante> - commit fake key for phabricator-bullseye host in git /var/lib/git/labs/private/modules/secret/secrets/ssl on puppetmaster-1001.devtools T327068

@brennen Per our meeting chat today:

  • phorge-1001.devtools deleted
  • new VM created: phabricator-bullseye.devtools
  • added Hiera keys needed for phabricator
  • fixed cert signing issue and added fake SSL key on local puppetmaster, fixed initial puppet run
  • added prod prabricator puppet role (it installed php7.4 modules etc after previous puppet code changes)

Now apache is installed and all that and things work up to:

Package[phabricator/deployment]: Provider scap3 is not functional on this host
and "Dependency Package[phabricator/deployment] has failures: true"

The local deployment server should be deploy-1004.devtools.

edit: Sorry Brennen, wrong ticket. We should use T334519 instead.

Mentioned in SAL (#wikimedia-operations) [2023-12-12T18:10:02Z] <mutante> reimaging phab2002 (stand-by phorge server with bullseye - T327068

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host phab2002.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host phab2002.codfw.wmnet with OS bullseye completed:

  • phab2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202312121832_dzahn_1780890_phab2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

[phab2002:~] $ lsb_release -c
Codename: bullseye

phab2002 is now on bullseye and ready for a deployment @brennen

planet VMs: buster replaced with bookworm machines, done!

Dzahn changed the status of subtask T334517: upgrade contint servers to bullseye from Open to In Progress.
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

contint2002 upgraded

@LSobanski was resolved but also added lists1001 now

Dzahn added a subscriber: eoghan.

Since @eoghan shut down lists1001 in T331706#9944135 this epic task is now completed.

[cumin2002:~] $ sudo cumin 'A:owner-collaboration-services' 'lsb_release -c'
37 hosts will be targeted:
...
(10) .. Codename: bookworm
(27) .. Codename: bullseye