Page MenuHomePhabricator

upgrade phab (phorge) hosts to bullseye
Closed, ResolvedPublic

Description

Hosts phab1004 and phab2002 are currently on buster but should become bullseye machines, one way or another.

Unrelatedly there is also T333885 to migrate Phabricator to Phorge.

In today's meeting we did not get to finally decide in which order we want to do these things and whether we want to use (temp) additional hardware.

migration plan for phab1004 upgrade (Jan 20th 2024)

https://etherpad.wikimedia.org/p/phabricator-20240120

Event Timeline

This should also include (or have a subtask) for:

  • remove buster phabricator instance from devtools (phabricator-prod-1001) (T334801)
  • have a bullseye phorge instance in devtools that uses the same role that production uses (like phorge-1001 but should use prod role or phorge role becomes new prod role)
LSobanski triaged this task as Medium priority.Apr 20 2023, 4:44 PM
LSobanski raised the priority of this task from Medium to High.Jun 26 2023, 10:45 AM

We've been running Phorge for two weeks now so it's a good time to revisit this task. The devtools Phorge instance is running Bullseye so there should be no obvious blockers to upgrading production. Considering the complexity of failing over to codfw, procuring a temporary host to do a switchover in eqiad would be the preferred approach.

I used the wrong ticket. My bad. All the updates from T327068#9350860 should have been here.

  • deleted instance phorge-1001 to get quota back and allow for creting new phabricator-on-bullseye instance
  • created instance phabricator-bullseye g3.cores2.ram4.disk20
  • fixed cert issue on new machine related to having local puppetmaster (rm -rf /var/lib/puppet/ssl on agent)
  • commited fake key for phabricator-bullseye host in git /var/lib/git/labs/private/modules/secret/secrets/ssl on puppetmaster-1001.devtools
  • added Hiera keys needed for phabricator, including envoy key/values
  • fixed initial puppet run
  • added prod prabricator puppet role (it installed php7.4 modules etc after previous puppet code changes)
  • apache, php and other phab things are installed, works up to:

Package[phabricator/deployment]: Provider scap3 is not functional on this host
and "Dependency Package[phabricator/deployment] has failures: true"

@brennen: Now we can/should try deployment. The local deployment server should be deploy-1004.devtools.

current issue: scap fails to boostrap itself from the local deployment server: -> T352223

the bootstrap script expects to rsync a 'scap-wheels' directory which does not exist on our deployment server.

Fixes applied on the deployment server, see T352223#9368436 ff

After this, re-enabled puppet on the phab machine and:

Nov 29 20:08:39 phabricator-bullseye puppet-agent[445784]: Applying configuration version '(315b1db31a) Andrew Bogott - labs-ip-alias-dump.py: fix check for attached/detached IPs'
Nov 29 20:08:51 phabricator-bullseye puppet-agent[445784]: (/Stage[main]/Scap/Exec[bootstrap-scap-target]/returns) executed successfully (corrective)
Nov 29 20:08:51 phabricator-bullseye puppet-agent[445784]: (/Stage[main]/Scap/File[/usr/bin/scap]/ensure) created

Brennen made an attempt at deployment which showed us what other things we need to fix.

for example: E: Package 'python-mysqldb' has no installation candidate

full output is at P53950

Change 978697 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: if distro newer than buster, use python3-mysqldb

https://gerrit.wikimedia.org/r/978697

Change 978697 merged by Dzahn:

[operations/puppet@production] phabricator: if distro newer than buster, use python3-mysqldb

https://gerrit.wikimedia.org/r/978697

Change 978710 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: turn deploy script into template, support for php7.4-fpm

https://gerrit.wikimedia.org/r/978710

Change 978710 merged by Dzahn:

[operations/puppet@production] phabricator: turn deploy script into template, support for php7.4-fpm

https://gerrit.wikimedia.org/r/978710

2 fixes issues are gone. Details in patches above.

Next and remaining issue is the service doesn't get access to the mysql/mariadb DB.

Upon restarting mariadb/mysql also the service failed to come back because the mysql db is missing.

I had to do quite a few things to get the DB and DB access set up, including:

  • restarted mariadb-server, removed mariadb-server package, let puppet reinstall it
  • run 'mariadb-install-db'
  • run 'mariadb-secure-installation' (interactive script) (changed bind_address to socket and other things)
  • ./phabricator/bin/config set mysql.host localhost
  • ./phabricator/bin/config set mysql.user app_user
  • ./phabricator/bin/config set mysql.pass app_pass
  • MariaDB [(none)]> create user 'app_user'@'127.0.0.1' identified by 'app_pass';
  • MariaDB [(none)]> grant all privileges on *.* to 'app_user'@'127.0.0.1' identified by 'app_pass';
  • ./phabricator/bin/storage upgrade --force

Then I created a "domain proxy" in Horizon and pointed https://phab-bull.wmcloud.org to the instance.

  • Had to add phabricator_domain: phab-bull.wmcloud.org to web Hiera for just this instance to override project defaults
  • commented out redirect to phabricator.wikimedia.org in apache2/sites-enabled/50-git-wikimedia-org.conf
  • restarted apache
  • more DB access issues were now showing up because now @localhost is used instead of @127.0.0.1
  • MariaDB [(none)]> grant all privileges on *.* to 'app_user'@'localhost' identified by 'app_pass';

Now we can open an URL like https://phab-bull.wmcloud.org/lol/ and see that no site is configured yet.

[phab2002:~] $ lsb_release -c
Codename: bullseye

phab2002 has been reimage and is now on bullseye.

ready for a deployment @brennen

Mentioned in SAL (#wikimedia-operations) [2024-01-03T18:27:17Z] <brennen@deploy2002> Started deploy [phabricator/deployment@369e797]: deploy to phab2002 for T334519

Mentioned in SAL (#wikimedia-operations) [2024-01-03T18:27:44Z] <brennen@deploy2002> Finished deploy [phabricator/deployment@369e797]: deploy to phab2002 for T334519 (duration: 00m 27s)

deployment ran without errors now, seems we fixed them all, yay!

next week we will talk about how to proceed next to get both servers to this version

Change 991439 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: temp test of repo syncing, using gitlab2003 spare host

https://gerrit.wikimedia.org/r/991439

Change 991439 merged by Dzahn:

[operations/puppet@production] phabricator: temp test of repo syncing, using gitlab2003 spare host

https://gerrit.wikimedia.org/r/991439

Change 991642 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: fix source host for repo sync test

https://gerrit.wikimedia.org/r/991642

Change 991642 merged by Dzahn:

[operations/puppet@production] phabricator: fix source host for repo sync test

https://gerrit.wikimedia.org/r/991642

Change 989535 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch phabricator server to codfw

https://gerrit.wikimedia.org/r/989535

Change 991649 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: switch active server from eqiad to codfw

https://gerrit.wikimedia.org/r/991649

Made a test to figure out how long rsyncing /srv/repos really takes.

Temp puppetized something to be able to pull from phab2002 (passive host) to gerrit2003 (not used yet, but has the space).

Result: To copy the 61G of data it took about 12 minutes.

but.. this was within the same DC and not cross-DC.. so for a better test we should have used an eqiad host.. not sure we have one though.

Change 991677 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: repo-sync test, use a machine in other DC

https://gerrit.wikimedia.org/r/991677

Change 991677 merged by Dzahn:

[operations/puppet@production] phabricator: repo-sync test, use a machine in other DC

https://gerrit.wikimedia.org/r/991677

Change 991829 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: clean up repo sync class after test

https://gerrit.wikimedia.org/r/991829

Change 991830 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: delete unused repo sync class after test

https://gerrit.wikimedia.org/r/991830

Did another test syncing the repo data, this time from codfw to eqiad into a VM, and it took 28 minutes. This is more but we can still live with it as the upper limit.

Change 991829 merged by Dzahn:

[operations/puppet@production] phabricator: clean up repo sync class after test

https://gerrit.wikimedia.org/r/991829

Change 991830 merged by Dzahn:

[operations/puppet@production] phabricator: delete unused repo sync class after test

https://gerrit.wikimedia.org/r/991830

Change 991934 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: switch phab1004 to migration role for syncing data

https://gerrit.wikimedia.org/r/991934

Change 991937 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: switch phab1004 back to production role

https://gerrit.wikimedia.org/r/991937

Change 989537 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use same db server regardless of DC of phab server

https://gerrit.wikimedia.org/r/989537

Change 991939 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: revert changes to DB server settings

https://gerrit.wikimedia.org/r/991939

Change 991940 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: switch active_server back to phab1004

https://gerrit.wikimedia.org/r/991940

Change 991941 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] phabricator: switch phab server back to phab1004

https://gerrit.wikimedia.org/r/991941

Mentioned in SAL (#wikimedia-operations) [2024-01-20T20:04:41Z] <brennen> start of phab/phorge bullseye update window - T334519

Change 991934 merged by Dzahn:

[operations/puppet@production] phabricator: switch phab1004 to migration role for syncing data

https://gerrit.wikimedia.org/r/991934

Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host phab1004.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host phab1004.eqiad.wmnet with OS bullseye executed with errors:

  • phab1004 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Change 991939 merged by Dzahn:

[operations/puppet@production] phabricator: revert changes to DB server settings

https://gerrit.wikimedia.org/r/991939

We had issues with scap deploying itself on the reimaged host, but eventually could work around them. (there will be follow-up)

So window took longer than expected but phab1004 is now on bullseye.