Page MenuHomePhabricator

Upgrade dborch1001 to Bullseye
Closed, ResolvedPublic

Description

We'd need to explore how difficult or painful it could be to upgrade dborch1001 (orchestrator central node) from Buster to Bullseye

Event Timeline

Marostegui moved this task from Triage to Ready on the DBA board.

@Kormat I have taken the liberty to assign this directly to you - congratulations!

This is on my radar. I don't anticipate there being any real issues with this; we already can't build the orchestrator packages using a debian-supplied version of Go, so this shouldn't change anything.

Any opposition to me picking this up during our sprint week?

Any opposition to me picking this up during our sprint week?

Thanks for wanting to pick this up!
There are some context you probably need to know before deciding to go for this. Unfortunately this is not just an apt-get dist-upgrade case and it is likely to take more than a week. However, there are progress that can definitely be made if you want to help us to get closer to upgrade this host.

This is Orchestrator main host, meaning that we should probably avoid working directly on the host, but rather spin a new VM with buster+orchestrator and then work to have all the needed steps clear.
That way of working is preferred, cause if the upgrade fails, we'd lose visibility over our infrastructure until the issues are fixed/reverted, which can take long.

We run orchestrator (packaged by us) and go (packaged by us) on this host, so we'd need to generate packages for both for bullseye (https://wikitech.wikimedia.org/wiki/Orchestrator#Packaging). And then install them on a bullseye host and try to start them.
If you want to work on this (I'd be happy if you do!), my suggested path would be (again, I think this can take more than a week, but any progress you can make, can be reused by us after the sprint week):

  • Spin dborch1002 with buster
  • Get dborch1002 to be a copy of dborch1001
  • Try to upgrade dborch1002 to bullseye with a apt-get
  • Following https://wikitech.wikimedia.org/wiki/Orchestrator#Packaging, generate the go and orchestrator bullseye packages
  • Install them on dborch1002 and see if orchestrator works and starts
  • Go ahead and do the same on dborch1001

The reason not to flip dborch1002 with dborch1001 once it is ready it is simply because of database grants, IPs, etc...

Let me know what you think :)
Thanks!

@Marostegui thanks for the detailed proposed plan, I'll start implementing your plan, and update this ticket as I go.

Thank you, this is going to help us a lot!

Change 901692 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] Add a dborch vm for testing the bullseye upgrade

https://gerrit.wikimedia.org/r/901692

Change 901692 merged by JHathaway:

[operations/puppet@production] Add a dborch vm for testing the bullseye upgrade

https://gerrit.wikimedia.org/r/901692

Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host dborch1002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host dborch1002.wikimedia.org with OS bullseye executed with errors:

  • dborch1002 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303211852_jhathaway_2589599_dborch1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • The reimage failed, see the cookbook logs for the details

pkg is built and installed on dborch1002.wikimedia.org, any thoughts on testing, it at least starts up!

Change 901709 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] dborch: allow dborch1002 to issue an ssl cert

https://gerrit.wikimedia.org/r/901709

Change 901709 merged by JHathaway:

[operations/puppet@production] dborch: allow dborch1002 to issue an ssl cert

https://gerrit.wikimedia.org/r/901709

Change 901770 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] ferm.pp: Add dborch1002 to the firewall rules

https://gerrit.wikimedia.org/r/901770

@jhathaway I have added the grants needed for this host to initialize orchestrator. To continue my tests, please double check the above patch. If it looks good, I can probably try to get orchestrator started and see how it looks like.

Self note: Remove grants for 208.80.154.77 once this is done

Change 901770 merged by Marostegui:

[operations/puppet@production] ferm.pp: Add dborch1002 to the firewall rules

https://gerrit.wikimedia.org/r/901770

Change 902046 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] ferm.pp: Add dborch1002

https://gerrit.wikimedia.org/r/902046

Change 902046 merged by Marostegui:

[operations/puppet@production] ferm.pp: Add dborch1002

https://gerrit.wikimedia.org/r/902046

Change 902050 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS

https://gerrit.wikimedia.org/r/902050

Change 902050 merged by Marostegui:

[operations/puppet@production] common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS

https://gerrit.wikimedia.org/r/902050

Change 902055 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb/ferm.pp: Add dborch1002 to the firewall rules

https://gerrit.wikimedia.org/r/902055

Change 902055 merged by Marostegui:

[operations/puppet@production] mariadb/ferm.pp: Add dborch1002 to the firewall rules

https://gerrit.wikimedia.org/r/902055

Looks like I can start orchestrator fine on dborch1002:

2023-03-22 11:23:55 DEBUG Connected to orchestrator backend: orchestrator_srv:?@tcp(db1115.eqiad.wmnet:3306)/orchestrator?timeout=1s&readTimeout=30s&rejectReadOnly=false&interpolateParams=true
2023-03-22 11:23:55 DEBUG Orchestrator pool SetMaxOpenConns: 128
2023-03-22 11:23:55 DEBUG Initializing orchestrator
2023-03-22 11:23:55 INFO Connecting to backend db1115.eqiad.wmnet:3306: maxConnections: 128, maxIdleConns: 32
2023-03-22 11:23:55 INFO Starting Discovery
2023-03-22 11:23:55 INFO Registering endpoints
2023-03-22 11:23:55 INFO continuous discovery: setting up
2023-03-22 11:23:55 INFO continuous discovery: starting
2023-03-22 11:23:55 DEBUG Queue.startMonitoring(DEFAULT)
2023-03-22 11:23:55 INFO Starting HTTP listener on localhost:3000
2023-03-22 11:23:56 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:23:57 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:23:58 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:23:59 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:00 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:01 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:02 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:03 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:04 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:05 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:06 INFO Not elected as active node; active node: dborch1001; polling
2023-03-22 11:24:07 INFO Not elected as active node; active node: dborch1001; polling

If I stop dborch1001 it starts fine, but if fails all the checks (as it has no grants on the databases).

@jhathaway at this point, could you write down the steps you had to do to get all sorted on dborch1002? Especially regarding packages (and where they are).
Cause I see:

root@dborch1001:~# dpkg  -l | grep orch
ii  orchestrator                         3.2.6-1                      amd64        service web+cli
ii  orchestrator-client                  3.2.6-1                      amd64        client script

root@dborch1002:~# dpkg -l | grep orch
ii  orchestrator                         3.2.6-2                        amd64        service web+cli

Do you have the client package too? Are them on the repo already?

@jhathaway at this point, could you write down the steps you had to do to get all sorted on dborch1002? Especially regarding packages (and where they are).

I rebuilt the packages following the steps on wikitech, https://wikitech.wikimedia.org/wiki/Orchestrator#Building_orchestrator_packages

except I built them on bullseye with golang 1.15

Cause I see:

root@dborch1001:~# dpkg  -l | grep orch
ii  orchestrator                         3.2.6-1                      amd64        service web+cli
ii  orchestrator-client                  3.2.6-1                      amd64        client script

root@dborch1002:~# dpkg -l | grep orch
ii  orchestrator                         3.2.6-2                        amd64        service web+cli

Do you have the client package too? Are them on the repo already?

The client package is built as well and uploaded to wikimedia's apt repo. I don't see the client pkg referenced in our puppetry Stragely the client module installs the orchestrator package? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/orchestrator/manifests/client.pp

I wonder if it was installed manually on dborch1001....I cannot recall it. It was long time ago. What I can do is _do_ install it manually on dborch1002 and test it too, just in case.

I wonder if it was installed manually on dborch1001....I cannot recall it. It was long time ago. What I can do is _do_ install it manually on dborch1002 and test it too, just in case.

sounds good, let me know if I can help in anyway

The client seems to be working fine too:

root@dborch1002:~#  dpkg -l | grep orch
ii  orchestrator                         3.2.6-2                        amd64        service web+cli
ii  orchestrator-client                  3.2.6-2                        amd64        client script

root@dborch1002:~# orchestrator-client -c topology -a s1
db2112.codfw.wmnet:3306              [0s,ok,10.4.25-MariaDB-log,rw,STATEMENT,>>,semi:master]
+ db1118.eqiad.wmnet:3306            [0s,ok,10.4.25-MariaDB-log,ro,STATEMENT,>>,GTID,semi:master]
  + db1106.eqiad.wmnet:3306          [0s,ok,11.0.1-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1107.eqiad.wmnet:3306          [0s,ok,10.4.28-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1119.eqiad.wmnet:3306          [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1128.eqiad.wmnet:3306          [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1132.eqiad.wmnet:3306          [0s,ok,10.6.12-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1134.eqiad.wmnet:3306          [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1135.eqiad.wmnet:3306          [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1139.eqiad.wmnet:3311          [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID]
  + db1140.eqiad.wmnet:3311          [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID]
  + db1163.eqiad.wmnet:3306          [0s,ok,10.4.22-MariaDB-log,ro,STATEMENT,>>,GTID,semi:replica]
  + db1169.eqiad.wmnet:3306          [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1184.eqiad.wmnet:3306          [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1186.eqiad.wmnet:3306          [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db1196.eqiad.wmnet:3306          [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
    + db1154.eqiad.wmnet:3311        [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID]
      + clouddb1013.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID]
      + clouddb1017.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID]
      + clouddb1021.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID]
  + db1206.eqiad.wmnet:3306          [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + dbstore1003.eqiad.wmnet:3311     [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID]
+ db2097.codfw.wmnet:3311            [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID]
+ db2102.codfw.wmnet:3306            [0s,ok,10.6.12-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2103.codfw.wmnet:3306            [0s,ok,10.4.26-MariaDB-log,ro,STATEMENT,>>,GTID,semi:replica]
+ db2116.codfw.wmnet:3306            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2130.codfw.wmnet:3306            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2141.codfw.wmnet:3311            [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID]
+ db2145.codfw.wmnet:3306            [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2146.codfw.wmnet:3306            [0s,ok,10.6.12-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2153.codfw.wmnet:3306            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2167.codfw.wmnet:3311            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID]
+ db2170.codfw.wmnet:3311            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID]
+ db2173.codfw.wmnet:3306            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
  + db2186.codfw.wmnet:3311          [0s,ok,10.4.28-MariaDB-log,ro,ROW,>>,GTID]
+ db2174.codfw.wmnet:3306            [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
+ db2176.codfw.wmnet:3306            [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica]
root@dborch1002:~#

@jhathaway should we go ahead then and reimage dborch1001 then?

@jhathaway should we go ahead then and reimage dborch1001 then?

sounds good to me, I'll do a *reimage* since there is no state that matters, correct? And after that tear down dborch1002?

A bit late to the battle, but it could have been an good opportunity to move it to private IP T317179: Move dborch to private IPs + CDN

@jhathaway should we go ahead then and reimage dborch1001 then?

sounds good to me, I'll do a *reimage* since there is no state that matters, correct? And after that tear down dborch1002?

Yeah, let's go for the reimage and see how it comes back. Fingers crossed.

Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host dborch1001.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host dborch1001.wikimedia.org with OS bullseye completed:

  • dborch1001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303221529_jhathaway_2852582_dborch1001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Reverted the patches and removed the extra grants.

Change 902127 had a related patch set uploaded (by JHathaway; author: JHathaway):

[operations/puppet@production] Revert "Add a dborch vm for testing the bullseye upgrade"

https://gerrit.wikimedia.org/r/902127

The initial package built for Bullseye failed because golang 1.15 does not support certificates that rely on verifying the common name as the hostname, which is how are puppet server generates them, https://phabricator.wikimedia.org/T273637. This was resolved by rebuilding on 1.14. I also improved the docs a bit to make this issue a little more prominent.

A bit late to the battle, but it could have been an good opportunity to move it to private IP T317179: Move dborch to private IPs + CDN

You are right, sorry for not getting to that piece as well.

Change 902127 merged by JHathaway:

[operations/puppet@production] Revert "Add a dborch vm for testing the bullseye upgrade"

https://gerrit.wikimedia.org/r/902127