We'd need to explore how difficult or painful it could be to upgrade dborch1001 (orchestrator central node) from Buster to Bullseye
- Remove testing grants T298959#8717027
- Revert patches at T298959#8717406
We'd need to explore how difficult or painful it could be to upgrade dborch1001 (orchestrator central node) from Buster to Bullseye
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | Marostegui | T298585 Upgrade WMF database-and-backup-related hosts to bullseye | |||
Resolved | jhathaway | T298959 Upgrade dborch1001 to Bullseye |
This is on my radar. I don't anticipate there being any real issues with this; we already can't build the orchestrator packages using a debian-supplied version of Go, so this shouldn't change anything.
Thanks for wanting to pick this up!
There are some context you probably need to know before deciding to go for this. Unfortunately this is not just an apt-get dist-upgrade case and it is likely to take more than a week. However, there are progress that can definitely be made if you want to help us to get closer to upgrade this host.
This is Orchestrator main host, meaning that we should probably avoid working directly on the host, but rather spin a new VM with buster+orchestrator and then work to have all the needed steps clear.
That way of working is preferred, cause if the upgrade fails, we'd lose visibility over our infrastructure until the issues are fixed/reverted, which can take long.
We run orchestrator (packaged by us) and go (packaged by us) on this host, so we'd need to generate packages for both for bullseye (https://wikitech.wikimedia.org/wiki/Orchestrator#Packaging). And then install them on a bullseye host and try to start them.
If you want to work on this (I'd be happy if you do!), my suggested path would be (again, I think this can take more than a week, but any progress you can make, can be reused by us after the sprint week):
The reason not to flip dborch1002 with dborch1001 once it is ready it is simply because of database grants, IPs, etc...
Let me know what you think :)
Thanks!
@Marostegui thanks for the detailed proposed plan, I'll start implementing your plan, and update this ticket as I go.
Change 901692 had a related patch set uploaded (by JHathaway; author: JHathaway):
[operations/puppet@production] Add a dborch vm for testing the bullseye upgrade
Change 901692 merged by JHathaway:
[operations/puppet@production] Add a dborch vm for testing the bullseye upgrade
Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host dborch1002.wikimedia.org with OS bullseye
Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host dborch1002.wikimedia.org with OS bullseye executed with errors:
pkg is built and installed on dborch1002.wikimedia.org, any thoughts on testing, it at least starts up!
Change 901709 had a related patch set uploaded (by JHathaway; author: JHathaway):
[operations/puppet@production] dborch: allow dborch1002 to issue an ssl cert
Change 901709 merged by JHathaway:
[operations/puppet@production] dborch: allow dborch1002 to issue an ssl cert
Change 901770 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] ferm.pp: Add dborch1002 to the firewall rules
@jhathaway I have added the grants needed for this host to initialize orchestrator. To continue my tests, please double check the above patch. If it looks good, I can probably try to get orchestrator started and see how it looks like.
Change 901770 merged by Marostegui:
[operations/puppet@production] ferm.pp: Add dborch1002 to the firewall rules
Change 902046 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] ferm.pp: Add dborch1002
Change 902046 merged by Marostegui:
[operations/puppet@production] ferm.pp: Add dborch1002
Change 902050 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS
Change 902050 merged by Marostegui:
[operations/puppet@production] common.yaml: Add dborch1002 to MYSQL_ROOT_CLIENTS
Change 902055 had a related patch set uploaded (by Marostegui; author: Marostegui):
[operations/puppet@production] mariadb/ferm.pp: Add dborch1002 to the firewall rules
Change 902055 merged by Marostegui:
[operations/puppet@production] mariadb/ferm.pp: Add dborch1002 to the firewall rules
The following patches will need to be reverted once done with the testing:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/902055
https://gerrit.wikimedia.org/r/c/operations/puppet/+/902050
https://gerrit.wikimedia.org/r/c/operations/puppet/+/902046
Looks like I can start orchestrator fine on dborch1002:
2023-03-22 11:23:55 DEBUG Connected to orchestrator backend: orchestrator_srv:?@tcp(db1115.eqiad.wmnet:3306)/orchestrator?timeout=1s&readTimeout=30s&rejectReadOnly=false&interpolateParams=true 2023-03-22 11:23:55 DEBUG Orchestrator pool SetMaxOpenConns: 128 2023-03-22 11:23:55 DEBUG Initializing orchestrator 2023-03-22 11:23:55 INFO Connecting to backend db1115.eqiad.wmnet:3306: maxConnections: 128, maxIdleConns: 32 2023-03-22 11:23:55 INFO Starting Discovery 2023-03-22 11:23:55 INFO Registering endpoints 2023-03-22 11:23:55 INFO continuous discovery: setting up 2023-03-22 11:23:55 INFO continuous discovery: starting 2023-03-22 11:23:55 DEBUG Queue.startMonitoring(DEFAULT) 2023-03-22 11:23:55 INFO Starting HTTP listener on localhost:3000 2023-03-22 11:23:56 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:23:57 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:23:58 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:23:59 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:00 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:01 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:02 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:03 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:04 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:05 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:06 INFO Not elected as active node; active node: dborch1001; polling 2023-03-22 11:24:07 INFO Not elected as active node; active node: dborch1001; polling
If I stop dborch1001 it starts fine, but if fails all the checks (as it has no grants on the databases).
@jhathaway at this point, could you write down the steps you had to do to get all sorted on dborch1002? Especially regarding packages (and where they are).
Cause I see:
root@dborch1001:~# dpkg -l | grep orch ii orchestrator 3.2.6-1 amd64 service web+cli ii orchestrator-client 3.2.6-1 amd64 client script root@dborch1002:~# dpkg -l | grep orch ii orchestrator 3.2.6-2 amd64 service web+cli
Do you have the client package too? Are them on the repo already?
@jhathaway at this point, could you write down the steps you had to do to get all sorted on dborch1002? Especially regarding packages (and where they are).
I rebuilt the packages following the steps on wikitech, https://wikitech.wikimedia.org/wiki/Orchestrator#Building_orchestrator_packages
except I built them on bullseye with golang 1.15
Cause I see:
root@dborch1001:~# dpkg -l | grep orch ii orchestrator 3.2.6-1 amd64 service web+cli ii orchestrator-client 3.2.6-1 amd64 client script root@dborch1002:~# dpkg -l | grep orch ii orchestrator 3.2.6-2 amd64 service web+cliDo you have the client package too? Are them on the repo already?
The client package is built as well and uploaded to wikimedia's apt repo. I don't see the client pkg referenced in our puppetry Stragely the client module installs the orchestrator package? https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/orchestrator/manifests/client.pp
I wonder if it was installed manually on dborch1001....I cannot recall it. It was long time ago. What I can do is _do_ install it manually on dborch1002 and test it too, just in case.
The client seems to be working fine too:
root@dborch1002:~# dpkg -l | grep orch ii orchestrator 3.2.6-2 amd64 service web+cli ii orchestrator-client 3.2.6-2 amd64 client script root@dborch1002:~# orchestrator-client -c topology -a s1 db2112.codfw.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,rw,STATEMENT,>>,semi:master] + db1118.eqiad.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,STATEMENT,>>,GTID,semi:master] + db1106.eqiad.wmnet:3306 [0s,ok,11.0.1-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1107.eqiad.wmnet:3306 [0s,ok,10.4.28-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1119.eqiad.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1128.eqiad.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1132.eqiad.wmnet:3306 [0s,ok,10.6.12-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1134.eqiad.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1135.eqiad.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1139.eqiad.wmnet:3311 [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID] + db1140.eqiad.wmnet:3311 [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID] + db1163.eqiad.wmnet:3306 [0s,ok,10.4.22-MariaDB-log,ro,STATEMENT,>>,GTID,semi:replica] + db1169.eqiad.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1184.eqiad.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1186.eqiad.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1196.eqiad.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db1154.eqiad.wmnet:3311 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID] + clouddb1013.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID] + clouddb1017.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID] + clouddb1021.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID] + db1206.eqiad.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + dbstore1003.eqiad.wmnet:3311 [0s,ok,10.4.22-MariaDB,ro,nobinlog,GTID] + db2097.codfw.wmnet:3311 [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID] + db2102.codfw.wmnet:3306 [0s,ok,10.6.12-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2103.codfw.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,STATEMENT,>>,GTID,semi:replica] + db2116.codfw.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2130.codfw.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2141.codfw.wmnet:3311 [0s,ok,10.4.25-MariaDB,ro,nobinlog,GTID] + db2145.codfw.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2146.codfw.wmnet:3306 [0s,ok,10.6.12-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2153.codfw.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2167.codfw.wmnet:3311 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID] + db2170.codfw.wmnet:3311 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID] + db2173.codfw.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2186.codfw.wmnet:3311 [0s,ok,10.4.28-MariaDB-log,ro,ROW,>>,GTID] + db2174.codfw.wmnet:3306 [0s,ok,10.4.26-MariaDB-log,ro,ROW,>>,GTID,semi:replica] + db2176.codfw.wmnet:3306 [0s,ok,10.4.25-MariaDB-log,ro,ROW,>>,GTID,semi:replica] root@dborch1002:~#
@jhathaway should we go ahead then and reimage dborch1001 then?
@jhathaway should we go ahead then and reimage dborch1001 then?
sounds good to me, I'll do a *reimage* since there is no state that matters, correct? And after that tear down dborch1002?
A bit late to the battle, but it could have been an good opportunity to move it to private IP T317179: Move dborch to private IPs + CDN
Cookbook cookbooks.sre.ganeti.reimage was started by jhathaway@cumin1001 for host dborch1001.wikimedia.org with OS bullseye
Cookbook cookbooks.sre.ganeti.reimage started by jhathaway@cumin1001 for host dborch1001.wikimedia.org with OS bullseye completed:
Change 902127 had a related patch set uploaded (by JHathaway; author: JHathaway):
[operations/puppet@production] Revert "Add a dborch vm for testing the bullseye upgrade"
The initial package built for Bullseye failed because golang 1.15 does not support certificates that rely on verifying the common name as the hostname, which is how are puppet server generates them, https://phabricator.wikimedia.org/T273637. This was resolved by rebuilding on 1.14. I also improved the docs a bit to make this issue a little more prominent.
Change 902127 merged by JHathaway:
[operations/puppet@production] Revert "Add a dborch vm for testing the bullseye upgrade"