During our meeting @CWilliams-WMF and myself discussed the possibility to adapt sre.mysql.major-upgrade to allow mass major mariadb/OS upgrades.
The idea would be to allow a whole section (or even a DC) to get upgraded in an unattended way.
There are a few things that we need to keep in mind:
- Some level of parallelism would be appreciated in a controlled way:
- 1 host per DC
- 1 host per section
- If a host doesn't come back - the whole process stops.
There are some things to keep in mind:
- Locking mechanism @FCeratto-WMF is working on would help here to prevent multiple operations blocking each other (we discussed queuing systems, but that may take way longer to implement - to be left for this first iteration).
- In case of OS upgrades, each reimage needs the operator to manually input the idrac password
- In case of major mariadb upgrades, each hosts needs a puppet patch + merge
As discussed, some safety measures to avoid upgrading things that shouldn't be upgraded on this first iteration could be:
- Hosts with replicas - never upgrade unless forced by the user
- Hosts running in a port different from 3306
The idea behind this would be to run this and get most of the simple of the hosts upgraded in an unattended way.