Page MenuHomePhabricator

Automate mass upgrades (OS/mariadb)
Open, MediumPublic

Description

During our meeting @CWilliams-WMF and myself discussed the possibility to adapt sre.mysql.major-upgrade to allow mass major mariadb/OS upgrades.
The idea would be to allow a whole section (or even a DC) to get upgraded in an unattended way.

There are a few things that we need to keep in mind:

  • Some level of parallelism would be appreciated in a controlled way:
    • 1 host per DC
    • 1 host per section
    • If a host doesn't come back - the whole process stops.

There are some things to keep in mind:

  • Locking mechanism @FCeratto-WMF is working on would help here to prevent multiple operations blocking each other (we discussed queuing systems, but that may take way longer to implement - to be left for this first iteration).
  • In case of OS upgrades, each reimage needs the operator to manually input the idrac password
  • In case of major mariadb upgrades, each hosts needs a puppet patch + merge

As discussed, some safety measures to avoid upgrading things that shouldn't be upgraded on this first iteration could be:

  • Hosts with replicas - never upgrade unless forced by the user
  • Hosts running in a port different from 3306

The idea behind this would be to run this and get most of the simple of the hosts upgraded in an unattended way.

Event Timeline

Marostegui triaged this task as Medium priority.Wed, Jun 3, 11:32 AM
Marostegui moved this task from Triage to Refine on the DBA board.

Thoughts, ideas to polish this?

We can follow the same pattern of schema change helper and rolling restarts: walking across DCs and sections in the safest sequence, and optional CLI flags to limit scope e.g. --sections s1,s4 --dc codfw.
We can reuse a good chunk of existing code for this.

The locking API is ready, any tool can lock a section-dc pair when needed or poll until it's available.

Can we do a single puppet patch + merge for a whole set of hosts before the scripts starts? It would reduce the manual workload. The script could potentially check if puppet has been updated before going forward.

We can follow the same pattern of schema change helper and rolling restarts: walking across DCs and sections in the safest sequence, and optional CLI flags to limit scope e.g. --sections s1,s4 --dc codfw.

Yep

We can reuse a good chunk of existing code for this.

Ideally this should be a cookbook so if the code is re-usable, great!

The locking API is ready, any tool can lock a section-dc pair when needed or poll until it's available.

Can we do a single puppet patch + merge for a whole set of hosts before the scripts starts? It would reduce the manual workload. The script could potentially check if puppet has been updated before going forward.

The problem is that puppet would keep failing until the reimage happens and it could be days from the start of a section til the end.