Page MenuHomePhabricator

Move the Data Engineering infrastructure to Debian Bullseye
Open, Needs TriagePublic

Description

This is a task like T234629, where we should organize the work to migrate to Debian Bullseye over the fiscal year. At the time of writing (August 2021) the new release is not official yet, but SRE has already started working on supporting it (T275873).

During the migration to Buster we worked on two things that should reduce a lot the pain of upgrading:

  1. Partman partition re-use recipes for Debian installs of most of our hosts. This means that it will be way easier to reimage/reinstall every node of the cluster without stressing too much about backing up data first etc..
  2. Fixed uid/gid of most of the system users. This will allow us to avoid weird permission errors/mismatches after reinstall/reimage.

It is nonetheless a sizeable amount of work :)

Some high level notes:

  • Moving the Hadoop test cluster to Bullseye ahead of time may be a good way to see if anything weird comes up.
  • A lot of VMs like matomo1002, archiva1002, eventlog1003, an-tool100*, etc.. should be easy to migrate. The work to do is to create a new VM with Bullseye running the same packages, test that everything is fine and flip the traffic over.
  • Most of our systems like Hadoop, Druid, etc.. are not ready for Java 11, so we'll need to use 8. In T287960 the SRE team added a new component to bullseye-wikimedia to support openjdk-8, since from Buster onward only 11 is supported. Having openjdk-8 available for Bullseye means that upgrading hosts like Druid etc.. should be just a matter of reimaging one node at the time.
  • Moving Hadoop to Bullseye poses some further questions, since on paper the current version of Bigtop that we run (1.5) doesn't support Bullseye (namely all the upstream CI/checks/etc.. do not contemplate Bullseye, but only Buster). There are some options to review in my opinion:
    • Test Bigtop 1.5 packages on Bullseye (for example on Hadoop test) and see if anything weird pops up. I am fairly confident that they should work fine, and worst case scenario we could use the upstream docker images to rebuild packages for Bullseye.
    • Upgrade to Bigtop 3.0 (the next version, the 2.x series is skipped), and then upgrade to Bullseye (the 3.x series will support Bullseye natively). This means also upgrading to Hadoop 3.2, Hive 3.x, etc.. so a major project (since a gigantic set of tests for all our jobs/software/etc.. needs to happen, and people to be moved to new Hadoop, etc..). There is also the main question mark about if to backup Hadoop data or not before the upgrade (see T277015 for more info).
    • Stay on Buster for the time being, but I'd avoid this road if possible. In this case we could keep Buster, theoretically, until it is supported by Debian (see https://wiki.debian.org/LTS). The SRE team strongly encourages to move to new Debian stable versions, so we should follow up with them in this case to know what they suggest/advise.

Just to clarify, this is a long project that will probably span months, there is no expectation to see it completed in a quarter or so. We used to create a high level plan about when to move hosts, so that SRE could check periodically.