Page MenuHomePhabricator

Upgrade the Data Engineering infrastructure to Debian Bullseye
Open, HighPublic

Description

Inventory of hosts to be upgraded to bullseye

Hadoop-test

Hadoop

Stats clients

Launcher

Presto

Druid

Kafka

Airflow

  • airflow - 5 - cumin 'P{F:lsbdistcodename = buster} and A:analytics-airflow'

AQS

Zookeeper

Event schemas

  • schema - 4 - cumin 'P{F:lsbdistcodename = buster} and A:schema'

Misc

  • eventlogging - 1 - eventlog1003.eqiad.wmnet
  • archiva - 1 - archiva1001.wikimedia.org Archiva is to be decommissioned
  • matomo - 1 - matomo1001.eqiad.wmnet T349397: Migrate the matomo host to bookworm
  • web publishing - 1 - an-web1001.eqiad.wmnet
  • hue - 1 - an-tool1009.eqiad.wmnet decommissioned
  • yarn - 1 - an-tool1008.eqiad.wmnet

Original description below

Recent updates are written in bold text

During the migration to Buster we worked on two things that should reduce a lot the pain of upgrading:

  1. Partman partition re-use recipes for Debian installs of most of our hosts. This means that it will be way easier to reimage/reinstall every node of the cluster without stressing too much about backing up data first etc..
  2. Fixed uid/gid of most of the system users. This will allow us to avoid weird permission errors/mismatches after reinstall/reimage.

It is nonetheless a sizeable amount of work :)

Some high level notes:

  • Moving the Hadoop test cluster to Bullseye ahead of time may be a good way to see if anything weird comes up.
  • A lot of VMs like matomo1002, archiva1002, eventlog1003, an-tool100*, etc.. should be easy to migrate. The work to do is to create a new VM with Bullseye running the same packages, test that everything is fine and flip the traffic over. There is a sre.ganeti.reimage cookbook, making a reimage in place an even easier option in many cases
  • Most of our systems like Hadoop, Druid, etc.. are not ready for Java 11, so we'll need to use 8. We now have full support for Java 8 in bullseye, so we are good to go
  • Moving Hadoop to Bullseye poses some further questions, since on paper the current version of Bigtop that we run (1.5) doesn't support Bullseye We have now built bigtop 1.5 for bullseye and deployed it so we are good to go

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
OpenNone
Resolved nfraison
Resolved nfraison
ResolvedBTullis
ResolvedStevemunene
ResolvedStevemunene
ResolvedStevemunene
ResolvedBTullis
ResolvedBTullis
ResolvedRequestStevemunene
ResolvedStevemunene
DuplicateBTullis
OpenBTullis
ResolvedBTullis
ResolvedStevemunene
Resolvedbrouberol
ResolvedEevans
ResolvedEevans
ResolvedJclark-ctr
Resolvedbrouberol
DeclinedNone
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedMstyles
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
ResolvedStevemunene
OpenNone
ResolvedBTullis
ResolvedBTullis
ResolvedRequestVRiley-WMF
ResolvedRequestVRiley-WMF
ResolvedBTullis
Resolvedbrouberol
DuplicateNone
Resolvedbrouberol

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Well, if we had it installed on all the workers, then you wouldn't have to ship your conda env every time you run a spark job and want python3.9.

BTullis renamed this task from Move the Data Engineering infrastructure to Debian Bullseye to Upgrade the Data Engineering infrastructure to Debian Bullseye.Feb 7 2023, 12:10 PM
BTullis removed a subscriber: razzi.
BTullis added a subscriber: JArguello-WMF.

@JArguello-WMF this one is an epic and will need several child tickets. There are about 172 servers to be upgraded at the moment.

BTullis updated the task description. (Show Details)

Change 922082 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/docker-images/production-images@master] Downgrade the version of spark to 3.1.2

https://gerrit.wikimedia.org/r/922082

Change 922082 merged by Btullis:

[operations/docker-images/production-images@master] Downgrade the version of spark to 3.1.2

https://gerrit.wikimedia.org/r/922082

BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.
brouberol updated the task description. (Show Details)
BTullis updated the task description. (Show Details)
BTullis removed subscribers: JArguello-WMF, nfraison.