Page MenuHomePhabricator

Upgrade eventlogging VM to bullseye (or bookworm)
Closed, ResolvedPublic

Description

Reference ticket for the buster upgrade: T278137: Migrate eventlog1002 to buster

We currently run our legacy eventlogging on a single VM:

  • eventlog1003.eqiad.wmnet

It runs the following eventlogging-processor services:

btullis@eventlog1003:~$ pstree -aT eventlogging
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-07
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-01
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-09
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-05
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-04
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-11
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-10
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-08
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-02
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-03
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-00
python3 /srv/deployment/eventlogging/analytics/bin/eventlogging-processor @/etc/eventlogging.d/processors/client-side-06

However, the virtual machine is otherwise stateless.
All state is now stored in Kafka.

As per T278137, the recommended approach last time we need to upgrade was to create a parallel VM running the next O/S.
We then ran the two systems in parallel until we were confident enough that we could turn off the older version.

We may have to do some work on the eventlogging code to make sure that it works in the system python.

There is perhaps an argument here for skipping bullseye and moving straight to bookworm.

Tagging Event-Platform and Data-Engineering for visibility and in case they might be need to help update the code, but I believe that Data-Platform-SRE will provision the new VM and migrate the service when tested.

Event Timeline

Pretty sure old eventlogging is python 2

edit: Oh! nevermind! I guess not, I see python3 in your ps output now

Gehel triaged this task as High priority.Nov 15 2023, 9:44 AM

I believe that we're on the verge of finishing the migration of all legacy eventlogging componenets.
See T259163: Migrate legacy metawiki schemas to Event Platform and T238230: Decommission EventLogging backend components by migrating to MEP for further details on that effort.

Therefore, I think it likely that we will be able to decommission the eventlogging1003 VM instead of upgrading it to bullseye/bookworm.

Decommissioning probably won't get done until after I'm back from leave in late April. Can we wait that long?

Decommissioning probably won't get done until after I'm back from leave in late April. Can we wait that long?

OK, that's quite a long time then. Maybe we will upgrade it in the meantime.

Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1002 for host eventlog1003.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1002 for host eventlog1003.eqiad.wmnet with OS bullseye completed:

  • eventlog1003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402141339_brouberol_4116050_eventlog1003.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 1003438 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] eventlogging: tweak PYTHONPATH to allow eventlogging to import _mysql.so

https://gerrit.wikimedia.org/r/1003438

Change 1003438 merged by Brouberol:

[operations/puppet@production] eventlogging: tweak PYTHONPATH to allow eventlogging to import _mysql.so

https://gerrit.wikimedia.org/r/1003438