Page MenuHomePhabricator

Upgrade stat1008 to bullseye
Closed, ResolvedPublic

Description

We have five stat servers that are still running buster, namely: stat100[4-8]

However, stat100[4-7] are currently scheduled to be decommisisoned in T353785: Decom EOL stats servers stat100[4-7]

Therefore, the only remaining host that needs to be upgraded is: stat1008

Event Timeline

Hi folks! In T295661 I updated the AMD ROCm stack for our GPU to 5.4, but I added support only for Bullseye. When you migrate stat100[5,8] it should be sufficient to bump the ROCM version to 54 in puppet's regex.yaml. Please ping me in case something doesn't work!

Can we revisit this? Primarily for using recent ROCM versions when working with GPUs as mentioned by Luca above, but it would also be nice to add the ability to run docker containers on the stat clients (T275551).

Hi @fkaelin - I believe that we will be tackling T336040: Bring stat1010 into service with GPU from stat1005 within the next couple of weeks, which should at least get you ROCm 5.4 on bullseye. That's as soon as T336036: Bring stat1009 into service is finished, which will be our first bullseye based stats server. (cc @Stevemunene)

Unfortunately, I don't think that getting the docker CLI on the stat servers is going to happen any time soon. I know that @elukey has made some good progress recently with getting access to GPUs on the dse-k8s cluster (see: T333009#8808370) so perhaps you could build on that work, somehow.

Hi folks! Would it be possible to have either stat1005 or stat1008 (the ones with GPUs) on bookworm? I am asking since we'd have a place with python 3.11 and the latest packages, very useful for debugging etc.. I can help in adjusting puppet where needed to accommodate bullseye -> bookworm if needed!

Hi folks! Would it be possible to have either stat1005 or stat1008 (the ones with GPUs) on bookworm? I am asking since we'd have a place with python 3.11 and the latest packages, very useful for debugging etc.. I can help in adjusting puppet where needed to accommodate bullseye -> bookworm if needed!

It's a great idea, but we haven't even finished getting a stat server running on bullseye yet. We are about to start work on T336040: Bring stat1010 into service with GPU from stat1005 so if you like we could change the target from that from bullseye to bookworm.

I've also just got a good bit of work done with rebuilding hadoop for bullseye T337465: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism and https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Bigtop_Packages so it should be easier for us to try building hadoop etc for bookworm.

The only other alternative would seem to be upgrading stat1008 in place, but I have a feeling that this would be pretty disruptive for users who are currently using it.

What do you think @elukey?

@BTullis sorry I completely forgot about the hadoop packages, let's not jump to bookworm yet, you folks have enough work on your plate, we'll see in the future :)

Gehel triaged this task as High priority.Nov 22 2023, 9:50 AM
Gehel moved this task from Misc to Ready for Work on the Data-Platform-SRE board.
BTullis renamed this task from Upgrade Stats clients to bullseye to Upgrade stat1008 to bullseye.May 8 2024, 9:56 AM
BTullis updated the task description. (Show Details)
BTullis edited subscribers, added: brouberol; removed: fkaelin, JArguello-WMF, elukey and 5 others.

I have scheduled this work for Thursday May 23rd 2024 at approximately 09:15 UTC.
Reminder to self that I will have to update this page after the upgrade: https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/stat1008.eqiad.wmnet
...and let users know that they can expect to see warnings of changed keys, so they will need to remove their local known_hosts entries before being able to access stat1008 again.

Change #1035346 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a new partman reuser recipe for stat1008

https://gerrit.wikimedia.org/r/1035346

Change #1035346 merged by Btullis:

[operations/puppet@production] Add a new partman reuse recipe for stat1008

https://gerrit.wikimedia.org/r/1035346

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host stat1008.eqiad.wmnet with OS bullseye

I'm not confident that the reuse recipe that I have created is correct.

image.png (404×691 px, 48 KB)

It looks like it's going to format md0 instead of using it as an LVM pv.
I coud fix it by hand in the installer, but I will re-work the recipe and restart the reimage.

Change #1035348 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Fix the stat1008 partman-reuse recipe

https://gerrit.wikimedia.org/r/1035348

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host stat1008.eqiad.wmnet with OS bullseye executed with errors:

  • stat1008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" stat1008.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Change #1035348 merged by Btullis:

[operations/puppet@production] Fix the stat1008 partman-reuse recipe

https://gerrit.wikimedia.org/r/1035348

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host stat1008.eqiad.wmnet with OS bullseye

Change #1035353 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove trailing slash from stat1008 partman recipe

https://gerrit.wikimedia.org/r/1035353

Change #1035353 merged by Btullis:

[operations/puppet@production] Remove trailing slash from stat1008 partman recipe

https://gerrit.wikimedia.org/r/1035353

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host stat1008.eqiad.wmnet with OS bullseye executed with errors:

  • stat1008 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" stat1008.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host stat1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host stat1008.eqiad.wmnet with OS bullseye completed:

  • stat1008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405231111_btullis_2285471_stat1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB