Page MenuHomePhabricator

Upgrade ORES to Debian Buster
Closed, ResolvedPublic

Description

Debian 9 support will end on June 2022, and SRE rightfully asks to avoid running on a non supported OS for Production traffic/infrastructure.

Migrating to Debian 10 means also upgrading to Python 3.7, something that we should be able to do (in theory) without re-training all models (pickle's main format doesn't change between 3.5 and 3.7 for example).

We should try to review what is needed to upgrade ORES to Python 3.7 and Buster (Debian 10), and define a strategy to roll it out. It will drain some time from Lift Wing, but we'll gain ~2 more years of OS support from upstream.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -10
operations/puppetproduction+4 -38
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+20 -5
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+35 -8
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
mediawiki/services/ores/deploymaster+9 -8
mediawiki/tools/scapmaster+1 -5
research/ores/wheelspython37+6 -6
research/ores/wheelspython37+6 -3
operations/puppetproduction+27 -3
research/ores/wheelspython37+3 -3
research/ores/wheelspython37+18 -0
research/ores/wheelspython37+51 -0
research/ores/wheelsmaster+51 -0
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 785154 merged by jenkins-bot:

[mediawiki/tools/scap@master] Remove the --global option from git lfs calls

https://gerrit.wikimedia.org/r/785154

Summary of what I have done so far:

To test in deployment-prep:

  • Created deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud (Debian Buster VM) and configured it.
  • Cherry picked https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/784649 on deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud
  • Deployed via scap to deployment-ores02
  • Ran httpbb /srv/deployment/httpbb-tests/ores/test_ores.yaml --hosts=deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud --http_port=8081 to test all models on deployment-deploy03.

Note: I had to manually apply https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/784649 on deployment-ores02 since git-lfs is updated on Debian Buster, and scap doesn't work anymore. I have asked to the Releng team if we can create a new scap release, until that we can't move forward with production.

If the above makes sense, the remaining steps are:

  1. Merge the python37 branch into the master one for ores wheels
  2. Update https://gerrit.wikimedia.org/r/c/mediawiki/services/ores/deploy/+/784649 accordingly, and merge it (Ores deploy repo).
  3. For every ORES node:
  4. Add hiera config to support celery 5 (see https://gerrit.wikimedia.org/r/c/operations/puppet/+/785124), host level override.
  5. Depool and reimage the node with Debian Buster

Every host should come up fine and it should start working without issues. I haven't tested fully the scenario of a mixed cluster with ORES Buster nodes (celery 5) and ORES Stretch nodes (celery 4), but they should be able to work on the same Redis queue without issues. If this is not the case, we can depool an entire cluster/DC (like all ORES eqiad nodes) and reimage them one by one, before returning any live traffic to them).

@kevinbazira @AikoChou @klausman I'd need you to verify what I wrote above, and possibly test that everything is indeed working correctly without issues.

Thank you for digging into this @elukey, it looks good to me. I think we should have a straightforward rollback process in case there are issues after implementing the remaining steps.

Thank you for digging into this @elukey, it looks good to me. I think we should have a straightforward rollback process in case there are issues after implementing the remaining steps.

Thanks for the review! To have a cleaner rollback plan, and to avoid messing with git-lfs as much as possible, we could keep the python37 branch in the ORES wheels repository. In case of a rollback, we'd just need to use the master branch (left untouched). Aside from it, rolling back should be very easy, especially if we do one node at the time.

Hi @elukey, thanks for the summary. It is clear to me what you have done in general. Just a few things I would like to clarify. First, not sure what you meant by "Cherry picked", but I assumed it is a way of picking. Secondly, regarding the scap issue due to the updated git-lfs on Buster, I am wondering how do you create a new scap release? And for the remaining steps, how could we test the mixed cluster? Do we just use the normal way (httpbb) to test until one ORES Buster node process a task? Or is there a special way to do it?

Hi @elukey, thanks for the summary. It is clear to me what you have done in general. Just a few things I would like to clarify.

Thanks a lot for the review :)

First, not sure what you meant by "Cherry picked", but I assumed it is a way of picking.

Yep I meant the git cherry pick. What I usually do to test in deployment-prep (that has a dedicated deployment node, like deploy1002) is to:

  1. Work on a change and create the gerrit code review
  2. Use the "Download" menu of the gerrit code review to get the git cherry pick command
  3. Ssh to the test deployment node, apply the cherry pick command, deploy

In this way I am able to deploy a new change without merging first. Let me know if it makes sense of if you have more doubts!

Screenshot from 2022-04-28 17-22-25.png (1×2 px, 198 KB)

Secondly, regarding the scap issue due to the updated git-lfs on Buster, I am wondering how do you create a new scap release?

The tracking task is T306998 if you are curious!

And for the remaining steps, how could we test the mixed cluster? Do we just use the normal way (httpbb) to test until one ORES Buster node process a task? Or is there a special way to do it?

Exactly yes, I have used httpbb. We have two ores instances in deployment-prep now:

  • deployment-ores01.deployment-prep.eqiad1.wikimedia.cloud
  • deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud

The first one runs with Debian Stretch and Python 3.5, and we have to deprecate it (same as prod basically). The latter runs Debian Buster and Python 3.7. Due to how the puppet code for testing was set up in the past, on the sam VM/instance we have:

  • redis (job queue for scores)
  • celery
  • uwsgi

So practically, every instance is isolated from the other one like it was a separate cluster (in prod terms). Today I tested running a mixed Stretch/Buster set up with the following:

  • Stopped celery on deployment-ores01
  • Configured celery on deployment-ores02 to use the Redis instance on deployment-ores01

Then I tried to run httpbb and checked that celery on deployment-ores02 was correctly picking up jobs on Redis to score without erroring out. This should mimic a state in production when we'll have, for the same cluster, part of the nodes on Buster (upgraded) and part of Stretch (to be upgraded). Let me know if it makes sense!

Change 788276 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Upgrade ores2001's celery settings

https://gerrit.wikimedia.org/r/788276

Change 784649 merged by Elukey:

[mediawiki/services/ores/deploy@master] Update scap settings for the Python 3.7 migration

https://gerrit.wikimedia.org/r/784649

Change 788276 merged by Elukey:

[operations/puppet@production] Upgrade ores2001's celery settings

https://gerrit.wikimedia.org/r/788276

ores2001 is on Buster! Everything looks good afaics, nothing strange in uwsgi/celery/logstash error logs! Let's keep things as they are for the next day, and then if nothing comes up we'll proceed with the reimages.

Procedure to upgrade a node:

I had a little issue with the first reimage (of 2001) since I had not updated deploy1002's ore repository before the reimage, and I belive that the first puppet run asks scap to pull the last version of the repo from it (I thought that it would have pulled from gerrit but probably not the case). It is easy to see if it happens since celery and uwsgi will not start, so a scap deploy --limit ores2001.codfw.wmnet on deploy1002 (/srv/deployment/ores/deploy) was needed after the cookbook to update the ORES code.

Change 788293 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Upgrade ores2002's celery settings

https://gerrit.wikimedia.org/r/788293

Change 788293 merged by Klausman:

[operations/puppet@production] Upgrade ores2002's celery settings

https://gerrit.wikimedia.org/r/788293

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2002.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2002.codfw.wmnet with OS buster executed with errors:

  • ores2002 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run failed, asking the operator what to do
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205020950_klausman_1339098_ores2002.out
    • The reimage failed, see the cookbook logs for the details

One possibility to explain the current issues with ORES wheels not in zip format (but simply text files) is that https://gerrit.wikimedia.org/r/785154 is not doing what it is meant to be. From the git lfs install docs, there should be some settings deployed for the smudge and clean filters, that are responsible (IIUC) to resolve SHA to binary conversions for git lfs when the repo is cloned etc..

I have a .gitconfig in my home dir (on my laptop) with:

[filter "lfs"]
	clean = git-lfs clean -- %f
	smudge = git-lfs smudge -- %f
	process = git-lfs filter-process
	required = true

And when I clone + submodule update --init all the binaries are populated correctly in the submodules dirs.

I am wondering if, with the new git lfs install settings, we should test/add something like --local (in the scap code) to instruct git lfs what to do, but not super sure.

Change 790297 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] ores: refactor git setup and add settings for Buster

https://gerrit.wikimedia.org/r/790297

Change 790297 merged by Elukey:

[operations/puppet@production] ores: refactor git setup and add settings for Buster

https://gerrit.wikimedia.org/r/790297

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2002.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2002.codfw.wmnet with OS buster completed:

  • ores2002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205091145_klausman_2569573_ores2002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2003.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2003.codfw.wmnet with OS buster completed:

  • ores2003 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205100853_klausman_2734124_ores2003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 790634 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores2003

https://gerrit.wikimedia.org/r/790634

Change 790634 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores2003

https://gerrit.wikimedia.org/r/790634

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2004.codfw.wmnet with OS buster

Change 790661 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores2004

https://gerrit.wikimedia.org/r/790661

Change 790661 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores2004

https://gerrit.wikimedia.org/r/790661

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2004.codfw.wmnet with OS buster completed:

  • ores2004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205101150_klausman_2758362_ores2004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 790678 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores2005

https://gerrit.wikimedia.org/r/790678

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2005.codfw.wmnet with OS buster

Change 790678 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores2005

https://gerrit.wikimedia.org/r/790678

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2005.codfw.wmnet with OS buster completed:

  • ores2005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205101302_klausman_2767846_ores2005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 790978 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set celery 5 settings for ores2009

https://gerrit.wikimedia.org/r/790978

Change 790978 merged by Elukey:

[operations/puppet@production] Set celery 5 settings for ores2009

https://gerrit.wikimedia.org/r/790978

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ores2009.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ores2009.codfw.wmnet with OS buster completed:

  • ores2009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205110751_elukey_797072_ores2009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/3032166/

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ores2009.codfw.wmnet with OS buster executed with errors:

  • ores2009 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205110751_elukey_797072_ores2009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/3032166/
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2006.codfw.wmnet with OS buster

Change 790999 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores2006

https://gerrit.wikimedia.org/r/790999

Change 790999 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores2006

https://gerrit.wikimedia.org/r/790999

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2006.codfw.wmnet with OS buster completed:

  • ores2006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205111021_klausman_2921593_ores2006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791023 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores2005

https://gerrit.wikimedia.org/r/791023

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2007.codfw.wmnet with OS buster

Change 791023 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores2007

https://gerrit.wikimedia.org/r/791023

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2007.codfw.wmnet with OS buster completed:

  • ores2007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205111223_klausman_9348_ores2007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791034 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores2008

https://gerrit.wikimedia.org/r/791034

Change 791034 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores2008

https://gerrit.wikimedia.org/r/791034

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin2002 for host ores2008.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin2002 for host ores2008.codfw.wmnet with OS buster completed:

  • ores2008 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205111330_klausman_18792_ores2008.out
    • Checked BIOS boot parameters are back to normal
    • Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791295 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set celery 5 settings for ores1001

https://gerrit.wikimedia.org/r/791295

Change 791295 merged by Elukey:

[operations/puppet@production] Set celery 5 settings for ores1001

https://gerrit.wikimedia.org/r/791295

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ores1001.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ores1001.eqiad.wmnet with OS buster completed:

  • ores1001 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205120644_elukey_1192727_ores1001.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 791299 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set celery 5 settings for ores1002

https://gerrit.wikimedia.org/r/791299

Change 791308 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores1003

https://gerrit.wikimedia.org/r/791308

Change 791308 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores1003

https://gerrit.wikimedia.org/r/791308

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ores1003.eqiad.wmnet with OS buster

Change 791299 merged by Klausman:

[operations/puppet@production] Set celery 5 settings for ores1002

https://gerrit.wikimedia.org/r/791299

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ores1002.eqiad.wmnet with OS buster

Change 771947 abandoned by Elukey:

[operations/puppet@production] WIP - ores::base: add conditionals for buster

Reason:

https://gerrit.wikimedia.org/r/771947

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ores1003.eqiad.wmnet with OS buster completed:

  • ores1003 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205120840_klausman_1209790_ores1003.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ores1002.eqiad.wmnet with OS buster completed:

  • ores1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205120852_elukey_1210797_ores1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791341 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores1005

https://gerrit.wikimedia.org/r/791341

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ores1005.eqiad.wmnet with OS buster

Change 791341 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores1005

https://gerrit.wikimedia.org/r/791341

Change 791353 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set celery 5 settings for ores1004

https://gerrit.wikimedia.org/r/791353

Change 791353 merged by Elukey:

[operations/puppet@production] Set celery 5 settings for ores1004

https://gerrit.wikimedia.org/r/791353

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host ores1004.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ores1005.eqiad.wmnet with OS buster completed:

  • ores1005 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121117_klausman_1234067_ores1005.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791355 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores1007

https://gerrit.wikimedia.org/r/791355

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ores1007.eqiad.wmnet with OS buster

Change 791355 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores1007

https://gerrit.wikimedia.org/r/791355

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host ores1004.eqiad.wmnet with OS buster completed:

  • ores1004 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121212_elukey_1242844_ores1004.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ores1007.eqiad.wmnet with OS buster completed:

  • ores1007 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121220_klausman_1243535_ores1007.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791374 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores1009

https://gerrit.wikimedia.org/r/791374

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ores1009.eqiad.wmnet with OS buster

Change 791374 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores1009

https://gerrit.wikimedia.org/r/791374

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ores1009.eqiad.wmnet with OS buster completed:

  • ores1009 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121329_klausman_1254981_ores1009.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791388 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] hiera: Use celery v5 on ores1008

https://gerrit.wikimedia.org/r/791388

Change 791388 merged by Klausman:

[operations/puppet@production] hiera: Use celery v5 on ores1008

https://gerrit.wikimedia.org/r/791388

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ores1008.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ores1008.eqiad.wmnet with OS buster completed:

  • ores1008 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121443_klausman_1268453_ores1008.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791401 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] Update default celery version for ORES to v5

https://gerrit.wikimedia.org/r/791401

Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1001 for host ores1006.eqiad.wmnet with OS buster

Change 791401 merged by Klausman:

[operations/puppet@production] Update default celery version for ORES to v5

https://gerrit.wikimedia.org/r/791401

Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1001 for host ores1006.eqiad.wmnet with OS buster completed:

  • ores1006 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121635_klausman_1284054_ores1006.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 791561 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] modules: clean up special case for celery v4 in ORES

https://gerrit.wikimedia.org/r/791561

Change 791561 merged by Klausman:

[operations/puppet@production] modules: clean up special case for celery v4 in ORES

https://gerrit.wikimedia.org/r/791561