Page MenuHomePhabricator

Migrate Dumps Snapshot hosts from Buster to Bullseye
Closed, ResolvedPublic

Description

We are hoping to start the migration of this by the second quarter(Q2) of next year because the migration of this is dependent on ICU migration and when the new packages would be built for Bullseye.

Progress:

  • snapshot1008 - Decommissioned in T364455
  • snapshot1009 - Decommissioned in T364456
  • snapshot1010
  • snapshot1011
  • snapshot1012
  • snapshot1013
  • snapshot1014
  • snapshot1015
  • snapshot1016
  • snapshot1017

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-01-09T12:43:52Z] <moritzm> imported mwbzutils 0.1.4~wmf-1+deb11u1 for bullseye-wikimedia T325228

I've reimaged snapshot1014, after the rebuild of mwbzutils most parts of the Puppet setup work fine, except one: The setup of the mw-cgroup (configured via mw-cgroup.systemd.erb) fails with Bullseye, there's a permission error trying to write to /sys/fs/cgroup/memory/release_agent:

jmm@snapshot1014:~$ sudo echo '/usr/local/bin/cgroup-mediawiki-clean' > /sys/fs/cgroup/memory/release_agent
-bash: /sys/fs/cgroup/memory/release_agent: Permission denied

Change 991347 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] mediawiki::cgroup: Enanble v1 cgroups on bullseye

https://gerrit.wikimedia.org/r/991347

Change 991347 merged by Muehlenhoff:

[operations/puppet@production] mediawiki::cgroup: Enable v1 cgroups on bullseye

https://gerrit.wikimedia.org/r/991347

When running the MediaWiki train, scap complained due to the ssh host key of snapshot1016.eqiad.wmnet not being recognized. From deploy2002.codfw.wmnet:

scap pull ...
(ran as mwdeploy@snapshot1016.eqiad.wmnet) returned [255]: Host key verification failed.

When running the MediaWiki train, scap complained due to the ssh host key of snapshot1016.eqiad.wmnet not being recognized. From deploy2002.codfw.wmnet:

scap pull ...
(ran as mwdeploy@snapshot1016.eqiad.wmnet) returned [255]: Host key verification failed.

That's expected, the server was in the middle of a reimage to Bullseye.

Change 992398 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] late_command: Drop special case for snapshot1016/1017

https://gerrit.wikimedia.org/r/992398

Change 992398 merged by Muehlenhoff:

[operations/puppet@production] late_command: Drop special case for snapshot1016/1017

https://gerrit.wikimedia.org/r/992398

Gehel triaged this task as High priority.Jan 23 2024, 2:05 PM
Gehel moved this task from Incoming to OS Upgrade on the Data-Platform-SRE board.

Change 1008451 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps/scap@master] Add a new deployment target in the beta cluster

https://gerrit.wikimedia.org/r/1008451

Moving this into our current milestone, as we are currently working on testing these dumps scripts on bullseye.

BTullis updated the task description. (Show Details)

Change 1008451 merged by ArielGlenn:

[operations/dumps/scap@master] Add a new deployment target in the beta cluster

https://gerrit.wikimedia.org/r/1008451

Change 1009288 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow the lilypond packages to be installed on bullseye

https://gerrit.wikimedia.org/r/1009288

Change 1009288 merged by ArielGlenn:

[operations/puppet@production] Allow the lilypond packages to be installed on bullseye

https://gerrit.wikimedia.org/r/1009288

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1015.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1015.eqiad.wmnet with OS bullseye completed:

  • snapshot1015 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404091034_btullis_1548395_snapshot1015.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

.. The setup of the mw-cgroup (configured via mw-cgroup.systemd.erb) fails with Bullseye, there's a permission error trying to write to /sys/fs/cgroup/memory/release_agent:

I ran into the same issue when trying to run a deployment_server in cloud VPS on bullseye (cgroup issue: T363957 -> deployment server on bullseye T363415 -> get rid of buster VMs T360964

update: rebooting the VM fixed the problem because then the grub config was applied: T363957#9762525 You just have to know you need that extra reboot.

Hello there.

Due to T364250, the host snapshot1011 will not be running the typical wikidatawiki dump, and thus will be idle till the ~20th. So a good window to migrate.

CC @BTullis

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1011.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1011.eqiad.wmnet with OS bullseye completed:

  • snapshot1011 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405080956_btullis_1291660_snapshot1011.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1029220 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move dumps::generation::worker::dumper_misc_crons_only role

https://gerrit.wikimedia.org/r/1029220

I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029220 which will move all of the following dumps from snapshot1008 to snapshot1017.

  • adds-changes
  • categoriesrdf-dump-daily
  • categoriesrdf-dump
  • cirrussearch-dump-s1
  • cirrussearch-dump-s11
  • cirrussearch-dump-s2
  • cirrussearch-dump-s3
  • cirrussearch-dump-s4
  • cirrussearch-dump-s5
  • cirrussearch-dump-s6
  • cirrussearch-dump-s7
  • cirrussearch-dump-s8
  • cirrussearch-dump
  • commonsjson-dump
  • commonsrdf-dump
  • global_blocks_dump
  • growth_mentorship_dump
  • list-media-per-project
  • pagetitles-ns0
  • pagetitles-ns6
  • shorturls
  • wikidatajson-dump
  • wikidatajson-lexemes-dump
  • wikidatardf-all-dumps
  • wikidatardf-lexemes-dumps
  • wikidatardf-truthy-dumps
  • xlation-dumps

When we deploy this patch the systemd timers and services will become unmanaged on snapshot1008, so we will want to disable the timers by hand in order to avoid duplicate runs.

Change #1029509 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move snapshot1009 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1029509

Change #1029509 merged by Btullis:

[operations/puppet@production] Move snapshot1009 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1029509

@xcollazo added a comment on my patch:

LGTM, however, let's wait till snapshot1008 is idle.
Right now it is running the dumpRdf job. I expect it to be idle by the ~15th of the month.

I have been checking snapshot1008 to see when it will be idle, but it looks like it's pretty much always running one dump or another.
These four dumps are currently running:

  • wikidatardf-all-dumps
  • wikidatardf-truthy-dumps
  • cirrussearch-dump-s4
  • cirrussearch-dump-s8

Listing the timers and filtering for dump we can see that more dumps will start on May 17th, 18th, 19th, 20th, and 22nd.
So I'm not sure that there is ever going to be a time when it's properly idle.
I think that I would be happy to merge the patch now, then manually stop and disable the timers on snapshot1008 to try to avoid duplicate runs.

Xabriel, what do you think? Is this workable to try to get the host roles switched without dumplicate dumps conflicting with each other?

Here is a one-liner to list the next scheduled runs of all of the timers from the list in T325228#9781322
It looks to me like this host is going to be doing some kind of dump all the time.

btullis@snapshot1008:~$ systemctl list-timers $(for t in $(cat timers.txt); do echo $t.timer;done)
NEXT                         LEFT           LAST                         PASSED       UNIT                            ACTIVATES
Thu 2024-05-16 20:50:00 UTC  8h left        Wed 2024-05-15 20:50:00 UTC  15h ago      adds-changes.timer              adds-changes.service
Fri 2024-05-17 05:00:00 UTC  16h left       Thu 2024-05-16 05:00:00 UTC  7h ago       categoriesrdf-dump-daily.timer  categoriesrdf-dump-daily.service
Fri 2024-05-17 08:10:00 UTC  19h left       Thu 2024-05-16 08:10:00 UTC  4h 1min ago  pagetitles-ns0.timer            pagetitles-ns0.service
Fri 2024-05-17 08:50:00 UTC  20h left       Thu 2024-05-16 08:50:00 UTC  3h 21min ago pagetitles-ns6.timer            pagetitles-ns6.service
Fri 2024-05-17 09:10:00 UTC  20h left       Fri 2024-05-10 09:10:00 UTC  6 days ago   xlation-dumps.timer             xlation-dumps.service
Fri 2024-05-17 23:00:00 UTC  1 day 10h left Fri 2024-05-10 23:00:00 UTC  5 days ago   wikidatardf-lexemes-dumps.timer wikidatardf-lexemes-dumps.service
Sat 2024-05-18 08:15:00 UTC  1 day 20h left Sat 2024-05-11 08:15:00 UTC  5 days ago   global_blocks_dump.timer        global_blocks_dump.service
Sat 2024-05-18 08:15:00 UTC  1 day 20h left Sat 2024-05-11 08:15:00 UTC  5 days ago   growth_mentorship_dump.timer    growth_mentorship_dump.service
Sat 2024-05-18 20:00:00 UTC  2 days left    Sat 2024-05-11 20:00:00 UTC  4 days ago   categoriesrdf-dump.timer        categoriesrdf-dump.service
Sun 2024-05-19 07:10:00 UTC  2 days left    Sun 2024-05-12 07:10:00 UTC  4 days ago   list-media-per-project.timer    list-media-per-project.service
Sun 2024-05-19 19:00:00 UTC  3 days left    Sun 2024-05-12 19:00:00 UTC  3 days ago   commonsrdf-dump.timer           commonsrdf-dump.service
Mon 2024-05-20 03:15:00 UTC  3 days left    Mon 2024-05-13 03:15:00 UTC  3 days ago   commonsjson-dump.timer          commonsjson-dump.service
Mon 2024-05-20 03:15:00 UTC  3 days left    Mon 2024-05-13 03:15:00 UTC  3 days ago   wikidatajson-dump.timer         wikidatajson-dump.service
Mon 2024-05-20 08:05:00 UTC  3 days left    Mon 2024-05-13 08:05:00 UTC  3 days ago   shorturls.timer                 shorturls.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s1.timer      cirrussearch-dump-s1.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s11.timer     cirrussearch-dump-s11.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s2.timer      cirrussearch-dump-s2.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s3.timer      cirrussearch-dump-s3.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s5.timer      cirrussearch-dump-s5.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s6.timer      cirrussearch-dump-s6.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s7.timer      cirrussearch-dump-s7.service
Wed 2024-05-22 03:15:00 UTC  5 days left    Wed 2024-05-15 03:15:00 UTC  1 day 8h ago wikidatajson-lexemes-dump.timer wikidatajson-lexemes-dump.service
n/a                          n/a            Mon 2024-05-13 21:52:12 UTC  2 days ago   cirrussearch-dump-s4.timer      cirrussearch-dump-s4.service
n/a                          n/a            Wed 2024-05-15 23:20:55 UTC  12h ago      cirrussearch-dump-s8.timer      cirrussearch-dump-s8.service
n/a                          n/a            Mon 2024-05-13 23:00:00 UTC  2 days ago   wikidatardf-all-dumps.timer     wikidatardf-all-dumps.service
n/a                          n/a            Wed 2024-05-15 23:00:00 UTC  13h ago      wikidatardf-truthy-dumps.timer  wikidatardf-truthy-dumps.service

26 timers listed.
Pass --all to see loaded but inactive timers, too.

Xabriel, what do you think? Is this workable to try to get the host roles switched without dumplicate dumps conflicting with each other?

I had missed the continous stream of jobs. Considering these are miscellaneous dumps, I'm not super worried if they fail or not run once or twice.

I think that I would be happy to merge the patch now, then manually stop and disable the timers on snapshot1008 to try to avoid duplicate runs.

Go for it!

Change #1029220 merged by Btullis:

[operations/puppet@production] Move dumps::generation::worker::dumper_misc_crons_only role

https://gerrit.wikimedia.org/r/1029220

Mentioned in SAL (#wikimedia-analytics) [2024-05-16T15:52:58Z] <btullis> moving the dumps::generation::worker::dumper_misc_crons role from snapshot1008 to snapshot1017 for T325228

I have disabled the timers on snapshot1008 with the following.

btullis@snapshot1008:~$ for t in $(cat timers.txt); do sudo systemctl disable $t.timer ; done
Removed /etc/systemd/system/multi-user.target.wants/adds-changes.timer.
Removed /etc/systemd/system/multi-user.target.wants/categoriesrdf-dump-daily.timer.
Removed /etc/systemd/system/multi-user.target.wants/categoriesrdf-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s1.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s11.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s2.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s3.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s4.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s5.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s6.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s7.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s8.timer.
Removed /etc/systemd/system/multi-user.target.wants/commonsjson-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/commonsrdf-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/global_blocks_dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/growth_mentorship_dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/list-media-per-project.timer.
Removed /etc/systemd/system/multi-user.target.wants/pagetitles-ns0.timer.
Removed /etc/systemd/system/multi-user.target.wants/pagetitles-ns6.timer.
Removed /etc/systemd/system/multi-user.target.wants/shorturls.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatajson-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatajson-lexemes-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-all-dumps.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-lexemes-dumps.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-truthy-dumps.timer.
Removed /etc/systemd/system/multi-user.target.wants/xlation-dumps.timer.

However, I think I may also have to stop the timers as well. Hopefully this will not affect the running services.

I stopped the timers with:

btullis@snapshot1008:~$ for t in $(cat timers.txt); do sudo systemctl stop $t.timer ; done

Now the timers cannot be listed, but the existing processes are still running:

btullis@snapshot1008:~$ for p in $(pgrep -f systemd-timer); do pstree -a $p ; done
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatardf-all-dumps --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d all -f ttl ...
  └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d all -f ttl -e nt
      ├─gzip -dc /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20240513/wikidata-20240513-all-BETA.ttl.gz
      └─lbzip2 -n 4 -c
          └─6*[{lbzip2}]
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatardf-truthy-dumps --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy ...
  └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 5 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 6 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
          ├─gzip -9
          └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 7 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject cirrussearch-dump-s4 --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpcirrussearch.sh --config/etc/dumps/confs/wiki
  └─dumpcirrussearc /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s4.dblist
      ├─gzip
      └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=commonswiki --indexSuffix=file
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject cirrussearch-dump-s8 --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpcirrussearch.sh --config/etc/dumps/confs/wiki
  └─dumpcirrussearc /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s8.dblist
      ├─gzip
      └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=wikidatawiki --indexSuffix=content

So I think we're good. I'll keep monitoring these dump processes on snapshot1008, but once they are finished I think that I can proceed to decommission it.

I'll also check on snapshot1017 that they start and run as expected.

Change #1032610 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] scap: remove snapshot1008 from dsh group mediawiki-installation

https://gerrit.wikimedia.org/r/1032610

Change #1032610 merged by Dzahn:

[operations/puppet@production] scap: remove snapshot1008 from dsh group mediawiki-installation

https://gerrit.wikimedia.org/r/1032610

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

There is still one dump running on snapshot1008. This is the cirrussearch-dump-s8 which is dumping cirrussearch for wikidatawiki.

Change #1036626 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner

https://gerrit.wikimedia.org/r/1036626

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1013.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2024-06-03T09:44:24Z] <btullis> reimagaing snapshot1013 to bullseye for T325228

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1013.eqiad.wmnet with OS bullseye completed:

  • snapshot1013 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406031002_btullis_250840_snapshot1013.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1012.eqiad.wmnet with OS bullseye

Change #1036626 merged by Btullis:

[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner

https://gerrit.wikimedia.org/r/1036626

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1010.eqiad.wmnet with OS bullseye completed:

  • snapshot1010 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406031402_btullis_294033_snapshot1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1012.eqiad.wmnet with OS bullseye completed:

  • snapshot1012 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406031405_btullis_294141_snapshot1012.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
BTullis updated the task description. (Show Details)