⚓ T325228 Migrate Dumps Snapshot hosts from Buster to Bullseye

Subject	Repo	Branch	Lines +/-
Configure snapshot1017 to be the misc cron snapshot runner	operations/puppet	production	+1 -0
scap: remove snapshot1008 from dsh group mediawiki-installation	operations/puppet	production	+0 -1
Move dumps::generation::worker::dumper_misc_crons_only role	operations/puppet	production	+6 -11
Move snapshot1009 to insetup::data_engineering	operations/puppet	production	+2 -4
mediawiki::cgroup: Enable v1 cgroups on bullseye	operations/puppet	production	+9 -0
Allow the lilypond packages to be installed on bullseye	operations/puppet	production	+2 -2
Add a new deployment target in the beta cluster	operations/dumps/scap	master	+1 -0
late_command: Drop special case for snapshot1016/1017	operations/puppet	production	+0 -8
mwbzutils: Build for bullseye	operations/debs/mwbzutils	master	+9 -3

		Status	Subtype	Assigned	Task
		Open		None	T291916 Tracking task for Bullseye migrations in production
		Resolved		BTullis	T325228 Migrate Dumps Snapshot hosts from Buster to Bullseye

Mentioned in SAL (#wikimedia-operations) [2024-01-09T12:43:52Z] <moritzm> imported mwbzutils 0.1.4~wmf-1+deb11u1 for bullseye-wikimedia T325228

I've reimaged snapshot1014, after the rebuild of mwbzutils most parts of the Puppet setup work fine, except one: The setup of the mw-cgroup (configured via mw-cgroup.systemd.erb) fails with Bullseye, there's a permission error trying to write to /sys/fs/cgroup/memory/release_agent:

jmm@snapshot1014:~$ sudo echo '/usr/local/bin/cgroup-mediawiki-clean' > /sys/fs/cgroup/memory/release_agent
-bash: /sys/fs/cgroup/memory/release_agent: Permission denied

Maintenance_bot removed a project: Patch-For-Review.Jan 9 2024, 1:30 PM

MoritzMuehlenhoff added a project: SRE.Jan 9 2024, 1:33 PM

Change 991347 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] mediawiki::cgroup: Enanble v1 cgroups on bullseye

https://gerrit.wikimedia.org/r/991347

gerritbot added a project: Patch-For-Review.Jan 17 2024, 1:46 PM

LSobanski added a project: Data-Platform-SRE.Jan 22 2024, 6:47 AM

Change 991347 merged by Muehlenhoff:

[operations/puppet@production] mediawiki::cgroup: Enable v1 cgroups on bullseye

https://gerrit.wikimedia.org/r/991347

When running the MediaWiki train, scap complained due to the ssh host key of snapshot1016.eqiad.wmnet not being recognized. From deploy2002.codfw.wmnet:

scap pull ...
(ran as mwdeploy@snapshot1016.eqiad.wmnet) returned [255]: Host key verification failed.

Maintenance_bot removed a project: Patch-For-Review.Jan 23 2024, 9:30 AM

In T325228#9480025, @hashar wrote:
When running the MediaWiki train, scap complained due to the ssh host key of snapshot1016.eqiad.wmnet not being recognized. From deploy2002.codfw.wmnet:
scap pull ...
(ran as mwdeploy@snapshot1016.eqiad.wmnet) returned [255]: Host key verification failed.

That's expected, the server was in the middle of a reimage to Bullseye.

Change 992398 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] late_command: Drop special case for snapshot1016/1017

https://gerrit.wikimedia.org/r/992398

gerritbot added a project: Patch-For-Review.Jan 23 2024, 11:33 AM

Change 992398 merged by Muehlenhoff:

[operations/puppet@production] late_command: Drop special case for snapshot1016/1017

https://gerrit.wikimedia.org/r/992398

Maintenance_bot removed a project: Patch-For-Review.Jan 23 2024, 12:30 PM

Gehel triaged this task as High priority.Jan 23 2024, 2:05 PM

Gehel moved this task from Incoming to OS Upgrade on the Data-Platform-SRE board.

• lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Feb 8 2024, 5:38 PM

Change 1008451 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dumps/scap@master] Add a new deployment target in the beta cluster

https://gerrit.wikimedia.org/r/1008451

gerritbot added a project: Patch-For-Review.Mar 4 2024, 1:27 PM

Moving this into our current milestone, as we are currently working on testing these dumps scripts on bullseye.

BTullis claimed this task.Mar 4 2024, 1:30 PM

BTullis updated the task description. (Show Details)

BTullis moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 4 2024, 1:32 PM

Change 1008451 merged by ArielGlenn:

[operations/dumps/scap@master] Add a new deployment target in the beta cluster

https://gerrit.wikimedia.org/r/1008451

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2024, 2:30 PM

BTullis mentioned this in R1885:213c58068252: Add a new deployment target in the beta cluster.Mar 4 2024, 4:16 PM

Change 1009288 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Allow the lilypond packages to be installed on bullseye

https://gerrit.wikimedia.org/r/1009288

gerritbot added a project: Patch-For-Review.Mar 6 2024, 3:43 PM

Change 1009288 merged by ArielGlenn:

[operations/puppet@production] Allow the lilypond packages to be installed on bullseye

https://gerrit.wikimedia.org/r/1009288

Maintenance_bot removed a project: Patch-For-Review.Mar 6 2024, 4:30 PM

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:45 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 22 2024, 8:45 AM

hashar unsubscribed.Mar 22 2024, 2:46 PM

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1015.eqiad.wmnet with OS bullseye

BTullis updated the task description. (Show Details)Apr 9 2024, 10:47 AM

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1015.eqiad.wmnet with OS bullseye completed:

snapshot1015 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404091034_btullis_1548395_snapshot1015.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE (2024.03.25 - 2024.04.14).Apr 15 2024, 12:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

Dzahn mentioned this in T363957: deployment_server bullseye - mw-cgroup.service: Failed .May 1 2024, 10:38 PM

In T325228#9445729, @MoritzMuehlenhoff wrote:

.. The setup of the mw-cgroup (configured via mw-cgroup.systemd.erb) fails with Bullseye, there's a permission error trying to write to /sys/fs/cgroup/memory/release_agent:

I ran into the same issue when trying to run a deployment_server in cloud VPS on bullseye (cgroup issue: T363957 -> deployment server on bullseye T363415 -> get rid of buster VMs T360964

update: rebooting the VM fixed the problem because then the grub config was applied: T363957#9762525 You just have to know you need that extra reboot.

Gehel edited projects, added Data-Platform-SRE (2024.05.06 - 2024.05.26); removed Data-Platform-SRE (2024.04.15 - 2024.05.05).May 3 2024, 3:39 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.06 - 2024.05.26) board.

Hello there.

Due to T364250, the host snapshot1011 will not be running the typical wikidatawiki dump, and thus will be idle till the ~20th. So a good window to migrate.

CC @BTullis

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1011.eqiad.wmnet with OS bullseye

BTullis updated the task description. (Show Details)May 8 2024, 9:49 AM

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1011.eqiad.wmnet with OS bullseye completed:

snapshot1011 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405080956_btullis_1291660_snapshot1011.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

BTullis updated the task description. (Show Details)May 8 2024, 10:19 AM

Change #1029220 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move dumps::generation::worker::dumper_misc_crons_only role

https://gerrit.wikimedia.org/r/1029220

gerritbot added a project: Patch-For-Review.May 8 2024, 3:41 PM

I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029220 which will move all of the following dumps from snapshot1008 to snapshot1017.

adds-changes
categoriesrdf-dump-daily
categoriesrdf-dump
cirrussearch-dump-s1
cirrussearch-dump-s11
cirrussearch-dump-s2
cirrussearch-dump-s3
cirrussearch-dump-s4
cirrussearch-dump-s5
cirrussearch-dump-s6
cirrussearch-dump-s7
cirrussearch-dump-s8
cirrussearch-dump
commonsjson-dump
commonsrdf-dump
global_blocks_dump
growth_mentorship_dump
list-media-per-project
pagetitles-ns0
pagetitles-ns6
shorturls
wikidatajson-dump
wikidatajson-lexemes-dump
wikidatardf-all-dumps
wikidatardf-lexemes-dumps
wikidatardf-truthy-dumps
xlation-dumps

When we deploy this patch the systemd timers and services will become unmanaged on snapshot1008, so we will want to disable the timers by hand in order to avoid duplicate runs.

Change #1029509 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move snapshot1009 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1029509

Change #1029509 merged by Btullis:

[operations/puppet@production] Move snapshot1009 to insetup::data_engineering

https://gerrit.wikimedia.org/r/1029509

BTullis updated the task description. (Show Details)May 15 2024, 1:47 PM

@xcollazo added a comment on my patch:

LGTM, however, let's wait till snapshot1008 is idle.
Right now it is running the dumpRdf job. I expect it to be idle by the ~15th of the month.

I have been checking snapshot1008 to see when it will be idle, but it looks like it's pretty much always running one dump or another.
These four dumps are currently running:

wikidatardf-all-dumps
wikidatardf-truthy-dumps
cirrussearch-dump-s4
cirrussearch-dump-s8

Listing the timers and filtering for dump we can see that more dumps will start on May 17th, 18th, 19th, 20th, and 22nd.
So I'm not sure that there is ever going to be a time when it's properly idle.
I think that I would be happy to merge the patch now, then manually stop and disable the timers on snapshot1008 to try to avoid duplicate runs.

Xabriel, what do you think? Is this workable to try to get the host roles switched without dumplicate dumps conflicting with each other?

Here is a one-liner to list the next scheduled runs of all of the timers from the list in T325228#9781322
It looks to me like this host is going to be doing some kind of dump all the time.

btullis@snapshot1008:~$ systemctl list-timers $(for t in $(cat timers.txt); do echo $t.timer;done)
NEXT                         LEFT           LAST                         PASSED       UNIT                            ACTIVATES
Thu 2024-05-16 20:50:00 UTC  8h left        Wed 2024-05-15 20:50:00 UTC  15h ago      adds-changes.timer              adds-changes.service
Fri 2024-05-17 05:00:00 UTC  16h left       Thu 2024-05-16 05:00:00 UTC  7h ago       categoriesrdf-dump-daily.timer  categoriesrdf-dump-daily.service
Fri 2024-05-17 08:10:00 UTC  19h left       Thu 2024-05-16 08:10:00 UTC  4h 1min ago  pagetitles-ns0.timer            pagetitles-ns0.service
Fri 2024-05-17 08:50:00 UTC  20h left       Thu 2024-05-16 08:50:00 UTC  3h 21min ago pagetitles-ns6.timer            pagetitles-ns6.service
Fri 2024-05-17 09:10:00 UTC  20h left       Fri 2024-05-10 09:10:00 UTC  6 days ago   xlation-dumps.timer             xlation-dumps.service
Fri 2024-05-17 23:00:00 UTC  1 day 10h left Fri 2024-05-10 23:00:00 UTC  5 days ago   wikidatardf-lexemes-dumps.timer wikidatardf-lexemes-dumps.service
Sat 2024-05-18 08:15:00 UTC  1 day 20h left Sat 2024-05-11 08:15:00 UTC  5 days ago   global_blocks_dump.timer        global_blocks_dump.service
Sat 2024-05-18 08:15:00 UTC  1 day 20h left Sat 2024-05-11 08:15:00 UTC  5 days ago   growth_mentorship_dump.timer    growth_mentorship_dump.service
Sat 2024-05-18 20:00:00 UTC  2 days left    Sat 2024-05-11 20:00:00 UTC  4 days ago   categoriesrdf-dump.timer        categoriesrdf-dump.service
Sun 2024-05-19 07:10:00 UTC  2 days left    Sun 2024-05-12 07:10:00 UTC  4 days ago   list-media-per-project.timer    list-media-per-project.service
Sun 2024-05-19 19:00:00 UTC  3 days left    Sun 2024-05-12 19:00:00 UTC  3 days ago   commonsrdf-dump.timer           commonsrdf-dump.service
Mon 2024-05-20 03:15:00 UTC  3 days left    Mon 2024-05-13 03:15:00 UTC  3 days ago   commonsjson-dump.timer          commonsjson-dump.service
Mon 2024-05-20 03:15:00 UTC  3 days left    Mon 2024-05-13 03:15:00 UTC  3 days ago   wikidatajson-dump.timer         wikidatajson-dump.service
Mon 2024-05-20 08:05:00 UTC  3 days left    Mon 2024-05-13 08:05:00 UTC  3 days ago   shorturls.timer                 shorturls.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s1.timer      cirrussearch-dump-s1.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s11.timer     cirrussearch-dump-s11.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s2.timer      cirrussearch-dump-s2.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s3.timer      cirrussearch-dump-s3.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s5.timer      cirrussearch-dump-s5.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s6.timer      cirrussearch-dump-s6.service
Mon 2024-05-20 16:15:00 UTC  4 days left    Mon 2024-05-13 16:15:00 UTC  2 days ago   cirrussearch-dump-s7.timer      cirrussearch-dump-s7.service
Wed 2024-05-22 03:15:00 UTC  5 days left    Wed 2024-05-15 03:15:00 UTC  1 day 8h ago wikidatajson-lexemes-dump.timer wikidatajson-lexemes-dump.service
n/a                          n/a            Mon 2024-05-13 21:52:12 UTC  2 days ago   cirrussearch-dump-s4.timer      cirrussearch-dump-s4.service
n/a                          n/a            Wed 2024-05-15 23:20:55 UTC  12h ago      cirrussearch-dump-s8.timer      cirrussearch-dump-s8.service
n/a                          n/a            Mon 2024-05-13 23:00:00 UTC  2 days ago   wikidatardf-all-dumps.timer     wikidatardf-all-dumps.service
n/a                          n/a            Wed 2024-05-15 23:00:00 UTC  13h ago      wikidatardf-truthy-dumps.timer  wikidatardf-truthy-dumps.service

26 timers listed.
Pass --all to see loaded but inactive timers, too.

Xabriel, what do you think? Is this workable to try to get the host roles switched without dumplicate dumps conflicting with each other?

I had missed the continous stream of jobs. Considering these are miscellaneous dumps, I'm not super worried if they fail or not run once or twice.

I think that I would be happy to merge the patch now, then manually stop and disable the timers on snapshot1008 to try to avoid duplicate runs.

Go for it!

Change #1029220 merged by Btullis:

[operations/puppet@production] Move dumps::generation::worker::dumper_misc_crons_only role

https://gerrit.wikimedia.org/r/1029220

Mentioned in SAL (#wikimedia-analytics) [2024-05-16T15:52:58Z] <btullis> moving the dumps::generation::worker::dumper_misc_crons role from snapshot1008 to snapshot1017 for T325228

I have disabled the timers on snapshot1008 with the following.

btullis@snapshot1008:~$ for t in $(cat timers.txt); do sudo systemctl disable $t.timer ; done
Removed /etc/systemd/system/multi-user.target.wants/adds-changes.timer.
Removed /etc/systemd/system/multi-user.target.wants/categoriesrdf-dump-daily.timer.
Removed /etc/systemd/system/multi-user.target.wants/categoriesrdf-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s1.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s11.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s2.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s3.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s4.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s5.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s6.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s7.timer.
Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s8.timer.
Removed /etc/systemd/system/multi-user.target.wants/commonsjson-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/commonsrdf-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/global_blocks_dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/growth_mentorship_dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/list-media-per-project.timer.
Removed /etc/systemd/system/multi-user.target.wants/pagetitles-ns0.timer.
Removed /etc/systemd/system/multi-user.target.wants/pagetitles-ns6.timer.
Removed /etc/systemd/system/multi-user.target.wants/shorturls.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatajson-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatajson-lexemes-dump.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-all-dumps.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-lexemes-dumps.timer.
Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-truthy-dumps.timer.
Removed /etc/systemd/system/multi-user.target.wants/xlation-dumps.timer.

However, I think I may also have to stop the timers as well. Hopefully this will not affect the running services.

I stopped the timers with:

btullis@snapshot1008:~$ for t in $(cat timers.txt); do sudo systemctl stop $t.timer ; done

Now the timers cannot be listed, but the existing processes are still running:

btullis@snapshot1008:~$ for p in $(pgrep -f systemd-timer); do pstree -a $p ; done
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatardf-all-dumps --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d all -f ttl ...
  └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d all -f ttl -e nt
      ├─gzip -dc /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20240513/wikidata-20240513-all-BETA.ttl.gz
      └─lbzip2 -n 4 -c
          └─6*[{lbzip2}]
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatardf-truthy-dumps --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy ...
  └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 5 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
      │   ├─gzip -9
      │   └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 6 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
      └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt
          ├─gzip -9
          └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 7 --sharding-factor 8 --batch-size 2000 --format nt --flavor ...
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject cirrussearch-dump-s4 --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpcirrussearch.sh --config/etc/dumps/confs/wiki
  └─dumpcirrussearc /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s4.dblist
      ├─gzip
      └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=commonswiki --indexSuffix=file
systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject cirrussearch-dump-s8 --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpcirrussearch.sh --config/etc/dumps/confs/wiki
  └─dumpcirrussearc /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s8.dblist
      ├─gzip
      └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=wikidatawiki --indexSuffix=content

So I think we're good. I'll keep monitoring these dump processes on snapshot1008, but once they are finished I think that I can proceed to decommission it.

I'll also check on snapshot1017 that they start and run as expected.

Maintenance_bot removed a project: Patch-For-Review.May 16 2024, 4:32 PM

Change #1032610 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] scap: remove snapshot1008 from dsh group mediawiki-installation

https://gerrit.wikimedia.org/r/1032610

gerritbot added a project: Patch-For-Review.May 17 2024, 12:47 AM

Change #1032610 merged by Dzahn:

[operations/puppet@production] scap: remove snapshot1008 from dsh group mediawiki-installation

https://gerrit.wikimedia.org/r/1032610

Maintenance_bot removed a project: Patch-For-Review.May 17 2024, 1:30 AM

Host rebooted by btullis@cumin1002 with reason: Rebooting to pick up new kernel

There is still one dump running on snapshot1008. This is the cirrussearch-dump-s8 which is dumping cirrussearch for wikidatawiki.

BTullis added a parent task: T291916: Tracking task for Bullseye migrations in production.May 23 2024, 11:23 AM

Gehel edited projects, added Data-Platform-SRE (2024.05.27 - 2024.06.16); removed Data-Platform-SRE (2024.05.06 - 2024.05.26).May 24 2024, 12:20 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.05.27 - 2024.06.16) board.

BTullis updated the task description. (Show Details)May 24 2024, 1:45 PM

BTullis mentioned this in T366043: Some dumps are not available since mid may 2024.May 28 2024, 8:30 AM

Change #1036626 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner

https://gerrit.wikimedia.org/r/1036626

gerritbot added a project: Patch-For-Review.May 28 2024, 11:03 AM

Dzahn unsubscribed.May 28 2024, 4:57 PM

BTullis mentioned this in T365155: Text id verification makes dumps skip many good rows.Jun 3 2024, 9:00 AM

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1013.eqiad.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-analytics) [2024-06-03T09:44:24Z] <btullis> reimagaing snapshot1013 to bullseye for T325228

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1013.eqiad.wmnet with OS bullseye completed:

snapshot1013 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406031002_btullis_250840_snapshot1013.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

BTullis updated the task description. (Show Details)Jun 3 2024, 10:45 AM

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1010.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1012.eqiad.wmnet with OS bullseye

Change #1036626 merged by Btullis:

[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner

https://gerrit.wikimedia.org/r/1036626

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1010.eqiad.wmnet with OS bullseye completed:

snapshot1010 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406031402_btullis_294033_snapshot1010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Maintenance_bot removed a project: Patch-For-Review.Jun 3 2024, 2:31 PM

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1012.eqiad.wmnet with OS bullseye completed:

snapshot1012 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406031405_btullis_294141_snapshot1012.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

BTullis closed this task as Resolved.Jun 3 2024, 2:47 PM

BTullis updated the task description. (Show Details)

BTullis moved this task from In Progress to Done on the Data-Platform-SRE (2024.05.27 - 2024.06.16) board.

Migrate Dumps Snapshot hosts from Buster to Bullseye
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	Hokwelum
	Dec 14 2022, 6:31 PM

Migrate Dumps Snapshot hosts from Buster to BullseyeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Migrate Dumps Snapshot hosts from Buster to Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...