We are hoping to start the migration of this by the second quarter(Q2) of next year because the migration of this is dependent on ICU migration and when the new packages would be built for Bullseye.
Progress:
We are hoping to start the migration of this by the second quarter(Q2) of next year because the migration of this is dependent on ICU migration and when the new packages would be built for Bullseye.
Progress:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T291916 Tracking task for Bullseye migrations in production | |||
Resolved | BTullis | T325228 Migrate Dumps Snapshot hosts from Buster to Bullseye |
Mentioned in SAL (#wikimedia-operations) [2024-01-09T12:43:52Z] <moritzm> imported mwbzutils 0.1.4~wmf-1+deb11u1 for bullseye-wikimedia T325228
I've reimaged snapshot1014, after the rebuild of mwbzutils most parts of the Puppet setup work fine, except one: The setup of the mw-cgroup (configured via mw-cgroup.systemd.erb) fails with Bullseye, there's a permission error trying to write to /sys/fs/cgroup/memory/release_agent:
jmm@snapshot1014:~$ sudo echo '/usr/local/bin/cgroup-mediawiki-clean' > /sys/fs/cgroup/memory/release_agent -bash: /sys/fs/cgroup/memory/release_agent: Permission denied
Change 991347 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] mediawiki::cgroup: Enanble v1 cgroups on bullseye
Change 991347 merged by Muehlenhoff:
[operations/puppet@production] mediawiki::cgroup: Enable v1 cgroups on bullseye
When running the MediaWiki train, scap complained due to the ssh host key of snapshot1016.eqiad.wmnet not being recognized. From deploy2002.codfw.wmnet:
scap pull ... (ran as mwdeploy@snapshot1016.eqiad.wmnet) returned [255]: Host key verification failed.
Change 992398 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] late_command: Drop special case for snapshot1016/1017
Change 992398 merged by Muehlenhoff:
[operations/puppet@production] late_command: Drop special case for snapshot1016/1017
Change 1008451 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/dumps/scap@master] Add a new deployment target in the beta cluster
Moving this into our current milestone, as we are currently working on testing these dumps scripts on bullseye.
Change 1008451 merged by ArielGlenn:
[operations/dumps/scap@master] Add a new deployment target in the beta cluster
Change 1009288 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Allow the lilypond packages to be installed on bullseye
Change 1009288 merged by ArielGlenn:
[operations/puppet@production] Allow the lilypond packages to be installed on bullseye
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1015.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1015.eqiad.wmnet with OS bullseye completed:
update: rebooting the VM fixed the problem because then the grub config was applied: T363957#9762525 You just have to know you need that extra reboot.
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1011.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1011.eqiad.wmnet with OS bullseye completed:
Change #1029220 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Move dumps::generation::worker::dumper_misc_crons_only role
I have created https://gerrit.wikimedia.org/r/c/operations/puppet/+/1029220 which will move all of the following dumps from snapshot1008 to snapshot1017.
When we deploy this patch the systemd timers and services will become unmanaged on snapshot1008, so we will want to disable the timers by hand in order to avoid duplicate runs.
Change #1029509 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Move snapshot1009 to insetup::data_engineering
Change #1029509 merged by Btullis:
[operations/puppet@production] Move snapshot1009 to insetup::data_engineering
@xcollazo added a comment on my patch:
LGTM, however, let's wait till snapshot1008 is idle.
Right now it is running the dumpRdf job. I expect it to be idle by the ~15th of the month.
I have been checking snapshot1008 to see when it will be idle, but it looks like it's pretty much always running one dump or another.
These four dumps are currently running:
Listing the timers and filtering for dump we can see that more dumps will start on May 17th, 18th, 19th, 20th, and 22nd.
So I'm not sure that there is ever going to be a time when it's properly idle.
I think that I would be happy to merge the patch now, then manually stop and disable the timers on snapshot1008 to try to avoid duplicate runs.
Xabriel, what do you think? Is this workable to try to get the host roles switched without dumplicate dumps conflicting with each other?
Here is a one-liner to list the next scheduled runs of all of the timers from the list in T325228#9781322
It looks to me like this host is going to be doing some kind of dump all the time.
btullis@snapshot1008:~$ systemctl list-timers $(for t in $(cat timers.txt); do echo $t.timer;done) NEXT LEFT LAST PASSED UNIT ACTIVATES Thu 2024-05-16 20:50:00 UTC 8h left Wed 2024-05-15 20:50:00 UTC 15h ago adds-changes.timer adds-changes.service Fri 2024-05-17 05:00:00 UTC 16h left Thu 2024-05-16 05:00:00 UTC 7h ago categoriesrdf-dump-daily.timer categoriesrdf-dump-daily.service Fri 2024-05-17 08:10:00 UTC 19h left Thu 2024-05-16 08:10:00 UTC 4h 1min ago pagetitles-ns0.timer pagetitles-ns0.service Fri 2024-05-17 08:50:00 UTC 20h left Thu 2024-05-16 08:50:00 UTC 3h 21min ago pagetitles-ns6.timer pagetitles-ns6.service Fri 2024-05-17 09:10:00 UTC 20h left Fri 2024-05-10 09:10:00 UTC 6 days ago xlation-dumps.timer xlation-dumps.service Fri 2024-05-17 23:00:00 UTC 1 day 10h left Fri 2024-05-10 23:00:00 UTC 5 days ago wikidatardf-lexemes-dumps.timer wikidatardf-lexemes-dumps.service Sat 2024-05-18 08:15:00 UTC 1 day 20h left Sat 2024-05-11 08:15:00 UTC 5 days ago global_blocks_dump.timer global_blocks_dump.service Sat 2024-05-18 08:15:00 UTC 1 day 20h left Sat 2024-05-11 08:15:00 UTC 5 days ago growth_mentorship_dump.timer growth_mentorship_dump.service Sat 2024-05-18 20:00:00 UTC 2 days left Sat 2024-05-11 20:00:00 UTC 4 days ago categoriesrdf-dump.timer categoriesrdf-dump.service Sun 2024-05-19 07:10:00 UTC 2 days left Sun 2024-05-12 07:10:00 UTC 4 days ago list-media-per-project.timer list-media-per-project.service Sun 2024-05-19 19:00:00 UTC 3 days left Sun 2024-05-12 19:00:00 UTC 3 days ago commonsrdf-dump.timer commonsrdf-dump.service Mon 2024-05-20 03:15:00 UTC 3 days left Mon 2024-05-13 03:15:00 UTC 3 days ago commonsjson-dump.timer commonsjson-dump.service Mon 2024-05-20 03:15:00 UTC 3 days left Mon 2024-05-13 03:15:00 UTC 3 days ago wikidatajson-dump.timer wikidatajson-dump.service Mon 2024-05-20 08:05:00 UTC 3 days left Mon 2024-05-13 08:05:00 UTC 3 days ago shorturls.timer shorturls.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s1.timer cirrussearch-dump-s1.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s11.timer cirrussearch-dump-s11.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s2.timer cirrussearch-dump-s2.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s3.timer cirrussearch-dump-s3.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s5.timer cirrussearch-dump-s5.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s6.timer cirrussearch-dump-s6.service Mon 2024-05-20 16:15:00 UTC 4 days left Mon 2024-05-13 16:15:00 UTC 2 days ago cirrussearch-dump-s7.timer cirrussearch-dump-s7.service Wed 2024-05-22 03:15:00 UTC 5 days left Wed 2024-05-15 03:15:00 UTC 1 day 8h ago wikidatajson-lexemes-dump.timer wikidatajson-lexemes-dump.service n/a n/a Mon 2024-05-13 21:52:12 UTC 2 days ago cirrussearch-dump-s4.timer cirrussearch-dump-s4.service n/a n/a Wed 2024-05-15 23:20:55 UTC 12h ago cirrussearch-dump-s8.timer cirrussearch-dump-s8.service n/a n/a Mon 2024-05-13 23:00:00 UTC 2 days ago wikidatardf-all-dumps.timer wikidatardf-all-dumps.service n/a n/a Wed 2024-05-15 23:00:00 UTC 13h ago wikidatardf-truthy-dumps.timer wikidatardf-truthy-dumps.service 26 timers listed. Pass --all to see loaded but inactive timers, too.
Xabriel, what do you think? Is this workable to try to get the host roles switched without dumplicate dumps conflicting with each other?
I had missed the continous stream of jobs. Considering these are miscellaneous dumps, I'm not super worried if they fail or not run once or twice.
I think that I would be happy to merge the patch now, then manually stop and disable the timers on snapshot1008 to try to avoid duplicate runs.
Go for it!
Change #1029220 merged by Btullis:
[operations/puppet@production] Move dumps::generation::worker::dumper_misc_crons_only role
Mentioned in SAL (#wikimedia-analytics) [2024-05-16T15:52:58Z] <btullis> moving the dumps::generation::worker::dumper_misc_crons role from snapshot1008 to snapshot1017 for T325228
I have disabled the timers on snapshot1008 with the following.
btullis@snapshot1008:~$ for t in $(cat timers.txt); do sudo systemctl disable $t.timer ; done Removed /etc/systemd/system/multi-user.target.wants/adds-changes.timer. Removed /etc/systemd/system/multi-user.target.wants/categoriesrdf-dump-daily.timer. Removed /etc/systemd/system/multi-user.target.wants/categoriesrdf-dump.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s1.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s11.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s2.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s3.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s4.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s5.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s6.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s7.timer. Removed /etc/systemd/system/multi-user.target.wants/cirrussearch-dump-s8.timer. Removed /etc/systemd/system/multi-user.target.wants/commonsjson-dump.timer. Removed /etc/systemd/system/multi-user.target.wants/commonsrdf-dump.timer. Removed /etc/systemd/system/multi-user.target.wants/global_blocks_dump.timer. Removed /etc/systemd/system/multi-user.target.wants/growth_mentorship_dump.timer. Removed /etc/systemd/system/multi-user.target.wants/list-media-per-project.timer. Removed /etc/systemd/system/multi-user.target.wants/pagetitles-ns0.timer. Removed /etc/systemd/system/multi-user.target.wants/pagetitles-ns6.timer. Removed /etc/systemd/system/multi-user.target.wants/shorturls.timer. Removed /etc/systemd/system/multi-user.target.wants/wikidatajson-dump.timer. Removed /etc/systemd/system/multi-user.target.wants/wikidatajson-lexemes-dump.timer. Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-all-dumps.timer. Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-lexemes-dumps.timer. Removed /etc/systemd/system/multi-user.target.wants/wikidatardf-truthy-dumps.timer. Removed /etc/systemd/system/multi-user.target.wants/xlation-dumps.timer.
However, I think I may also have to stop the timers as well. Hopefully this will not affect the running services.
I stopped the timers with:
btullis@snapshot1008:~$ for t in $(cat timers.txt); do sudo systemctl stop $t.timer ; done
Now the timers cannot be listed, but the existing processes are still running:
btullis@snapshot1008:~$ for p in $(pgrep -f systemd-timer); do pstree -a $p ; done systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatardf-all-dumps --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d all -f ttl ... └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d all -f ttl -e nt ├─gzip -dc /mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20240513/wikidata-20240513-all-BETA.ttl.gz └─lbzip2 -n 4 -c └─6*[{lbzip2}] systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject wikidatardf-truthy-dumps --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy ... └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 0 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 1 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 2 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 3 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 4 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 5 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... ├─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt │ ├─gzip -9 │ └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 6 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... └─dumpwikibaserdf /usr/local/bin/dumpwikibaserdf.sh -p wikidata -d truthy -f nt ├─gzip -9 └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dumpRdf.php --wiki wikidatawiki --shard 7 --sharding-factor 8 --batch-size 2000 --format nt --flavor ... systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject cirrussearch-dump-s4 --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpcirrussearch.sh --config/etc/dumps/confs/wiki └─dumpcirrussearc /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s4.dblist ├─gzip └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=commonswiki --indexSuffix=file systemd-timer-m /usr/local/bin/systemd-timer-mail-wrapper --subject cirrussearch-dump-s8 --mail-to root@snapshot1008.eqiad.wmnet --only-on-error /usr/local/bin/dumpcirrussearch.sh --config/etc/dumps/confs/wiki └─dumpcirrussearc /usr/local/bin/dumpcirrussearch.sh --config /etc/dumps/confs/wikidump.conf.other --dblist /srv/mediawiki/dblists/s8.dblist ├─gzip └─php7.4 /srv/mediawiki/multiversion/MWScript.php extensions/CirrusSearch/maintenance/DumpIndex.php --wiki=wikidatawiki --indexSuffix=content
So I think we're good. I'll keep monitoring these dump processes on snapshot1008, but once they are finished I think that I can proceed to decommission it.
I'll also check on snapshot1017 that they start and run as expected.
Change #1032610 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] scap: remove snapshot1008 from dsh group mediawiki-installation
Change #1032610 merged by Dzahn:
[operations/puppet@production] scap: remove snapshot1008 from dsh group mediawiki-installation
There is still one dump running on snapshot1008. This is the cirrussearch-dump-s8 which is dumping cirrussearch for wikidatawiki.
Change #1036626 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1013.eqiad.wmnet with OS bullseye
Mentioned in SAL (#wikimedia-analytics) [2024-06-03T09:44:24Z] <btullis> reimagaing snapshot1013 to bullseye for T325228
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1013.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1010.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host snapshot1012.eqiad.wmnet with OS bullseye
Change #1036626 merged by Btullis:
[operations/puppet@production] Configure snapshot1017 to be the misc cron snapshot runner
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1010.eqiad.wmnet with OS bullseye completed:
Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host snapshot1012.eqiad.wmnet with OS bullseye completed: