These are currently running jessie:
These are currently running jessie:
|operations/puppet||production||+1 -1||configure adds-changes dumps to skip locking for now|
|operations/puppet||production||+1 -1||dumpsdata1001 will install with buster now|
|operations/dumps||master||+31 -17||add ability to skip locking for adds-changes dumps|
|operations/puppet||production||+7 -7||move misc crons to dumpsdata1002 nfs server|
|operations/puppet||production||+75 -1||add new partman recipe that skips format of /data partition for dumps servers|
|operations/puppet||production||+76 -1||add partman recipe that leaves /data on dump servers alone|
|operations/puppet||production||+1 -1||make dumpsdata1002 install buster instead of jessie|
|operations/puppet||production||+6 -1||make dumpsdata1002 spare before reimaging|
|operations/puppet||production||+2 -2||make dumpsdata primary nfs server rsync to dumpsdata1003 now|
|operations/puppet||production||+6 -1||fix up dump stats script to use either mail or s-nail|
|operations/puppet||production||+5 -2||buster doesn't have mailx, replace with s-nail|
|operations/puppet||production||+17 -0||add per-host configs for new dumps fallback NFS server|
|operations/puppet||production||+1 -1||make dumpsdata1003 another secondary dumps NFS server along with dumpsdata1002|
|Open||None||T224549 Track remaining jessie systems in production|
|Resolved||ArielGlenn||T224563 Migrate dumpsdata hosts to Stretch/Buster|
|Resolved||ArielGlenn||T219768 Get a third dumpsdata server|
|Unknown Object (Task)|
|Open||ArielGlenn||T234076 (Need by Aug 1) rack/setup/install dumpsdata1003.eqiad.wmnet|
Plan for migration:
At this point we should have three hosts on buster, one doing misc crons, one as fallback for the xml/sql dumps, and one as primary for the xml/sql dumps.
If the misc crons host fails, the xml/sql fallback server should be able to be used for it without issues by applying the right role.
rsync -v labstore1006.wikimedia.org::data/xmldatadumps/public/rsync-inc-last-2.txt . rsync -av --include '/*wik*/' --include-from=rsync-inc-last-2.txt --exclude='*' labstore1006.wikimedia.org::data/xmldatadumps/public/ /data/xmldatadumps/public
This will bring over some older files from 2007 and 2009 but it's easier to clean those up later than try to get the rsync args right to exclude them.
The above rsync completed; I will be rerunning it from time to time. In the meantime I have now moved onto the 'misc' dumps:
rsync -av labstore1006.wikimedia.org::data/xmldatadumps/public/other/ /data/otherdumps
rsync -av --bwlimit=80000 dumpdata1002.eqiad.wmnet::data/otherdumps/ /data/otherdumps
I have extended the rsync of xlm/sql dumps to the last three good dumps and have been running a bandwidth-limited pull from labstore1006 to dumpsdata1003 in a screen session on dumpsdata1003. I've periodically been updating the misc/other dumps via pull from dumpsdata1002.
Rync of both xmldatadumps/public and otherdumps from dumpsdata1002 to dumpsdata1003 is caught up as of earlier this evening. I'll be running these throughout the day tomorrow, waiting for the misc cron dumps to finish.
Script wmf-auto-reimage was launched by ariel on cumin1001.eqiad.wmnet for hosts:
The log can be found in /var/log/wmf-auto-reimage/201911161927_ariel_118105.log.
Expanded /data on dumpsdata1002, rsyncing copies of adds-changes dumps now from dumpsdata1003 in a screen session. After that I'll pick up the categoryrdf dumps, also via rsync from dumpsdata1003.
The schedule is now:
And then of course check that everything is running ok when xml dumps start on Dec 1st.
Given that the wikidata entity dumps are still finishing up the truthy gz files, and after that there will be bz2 recompression and the Lexemes, I'll be making the switchover tomorrow morning or mid-day EET.
snapshot1008 now uses dumpsdata1002 as its nfs server. I had to manually systemctl stop nfs-mountd.service and start it again for dumpsdata1002 to pick up the values (and especially the port setting) in /etc/default/nfs-kernel-server so that's poor. Other than that, no problems with puppet's unmounting and remounting of the share.
The next misc cron dump is already running (pagetitles) so I expect to see the files over on labstore1006,7 in a little while.
Adds-changes dumps did not run properly; when I checked this afternoon the Nov 23 job was hung indefinitely trying to get a lockfile on the first wiki to be processed (abwiki). I watched snapshot1008 attempt to connect to dumpsdata1002 for (some) nfs request and then try dumpsdata1003 when that failed (!) I rebooted snapshot1008 which no longer does this. Some port was still advertised wrongly on dumsdata1002 it seems, a reboot took care of that.
However, locks over nfs in buster either behave differently or there is some other flag someplace I missed.
I've pushed over changes to the adds-changes scripts to skip locking for now, since only oe process runs at a time for a given date anyways. However, it needs to be fixed soon. I need also to see if the locking mechanism for xml/sql dumps works in buster as is, since that switchover is coming up very soon.
Changeset for skipping locks not yet merged, that will go tomorrow.
Backrunning the Nov 23 adds-changes now so they'll be complete in time for the Nov 24 run which kicks off around 9 pm UTC.
I have tested on snapshot1008, which mounts only the buster nfs share, that the dump_lock.py script with multiple instances works as it should; this is the locking mechanism for xml/sql dumps. This means that although the adds-changes dumps locking must still be investigated later, I can go ahead and re-image dumpsdata1001 now that the current xml/sql run has completed.