Page MenuHomePhabricator

Pre-load the 20250601 dumps to the cephfs volume to use as prefetch data for the 20250701 dumps
Closed, ResolvedPublic

Description

The first full-size set of XML/SQL dumps on Airflow is set for July 1st.

It would be useful if we could pre-load the previous month's full dump onto the CephFS volume, so that the prefetch mechanism will function.

There are 1008 wikis to be included and the full size of the set of dumps is 12 TB.

tullis@clouddumps1001:/srv/dumps/xmldatadumps/public$ find . -maxdepth 2 -name 20250601|wc -l
1008
btullis@clouddumps1001:/srv/dumps/xmldatadumps/public$ find . -maxdepth 2 -name 20250601 -exec du -ch {} + | grep total$
12T	total

Event Timeline

BTullis triaged this task as High priority.Jun 25 2025, 4:35 PM
BTullis updated the task description. (Show Details)

I am running the following command in the sync toolbox pod.

runuser@mediawiki-dumps-legacy-sync-toolbox-6d56d67d48-zg2nd:/mnt/dumpsdata/xmldatadumps/public$ rsync --stats -r -v --relative dumpsgen@clouddumps1001.wikimedia.org:/srv/dumps/xmldatadumps/public/./*/20250601 .

This should copy all of the data and put it into the right place.

For some reason, it keeps stopping after this file:

cawiki/20250601/cawiki-20250601-pages-meta-history.xml.7z
cawiki/20250601/cawiki-20250601-pages-meta-history.xml.bz2
rsync: connection unexpectedly closed (186779565598 bytes received so far) [receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(232) [receiver=3.2.7]
rsync: connection unexpectedly closed (1136499 bytes received so far) [generator]
rsync error: error in rsync protocol data stream (code 12) at io.c(232) [generator=3.2.7]

It keeps happening, so I'm going to attempt to work around the issue by creating a tarball of the dumps, then copying the whole file and extracting it to the cephfs directory.
I'm running the following on clouddump1001.

find . -maxdepth 2 -name 20250601 -exec tar rvf /srv/dumps/20250601_dumps_backup/20250601_dumps_backup.tar {} ;

This tarball generation seems to be on a go-slow now. It was fast to start with, but now it has been sitting at 1.7 TB for a few hours.

Every 2.0s: pstree -a dumpsgen && ls -sh /srv/dumps/20250601_dumps_backup/20250601_dumps_backup.tar                                                                        clouddumps1001: Fri Jun 27 11:36:46 2025

bash
  `-find . -maxdepth 2 -name 20250601 -exec tar rvf /srv/dumps/20250601_dumps_backup/20250601_dumps_backup.tar {} ;
      `-tar rvf /srv/dumps/20250601_dumps_backup/20250601_dumps_backup.tar ./taywiki/20250601
1.7T /srv/dumps/20250601_dumps_backup/20250601_dumps_backup.tar

I've gone back to the rsync method, but every time that it breaks on a particular file, I have so far used scp to copy the file and then retry the rsync.
It's quite laborious, but hopefully it will succeed.

I've also added the --partial and --append flags to the rsync command, so that it will leave partial files in place and resume them.

I believe that this is all done now.