Page MenuHomePhabricator

Synchronize the dumps files to the clouddumps hosts
Closed, ResolvedPublic

Description

The dumps v1 jobs will write their files on a CephFS volume. However, that volume is not publicly accessible on the internet, meaning that the files will need to be synchronized onto the clouddumps100[1-2].wikimedia.org hosts, to /srv/dumps/xmldatadumps/public/.

We probably need to setup a way to run some kind of continuous synchronization.

Event Timeline

brouberol triaged this task as Medium priority.

Let's look at how things are done at the moment, then we can work out how best to replicate this functionality and which bits we might be able to improve:

Firstly, I don't think that any synchronization happens on the snapshot servers, themselves. It happens on the dumpsdata servers, using these puppet classes I believe.

They set up a systemd timer called dumps-rsyncer.service on each of the dumpsdata hosts, which use the templates here:

From this command:

btullis@cumin1002:~$ sudo cumin 'A:dumps and not D{htmldumper1001.eqiad.wmnet}' 'systemctl show dumps-rsyncer.service|grep ExecStart='

...we can see that this service is only currently deployed to:

  • dumpsdata1006 for xml/sql dumps
  • dumpsdata1003 for misc dumps

dumpsdata1006 is running this command:

/usr/local/bin/rsync-via-primary.sh --do_tarball --do_rsync_xml --xmldumpsdir /data/xmldatadumps/public --xmlremotedirs dumpsdata1007.eqiad.wmnet::data/xmldatadumps/public/,clouddumps1001.wikimedia.org::data/xmldatadumps/public/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/

dumpsdata1003 is running this command:

/usr/local/bin/rsync-via-primary.sh --do_rsync_misc --do_rsync_miscsubs --miscdumpsdir /data/otherdumps --miscremotedirs clouddumps1001.wikimedia.org::data/xmldatadumps/public/other/,clouddumps1002.wikimedia.org::data/xmldatadumps/public/other/ --miscsubdirs incr,categoriesrdf --miscremotesubs dumpsdata1007.eqiad.wmnet::data/otherdumps/

In addition these, dumpsdata1007 has the profile::dumps::generation::server::dumpstatusfiles_sync profile applied.
This profile adds the following class: dumps::web::dumpstatusfiles

That causes unpack-dumpstatusfiles.sh to be executed every 5 minutes on the XML fallback host.

I think that this setup is so that the HTML status files can be kept up to date independently of the larger dumps files, but also so that the index.html files don't show links to fines that haven't been copied from dumpsdata servers to clouddumps servers yet. This commit and T179857: Make sure rsynced dump status/html files don't contain links to files not yet copied over explain the intention of this a little more.

It's pretty convoluted. I wonder how much we can remove.

Ah, now I see that it is not only dumpsdata1007 that unpacks these HTML status files. The two clouddumps servers also run the same script.

btullis@cumin1002:~$ sudo cumin C:dumps::web::dumpstatusfiles
3 hosts will be targeted:
clouddumps[1001-1002].wikimedia.org,dumpsdata1007.eqiad.wmnet
DRY-RUN mode enabled, aborting

We can see from here that the /usr/local/bin/rsync-via-primary.sh script specifically excludes *.html and *,json files.

This means that these HTML and JSON files are only updated on the clouddumps hosts by means of the following processes:

  • The make_statusfiles_tarball function:
    • running on dumpsdata1006
    • as part of the rsync-via-primary.sh script
    • because the --do_tarball and --do_rsync_xml options are present.
    • This script runs in a loop, so we cannot easily see how often the tarball is generated.
  • The unpack-dumpstatusfiles.sh script:
    • running on clouddumps100[1-2]
    • every 5 minutes

Change #1135416 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps

https://gerrit.wikimedia.org/r/1135416

Change #1135416 merged by Btullis:

[operations/puppet@production] Add an rsync fragment to permit dse-k8s pods to sync mediawiki dumps

https://gerrit.wikimedia.org/r/1135416

bking renamed this task from Syncronize the dumps files to the clouddumps hosts to Synchronize the dumps files to the clouddumps hosts.Apr 29 2025, 3:18 PM

Change #1140664 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update dumpsgen SSH settings on clouddumps servers.

https://gerrit.wikimedia.org/r/1140664

Change #1140664 merged by Btullis:

[operations/puppet@production] Update dumpsgen SSH settings on clouddumps servers.

https://gerrit.wikimedia.org/r/1140664

It's working and the first rsync over ssh is running now.
I am running this command in the pod.

runuser@sync-pod-with-cephfs-volume:~$ rsync -av /mnt/dumpsdata/xmldatadumps/public dumpsgen@clouddumps1001.wikimedia.org:/srv/mediawiki-dumps-legacy/xmldatadumps

On the server, this is the process tree for the dumpsgen user.

btullis@clouddumps1001:/srv/mediawiki-dumps-legacy$ pstree -a dumpsgen
sshd
  └─rsync --server -vlogDtpre.iLsfxCIvu . /srv/mediawiki-dumps-legacy/xmldatadumps
      └─rsync --server -vlogDtpre.iLsfxCIvu . /srv/mediawiki-dumps-legacy/xmldatadumps

systemd --user
  └─(sd-pam)

I had to get rid of the ChrootDirectory in order to use rsync because it doesn't exist in the chroot.

I am still considering whether to use just the SFTP protocol and use rclone copy with an sftp back-end.

But for now, rsync over ssh is working.

brouberol changed the task status from Open to In Progress.May 5 2025, 9:56 AM

Change #1141926 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mediawiki-dumps-legacy: rename Secret key associated to private key

https://gerrit.wikimedia.org/r/1141926

Change #1141926 merged by Brouberol:

[operations/deployment-charts@master] mediawiki-dumps-legacy: rename Secret key associated to private key

https://gerrit.wikimedia.org/r/1141926

I am now having some success with the command:

runuser@sync-pod-with-cephfs-volume:/home/sync-utils$ parallel-rsync -v -ra -H dumpsgen@clouddumps1001.wikimedia.org -H dumpsgen@clouddumps1002.wikimedia.org /mnt/dumpsdata/xmldatadumps/public/ /srv/mediawiki-dumps-legacy/xmldatadumps/public

Transfer speeds are around 190 MB/s when the pod has 2 CPUs and 4GB of RAM. I think that this is workable.

image.png (926×1 px, 138 KB)

We can populate a file to use instead of the two -H options, which would make the command line a little nicer.

We can also use a /home/runuser/.ssh/config file to set the default cipher to be a slightly faster one, or pass other ssh options with a -S argument, or other rsync options with a -X option.

Change #1143060 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy

https://gerrit.wikimedia.org/r/1143060

Change #1143060 merged by jenkins-bot:

[operations/deployment-charts@master] Add ssh_config and an rsync_targets files to mediawiki-dumps-legacy

https://gerrit.wikimedia.org/r/1143060

Change #1143101 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mediawiki-dumps-legacy: fix typo

https://gerrit.wikimedia.org/r/1143101

Change #1143103 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Fix typo in mediawiki-dumps-legacy

https://gerrit.wikimedia.org/r/1143103

Change #1143103 abandoned by Btullis:

[operations/deployment-charts@master] Fix typo in mediawiki-dumps-legacy

Reason:

duplicate

https://gerrit.wikimedia.org/r/1143103

Change #1143101 merged by Brouberol:

[operations/deployment-charts@master] mediawiki-dumps-legacy: fix typo

https://gerrit.wikimedia.org/r/1143101

Change #1143106 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Fix typo

https://gerrit.wikimedia.org/r/1143106

Change #1143106 merged by Brouberol:

[operations/deployment-charts@master] Fix typo

https://gerrit.wikimedia.org/r/1143106

Change #1145201 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update the managed airflow temp directory for dumps on k8s

https://gerrit.wikimedia.org/r/1145201

Change #1145201 merged by Btullis:

[operations/puppet@production] Update the managed airflow temp directory for dumps on k8s

https://gerrit.wikimedia.org/r/1145201

Change #1145874 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] clouddumps: Manage directories beneath /srv/dumps/xmldatadumps_airflow_temp

https://gerrit.wikimedia.org/r/1145874

Change #1145874 merged by Btullis:

[operations/puppet@production] clouddumps: Manage directories beneath /srv/dumps/xmldatadumps_airflow_temp

https://gerrit.wikimedia.org/r/1145874