Page MenuHomePhabricator

Provide dumps using bittorrent
Open, MediumPublicFeature

Description

Without citing stats, these huge files demand multisourcing, either over HTTP using mirrors, or even better, using bittorrent. I hear this will dramatically improve bandwidth demand.

bittorrent is particularly nice, because files can be selectively downloaded from within the bundle. You could provide a single torrent containing all outputs from a particular wiki snapshot date.


Version: unspecified
Severity: enhancement

Details

Reference
bz27653

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:30 PM
bzimport set Reference to bz27653.

http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

As for the BitTorrent part, that would be somewhat feasible, having the tracker on WMF, but seeding from might be more of an issue

This is not an area I know much about, but what is the objection to seeding? I imagine you will get the maximum benefit by using an open tracker which is already tied into search services. And if your mirrors agree to use this protocol, they would provide a natural pool of seeders, even before they have finished replicating.

One major down side of the torrent idea is that it would be inefficient to offer incomplete dumps, because the .torrent would have to be changed as data grows. Unless there is a workaround, it would only make sense to wait until the dump is completed--by which point the data has aged...

Once the dump is available there is nothing preventing someone in the community or several someones from setting up a torrent of these files, and I encourage folks to do so (as has been done a number of times in the past).

Waiting til the dump is completed before adding it to a torrent is a good idea in all cases; only then are we sure that the files are intact and worth your while to download.

Folks that have talked with us about setting up a mirror site have expressed a preference for rsync, and that works best for us for distributing a subset of the dumps for mirroring.

Per Ariel comment I am closing this bug. Either set your own torrent or ask a rsync access.

Legoktm subscribed.

I'm re-opening this task because I think providing torrents is something that should be reasonably easy to integrate, and will provide enough benefits to users and the dumps infrastructure to be worth the amount of work.

I've been running https://tools.wmflabs.org/dump-torrents/ a little over a year now, which has been creating torrent files for all the dumps mirrored to toolforge, except it's started to run into performance problems with the NFS setup so I've paused it for now - the main instigator in suggesting a move to production.

With web seeds, downloaders will automatically download from multiple mirrors, distributing the bandwith required from hosters, and getting faster speeds. In addition, torrents have built-in integrity checking. I doubt many people will reseed torrents faster than any of the web seeds, but if it happens, it would be a nice bonus.

It's also super trivial (read: no major CPU resources needed) to add new web seeds/trackers to existing torrents once they've been created (source). Here is the mktorrent configuration I've been using, which seems to work reasonably well for most files, though we might want to consider adjusting the piece length based on the size of the dump.

I'm happy to work on integrating my existing code into the dumps infrastructure given a few pointers.

tl;dr: the win from torrents is automatically using multiple web seeds and distribution of load, not necessarily P2P.

This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server. Looping in @Bstorm who is the point person for the labstore boxes (which handle web service) now.

Can I hear a little about the performance problems you have been running into?

This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server.

Hmm, why wouldn't those hosts be the right place to call mktorrent? It can be CPU intensive, so I don't think running it on a web server is a good idea.
(I have very little understanding of how the actual dumps generating process works fwiw)

Can I hear a little about the performance problems you have been running into?

I know that it was overloading NFS, @Bstorm is the one who has more details on the specifics :)

Yep, it shot the NFS server up to a load avg of 20 all on its own. It didn't pin the CPU of the NFS server itself. The load was all network and IO related.

This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server.

Hmm, why wouldn't those hosts be the right place to call mktorrent? It can be CPU intensive, so I don't think running it on a web server is a good idea.
(I have very little understanding of how the actual dumps generating process works fwiw)

Those hosts write/read dumps to NFS, so that defeats the purpose of your moving the torrents. The dumpsdata boxes (that provide NFS filesystems) have lighter weight stats than the labstore boxes too.

Aklapper added subscribers: ArielGlenn, Aklapper.

@ArielGlenn: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed a subscriber: Tfinc.

So this seems not possible as long as the only way to access/mount dumps is through NFS. Isn't it? Unless somebody obtains a WMCloudVPS with looot of space to cache them locally, but seems not feasible to me.

Subscribing to this as I'm the person currently taking on making Torrents of the non-English language Wikipedia dumps and cloning them to the Internet Archive. I've often wondered why WMF don't take on making torrents in-house, and from this ticket I'm getting the impression it's mainly to do with server resources and an unfortunate consequence of the filesystem(s) WM employs on the back-end.

FWIW; My first set of torrents seems to have been well received, and the data on Academic Torrents shows the primary interest seems to be in the Spanish, German, Russian and Polish dumps respectively. I've got 18 languages lined up for this months batch (Which I do quarterly) and these will be uploaded to IA when I have access to a more appropriate connection than the cellular one I use at home. Progress of the IA uploads can be followed on My Wikimedia userpage.

If it helps, my process for handling and torrenting/seeding the dumps is as follows:

  1. Download the XXwiki-JJJJMMTT-pages-articles-multistream checksums from the WMF dump server, the index files and the dumps themselves from a mirror (Usually mirror.accum.se), and verify that the downloaded files match the checksums from WMF.
  2. Upload all of these items to new collections on IA, noting that IA supports both web-seeding and BitTorrent distribution¹.
  3. Create new torrents (Via transmission-create) with both official WMF mirrors and the IA collections specified as web-seeds. Most torrent clients are designed to always prefer P2P download over using web-seeds (Which exist as a fall-back for any blocks not available from any accessible P2P client) making these a convenient way of combining both into a single point of access.
  4. Upload the torrents to Academic Torrents (Which has good torrent policing, and unlike most torrent sharing services is accessible in both the UK and US because it doesn't interfere with the moneymen) and other locations where they can be lawfully shared and are likely to be useful. I'm also intending to start mirroring the torrent files in my own Wikipedia userspace when the time/motivation presents. (It's a rather exhaustive process, and my health isn't the best).

¹ - An unfortunate hiccup with IA's torrent setup is that any change in collection metadata - Title, description etc. - Re-generates the torrent with a new InfoHash, meaning that torrents downloaded after the change was made won't see or download from clients who downloaded the torrent file before that change was made, making IA-generated torrents unsuitable for sharing until a collection is definitely stable and theoretically „locked“.

This months torrents are intended to bundle the hashfiles, index, dump, and a User's Guide (Presently a WiP) into a single torrent for convenience. I have some experience with low-level data handling and data structure alignment, so mine should be able to avoid padding files and other inconveniences associated with multi-file torrents.

It should be added that I achieve the above primarily through Linux terminals on older computers (And yes, transmission-create takes a little while to create torrent files on my setups) plus a bit of sneakernet and a friends unlimited cable connection for the larger data transfers. I'm intending to automate this as time goes on, and from where I'm standing it looks like the above process should reasonably be achievable on a Raspberry Pi with a large enough SD card in it.
ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking)