Page MenuHomePhabricator

Image tarball dumps on are not being generated
Open, MediumPublic


Image tarball dumps were being generated at on a somewhat monthly basis from April 2012 - December 2012. The last complete set
was in December 2012 [1]. A January 2013 post [2] indicated a hardware issue at A July 2013 post [3] indicated that the hardware issue was resolved at, but further progress required a new setup due to the recent Wikimedia datacenter move.


Version: unspecified
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:01 AM
bzimport set Reference to bz51001.

Ariel: Do you plan to take a look on this?

AH yes, sorry for not responding; I'm working on scripts that are independent of the specific media backend to handle the rsync.

(In reply to Ariel T. Glenn from comment #2)

I'm working on scripts that are independent of the specific media backend

Any news? :)

Nemo_bis changed the task status from Open to Stalled.Apr 9 2015, 7:16 AM
Nemo_bis set Security to None.

Is this still blocked on the lack of a rsync daemon for Your.Org to use?

Aklapper renamed this task from Image tarball dumps are not being generated to Image tarball dumps on are not being generated.Nov 18 2016, 4:03 PM

Is this specifically about the tarballs or is similarly affected? Given our tendency to lose image files (see T153565) it's pretty scary if there is no external backup for the files uploaded in the last 3 years.

These files are in the Swift filesystem so there are multiple copies of each file that is uploaded. There are no external copies of media uploaded since the move from a flat filesystem to Swift, afaik.

Swift copies are good for hardware errors but when there is a bug in the application code, all the copies get deleted (or, more likely, renamed to something that's hard to find).

I don't know if we'll bring back the tarballs but I do have a stealth project to get the rsyncable directory structure updated again. Expect it to take awhile. Script (in progress) lives off-site since it would never be run on WMF servers:

This would only sync media actually in use on the projects, and it will be slow to catch up once it's written and running.

ArielGlenn changed the task status from Stalled to Open.Jun 26 2019, 1:49 PM
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.

Some notes on architecture of the media sync scripts above:

  • The plan is that these would only every run on some primary mirror, so that other mirrors and media endusers could grab from there.
  • No local additional copy of media would be kept on Wikimedia servers. Swift already has enough copies.
  • Only original (unscaled) versions of the media would be provided, at least at first.
  • Only media in use on a Wikimedia project would be provided, so the bulk of what is uploaded to commons would not be synced. Plans should be made for a public mirror of all of commons but that's beyond the scope of these scripts.
  • Making lists of which files to delete locally and which to download on a per project basis is something that can easily be done at the public mirror end, since we publish periodic lists of images locally uploaded to projects and in use on the projects but housed on commons.
  • It's likely much better to request these files from our caches than directly from the swift backend, since there's the hope that some portion will be cached already. In particular, once we are caught up with the backlog, something that can be done slowly over time, new requests will be for files newly uploaded which might be cached? I should check with traffic about that.
  • The default wait time between retrievals is 5 seconds (configurable), with the plan to run this in serial only, one instance. I should check with traffic folks about that too; is this being overly cautious? Not cautious enough? The wait time is tuneable, and we could run in parallel by starting up separate instances per wiki if desired. In past discussions one process and a short wait has seemed acceptable but, best to check in again now that this idea is being revived.

There is current ongoing discussion to setup offline backups of media originals at T262669. While public dumps are not a priority of that project (backups are) it would be silly not to consider the possibility of also generating them with a similar workflow. @ArielGlenn was invited to the discussion there, as well as other mediawiki stakeholders will be asked (media, network, backups, operations, dumps).

If media tarballs become a real possibility, I may ask for broader input on what would be good exporting formats for reuse, but we are not yet there.

I wanted to update here that progress on media resiliency is ongoing, although no concrete promise can be made yet- this is a long-term project.