Page MenuHomePhabricator

Image tarball dumps on your.org are not being generated
Open, NormalPublic

Description

Image tarball dumps were being generated at http://ftpmirror.your.org/ on a somewhat monthly basis from April 2012 - December 2012. The last complete set
was in December 2012 [1]. A January 2013 post [2] indicated a hardware issue at your.org. A July 2013 post [3] indicated that the hardware issue was resolved at your.org, but further progress required a new setup due to the recent Wikimedia datacenter move.

[1] http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/
[2] http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000665.html
[3] http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html


Version: unspecified
Severity: normal

Details

Reference
bz51001

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 2:01 AM
bzimport set Reference to bz51001.
gnosygnu created this task.Jul 9 2013, 3:06 AM

Ariel: Do you plan to take a look on this?

AH yes, sorry for not responding; I'm working on scripts that are independent of the specific media backend to handle the rsync.

(In reply to Ariel T. Glenn from comment #2)

I'm working on scripts that are independent of the specific media backend

Any news? :)

Hydriz added a subscriber: Hydriz.
Nemo_bis changed the task status from Open to Stalled.Apr 9 2015, 7:16 AM
Nemo_bis set Security to None.

Is this still blocked on the lack of a rsync daemon for Your.Org to use?

555 added a subscriber: 555.Apr 29 2015, 4:47 PM
Restricted Application added a subscriber: Matanya. · View Herald TranscriptOct 17 2015, 7:49 AM
Aklapper renamed this task from Image tarball dumps are not being generated to Image tarball dumps on your.org are not being generated.Nov 18 2016, 4:03 PM
Tgr added a subscriber: Tgr.Dec 18 2016, 12:55 AM

Is this specifically about the tarballs or is http://ftpmirror.your.org/pub/wikimedia/images/ similarly affected? Given our tendency to lose image files (see T153565) it's pretty scary if there is no external backup for the files uploaded in the last 3 years.

These files are in the Swift filesystem so there are multiple copies of each file that is uploaded. There are no external copies of media uploaded since the move from a flat filesystem to Swift, afaik.

Tgr added a comment.Dec 19 2016, 7:18 PM

Swift copies are good for hardware errors but when there is a bug in the application code, all the copies get deleted (or, more likely, renamed to something that's hard to find).

I don't know if we'll bring back the tarballs but I do have a stealth project to get the rsyncable directory structure updated again. Expect it to take awhile. Script (in progress) lives off-site since it would never be run on WMF servers: https://github.com/apergos/mw_media_sync

This would only sync media actually in use on the projects, and it will be slow to catch up once it's written and running.

ArielGlenn changed the task status from Stalled to Open.Jun 26 2019, 1:49 PM
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.
matmarex removed a subscriber: matmarex.Jun 26 2019, 5:17 PM

Some notes on architecture of the media sync scripts above:

  • The plan is that these would only every run on some primary mirror, so that other mirrors and media endusers could grab from there.
  • No local additional copy of media would be kept on Wikimedia servers. Swift already has enough copies.
  • Only original (unscaled) versions of the media would be provided, at least at first.
  • Only media in use on a Wikimedia project would be provided, so the bulk of what is uploaded to commons would not be synced. Plans should be made for a public mirror of all of commons but that's beyond the scope of these scripts.
  • Making lists of which files to delete locally and which to download on a per project basis is something that can easily be done at the public mirror end, since we publish periodic lists of images locally uploaded to projects and in use on the projects but housed on commons.
  • It's likely much better to request these files from our caches than directly from the swift backend, since there's the hope that some portion will be cached already. In particular, once we are caught up with the backlog, something that can be done slowly over time, new requests will be for files newly uploaded which might be cached? I should check with traffic about that.
  • The default wait time between retrievals is 5 seconds (configurable), with the plan to run this in serial only, one instance. I should check with traffic folks about that too; is this being overly cautious? Not cautious enough? The wait time is tuneable, and we could run in parallel by starting up separate instances per wiki if desired. In past discussions one process and a short wait has seemed acceptable but, best to check in again now that this idea is being revived.