Background
As part of creating and regularly keeping-up-to-date a back up of media files (currently in Swift), we need the backup processor to know what to iterate. It is understood that the raw hierarchy with which files are stored in Swift is not adequate for this alone (e.g. we can't just blindly iterate all relevant Swift containers and sync to a backup).
The way MediaWiki uses Swift today involves files often physically moving around on disk to reflect certain changes in state. E.g. when a file title is renamed, we also move the file on disk. And the "latest" revision of the file is stored at a canonical/unversioned file path (e.g. a/ab/Foo.jpg), all previous revisions of a file are stored under an "archive" file path with a version timestamp (e.g. archive/a/ab/20110401090000!Foo.jpg), and if a file is deleted/hidden from public it moves to yet another file path under a "deleted" subdirectory.
The direction that has been proposed, to make the back up process efficient and able update incrementally and often, is to try to normalize these with a minimal best-effort approach so that within the backup the actual file binaries are mostly stable and don't move. This, together with the absence of much state tracking in this area, means it seems simplest to let MW iterate the file database (instead of iterating Swift directly), and have it dictate what to backup from where and to give a reasonably stable identifier to recognise files we've backed up before.
See also:
- T262668: WMF media storage must be adequately backed up
- Google Doc (WMF-restricted)
Outcome
Develop a CLI maintenance script for MediaWiki that supports at least the Swift file backend, and will produce a standard output stream of lines that contain:
- some reasaonbly stable id (e.g. wiki ID + file title + file sha1 + upload timestamp).
- the path/url to download the file from in Swift.
- whether to encrypt (e.g. is it a private wiki).
It would do this for all public file revisions (image and oldimage), and "deleted" file revisions (filearchive).
This could work as a re-usable script in MW core, that the backup processor could invoke via WMF's foreachwiki utility. Or it would be a WMF-specific script in WikimediaMaintenance extension that also iterates all wikis.