Page MenuHomePhabricator

Develop maintenance script for enumerating Swift media files from MediaWiki (for backup processing)
Open, MediumPublic

Description

Background

As part of creating and regularly keeping-up-to-date a back up of media files (currently in Swift), we need the backup processor to know what to iterate. It is understood that the raw hierarchy with which files are stored in Swift is not adequate for this alone (e.g. we can't just blindly iterate all relevant Swift containers and sync to a backup).

The way MediaWiki uses Swift today involves files often physically moving around on disk to reflect certain changes in state. E.g. when a file title is renamed, we also move the file on disk. And the "latest" revision of the file is stored at a canonical/unversioned file path (e.g. a/ab/Foo.jpg), all previous revisions of a file are stored under an "archive" file path with a version timestamp (e.g. archive/a/ab/20110401090000!Foo.jpg), and if a file is deleted/hidden from public it moves to yet another file path under a "deleted" subdirectory.

The direction that has been proposed, to make the back up process efficient and able update incrementally and often, is to try to normalize these with a minimal best-effort approach so that within the backup the actual file binaries are mostly stable and don't move. This, together with the absence of much state tracking in this area, means it seems simplest to let MW iterate the file database (instead of iterating Swift directly), and have it dictate what to backup from where and to give a reasonably stable identifier to recognise files we've backed up before.

See also:

Outcome

Develop a CLI maintenance script for MediaWiki that supports at least the Swift file backend, and will produce a standard output stream of lines that contain:

  • some reasaonbly stable id (e.g. wiki ID + file title + file sha1 + upload timestamp).
  • the path/url to download the file from in Swift.
  • whether to encrypt (e.g. is it a private wiki).

It would do this for all public file revisions (image and oldimage), and "deleted" file revisions (filearchive).

This could work as a re-usable script in MW core, that the backup processor could invoke via WMF's foreachwiki utility. Or it would be a WMF-specific script in WikimediaMaintenance extension that also iterates all wikis.

Event Timeline

Tagging both SDE and PE. I believe in conversations so far this has been associated with PET, but I can also see this being a good oppertunity for knowledge transfer and pairing. The outcome is fairly simple and it would be a good excercise in reviewing how MediaWiki stores multimedia files, and the overal model behind FileRepo/FileBackend/SwiftFileBacked in MW.

The script might take a while to run on Commons. Any plans for slicing it or otherwise multiprocessing?

The script might take a while to run on Commons. Any plans for slicing it or otherwise multiprocessing?

Unless php mw functions are really underperformant, I wonder why? Querying all the info with a custom db query took to Commonswiki took me 5-15 minutes:

https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/637769/1/wmfbackups/media/MySQLMedia.py#48

We reached the conclusion that querying the db or swift itself was not worth it, and it needed PHP/application layer because there is just too many MW-specific logic on how the database and swift are used to try to rebuild it outside of it (configuration, wikis, large wikis vs small wikis, internal structure and location). Of course, 5-15 minute queries would be too long for web requests, but similar queries happen for dumps on slow dbs, and we can even -if needed- use dedicated mw application servers and databases (mysql backups ones), so I don't see it as a huge issue unless I am missing something.

Of course we can batch the outputing- you can see on the scripts that while I query all the image list, I process it in batches of 1000, so very little memory is used. If you think for some reason multiple queries is better, we can batch by title within a transaction (sadly, there is no good numeric index to batch for).

Given the 5-15 minute query time, I wasn't intending to multiprocess the listing output: just output it serially to stdout and it will be read in batches through a os pipe buffer. For database reads, a single query with a small cursor buffer or multiple queries can be done, that is implementation details. BTW, despite the image table being quite big, the actual complete list of images for commons would easily fit into memory (even if we don't need it all on memory).

Backup implementation (the actual image download) will be a later problem no in scope of this ticket- and of course that would be multithreaded, but resolved outside of Mediawiki.

See Python equivalent script on how I yield on every batch of 1000 rows, for reference, so I don't store the whole list in memory. MySQL server also uses a buffer to return results without using all its memory, so neither client nor server ever stores the full thing in memory to output it all.

I think the confusion would be that this would be a bad ideal to expose as a public API without app-level batching (á la iterators), and that is 100% sure, but note this is intended just as an internal, read only, maintenance script (think dump scripts), not a public API. I implemented a mw-api (serial, as it would be otherwise too disrupting) download script: https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/636007/1/wmfbackups/MediaBackup.py

But it had a few issues (aside from the naive download procedure):

  1. It won't provide archived and deleted images
  2. It wasn't transactionally-consistent
  3. It didn't provide mw config information (eg. list of all wikis)
  4. It didn't provide all info (some info may have to be extracted querying directly swift or after download)

Stress on- we only need a script for listing- downloading will be handed directly through swift for performance, but we need something to get us the urls and metadata that won't break easily on the next image table structure update.

@ArielGlenn will look into whether URLs can be extracted for deleted images and what the alternatives are if not.

Urls exist on the database- that I can be sure as it is on the database. The problem is I had to hardcode/duplicate a lot of existing mw logic, which is not sustainable long term.

Any plans for slicing it or otherwise multiprocessing

Answering @tstarling above, I have chosen to split chunks on smaller <500K records, based on the image initials: https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/643980/11/wmfbackups/media/MySQLMedia.py#75

@ArielGlenn will look into whether URLs can be extracted for deleted images and what the alternatives are if not.

To clarify, are we talking about urls of the form upload.wikimedia.org/something (or lang.project.org/something.php?somethingelse... or swift urls?

To clarify, are we talking about urls of the form upload.wikimedia.org/something (or lang.project.org/something.php?somethingelse... or swift urls?

@ArielGlenn Could we get both? :)

If not, I think Swift URLs would be the ones we're looking for, @jcrespo could you confirm / disprove?

End-user URLs are not really that useful for us. We just need swift ones as we will use swift for mass backup, not app or traffic layers. For originals, it is trivial to convert between both, but ofc private files "public urls" don't make much sense in context. So we will mostly work with Swift locations (container + path).

@ArielGlenn what's the relative priority of this in your opinion? Do you see a chance of looking into it anytime soon?

Let me rope @WDoranWMF into this discussion; he's trying to help me manage priorities.