Document media recovery use case proposals and decide their priority
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jan 21 2022, 1:07 PM

Description

After the analysis, design, implementation and first run of full backups at both WMF primary datacenters (eqiad and codfw), a workflow for backup and limited recovery is already in place.

If one or a few files were to be lost, corrupted or otherwise unavailable from Commons, or any other wikis hosting media files, there exists a method/application that can easily recover a single object, or a few related ones (e.g. all uploaded with a given title, or with a given hash), from backups and upload it back into the production Swift cluster.

Equally, if there was a full data availability outage (e.g. all or a large part of the files were lost), a "best effort" recovery would be always possible, where all files in the backup would be sent back to production as available on the last backup. This use case, of course, needs large-scale testing to make sure we could optimize for performance, not only parallelizing the recovery, but also the order of it, by providing a priority to each file (e.g. by file type, by access statistics, by wiki usage, by state, etc.). While this needs further preparation, the way to move forward is relatively clear.

However, as with most recoveries, the case in between these two scenarios, where a substantial amount of files are lost, but not all, creates the most challenges. Unlike the media backups, high available media storage in production, on its logical model as used by mediawiki, does not follow an append-only model, and in the opinion of the author of this document (and other numerous mediawiki stakeholders), has clear inefficiencies and defects due to lack of maintenance over the years. Equally, a recovery of every physical, concrete set of bytes will not work without its matching metadata, and the metadata state also changes in the life of the file (it can be the latest version, an old version, being deleted (with all its version) or only a revision if it, and depending on that, this will alter its required metadata and recovery location). There is not any unmutable identifier to consistently track a media file or media page, and the closest thing to it, a hash, uses the older sha1 algorithm, which is has been practically demonstrated to generate collisions rather easily.

As the file changes, the question of how to recover it becomes non trivial- and has the same issues as trying to partially recover a database- some recoveries will be trivial, while other will require decisions and trade offs. For example, a file could be uploaded, then one of its revisions lost, but by the time it is recovered, a new file could have been uploaded as the new, latest version, or been renamed, or removed, or a combination of all. There is no easy way to merge automatically backups and further changes, and not all changes are even tracked or possible to backup easily (e.g. a file can be deleted and restored multiple times, logs can be unreliable of difficult to parse, etc.).

While in an ideal world, we would improve the logical storage model as needed (adding unique identifiers, using hashing other than sha1, having a more append only model, not requiring file changes after upload, etc.), backups have to be working now, and cannot wait a rearchitecture of production media hosting.

As a consequence, in the current status, while backing up every single file at once (full backups) or continuously (streaming backups) it is very easy to implement (through kafka events, or monitoring the database at certain intervals), recovery will require some concessions. For example, following the model of databases, a snapshot of the metadata could be generated (every week? every day?) and we could recover the consistent state of metadata and files to the incremental backup. We could also store those incremental metadata backups for 3 months. Under these assumptions, recovery could be done to the given periods of time, but not more. All files could be retrieved, theoretically (losing at most those between upload and backup, there could be no automatic recovery since the latest metadata snapshot.

Another potential workflow is implementing the recovery as just another mediawiki user- that means that recovery would happen, not directly to swift and the databases, but generating a new upload log, without breaking the mediawiki workflow, even if that means that older references to certain files would be lost. For example, if an old version of a file has been lost, we would upload it again, generating a recent log, which would be visible to end users.

These two, of course, would be one of many possible workflows- it is those that will respond to a media outage that will set their requirements and preference of supported use cases for recovery, and finalization of the backup workflow will be built around that. Several workflows are possible (backups were build with flexibility in mind), although probably not all will be practically supported, or not with the same preference, always having into account the availability of finite backup resources.

Details

Subject	Repo	Branch	Lines +/-
mariabackup: Prefer db2099 mw db (backup source) for mediabackups	operations/puppet	production	+1 -1
mediabackup: Update s4 backup in codfw	operations/puppet	production	+2 -2
mediabackup: Update backup of testwiki media on codfw	operations/puppet	production	+3 -3
Add new command line utility to update existing metadata	operations/software/mediabackups	master	+143 -21
mediabackup: Update s4 backup in eqiad	operations/puppet	production	+3 -3
mediabackups: Test mediabackups updates on testwiki only	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T262668 WMF media storage must be adequately backed up
		Resolved		jcrespo	T299764 Document media recovery use case proposals and decide their priority

Event Timeline

jcrespo created this task.Jan 21 2022, 1:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 21 2022, 1:07 PM

jcrespo triaged this task as High priority.Jan 21 2022, 1:09 PM

ArielGlenn subscribed.Feb 1 2022, 9:42 AM

Krinkle subscribed.Feb 7 2022, 10:08 PM

Change 773192 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackups: Test mediabackups updates on testwiki only

https://gerrit.wikimedia.org/r/773192

gerritbot added a project: Patch-For-Review.Mar 23 2022, 9:39 AM

Change 773192 merged by Jcrespo:

[operations/puppet@production] mediabackups: Test mediabackups updates on testwiki only

https://gerrit.wikimedia.org/r/773192

Maintenance_bot removed a project: Patch-For-Review.Mar 23 2022, 11:10 AM

Change 773442 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update s4 backup in eqiad

https://gerrit.wikimedia.org/r/773442

gerritbot added a project: Patch-For-Review.Mar 24 2022, 8:00 AM

Change 773444 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/mediabackups@master] Add new command line utility to update existing metadata

https://gerrit.wikimedia.org/r/773444

Change 773442 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update s4 backup in eqiad

https://gerrit.wikimedia.org/r/773442

Mentioned in SAL (#wikimedia-operations) [2022-03-24T11:47:41Z] <jynus> updating eqiad swift-commonswiki backups of originals T299764

Change 773444 merged by Jcrespo:

[operations/software/mediabackups@master] Add new command line utility to update existing metadata

https://gerrit.wikimedia.org/r/773444

jcrespo mentioned this in rOSMB76987dad8a64: Add new command line utility to update existing metadata.Mar 25 2022, 8:53 AM

Because performing backups takes multiple days, the following issues have been detected:

Some files are inserted into metadata twice due to its status changing (deletion, undeletion, etc.) in the backup process. This doesn't affect the backups, but it gets confusing on recovery. The older version needs to be moved to the newly proposed file_history table for archival.
Some files failed to be backed up because they changed between metadata collection and actual file transmission (that was expected, and will be fixed on a followup run). However, in the case where a file with the same production storage location was found, it was backed up instead. This means there is missing metadata for existing backed up files. If the new file has a not-yet-seen sha1, it will be fixed automatically on next run, but a more complex operation (revert of an existing old file as the latest version, rename of an existing file) may not be easy to track.

Change 774378 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update backup of testwiki media on codfw

https://gerrit.wikimedia.org/r/774378

Change 774378 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update backup of testwiki media on codfw

https://gerrit.wikimedia.org/r/774378

Change 774474 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mediabackup: Update s4 backup in codfw

https://gerrit.wikimedia.org/r/774474

Change 774474 merged by Jcrespo:

[operations/puppet@production] mediabackup: Update s4 backup in codfw

https://gerrit.wikimedia.org/r/774474

Change 774477 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] mariabackup: Prefer db2099 mw db (backup source) for mediabackups

https://gerrit.wikimedia.org/r/774477

Change 774477 merged by Jcrespo:

[operations/puppet@production] mariabackup: Prefer db2099 mw db (backup source) for mediabackups

https://gerrit.wikimedia.org/r/774477

The main issue I ran into is that it was said it was guaranteed by Mediawiki that no file with the same title was ever uploaded on the same timestamp. But checking the original commonswiki database, this is not true for many files:

Here the same file was uploaded twice on the same second: https://commons.wikimedia.org/wiki/File:%22Campo_agreste%22.jpg
Here different files were uploaded several times on the same second: https://commons.wikimedia.org/wiki/File:%22Lincoln_the_Lawyer%22_by_Lorado_Taft.jpg

If they were errors on upload, that doesn't matter to me now- the objective of backups is to keep a copy of what actually is on production. But given many files are in constant state change, it is difficult to uniquely identify file versions- e.g. I don't know if the original db is "bad" or I just captured the file twice after it has been restored/a new version was uploaded while the backup was ongoing.

One option would be to ignore (duplicates) on backups file versions with the same sha1, wiki, and upload date and file status (deleted/latest version/older version) , and trust they are not sha1 collisions of different files. Thoughts, @Krinkle if those db entries could be deleted/constrained in the future on production?

LSobanski subscribed.Mar 30 2022, 1:04 PM

jcrespo mentioned this in T289996: Media storage metadata inconsistent with Swift or corrupted in general.Mar 31 2022, 8:54 AM

A summary of my conclusions after the feedback were docummented at: https://docs.google.com/document/d/17iVqWzd6ebTo7M3Xn79vC1mXxqNVChum7cTx34flJkI/edit#

Feel free to provide additional feedback to my conclusions- which will result on the creation of a few followup tickets.

jcrespo moved this task from Triage to Refine on the Data-Persistence-Backup board.May 18 2022, 3:34 PM

In T299764#7813469, @jcrespo wrote:

The main issue I ran into is that it was said it was guaranteed by MediaWiki that no file with the same title was ever uploaded on the same timestamp. But checking the original commonswiki database, this is not true for many files:

Here the same file was uploaded twice on the same second: https://commons.wikimedia.org/wiki/File:%22Campo_agreste%22.jpg

Here different files were uploaded several times on the same second: https://commons.wikimedia.org/wiki/File:%22Lincoln_the_Lawyer%22_by_Lorado_Taft.jpg

If they were errors on upload, that doesn't matter to me now- the objective of backups is to keep a copy of what actually is on production. But given many files are in constant state change, it is difficult to uniquely identify file versions- e.g. I don't know if the original db is "bad" or I just captured the file twice after it has been restored/a new version was uploaded while the backup was ongoing.

One option would be to ignore (duplicates) on backups file versions with the same sha1, wiki, and upload date and file status (deleted/latest version/older version) , and trust they are not sha1 collisions of different files. Thoughts, @Krinkle if those db entries could be deleted/constrained in the future on production?

Interesting issue!

Historical context

First some context for future reference (Jaime knows most of this already) — the title,timestamp combination is our only* unique key for file versions, and this is a long-standing issue. For, well, over ten years by now. In fact, one of my earliest technical proposal is T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) from 2011 with Roan at the Amsterdam Hackathon. This was "approved" in T589, but remains unresourced. In part, due to the loss in scope of (and subsequently, existence of) our Multimedia team, and the various levels of confused management over what we currently call "Platform". In a nut shell, just like the sibling task about page archive tables (T20493), this model requires moving primary data between tables, and on top of that lacks a stable identifier. Because, yes, not only is it a weird identifier, its not a stable one either, because files can be renamed. And, those renames are reflected in Swift, including retroactively for previous versions. This as otherwise we couldn't discover the file in Swift or the related metadata in the DB, because its keyed by (page_title, timestamp), and there's no other tracking of what those values would be other than the current page_title.

Now, in theory we have logic in place to at least prevent duplicate insertions. For example, during regular uploads MW refuses to insert duplicate image rows for the same title. And during file revision uploads ("overwrite") we refuse to insert a duplicate oldimage row for an existing title/timestamp pair. Note, btw, that file overwrite actions are generally rare. Our community's policy is generally that modified versions of files should be uploaded as their own file so that both versions can be references in content. Unlike pages, most files are never edited/replaced, and that's mostly an expected quality. For example, crops or major retouches would be their own file linked to the original, not a file revision ("overwrite").

Client perspective

The scenario where this bug would happen, I suspect, is when a file is automatically updated periodically by a bot, and that the bot was misconfigured, and that it concurrently ran to produce visually-identical but binary-different files, going out of control and overriding each other.

Practically speaking, my opinion is that:

What we do here for backups doesn't matter. So long we have "the" file from Swift, and one of the metadata rows, that's good enough, even if we could determine which one is "correct", I don't think it matters.
In all likelihood, the uploads were near-identical, with near-identical metadata.

Server-side, in theory...

If during upload we find that a file was already revisied during the same second, MW first tries to resolve this conflict by allocating now + 1 second as future timestamp instead. If that's taken as well, it waits upto a certain amount of time, and then bails. Or rather, that's how it behaves in theory from my reading, and has for many years. Based on Jaime's evidence, it's very likely that there's at least one race condition bug here.

Alternatively, as @Catrope mentioned to me yesterday, it could also be due to conflicts arisen after the fact. In particular, when file histories are merged (upload A1, A2, B2, B3; delete A; rename B to A; restore history of A; we now have two A2s).

*: The caveat here is that not only is the insertion logic likely subject to race conditions, and there's nothing preventing conflicting histories from being merged together, but also there is no protection from the database side. The oldimage schema doesn't even set UNIQUE INDEX on the title,timestamp pair. This was pointed out in its own issue in 2014, ref T67264, three years after T28741 in 2011.

As for Swift, there is where afaik a decision is made for us. With title-timestamp as our only identifier, this means inevitably that a conflict is resolved by just overwriting it in place and dereferencing without recovery whatever was there before. As per the "client" story above, this is imho not a big deal as these aren't likely to be significant versions either way, and by design never the "current" version of a file.

Recommendation

I recommend against putting in effort to try to preserve both rows of metadata, or otherwise accomodate this within your data model long-term. Do whatever is the least effort from your end that still preserves something.

As for what "we" can do as MW developers:

Make inventory, list all current violations of this constraint.
Triage each one and remove one of the rows. We can try to remove the "wrong" one where we know it, e.g. based on the sha1 field.
Apply the schema change per T67264 so that new violations at least cause insertion errors. This is a fairly small and low-risk schema change from the MW perspective. Having errors here would be a good thing, and will be rare either way. But.. it would not surprise me if applying such unique constraint would be a slow and/or risky operation. I'll rely on DBAs for strategy on this.
Do T28741 at some point.

Do whatever is the least effort from your end that still preserves something

Thank you a lot, that is exactly what I needed, with all context, of course- permission to "skip" production data that is "bad" to start with from a data recovery perspective. The easiest thing I will likely implement is to "log" to history a random version of the duplicate file and try to backup the other- and in case of a recovery only 1 version will be available for recovery.

Later, when/if production metadata is cleaned up, it should reconciliate with production/swift. As I track all backup errors, nothing should be lost or silently fail (plus most issues are with archived or deleted versions, not the latest, which probably will be of much higher priority). I will document this decision and bake the logic into the script. I also believe that the backup work will help guide production redesigns and will be of great value to understand what there is *actually* on media, not what it is supposed to have. 0:-)

jcrespo mentioned this in rOSMBccc8a2fa7649: Add functionality to "archiving" older status of a file.Jun 23 2022, 10:22 AM

All open questions (or at least basic ones resolved), basically we will do a "best effort" on places where we get "garbage in" I will open a separate ticket for finalization of the implementation of incremental backups and streaming backups.

Document media recovery use case proposals and decide their priorityClosed, ResolvedPublicActions