Page MenuHomePhabricator

Old image unexpectedly overwritten by a revision several years later (after Internal server error)
Open, HighPublic

Description

I tried to upload a new version of https://commons.wikimedia.org/wiki/File:Portrait_of_Nora_Perry.jpg by going to https://commons.wikimedia.org/w/index.php?title=Special:Upload&wpDestFile=Portrait_of_Nora_Perry.jpg&wpForReUpload=1. After I clicked "Upload file", it gave me the following error:

Internal error
[0e08f82e-9189-41cc-964d-b870ce88dad6] 2020-09-18 21:44:17: Fatal exception of type "JobQueueError"

I then clicked the back button and re-submitted the form. This time the upload worked, but something very strange happened. Now the thumbnails for both versions of the image are the image I just uploaded, and the link to the 2016 version goes to https://upload.wikimedia.org/wikipedia/commons/archive/a/a1/20200918214448%21Portrait_of_Nora_Perry.jpg, which is from today. (Notice the 20200918 in the URL.) So now there is no way to access the original 2016 version of File:Portrait_of_Nora_Perry.jpg. I think that the error during upload caused some sort of revision sequencing to get out of whack.

Event Timeline

kaldari created this task.Sep 18 2020, 10:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2020, 10:08 PM
kaldari updated the task description. (Show Details)Sep 18 2020, 10:08 PM
kaldari added a project: WMF-JobQueue.
Krinkle added a subscriber: Krinkle.

There are two issues here:

  1. Job queue failure.

Not sure if this is a caused by how the upload code specifically does it job submissions (does it follow current best practices?) or a general issue with jobqueue availability. I've tagged both CPT and SDE for this.

  1. File corruption: Permanent loss of old file?

This is definitely something that needs SDE to take a closer look and is specific to uploading.

The problem is that on https://commons.wikimedia.org/wiki/File:Portrait_of_Nora_Perry.jpg the 2016 version of that file appears to have gone permanently missing. The thumbnail is now linked to a 2020 upload that was meant to be a new version.

MediaWiki does not have any (intentional) way of replacing old versions, it is meant to be an append-only system so this is a pretty serious break of the expected model.

Hopefully the file still exists on disk and can be manually re-inserted as reference in the oldimage once we figure out how to prevent new ones from breaking in this manner.

Restricted Application added a project: Structured-Data-Backlog. · View Herald TranscriptSep 22 2020, 4:36 PM
Krinkle triaged this task as High priority.Sep 22 2020, 4:36 PM
Krinkle renamed this task from Old image version replaced by new upload due to internal server error to Old image unexpectedly overwritten by a revision several years later (after Internal server error).Sep 22 2020, 4:41 PM
daniel added a subscriber: daniel.EditedWed, Sep 23, 10:40 AM

MediaWiki does not have any (intentional) way of replacing old versions, it is meant to be an append-only system so this is a pretty serious break of the expected model.

Hopefully the file still exists on disk and can be manually re-inserted as reference in the oldimage once we figure out how to prevent new ones from breaking in this manner.

I don't think the old image is still there. According to the timestamp in the oldimage table, the archive URL of the old image should be:
https://upload.wikimedia.org/wikipedia/commons/archive/a/a1/20160305221455!Portrait_of_Nora_Perry.jpg. But that's a 404.
[I was wrong here, see next comment]

Note the timestamp in the URL of the (wrong) old version: https://upload.wikimedia.org/wikipedia/commons/archive/a/a1/20200918214448%21Portrait_of_Nora_Perry.jpg - 2020-09-18 21:44:48 is seven seconds before the current image's timestamp, 2020-09-18 21:44:55. This is probably the time between the error and the re-submit. So the first upload was recorded, and then replaced by the second upload. But somehow the original version was lost on disk.

I had a quick look at the database:

> select * from oldimage where oi_name = "Portrait_of_Nora_Perry.jpg" \G                    
*************************** 1. row ***************************
          oi_name: Portrait_of_Nora_Perry.jpg
  oi_archive_name: 20200918214448!Portrait_of_Nora_Perry.jpg
          oi_size: 467670
         oi_width: 821
        oi_height: 1307
          oi_bits: 8
oi_description_id: 44
         oi_actor: 71746
     oi_timestamp: 20160305221455
      oi_metadata: a:1:{s:22:"MEDIAWIKI_EXIF_VERSION";i:2;}
    oi_media_type: BITMAP
    oi_major_mime: image
    oi_minor_mime: jpeg
       oi_deleted: 0
          oi_sha1: c98lqab46vdlf50ou0f375zdqk5vmai

> select img_name, img_timestamp, img_size, img_width, img_height, img_sha1 from image where img_name = "Portrait_of_Nora_Perry.jpg"\G
*************************** 1. row ***************************
     img_name: Portrait_of_Nora_Perry.jpg
img_timestamp: 20200918214455
     img_size: 282232
    img_width: 680
   img_height: 1080
     img_sha1: j7gv3ceaednhaccy7xpami04rg1c1c8

So, only one row was inserted into oldimages (or the original insert was rolled back). I note that there is no unique index on oldimage, so this is not due to a key conflict.

What confuses me is that, had the first insertion been successful, it should have the right oi_archive_name value. If the second insert had been successful, it would have the meta-data of the new version - but it doesn't. The entry in the oldimage table has the old file's meta-data, including the timestamp and hash. Is oi_archive_name updated later? But without a unique key on the table, how does that work? [I was wrong here, see next comment]

As far as I understand the process, when the new version of the file was uploaded, the old version should have been renamed from a/a1/Portrait_of_Nora_Perry.jpg to a/a1/20160305221455!Portrait_of_Nora_Perry.jpg. After that, the new version would be written to a/a1/Portrait_of_Nora_Perry.jpg. Now, a row is inserted into oldimage based on the current data in the image table. Finally, the row in the image table is updated.

[this analysis is somewhat off, see later comments] To explain the result we now see, something like this must have happened: Renaming the old version from a/a1/Portrait_of_Nora_Perry.jpg to a/a1/20160305221455!Portrait_of_Nora_Perry.jpg failed somehow, but we didn't notice so the new version was written to a/a1/Portrait_of_Nora_Perry.jpg anyway, causing the data to be lost! A row was inserted into oldimages and the image table was updated. At this point, something (the job queue, for some reason?) triggered an error. When re-submitting, a/a1/Portrait_of_Nora_Perry.jpg was renamed to a/a1/20200918214448!Portrait_of_Nora_Perry.jpg based on the img_timestamp value from the first attempt. But for some reason, no new row in oldimages was inserted. Instead, the existing row was updated with a new value for oi_archive_name.

It all doesn't quite fit...

So the right place to start digging appears to be LocalFile::upload(), which calls LocalFile::publish() and later LocalFile::recordUpload3(). In LocalFile::upload(), there is a comment that reads:

			// There will be a copy+(one of move,copy,store).
			// The first succeeding does not commit us to updating the DB
			// since it simply copied the current version to a timestamped file name.
			// It is only *preferable* to avoid leaving such files orphaned.
			// Once the second operation goes through, then the current version was
			// updated and we must therefore update the DB too.

So, the first operation is to copy the old file to the archive. The second is to overwrite the current version. Only if that succeeds, the image and oldimage tables are updated. oldimage is updated by copying values from oldimage, except for the value of oi_archive_name, which comes from the return value of the call to publish() earlier. LocalFile::publishTo() constructs the archive fiel name from the current time: $archiveName = wfTimestamp( TS_MW ) . '!' . $this->getName();.

So I was wrong above. We might be able to find the lost file by looking for a/a1/*!Portrait_of_Nora_Perry.jpg. We can probably even narrow it down to a/a1/20200918214*!Portrait_of_Nora_Perry.jpg. I have no idea how to do this on our Swift boxes...

daniel added a comment.EditedWed, Sep 23, 11:31 AM

New attempt to reconstruct the events, after digging around the code some more:

First submit:

  1. the old version was copied to an archive name
  2. the new version was written to the primary name of the file
  3. a new row was inserted into oldimage (but later rolled back)
  4. the row in image was updated (but later rolled back)
  5. prerenderThumbnails() throw an exception, causing the AutoCommitUpdate that also contains the above database updates to be roleld back.

At this point, we have an orphaned archive file, the new data under the primary file path, no entry in oldimages, and the old meta-data in the image table. We have already lost (orphaned) the old version of the file.

Second submit:

  1. the data in the primary file path was copied to a new archive name (this is the data of the new version from the previous submit)
  2. the new version was written to the primary path of the file (doing nothing, it's it was already there)
  3. a new row was inserted into oldimage, by copying data from the image table, which still has the meta data of the original version of the file. oi_archive_name however is set to the new archive name.
  4. the row in image was updated to reflect the new version's meta data (correctly)

We now have one orphaned archive file, one archive file on record, correct data in the image table, and one row in oldimage which points to the second archvie file, but has meta data that relates to the first (orphaned) archive file.

The fundamental problem is this:
As noted in the comment I quoted: "Once the second operation goes through, then the current version was updated and we must therefore update the DB too". But there is nothing in the code to ensure this. Any error that happens before the DB transaction is committed will prevent the necessary updates to be recorded. And since the AutoCommitUpdate that contains the DB updates is deferred to PRESEND, there is actually a lot that could go wrong before. Or, in our case, within the transaction.

Further, I note that LocalFile::publishTo() uses FileRepo::quickImport(), which says "This does no locking nor journaling and overrides existing files" and "This is intended for copying generated thumbnails into the repo". FileRepo::quickImport() uses FileBackend::doQuickOPerations(), which says: "This does no locking, nor journaling, and possibly no stat calls" and "This should *only* be used on non-original files, like cache files". So we are using backend functions that are explicitly documented to be unfit for primary user data.

I see several ways in which we can reduce the risk of this happening, but I don't see a way to fully ensure consistency.

daniel added subscribers: aaron, tstarling.EditedWed, Sep 23, 11:48 AM

Ideas for making the upload process less prone to data loss:

Stage one:

  1. start db transaction (do not use a deferred update)
  2. determine archive name of the file, insert a row into oldimage, based on data in the image table
  3. copy the current version to the archive name. Do not use "quick" operations.
  4. commit db transaction (really flush! we must know this is permanent before overwriting the current version of the file!)

Stage two:

  1. start db transaction (do not use a deferred update)
  2. determine meta-data for the new version, insert update the row in the image table
  3. copy the new version into the primary location. Do not use "quick" operations.
  4. commit db transaction (really flush!)

Stage three:

  1. schedule jobs for thumbnail generation (in a deferred update?)

This should prevent any data loss. However, if we fail before stage two is committed, we end up with an extra row in oldimage, which is visible to users. It would point to a copy of the current version, and have the same meta data. We could try to detect this during the next upload, and remove such a row. This would be even easier with a unique index over oi_name and oi_timestamp (plus perhaps oi_sha1).

@daniel I can provide dumps (or even binlog events- every write that happened and when) of the db from last week, even even recover single rows if need. Please guide me with what you need. I will be meanwhile recovering image and oldimage from last week.

This is the row from the image table as of 2020-09-15:

("Portrait_of_Nora_Perry.jpg",467670,821,1307,"a:1:{s:22:\"MEDIAWIKI_EXIF_VERSION\";i:2;}",8,"BITMAP","image","jpeg",44,71746,"20160305221455","c98lqab46vdlf50ou0f375zdqk5vmai"),

There were no oldimage rows.

@daniel regarding swift, this is what I found:

$ swift list wikipedia-commons-local-public.a1 | grep Portrait_of_Nora_Perry.jpg
a/a1/Portrait_of_Nora_Perry.jpg
archive/a/a1/20200918214414!Portrait_of_Nora_Perry.jpg
archive/a/a1/20200918214448!Portrait_of_Nora_Perry.jpg

I hope that is useful, let me know how I can help further.

@jcrespo: Bingo! archive/a/a1/20200918214414!Portrait_of_Nora_Perry.jpg is the file name we failed to record in the DB. We can now again access it, and could even create an appropriate image archive row for it. Thank you very much!

Here's the URL to the original version of the file: https://upload.wikimedia.org/wikipedia/commons/archive/a/a1/20200918214414!Portrait_of_Nora_Perry.jpg

We could now manually fix the oldimage row, by orphaning the newer, redundant archive file:

UPDATE oldimage SET oi_archive_name = '20200918214414!Portrait_of_Nora_Perry.jpg'
WHERE oi_name = 'Portrait_of_Nora_Perry.jpg' AND oi_archive_name = '20200918214448!Portrait_of_Nora_Perry.jpg';

The row already has the old meta-data, so pointing it to the corresponding file would be appropriate.

I'm not sure the manual intervention is needed, since the old file in this case isn't terribly valuable. But we have a misleading history so... what does everyone think?

In any case, the key thing is to prevent this from happening in the future.

I will let CPT and or end users handle the best way to proceed with the recovery, unless you need me involved for any database thing. I believe with the above information nothing was lost.

For longer term, we should push for the modern multimedia workflow/structure T28741, and on my side for the T262669: Plan logical and physical design for media backups, which should make impossible to ever lose media data in the future.

Cheers.

@daniel - I'm not too worried about restoring the old version unless it will prevent future problems. For example, any idea what would happen currently if I clicked "revert" for the old revision? Otherwise, I'm more concerned about preventing the problem in the future, as sometimes new image versions are vandalism and we actually need to revert to the old versions.

daniel added a comment.EditedWed, Sep 23, 7:28 PM

@daniel - I'm not too worried about restoring the old version unless it will prevent future problems. For example, any idea what would happen currently if I clicked "revert" for the old revision?

I guess the image would stay the same, but the meta-data would be the one for the original revision. This might cause the aspect ratio to be off, I guess.

But it's also possible that the meta-data just gets re-computed.