Page MenuHomePhabricator

Media storage metadata inconsistent with Swift or corrupted in general
Open, LowPublic

Description

During the work on media storage backups certain inconsistencies between the MediaWiki metadata (image tables) and actual files stored in Swift were discovered. The goal of this task is to identify the correct areas of ownership to identify, report and fix these issues.

Examples:

  • A file correctly identified on metadata, when tried to be retrieved from Swift returns a 404. Options:
    • The file has been moved or deleted or undeleted after scanning (will be retrieved on a subsequent scan)[backup limitation]
    • The file has been permanently deleted by T&S
    • The file has been lost, or metadata de-synced with file data, and it is still on Swift, but with an unexpected name
  • A file expected to have a particular sha1 hash is downloaded and happens to have a different sha1 hash
    • The file has been moved or deleted or undeleted after scanning (will be retrieved on a subsequent scan)[backup limitation]
    • The file was incorrectly hashed or partially downloaded on backup [backup limitation]
    • The file was incorrectly hashed or partially uploaded on upload
    • The file has been corrupted or damaged, or metadata de-synced with file data, or on reshard/replication
  • A file has an impossible name (upload_name) (NULL, the empty string)
    • Metadata loss or corrupted, incomplete upload process, interrupted workflow (e.g. an outage while file was being moved or renamed)
  • A file has an unexpected name or location, or a missing property enough to locate it on swift (storage key, sha) (e.g. a deleted file has a name that is not a hash, or the key on metadata is missing)
    • Metadata loss or garbage, incomplete upload process, interrupted workflow (e.g. an outage while file was being moved or renamed)
  • A file has a missing or impossible deletion timestamp
    • Metadata loss or garbage, incomplete upload process, interrupted workflow (e.g. an outage while file was being moved or renamed)
  • A file is referenced from when before the wiki was hosted at Wikimedia
    • enwikivoyage was probably imported from elsewhere in 2012, missing non-public files import- so files deleted referenced from before that time then will be missing from swift, but still on the metadata tables
  • A file is referenced in the archived table (non-latest version) as non-deleted, but the wiki doesn't have any non-deleted file
    • eswiki has no files (publicly available, has all of them deleted), but the oldimage table has 2 non-deleted missing entries
  • A file has 2 entries on the database with the same title and upload timestamp
    • Be it the same file or a different one: Here there are 2 commonswiki examples: T299764#7813469
  • A file has non valid (non-utf8) path on the oldimage table
    • Two files found on commons: ДАЖО_127-1-68.1897._Геодезичний_опис_ділянки_землі_вічного_чиншовика_Антона_Станіслава_Гарбовських_села_Рудня-Старики_Овруцького_повіту.pdf and Алфавітно-предметний_покажчик_за_1938_рік_до_Збірника_постанов_і_розпоряджень_Уряду_Української_Радянської_Соціалістичної_Республіки.pdf have invalid UTF-8 characters.

Event Timeline

There seems to be functionality on core maintenance to check some of these issues, e.g. based on the name, things like : https://github.com/wikimedia/mediawiki/blob/master/maintenance/findMissingFiles.php We should check work that probably have been done and try to apply it, fix it or implemented based on other maintenance scripts.

Enwikivoyage was created in 2012 at Wikimedia: https://www.mail-archive.com/newprojects@lists.wikimedia.org/msg00015.html

But still references deleted files from 2008.

All publicly available files were backed up correctly, though.

jcrespo renamed this task from Media storage metadata inconsistent with Swift to Media storage metadata inconsistent with Swift or corrupted in general.Mar 31 2022, 8:43 AM
jcrespo updated the task description. (Show Details)