Currently all file metadata is permanently stored in the img_metadata column. While per mediahandler, you can version and have this refreshed, due to the sheer amount of files on Wikimedia Commons, $wgUpdateCompatibleMetadata is disabled there in order to avoid any sort of traffic stampede when that versioning starts dumping old and reading new data.
While refreshImageMetadata.php can update specific filetypes, or properties of some types, it is really unsuited to reparse files (esp. bitmaps) where we previously missed a metadata property that we now want to extract. For instance for T118799, which likely caused us to miss XMP metadata in possibly millions of files, similarly T32961. This is clearly a problem that needs long term fixing so that we don't require human monitored running of maintenance scripts for days on end. And then when we add another new attribute half a year later, the same process has to start over again.
I think one of the problems is that we basically have a cache without a max_age, nor consistent versioning. While each mediahandler can add a version in the metadata itself, that is not consistently done, and it also doesn't tell us when the metadata was read.
Ideas:
Storing it in MCR is one idea, but I believe MCR is first of all not really a cache layer, and 2nd of all, I don't think it currently has a lot of awareness of the file tables ?- We all agree, not the right hammer
- Maintenance script
- Avoids the stampede
- running scripts for 2 months is not desirable according to ops
- Create a PostSaveHook, get the 'files used' list and fire off a job to check for old filedata ?
- Downside 1: Lots of images are not used at all
- Downside 2: Lots of pages are very rarely saved
- modify getMetadata() and add a configurable algorithm to refresh the data.. Maybe something like if metadata === old then if ((uploadday - configurable amount of days) < current day < uploadday) then refresh. (where the days are day within the year of uploaddate/currentdate, so max 366)
- downside: the definition of old can only be determined per mediahandler
- downside: might take a year, or several config changes to increase "configurable amount of days" to get everything reparsed in under a year.
- or maybe make the whole algorithm a configurable wgUploadMaxAgeFunction ?
- Maintenance script which queues jobs on a separate job runner T32961#2864157
- Should we do this more often ?
please add more ideas