Page MenuHomePhabricator

Make it possible to refresh metadata of files without jumping through crazy hoops
Open, Needs TriagePublic

Description

Currently all file metadata is permanently stored in the img_metadata column. While per mediahandler, you can version and have this refreshed, due to the sheer amount of files on Wikimedia Commons, $wgUpdateCompatibleMetadata is disabled there in order to avoid any sort of traffic stampede when that versioning starts dumping old and reading new data.

While refreshImageMetadata.php can update specific filetypes, or properties of some types, it is really unsuited to reparse files (esp. bitmaps) where we previously missed a metadata property that we now want to extract. For instance for T118799, which likely caused us to miss XMP metadata in possibly millions of files, similarly T32961. This is clearly a problem that needs long term fixing so that we don't require human monitored running of maintenance scripts for days on end. And then when we add another new attribute half a year later, the same process has to start over again.

I think one of the problems is that we basically have a cache without a max_age, nor consistent versioning. While each mediahandler can add a version in the metadata itself, that is not consistently done, and it also doesn't tell us when the metadata was read.

Ideas:

  • Storing it in MCR is one idea, but I believe MCR is first of all not really a cache layer, and 2nd of all, I don't think it currently has a lot of awareness of the file tables ?
    • We all agree, not the right hammer
  • Maintenance script
    • Avoids the stampede
    • running scripts for 2 months is not desirable according to ops
  • Create a PostSaveHook, get the 'files used' list and fire off a job to check for old filedata ?
    • Downside 1: Lots of images are not used at all
    • Downside 2: Lots of pages are very rarely saved
  • modify getMetadata() and add a configurable algorithm to refresh the data.. Maybe something like if metadata === old then if ((uploadday - configurable amount of days) < current day < uploadday) then refresh. (where the days are day within the year of uploaddate/currentdate, so max 366)
    • downside: the definition of old can only be determined per mediahandler
    • downside: might take a year, or several config changes to increase "configurable amount of days" to get everything reparsed in under a year.
  • or maybe make the whole algorithm a configurable wgUploadMaxAgeFunction ?
  • Maintenance script which queues jobs on a separate job runner T32961#2864157
    • Should we do this more often ?

please add more ideas

Event Timeline

TheDJ created this task.Sep 12 2019, 8:01 PM
Restricted Application added a project: Commons. · View Herald TranscriptSep 12 2019, 8:01 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Possibly the RevisionDataUpdates hook?

Tgr added a subscriber: Tgr.EditedSep 12 2019, 10:35 PM

MCR is for primary data, not derived data. It is a snapshot of what information a user entered at a certain point in time; it never needs to change. Derived data needs to change whenever the algorithms to derive it do. (Also img_metadata is derived page data, not revision data. Keeping old versions of it around forever as MCR would would be a waste of space.)

In any case, storing the data in img_metadata is fine. It's the invalidation rule that needs to be changed, not the storage location.
...although I'm not sure why it would have to be. When the way to calculate XMP metadata changes, you need to go through all affected images and recalculate. That's exactly what the maintenance script does. Seems like the perfect tool for the job to me. It could be replaced with a two-level job queue approach, maybe (a maser job scheduling per-image child jobs, much like GWT does), that works much like a maintenance script internally - I think if anything that's less convenient for monitoring, and more fragile, although maybe better for performance as it's a bit simpler to parallelize. Any other approach would be a worse fit that would make the invalidation take way longer, IMO.

Any other approach would be a worse fit that would make the invalidation take way longer, IMO.

Taking longer is the objective, to avoid the stampeding herd.

Tgr added a comment.Sep 12 2019, 11:24 PM

A maintenance script does not stampede. A multithreaded maintenance script is pretty much the ideal approach, determine how many refreshes are acceptable at the same time and do that amount, but not more, continuously all the time until all images are processed.

You could use some kind of probabilistic approach, or refresh on-demand with a poolcounter or similar parallelism limit, but then getting all the images would take practically forever, which is probably more complexity in the long run.

Krinkle updated the task description. (Show Details)Sep 13 2019, 12:34 AM
TheDJ updated the task description. (Show Details)Sep 13 2019, 7:43 AM
TheDJ updated the task description. (Show Details)