Page MenuHomePhabricator

Evaluate whether to create a RESTBase-backed JSON endpoint for image metadata
Closed, DeclinedPublic

Description

MediaWiki API imageinfo queries for extended unstructured metadata stored in File pages are very slow because the request involves parsing the info from the page HTML. (I believe the main reason that the /page/media endpoint is so slow is that it involves many such queries.)

@JoeWalsh and I were discussing how it would be handy to have a RESTBase-backed per-image metadata endpoint to help speed up extmetadata requests and include structured metadata from wikibase. The endpoint would be something like /media/image/metadata/{title}{/rev}. This would be useful anywhere we have MW API extmetadata queries now, which includes both the feed featured image and the media endpoint in MCS, as well as both native apps' image gallery activities.

To eliminate redundancy, this endpoint would only store and return results for files stored locally to the request domain wiki; the vast majority would of course be on Commons.

@mobrovac / @Pchelolo: What do you think? Is this worth doing, at least as a bridge to when all file metadata properties have been migrated to SDC? Is Cassandra storage availability still a concern?

Event Timeline

Mholloway created this task.Jun 3 2019, 6:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 3 2019, 6:10 PM
Mholloway updated the task description. (Show Details)Jun 3 2019, 6:12 PM
Mholloway updated the task description. (Show Details)Jun 3 2019, 6:16 PM

This is an interesting idea. The issue here, though, is: how can updates to such metadata be triggered?

Tgr added a subscriber: Tgr.Jun 5 2019, 4:15 PM

Extmetadata will probably be killed after the SDC migration, which probably will have a different (and more complex) structure. I wouldn't put much effort in this.

MediaWiki API imageinfo queries for extended unstructured metadata stored in File pages are very slow because the request involves parsing the info from the page HTML.

Extmetadata should be cached for 30 days. So at least repeat / frequent requests should not be slow for that reason.

To eliminate redundancy, this endpoint would only store and return results for files stored locally to the request domain wiki; the vast majority would of course be on Commons.

Note that a file can be on Commons but have a description page on English Wikipedia (in which case the extmetadata from enwiki comes from there). This is super rare though.

how can updates to such metadata be triggered?

By editing the file description page (which is on the local wiki) or uploading a new version of the file (which might not be).

Extmetadata will probably be killed after the SDC migration, which probably will have a different (and more complex) structure. I wouldn't put much effort in this.

Part of why I thought this was worth proposing despite the SDC project is that this would take very little new code in MCS. When would you expect all file metadata fields to be converted to SDC?

Extmetadata should be cached for 30 days. So at least repeat / frequent requests should not be slow for that reason.

I thought there must be some MW-internal caching involved somewhere here. Still, my perception when when developing services that consume extmetadata is that it's often very slow, even for repeated requests.

JoeWalsh triaged this task as High priority.Jun 12 2019, 3:46 PM

Could potentially help with the fix for T225443

JoeWalsh renamed this task from Evaluate whether to create a RESTBase-backed JSON endpoint for extended (unstructured) image metadata to Evaluate whether to create a RESTBase-backed JSON endpoint for image metadata.Jun 13 2019, 4:19 PM
JoeWalsh updated the task description. (Show Details)

The existing endpoints that return extended metadata (/page/media, /media/image/featured) are choosy about what fields they request from the MW API and return. Do we want to pick and choose likely candidates of interest here as well, or just return everything? I'm leaning slightly toward the latter.

If there's no significant performance penalty on the MW API for getting additional fields, I'd say it's worth just returning everything

Change 517125 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] [DNM] File metadata endpoint

https://gerrit.wikimedia.org/r/517125

Change 517135 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] [WIP] Add page media list endpoint

https://gerrit.wikimedia.org/r/517135

Tgr added a comment.Jun 16 2019, 5:13 PM

If there's no significant performance penalty on the MW API for getting additional fields, I'd say it's worth just returning everything

Depends entirely on the API. Sometimes it just means adding more fields to the SELECT clause of an SQL query (basically free), sometimes joining more tables (usually cheap), sometimes making additional queries (not so cheap).

In the case of the imageinfo API, I'd expect parsedcomment, archivename and badfile to have nontrivial performance impact. Although if the caching for extmetadata is indeed broken, that's probably the slowest by far.

Joe added a subscriber: Joe.Jun 17 2019, 4:55 AM

I have some concerns, but I might have misunderstood what this ticket is about.

Specifically, you are proposing to cache some specific data in RESTbase via change-propagation (I suppose), instead of fixing the MediaWiki API performance issue.

This is an antipattern we've followed too many times over the years, and we should stop. It's creating us all sorts of growing pains at multiple levels, and it's a way to mask problems instead of solving them.

Could we make the MW API faster? Is there information MW should cache and is not caching?

I concur with @Tgr that the API request shouldn't be too expensive, and if it is, we should probably find a simpler query to make / cache data at the MediaWiki level.

Hey @Joe,

Sorry, discussions around this ticket have sort of evolved in focus, but the description hasn't been updated to reflect that. As the description indicates, it started life as a possibly nice-to-have REST endpoint to work around the apparent slowness of the imageinfo API, but currently we're mostly thinking about whether it would help solve T225443: Media endpoint does not refresh structured captions. The PCS /page/media endpoint, as currently written, includes data from imageinfo and SDC on every (non-UI) image used in the requested page title, and so the content cached in Varnish for that title must be purged via ChangeProp anytime any of those images' File page contents or SDC content changes; and that currently isn't happening (see T225443#5248591 for @Pchelolo's comment on that—we have scalability concerns). /page/media also happens to be quite slow (I think because of its heavy imageinfo usage).

Our latest thinking is that perhaps /page/media should be scaled back to provide only a list of non-UI page titles along with a limited amount of other info gathered from the page itself, and that further data about the images themselves should be gathered in one or more separate requests. That would help to resolve both the performance and scalability concerns with the current /page/media.

That still leaves the question of whether this is worth doing at all (i.e., whether it would be of some value beyond having clients just call imageinfo and wbgetentities directly, and if so, whether that value is enough to outweigh the maintenance cost). With the feedback from you and @Tgr I am probably leaning toward "decline," myself.

Change 517125 abandoned by Mholloway:
[DNM] File metadata endpoint

https://gerrit.wikimedia.org/r/517125