Page MenuHomePhabricator

Reader gets file description
Open, MediumPublic

Description

"As a Reader, I want to get the current version of a media file, so I can read, view or listen to it. "

GET /file/{title}

Get information about a file.

Notable request headers: none

Request body: None

Status:
200 - OK
404 – No such file

Notable Headers: none

Body: JSON

  • title: title of the file
  • latest: latest file revision, with the following properties:
    • timestamp: last modified timestamp, YYYY-MM-DDTHH:MM:SSZ
    • user: user object for the uploader with the following properties
      • id: numeric ID or null
      • name: registered user name or other identifier
  • default: information on the default, easy representation of the file for representation in a document, with these elements
    • mediatype: media type for the preferred representation of the file
    • size: size of the preferred representation of the file in bytes
    • width: width of the preferred representation of the file in pixels if applicable (image, video, audio ...?)
    • height: height of the preferred representation of the file in pixels if applicable (image, video, ...?)
    • duration: temporal duration of the the preferred representation of the file in seconds if applicable (video, audio, animated gif/png, ...?)
    • url: full URL of the binary version of the preferred representation of the file (the image/audio/video/document itself)
  • thumbnail: information on a bitmap image smaller than 512x512 pixels, that represents the file, such a smaller version of an image file, a still from a video, or an icon for audio or a document {size, width, height, duration, mediatype, url}
  • original: information on the original representation of the file as uploaded, with {size, width, height, duration, mediatype, url}
  • file_description_url: URL to the HTML page for the file description, which can help with license information

Event Timeline

Why get the metadata (and create an indirection) instead of serving the file directly? GET /file/{title} really says get me that file, and if, as the story says, I want to read, view or listen to the file, then I'm really interested in the file, not its metadata. I really think we should separate out metadata information into its own hierarchy.

Wrt status codes, I have the same comments as in T229663#5431424 :)

eprodromou subscribed.

@mobrovac We already have a hierarchy for getting the content of the file. This is the metadata endpoint. I'll update the user story to reflect that.

@mobrovac We already have a hierarchy for getting the content of the file. This is the metadata endpoint. I'll update the user story to reflect that.

Is there anything comprehensive yet, for the endpoints already implemented, and the ones being proposed? Something that would make it easier to see how these fit together as parts of the whole?

@mobrovac We already have a hierarchy for getting the content of the file. This is the metadata endpoint. I'll update the user story to reflect that.

I'll just chime in here with a me too; As I read the user story, I was surprised to find us returning JSON-encoded metadata (based on the resource naming and description).

id: ID of the file

Files don't have IDs.

versionID: version ID of the file

This is also not a pre-existing concept. Files just have timestamps.

url: location of the API entry

What is this?

Here are some properties returned by the action API's imageinfo which are missing from this user story and may be desired:

  • timestamp
  • width
  • height
  • pagecount
  • duration
  • sha1
  • metadata
  • commonmetadata
  • extmetadata
  • mediatype

Here are some properties returned by ParsoidBatchAPI, which are needed by Parsoid and anything else attempting to display the image:

  • mustRender
  • badFile

Note that the original source file URL is typically not what a client needs to display a file. I'm not sure if it is wise to encourage download of the original source file by providing this link in the fileUrl property and Location header.

I updated the endpoint as follows:

  • Removed the status codes that were causing issues
  • Removed ID and version ID
  • Added in some of the properties @tstarling mentioned from the Action API
  • Split the representation into three forms:
    • original, as uploaded (more or less)
    • modified, array of representations that are converted to different sizes, or to different media types
    • preferred, "best" representation for lazy developers who just want to plop the file into a widget or HTML page ("the representation that appears when you open the corresponding File:Filename.ext page" is my best definition here)

I want to add one more property, which is thumbnail, that is, a preferred thumbnail image for a file. Do we keep a preferred thumbnail?

One last thing: per API/LICENSE from the Architecture Principles, I've added license info, author, and contributor info. Yes, I realize that author=uploader, contributors=all previous uploaders is not entirely fair, but I think it's better than nothing.

How do I get "transforms" and know which representation is "preferred" vs "latest"?

I found RepoGroup::findFile() and the File class, which gets me most of the desired info, but I'm not clear on how multiple representations of a file work. Or am I looking at entirely the wrong code?

Per my previous discussion with @eprodromou , the preferred transform is a thumbnail or poster which would be displayed on the image description page, or an icon thumb if there is no other thumbnail image. To do this, you will need to duplicate some of the logic from ImagePage::openShowImage(). In particular, around line 415:

			if ( $this->displayImg->allowInlineDisplay() ) {
...
				$thumbnail = $this->displayImg->transform( $params );
...

$params takes a bit of work to derive.

If $thumbnail is a MediaTransformError then you should pass the error message back to the user if possible. If $thumbnail is false, that is also an error. If $thumbnail is a MediaTransformOutput and getUrl() is not false, and getWidth() and getHeight() are nonzero, then you can probably return them back to the user assuming they represent a valid thumbnail. Icon thumbnails should need no special handling since they are embedded in a MediaTransformOutput when necessary.

I'm not sure what should be in "transforms", but it could possibly be treated like an srcset attribute. See Linker::processResponsiveImages() and the call to it from ImagePage::openShowImage().

I'm hitting an unexpected little issue. The extension is part of the title, so I need to hit urls like (from my local, but similar urls will be true for live):

http://127.0.0.1:8082/mediawiki/rest.php/coredev/v0/files/File:Bill_Pirkle_Key_Largo_2016.jpg

This gives the error

Invalid file extension found in the path info or query string.

This is because of a security check in WebRequest.php. It is discussed in T85751: "Forbidden error" on en.wp for ?action=raw when page name contains . or %2E and you can see the security check code here

I think the simple fix is to change the path to have something after title, so maybe /file/{title}/metadata or something, so that the url doesn't end in the file extension.

@eprodromou , we talked in the check-in meeting regarding the "thumbnail" naming in the code Tim posted, and a concern that we didn't want the preferred image to be a small thumbnail. Local testing of uploading a 1000x667 image resulted in the API returning a preferred image size of 800x534, which seems reasonable to me.

Questions for @eprodromou :

  • path: can't end in a file extension due to security check, can we add /metadata (or something, anything) at the end?
  • what if "preferred" or "transforms" aren't available? Return "null" for that field? Other options would be an empty array, omit, or HTTP error
  • if there are no "transforms" should that return an empty array? Note that this means we were able to look for transforms and found there were none, which is different than the question above where we weren't even able to look.
  • "original" is specified for this endpoint, but not /page/{title}/links/media. Should we be consistent?
  • for all detail blocks (original, preferred, transforms) if "width" and "height" are unavailable, return "null"? (other options: omit or HTTP error)
  • "duration" returns 0 for things that don't have a duration (that's how our file handling code works). Is that okay?
  • file_description_url is top level for /file/{title} but in "preferred" for /page/{title}/links/media? Can we move it to top-level there too?

One thing Tim brought up was that we should look to MediaViewer and Parsoid to see what image data was needed for real-world API clients. Bill and I split up that task; I did MediaViewer.

MV uses two Action API query calls to get information about files:

  • filerepoinfo returns information about available media repos for the wiki. I'm not sure why this is needed; I'm going to delve into it a little more.
  • imageinfo returns information about a file (not necessarily an image).

For imageinfo, MediaViewer requests the following properties:

  • timestamp
  • url
  • size
  • mime
  • mediatype
  • extmetadata

For the extended metadata, it asks for the following:

  • DateTime
  • DateTimeOriginal
  • ObjectName
  • ImageDescription
  • License
  • LicenseShortName
  • UsageTerms
  • LicenseUrl
  • Credit
  • Artist
  • AuthorCount
  • GPSLatitude
  • GPSLongitude
  • Permission
  • Attribution
  • AttributionRequired
  • NonFree
  • Restrictions
  • DeletionReason

I'll look into which of these map directly to the properties we have above, and which we need to add.

tl;dr: let's add "mustrender" and "badfile" (or snakecased version of those names as you prefer) to the response data.

Here are some properties returned by ParsoidBatchAPI, which are needed by Parsoid and anything else attempting to display the image:

  • mustRender

This appears to be conveniently available from https://gerrit.wikimedia.org/g/mediawiki/core/+/864d4fb4c86925039d8d9ae44e3973e0dc7e4127/includes/filerepo/file/File.php#774 and could be added to the response data.

  • badFile

It looks like ParsoidBatchAPI gets this from https://gerrit.wikimedia.org/g/mediawiki/core/+/864d4fb4c86925039d8d9ae44e3973e0dc7e4127/includes/GlobalFunctions.php#2914 , but I notice that's deprecated in favor of the BadFileLookup service. Presumably I can inject that service and use it to get this data.

Invalid file extension found in the path info or query string.

The fix for this is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/552679 . I added you as a reviewer.

Invalid file extension found in the path info or query string.

The fix for this is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/552679 . I added you as a reviewer.

Thank you. WIP patch adjusted accordingly.

Summary of my local testing:

File types tested:

  • raster (jpg)
  • vector (svg)
  • audio (ogg)
  • video (webm)

Results:

  • size is constant, reflecting only the size of the original file. Recommend we include it only for original.
  • duration is constant, reflecting only the duration of the original file (even for “default” and “thumbnail” when the url refers to an image and not an audio file, the duration of the audio file is retrieved). Recommend we include it only for original.
  • mediatype: ditto, recommend including only for original
  • width and height are accurate for images, even for “default” and “thumbnail”, but unavailable for audio file thunbnails. Recommend we include as null.
  • “url” is fine any time it is available. For “default” and “thumbnail”, url will retrieve the desired (scaled) file. For audio, url will retrieve a placeholder image suitable for an icon
  • “default” and “thumbnail” are unavailable for videos (at least, for webm). Recommend we include as “null”.
  • the exact values returned are influenced by extensions. Support for audio and video requires the TimedMediaHandler extension (which is available on production). It is theoretically possible for third-party installations to enable different extensions that have different behavior (not sure any exist, but people could privately code whatever they like)

Summary of recommendations:

  • make the following fields available only on "original" (or alternatively, at the top level): size, duration, mediatype
  • when a field is available, include it as "null", to proactively indicate that it does not exist (applies to width, height, default, and thumbnail in various cases)

I am assuming that we do not want to change/extend core capability regarding media files. I notice there are other media-related extensions (MP3MediaHandler, MultimediaPlayer, MultimediaViewer, etc.), and I suspect they can interrogate media files in additional ways. That seemed outside the scope of this task, so I focused only on what data current core code could provide, and testing that in a standard configuration.

Just found that I can get a thumbnail size from options, in the same way I get the default size. I can use that instead of the somewhat arbitrary 512 from the task description. That'll make it overridable in settings, and more likely to correspond to an image already existing in the filesystem. On my local wiki this size is 320x240, so seems appropriate.

  • make the following fields available only on "original" (or alternatively, at the top level): size, duration, mediatype

Can you just put "null" for information we don't have? It's unfortunate we don't get file size or mediatype for thumbnails, but that seems like something we could fix in the future. I'd like to keep the duration on all file representations even if the value is inapplicable or we don't know it.

I'll open tickets for this stuff.

  • make the following fields available only on "original" (or alternatively, at the top level): size, duration, mediatype

Can you just put "null" for information we don't have? It's unfortunate we don't get file size or mediatype for thumbnails, but that seems like something we could fix in the future. I'd like to keep the duration on all file representations even if the value is inapplicable or we don't know it.

I'll open tickets for this stuff.

tl;dr: I'm going to disagree with you, then suggest a compromise.

You replied to the "size/duration/mediatype" part. They're a little different from each other:

size: this is flat-out wrong for everything but original. I could put in "null" for cases where it is currently incorrect. But I disagree that we could (realistically) make size available for other cases in the (reasonably near) future. Our media file handling system, from what I can tell, achieves scaling by not actually creating transforms until they're requested. Overhauling that subsystem for the purpose of providing more data in an API endpoint seems both backwards, and a multi-team coordination effort. And creating an API endpoint that is going to return size as "null" for for months or (more likely) years feels incorrect. I'm concerned we'll make users of the API think something is broken, or that they're using the API incorrectly. I'd much rather return only the data we have now. If in the future we have more data - great! We have a versioning system for the api, so we can cleanly add it.

duration/mediatype: these aren't technically wrong, they're just redundant. Well, they're kinda wrong in the default/thumbnail case where the url refers to an image while duration/mediatype continue to refer to the original file. But the image is a stand-in for the actual file, so that's arguably okay. An additional subtlety is that the still image vs actual playable media file behavior is governed by an extension, and doing anything "smart" with these would be making a risky assumption in core logic that might change without warning. I'd still rather omit these in the questionable cases.

In the interest of moving this endpoint forward, how about I go with "null" for size in the cases we know it is incorrect, and keep including the redundant (and slightly misleading) duration/mediatype values for now? This endpoint is still under coredev/v0, so it is experimental and we can continue to fiddle with the formats even after we merge it. Then we can see what kind of traction you get with the other tickets you're opening, and possibly reconsider handling of these fields before we move the endpoint under v1.

duration is also set for animated gif/png (i updated this in the description)

size: "But I disagree that we could (realistically) make size available for other cases in the (reasonably near) future."

I've been considering adding sizes to the generated transcodes of videos for some time. They are expensive resources and it would be valuable to know their size, but I don't think that should change your assessment of this right now.

mediatype: ditto, recommend including only for original

While mediatype is consistent, please note that mimetype is NOT consistent (and arguably more valuable/useful than mediatype).

For audio, url will retrieve a placeholder image suitable for an icon

Note that this is mostly an artifact of history, as we had lots of systems that assumed everything had an image thumbnail, since that was the only form of transclusion we supported. In general we no longer show this thumbnail for audio files at the html side.

Adding missing MediaWiki-REST-API code project tag as Platform Team Initiatives (MW REST API in PHP) team tag is archived and its parent Platform Engineering team does not exist anymore