Page MenuHomePhabricator

Reader gets media links
Open, MediumPublic

Description

"As a Reader, I want to get a list of media files embedded in a page, so that I can view, read or listen to them. "

GET /page/{title}/links/media
Returns medialinks for the page.

Notable request headers: none

Request body: None

Status:
200 - OK
404 – there has never been such a page

Body: JSON
Object with these properties:

  • files: array of objects representing files embedded in the page; each object has the following properties:
    • title: title of the file
    • file_description_url: URL for the HTML file description page for this file, to get license info and other metadata
    • latest: latest file revision, with the following properties:
      • timestamp: last modified timestamp, YYYY-MM-DDTHH:MM:SSZ
      • user: user object for the uploader with the following properties
        • id: numeric ID or null
        • name: registered user name or other identifier
    • default: information on the default representation of the file for representation in a document, with these elements
      • mediatype: media type for the preferred representation of the file
      • size: size of the preferred representation of the file in bytes
      • width: width of the preferred representation of the file in pixels if applicable (image, video, ...?)
      • height: height of the preferred representation of the file in pixels if applicable (image, video, ...?)
      • duration: temporal duration of the the preferred representation of the file in seconds if applicable (video, audio, ...?)
      • url: full URL of the binary version of the preferred representation of the file (the image/audio/video/document itself)
    • thumbnail: information on a bitmap image smaller than 512x512 pixels, that represents the file, such a smaller version of an image file, a still from a video, or an icon for audio or a document {size, width, height, duration, mediatype, url}
    • original: information on the original representation of the file as uploaded, with {size, width, height, duration, mediatype, url}

Event Timeline

Same comments as in T230846 regarding HTTP statuses.

id: ID of the file

What is that?

versionID: version ID of the file

What is that? AFAIK files are versioned by timestamp, is it a timestamp?

url: location of the API entry

What is the API entry?

CCicalese_WMF triaged this task as Medium priority.Oct 23 2019, 7:33 PM

Looks like I said the same thing as @Pchelolo above at T230848#5605205.

Also, as I said in that comment, the response properties listed here are not actually the information needed to "view, read or listen" to the files. To display an image, the client needs a thumbnail URL of an appropriate size. The thumbnail is the only way to display an image with mustRender=true. To display a video or audio file, content negotiation may be required.

If the idea is to support client-side media display, we should have a close look at existing examples, such as MultimediaViewer and Parsoid. If we're not delivering the information that existing clients already need, the endpoint probably won't effectively support future clients.

If a client developer asked me how to implement media display, I would not recommend reimplementing Parsoid or MultimediaViewer, I would recommend parsing the Parsoid HTML and modifying it as necessary. But I suppose we can whip up something to put here if we need a toy endpoint for demonstration purposes.

eprodromou updated the task description. (Show Details)Oct 25 2019, 6:37 PM
eprodromou updated the task description. (Show Details)
eprodromou updated the task description. (Show Details)Oct 25 2019, 6:41 PM

Updated as follows:

  • removed id, version id
  • added license data
  • added preferred representation from T230848
  • added URL for the API endpoint defined by T230848 so the client can get more information (like other formats or sizes)

If the idea is to support client-side media display, we should have a close look at existing examples, such as MultimediaViewer and Parsoid.

Thanks. I'll check those and see what we're missing.

eprodromou updated the task description. (Show Details)Oct 28 2019, 8:56 PM
eprodromou updated the task description. (Show Details)

There's already a media-list endpoint used by Android app, we need to chat with reading infrastructure team and maybe try to align the two endpoints, so that maybe they can switch to using the new endpoint on the app

eprodromou updated the task description. (Show Details)Oct 30 2019, 11:08 PM
BPirkle added a subscriber: BPirkle.Nov 6 2019, 6:39 PM

Following @Pchelolo comment in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/548860/, should we use v1/page/title/links/media instead of v1/page/{title}/medialinks?

Then we could cleanly have both:

v1/page/title/links/media
v1/page/title/links/language

And we'd have a tidy place to put any other types of links we might need.

Following @Pchelolo comment in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/548860/, should we use v1/page/title/links/media instead of v1/page/{title}/medialinks?

Sure, that's fine. I'll update the user stories.

eprodromou updated the task description. (Show Details)Nov 6 2019, 6:40 PM

Are there any de facto test pages for media links? For example, any pages with a very large number of embedded files or files with notably uncommon characteristics?

I know reading infrastructure were using https://en.wikipedia.org/wiki/User:BSitzmann_(WMF)/MCS/Test/Frankenstein as a page with a lot of different weird characteristics.

There's already a media-list endpoint used by Android app, we need to chat with reading infrastructure team and maybe try to align the two endpoints, so that maybe they can switch to using the new endpoint on the app

Where is the actual implementation that powers this endpoint? I noticed some differences between the results my query gave for the test page you mentioned (when I ran it manually against enwiki) vs this endpoint and I'd like to see how that endpoint generates it results.

I cloned the restbase repo and found media-list.yaml but not sure where to go from there.

Hi @BPirkle,

The specific files you'll want to look at are routes/page/media.js and lib/media.js.

I also left some thoughts on this on the Product-Platform Sync doc (link) (see notes from 2019-10-31), which I've copied below:


Yes, it looks in principle like this could replace media-list endpoint (and should, assuming it will be coming into existence in something like its proposed form). A couple of thoughts and questions:

  • It's important to the apps that the response list the files in the order that they appear on the page.
  • Will this include Mathoid images? One of the use cases of media-list for the apps is for saving page content for offline use, so they will want Mathoid images to be included. (The showInGallery property in the media-list endpoint is to tell the app that images like Mathoid math formulas should not be shown in the apps' native image gallery activity.)
  • A mobile client will typically want a thumbnail URL for a suitable size, or a set of thumbnail URLs it can choose from, in addition to the original image URL.

Related to the last point, I am skeptical of the notion of a single "preferred representation" of a file, because that necessarily depends on the client. I guess Tim is making largely the same point in his comment.

I've posted a speculative work-in-progress patch to T236169. Among its deficiencies are wholesale copying of two functions from the ImagePage class, and its failure to factor out code that will be needed for T236170. Other questions or known areas that need attention are marked with todo's.

I'm posting it as-is because I'm concerned about whether this is even headed the right way. This comment from @tstarling concerns me:

If a client developer asked me how to implement media display, I would not recommend reimplementing Parsoid or MultimediaViewer, I would recommend parsing the Parsoid HTML and modifying it as necessary. But I suppose we can whip up something to put here if we need a toy endpoint for demonstration purposes.

I don't think any of us have a desire to build a toy demonstration endpoint. So if the general approach I'm taking in the patch that I posted is not going to satisfy client needs, let's reconsider how we're approaching this.

One thing I think is worth making explicit on this ticket is whether this endpoint is intended to return only media items that are part of the article content, or all linked media items. For example, on enwiki, many articles with linked media content contain an imagelink to Commons-logo.svg via the inclusion of Template:Commons_category. Is the intent to include images like these in the response, or to try and restrict inclusion to images more likely to be interesting to the caller? The media/media-list endpoints in mobileapps attempt to do the latter, but I'm not sure whether you intend to go down that road again here.

So, I think there's a "preferred" file representation, which is, "If you don't know what you want to do with this file, use this, and it will mostly do what you want." "default" might be a better name for this property. I'd think of this "default" representation as being a "normal" sized image (I'll say around 1000px in one direction), a medium-quality version of a video in a preferred format (ogv, say), medium-quality version of audio (ogg, say), and just the original for other documents and binary data.

I don't think that's a thumbnail. However, it may make sense for us to have an additional "thumbnail" file representation, which is either a "small" version of an image, a still from a video, or an icon for audio, documents, or binary data.

Questions for @eprodromou :

  • we do not need a limit on the underlying query that retrieves images for a page, do we? (if we did, we might need pagination)
  • do we really want an extra level of indirection in the response value? We could omit "files" and just return an array of file objects.
  • what if "preferred" isn't available? Return "null" for that field? Other options would be an empty array, omit, or HTTP error
  • we don't return transforms here, but we do in /file/{title}. That's probably best, for performance, but just confirming it is intentional
  • "original" is specified for /page/{title}/links/media but not this endpoint. Should we be consistent?
  • if "width" and "height" are unavailable, return "null"? (other options: omit or HTTP error)
  • "duration" returns 0 for things that don't have a duration (that's how our file handling code works). Is that okay?
  • file_description_url is top level for /file/{title} but in "preferred" for /page/{title}/links/media? Can we move it to top-level here too?
eprodromou updated the task description. (Show Details)Dec 9 2019, 6:59 PM
eprodromou updated the task description. (Show Details)Dec 9 2019, 7:01 PM
  • we do not need a limit on the underlying query that retrieves images for a page, do we? (if we did, we might need pagination)

This is a good question. My understanding is that images on a page are typically O(10) and at the limits O(100). I think that doesn't require segmenting the list. Can we verify this with a database query?

  • do we really want an extra level of indirection in the response value? We could omit "files" and just return an array of file objects.

One of the design principles is that we always return an object for REST results. This is due to bugs in older browsers where an attacker can modify the Array superclass and do bad things to the client. OWASP recommends always having an object in your JSON replies because of this hack.

  • what if "preferred" isn't available? Return "null" for that field? Other options would be an empty array, omit, or HTTP error

I think the "default" value should always be available. I guess there might be some situations where no representation of the file is available...? Unless that's common, I'd say an HTTP 500 error.

  • we don't return transforms here, but we do in /file/{title}. That's probably best, for performance, but just confirming it is intentional

Yes, intentional.

  • "original" is specified for /page/{title}/links/media but not this endpoint. Should we be consistent?

I think "original" makes sense here, and I think it's been added since you asked.

  • if "width" and "height" are unavailable, return "null"? (other options: omit or HTTP error)

If "width" and "height" are not applicable (say, for an audio file), then null.

If they're applicable (say, for an image file), but not present, I'd say... still make them null. This seems like a data problem, and something we should fix, but I don't think it's fatal.

  • "duration" returns 0 for things that don't have a duration (that's how our file handling code works). Is that okay?

I'd prefer "null" for files that don't have a time extent, like images, data blobs, or documents. Zero-length videos or audio I guess should still have 0 duration.

  • file_description_url is top level for /file/{title} but in "preferred" for /page/{title}/links/media? Can we move it to top-level here too?

That looks like an error. I'll fix it here.

eprodromou updated the task description. (Show Details)Jan 8 2020, 2:31 PM
  • we do not need a limit on the underlying query that retrieves images for a page, do we? (if we did, we might need pagination)

This is a good question. My understanding is that images on a page are typically O(10) and at the limits O(100). I think that doesn't require segmenting the list. Can we verify this with a database query?

Hrm, https://en.wikipedia.org/?curid=17179817 is unpleasant:

SELECT COUNT(*) FROM imagelinks WHERE il_from=17179817;
+--------------+
| COUNT(il_from) |
+--------------+
|        11163 |
+--------------+
1 row in set (0.01 sec)

The database query portion of the medialinks handler execution is very fast for this even page (small fractions of a second), presumably because this is a simple query and everything is indexed. I'm not sure how long the PHP portion would take to process the results and get the file info.

Here are the top ten pages on enwiki by number of image links:

SELECT il_from, COUNT(*) AS count FROM imagelinks GROUP BY il_from ORDER BY count DESC LIMIT 10;
+----------+-------+
| il_from  | count |
+----------+-------+
| 17179817 | 11163 |
| 34950575 | 10337 |
| 18555531 |  9991 |
| 29202369 |  9211 |
| 21325105 |  8388 |
| 19186599 |  8253 |
| 18558089 |  6431 |
| 56093453 |  6269 |
| 18555748 |  5675 |
| 46867042 |  5147 |
+----------+-------+
10 rows in set (3 min 1.57 sec)

So it does fall off fairly sharply. When I queried for the top 100 pages, the 100th had 1500 rows.

Testing via eval.php against enwiki suggests the endpoint would take around minute to execute against the pages with the largest numbers of image links. That's not ideal, but given that most of this is PHP time and not database time I'm not certain how horrible it is.

Thanks for the replies, @eprodromou. Changes made and new patchset uploaded under engineering task T236169. I still need to sync changes with T236170 (file description endpoint) and resolve some cut-and-paste-code, but that's mostly just housekeeping.

daniel added a subscriber: daniel.Mar 30 2020, 2:20 PM

I see a MediaLinksHandler in the codebase. Has this been implemented? What's the status?