Page MenuHomePhabricator

mobile-html-offline-resources endpoint
Open, HighPublic

Description

Background information

In order to save an article for offline viewing, the apps need to know of any related files that would need to be downloaded as well. There's currently an endpoint for media, but there should be an additional endpoint for any other related files that the article would need.

What

mobile-html-offline-resources endpoint would take a page title and revision and return a list of related scheme-less URLs

Event Timeline

JoeWalsh created this task.Feb 28 2019, 5:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2019, 5:17 PM
bearND added a subscriber: bearND.EditedFeb 28 2019, 6:34 PM

I've been thinking we could add a JS function in the page library which would get all the resources needing to be saved for offline.
It could even do the DOM transformation before writing the file to disk if the replacement is predictable. For reading while offline we could also do the same for the opposite way, which would get invoked after reading the stored page from disk.

Edit: Is there a way to go through a WebView when saving a page for offline from a link, similarly what you do when you cover up the WebView content during regular page load on iOS?

@bearND I think the answer to that question would be no, unless I'm misunderstanding your question - when we save for offline, we don't interact with the webView at all. We initialize it when the user is about to view an article.

bearND added a comment.Mar 5 2019, 4:55 AM

Bummer. That going to be tough. I don't think Android needs something like that since they use the same networking library (and interceptors) when they save for offline. Would it be possible to instantiate a hidden WebView somewhere to still get to run JavaScript DOM transformations?

Once T217348 is merged and this endpoint is implemented, we should be fine without additional DOM transformations for offline. I wrongly assumed that it would be necessary to swap out external links for local files in the html. It's not required if the links are schemeless and there's correct content security policy in place.

Is this what you were concerned about (transforming DOM to get the links right) or were you thinking about other transformations?

bearND added a comment.Mar 5 2019, 4:08 PM

Great. That's much easier. (Yes, I was thinking about the DOM transformations you'd do when changing the URLs for reading and writing external links for offline use. Not sure if you do for both or just one way.)

We need to explicitly list here all of what this endpoint will provide.

bearND added a comment.EditedMar 7 2019, 5:08 PM

My current understanding based on the discussion we had today is that the output of this endpoint should include:

  • all linked CSS
  • all linked JS
  • all URLs of <img> tags (media endpoint only has a subset of images because that one is meant for gallery)
  • possibly links to video and audio files, too?

Every item in this list should have a mime type.

We could either:

  • expand the /page/media endpoint to include this since there is a flag for showInGallery but that would mean it includes non-media files, which seems bad, or
  • add a new endpoint
This comment was removed by JoeWalsh.
JoeWalsh added a comment.EditedMar 8 2019, 7:09 PM

I think /page/media/ could be removed and replaced with a unified endpoint.

It would return a list of related files, some of which have a "media" property with the same information as the old media endpoint:

[
  {
    "url": "//meta.wikimedia.org/api/rest_v1/data/css/mobile/base",
    "mime": "text/css"
  },
  {
    "url": "/meta.wikimedia.org/api/rest_v1/data/javascript/mobile/pagelib",
    "mime": "text/javascript"
  },
  {
    "url": "https://upload.wikimedia.org/wikipedia/commons/d/d9/Collage_of_Nine_Dogs.jpg",
    "mime": "image/jpeg",
    "media": {
      "section_id": 0,
      "type": "image",
      "showInGallery": true,
      "titles": {
        "canonical": "File:Collage_of_Nine_Dogs.jpg",
        "normalized": "File:Collage of Nine Dogs.jpg",
        "display": "File:Collage of Nine Dogs.jpg"
      },
      "thumbnail": {
        "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/320px-Collage_of_Nine_Dogs.jpg",
        "width": 320,
        "height": 281,
        "mime": "image/jpeg"
      },
      "original": {
        "source": "https://upload.wikimedia.org/wikipedia/commons/d/d9/Collage_of_Nine_Dogs.jpg",
        "width": 1665,
        "height": 1463,
        "mime": "image/jpeg"
      },
      "file_page": "https://commons.wikimedia.org/wiki/File:Collage_of_Nine_Dogs.jpg",
      "artist": {
        "html": "...html here...",
        "text": "YellowLabradorLooking_new.jpg\nGolden_Retriever_Sammy.jpg\nCockerpoo.jpg\nLonghaired_yorkie.jpg\nBoxer_female_brown.jpg\nMilù_050.JPG\nBeagle1.jpg\nBasset_Hound_600.jpg\nNewfoundland_dog_Smoky.jpg"
      },
      "license": {
        "type": "CC BY-SA 3.0",
        "code": "cc-by-sa-3.0",
        "url": "https://creativecommons.org/licenses/by-sa/3.0"
      },
      "description": {
        "html": "perros",
        "text": "perros",
        "lang": "es"
      }
    }
  }
]

/page/media does a lot more than just listing resources. It has associated metadata for the resources themselves, and their relation with the page that embeds them, like the section id for example.

I think they serve very different use cases.

I agree with Bernd’s list above, with a definitely yes for video and audio.

The mime types are for letting clients decide if they want to save certain assets for offline. For example, skipping video could be an option.

We can also consider embedding that filtering logic on the service itself, but then the clients lose some flexibility to make choices, and they have more information about the device like network conditions and disk space to make those decisions.

Additionally we should investigate if it is possible to get the download size of the different assets to return for each entry. I believe it could be possible to make a HEAD request and check for the Content-Length, but it depends on who is serving the assets and if they actually include that information in that response.

Would it then make sense to leave the media endpoint as is and structure this endpoint in a way described by @JoeWalsh above?

If for media, we get url and mime first, should mime be stripped from thumbnail and original objects?

Jhernandez triaged this task as High priority.

Assuming we want to keep the two endpoints separate, do we need anything else in the new mobile-html-offline-resources endpoint besides the following?

[
  {
    "url": "//meta.wikimedia.org/api/rest_v1/data/css/mobile/base",
    "mime": "text/css"
  },
  {
    "url": "//meta.wikimedia.org/api/rest_v1/data/javascript/mobile/pagelib",
    "mime": "text/javascript"
  },
  {
    "url": "//upload.wikimedia.org/wikipedia/commons/d/d9/Collage_of_Nine_Dogs.jpg",
    "mime": "image/jpeg",
    }
  }
]

I wanted to also include an example for video and audio files but I think we need to discuss these a bit more. A mime type may not be the best solution for these since there can be many derivatives for video files and some for audio files. Even for certain images there could be derivatives. So, my question is should we move to using types instead of mime types?

@bearND assuming we're keeping both endpoints, the media wouldn't need to be returned by this endpoint and we wouldn't even need mime type - it could be gathered when making the request for the file to save for offline. The resulting response would be just a list of the css and js urls:

[
  "//meta.wikimedia.org/api/rest_v1/data/css/mobile/base",
  "//meta.wikimedia.org/api/rest_v1/data/javascript/mobile/pagelib"
[

Ok, that should be quite easy then. I like the simplicity.

MSantos claimed this task.Apr 18 2019, 3:07 PM
MSantos moved this task from To Do to Doing on the Reading-Infrastructure-Team-Backlog (Kanban) board.
MSantos added a comment.EditedApr 18 2019, 4:26 PM

@bearND and @JoeWalsh, from the description, are we still keeping revision and title as parameters?

mobile-html-offline-resources endpoint would take a page title and revision and return a list of related scheme-less URLs

Because of the simplicity of the endpoint, these parameters seem useless.

@MSantos the apps are requesting what resources are needed to render a given article and revision offline. The fact that the response is the same right now should be irrelevant to them - this way if anything changes in the future that would make the response different for different articles, the apps would be able to handle it without a client update.

Change 504937 had a related patch set uploaded (by MSantos; owner: MSantos):
[mediawiki/services/mobileapps@master] PCS: mobile-html-offline-resources endpoint

https://gerrit.wikimedia.org/r/504937

Change 504937 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] PCS: mobile-html-offline-resources endpoint

https://gerrit.wikimedia.org/r/504937