Include size of article in Summary
Open, NormalPublic

Description

New efforts in Reading have been focused on providing better settings and information for users to control their data usage. Meaning both bandwidth when downloading and storage when saving for offline.

To support this effort, there must be a way to let users know how large a download will be. This is primarily important when downloading a "set" of articles. As in the upcoming Reading List Service.

In order to support this, can we:

  1. Calculate the size of an article?
  2. Calculate the download size (gzipped)?
  3. Calculate the size of images (if we know what size the platforms download)?
  4. Can this information be updated when summaries are updated (when a new revision occurs)?
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2017, 12:51 AM
Pchelolo added a subscriber: Pchelolo.

To support this effort, there must be a way to let users know how large a download will be. This is primarily important when downloading a "set" of articles. As in the upcoming Reading List Service.

Is the primary goal to download a single article or a set of articles? In case it's a single article from the apps, we can include the info in the MCS lead section?

Calculate the size of an article?

We can't really take it from the MediaWiki API, because what's actually gonna be downloaded is the mobile-sections content, right? What we could do though is to make the summary update after the MCS content, and fetch themobile-sections on update and just see how big they are. Alternatively Mobile-Content-Service could add a special field with the size to the mobile-sections-lead content, then we'd need to just fetch the lead on update and take the value from the field.

Calculate the download size (gzipped)?

Again, MCS could gzip the content and take the size, but that seems like a waste of CPU cycles. What we could do is to gzip a ton of wikipedia articles and take the average compression ratio, and then apply it on the client. It wouldn't be 100% accurate, but the client probably doesn't care about byte-to-byte accuracy here?

Calculate the size of images (if we know what size the platforms download)?

That's even more complex. With introduction of the thumbnail API the server wouldn't really know which size of pictures the client will download. Also, we shouldn't consult the MediaWiki API to find out which images are on the page, because MCS might decide to strip out some sections with images. Maybe we could make MCS count the images on the page and include that on the lead section and we will transfer that count to the summary? Client can then approximate how much data bandwidth it will need to download that number of images in the resolution it uses?

Can this information be updated when summaries are updated (when a new revision occurs)?

Yes why not?

With the above ideas the client would be able to estimate the downloadable size of the article. Not sure how accurate the estimate would be, but I'm also not sure whether we need 100% accuracy or +- ~1 Mb could be fine?

Another question is whether this data belongs in the general summary? The size of the mobile-sections and the number of pictures seems like a very use-case specific set of properties not useful for the general public. Perhaps we could store this data with the summaries, and then strip it for the general public, but include it only when doing content-hydration for the reading list response? (if we are doing the content-hydration for the reading list response at all, that's not clear to me yet)

Also, right now we don't store and purge the summary if it's content wasn't changed and it works really well - only ~20% of rerenders result in a purge increasing the cache hit ratio a great deal. If we include the size of the article this optimization will be lost since the size is likely to change on every edit. Can be avoided by including only the approximate size with, for example, 1 kilobyte resolution.

So, overall it seems doable, but I wanna understand the use-case better before deciding on the right path for it. Seems a bit controversial.

Alright, I think we need to hold our horses a bit, people. There have been a number of tickets lately about adding information and/or functionality to the summary end point, which, quite frankly, I believe don't belong there at all. Instead of adding things to summary, I think we should list/assess what information will Reading need in the short- to mid-term and find the appropriate ways to provide/expose/surface the needed information.

In the concrete case of this ticket, as @Pchelolo explained nicely, it is doable, but we need to keep in mind that all of the listed questions here add latency to the exposure of other data. The summary is supposed to be returned quickly and be a compact representation of what an article is about. The requirements in this task seem to violate that, given the huge latency introduced by the potential implementation.

Alright, I think we need to hold our horses a bit, people. There have been a number of tickets lately about adding information and/or functionality to the summary end point, which, quite frankly, I believe don't belong there at all. Instead of adding things to summary, I think we should list/assess what information will Reading need in the short- to mid-term and find the appropriate ways to provide/expose/surface the needed information.

+1. Potentially we could introduce the meta (extended summary) endpoint and stash all the data there letting us still having the summary nice and cozy while providing a way more sophisticated and rich dataset separately. So +1 on the list.

@Pchelolo @mobrovac some answers below, but figure we can go more in-depth at this weeks meeting. Thanks for all the comments

Alright, I think we need to hold our horses a bit, people. There have been a number of tickets lately about adding information and/or functionality to the summary end point, which, quite frankly, I believe don't belong there at all.

If we want to have a separate end point for such data thats fine with me… I'm just filing these with the summary because thats the only end point that a client hits before a user actually downloads an article. However, a separate purpose built endpoint with this type of data (which seems less like a summary), and more like metadata would also make sense to me.

Instead of adding things to summary, I think we should list/assess what information will Reading need in the short- to mid-term and find the appropriate ways to provide/expose/surface the needed information

Most of these updates are in support of the product requirements for Reading Lists and Offline support in general. As far as timelines… these changes could be staged and rolled out - no immediate time pressure… we can get into specifics at the sync meeting.

Is the primary goal to download a single article or a set of articles? In case it's a single article from the apps

Both. For similar but different use cases. The size of a "set" of articles can be aggregated by the reading list end point, if it is available for each article through another API call

we can include the info in the MCS lead section?

Fetching the first section is likening going away in the new API - we may be streaming the HTML content of an article instead of getting JSON, which is why I was suggesting a separate API for this type of structured metadata (if we don't want it in the summary).

We can't really take it from the MediaWiki API, because what's actually gonna be downloaded is the mobile-sections content, right?

As mentioned above, this will be from the endpoints being developed right now. See: T162179: Extract HTML Compatibility Layer from MCS Mobile Sections API
The output of this endpoint would be the approbate end point to measure.

Again, MCS could gzip the content and take the size, but that seems like a waste of CPU cycles.

Agreed, seems like a waste.

What we could do is to gzip a ton of wikipedia articles and take the average compression ratio, and then apply it on the client. It wouldn't be 100% accurate, but the client probably doesn't care about byte-to-byte accuracy here?

Correct… we just need to be accurate within a few MBs… its just about giving control of data usage to users. So any heuristic we use to get close enough is fine.

GWicke added a subscriber: GWicke.EditedAug 8 2017, 9:21 PM

One way to achieve this using plain HTTP:

  1. Send a HEAD request for the PDF (or HTML) resource. As HEAD is handled at the Varnish layer, this will implicitly store the response in Varnish.
  2. Check the content-length header in the response, and possibly prompt the user about aborting / continuing the download (if the size is above some threshold).
  3. If the user consented, send a plain GET for the same resource, retrieving the recently-cached PDF (or HTML) (with *very* high probability).

To make this work for CORS requests, we'll need to add content-length to the access-control-expose-headers value.

GWicke triaged this task as Normal priority.Aug 8 2017, 10:31 PM