Page MenuHomePhabricator

[Spike 3hr] Determine how to store PDFs in order to get file size
Closed, DeclinedPublic

Description

Background

Story: As a data conscious users, I want the ability to know the size of my PDF prior to download, so that I know if I have space for it in my phone.

Based on the work from T163472: [EPIC] Provide a way to download articles in PDF on the mobile website, we want the ability to render a PDF, determine its size, and display the size to the user prior to allowing the user to download the pdf.

Acceptance Criteria

Answer the following questions:

  • Is there storage space for temporarily storing PDFs

In Cassandra, yes! See T177219#3991177.

  • If so, how much? Does it have any limit?

Per T177219#3991177, since this is a low-traffic service, there should be enough space in Cassandra to temporarily store rendered PDFs.

  • What is the most efficient timespan to keep the PDFs for? (a day, until use session completes, something else?)
  • How would we go about implementing this?

Event Timeline

ovasileva added a subscriber: phuedx.

@phuedx - took a shot at the acceptance criteria - feel free to add detail

Niedzielski subscribed.

Sam is going to email the Services team for how we should investigate the answers to these questions.

Just recording some meeting discussion from @Jdlrobson so it's not lost:

This doesn't sound possible.. since we'd only generate PDFs upon request (not beforehand). We could use a HEAD http request to get the headers (which could have filesize i guess - but if the pdf is not generated we'd need to do that in the HEAD request) which still has a delay....

This may also apply to the parent.

Niedzielski renamed this task from [Spike] Determine how to store PDFs in order to get file size to [Spike 1hr] Determine how to store PDFs in order to get file size.Oct 3 2017, 4:46 PM
Niedzielski added a project: Spike.

The reasoning for working on this feature is not convincing.

Story: As a data conscious users, I want the ability to know the size of my PDF prior to download, so that I know if I have space for it in my phone.

We should let the OS warn the user if they don't have enough space on their phone. I vote for declining this task as the return on investment is too low. We should spend our time doing more important and impactful work.

The reasoning for working on this feature is not convincing.

Story: As a data conscious users, I want the ability to know the size of my PDF prior to download, so that I know if I have space for it in my phone.

We should let the OS warn the user if they don't have enough space on their phone. I vote for declining this task as the return on investment is too low. We should spend our time doing more important and impactful work.

I think the concern here is more about data-conscious users and users who look at their phone space more carefully because of holding larger amounts of offline content. From the New Readers finding 20 (https://meta.wikimedia.org/wiki/New_Readers/Offline#Web:_Building_mobile_PDF_support) we know that people are increasingly consuming and sharing information offline and are thus more conscious of the amount of space they have in their phones. We want to let the users know of the size of an article prior to download so that they can make the decision if downloading is too costly for them.

That said, I think we should do this spike and get an idea of how much effort this would take. I have spoken to @atgo and in her opinion providing size is not a blocker, but it is important. Anne - correct me if I'm wrong on this. If it's estimated to be too much work, we would most likely punt it for the time being.

Yep, that sounds right to me. Thanks

ovasileva changed the task status from Open to Stalled.Oct 10 2017, 12:31 PM
ovasileva lowered the priority of this task from High to Medium.
ovasileva changed the task status from Stalled to Open.Feb 20 2018, 4:44 PM
ovasileva renamed this task from [Spike 1hr] Determine how to store PDFs in order to get file size to [Spike 3hr] Determine how to store PDFs in order to get file size.Feb 20 2018, 5:34 PM
ovasileva updated the task description. (Show Details)

We discussed this task some today in grooming and a couple questions came up:

  1. How long does a PDF take to render on average? From @pmiazga: average render for popular articles is 7s, short articles is 4s, and long articles is 19s (these are tests on VPS, not production).
  2. Do we want to add tighter queue and render timeouts in the service? For example, if an article takes several minutes to render, do we just want to timeout and guess "large"?
  3. Is a modal the correct UI pattern to use? A given article may render in less than 5 seconds but what if the queue is full? A user may scroll down the article away from the spinner in boredom and suddenly a modal pops up. Regardless, shouldn't the user be able to download a PDF without waiting for a file size if they wish?
  4. Should we approximate the PDF size client side (instantly) based on article content (such as word count) instead of calculating it precisely by waiting for a render to complete?

Assigned to @phuedx to clarify on the Services sync meeting about this task (@Niedzielski & @pmiazga will be there too)

Since we're exposing the PDF service via RESTBase we can add temporary storage for the rendered articles in Cassandra. RESTBase supports HEAD requests and emits the Content-Length header, so the client can issue a head request, the render will be stored in Cassandra for a little while and the second GET request will fetch the PDF without waiting for the service to render the article again.

Given that this feature will not be super high-traffic we have space to temporary cache PDFs in Cassandra, that's no problem.

Should we approximate the PDF size client side (instantly) based on article content (such as word count) instead of calculating it precisely by waiting for a render to complete?

I personally think this is a better approach, but it's not my call to make.

☝️ We'll be discussing this at today's Audiences Services Sync meeting.

phuedx added a subscriber: mobrovac.

@Niedzielski will be updating the task description after chatting with @Pchelolo and @mobrovac in the Audiences Services sync.

We talked with @Pchelolo and @mobrovac today. As I understand it, we have the following options:

  1. The client issues a HEAD request for the PDF and only retrieves the size (no body) for the initial request. Any subsequent GET request within a short period of time would download the body. However, @phuedx and I were slightly puzzled in discussion afterwards because we thought we heard Services mention that actually we _can't_ cache renders.
  2. Regardless of #1, Services recommended considering an alternative user experience that would provide a near instant filesize estimation based on article content. There was some further discussion afterwards that still doing this approximation in the service would keep the code in one place, allow us to change the implementation transparently to the client as desired, and avoid any browser compatibility concerns around how sizes calculated. However, this could also be done in the client. We should verify the approximation is reasonable against a number of articles.

@ovasileva, let's talk about the preferred approach in the next stand up and I'll update the task description as needed.

@ovasileva is worried about differences between wikis, such as templates, that would totally throw the approximations off. She's worried about outliers not so much accuracy.

@pmiazga, we have to ask Chromium to render PDF though and that takes time. It's the biggest problem.

Assigning to @ovasileva to think about whether we want to pursue this. She's fine with the approximation approach if we do decide to move forward with it.

We've decided to look at some of the initial data before making a decision on this. Removing from the board for now and moving to the PO backlog

Based on our schedule and changes in team priorities, we have decided not to pursue this for the time being.