Page MenuHomePhabricator

Document and possibly fine-tune how Proton interacts with Varnish
Open, MediumPublic

Description

Proton fetches article HTML from RESTBase, and converts it to PDF. Unlike our typical APIs, a request is pretty resource-intensive; well-developer articles take tens of seconds to render, abnormally huge pages will just time out and abort. If there is even a small spike in requests to the same page (e.g. some URL gets shared on social media) that will be pretty taxing. We should make sure we understand and document how Varnish behaves during such a spike (do requests get cached? do they get coalesced?), and fine-tune that behavior if needed.

Especially, do we want PDF responses (which will be several megabytes) to be cached? Normally, requests go through 2-3 layers of Varnish and get cached in all layers. We don't expect much traffic for any single URL, and latency in the single-second range or below does not really matter, so this would be a waste of space; maybe we should only cache them in backend Varnish (disk is cheaper then memory), or not at all.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
herron triaged this task as Medium priority.Jan 10 2019, 10:24 PM

@phuedx and I talked about this.

We need some documentation about how proton interacts with load balancers and cache and how it responds to load spikes. cc/ @pmiazga

Ping Services, just to confirm you are aware of the setup and are ok with it as it stands.

Joe subscribed.

@Jhernandez I'm happy to explain to you whatever you might want to know about our load-balancing infrastructure, and how it interacts with proton.

If you want more generic documentation about our LVS infrastructure, it should be available on wikitech.

Pchelolo subscribed.

With request rate as low as this endpoint is expected to have, the Varnish hit rate would probably be very close to zero, but indeed, accidental spikes in requests for a certain page are possible.

Currently, we cache responses in Varnish for 5 minutes in order to avoid active purging of one more endpoint, and surprisingly the hit ratio looks better then I have expected according to the following hadoop query:

select cache_status, count(*) from webrequest where year = 2019 and month = 01 and day = 10 and uri_path like "%/pdf/%" group by cache_status;
cache_statuscount
hit-front16065
int-front1102
hit-local7913
int-remote718
int-local6174
pass96
miss74180
hit-remote47

So it might be enough to keep the caching strategy

What IMO we need to know to understand how the service will deal with spikes:

  • Are requests cached? (Per @Pchelolo, yes, for 5 minutes.)
    • Does that also go for errors? (Especially when rendering takes too long and gets aborted by the service, or abandoned by Varnish.)
  • When a request is being processed and a new request arrives for the same document, do they get coalesced? (Ie. can there be cache stampedes?)
  • Are there any concerns about using too much storage/memorey in Varnish cache? (PDFs are larger than HTMLs, although I guess not that much larger. Also they probably embed images?)

Does that also go for errors? (Especially when rendering takes too long and gets aborted by the service, or abandoned by Varnish.)

Errors are not cached. Only 200 and 3xx

When a request is being processed and a new request arrives for the same document, do they get coalesced? (Ie. can there be cache stampedes?)

In the old system - no. AFAIK neither in the new system.

Are there any concerns about using too much storage/memorey in Varnish cache? (PDFs are larger than HTMLs, although I guess not that much larger. Also they probably embed images?)

Given the req rate, my gut feeling is that PDFs will take a negligible amount of space. I can estimate the space needed, we have a metric on the PDF size and we know the req rate.

What IMO we need to know to understand how the service will deal with spikes:

  • Are requests cached? (Per @Pchelolo, yes, for 5 minutes.)
    • Does that also go for errors? (Especially when rendering takes too long and gets aborted by the service, or abandoned by Varnish.)
  • When a request is being processed and a new request arrives for the same document, do they get coalesced? (Ie. can there be cache stampedes?)

I would let the traffic team respond on this point, but from my memory, unless we explicitly tell varnish to NOT do it, a vaarnish instance will block all requests for a resource waiting on the first one arrived to complete, and then respond with the cached object.

  • Are there any concerns about using too much storage/memorey in Varnish cache? (PDFs are larger than HTMLs, although I guess not that much larger. Also they probably embed images?)

Depending on usage, this could become a problem. Or we might want to move PDF caching to the upload cluster.

I'll let @BBlack or @ema comment further though, as we should probably consider the ATS transition too.

Given the req rate, my gut feeling is that PDFs will take a negligible amount of space.

Yeah, I didn't think that through, if they are only cached for 5 minutes, that's irrelevant. Also short enough that we don't need to care about cache invalidation.

What IMO we need to know to understand how the service will deal with spikes:

There is a queue mechanism. Each proton instance can only render X of jobs at one time, and Y jobs can be queued. If the spike is higher than X+Y every new job will be rejected with HTTP_503 error including the Retry-After header.
Additionally, the pool manager will de-pool that instance (allow it to peacefully render all jobs). When instance returns HTTP_503 for the first time, it won't be asked to render another PDF until Retry-After which currently is set to conf.render_queue_timeout

This was discussed in the Web/Infra/SRE/Services Q3-Q4 interlock meeting today.

I think there is little concern about it in general given the traffic and TTL of the cached objects.

A possible consideration for the future depending on the PDF sizes could be what @Joe mentioned above:

we might want to move PDF caching to the upload cluster.

But it seems like a NOOP right now.


Does anyone have remaining questions about this? We'll wait a bit so that traffic can have a cursory look and ask any questions they may have.


@pmiazga Do you feel like the documentation reflects what you last wrote?

There is a queue mechanism.

Right, sorry, I still need to adjust to that mentally. So the usual problems with spikes don't apply here.

Does the queue merge duplicate requests? If not, it would still be interesting to know if request coalescing works (I also have the impression that it's the default behavior for Varnish but it would be good to know for sure).

(I also have the impression that it's the default behavior for Varnish but it would be good to know for sure).

Or at least it's the default for URLs where no response or a successful response is cached. I think unsuccessful responses create a hit-for-pass object and disable coalescing?

@ema can you help out with the Varnish questions?

  • Is my understanding correct that by default Varnish will coalesce requests but after an error response it will store a hit-for-pass object and disable coalescing? (And if so, how long is that stored?)
  • Is there a way to cache error responses (e.g. by setting a max-age header on them)?

The context is that Proton time error out when asked to render abnormally large pages (which is acceptable, dealing with book-length text was not the goal of the service), and in that case we probably don't want to trigger several new renderings of the same page as the user retries.

@Jhernandez I'll update documentation once all Tgr questions are answered.

@Tgr I assume you're still waiting for answers from @ema? Is there anything I can help you with?

Yeah, but I don't think this task should be a blocker (for either handover or production switchover). It's just something we should document (and maybe consider coalescing as a future feature for the queue management system if Varnish itself does not handle it, although I expect it does).

@Tgr I assume you're still waiting for answers from @ema? Is there anything I can help you with?

@BBlack is probably the best to answer the above questions. @ema is currently on leave.

Joe moved this task from Backlog to Watched on the serviceops-radar board.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!