Document and possibly fine-tune how Proton interacts with Varnish
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Tgr
	Jan 10 2019, 3:49 AM

Description

Proton fetches article HTML from RESTBase, and converts it to PDF. Unlike our typical APIs, a request is pretty resource-intensive; well-developer articles take tens of seconds to render, abnormally huge pages will just time out and abort. If there is even a small spike in requests to the same page (e.g. some URL gets shared on social media) that will be pretty taxing. We should make sure we understand and document how Varnish behaves during such a spike (do requests get cached? do they get coalesced?), and fine-tune that behavior if needed.

Especially, do we want PDF responses (which will be several megabytes) to be cached? Normally, requests go through 2-3 layers of Varnish and get cached in all layers. We don't expect much traffic for any single URL, and latency in the single-second range or below does not really matter, so this would be a waste of space; maybe we should only cache them in backend Varnish (disk is cheaper then memory), or not at all.

Related Objects

Mentioned In: T210652: Handoff Proton service to Reading Infrastructure

Event Timeline

Tgr created this task.Jan 10 2019, 3:49 AM

Restricted Application added a project: SRE. · View Herald TranscriptJan 10 2019, 3:49 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Tgr mentioned this in T210652: Handoff Proton service to Reading Infrastructure.Jan 10 2019, 3:49 AM

• Jhernandez added a parent task: T210652: Handoff Proton service to Reading Infrastructure.Jan 10 2019, 4:45 PM

herron triaged this task as Medium priority.Jan 10 2019, 10:24 PM

@phuedx and I talked about this.

We need some documentation about how proton interacts with load balancers and cache and how it responds to load spikes. cc/ @pmiazga

Ping Services, just to confirm you are aware of the setup and are ok with it as it stands.

• Jhernandez moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.Jan 15 2019, 11:23 AM

@Jhernandez I'm happy to explain to you whatever you might want to know about our load-balancing infrastructure, and how it interacts with proton.

If you want more generic documentation about our LVS infrastructure, it should be available on wikitech.

Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.Jan 15 2019, 3:54 PM

pmiazga moved this task from Needs Prioritization to Upcoming on the Web-Team-Backlog board.Jan 15 2019, 4:50 PM

With request rate as low as this endpoint is expected to have, the Varnish hit rate would probably be very close to zero, but indeed, accidental spikes in requests for a certain page are possible.

Currently, we cache responses in Varnish for 5 minutes in order to avoid active purging of one more endpoint, and surprisingly the hit ratio looks better then I have expected according to the following hadoop query:

select cache_status, count(*) from webrequest where year = 2019 and month = 01 and day = 10 and uri_path like "%/pdf/%" group by cache_status;

cache_status	count
hit-front	16065
int-front	1102
hit-local	7913
int-remote	718
int-local	6174
pass	96
miss	74180
hit-remote	47

So it might be enough to keep the caching strategy

ovasileva moved this task from Upcoming to Needs Prioritization on the Web-Team-Backlog board.Jan 16 2019, 5:06 PM

Jdlrobson moved this task from Needs Prioritization to Tracking on the Web-Team-Backlog board.Jan 22 2019, 10:46 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

What IMO we need to know to understand how the service will deal with spikes:

Are requests cached? (Per @Pchelolo, yes, for 5 minutes.)
- Does that also go for errors? (Especially when rendering takes too long and gets aborted by the service, or abandoned by Varnish.)
When a request is being processed and a new request arrives for the same document, do they get coalesced? (Ie. can there be cache stampedes?)
Are there any concerns about using too much storage/memorey in Varnish cache? (PDFs are larger than HTMLs, although I guess not that much larger. Also they probably embed images?)

Does that also go for errors? (Especially when rendering takes too long and gets aborted by the service, or abandoned by Varnish.)

Errors are not cached. Only 200 and 3xx

When a request is being processed and a new request arrives for the same document, do they get coalesced? (Ie. can there be cache stampedes?)

In the old system - no. AFAIK neither in the new system.

Are there any concerns about using too much storage/memorey in Varnish cache? (PDFs are larger than HTMLs, although I guess not that much larger. Also they probably embed images?)

Given the req rate, my gut feeling is that PDFs will take a negligible amount of space. I can estimate the space needed, we have a metric on the PDF size and we know the req rate.

In T213371#4901265, @Tgr wrote:

What IMO we need to know to understand how the service will deal with spikes:

Are requests cached? (Per @Pchelolo, yes, for 5 minutes.)

Does that also go for errors? (Especially when rendering takes too long and gets aborted by the service, or abandoned by Varnish.)

When a request is being processed and a new request arrives for the same document, do they get coalesced? (Ie. can there be cache stampedes?)

I would let the traffic team respond on this point, but from my memory, unless we explicitly tell varnish to NOT do it, a vaarnish instance will block all requests for a resource waiting on the first one arrived to complete, and then respond with the cached object.

Are there any concerns about using too much storage/memorey in Varnish cache? (PDFs are larger than HTMLs, although I guess not that much larger. Also they probably embed images?)

Depending on usage, this could become a problem. Or we might want to move PDF caching to the upload cluster.

I'll let @BBlack or @ema comment further though, as we should probably consider the ATS transition too.

In T213371#4901269, @Pchelolo wrote:

Given the req rate, my gut feeling is that PDFs will take a negligible amount of space.

Yeah, I didn't think that through, if they are only cached for 5 minutes, that's irrelevant. Also short enough that we don't need to care about cache invalidation.

In T213371#4901265, @Tgr wrote:

What IMO we need to know to understand how the service will deal with spikes:

There is a queue mechanism. Each proton instance can only render X of jobs at one time, and Y jobs can be queued. If the spike is higher than X+Y every new job will be rejected with HTTP_503 error including the Retry-After header.
Additionally, the pool manager will de-pool that instance (allow it to peacefully render all jobs). When instance returns HTTP_503 for the first time, it won't be asked to render another PDF until Retry-After which currently is set to conf.render_queue_timeout

This was discussed in the Web/Infra/SRE/Services Q3-Q4 interlock meeting today.

I think there is little concern about it in general given the traffic and TTL of the cached objects.

A possible consideration for the future depending on the PDF sizes could be what @Joe mentioned above:

we might want to move PDF caching to the upload cluster.

But it seems like a NOOP right now.

Does anyone have remaining questions about this? We'll wait a bit so that traffic can have a cursory look and ask any questions they may have.

@pmiazga Do you feel like the documentation reflects what you last wrote?

In T213371#4902627, @pmiazga wrote:

There is a queue mechanism.

Right, sorry, I still need to adjust to that mentally. So the usual problems with spikes don't apply here.

Does the queue merge duplicate requests? If not, it would still be interesting to know if request coalescing works (I also have the impression that it's the default behavior for Varnish but it would be good to know for sure).

In T213371#4906133, @Tgr wrote:

(I also have the impression that it's the default behavior for Varnish but it would be good to know for sure).

Or at least it's the default for URLs where no response or a successful response is cached. I think unsuccessful responses create a hit-for-pass object and disable coalescing?

@ema can you help out with the Varnish questions?

Is my understanding correct that by default Varnish will coalesce requests but after an error response it will store a hit-for-pass object and disable coalescing? (And if so, how long is that stored?)
Is there a way to cache error responses (e.g. by setting a max-age header on them)?

The context is that Proton time error out when asked to render abnormally large pages (which is acceptable, dealing with book-length text was not the goal of the service), and in that case we probably don't want to trigger several new renderings of the same page as the user retries.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 11:02 PM

@Jhernandez I'll update documentation once all Tgr questions are answered.

jijiki moved this task from Incoming 🐫 to Stalled 🐌 on the serviceops board.Feb 6 2019, 12:09 PM

@Tgr I assume you're still waiting for answers from @ema? Is there anything I can help you with?

Yeah, but I don't think this task should be a blocker (for either handover or production switchover). It's just something we should document (and maybe consider coalescing as a future feature for the queue management system if Varnish itself does not handle it, although I expect it does).

Tgr removed a parent task: T210652: Handoff Proton service to Reading Infrastructure.Feb 6 2019, 10:36 PM

In T213371#4932956, @pmiazga wrote:

@Tgr I assume you're still waiting for answers from @ema? Is there anything I can help you with?

@BBlack is probably the best to answer the above questions. @ema is currently on leave.

ovasileva moved this task from Triage to Backlog on the Proton board.Feb 22 2019, 3:03 PM

• ema moved this task from Backlog to Caching on the Traffic board.Mar 6 2019, 9:34 AM

Joe edited projects, added serviceops-radar; removed serviceops.Jun 21 2019, 8:52 AM

Joe moved this task from Backlog to Watched on the serviceops-radar board.

• Jhernandez unsubscribed.Apr 2 2020, 6:46 PM

• Mholloway subscribed.Jul 24 2020, 1:49 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

Jdlrobson moved this task from Untriaged to Untag on the Web-Team-Backlog (Tracking) board.Sep 21 2021, 8:40 PM

LGoto removed a project: Web-Team-Backlog (Tracking).Nov 9 2021, 4:23 PM

Document and possibly fine-tune how Proton interacts with VarnishOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Document and possibly fine-tune how Proton interacts with Varnish
Open, MediumPublic
Actions