Page MenuHomePhabricator

Test ESI feasibility with current Varnish installation
Closed, ResolvedPublic

Description

The OKR stuff for this looks like:

O: Determine viability and difficulty of using ESI feature in Varnish as part of a banner bump solution
KR: Build a minimal scalable test content service deployable on edge nodes
KR: Include minimal content output via ESI on one Varnish instance
KR: Expand test to all nodes progressively, and if possible run on all nodes for a full week
KR: Summarize challenges and difficulty level of continuing on this solution path

The best idea we have for a trivial minimal test content service is a simple haproxy instance with the necessary static content outputs defined in its configuration. The basic plan is to start with that, then deploy it and integrate it as a backend service for ats-be, and then move on to manual tests and then getting a test ESI tag included in production banner outputs for "real" testing.

Details

Other Assignee
Vgutierrez
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -0
operations/puppetproduction+0 -1
operations/puppetproduction+0 -1
operations/puppetproduction+1 -1
operations/puppetproduction+0 -3
operations/puppetproduction+8 -0
operations/puppetproduction+14 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+67 -9
operations/puppetproduction+7 -1
operations/puppetproduction+0 -1
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/mediawiki-configmaster+4 -0
operations/puppetproduction+114 -0
operations/puppetproduction+1 -0
operations/puppetproduction+9 -0
operations/puppetproduction+3 -0
operations/puppetproduction+5 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 793561 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] [WIP] esitest service

https://gerrit.wikimedia.org/r/793561

Heyy thanks so much for all the work on this!!! just a few notes here from a super uninformed perspective, just on the off chance they might be useful...

In the task description it says the test service would be deployed "as a backend service for ats-be". However, in the rough sketch of how a banner service might work (T303075) I was imagining the service would run on the Varnish nodes, not ATS, to avoid the overhead of an extra internal network request on every pageview.

On at least some requests (and possibly on many requests, depending on how things are optimized) the banner service would in turn request banner content via Special:BannerLoader. I'm not sure if those requests should go to an in-memory or on-disk cache. At least, the number of internal requests caused would not be greater than the number of external requests we currently handle for Special:BannerLoader.

So, if a banner is selected for the pageview, in some cases, the banner service wouldn't return until it has fetched the banner content it needs to provide. (Again, how often this happens would depend on how the banner service is optimized.) I don't know if such a delay could impact memory or thread usage by Varnish, but I imagine it's plausible it could have an impact?

Finally, recall that the banner service would need data points specific to the pageview (hostname, user agent, country code, country subdivision, logged-in status) and a cookie from the client. No idea if the Varnish code needed to gather and pass that data along could impact/complicate anything ESI-related... just thought I'd mention it, in case it might be a consideration for this test...

Thanks again!!! :) :)

Change 810030 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Add ESI testing remap rule

https://gerrit.wikimedia.org/r/810030

Change 810044 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Enable ESI for /esitest-fa8a495983347898/includer

https://gerrit.wikimedia.org/r/810044

@AndyRussG currently in our CDN varnish and ATS runs on the same nodes. All the communication with backend servers/applayer is performed by ats-be (see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#4 for reference)

I've implemented the current approach suggested by @BBlack on https://gerrit.wikimedia.org/r/c/operations/puppet/+/810030 && https://gerrit.wikimedia.org/r/c/810044.

vgutierrez@traffic-cache-atstext-buster:~$ curl http://127.0.0.1:4321/esitest-fa8a495983347898/includer
<html><body>
<p>Pre-ESI content</p>
<!--esi <esi:include src="/esitest-fa8a495983347898/content" /> -->
<p>Post-ESI content</p>
</body></html>

vgutierrez@traffic-cache-atstext-buster:~$ curl http://127.0.0.1:4321/esitest-fa8a495983347898/content
<!-- Test comment included via ESI tag -->

# varnish is listening on port 3120 and it's performing the ESI request
vgutierrez@traffic-cache-atstext-buster:~$ curl -H 'X-Forwarded-Proto:https' "http://127.0.0.1:3120/esitest-fa8a495983347898/includer"
<html><body>
<p>Pre-ESI content</p>
 <!-- Test comment included via ESI tag -->
 
<p>Post-ESI content</p>
</body></html>

here varnish fetches /[...]includer and via ESI /content is included

Change 816808 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] cache::text - include esitest service

https://gerrit.wikimedia.org/r/816808

Change 793561 merged by BBlack:

[operations/puppet@production] esitest service for cache nodes

https://gerrit.wikimedia.org/r/793561

Change 816808 merged by BBlack:

[operations/puppet@production] cache::text - include esitest service

https://gerrit.wikimedia.org/r/816808

Change 810030 merged by BBlack:

[operations/puppet@production] trafficserver: Add ESI testing remap rule

https://gerrit.wikimedia.org/r/810030

Change 810044 merged by BBlack:

[operations/puppet@production] varnish: Enable ESI for /esitest-fa8a495983347898/includer

https://gerrit.wikimedia.org/r/810044

Wooohoooo, many congrats on this merge @BBlack and @Vgutierrez!!!!! Also many thanks for the explanation and many apologies for the long delay in replying @Vgutierrez!!

Just adding here a few more notes from when I was poking at ESI on a local Varnish, just in case they might be useful... Again this is very much from a perspective of near total ignorance, so many apologies if this is just noise...

  • To prevent content injected via ESI from in turn injecting anything else via ESI, perhaps conditional somewhere along the lines req.esi_level != 0 could be worthwhile?
  • I didn't see any easy (built-in) way to count how many ESI inclusions have been performed per top-level request, but I imagine counting and checking that might be useful, to prevent more than one ESI inclusion and stop user-controlled content from ever triggering an ESI inclusion?

Congrats and thanks so much once again!!!!!!!!!! :) :)

Change 818460 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] cp4027: enable manual ESI testing

https://gerrit.wikimedia.org/r/818460

Change 818460 merged by BBlack:

[operations/puppet@production] cp4027: enable manual ESI testing

https://gerrit.wikimedia.org/r/818460

Sorry for all the delays on my end as well, but we're getting somewhere. On the earlier points about attachment via varnish vs ats-be: For now, it's just simpler to attach it via ats-be because that's how all other services behind varnish are currently defined, with varnish only knowing of ats-be as its singular source of data.. We may attach a real solution as a direct backend to Varnish, or not, depending on the tradeoffs once we get down to the details about cacheability and such. I'm hoping that least some banner outputs will be categorically cacheable for some short timeframes based on the determinant parameters, in which case we may want ats-be's longer-tail cache storage playing a role as well (or we may decide we'd rather have varnish reach out directly for efficiency.

Just adding here a few more notes from when I was poking at ESI on a local Varnish, just in case they might be useful... Again this is very much from a perspective of near total ignorance, so many apologies if this is just noise...

  • To prevent content injected via ESI from in turn injecting anything else via ESI, perhaps conditional somewhere along the lines req.esi_level != 0 could be worthwhile?
  • I didn't see any easy (built-in) way to count how many ESI inclusions have been performed per top-level request, but I imagine counting and checking that might be useful, to prevent more than one ESI inclusion and stop user-controlled content from ever triggering an ESI inclusion?

So, the current patches aims are merely to test and validate that varnish's ESI capabilities work at all as we expect and don't immediately cause grave issues, and isn't really reflective of how any final design would be configured. On the points above: I'd image we'll limit to both a single layer of inclusion and limit by-including-URI to only process ESI from known paths we're using for this feature.

In manual testing so far, the basic observations are:

  • ESI does work as we roughly expected it to and does the inclusion properly in a real request
  • All request attributes (other than the changed request URL), are preserved in the ESI subrequest it triggers (arbitrary headers, cookies, user-agent, etc). The VCL does also have access to the original request URL (via top_req) during the subrequest.
  • A cached "includer" still re-requests an expired or uncacheable ESI content subrequest.
  • Nothing seems to crash :)

I'd like to also test GZIP issues with this manual test harness mechanism, as we've seen issues there long ago. We'll make variants of both the includer and the content which are gzipped (and have enough length to cause real compression) and try all the combinations (gzip including non-gzip, non-gzip including gzip, gzip including gzip). Those patches should be pretty trivial. The potential/past issues there are mostly about re-adjusting gzip block boundaries and byte offsets (varnish stores the cached includer in gzip form, and then does some tricks to inject new content in the middle of gzip blocks efficiently...). Worst case, we could probably just ensure gzip is disabled (internally and on the backend side, not the user-facing side) for related requests if there's issues here.

Next step will be testing this at increasing scale with "real" user requests. Basically, this involves having the normal banner output code that's live on the wikis today use our includer HTML comment string <!--esi <esi:include src="/esitest-fa8a495983347898/content" /> --> (which will be ESI-replaced by another HTML comment in cases where ESI is enabled), and then testing the inclusion of the comment output in live user requests from a single cache node and expanding from there.

Cool, thanks so much for all these explanation!!

I'm hoping that least some banner outputs will be categorically cacheable for some short timeframes based on the determinant parameters, in which case we may want ats-be's longer-tail cache storage playing a role as well (or we may decide we'd rather have varnish reach out directly for efficiency.

The answer on this is, yes! A banner selection service will know if its decision was deterministic based on inputs. If a decision was deterministic, the banner service could include a header with the output indicating cacheability.

While the cacheability of the results of any given inputs could change following changes to campaign/banner configuration, it's fine to have a bit of lag, as we currently do. Currently updates to campaign/banner configuration take up to 10 minutes to roll out (or whatever the current TTL is for the ResourceLoader cache). So, yes, I think we'd want either a similar TTL across the board for decisions that are flagged by the banner service as cacheable, or (no clue if this would be feasible or a good idea) a longer TTL plus a purge automatically triggered any time campaign/banner config changes? Does that make sense?

Next step will be testing this at increasing scale with "real" user requests. Basically, this involves having the normal banner output code that's live on the wikis today use our includer HTML comment string <!--esi <esi:include src="/esitest-fa8a495983347898/content" /> --> (which will be ESI-replaced by another HTML comment in cases where ESI is enabled), and then testing the inclusion of the comment output in live user requests from a single cache node and expanding from there.

OK sounds great! Having CentralNotice add that to the base HTML will be trivial. We could also potentially add some reporting client-side, if that's useful (i.e., add some JS to verify the included content and report back).

Heyy just some quick thoughts and questions about testing with live user requests...

  • So, we could use CentralNotice to add the includer HTML comment string to the base HTML of pages that CN currently shows banners on... right?
  • CN already modifies the base HTML here, and I imagine that's where we'd put the code to insert the includer, too.
  • We could use config variables to determine whether or not we insert the includer, and what the includer src URL would be. That way, we could enable or disable things, or change the URL, on a per-wiki basis.
  • Idea for the client-side reporting: if it's useful and easy to do, a timestamp could be included in the ESI-injected comment. We could check the timestamp in JS and report back using a sampled client-side event.

Thanks again!!! :)

Hmmm one more note... so, another option would be to add the includer inside banner content, rather than the base HTML. (Maybe that's what you meant by "having the normal banner output code that's live on the wikis today use our includer"?)

The advantages of adding the includer in banner content are: (1) we wouldn't have to wait for the base HTML cache to roll over when we turn it on or off; and (2) and we could have more fine-grained control over how frequently and where it goes out.

On the flip side, the disadvantages of adding it in banner content would be: (1) we'd be testing something less similar to an expected production setup for a banner microservice; and (2) to test at scale, we'd end up adding an extra HTTP request from the client on many pageviews. So, maybe which route we take would depend on the requirements of the tests Traffic would like to run?

(Also, the above is just a quick brain-dump, and there are almost certainly more considerations that I've missed...)

Change 843590 had a related patch set uploaded (by AndyRussG; author: AndyRussG):

[operations/mediawiki-config@master] CentralNotice: Set ESI test string

https://gerrit.wikimedia.org/r/843590

@BBlack @Vgutierrez hiii! The CentralNotice change will go out with the main cluster deploy train this week! The related config change to set the requested string will be deployed tomorrow (afternoon UTC time).

The string to be added to the HTML is, as requested here, <!--esi <esi:include src="/esitest-fa8a495983347898/content" /> -->.

(I was initially confused about how the entire comment, and not just the <esi> tag inside it, would be replaced, as shown in the console output here. But now I see this is a specific feature that exists for precisely this use case! Just noting here for reference...)

Thanks much!!! :) :)

Change 843590 merged by jenkins-bot:

[operations/mediawiki-config@master] CentralNotice: Set ESI test string

https://gerrit.wikimedia.org/r/843590

The change has rolled out to group 0 wikis, and should go to groups 1 and 2 this week. An example of a page with the ESI comment in the base HTML is here.

However, there's a minor issue: for most wikis, the code we just deployed makes/will make the comment show up only for desktop sites. (This is due to a config setting for mobile sites that's on prod but that I didn't have in my local setup.) Details are in the attached subtask. I've reached out to the appropriate team for guidance.

You should still be able start testing live as soon as you're ready; I don't know of any issues that would affect this on desktop sites.

Thanks!!

The fix for T320734 is now deployed, so the ESI comment string for testing is now being added to the base HTML for both desktop and mobile sites. So, once the cache finishes rolling over, it'll be present throughout the base HTML served via Varnish.

Thanks so much!!

Change 884052 had a related patch set uploaded (by BBlack; author: BBlack):

[operations/puppet@production] esitest: compat with haproxy >= 2.5

https://gerrit.wikimedia.org/r/884052

Change 884052 abandoned by BBlack:

[operations/puppet@production] esitest: compat with haproxy >= 2.5

Reason:

https://gerrit.wikimedia.org/r/884052

Change 887882 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] cp4044: Enable ESI testing

https://gerrit.wikimedia.org/r/887882

Change 887882 merged by Vgutierrez:

[operations/puppet@production] cp4044: Enable ESI testing

https://gerrit.wikimedia.org/r/887882

Change 887850 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] Revert "cp4044: Enable ESI testing"

https://gerrit.wikimedia.org/r/887850

Change 887850 merged by Vgutierrez:

[operations/puppet@production] Revert "cp4044: Enable ESI testing"

https://gerrit.wikimedia.org/r/887850

Change 888044 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Perform ESI processing on wiki pages

https://gerrit.wikimedia.org/r/888044

Change 888044 merged by Vgutierrez:

[operations/puppet@production] varnish: Perform ESI processing on wiki pages

https://gerrit.wikimedia.org/r/888044

Regarding the concern that malicious user input could lead to injection of ESI tags:

  • In the old parser:
    • HTML comments in user input are completely removed
    • Angle brackets in generated attributes are escaped by Html::expandAttributes()
    • Angle brackets in attributes from user input are escaped by Sanitizer::safeEncodeTagAttributes()
  • In generated JS scripts: callers of Xml::encodeJsVar() and ResourceLoader::makeConfigSetScript() are safe due to the use of JSON_HEX_TAG. Comments indicate that it is already an XSS vulnerability to omit JSON_HEX_TAG.
  • In TemplateStyles: security depends on the CSS sanitizer, which I'm not familiar with. So this needs further review.
  • In Parsoid:
    • HTML comments in user input are processed by WTUtils::encodeComment(), which allows left angle brackets but encodes right angle brackets as "&gt;"
    • Left angle brackets in attributes are escaped by XMLSerializer::serializeToString(), but right angle brackets are passed through.
  • Other content types:
    • It is possible to append ?action=raw to a URL in /wiki, e.g. https://en.wikipedia.org/wiki/Main_Page?action=raw . So it is not safe to depend on the URL prefix.
    • It would be prudent to make ESI processing be activated only when the response has Content-Type: text/html
    • It would be prudent to have an internal response header like X-Allow-ESI which MediaWiki will send when it generates a page that includes ESI tags, and to activate ESI replacement only when the header is seen.

It's unclear from the documentation the contexts in which Varnish will replace ESI tags. It says "The ESI parser expects the XML to be reasonably well formed, but this may fail if you are ESI including non-XML files." But our HTML output is typically not well-formed XML. If Varnish can avoid replacing ESI tags in <style> and <script> elements, this will help to reduce the attack surface, but these elements cannot be parsed with an XML parser.

Change 889530 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Limit ESI processing to text/html pages

https://gerrit.wikimedia.org/r/889530

Change 889530 merged by Vgutierrez:

[operations/puppet@production] varnish: Limit ESI processing to text/html pages

https://gerrit.wikimedia.org/r/889530

Change 891586 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Limit ESI depth to 1

https://gerrit.wikimedia.org/r/891586

Change 894600 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable ESI testing in cp4044

https://gerrit.wikimedia.org/r/894600

Change 894600 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable ESI testing in cp4044

https://gerrit.wikimedia.org/r/894600

Mentioned in SAL (#wikimedia-operations) [2023-03-06T11:42:43Z] <vgutierrez> enable ESI testing in cp4044 - T308799

Change 895172 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable ESI testing in cp3064

https://gerrit.wikimedia.org/r/895172

Change 895172 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable ESI testing in cp3064

https://gerrit.wikimedia.org/r/895172

Change 895702 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable ESI testing in cp5024

https://gerrit.wikimedia.org/r/895702

Change 895702 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable ESI testing in cp5024

https://gerrit.wikimedia.org/r/895702

Change 903274 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Bypass ATS for esitest requests

https://gerrit.wikimedia.org/r/903274

Change 903274 merged by Vgutierrez:

[operations/puppet@production] varnish: Bypass ATS for esitest requests

https://gerrit.wikimedia.org/r/903274

Change 904556 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Set backend_hit = esitest for HfP requests

https://gerrit.wikimedia.org/r/904556

Change 904556 merged by Vgutierrez:

[operations/puppet@production] varnish: Set backend_hint = esitest for HfP requests

https://gerrit.wikimedia.org/r/904556

BCornwall changed the task status from Open to Stalled.Mar 30 2023, 9:02 PM
BCornwall raised the priority of this task from High to Needs Triage.
BCornwall moved this task from Ready for work to Backlog on the Traffic board.
BCornwall removed a project: SRE.

Change 904768 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] trafficserver: Remove esitest backend

https://gerrit.wikimedia.org/r/904768

Change 904768 merged by Vgutierrez:

[operations/puppet@production] trafficserver: Remove esitest backend

https://gerrit.wikimedia.org/r/904768

Change 905173 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Enable esitest on text@eqsin

https://gerrit.wikimedia.org/r/905173

Change 905173 merged by Vgutierrez:

[operations/puppet@production] hiera: Enable esitest on text@eqsin

https://gerrit.wikimedia.org/r/905173

Vgutierrez changed the task status from Stalled to In Progress.Apr 3 2023, 10:34 AM
Vgutierrez triaged this task as High priority.
Vgutierrez moved this task from Backlog to Traffic team actively servicing on the Traffic board.

This is an ongoing effort and shouldn't be flagged as stalled. Thanks

Mentioned in SAL (#wikimedia-operations) [2023-04-03T10:35:02Z] <vgutierrez> Extend the ESI test to text@eqsin, revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/905173/ if this gives any issue - T308799

Change 908568 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] Revert "hiera: Enable ESI testing in cp3064"

https://gerrit.wikimedia.org/r/908568

Change 908569 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] Revert "hiera: Enable esitest on text@eqsin"

https://gerrit.wikimedia.org/r/908569

Change 908568 merged by Vgutierrez:

[operations/puppet@production] Revert "hiera: Enable ESI testing in cp3064"

https://gerrit.wikimedia.org/r/908568

Change 908569 merged by Vgutierrez:

[operations/puppet@production] Revert "hiera: Enable esitest on text@eqsin"

https://gerrit.wikimedia.org/r/908569

We didn't experience any major issues with ESI enabled under normal load but under heavy traffic nodes running the ESI experiment starved varnish threads and failed to serve requests significantly faster than nodes where the ESI experiment wasn't running

Change 891586 abandoned by Vgutierrez:

[operations/puppet@production] varnish: Limit ESI depth to 1

Reason:

https://gerrit.wikimedia.org/r/891586