Page MenuHomePhabricator

Proposal: Fix banner bump with ESI + banner service behind Varnish
Open, Needs TriagePublic

Description

This is a proposal for preventing content shifts when banners are injected into pages (the "banner bump").

Overview

Implement a banner service that would select and return a banner, or no banner, for every pageview.

In CentralNotice, insert in the base HTML an <esi> tag pointing to the banner service.

In Varnish, use the ESI feature to call the banner service and pass along the required inputs. HTML content returned by the banner service will be injected into the base HTML.

Client-side JavaScript will detect if a banner was injected and may set a cookie with client data for the banner service to read on the next pageview. Code for this will be mostly the same as existing client-side CentralNotice code. Data reporting would not change from the current system.

End-user impact

For readers, CentralNotice campaign admins and data analysts, there would be essentially no change from the current system, except that the banner bump would be eliminated.

Implementation notes

Inputs

  • From MediaWiki:
    • Active CentralNotice campaign config, provided by CentralNotice via a MediaWiki API.
    • Mapping between hostnames and CentralNotice projects (could be included in campaign config API response, or provided in a static file).
  • Targeting data, provided by Varnish:
    • hostname
    • language
    • country code
    • regional subdivision code
    • user agent string
    • mobile or desktop skin
    • logged-in status
  • Client data, provided in cookies set in JavaScript on previous pageviews:
    • campaings for which the client has reached the maximum number of impressions
    • previous clicks on banner close buttons
    • buckets by campaign
    • current step in campaigns using banner sequence
    • user preferences for campaign display
  • Banner content, provided by Special:BannerLoader.

(See notes on caching, below.)

Output

HTML elements with banner content (or no banner) and data for Javascript post-processing.

Selection logic

Part 1:

banner_service_part1.png (2×1 px, 212 KB)

Part 2:

banner_service_part2.png (1×1 px, 155 KB)

(Code: P22017, P22018. Diagrams generated with PlantUML.)

Caching and optimization

Campaign config from CentralNotice

CentralNotice campaign config could be cached in RAM by the banner service. Currently config sent to the client for banner selection is cached on the same schedule as ResourceLoader modules.

Banner content

Banner content is specific to the user's language and the campaign the banner is being displayed for. (A single banner can be assigned to more than one campaign.) The banner service could cache content for frequently selected banner/language/campaign permutations, or it could always request banner content from Varnish (which should be tuned appropriately).

Selection decisions

While the banner selection process can involve randomness, in many cases it doesn't. The banner service could cache decisions made deterministically, to speed up selection in such cases.

In addition, often, we'll be able to know in advance that selection for many large user segments doesn't need to take into account certain inputs. For example, sometimes entire projects will have no banners showing, so for those projects we can ignore geolocation inputs. Similarly, often most countries will have no active campaigns using regional subdivision targeting; so, for those countries, we'll know we can safely ignore regional geolocation. The campaign config provided by CentralNotice could include data about cases in which certain inputs could be ignored. By ignoring unneeded inputs, we can limit the size of an in-memory cache of deterministic decisions.

Still, it's not clear that caching selection decisions would provide that much increased performance, and it would add complexity to the system. So, it may not be worth it to do so.

Added restriction

One minor, new restriction is needed for CentralNotice campaign config: all campaigns must either (a) have only one bucket (which can have multiple banners targeting different devices and logged-in statuses), or (b) all banners, across all buckets, must target the same set of devices and logged-in statuses.

In practice, all CentralNotice campaigns already comply with this restriction.

Logging and monitoring

What sort of logging would the banner service do? Maybe log a summary of results once a second?

What would real-time monitoring and alerts look like?

Programming languages

Two options for implementing a banner service have come up so far:

RustAdvantages: performance, memory and thread safety, significant compile-time checks for correctness. Disadvantages: this language is not currently used much (or perhaps not at all) in our stack.
GoAdvantages: relative ease of development, previous use in our stack. Disadvantages: performance might be a bit slower than Rust, though it's not clear that the difference would be significant in this use case.

Unit tests

CentralNotice already includes extensive unit tests of banner selection code, currently used in both PhpUnit and QUnit tests. This could provide a starting point for unit tests for the banner service.

Proof-of-concept

Here is a working proof-of-concept, using Varnish <esi> tags and a standalone Rust web service! (Note: instructions there are not yet complete. Pls ping if you'd like to try it out!)

Questions

Risks

What are the risks involved in this proposal, and how could they be mitigated?

RiskPossible mitigation
Banner service goes down or takes too long to respond.In Varnish (if possible) set a short timeout for requests to the banner service. Inject a synthetic response if the connection is refused or if the timeout is exceeded.
A bug causes the banner service to consume too much RAM or CPU.Automated kill switch: monitoring that automatically turns off banners if resource usage exceeds a specified level.
Site performance hit under loadGradual rollout with constant profiling as we go, and automated kill switch (see above)
DDoSAutomated kill switch (see above).
SecurityPrevent Varnish from processing any additional ESI tags that an attacker might be able to reflect into the base HTML. Prevent ESI tags from causing requests to anywhere other than the banner service.
Limited capacity to maintain the service or respond to urgent bugs (if we use languages or tools that are not currently part of our stack).Train multiple engineers in the languages and tools used to implement the banner service.

Durability

Could this solution potentially be permanent or semi-permanent? If so, or if not, why? For how long might it be acceptable to leave it in place?

Rollout

What steps might be required for a rollout of this system? Tentatively, perhaps it could go something like this?

  • Analysis of webrequest and server load data to try to understand performance requirements.
  • Initial test of Varnish ESI tag on production (for example to inject a static HTML comment).
  • Initial profiling (lab conditions)
  • Initial rollout on one or two small wikis
  • More profiling
  • Second rollout stage (more wikis, including one medium-sized wiki)
  • More profiling
  • Third rollout stage (one large wiki)
  • More profiling
  • Full rollout to all wikis

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle renamed this task from Fix banner bump with ESI and banner service behind Varnish to Proposal: Fix banner bump with ESI-like banner service behind Varnish.Mar 7 2022, 7:38 PM
AndyRussG renamed this task from Proposal: Fix banner bump with ESI-like banner service behind Varnish to Proposal: Fix banner bump with ESI + banner service behind Varnish.Mar 8 2022, 2:41 AM