Page MenuHomePhabricator

Investigate ways to reduce cache retention timespans
Closed, DeclinedPublicBUG REPORT

Description

Background

Recent discussions on T366517 as well as recent feature deployments (e.g. font-size and dark-mode) have led the Web team to question some assumptions around our front end Varnish caching. When asking "how long are documents served from the Varnish/ATS cache?" we’ve gotten responses ranging from 24 hours to 2 weeks. After tracking some metrics related to dark-mode specific HTML, and digging through some documentation, we discovered that the cache lifetime for articles could be as long as 14 days.

From my (very basic) understanding: Varnish has a TTL of 24 hours, after which it asks MediaWiki to revalidate a document. If the document's last-modified header (reflecting when an article was last edited) hasn't changed, then MediaWiki continues revalidating the document for up to 14 days. (based on T124954#2404883 ).

Since skin-level changes are not taken into account in the 'last-modified' header, this 14 day timespan complicates feature development as well as product releases. It essentially means we have to accommodate cached HTML for at least two deployment cycles while building new features. To give an idea of how onerous this can be, there are currently 75 commits in Vector related to just cached HTML. This also means we can’t reliably communicate the availability new features to our end users for two weeks, e.g: when we put out a press release informing users of a new feature, there is often confusion as to why it’s available on some pages but not others (especially when a change is visually striking like dark-mode).

Problem statement

As a team building features for anonymous users, accommodating cached HTML has come at a considerable cost. Writing code to accommodate new features as well as cached HTML has slowed down feature development, leading to bugs and confusion around feature availability.
Acknowledging that caching HTML is a necessity in our environment and at our scale, how could we reduce the cache retention lifetime so that it leads to more predictable feature releases? We suspect other teams have come across this before - is there an existing mechanism we could use? Can or should we build something specifically for this purpose? We'd like to explore solutions to this issue that work for both product and platform/infrastructure teams.

Event Timeline

Jdlrobson added a subscriber: ovasileva.

@ovasileva we need to setup a meeting with the platform team and move this to their backlog so moving out of sprint board and leaving with you for next steps.

Jdlrobson triaged this task as Medium priority.Sep 12 2024, 10:30 PM
Jdlrobson moved this task from Incoming to Groomed on the Web-Team-Backlog-Archived board.
Jdlrobson-WMF subscribed.

Per quarterly grooming: I think cached HTML is the cost of us doing business. I don't think it makes sense to explore a generic solution here but to revisit the next time we have that problem.