It is a truth universally acknowledged that page HTML is cached for 30 days. But this number is a theoretical maximum; the lifetime of cached HTML is probably much lower. It would be good to know what it is, because the cache lifetime of HTML influences which resources we can inline.
Description
Description
Related Objects
Related Objects
- Mentioned Here
- T124954: Decrease max object TTL in varnishes
Event Timeline
Comment Actions
Put differently, what would the performance impact be if we reduced s-maxage to
a) 2 weeks,
b) 1 week,
c) days?
Comment Actions
Script to get cached HTML ages:
# -*- coding: utf-8 -*- """ cache_age ~~~~~~~~~ Retrieve random pages from random Wikimedia projects and scrape their cache age. Then write the ages to a file on disk. """ import datetime import random import re import requests NUM_PAGES = 10000 def get_sites(): params = {'action': 'sitematrix', 'format': 'json'} r = requests.get('https://meta.wikimedia.org/w/api.php', params) site_matrix = r.json()['sitematrix'].values() projects = (p for p in site_matrix if isinstance(p, dict)) sites = [] for project in projects: for site in project.get('site', ()): sites.append(site['url']) return sites sites = get_sites() ages = [] while len(ages) < NUM_PAGES: site = random.choice(sites) r = requests.get(site + '/wiki/Special:Random') m = re.search(r'and timestamp (\d+) and revision', r.text) if not m: continue parsed = datetime.datetime.strptime(m.group(1), '%Y%m%d%H%M%S') delta = datetime.datetime.utcnow() - parsed age = delta.total_seconds() ages.append(age) with open('html_ages.csv', 'w') as f: for age in ages: print(age, file=f)
Comment Actions
My point in the other ticket is it's not really about the percentage of pages which are cached longer than X, it's about the percentage of requests. In the extreme possible case, if reducing the varnish maximum TTL from 30 days to 3 days only actually affects 0.01% of all requests, why not set the max TTL at 3 days and save a lot of pain?