Page MenuHomePhabricator

Estimate effective cache time for text
Closed, DuplicatePublic

Description

It is a truth universally acknowledged that page HTML is cached for 30 days. But this number is a theoretical maximum; the lifetime of cached HTML is probably much lower. It would be good to know what it is, because the cache lifetime of HTML influences which resources we can inline.

Event Timeline

ori raised the priority of this task from to Needs Triage.
ori updated the task description. (Show Details)
ori added projects: Traffic, Performance Issue.
ori subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Put differently, what would the performance impact be if we reduced s-maxage to

a) 2 weeks,
b) 1 week,
c) days?

Script to get cached HTML ages:

# -*- coding: utf-8 -*-
"""
  cache_age
  ~~~~~~~~~
  Retrieve random pages from random Wikimedia projects and scrape
  their cache age. Then write the ages to a file on disk.

"""
import datetime
import random
import re

import requests


NUM_PAGES = 10000

def get_sites():
    params = {'action': 'sitematrix', 'format': 'json'}
    r = requests.get('https://meta.wikimedia.org/w/api.php', params)
    site_matrix = r.json()['sitematrix'].values()
    projects = (p for p in site_matrix if isinstance(p, dict))
    sites = []
    for project in projects:
        for site in project.get('site', ()):
            sites.append(site['url'])
    return sites

sites = get_sites()

ages = []
while len(ages) < NUM_PAGES:
    site = random.choice(sites)
    r = requests.get(site + '/wiki/Special:Random')
    m = re.search(r'and timestamp (\d+) and revision', r.text)
    if not m: continue
    parsed = datetime.datetime.strptime(m.group(1), '%Y%m%d%H%M%S')
    delta = datetime.datetime.utcnow() - parsed
    age = delta.total_seconds()
    ages.append(age)


with open('html_ages.csv', 'w') as f:
    for age in ages:
        print(age, file=f)

My point in the other ticket is it's not really about the percentage of pages which are cached longer than X, it's about the percentage of requests. In the extreme possible case, if reducing the varnish maximum TTL from 30 days to 3 days only actually affects 0.01% of all requests, why not set the max TTL at 3 days and save a lot of pain?