The MLT cache typically answers ~50% of requests for a very expensive and very common query. This cache is stored in memcached from the mediawiki side. On switchover the cache was completely empty,
Fallout:
- a few hundred MLT requests were rejected due to concurrency limits that protect the system from overload
- mlt qps increased from ~130 to ~250
- cluster cpu usage increase from 25% to 35%. Still well within reasonable values.
- backend mlt p95 latency increased from ~330ms to 400ms (excluding effects of network round trips). User visible latency would have a much higher impact, as the 50% of requests with effectively 0ms cached responses are not considered in this metric.
Todo:
- Ensure the WAN cache is being used
- Check for cluster-specific data ending up in the hashed data included in the cache key