Page MenuHomePhabricator

Cirrus MLT cache has 0% hit rate on switchover
Closed, ResolvedPublic

Description

The MLT cache typically answers ~50% of requests for a very expensive and very common query. This cache is stored in memcached from the mediawiki side. On switchover the cache was completely empty,

Fallout:

  • a few hundred MLT requests were rejected due to concurrency limits that protect the system from overload
  • mlt qps increased from ~130 to ~250
  • cluster cpu usage increase from 25% to 35%. Still well within reasonable values.
  • backend mlt p95 latency increased from ~330ms to 400ms (excluding effects of network round trips). User visible latency would have a much higher impact, as the 50% of requests with effectively 0ms cached responses are not considered in this metric.

Todo:

  • Ensure the WAN cache is being used
  • Check for cluster-specific data ending up in the hashed data included in the cache key

Event Timeline

@EBernhardson if we run a warmup script before the switchover that could solve the issue?
We're already running a MW warmup script, so I guess we could either have one specific to ES or use the same just hitting URLs on MW that perform the search request needed to populate this cache items.

A warmup wouldn't be enough, the cache holds a value per mediawiki content page across all wikis. 5 hours after switchover the cache hit rate is still down 10% from yesterday. Mirroring the existing cache across dc's seems the most direct solution. Note that this cache is in the mediawiki side, and not in elastic itself which makes this relatively painless.

For cache hit rates see https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=55&fullscreen&orgId=1&from=1536611812617&to=1536784466597

Change 460424 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Cache cirrus queries to the WAN

https://gerrit.wikimedia.org/r/460424

Change 460424 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Cache cirrus queries to the WAN

https://gerrit.wikimedia.org/r/460424

debt claimed this task.