Maniphest T204148

Cirrus MLT cache has 0% hit rate on switchover
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Sep 12 2018, 6:29 PM

Description

The MLT cache typically answers ~50% of requests for a very expensive and very common query. This cache is stored in memcached from the mediawiki side. On switchover the cache was completely empty,

Fallout:

a few hundred MLT requests were rejected due to concurrency limits that protect the system from overload
mlt qps increased from ~130 to ~250
cluster cpu usage increase from 25% to 35%. Still well within reasonable values.
backend mlt p95 latency increased from ~330ms to 400ms (excluding effects of network round trips). User visible latency would have a much higher impact, as the 50% of requests with effectively 0ms cached responses are not considered in this metric.

Todo:

Ensure the WAN cache is being used
Check for cluster-specific data ending up in the hashed data included in the cache key

Details

	Subject	Repo	Branch	Lines +/-
	Cache cirrus queries to the WAN	mediawiki/extensions/CirrusSearch	master	+1 -1

Customize query in gerrit

Event Timeline

EBernhardson created this task.Sep 12 2018, 6:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 12 2018, 6:29 PM

EBernhardson updated the task description. (Show Details)Sep 12 2018, 6:49 PM

EBernhardson updated the task description. (Show Details)

Gehel subscribed.Sep 12 2018, 7:00 PM

@EBernhardson if we run a warmup script before the switchover that could solve the issue?
We're already running a MW warmup script, so I guess we could either have one specific to ES or use the same just hitting URLs on MW that perform the search request needed to populate this cache items.

A warmup wouldn't be enough, the cache holds a value per mediawiki content page across all wikis. 5 hours after switchover the cache hit rate is still down 10% from yesterday. Mirroring the existing cache across dc's seems the most direct solution. Note that this cache is in the mediawiki side, and not in elastic itself which makes this relatively painless.

For cache hit rates see https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=55&fullscreen&orgId=1&from=1536611812617&to=1536784466597

Change 460424 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Cache cirrus queries to the WAN

https://gerrit.wikimedia.org/r/460424

gerritbot added a project: Patch-For-Review.Sep 13 2018, 8:08 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Sep 13 2018, 8:08 PM

Change 460424 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Cache cirrus queries to the WAN

https://gerrit.wikimedia.org/r/460424

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-18 (1.32.0-wmf.22)).Sep 17 2018, 2:00 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Sep 17 2018, 4:08 PM

debt closed this task as Resolved.Oct 5 2018, 4:08 PM

debt claimed this task.