Page MenuHomePhabricator

Improve automation of CirrusSearch caches during database switchover
Open, MediumPublic

Description

https://wikitech.wikimedia.org/wiki/Switch_Datacenter#ElasticSearch

Currently before a datacenter switchover, we hardcode the more_like queries to go to the active datacenter so that after the switchover when caches are cold, it still has a hot cache. After ~24h the caches are warmed up enough and the hardcoding is removed.

Ideally this process would be automated potentially by:

  • Replicating cache
  • A cookbook to warm up the cache ahead of time
  • Deciding that it's fast enough with the cold cache (likely what would happen in an emergency) and no longer require manual intervention

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Deciding that it's fast enough with the cold cache (likely what would happen in an emergency) and no longer require manual intervention

It's not that it's slow, it's that we end up rejecting lots of requests while the cache fills back up. To prevent this one use case (more_like requests, primarily via related articles and fetched on significant % of mobile page views) from dominating the cluster it has it's own PoolCounter, once that fills we reject a large stream of requests. If we have too much load the remedy would instead be to reduce the concurrency allowed by this PoolCounter until the cluster is stable and typical autocomplete/fulltext requests continue. If my memory serves the cache fills to a reasonable level in a few hours, but the initial spike will include significant numbers of rejected requests.

MPhamWMF triaged this task as Medium priority.Jun 28 2021, 3:32 PM
MPhamWMF moved this task from needs triage to elastic / cirrus on the Discovery-Search board.