Page MenuHomePhabricator

Increase ltr.cache.max_size in Cirrus elasticsearch clusters
Closed, ResolvedPublic

Description

While evaluating the new larger ranking models ran into issue where queries run in 150ms on codfw and 1s on eqiad. Turns out to be because codfw was caching models, but eqiad was churning the model cache. It appears the default cache size of 10MB is too small and is churning models. Compiling models, especially large ones, can take a second or more and is not something we can have regularly happening. Resize the cache up to 100mb which is still a very small fraction of heap, but should be large enough to prevent churn.

Sadly this is not currently updatable via cluster settings api (task filed upstream), so we need to do a rolling restart.

Event Timeline

EBernhardson created this task.

Change 413407 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Resize the Cirrus LTR model cache

https://gerrit.wikimedia.org/r/413407

Change 413407 merged by Gehel:
[operations/puppet@production] Resize the Cirrus LTR model cache

https://gerrit.wikimedia.org/r/413407

Change 414637 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] Resize the Cirrus LTR model cache

https://gerrit.wikimedia.org/r/414637

Change 414637 merged by Gehel:
[operations/puppet@production] Resize the Cirrus LTR model cache

https://gerrit.wikimedia.org/r/414637

Tested again today after restarts were completed. After an initial warmup to get the models into the caches I am now seeing consistent performance for both clusters in an acceptable range.

EBjune subscribed.

Seems to be working as expected, closing this one.