Page MenuHomePhabricator

Adjust CirrusSearch PoolCounter limits
Closed, ResolvedPublic

Description

We've had a few times where maintenance or an outage in the ElasticSearch cluster has affected the MediaWiki API cluster by starving PHP-FPM workers.

CirrusSearch is mostly limited by PoolCounter https://noc.wikimedia.org/conf/highlight.php?file=PoolCounterSettings.php

On average we have ~3k idle PHP-FPM workers, previously the sum of all maxqueues was 1,910, which meant if the ES cluster was slow, we'd end up with a majority workers just waiting for ES, or sitting in the PoolCounter queue - triggering pages.

As of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/740674/ - the new limits total at 1,035, so CirrusSearch can't cause a majority of workers to hang.

That was mostly written as a very quick temporary solution, all of the limits should probably be double checked to make sure they are reasonable with current ES and MW cluster capacities.