Page MenuHomePhabricator

Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs
Closed, ResolvedPublic2 Estimated Story Points

Description

In a mostly unconfigured CirrusSearch setup (in my case: FactGrid), the job queue contains lots of cirrusSearchElasticaWrite jobs. According to @dcausse (IRC conversation 2025-03-10 and 2025-03-19), this is surprising: Updater::pushElasticaWriteJobs() is supposed to run these jobs immediately (“inline”), rather than push them to the job queue, if the cluster isn’t configured to be isolated. However, it turns out that ClusterSettings::isIsolated() returns true by default:

public function isIsolated(): bool {
	$isolate = $this->config->get( 'CirrusSearchWriteIsolateClusters' ); // defaults to null according to extension.json
	return $isolate === null || in_array( $this->cluster, $isolate );
}

Perhaps $wgCirrusSearchWriteIsolateClusters is supposed to default to [] rather than null? Or the condition should be $isolate !== null && in_array( $this->cluster, $isolate ) instead?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel set the point value for this task to 2.Mar 24 2025, 4:32 PM

According to docs/settings.txt this is the intended behaviour:

List of clusters, by name, that will have their writes isolated from the other
clusters. If not set all clusters will be isolated from each other. Limiting
isolation to only clusters that may have issues will result in reduced job
queue load.

To disable isolation we should set:

$wgCirrusSearchWriteIsolateClusters = [];

I suppose it's reasonable to question if this is desirable as the default behaviour. It's also worth noting that disable write isolation won't prevent the creation of cirrusSearchElasticaWrite jobs, their primarily purpose is to allow retrying a write later. Whenever a write fails we queue an ElasticaWrite job to try again when it comes up. Usually if the queue has lots of these jobs that means there are failures writing to elasticsearch

Change #1131067 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Remove write isolation

https://gerrit.wikimedia.org/r/1131067

Change #1131067 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Remove write isolation

https://gerrit.wikimedia.org/r/1131067

Usually if the queue has lots of these jobs that means there are failures writing to elasticsearch

Well, that’s what David thought too, but IIUC it’s not actually true on my wiki. (I guess it’s true at Wikimedia.) So it sounds like this above change should reduce confusion everywhere – thanks :)

I’m guessing this change is a bit too big to backport… any advice for other wikis in the meantime? Is it safe to configure $wgCirrusSearchWriteIsolateClusters = []; with no other config changes? Or should we just wait and look forward to the job load dropping in the future 1.44 release?

It should be safe to configure $wgCirrusSearchWriteIsolateClusters = []; without any other changes, that looks to be the intended use case and there were test cases for that configuration.

Alright, thanks a lot! Trying this out now :)

At least on REL1_39, setting $wgCirrusSearchWriteIsolateClusters = []; directly in LocalSettings.php didn’t work – it got reset to null for some reason. (I couldn’t figure out why; for all I know, it could be a general issue with extension registration that got fixed in the meantime – ExtensionRegistry::exportExtractedData() changed a little bit in the meantime, at least.) But an extension function works:

// disable CirrusSearch write isolation so search updates are run inline rather than via jobs
// this setting can be removed on MediaWiki 1.44+ as the corresponding feature is completely removed there (T389429)
$wgExtensionFunctions[] = function () {
	global $wgCirrusSearchWriteIsolateClusters;
	$wgCirrusSearchWriteIsolateClusters = [];
};

And with that, cirrusSearchElasticaWrite jobs are gone from our job queue and search (including updates) still works \o/

Change #1134064 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Remove unused config vars

https://gerrit.wikimedia.org/r/1134064

Change #1134064 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove unused config vars

https://gerrit.wikimedia.org/r/1134064

Mentioned in SAL (#wikimedia-operations) [2025-04-08T13:22:49Z] <lucaswerkmeister-wmde@deploy1003> Started scap sync-world: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-08T13:30:20Z] <lucaswerkmeister-wmde@deploy1003> lucaswerkmeister-wmde, ebernhardson, seanleong-wmde: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-08T13:38:19Z] <lucaswerkmeister-wmde@deploy1003> Finished scap sync-world: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] (duration: 15m 30s)