Page MenuHomePhabricator

cirrusSearchElasticaWrite job failures in quibble
Closed, ResolvedPublic

Description

As I discovered in T389863#10671903, cirrusSearchElasticaWrite jobs are consistently failing in quibble-apitests, and get re-enqueued over and over again, until they're dropped. This is especially problematic because the api-testing framework, for unrelated reasons, will under certain conditions wait for the job queue to be empty. When there are many of these jobs enqueued (and Flow will enqueue some: T389894), the test becomes prone to timing out. Here are the relevant excerpts from the MW log:

2025-03-24 20:52:16 cirrusSearchElasticaWrite Special: method=sendData arguments=["general",[{"data":{"version":9,"wiki":"wikidb","page_id":9,"namespace":10,"namespace_text":"Template","title":"Archive for converted wikitext talk page","timestamp":"2025-03-24T20:51:56Z","create_timestamp":"2025-03-24T20:51:56Z","redirect":[]},"params":{"_id":"9","_index":"","_cirrus_hints":{"BuildDocument_flags":0,"noop":{"version":"documentVersion"}}},"upsert":true}]] cluster=default jobqueue_partition=default-0 update_group=page update_kind=page_refresh root_event_time=1742849517 createdAt=1742849536 errorCount=0 retryCount=0 requestId=f83047a9bb4bcb171e3bb648 namespace=-1 title= (id=23,timestamp=20250324205216) STARTING
2025-03-24 20:52:16 cirrusSearchElasticaWrite Special: method=sendData arguments=["general",[{"data":{"version":9,"wiki":"wikidb","page_id":9,"namespace":10,"namespace_text":"Template","title":"Archive for converted wikitext talk page","timestamp":"2025-03-24T20:51:56Z","create_timestamp":"2025-03-24T20:51:56Z","redirect":[]},"params":{"_id":"9","_index":"","_cirrus_hints":{"BuildDocument_flags":0,"noop":{"version":"documentVersion"}}},"upsert":true}]] cluster=default jobqueue_partition=default-0 update_group=page update_kind=page_refresh root_event_time=1742849517 createdAt=1742849536 errorCount=0 retryCount=0 requestId=f83047a9bb4bcb171e3bb648 namespace=-1 title= (id=23,timestamp=20250324205216) t=62 error=ElasticaWrite job failed: Requeued
2025-03-24 20:52:16 sending 1 documents to the wikidb_general index(s) against wikidb_general took 1 millis. Requested via web for 2c33c9c64851a02e2ba8143c6550a3b8 by executor 840951780
2025-03-24 20:52:16 Search backend error during sending 1 documents to the wikidb_general index(s) after 1: http_exception: Couldn't connect to host, Elasticsearch down?
2025-03-24 20:52:16 Failed to update documents 9
#0 /workspace/src/vendor/ruflin/elastica/src/Request.php(183): Elastica\Transport\Http->exec(Elastica\Request, array)
#1 /workspace/src/vendor/ruflin/elastica/src/Client.php(545): Elastica\Request->send()
#2 /workspace/src/vendor/ruflin/elastica/src/Bulk.php(298): Elastica\Client->request(string, string, string, array, string)
#3 /workspace/src/extensions/CirrusSearch/includes/DataSender.php(270): Elastica\Bulk->send()
#4 /workspace/src/extensions/CirrusSearch/includes/Job/ElasticaWrite.php(178): CirrusSearch\DataSender->sendData(string, array)
#5 /workspace/src/extensions/CirrusSearch/includes/Job/JobTraits.php(139): CirrusSearch\Job\ElasticaWrite->doJob()
#6 /workspace/src/includes/jobqueue/JobRunner.php(375): CirrusSearch\Job\CirrusGenericJob->run()
#7 /workspace/src/includes/jobqueue/JobRunner.php(337): MediaWiki\JobQueue\JobRunner->doExecuteJob(CirrusSearch\Job\ElasticaWrite)
#8 /workspace/src/includes/jobqueue/JobRunner.php(232): MediaWiki\JobQueue\JobRunner->executeJob(CirrusSearch\Job\ElasticaWrite)
#9 /workspace/src/includes/specials/SpecialRunJobs.php(132): MediaWiki\JobQueue\JobRunner->run(array)
#10 /workspace/src/includes/specials/SpecialRunJobs.php(117): MediaWiki\Specials\SpecialRunJobs->doRun(array)
#11 /workspace/src/includes/specialpage/SpecialPage.php(729): MediaWiki\Specials\SpecialRunJobs->execute(null)
#12 /workspace/src/includes/specialpage/SpecialPageFactory.php(1737): MediaWiki\SpecialPage\SpecialPage->run(null)
#13 /workspace/src/includes/actions/ActionEntryPoint.php(499): MediaWiki\SpecialPage\SpecialPageFactory->executePath(string, MediaWiki\Context\RequestContext)
#14 /workspace/src/includes/actions/ActionEntryPoint.php(143): MediaWiki\Actions\ActionEntryPoint->performRequest()
#15 /workspace/src/includes/MediaWikiEntryPoint.php(202): MediaWiki\Actions\ActionEntryPoint->execute()
#16 /workspace/src/index.php(58): MediaWiki\MediaWikiEntryPoint->run()
#17 {main}

Full output, LocalSettings and everything else can be found in https://integration.wikimedia.org/ci/job/mediawiki-quibble-apitests-vendor-php74/40825/ (kept forever). I'm assuming it might be a simple misconfiguration, but I don't know how to fix it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

If we fix T389429 this will still occur. Essentially in that ticket what is supposed to happen is the first execution should happen in the initial job, but if that execution fails a cirrusSearchElasticaWrite job will be queued to try again later. Some suggestions:

  • The main failure is Couldn't connect to host, Elasticsearch down? This suggests elasticsearch isn't configured appropriately in the test environment and Cirrus is unable to communicate with it. Should it be configured? If elasticsearch is intentionally not installed or available, is CirrusSearch still needed?
  • If CirrusSearch needs to be installed but we don't want it to write anywhere it can be configured to have only a read-only cluster with something like the following. Setting $wgCirrusSearchClusters is important, as by default it uses $wgCirrusSearchServers which skips all of the multi-cluster support including the read-only bits:
$wgCirrusSearchClusters = [
    'default' => ['localhost'],
];
$wgCirrusSearchWriteClusters => [];

Thank you, makes sense.

This suggests elasticsearch isn't configured appropriately in the test environment and Cirrus is unable to communicate with it. Should it be configured? If elasticsearch is intentionally not installed or available, is CirrusSearch still needed?

This is inside quibble, so it's for extensions that declare CirrusSearch as a dependency (and therefore is needed). Meaning that yes, it should be configured, but maybe it doesn't necessarily need to work correctly. Tagging Quibble because it might be possible to fix this by amending quibble's LocalSettings.

If CirrusSearch needs to be installed but we don't want it to write anywhere it can be configured to have only a read-only cluster with something like the following.

This seems like it would do it.

Setting $wgCirrusSearchClusters is important, as by default it uses $wgCirrusSearchServers which skips all of the multi-cluster support including the read-only bits:

Pardon my ignorance, but would it be possible to have [something like] this as the default config? I tried your sample config locally and it does prevent the failure. I also checked other quibble jobs just to be sure (https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium/83910 / r1131064), and it is also not configured. So, in all quibble tests we might be wasting time trying to execute these jobs, and just re-enqueuing them over and over again.

Daimona renamed this task from cirrusSearchElasticaWrite job failures in quibble-apitests to cirrusSearchElasticaWrite job failures in quibble.Mar 25 2025, 6:11 PM

I don't think it would be a good idea for CirrusSearch to default itself to not writing. The default assumption when installing CirrusSearch is that you want to use CirrusSearch.

I don't think it would be a good idea for CirrusSearch to default itself to not writing. The default assumption when installing CirrusSearch is that you want to use CirrusSearch.

That also makes sense, I guess... I'll make a patch for quibble.

Change #1131078 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[integration/quibble@master] LocalSettings: explicitly configure CirrusSearch to do nothing

https://gerrit.wikimedia.org/r/1131078

Change #1131078 abandoned by Daimona Eaytoy:

[integration/quibble@master] LocalSettings: explicitly configure CirrusSearch to do nothing

https://gerrit.wikimedia.org/r/1131078

Change #1131331 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/CirrusSearch@master] Disable writes when running in quibble

https://gerrit.wikimedia.org/r/1131331

Change #1131331 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Disable writes when running in quibble

https://gerrit.wikimedia.org/r/1131331

Thanks for the quick review! As I wrote in T389863#10679137, things are indeed looking better now.

If that is causing enough issues, should the fix https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1131331 be backported to REL1_39, REL1_42 and REL1_43?

If that is causing enough issues, should the fix https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1131331 be backported to REL1_39, REL1_42 and REL1_43?

The issues become noticeable when something attempts to process the job queue. Even then, the only consequence is that the test will take a bit longer. And this is typically not noticeable or not a problem, with the exception of time-sensitive tests, such as E2E tests running with a timeout. It can be backported if we want to, but I'm not sure if it's worthwhile.

It can be backported if we want to, but I'm not sure if it's worthwhile.

Good lets skip so. I wasn't sure how critical the fix was :) Thank you for investigation and the fix, that was great to watch.