Page MenuHomePhabricator

Move bulk content out of the ElasticaWrite job
Closed, ResolvedPublic

Description

ElasticaWrite jobs can be multiple megabytes, which is an order of magnitude larger than should be reasonable. The problem is we put the entire elasticsearch update into the ElasticaWrite, and this contains two copies of the page text content, among other things.

Adjust CirrusSearch such that bulk content loading from mediawiki is done as part of the ElasticaWrite job, rather than before it. This will mean that each retry will need to recalculate the document, but should be acceptable. Care must be taken to not move every piece of document building behind ElasticaWrite. In particular we want to make sure link counting happens once and is written to all clusters, as that query is quite expensive.

Details

Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : masterMove bulk content for update after ElasticaWrite

Event Timeline

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptOct 22 2019, 4:36 PM
EBernhardson triaged this task as Normal priority.Thu, Oct 24, 9:13 PM
chasemp removed a subscriber: chasemp.Tue, Oct 29, 5:44 PM

Change 546285 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Move bulk content for update after ElasticaWrite

https://gerrit.wikimedia.org/r/546285

Change 546285 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Move bulk content for update after ElasticaWrite

https://gerrit.wikimedia.org/r/546285

TJones closed this task as Resolved.Wed, Nov 20, 4:56 PM