Move bulk content out of the ElasticaWrite job
Closed, ResolvedPublic
Actions

Description

ElasticaWrite jobs can be multiple megabytes, which is an order of magnitude larger than should be reasonable. The problem is we put the entire elasticsearch update into the ElasticaWrite, and this contains two copies of the page text content, among other things.

Adjust CirrusSearch such that bulk content loading from mediawiki is done as part of the ElasticaWrite job, rather than before it. This will mean that each retry will need to recalculate the document, but should be acceptable. Care must be taken to not move every piece of document building behind ElasticaWrite. In particular we want to make sure link counting happens once and is written to all clusters, as that query is quite expensive.

Details

	Subject	Repo	Branch	Lines +/-
	Move bulk content for update after ElasticaWrite	mediawiki/extensions/CirrusSearch	master	+245 -44

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined	Feature	None	T71489 Expose mwgrep functionality on-wiki
Resolved		None	T109715 Replicate production elasticsearch indices to labs
Resolved		EBernhardson	T220625 Initialize CirrusSearch on cloudelastic
Resolved		TJones	T235831 [Objective Fiscal 19-20/Q2] (4) Isolate failures between different search components to maintain a resilient service
Resolved		EBernhardson	T235832 CirrusSearch writes are split into per cluster kafka partitions to isolate clusters from each others by end of Q2
Resolved		EBernhardson	T230495 Partition CirrusSearch mediawiki jobs by cluster
Resolved		Ottomata	T239135 Create partitioned CirrusSearchElasticaWrite topic
Resolved		EBernhardson	T235833 CirrusSearch writes can be paused during cluster operations without causing excessive stress on change propagation infrastructure by end of Q2
Resolved		EBernhardson	T236186 Move bulk content out of the ElasticaWrite job

Event Timeline

EBernhardson created this task.Oct 22 2019, 4:36 PM

Restricted Application edited projects, added Discovery-Search; removed Discovery-Search (Current work). · View Herald TranscriptOct 22 2019, 4:36 PM

EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.Oct 23 2019, 8:25 PM

EBernhardson edited projects, added Discovery-Search (Current work); removed Discovery-Search.

EBernhardson triaged this task as Medium priority.Oct 24 2019, 9:13 PM

Gehel added a parent task: T235832: CirrusSearch writes are split into per cluster kafka partitions to isolate clusters from each others by end of Q2.Oct 29 2019, 5:31 PM

Gehel added a parent task: T235833: CirrusSearch writes can be paused during cluster operations without causing excessive stress on change propagation infrastructure by end of Q2.Oct 29 2019, 5:36 PM

• chasemp unsubscribed.Oct 29 2019, 5:44 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.Oct 29 2019, 5:49 PM

Change 546285 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Move bulk content for update after ElasticaWrite

https://gerrit.wikimedia.org/r/546285

gerritbot added a project: Patch-For-Review.Oct 29 2019, 9:52 PM

• WDoranWMF moved this task from Inbox to Backlog on the Platform Team Workboards (Clinic Duty Team) board.Oct 31 2019, 8:54 PM

Change 546285 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Move bulk content for update after ElasticaWrite

https://gerrit.wikimedia.org/r/546285

EBernhardson mentioned this in rECIR77592f4c2e72: Move bulk content for update after ElasticaWrite.Nov 12 2019, 10:44 AM

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.8; 2019-11-26).Nov 12 2019, 11:00 AM

Maintenance_bot removed a project: Patch-For-Review.Nov 12 2019, 11:10 AM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Nov 19 2019, 10:49 PM

TJones closed this task as Resolved.Nov 20 2019, 4:56 PM