Limit the size of the documents indexed by CirrusSearch
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Aug 23 2022, 2:32 PM

Description

As a maintainer of the search cluster I want to have a control on the size of the documents sent to elasticsearch.

As of today we do not have an explicit limit on the maximum size a document produced by CirrusSearch can be. A limit is indirectly imposed by elasticsearch as it has a 100Mb set through http.max_content_length so we can't push anything bigger than 100Mb.

We should probably have a better control on the doc size to avoid relying on this elasticsearch limit, especially if we want to store this document somewhere else than elastic (kafka).

As of today we don't have a good sense of the size distribution of the various doc indexed in the clusters but @EBernhardson wrote a painless script that could give us a good estimation of what to expect (P32761). Knowing this could inform us about what could be a reasonable limit that does affect too many documents.

Going forward monitoring the doc size from CirrusSearch might be a good idea so that we can get a sense of the actual usage during index updates.

Implementing the limitation might not be entirely trivial given the various fields we index and we might investigate doing an approximation by truncating some well-known fields (auxialiary_text, text, source).

AC:

have a distribution of the approx doc size we have in our indices
monitor the doc size seen during index requests from CirrusSearch and graph few top percentiles
add a configurable size limit in CirrusSearch and truncate the doc accordingly
decide on a limit in coordination with the Data Engineering Team so that we are confident that the content can be stored in kafka (in anticipation of the search update pipeline rewrite)

Details

Subject	Repo	Branch	Lines +/-
Add DocumentSizeLimiter a component to limit cirrus doc sizes	mediawiki/extensions/CirrusSearch	master	+510 -6
Fix metric CirrusSearch._cluster_.updates.all.doc_size	mediawiki/extensions/CirrusSearch	master	+1 -1
Monitor doc sizes in statsd	mediawiki/extensions/CirrusSearch	master	+25 -4

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T317045 [Epic] Re-architect the Search Update Pipeline
		Resolved		dcausse	T316016 Limit the size of the documents indexed by CirrusSearch

Event Timeline

dcausse created this task.Aug 23 2022, 2:32 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 23 2022, 2:32 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What are the risks and their likelihoods if we don't do this?

I don't think there are any immediate risks not doing this, the main purpose of the task is to inform some decisions regarding the design of the new update pipeline. Since it might involve some discussions outside of the team I filed this task early so that we get some numbers "soon".

Gehel triaged this task as High priority.Aug 29 2022, 3:15 PM

Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

dcausse added a parent task: T317045: [Epic] Re-architect the Search Update Pipeline.Sep 7 2022, 8:48 AM

Gehel moved this task from ML & Data Pipeline to needs triage on the Discovery-Search board.Sep 23 2022, 1:11 PM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Sep 23 2022, 1:14 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

MPhamWMF set the point value for this task to 5.Oct 3 2022, 3:48 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

dcausse claimed this task.Oct 4 2022, 8:34 AM

dcausse moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Size percentiles of the json document representing a page over all the indices (excluding private wikis) is as follow:

+-------+-------+------+-----+-----+
|99.999%| 99.99%| 99.9%|  99%|  90%|
+-------+-------+------+-----+-----+
|4034658|1137197|296056|73224|17970|
+-------+-------+------+-----+-----+

So setting a size limit between 1M and 1.5M is going to only affect 99.99% of the pages (between 20000 and 40000 pages over 455million pages).
Being more conservative and allowing more (15M like what has WM Enterprise on their kafka setup) we'd affect only 20 pages, a 4Mb limit would affect fewer than 5000pages.

Couldn't help poking this a bit. I adjusted the previous query to get percentiles to instead perform an aggregation and report the top 4 (arbitrary) pages by size from each index (P35367), then import it into pandas (P35368) and get the following list (which excludes private wikis) for the top 100 pages by size from those results: P35369

Exact query results also include metadata about private wikis so I haven't posted them to phab, but can be found (for the moment) at stat1006.eqiad.wmnet:/home/ebernhardson/top_pages_by_index-*.json.

Notable limitations, while the query against omega and psi clusters is relatively quick, the chi cluster query took quite some time. It had to be issued directly to port 9200 on an elastic instance, skipping the proxies, to avoid timing out. And for funsies, the big query reported a "took" time of 331 hours over about 15 minutes.

Change 840135 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Monitor doc sizes in statsd

https://gerrit.wikimedia.org/r/840135

gerritbot added a project: Patch-For-Review.Oct 7 2022, 2:19 PM

Change 840135 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Monitor doc sizes in statsd

https://gerrit.wikimedia.org/r/840135

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.6; 2022-10-17).Oct 11 2022, 4:00 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 11 2022, 4:30 PM

Change 845645 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Fix metric CirrusSearch._cluster_.updates.all.doc_size

https://gerrit.wikimedia.org/r/845645