Page MenuHomePhabricator

Limit the size of the documents indexed by CirrusSearch
Closed, ResolvedPublic5 Estimated Story Points

Description

As a maintainer of the search cluster I want to have a control on the size of the documents sent to elasticsearch.

As of today we do not have an explicit limit on the maximum size a document produced by CirrusSearch can be. A limit is indirectly imposed by elasticsearch as it has a 100Mb set through http.max_content_length so we can't push anything bigger than 100Mb.

We should probably have a better control on the doc size to avoid relying on this elasticsearch limit, especially if we want to store this document somewhere else than elastic (kafka).

As of today we don't have a good sense of the size distribution of the various doc indexed in the clusters but @EBernhardson wrote a painless script that could give us a good estimation of what to expect (P32761). Knowing this could inform us about what could be a reasonable limit that does affect too many documents.

Going forward monitoring the doc size from CirrusSearch might be a good idea so that we can get a sense of the actual usage during index updates.

Implementing the limitation might not be entirely trivial given the various fields we index and we might investigate doing an approximation by truncating some well-known fields (auxialiary_text, text, source).

AC:

  • have a distribution of the approx doc size we have in our indices
  • monitor the doc size seen during index requests from CirrusSearch and graph few top percentiles
  • add a configurable size limit in CirrusSearch and truncate the doc accordingly
  • decide on a limit in coordination with the Data Engineering Team so that we are confident that the content can be stored in kafka (in anticipation of the search update pipeline rewrite)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What are the risks and their likelihoods if we don't do this?

I don't think there are any immediate risks not doing this, the main purpose of the task is to inform some decisions regarding the design of the new update pipeline. Since it might involve some discussions outside of the team I filed this task early so that we get some numbers "soon".

Gehel triaged this task as High priority.Aug 29 2022, 3:15 PM
Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.
MPhamWMF set the point value for this task to 5.Oct 3 2022, 3:48 PM

Size percentiles of the json document representing a page over all the indices (excluding private wikis) is as follow:

+-------+-------+------+-----+-----+
|99.999%| 99.99%| 99.9%|  99%|  90%|
+-------+-------+------+-----+-----+
|4034658|1137197|296056|73224|17970|
+-------+-------+------+-----+-----+

So setting a size limit between 1M and 1.5M is going to only affect 99.99% of the pages (between 20000 and 40000 pages over 455million pages).
Being more conservative and allowing more (15M like what has WM Enterprise on their kafka setup) we'd affect only 20 pages, a 4Mb limit would affect fewer than 5000pages.

Couldn't help poking this a bit. I adjusted the previous query to get percentiles to instead perform an aggregation and report the top 4 (arbitrary) pages by size from each index (P35367), then import it into pandas (P35368) and get the following list (which excludes private wikis) for the top 100 pages by size from those results: P35369

Exact query results also include metadata about private wikis so I haven't posted them to phab, but can be found (for the moment) at stat1006.eqiad.wmnet:/home/ebernhardson/top_pages_by_index-*.json.

Notable limitations, while the query against omega and psi clusters is relatively quick, the chi cluster query took quite some time. It had to be issued directly to port 9200 on an elastic instance, skipping the proxies, to avoid timing out. And for funsies, the big query reported a "took" time of 331 hours over about 15 minutes.

Change 840135 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Monitor doc sizes in statsd

https://gerrit.wikimedia.org/r/840135

Change 840135 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Monitor doc sizes in statsd

https://gerrit.wikimedia.org/r/840135

Change 845645 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Fix metric CirrusSearch._cluster_.updates.all.doc_size

https://gerrit.wikimedia.org/r/845645

Change 845646 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Add DocumentSizeLimiter a component to limit cirrus doc sizes

https://gerrit.wikimedia.org/r/845646

Change 845645 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Fix metric CirrusSearch._cluster_.updates.all.doc_size

https://gerrit.wikimedia.org/r/845645

Change 845646 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add DocumentSizeLimiter a component to limit cirrus doc sizes

https://gerrit.wikimedia.org/r/845646