Page MenuHomePhabricator

Optimize the elasticsearch analysis settings for wikibase
Closed, ResolvedPublic8 Estimated Story Points

Description

The analysis settings for wikibase may create a set of analyzers/token_filter/char_filters prefixed per language.

Currently it generates 1200+ analyzers and and most of them are identical.
Only analysis components like token and char filters are deduplicated.
Deduplicating analyzers is not entirely trivial as they are referenced from the mapping config builders and all of them are expected to be there.
It might perhaps make sense to quickly evaluate the perf gain of doing such optimization (possibly measuring index creation and node startup).

AC:

  • the number of analyzers created on wikibase indices is largely reduced

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We should measure the gain. We already have a component that can deduplicate (AnalysisFilter) but we should test if this has any useful effect. Based on the change in index creation time we can decided if it should move forward.

We're waiting on @Lydia_Pintscher to a better idea of the priority of this.

Is there anywhere where I can briefly read up on what analyzers, token filters and char filters are in this context? Then I can probably help.

Is there anywhere where I can briefly read up on what analyzers, token filters and char filters are in this context? Then I can probably help.

For detail, Trey has a series of blog posts on the subject. The short answer is they are language specific text processing components that improve the matching between user queries and content.

For how that specifically relates here though, each elasticsearch index has some amount of configuration that defines it. On a typical wiki like say eswiktionary this configuration is ~8kb. For a wikidata index, which contains the text processing configuration of all possible languages, this configuration is 450kb and probably outside the normal operating expectations of elasticsearch. For the WMF cluster where we have 3 of these indices it's a bit of a pain but it's manageable.

We talked recently with someone at our office hours who was running hundreds of wikibase instances into a single elasticsearch cluster. In general not an unreasonable ask, we put a couple hundred wikis per cluster for wmf deployments. Unfortunately, their elasticsearch cluster became unresponsive, failed master elections, and generally became unusable. We think after some light review of stack traces and logs this is due to it taking 10's of minutes for the master to load the cluster state, which includes the configuration of those hundreds of wikibase indices. I forget the exact size of their cluster state, but i think it was two orders of magnitude larger than the cluster state we see on WMF clusters. One theory to investigate in this ticket is if we could improve the time it takes to load the wikibase search index configuration by clearing out duplications between languages, and by proxy reduce the size of elasticsearch cluster state created by each wikibase instance.

Thank you!
I assume the cluster we are talking about is Wikibase.cloud? If that's the case then this is fairly important for WMDE's LOD work as this is a serious issue for giving more people access to Wikibase.cloud and by extension alleviating pain from Wikidata and Wikidata Query Service by moving data there longer-term.

If this is about Wikibase.cloud, how much should we de-duplicate vs reduce the numbers of supported languages? My intuition is that on Wikibase.cloud, the number of languages per instance is unlikely to be as high as on Wikibase. Would that also be an option? Would that help?

Gehel triaged this task as High priority.May 8 2023, 3:25 PM

If this is about Wikibase.cloud, how much should we de-duplicate vs reduce the numbers of supported languages? My intuition is that on Wikibase.cloud, the number of languages per instance is unlikely to be as high as on Wikibase. Would that also be an option? Would that help?

Both limiting the languages and reducing duplication between languages are viable options to reduce the size of the per-index configuration. I suppose this one was ticketed first because we previously wrote the deduplication routines but they haven't been turned on. Limiting the languages used wouldn't be a super invasive patch. The mapping side is straight forward, query time will require a little rework as it assumes all languages exist currently but shouldn't be a significant undertaking.

To get an idea of what we need to optimize i ran an experiment. This experiment stands up a fresh elasticsearch instance, creates 100 indexes with the same settings, and restarts the instance every 10 indexes. I measure how long the instance takes to come up and how long indices take to create. Ran this experiment with 4 different index configurations:

  1. prod_enwiki - the current enwiki_content settings in production
  2. prod_wikidata - the current wikidatawiki_content settings in production
  3. wbcs_content_dedup - current development branches with CirrusSearch hacked to enable deduplication in AnalysisFilter::filterAnalysis
  4. wbcs_content_minlang - current development branches with Wikibase hacked to only have 4 default terms languages

Initially i was intending to plot these values, but the differences are so large it seems unnecessary. Also i didn't repeat any of these tests, so the error margins are probably significant. But i think this provides a strong enough trend line to ignore all that:

2nd index create100th index creationfirst restartlast restart
prod_enwiki0.2s0.3s12s30s
prod_wikidata30s48s177s1860s
wbcs_content_dedup1s1.2s19s67s
wbcs_content_minlang0.5s0.5s20s54s

I attached a profiler to elasticsearch while it was picking up after a restart to see what it was doing. Elasticsearch is essentially stuck inside some routines that iterate over the available settings. It looks like some O(n^2) might have snuck into their Settings iteration procedure, where n is the number of settings in a single index.

As for ways forward, I think enabling the deduplication procedure is going to retain the most functionality. It keeps all of the per-language fields but greatly reduces the number of settings used to configure those fields. In theory this should be transparent, although it means we can never override search_analyzer at query time (not that we do that anywhere currently, afaict).

Since we've never enabled the deduplication anywhere I will have to work out what the appropriate mechanism we should have to turn it on is.

Change 920797 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add a config flag to enable analysis chain deduplication

https://gerrit.wikimedia.org/r/920797

Change 920797 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add a config flag to enable analysis chain deduplication

https://gerrit.wikimedia.org/r/920797

Change 929411 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Enable analysis chain deduplication for wikibase

https://gerrit.wikimedia.org/r/929411

Change 929411 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Enable analysis chain deduplication for wikibase

https://gerrit.wikimedia.org/r/929411

Mentioned in SAL (#wikimedia-operations) [2023-06-13T20:48:09Z] <ebernhardson@deploy1002> Started scap: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-13T20:49:37Z] <ebernhardson@deploy1002> ebernhardson: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-13T20:55:46Z] <ebernhardson@deploy1002> Finished scap: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] (duration: 07m 36s)