Optimize the elasticsearch analysis settings for wikibase
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Apr 6 2023, 9:37 AM

Description

The analysis settings for wikibase may create a set of analyzers/token_filter/char_filters prefixed per language.

Currently it generates 1200+ analyzers and and most of them are identical.
Only analysis components like token and char filters are deduplicated.
Deduplicating analyzers is not entirely trivial as they are referenced from the mapping config builders and all of them are expected to be there.
It might perhaps make sense to quickly evaluate the perf gain of doing such optimization (possibly measuring index creation and node startup).

AC:

the number of analyzers created on wikibase indices is largely reduced

Details

	Subject	Repo	Branch	Lines +/-
	cirrus: Enable analysis chain deduplication for wikibase	operations/mediawiki-config	master	+3 -0
	Add a config flag to enable analysis chain deduplication	mediawiki/extensions/CirrusSearch	master	+57 -3

Customize query in gerrit

Event Timeline

dcausse created this task.Apr 6 2023, 9:37 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 6 2023, 9:37 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We should measure the gain. We already have a component that can deduplicate (AnalysisFilter) but we should test if this has any useful effect. Based on the change in index creation time we can decided if it should move forward.

We're waiting on @Lydia_Pintscher to a better idea of the priority of this.

Is there anywhere where I can briefly read up on what analyzers, token filters and char filters are in this context? Then I can probably help.

In T334194#8827123, @Lydia_Pintscher wrote:

Is there anywhere where I can briefly read up on what analyzers, token filters and char filters are in this context? Then I can probably help.

For detail, Trey has a series of blog posts on the subject. The short answer is they are language specific text processing components that improve the matching between user queries and content.

For how that specifically relates here though, each elasticsearch index has some amount of configuration that defines it. On a typical wiki like say eswiktionary this configuration is ~8kb. For a wikidata index, which contains the text processing configuration of all possible languages, this configuration is 450kb and probably outside the normal operating expectations of elasticsearch. For the WMF cluster where we have 3 of these indices it's a bit of a pain but it's manageable.

We talked recently with someone at our office hours who was running hundreds of wikibase instances into a single elasticsearch cluster. In general not an unreasonable ask, we put a couple hundred wikis per cluster for wmf deployments. Unfortunately, their elasticsearch cluster became unresponsive, failed master elections, and generally became unusable. We think after some light review of stack traces and logs this is due to it taking 10's of minutes for the master to load the cluster state, which includes the configuration of those hundreds of wikibase indices. I forget the exact size of their cluster state, but i think it was two orders of magnitude larger than the cluster state we see on WMF clusters. One theory to investigate in this ticket is if we could improve the time it takes to load the wikibase search index configuration by clearing out duplications between languages, and by proxy reduce the size of elasticsearch cluster state created by each wikibase instance.

Thank you!
I assume the cluster we are talking about is Wikibase.cloud? If that's the case then this is fairly important for WMDE's LOD work as this is a serious issue for giving more people access to Wikibase.cloud and by extension alleviating pain from Wikidata and Wikidata Query Service by moving data there longer-term.

If this is about Wikibase.cloud, how much should we de-duplicate vs reduce the numbers of supported languages? My intuition is that on Wikibase.cloud, the number of languages per instance is unlikely to be as high as on Wikibase. Would that also be an option? Would that help?

Gehel triaged this task as High priority.May 8 2023, 3:25 PM

In T334194#8833365, @Gehel wrote:

If this is about Wikibase.cloud, how much should we de-duplicate vs reduce the numbers of supported languages? My intuition is that on Wikibase.cloud, the number of languages per instance is unlikely to be as high as on Wikibase. Would that also be an option? Would that help?

Both limiting the languages and reducing duplication between languages are viable options to reduce the size of the per-index configuration. I suppose this one was ticketed first because we previously wrote the deduplication routines but they haven't been turned on. Limiting the languages used wouldn't be a super invasive patch. The mapping side is straight forward, query time will require a little rework as it assumes all languages exist currently but shouldn't be a significant undertaking.

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.May 15 2023, 3:25 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• MPhamWMF set the point value for this task to 8.May 15 2023, 3:45 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

To get an idea of what we need to optimize i ran an experiment. This experiment stands up a fresh elasticsearch instance, creates 100 indexes with the same settings, and restarts the instance every 10 indexes. I measure how long the instance takes to come up and how long indices take to create. Ran this experiment with 4 different index configurations:

prod_enwiki - the current enwiki_content settings in production
prod_wikidata - the current wikidatawiki_content settings in production
wbcs_content_dedup - current development branches with CirrusSearch hacked to enable deduplication in AnalysisFilter::filterAnalysis
wbcs_content_minlang - current development branches with Wikibase hacked to only have 4 default terms languages

Initially i was intending to plot these values, but the differences are so large it seems unnecessary. Also i didn't repeat any of these tests, so the error margins are probably significant. But i think this provides a strong enough trend line to ignore all that:

	2nd index create	100th index creation	first restart	last restart
prod_enwiki	0.2s	0.3s	12s	30s
prod_wikidata	30s	48s	177s	1860s
wbcs_content_dedup	1s	1.2s	19s	67s
wbcs_content_minlang	0.5s	0.5s	20s	54s

I attached a profiler to elasticsearch while it was picking up after a restart to see what it was doing. Elasticsearch is essentially stuck inside some routines that iterate over the available settings. It looks like some O(n^2) might have snuck into their Settings iteration procedure, where n is the number of settings in a single index.

As for ways forward, I think enabling the deduplication procedure is going to retain the most functionality. It keeps all of the per-language fields but greatly reduces the number of settings used to configure those fields. In theory this should be transparent, although it means we can never override search_analyzer at query time (not that we do that anywhere currently, afaict).

Since we've never enabled the deduplication anywhere I will have to work out what the appropriate mechanism we should have to turn it on is.

bking subscribed.May 17 2023, 3:54 PM

Change 920797 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add a config flag to enable analysis chain deduplication

https://gerrit.wikimedia.org/r/920797

gerritbot added a project: Patch-For-Review.May 17 2023, 9:56 PM

EBernhardson claimed this task.May 17 2023, 9:56 PM

EBernhardson moved this task from Ready for Dev -- SWE to Needs review on the Discovery-Search (Current work) board.

Change 920797 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add a config flag to enable analysis chain deduplication

https://gerrit.wikimedia.org/r/920797

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.11; 2023-05-30).May 23 2023, 9:00 AM

Maintenance_bot removed a project: Patch-For-Review.May 23 2023, 9:10 AM

EBernhardson moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.May 24 2023, 6:33 PM

Change 929411 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Enable analysis chain deduplication for wikibase

https://gerrit.wikimedia.org/r/929411

gerritbot added a project: Patch-For-Review.Jun 12 2023, 8:50 PM

Change 929411 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Enable analysis chain deduplication for wikibase

https://gerrit.wikimedia.org/r/929411

Mentioned in SAL (#wikimedia-operations) [2023-06-13T20:48:09Z] <ebernhardson@deploy1002> Started scap: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]]

Mentioned in SAL (#wikimedia-operations) [2023-06-13T20:49:37Z] <ebernhardson@deploy1002> ebernhardson: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-06-13T20:55:46Z] <ebernhardson@deploy1002> Finished scap: Backport for [[gerrit:929411|cirrus: Enable analysis chain deduplication for wikibase (T334194)]] (duration: 07m 36s)

Maintenance_bot removed a project: Patch-For-Review.Jun 13 2023, 9:10 PM

Gehel moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Jun 26 2023, 3:11 PM

Gehel closed this task as Resolved.Jun 30 2023, 8:10 AM

Optimize the elasticsearch analysis settings for wikibaseClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Event Timeline

Optimize the elasticsearch analysis settings for wikibase
Closed, ResolvedPublic8 Estimated Story Points
Actions