Page MenuHomePhabricator

🔍Investigate ways to reduce the resources consumed by ES
Open, HighPublic

Description

There could be a variety of ways to reduce the resources consumed by ES including:

  • reducing the number of indices or shards
  • using more modern or patches versions of ES
  • changing config settings or optimising the installation

Patches:

Event Timeline

Tarrow renamed this task from Investigate ways to reduce the requirements of ES to Investigate ways to reduce the resources consumed by ES.Jan 19 2023, 5:31 PM
Tarrow created this task.

This task was already partially started by @Andrew-WMDE; created this ticket so we don't lose his work

Change 881642 had a related patch set uploaded (by Tarrow; author: Andrew-WMDE):

[mediawiki/extensions/CirrusSearch@REL1_37] [POC] Support for sharing indices across multiple wikis

https://gerrit.wikimedia.org/r/881642

We had explored this possibility in the past (see T139496) we never finished/enabled this work because we realized that the perf benefits were not worth the efforts (T148554).
But I believe that there are few pieces in CirrusSearch that was written for this purpose that you could use to run this POC:

  • conflicting IDs, page ids are used as elasticsearch doc_id and thus won't work well if multiple wikis have similar page_ids, $wgCirrusSearchPrefixIds can be set to true to prefix page_ids with the wiki and used the elastic doc_id.
  • the basename of the index can already be forced with $wgCirrusSearchIndexBaseName, have you considered testing with this setting instead of adding a new sharedName option to the UpdateOneSearchIndexConfig?

The missing bits are I think on the query side, as all docs will be stored in the same index the generated query must be attaching a "wiki: current_wiki" filter to all the searches (we have a indexed field named wiki).
There are certainly other problems to solve but it's hard to anticipate all of them at this point.

Might there be any benefit from cloning (ES 7.4+, so maybe MW 1.39+) a default index per MediaWiki release used by Wikibase.cloud, rather than generate a new one for each instance, which might presumably create a lot of duplicate content if default messages are indexed?

It seems to hardlink from a readonly index on compatible filesystems, rather than make new files (but I don't know if the subsequent 'recovery' creates a copy of the old data, or simply references it from the readonly indexes).

Apologies if this does not make sense, or is being done already as part of the creation process - I know little about how ES works or the details of its use in MediaWiki/WB.C.

One way to test this would be to add this getting patch to the patchUrls section of pacman.yaml.

It would be interesting to know if this does indeed work in combination with the "platform api" even on some old form of of our MediaWiki image.

A few manual steps to see if this thing is working would be to create two wikis and add some content to both of them.

For example an item in wiki1 called cat1 and and item in wiki2 called cat2. It would be then great to confirm that there are no results on wiki1 for cat2 for example but that the search results we would normally expect are there. Similarly it would be good to confirm that Q1 when searched for returns results from only one wiki.

Today I tried this patch on a local wikibase.cloud cluster and mediawiki image 1.37-7.4-20220621-fp-beta-0 (slightly modified to contain the patched code).

I used mediawiki 481e0b535... and wbaas-deploy c54a972...
and added the patch URL to pacman.yaml

$ git diff pacman.yaml
diff --git a/pacman.yaml b/pacman.yaml
index c0121f437..ae1d55c18 100644
--- a/pacman.yaml
+++ b/pacman.yaml
@@ -244,6 +244,8 @@
   - Gruntfile.js
   - Doxyfile
 - name: CirrusSearch
+  patchUrls:
+  - https://gerrit.wikimedia.org/r/changes/mediawiki%2Fextensions%2FCirrusSearch~881642/revisions/2/patch?download
   artifactUrl: https://codeload.github.com/wikimedia/mediawiki-extensions-CirrusSearch/zip/e9fe241ff135f666dc1837cedb1afd5b8b78a338
   artifactLevel: 1
   destination: ./dist/extensions/CirrusSearch

After that I ran sync.sh and built a local image, which I used in the helmfile local env config.

Results:

Creating wikis worked. I created several wikis where Q1 Item label was the same and/or slightly different, and in both cases the searchbox suggestions lead to the correct item. Also I didn't see suggestions for items from other wikis.

I also looked a bit at the ES status via elasticHQ (docker run --network host elastichq/elasticsearch-hq and kubectl port-forward elasticsearch-master-0 9002).

There I could confirm that only shared indices for wikis exist (mw_cirrus_metastore_first, wiki_content_first and wiki_general_first)
as well as correct aliases for the wikis.

Today I tried this patch on a local wikibase.cloud cluster and mediawiki image 1.37-7.4-20220621-fp-beta-0 (slightly modified to contain the patched code).

Same results for mediawiki image 1.38-7.4-20230323-0 (used with wbaas-deploy d39ace9...)

also: the patched maint script UpdateOneSearchIndexConfig.php didn't change between the last 1.37 and this 1.38 image. I'm now looking at our current mediawiki upstream with 1.39 and I see some changes when diffing, so I first will try to check if the patch looks still compatible to that version.

For 1.39 I tried to port the patch, currently only living here: https://phabricator.wikimedia.org/P46080

wbaas-deploy 6dec041...
mediawiki 1.39-7.4-20230328-0

At first it looked like it's working but then I noticed that for every wiki other than the first one, ES wasn't enabled (which decreased my confidence in my results from earlier - maybe I didn't realize looking at the wrong search suggestions).

I created a third wiki, and for all but the first one there was a failed wbstackElasticSearchInit log: https://phabricator.wikimedia.org/P46081
Looking at existing ES aliases in that setup though looked like the aliases were created successfuly earlier.

Adding to the last experiment with 1.39, I found out that it works fine for other wikis if the ES WikiSetting gets enabled manually. May assumption is that it probably just "fails" because the wikibase.cloud APIs approach of checking if the job completed successfully is by looking for specific strings in the output, and that is quite error prone. I assume the output changed and this would work after T333559 is merged. (edit: tried it with that fix locally, that worked fine!)

Here's the formatted output and stack trace of a failed job from that setup: https://phabricator.wikimedia.org/P46555

Change 881642 abandoned by Umherirrender:

[mediawiki/extensions/CirrusSearch@REL1_37] [POC] Support for sharing indices across multiple wikis

Reason:

This branch is EOL, please upload in the master branch if still relevant/needed

https://gerrit.wikimedia.org/r/881642

notes from a call about this:

Some upstream patch to try this out is visible at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1010190

We're dusting off and reconsidering this attempt to alias many indices to one. The suggestion is that we make a replica ES cluster in staging which would cost around 1kEUR/month (but hopefully we won't need for a full month).

We have maybe around 18GB of ES storage although we think a lot of that might be index overhead e.g. for empty Wikis

Sticking everything in one index claims to have an approx limit of 200million docs and 50GB but we're well below that right now and in the future we could also try aliasing to a small number of upstream indices if we get near this limit.

We think that we could set this new cluster as the write only cluster for a bit. Populate it and then try and move it to become a read cluster while keeping the existing many index cluster as write only.

We want to try this on staging first; we also want to document both how the aliasing technique is working as well as the mechanics of doing the test migration.

Anton.Kokh renamed this task from Investigate ways to reduce the resources consumed by ES to 🔍Investigate ways to reduce the resources consumed by ES.Wed, May 29, 1:47 PM
Anton.Kokh triaged this task as High priority.
Andrew-WMDE claimed this task.
Andrew-WMDE moved this task from Doing to Done on the Wikibase Cloud (Kanban Board Q2 2024) board.