Page MenuHomePhabricator

Implement text analysis to support stemming
Closed, ResolvedPublic

Description

Stemming is the process of reducing a word to its root form. This is commonly done when indexing and searching freeform text content to increase the chance of matching a document containing a word form that varies in tense or cardinality from the user's search terms.

Cardinality is an approachable way to think about this complex problem. If a user searches Toolhub for the plural English noun templates they are probably equally happy to find results where the toolinfo author used the singular English noun template. A savvy user can use wildcards to work around a lack of cardinality stemming in some languages (like English) by searching for template*. This type of workaround is limited however to suffix-based variations.

Elasticsearch uses token filters to implement stemming support. Fully supporting all languages is a never ending task, but we should be able to support a number of commonly used languages without investing hundreds of human hours in implementation and configuration by using multi-fields with language-specific analyzers. For the initial implementation, supporting English stemming would be sufficient. We do not have a localization process for toolinfo records yet, and as a result most content is only available in English.

Event Timeline

bd808 triaged this task as Medium priority.Dec 2 2021, 11:13 PM
bd808 updated the task description. (Show Details)
bd808 moved this task from Backlog to Groomed/Ready on the Toolhub board.

@bd808 Could you help me understand what user value this enables? Likewise, how might we acceptance test this when completed?

@bd808 is there any preferred method to solving this? elastic search seems to have certain in-built stemmers, each having there advantages and disadvantages. I was wondering if you have any preference or we should come up with something?

@bd808 Could you help me understand what user value this enables? Likewise, how might we acceptance test this when completed?

The most simple description is that it would enable a search for "template" to match a record containing the word "templates" and vice versa. This is accomplished by stemming each term as it is added to the full text index or search criteria so that it is stored/searched in a root form according to the rules of the token filter being used. For our current English language corpus, a good acceptance test would be that a search for "template" returns the same matched number of records as a search for "template*". At the time I'm writing this, the latter query is returning 29 more results than the first.

@bd808 is there any preferred method to solving this? elastic search seems to have certain in-built stemmers, each having there advantages and disadvantages. I was wondering if you have any preference or we should come up with something?

For the moment we really only need English stemming due to the dominance of English in our source material and out current lack of translation for the dynamic content of toolinfo records. The upstream recommended "english" stemming filter would be the simplest thing to start with, but we could also try to get some advice from @TJones on customizations that would be even more likely to be effective.

Trey has previously written about improvements to search for enwiki that he researched and tested. Perhaps he would be able to tell us roughly which parts of the text analyzer config used for English by CirrusSearch are worth attempting to apply to our index.

Because our production deployment is made to Elasticsearch clusters which also service CirrusSearch we should be able to use custom analyzers built for CirrusSearch. Use of custom things may need to be gated by a new configuration variable (a feature toggle) if adding the same Elasticsearch customization to our development and demo mode Elasticsearch Docker container are too complicated to support.

@bd808 is there any preferred method to solving this? elastic search seems to have certain in-built stemmers, each having there advantages and disadvantages. I was wondering if you have any preference or we should come up with something?

For the moment we really only need English stemming due to the dominance of English in our source material and out current lack of translation for the dynamic content of toolinfo records. The upstream recommended "english" stemming filter would be the simplest thing to start with, but we could also try to get some advice from @TJones on customizations that would be even more likely to be effective.

Trey has previously written about improvements to search for enwiki that he researched and tested. Perhaps he would be able to tell us roughly which parts of the text analyzer config used for English by CirrusSearch are worth attempting to apply to our index.

Because our production deployment is made to Elasticsearch clusters which also service CirrusSearch we should be able to use custom analyzers built for CirrusSearch. Use of custom things may need to be gated by a new configuration variable (a feature toggle) if adding the same Elasticsearch customization to our development and demo mode Elasticsearch Docker container are too complicated to support.

I think our first step should be to contact @TJones then and hear what he thinks

The default English analyzer (the inner details of which are here) is probably a good place to start.

On English Wikipedia, Wiktionary, and other wikis, we have a lot more complex configuration because of the wild breadth of content covered. We need to handle unusual Unicode characters, extensive non-English text, hard-to-type diacritics, creative forms of vandalism, etc. It sounds like you probably don't need all that—and if it turns out that you do, you can iterate later after making a big easy improvement with the English analyzer ("type": "english").

If you want to just implement stemming—for example, if you are worried that stop words would filter out too much—you could do that, too. But unless you have tools with weird names like "The The" or "To Be Or Not To Be", it should be fine.

Also, stemming is better than wildcards, even in English. Some words are prefixes of each other, so if you just want wiki and wikis, wiki* will get too much (it'll match Wikipedia, Wikisource, WikiGnome, etc., etc.). That also means that template (with stemming) and template* might not get the same number of results.

If you want simple "is it on?" acceptance criteria, searching for less common forms of words (like templated) and seeing the common forms highlighted in the results will let you know it's working. If you want more complex acceptance criteria, I can show anyone who is interested how I do "analyzer analysis" offline for new stemmers / analysis chains, and you can get a speaker to look at the kinds of words that will be indexed as "equivalent" by your analysis chain. At least English speakers are easy to come by. ;)

bd808 moved this task from Groomed/Ready to In Progress on the Toolhub board.

would love to take on this task @bd808

Do it! :)

The current Elasticsearch schema is generated by toolhub.apps.search.documents.ToolDocument using the toolhub.apps.search.documents.SearchDocument base class and the toolhub.apps.toolinfo.models.Tool model. I think a first attempt at adding the desired stemming support would start by adjusting the analyzer configured for "string" fields by SearchDocument.build_string_field(). The upstream documentation at https://elasticsearch-dsl.readthedocs.io/en/latest/index.html should be helpful.

Change 745283 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[wikimedia/toolhub@main] Search: Implement text analysis to support stemming

https://gerrit.wikimedia.org/r/745283

Change 745283 merged by jenkins-bot:

[wikimedia/toolhub@main] Search: Implement text analysis to support stemming

https://gerrit.wikimedia.org/r/745283

To deploy this change to production we will need to update the document mapping for our index and then reindex the existing toolinfo documents. With a solution for T290357: Maintenance environment needed for running one-off commands in place this could be done with poetry run ./manage.py search_index --rebuild from inside our Python container. We will have to do some creative thinking to find another solution. I haven't really tried this yet, but I think it may be possible to use ssh tunneling to establish a running container on a local laptop which has connectivity into the production data sources.

To deploy this change to production we will need to update the document mapping for our index and then reindex the existing toolinfo documents. With a solution for T290357: Maintenance environment needed for running one-off commands in place this could be done with poetry run ./manage.py search_index --rebuild from inside our Python container. We will have to do some creative thinking to find another solution. I haven't really tried this yet, but I think it may be possible to use ssh tunneling to establish a running container on a local laptop which has connectivity into the production data sources.

I didn't really see this before closing the task as Resolved. Perhaps I should re-open it until the change has been fully deployed?
Reading the description and discussions under T290357: Maintenance environment needed for running one-off commands to understand the context surrounding your comment.

bd808 moved this task from In Progress to Review on the Toolhub board.

Yes, let's keep this open for now to not lose track of the production deployment challenge. I'm moving the task over to the review column and also marking is as stalled for now.

Change 749220 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2021-12-20-122341-production

https://gerrit.wikimedia.org/r/749220

Change 749220 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2021-12-23-121200-production

https://gerrit.wikimedia.org/r/749220

The code for this is now in production, but rebuilding the existing index failed because of the TLS issue for the current maintenance environment documented at T290357#7599589.

Change 751809 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] config: Add setting to disable Elasticsearch TLS cert verification

https://gerrit.wikimedia.org/r/751809

The prod index has been rebuilt via more ugly hacks (https://gerrit.wikimedia.org/r/751809 built locally) making this complete.

Change 751809 merged by jenkins-bot:

[wikimedia/toolhub@main] config: Add setting to disable Elasticsearch TLS cert verification

https://gerrit.wikimedia.org/r/751809

Change 770638 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

Change 770638 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638