Implement text analysis to support stemming
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Mar 8 2021, 11:02 PM

Description

Stemming is the process of reducing a word to its root form. This is commonly done when indexing and searching freeform text content to increase the chance of matching a document containing a word form that varies in tense or cardinality from the user's search terms.

Cardinality is an approachable way to think about this complex problem. If a user searches Toolhub for the plural English noun templates they are probably equally happy to find results where the toolinfo author used the singular English noun template. A savvy user can use wildcards to work around a lack of cardinality stemming in some languages (like English) by searching for template*. This type of workaround is limited however to suffix-based variations.

Elasticsearch uses token filters to implement stemming support. Fully supporting all languages is a never ending task, but we should be able to support a number of commonly used languages without investing hundreds of human hours in implementation and configuration by using multi-fields with language-specific analyzers. For the initial implementation, supporting English stemming would be sufficient. We do not have a localization process for toolinfo records yet, and as a result most content is only available in English.

Details

Subject	Repo	Branch	Lines +/-
toolhub: Bump container version to 2022-03-15-002555-production	operations/deployment-charts	master	+1 -1
config: Add setting to disable Elasticsearch TLS cert verification	wikimedia/toolhub	main	+1 -0
toolhub: Bump container version to 2021-12-23-121200-production	operations/deployment-charts	master	+1 -1
Search: Implement text analysis to support stemming	wikimedia/toolhub	main	+13 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Raymond_Ndibe	T276865 Implement text analysis to support stemming
Resolved	bd808	T195680 Faceted search toolinfo
Resolved	bd808	T271374 Mirror toolinfo records into Elasticsearch
Resolved	bd808	T271375 Add Elasticsearch service to docker-compose development environment
Resolved	bd808	T271377 API for faceted search toolinfo
Resolved	bd808	T271383 UI for faceted search toolinfo
Resolved	bd808	T276633 Revamp facet display for better small viewport support
Open	None	T290357 Maintenance environment needed for running one-off commands
Resolved	Joe	T341197 Allow deployers to get a php REPL environment inside the mw-debug pods

Event Timeline

bd808 created this task.Mar 8 2021, 11:02 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 8 2021, 11:02 PM

bd808 added a subtask: T195680: Faceted search toolinfo.Mar 8 2021, 11:02 PM

bd808 closed subtask T195680: Faceted search toolinfo as Resolved.Mar 12 2021, 5:34 PM

Husky subscribed.Oct 13 2021, 10:47 AM

bd808 triaged this task as Medium priority.Dec 2 2021, 11:13 PM

bd808 updated the task description. (Show Details)

bd808 moved this task from Backlog to Groomed/Ready on the Toolhub board.

@bd808 Could you help me understand what user value this enables? Likewise, how might we acceptance test this when completed?

@bd808 is there any preferred method to solving this? elastic search seems to have certain in-built stemmers, each having there advantages and disadvantages. I was wondering if you have any preference or we should come up with something?

In T276865#7546982, @sdkim wrote:

@bd808 Could you help me understand what user value this enables? Likewise, how might we acceptance test this when completed?

The most simple description is that it would enable a search for "template" to match a record containing the word "templates" and vice versa. This is accomplished by stemming each term as it is added to the full text index or search criteria so that it is stored/searched in a root form according to the rules of the token filter being used. For our current English language corpus, a good acceptance test would be that a search for "template" returns the same matched number of records as a search for "template*". At the time I'm writing this, the latter query is returning 29 more results than the first.

In T276865#7547122, @Raymond_Ndibe wrote:

@bd808 is there any preferred method to solving this? elastic search seems to have certain in-built stemmers, each having there advantages and disadvantages. I was wondering if you have any preference or we should come up with something?

For the moment we really only need English stemming due to the dominance of English in our source material and out current lack of translation for the dynamic content of toolinfo records. The upstream recommended "english" stemming filter would be the simplest thing to start with, but we could also try to get some advice from @TJones on customizations that would be even more likely to be effective.

Trey has previously written about improvements to search for enwiki that he researched and tested. Perhaps he would be able to tell us roughly which parts of the text analyzer config used for English by CirrusSearch are worth attempting to apply to our index.

Because our production deployment is made to Elasticsearch clusters which also service CirrusSearch we should be able to use custom analyzers built for CirrusSearch. Use of custom things may need to be gated by a new configuration variable (a feature toggle) if adding the same Elasticsearch customization to our development and demo mode Elasticsearch Docker container are too complicated to support.

In T276865#7547412, @bd808 wrote:

In T276865#7547122, @Raymond_Ndibe wrote:

@bd808 is there any preferred method to solving this? elastic search seems to have certain in-built stemmers, each having there advantages and disadvantages. I was wondering if you have any preference or we should come up with something?

For the moment we really only need English stemming due to the dominance of English in our source material and out current lack of translation for the dynamic content of toolinfo records. The upstream recommended "english" stemming filter would be the simplest thing to start with, but we could also try to get some advice from @TJones on customizations that would be even more likely to be effective.

Trey has previously written about improvements to search for enwiki that he researched and tested. Perhaps he would be able to tell us roughly which parts of the text analyzer config used for English by CirrusSearch are worth attempting to apply to our index.

Because our production deployment is made to Elasticsearch clusters which also service CirrusSearch we should be able to use custom analyzers built for CirrusSearch. Use of custom things may need to be gated by a new configuration variable (a feature toggle) if adding the same Elasticsearch customization to our development and demo mode Elasticsearch Docker container are too complicated to support.

I think our first step should be to contact @TJones then and hear what he thinks

Slst2020 subscribed.Dec 6 2021, 10:43 AM

The default English analyzer (the inner details of which are here) is probably a good place to start.

On English Wikipedia, Wiktionary, and other wikis, we have a lot more complex configuration because of the wild breadth of content covered. We need to handle unusual Unicode characters, extensive non-English text, hard-to-type diacritics, creative forms of vandalism, etc. It sounds like you probably don't need all that—and if it turns out that you do, you can iterate later after making a big easy improvement with the English analyzer ("type": "english").

If you want to just implement stemming—for example, if you are worried that stop words would filter out too much—you could do that, too. But unless you have tools with weird names like "The The" or "To Be Or Not To Be", it should be fine.

Also, stemming is better than wildcards, even in English. Some words are prefixes of each other, so if you just want wiki and wikis, wiki* will get too much (it'll match Wikipedia, Wikisource, WikiGnome, etc., etc.). That also means that template (with stemming) and template* might not get the same number of results.

If you want simple "is it on?" acceptance criteria, searching for less common forms of words (like templated) and seeing the common forms highlighted in the results will let you know it's working. If you want more complex acceptance criteria, I can show anyone who is interested how I do "analyzer analysis" offline for new stemmers / analysis chains, and you can get a speaker to look at the kinds of words that will be indexed as "equivalent" by your analysis chain. At least English speakers are easy to come by. ;)

would love to take on this task @bd808

In T276865#7551306, @Raymond_Ndibe wrote:

would love to take on this task @bd808

Do it! :)

The current Elasticsearch schema is generated by toolhub.apps.search.documents.ToolDocument using the toolhub.apps.search.documents.SearchDocument base class and the toolhub.apps.toolinfo.models.Tool model. I think a first attempt at adding the desired stemming support would start by adjusting the analyzer configured for "string" fields by SearchDocument.build_string_field(). The upstream documentation at https://elasticsearch-dsl.readthedocs.io/en/latest/index.html should be helpful.

Change 745283 had a related patch set uploaded (by Raymond Ndibe; author: Raymond Ndibe):

[wikimedia/toolhub@main] Search: Implement text analysis to support stemming

https://gerrit.wikimedia.org/r/745283

gerritbot added a project: Patch-For-Review.Dec 8 2021, 4:34 PM

Change 745283 merged by jenkins-bot:

[wikimedia/toolhub@main] Search: Implement text analysis to support stemming

https://gerrit.wikimedia.org/r/745283

Maintenance_bot removed a project: Patch-For-Review.Dec 9 2021, 3:11 PM

To deploy this change to production we will need to update the document mapping for our index and then reindex the existing toolinfo documents. With a solution for T290357: Maintenance environment needed for running one-off commands in place this could be done with poetry run ./manage.py search_index --rebuild from inside our Python container. We will have to do some creative thinking to find another solution. I haven't really tried this yet, but I think it may be possible to use ssh tunneling to establish a running container on a local laptop which has connectivity into the production data sources.

Raymond_Ndibe closed this task as Resolved.Dec 9 2021, 10:42 PM

In T276865#7560483, @bd808 wrote:

To deploy this change to production we will need to update the document mapping for our index and then reindex the existing toolinfo documents. With a solution for T290357: Maintenance environment needed for running one-off commands in place this could be done with poetry run ./manage.py search_index --rebuild from inside our Python container. We will have to do some creative thinking to find another solution. I haven't really tried this yet, but I think it may be possible to use ssh tunneling to establish a running container on a local laptop which has connectivity into the production data sources.

I didn't really see this before closing the task as Resolved. Perhaps I should re-open it until the change has been fully deployed?
Reading the description and discussions under T290357: Maintenance environment needed for running one-off commands to understand the context surrounding your comment.

Yes, let's keep this open for now to not lose track of the production deployment challenge. I'm moving the task over to the review column and also marking is as stalled for now.

bd808 added a subtask: T290357: Maintenance environment needed for running one-off commands.Dec 17 2021, 4:58 PM

Change 749220 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2021-12-20-122341-production

https://gerrit.wikimedia.org/r/749220

gerritbot added a project: Patch-For-Review.Dec 21 2021, 5:19 PM

Change 749220 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Bump container version to 2021-12-23-121200-production

https://gerrit.wikimedia.org/r/749220

Maintenance_bot removed a project: Patch-For-Review.Jan 5 2022, 6:17 PM

The code for this is now in production, but rebuilding the existing index failed because of the TLS issue for the current maintenance environment documented at T290357#7599589.

Change 751809 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[wikimedia/toolhub@main] config: Add setting to disable Elasticsearch TLS cert verification

https://gerrit.wikimedia.org/r/751809

gerritbot added a project: Patch-For-Review.Jan 5 2022, 9:23 PM

The prod index has been rebuilt via more ugly hacks (https://gerrit.wikimedia.org/r/751809 built locally) making this complete.

Change 751809 merged by jenkins-bot:

[wikimedia/toolhub@main] config: Add setting to disable Elasticsearch TLS cert verification

https://gerrit.wikimedia.org/r/751809

Maintenance_bot removed a project: Patch-For-Review.Jan 6 2022, 12:10 AM

Change 770638 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Bump container version to 2022-03-15-002555-production

https://gerrit.wikimedia.org/r/770638

gerritbot added a project: Patch-For-Review.Mar 15 2022, 12:34 AM

Change 770638 merged by jenkins-bot: