Page MenuHomePhabricator

Investigate applying aggressive_splitting everywhere, not just on English-language wikis
Closed, ResolvedPublic5 Estimated Story Points

Description

English-language wikis use aggressive_splitting, which is a language analysis filter (a version of Elasticsearch's Word Delimiter Token Filter) that splits words on case changes (as was the original issue in this ticket) and in other circumstances. Investigate applying it everywhere, or at least for many more languages.


Original task title & description:

Cross-wiki search tokenizer is better than local search one

Searching for “FilesystemHierarchyStandard” in fr.wp give me no local result but several results from en.wp, including [en:Filesystem Hierarchy Standard] whereas equivalent [fr:Filesystem Hierarchy Standard] exists.

I’ve already encountered this strange issue: global search is sometimes better than local search, especially in phrase tokenization (when I missed spaces).

Maybe it’s because I use an English phrasing on French wiki?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Mar 28 2019, 5:09 PM
debt moved this task from needs triage to Language Stuff on the Discovery-Search board.
debt added subscribers: TJones, debt.

hey @TJones can you take a look?

I'll take a look today. I'm pretty sure I know what's happening, but will double check.

As I thought, this is a customization that was added to the English Language Analysis years ago before my time. It was originally limited to search on MediaWiki.org in 2013, and then expanded to all English-language wikis in 2014, but it was never expanded beyond that.

English has an additional filter called aggressive_splitting, which is a version of Elastic's Word Delimiter Token Filter. It breaks on case changes, among others, so FilesystemHierarchyStandard is split up on English Wikipedia, but not on French Wikipedia.

I have a long-term goal (not yet documented on Phab) of comparing the output of the language analyzers of at least the top 10 wikis and trying to harmonize the features that aren't language-specific. The Word Delimiter filter and another—word_break_helper, which splits on underscores, periods, and parens (see T170625)—are two examples that I know are not implemented consistently (or even correctly) across languages.

Part of the problem is that default language analyzers don't allow any customization until they are "unpacked" into their component parts. Unpacked analyzers get some automatic "upgrades", like ICU normalization instead of lowercasing, which can actually break some things (see T217602 and related tasks). And I've seen other unexpected dependencies, like Greek stemming doesn't work unless certain diacritics are removed, which is done by Greek-specific "lowercasing" but not by ICU normalization. The result is that we can't really just enable everything that seems useful everywhere at once without running the risk of breaking many things.

I'll convert this task into one parallel to the one for word_break_helper and create a parent task for language analysis harmonization and put this and word_break_helper under it, and document some of the other steps I know need to be taken.

Thanks @Pols12 for pointing this out!

TJones renamed this task from Cross-wiki search tokenizer is better than local search one to Investigate applying aggressive_splitting everywhere, not just on English-language wikis.Mar 28 2019, 7:04 PM
TJones updated the task description. (Show Details)

Another different use case: searching for “download helper” on fr.wp doesn’t seem to return Video DownloadHelper article, but well returns Video DownloadHelper Wikidata item (in page bottom).

Whereas typing “video download helper” in search field well displays “Video DownloadHelper” as a spelling suggestion, but once the form is submitted, the previous suggestion is not suggested again in result page (spelling suggestions could/should improve results).

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM

More details to come, but aggressive_splitting (which is a word_delimiter filter underneath) is just too aggressive. It breaks things ICU normalization does better, and word_break_helper (T170625) does or can do the good things aggressive_splitting does. My new plan is to deactivate aggressive_splitting on English-language wikis and replace it with a split_camelCase filter that addresses the original issue of "FilesystemHierarchyStandard" in this ticket, and delegate the good things it does to word_break_helper.

My full writeup is on Mediawiki.

Summary:

  • aggressive_splitting is too aggressive, so it should be dropped in favor of word_break_helper, which does most of the good parts, and a new CamelCase-splitting filter, which is the other good thing aggressive_splitting does.
  • In this ticket, we'll just drop aggressive_splitting and add the CamelCase filter.
    • (And clean up some related zombie code and fix an old, related bug in italian_elision that has been ignoring uppercase elision forever.)

@dcausse is also assigned the 0.1-point task to worry a little about the efficiency of the pattern_replace filter that handles CamelCase. 😉

Patch coming soon-ish, after I get an evaluation dataset in place.

Change 933672 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Disable aggressive_splitting, add split_camelCase

https://gerrit.wikimedia.org/r/933672

Change 933672 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Disable aggressive_splitting, add split_camelCase

https://gerrit.wikimedia.org/r/933672

Is this supposed to be fully deployed? The original issue is still present: Searching for “FilesystemHierarchyStandard” in fr.wp returns no local result but several results from en.wp, including [en:Filesystem Hierarchy Standard] whereas equivalent [fr:Filesystem Hierarchy Standard] exists.

@Pols12, the code is deployed, but not activated yet. In our workflow, we generally close tickets when the code is deployed, separate from when the feature is available.

Activating this change requires reindexing the wikis. Since this change is for all wikis, reindexing is going to be a fairly large undertaking—it takes a couple of weeks, unfortunately. I have several such projects that affect all or almost all wikis, and I decided to wait until I could merge the code for three of them before doing the reindexing. The last patch of the three just go merged, and it should be deployed soon, then we will reindex. French is early in alphabetical order, so it will get reindexed a little sooner than most others. :)

I'll open a separate ticket for the reindexing and link back to this one. I should have opened it sooner, but this is my first time doing multiple of these all-wiki changes together. Sorry for the confusion.. but the change is coming relatively soon!