Page MenuHomePhabricator

Investigate applying aggressive_splitting everywhere, not just on English-language wikis
Open, HighPublic


English-language wikis use aggressive_splitting, which is a language analysis filter (a version of Elasticsearch's Word Delimiter Token Filter) that splits words on case changes (as was the original issue in this ticket) and in other circumstances. Investigate applying it everywhere, or at least for many more languages.

Original task title & description:

Cross-wiki search tokenizer is better than local search one

Searching for “FilesystemHierarchyStandard” in fr.wp give me no local result but several results from en.wp, including [en:Filesystem Hierarchy Standard] whereas equivalent [fr:Filesystem Hierarchy Standard] exists.

I’ve already encountered this strange issue: global search is sometimes better than local search, especially in phrase tokenization (when I missed spaces).

Maybe it’s because I use an English phrasing on French wiki?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Medium priority.Mar 28 2019, 5:09 PM
debt moved this task from needs triage to Language Stuff on the Discovery-Search board.
debt added subscribers: TJones, debt.

hey @TJones can you take a look?

I'll take a look today. I'm pretty sure I know what's happening, but will double check.

As I thought, this is a customization that was added to the English Language Analysis years ago before my time. It was originally limited to search on in 2013, and then expanded to all English-language wikis in 2014, but it was never expanded beyond that.

English has an additional filter called aggressive_splitting, which is a version of Elastic's Word Delimiter Token Filter. It breaks on case changes, among others, so FilesystemHierarchyStandard is split up on English Wikipedia, but not on French Wikipedia.

I have a long-term goal (not yet documented on Phab) of comparing the output of the language analyzers of at least the top 10 wikis and trying to harmonize the features that aren't language-specific. The Word Delimiter filter and another—word_break_helper, which splits on underscores, periods, and parens (see T170625)—are two examples that I know are not implemented consistently (or even correctly) across languages.

Part of the problem is that default language analyzers don't allow any customization until they are "unpacked" into their component parts. Unpacked analyzers get some automatic "upgrades", like ICU normalization instead of lowercasing, which can actually break some things (see T217602 and related tasks). And I've seen other unexpected dependencies, like Greek stemming doesn't work unless certain diacritics are removed, which is done by Greek-specific "lowercasing" but not by ICU normalization. The result is that we can't really just enable everything that seems useful everywhere at once without running the risk of breaking many things.

I'll convert this task into one parallel to the one for word_break_helper and create a parent task for language analysis harmonization and put this and word_break_helper under it, and document some of the other steps I know need to be taken.

Thanks @Pols12 for pointing this out!

TJones renamed this task from Cross-wiki search tokenizer is better than local search one to Investigate applying aggressive_splitting everywhere, not just on English-language wikis.Mar 28 2019, 7:04 PM
TJones updated the task description. (Show Details)

Another different use case: searching for “download helper” on fr.wp doesn’t seem to return Video DownloadHelper article, but well returns Video DownloadHelper Wikidata item (in page bottom).

Whereas typing “video download helper” in search field well displays “Video DownloadHelper” as a spelling suggestion, but once the form is submitted, the previous suggestion is not suggested again in result page (spelling suggestions could/should improve results).

TJones raised the priority of this task from Medium to High.Aug 27 2020, 9:46 PM