Page MenuHomePhabricator

Make list of languages where using stemmed analyzer for Wikibase is beneficial
Closed, ResolvedPublic

Description

After talking with @dcausse, we decided that having two custom analyzers set up (stemmed & non-stemmed one) for every language in descriptions is wasteful, since not all of them are useful for Wikibase use case. We'd want to only make stemmed ones for those languages, and use the plain (non-stemmed) analyzer for others.

Here is the list of languages for which we have "non-trivial" configuration for stemming (text) analyzer:

ar
bg
ca
ckb
cs
da
de
el
en
en-ca
en-gb
es
eu
fa
fi
fr
ga
gl
hi
hu
hy
id
it
ja
ko
lt
lv
nb
nl
nn
pt
pt-br
ro
ru
simple
sv
th
tr

That includes having named analyzer types (e.g. 'bulgarian') and specialized filters or tokenizers.

Note that we are only concerned about whether the text analyzer we have will have additional value as compared to plain analyzer, since we're keeping plain one anyway, and only in the context of common Wikibase/Wikidata usage on descriptions.

Related Objects

StatusSubtypeAssignedTask
Resolved Wikidata-bugs
OpenNone
Resolvedaude
ResolvedSmalyshev
Resolvedaude
ResolvedNone
InvalidNone
ResolvedSmalyshev
ResolvedLydia_Pintscher
DuplicateSmalyshev
DuplicateNone
DeclinedNone
DeclinedNone
Resolveddaniel
ResolvedLydia_Pintscher
OpenNone
DeclinedNone
ResolvedSmalyshev
ResolvedSmalyshev
DeclinedNone
ResolvedSmalyshev
Resolveddcausse
Resolveddcausse
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev

Event Timeline

Smalyshev renamed this task from Make list of languages where using custom analyzer for Wikibase is beneficial to Make list of languages where using stemmed analyzer for Wikibase is beneficial.Nov 9 2017, 9:23 PM
Smalyshev created this task.
Smalyshev removed a subscriber: gerritbot.
Smalyshev updated the task description. (Show Details)

@TJones this is something you may want to look at I think :)

@Smalyshev, I think this covers the info you need. Let me know if I can give more info or help with anything else. :)

TL;DR: yep, text is useful compared to plain for of ar, bg, ca, ckb, cs, da, de, el, en, en-ca, en-gb, es, eu, fa, fi, fr, ga, gl, hi, hu, hy, id, it, ja, ko, lt, lv, nb, nl, nn, pt, pt-br, ro, ru, simple, sv, th, and tr.

Also, if the standard plugins are installed, include pl, zh, he, and uk.

You should possibly note that bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan probably do better with the icu_tokenizer rather than the standard tokenizer.

For everything else, keep in mind that the difference between text and plain is that plain has word_break_helper enabled.

Details:

The default plain analyzer is the standard tokenizer, the ICU Normalizer (which does some folding but much less than full ICU Folding) and the "word break helper" (which breaks words on periods, underscores, and parens). So default below is the same as "standard + icu_normalizer + word_break_helper".

All of the analyzers except CJK, Persian, and Thai have stemmers, which I assume do something useful.

Persian and Thai have stop words (as do most of the others), which I also assume do something useful.

CJK has the CJK bigram filter (whick gives overlapping bigrams as tokens) and—oddly—English stop words; that seems useful.

Also, if this is in an environment where the usual plugins are installed, you also have custom analyzers for pl, zh, he, and uk, so I've included them below in their own little sub-table.

There are also a list of languages that have the icu_tokenizer enabled rather than the standard tokenizer: bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan. That might be worth having as another config option for those languages.

For all of the languages without a custom analyzer, (including the ones using the icu_tokenizer), there is always a difference betweeen text and plain: plain includes word_break_helper. Most of the language-specific analyzers do not use word_break_helper (but a few do).

Default Elastic analyzers:

CodeLgtextplain
arArabicarabicdefault
bgBulgarianbulgariandefault
caCatalancatalandefault
ckbSoranisoranidefault
csCzechczechdefault
daDanishdanishdefault
deGermangermandefault
elGreekgreekstandard + icu_normalizer + icu_folding + word_break_helper
enEnglishenglishstandard + icu_normalizer + icu_folding + word_break_helper
en-caCanadian Englishenglishstandard + icu_normalizer + icu_folding + word_break_helper
en-gbBritish Englishenglishstandard + icu_normalizer + icu_folding + word_break_helper
esSpanishspanishdefault
euBasquebasquedefault
faPersianpersiandefault
fiFinnishfinnishdefault
frFrenchfrenchstandard + icu_normalizer + icu_folding + word_break_helper
gaIrishirishdefault
glGaliciangaliciandefault
hiHindihindidefault
huHungarianhungariandefault
hyArmenianarmeniandefault
idIndonesianindonesiandefault
itItalianitalianstandard + icu_normalizer + ascii_folding + dedupe_asciifolding
jaJapanesecjkicu_tokenizer + icu_normalizer + word_break_helper
koKoreancjkdefault
ltLithuanianlithuaniandefault
lvLatvianlatviandefault
nbNorwegian Bokmålnorwegiandefault
nlDutchdutchdefault
nnNorwegian Nynorsknorwegiandefault
ptPortuguesebraziliandefault
pt-brBrazilian Portugueseportuguesedefault
roRomanianromaniandefault
ruRussianrussianstandard + icu_normalizer + russian_char_filter + word_break_helper
simpleSimple Englishenglishstandard + icu_normalizer + icu_folding + word_break_helper
svSwedishswedishstandard + icu_normalizer + icu_folding + word_break_helper
thThaithaidefault
trTurkishturkishdefault

Analyzers with usual plugins:

CodeLgtextplain
plPolishpolishdefault
zhChinesechineseicu_tokenizer + smartcn_stop + icu_normalizer + word_break_helper
heHebrewhebrewstandard + icu_normalizer + icu_folding + word_break_helper
ukUkrainianukrainiandefault

ICU Tokenization languages:

CodeLg
boTibetan
dzDzongkha
ganGan
jaJapanese
kmKhmer
loLao
myBurmese
thThai
wuuWu
zhChinese
zh-classicalClassical Chinese
zh-yueCantonese
bugBuginese
cdoMin Dong
crCree
hakHakka
jvJavanese
zh-min-nanMin Nan
Smalyshev claimed this task.

Thank you @TJones I think this is exactly what I needed.