Make list of languages where using stemmed analyzer for Wikibase is beneficial
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Nov 9 2017, 9:23 PM

Description

After talking with @dcausse, we decided that having two custom analyzers set up (stemmed & non-stemmed one) for every language in descriptions is wasteful, since not all of them are useful for Wikibase use case. We'd want to only make stemmed ones for those languages, and use the plain (non-stemmed) analyzer for others.

Here is the list of languages for which we have "non-trivial" configuration for stemming (text) analyzer:

ar
bg
ca
ckb
cs
da
de
el
en
en-ca
en-gb
es
eu
fa
fi
fr
ga
gl
hi
hu
hy
id
it
ja
ko
lt
lv
nb
nl
nn
pt
pt-br
ro
ru
simple
sv
th
tr

That includes having named analyzer types (e.g. 'bulgarian') and specialized filters or tokenizers.

Note that we are only concerned about whether the text analyzer we have will have additional value as compared to plain analyzer, since we're keeping plain one anyway, and only in the context of common Wikibase/Wikidata usage on descriptions.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Wikidata-bugs	T77898 query taking 10s: TermSqlIndex::getMatchingIDs
Open	None	T46529 Wikidata search problems (tracking)
Resolved	aude	T110648 [Bug] high-ranking items seemed to have dropped significantly in Special:Search results for wikidata
Resolved	Smalyshev	T119067 Adjust rescoring config for Wikidata to consider sitelink count
Resolved	aude	T119066 Add sitelink count to search index for Wikidata
Resolved	None	T120217 [Task] Add CirrusSearch and Elastica as dependencies of Wikibase on Jenkins
Invalid	None	T161334 Stemming for item suggestions, e.g. "the" vs. no "the" on Wikidata
Resolved	Smalyshev	T78157 [Story] Use ElasticSearch for entity search on wikidata.org
Resolved	Lydia_Pintscher	T88535 [Task] Investigate preconditions & constraints for using Elastic directly
Duplicate	Smalyshev	T117520 Index Wikidata labels, aliases and descriptions as separate fields in ElasticSearch
Duplicate	None	T117522 Create standard "completion suggestion" API for Search
Declined	None	T120089 Add an internal completion or suggestions API to core SearchEngine
Declined	None	T170392 Create gadget that enables the use of the elastic search backend for the entity selector
Resolved	daniel	T170400 Define metrics for search result quality for the entity selector widget on wikidata.
Resolved	Lydia_Pintscher	T170405 Manually evaluate cirrus based entity search on test system
Open	None	T170547 Metrics to evaluate new search for item suggestor
Declined	None	T170549 Provide A/B test for item suggestor
Resolved	Smalyshev	T162292 Reindex wikidata to pick up labels/descriptions mappings
Resolved	Smalyshev	T175199 Index certain statements for Wikidata items
Declined	None	T141813 Add full-text search support to Query Service
Resolved	Smalyshev	T125500 [Epic] Index Wikidata labels and descriptions as separate fields in ElasticSearch
Resolved	dcausse	T160926 Make noop script be able replace whole fields with nested subfields
Resolved	dcausse	T166589 Update wikidata code to take advantage of nested fields noop script
Resolved	Smalyshev	T178851 Use label & description index for fulltext search
Resolved	Smalyshev	T176903 Index wikidata descriptions
Resolved	Smalyshev	T180169 Make list of languages where using stemmed analyzer for Wikibase is beneficial

Event Timeline

Smalyshev renamed this task from Make list of languages where using custom analyzer for Wikibase is beneficial to Make list of languages where using stemmed analyzer for Wikibase is beneficial.Nov 9 2017, 9:23 PM

Smalyshev created this task.

Smalyshev removed a subscriber: gerritbot.

dcausse added a subscriber: TJones.Nov 9 2017, 9:41 PM

@TJones this is something you may want to look at I think :)

@Smalyshev, I think this covers the info you need. Let me know if I can give more info or help with anything else. :)

TL;DR: yep, text is useful compared to plain for of ar, bg, ca, ckb, cs, da, de, el, en, en-ca, en-gb, es, eu, fa, fi, fr, ga, gl, hi, hu, hy, id, it, ja, ko, lt, lv, nb, nl, nn, pt, pt-br, ro, ru, simple, sv, th, and tr.

Also, if the standard plugins are installed, include pl, zh, he, and uk.

You should possibly note that bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan probably do better with the icu_tokenizer rather than the standard tokenizer.

For everything else, keep in mind that the difference between text and plain is that plain has word_break_helper enabled.

Details:

The default plain analyzer is the standard tokenizer, the ICU Normalizer (which does some folding but much less than full ICU Folding) and the "word break helper" (which breaks words on periods, underscores, and parens). So default below is the same as "standard + icu_normalizer + word_break_helper".

All of the analyzers except CJK, Persian, and Thai have stemmers, which I assume do something useful.

Persian and Thai have stop words (as do most of the others), which I also assume do something useful.

CJK has the CJK bigram filter (whick gives overlapping bigrams as tokens) and—oddly—English stop words; that seems useful.

Also, if this is in an environment where the usual plugins are installed, you also have custom analyzers for pl, zh, he, and uk, so I've included them below in their own little sub-table.

There are also a list of languages that have the icu_tokenizer enabled rather than the standard tokenizer: bo, dz, gan, ja, km, lo, my, th, wuu, zh, zh-classical/lzh, zh-yue/yue, bug, cdo, cr, hak, jv, and zh-min-nan. That might be worth having as another config option for those languages.

For all of the languages without a custom analyzer, (including the ones using the icu_tokenizer), there is always a difference betweeen text and plain: plain includes word_break_helper. Most of the language-specific analyzers do not use word_break_helper (but a few do).

Default Elastic analyzers:

Code	Lg	text	plain
ar	Arabic	arabic	default
bg	Bulgarian	bulgarian	default
ca	Catalan	catalan	default
ckb	Sorani	sorani	default
cs	Czech	czech	default
da	Danish	danish	default
de	German	german	default
el	Greek	greek	standard + icu_normalizer + icu_folding + word_break_helper
en	English	english	standard + icu_normalizer + icu_folding + word_break_helper
en-ca	Canadian English	english	standard + icu_normalizer + icu_folding + word_break_helper
en-gb	British English	english	standard + icu_normalizer + icu_folding + word_break_helper
es	Spanish	spanish	default
eu	Basque	basque	default
fa	Persian	persian	default
fi	Finnish	finnish	default
fr	French	french	standard + icu_normalizer + icu_folding + word_break_helper
ga	Irish	irish	default
gl	Galician	galician	default
hi	Hindi	hindi	default
hu	Hungarian	hungarian	default
hy	Armenian	armenian	default
id	Indonesian	indonesian	default
it	Italian	italian	standard + icu_normalizer + ascii_folding + dedupe_asciifolding
ja	Japanese	cjk	icu_tokenizer + icu_normalizer + word_break_helper
ko	Korean	cjk	default
lt	Lithuanian	lithuanian	default
lv	Latvian	latvian	default
nb	Norwegian Bokmål	norwegian	default
nl	Dutch	dutch	default
nn	Norwegian Nynorsk	norwegian	default
pt	Portuguese	brazilian	default
pt-br	Brazilian Portuguese	portuguese	default
ro	Romanian	romanian	default
ru	Russian	russian	standard + icu_normalizer + russian_char_filter + word_break_helper
simple	Simple English	english	standard + icu_normalizer + icu_folding + word_break_helper
sv	Swedish	swedish	standard + icu_normalizer + icu_folding + word_break_helper
th	Thai	thai	default
tr	Turkish	turkish	default

Analyzers with usual plugins:

Code	Lg	text	plain
pl	Polish	polish	default
zh	Chinese	chinese	icu_tokenizer + smartcn_stop + icu_normalizer + word_break_helper
he	Hebrew	hebrew	standard + icu_normalizer + icu_folding + word_break_helper
uk	Ukrainian	ukrainian	default

ICU Tokenization languages:

Code	Lg
bo	Tibetan
dz	Dzongkha
gan	Gan
ja	Japanese
km	Khmer
lo	Lao
my	Burmese
th	Thai
wuu	Wu
zh	Chinese
zh-classical	Classical Chinese
zh-yue	Cantonese
bug	Buginese
cdo	Min Dong
cr	Cree
hak	Hakka
jv	Javanese
zh-min-nan	Min Nan

Thank you @TJones I think this is exactly what I needed.

EBernhardson moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.May 6 2019, 4:02 PM

Make list of languages where using stemmed analyzer for Wikibase is beneficialClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Make list of languages where using stemmed analyzer for Wikibase is beneficial
Closed, ResolvedPublic
Actions

Related Objects
Search...