Consider asking communities which languages are analysed the poorest in search
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Deskana
	Jan 7 2017, 7:28 PM

Description

One of the Search Team's Q3 goals is to investigate new language analysers to improve search. For more information on what that means, review the recent mailing list post.

To figure out which analysers to start with, we're using our intuition and data from previous tests. For example, we know that the Chinese language analyser is very bad from our recent tests of BM25 on that wiki, so we can research a new analyser for them. We have the possibility of also prioritising other analysers, but it's hard for us to know where to start; the languages spoken by the Search Team are fairly limited.

It would be great if we could do some outreach to figure out which communities could benefit from having a new language analyser. We'll have to craft the questions we ask them carefully; feedback such as "X query gives Y bad results" is not helpful in this case, since we're talking specifically about bad language analysis.

Related Objects

Mentioned Here: T102298: Add accent squashing to Russian/Cyrillic analyser
T124592: Cyrillic 'Е' and 'Ё' equivalence not found by search
T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers
T154511: [Tracking] Research, test, and deploy new language analyzers
T160562: Enable ICU folding on Swedish, except for å, ä, and ö

Event Timeline

• Deskana created this task.Jan 7 2017, 7:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 7 2017, 7:28 PM

CKoerner_WMF triaged this task as Medium priority.Jan 18 2017, 9:28 PM

I changed the title to "consider"; we may consider this and decide not to do it, for the reasons I listed above. :-)

Here's a thing you could try measuring without asking the community: Check in which languages is the search box used relatively less, as a percent of searches out of pageviews. For example, if Syldavian and Bordurian both have ~100,000 views per week, but Syldavian has 20,000 searches per week and Bordurian has 5,000, then maybe Bordurian is supported less well (the numbers and the language names are made up just for the sake of the example).

A probably more complicated thing to measure is to see in which languages is Google used for in-site searches more.

debt moved this task from needs triage to Up Next on the Discovery-Search board.Jan 19 2017, 11:21 PM

• Deskana renamed this task from Consider ask communities which languages are analysed the poorest in search to Consider asking communities which languages are analysed the poorest in search.Jan 20 2017, 11:30 PM

While we (Discovery) look into analytical methods to measure poor results, I put together a proposal for a message to communities.

https://meta.wikimedia.org/wiki/User:CKoerner_(WMF)/Language_analyzer_question

Translators-l, wikitech-ambassadors, mediawiki-i18n, languages and discovery mailing lists would be candidates for disseminating the request.

Part of the problem is that full-blown analyzers are a lot of work. Elastic makes a fair number available, and we use all of them. They don't all map to language names—there's both Portuguese and Brazilian, and CJK does something generic but reasonable with bigrams for Chinese, Japanese, and Korean. Here's the list, for reference:

Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai.

Most of these are enabled based on language settings. Additional analyzers are available as plugins, some of which ("core plugins") are supported by Elastic, and some of which are linked to by Elastic on their Analysis Plugins page.

Among the core plugins:

We didn't previously use SmartCN for Chinese because it only works on Simplified characters, but we added another plugin, STConvert, that converts Traditional characters to Simplified, and lets SmartCN do a decent job. That's been deployed.
We've also deployed Stempel for Polish and the Ukrainian plugin.
We use the ICU plugin (which is not language specific) when it is available.
I'm working on configuring and enabling Kuromoji for Japanese now.

Among the language-related community-contributed plugins:

Hebrew is done with configuration, but is a bit more complicated to deploy because it uses external resources, but we're making progress.
IK and Mmseg are for Chinese, which we have covered, and I don't think we need Pinyin/Chinese conversion.
I haven't looked at the Russian/English one, but I don't think we need it for Russian and English.

That leaves Vietnamese—which we could definitely add to the list to work on in Q1 of 2017/2018.

While researching various other options, I've noted other analyzers—which have generally been few and far between—in T154511: [Tracking] Research, test, and deploy new language analyzers. There's Vietnamese, as noted above, and another approach to Thai, which doesn't actually look very promising, but which I would be glad to look into.

I didn't search hard for other analyzers, but I did keep running into the same ones were have ended up using over and over. Happy to look more and report back.

Anyway, the point would be that building a language anlayzer is a very large effort, and requires expertise (or a whole lot of data of the right kind) in the language you are interested in. I wouldn't want to promise significant improvement for a language where there is no analyzer already available—so asking for open-ended suggestions is probably going to turn into a lot of disappointment.

However—there's always a however—I think generally disabling fallback languages for language analyzers (T147959) would make analysis for some languages suck less.

Also, if there are specific problems of particular characters being treated badly—such as Swedish å, ä, and ö (T160562) or Russian ё/е (T124592) or Russian stress accents (T102298)—it'd be great to know about those problems so we could address them.

Qgil added a project: Community-Relations-Support.Jul 7 2017, 7:23 PM

As far as I can see, as of today Community-Relations-Support cannot act on this task. I'm triaging it accordingly. If you want to work on this again, please let us know. If nobody is planning to work on this anytime soon, then maybe we can decline it (and reopen in the future if needed)?

@Qgil - I don't think there is anything left for the Community-Relations-Support to do on this particular ticket, and it can be closed. We've got this ticket going (T147959#3636082) where we're moving forward with community consultation with help from @CKoerner_WMF.

Consider asking communities which languages are analysed the poorest in searchClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Consider asking communities which languages are analysed the poorest in search
Closed, ResolvedPublic
Actions