Lack of diacritic folding in e.g. Ancient Greek
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ObsequiousNewt
	Apr 13 2016, 10:01 PM

Description

On multiple sites including at least en.wiktionary.org and en.wikipedia.org, it is impossible to search for a word which has diacritics by entering the word without diacritics, and vice versa, e.g. the searches

https://en.wiktionary.org/w/index.php?search=ἔθανον
https://en.wiktionary.org/w/index.php?search=ἐθάνετον

will fail, while

https://en.wiktionary.org/w/index.php?search=ἔθᾰνον
https://en.wiktionary.org/w/index.php?search=ἐθᾰ́νετον.

will succeed. However, the latter case requires not only knowledge of the length of the vowel (which the user may very well be trying to find) but also the ability to type it on the keyboard.

(The corresponding problem also exists—e.g. one cannot find κλίνω by searching for κλῑ́νω. Ideally the page κλίνω should include length information, but not all entries have yet been thusly updated.)

Note that this problem exists both for precomposed diacritics (ἔθᾰνον) and also combining diacritics (ἐθᾰ́νετον—ᾰ́ = U+1FB0 U+0301.)

Additionally, typing e.g. αθανατος into the search bar does not bring up ἀθάνατος. This is, from the user's eyes, a regression. Previously—and I don't know whether this is the fault of Cirrus or not—it was possible to do this. And this is a desirable if not necessary ability—it is often difficult to read diacritics (try distinguishing ἀ from ἁ in some fonts/handwriting), and if you study original inscriptions or papyri, no diacritics well be given at all.

This problem is not limited to the Greek script. Bugs have been filed to this effect for Cyrillic (T124592 and T102298), Hebrew (T71361), Devanagari and Arabic (T29055), as well as Latin (T123179 and T104814).

(Note that in the case of Latin the problem is only partial: searching for any one of 'Bronte' (no diacritic), 'Brónte' (precomposed character), or 'Brónte' (combining diacritic) will yield the expected pages Brontë and Bronte both in the suggestion box and the results page, however, searching for 'Bro̍nte' (combining diacritic) will yield Brontë and Bronte in the suggestion box but not in the results page. Additionally, the list of pages in the suggestion box is much shorter and only apparently includes pages whose titles match exactly up to the point of the diacritic—so while Brónte matches pages beginning with bry-, Bro̍nte doesn't. The reason for this discrepancy between Brónte and Bro̍nte is apparently that o̍ does not have a canonically equivalent precomposed form. This is supported by the fact that a search for Bront́e fails just as Bro̍nte does.)

[EDIT: I changed the example above, κλī́νω, to κλῑ́νω. The first one uses a Latin i with a macron (ī), while the second one uses a Greek iota with a macron (ῑ)—the visual difference is in the tail at the bottom of the character. Both have an additional combining acute accent attached. I agree that distinguishing ἀ and ἁ in handwriting and some fonts is almost impossible, but it seems fair to assume the ı/ι-shaped thing in the middle of a Greek word is an iota, right? —TJones]

Related Objects
Search...

Status	Assigned	Task
Resolved	debt	T132637 Lack of diacritic folding in e.g. Ancient Greek
Resolved	dcausse	T137830 Use the icu_folding filter if available instead of asciifolding
Resolved	dcausse	T138749 Add a generic preserve original token filter to the extra plugin

Event Timeline

ObsequiousNewt created this task.Apr 13 2016, 10:01 PM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptApr 13 2016, 10:01 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ObsequiousNewt updated the task description. (Show Details)Apr 13 2016, 10:03 PM

@ObsequiousNewt If I'm understanding this task correctly, it only really affects if you're searching for Ancient Greek words on the English Wiktionary. I think we fixed this issue for the Greek Wikipedia a while ago. Is that correct?

@Deskana It affects other scripts than Greek (e.g. searching for емкостях will not yield ёмкостях, etc.) and more sites than en.wikt (de.wikt and el.wikt, as well as en.wp and de.wp, and almost certainly others). It does not seem to affect el.wp—although I can't tell whether the problem has also been solved there in other scripts, or whether just a partial workaround has been implemented.

dcausse added a subtask: T137830: Use the icu_folding filter if available instead of asciifolding.Jun 14 2016, 6:07 PM

• Deskana closed subtask T137830: Use the icu_folding filter if available instead of asciifolding as Resolved.Dec 9 2016, 3:24 PM

@Deskana What does the recently resolved subtask mean for this issue as a whole?

@Wikitiki89 it means that we added all the necessary code to enable ICU folding (required for all non-latin alphabets)

At first we thought we could blindly enable ICU folding everywhere but we realized that it may cause troubles on some languages where diacritics are used to denote very different letters, e.g. o and ö in finnish or й in russian.

So we decided to do it on a case by case basis, ICU folding will be enabled on the next reindex batch for english/french and greek wikis. If other languages want to activate it a request must be made with the list of character to exclude (if any) and we can enable it.

Sorry if it is not be very clear...

In T132637#2876944, @dcausse wrote:

ICU folding will be enabled on the next reindex batch for english/french and greek wikis.

To clarify, do you mean all Wikimedia wikis in those languages (i.e. including Wiktionary, Wikipedia, and others)?

In T132637#2876984, @Wikitiki89 wrote:

To clarify, do you mean all Wikimedia wikis in those languages (i.e. including Wiktionary, Wikipedia, and others)?

Yes, exactly

@dcausse When do we expect the reindex to be done for these wikis? I thought we'd just reindexed them to launch BM25 everywhere.

@Deskana yes it was the initial plan but due to timing issues the ICU analysis patch chain was not yet ready.
I'd suggest running the reindex next quarter. We could try to contact more wiki stakeholders in the meantime to see if more wikis would be interested.
The problem is that we don't have yet a very comprehensive workflow for tracking such changes. We have a recurring task T147505 that should be used for this but imo it's not very clear to understand what's going on, the tasks we use to track work on the code are reused in comments there, but when you see the task closed in those comments you may think it's done but it's not necessarily the case...
Maybe a dedicated work-board with dedicated tasks could be easier to read?

In T132637#2880285, @dcausse wrote:

@Deskana yes it was the initial plan but due to timing issues the ICU analysis patch chain was not yet ready.
I'd suggest running the reindex next quarter. We could try to contact more wiki stakeholders in the meantime to see if more wikis would be interested.

Works for me. Thanks! I'll chat to @CKoerner_WMF about contact people today. To confirm I understand, is the audience "wikis in languages where diacritics are used often", and the improvement "searching with diacritics should match words in a more logical manner"?

The problem is that we don't have yet a very comprehensive workflow for tracking such changes. We have a recurring task T147505 that should be used for this but imo it's not very clear to understand what's going on, the tasks we use to track work on the code are reused in comments there, but when you see the task closed in those comments you may think it's done but it's not necessarily the case...
Maybe a dedicated work-board with dedicated tasks could be easier to read?

Interesting! I have a few ideas how that could work. I'll write something to the Discovery list about it. :-)

In T132637#2880285, @dcausse wrote:

I'd suggest running the reindex next quarter. We could try to contact more wiki stakeholders in the meantime to see if more wikis would be interested.

That's a long time. Why should we have to wait another three months?

In T132637#2881207, @Deskana wrote:

Works for me. Thanks! I'll chat to @CKoerner_WMF about contact people today. To confirm I understand, is the audience "wikis in languages where diacritics are used often", and the improvement "searching with diacritics should match words in a more logical manner"?

I would state the audience as "wikis that often contain text in languages with diacritics".
And I would state the improvement with the more frequent scenario: "searching without diacritics should be able to match words that have diacritics".

In T132637#2881207, @Deskana wrote:

In T132637#2880285, @dcausse wrote:

@Deskana yes it was the initial plan but due to timing issues the ICU analysis patch chain was not yet ready.
I'd suggest running the reindex next quarter. We could try to contact more wiki stakeholders in the meantime to see if more wikis would be interested.

Works for me. Thanks! I'll chat to @CKoerner_WMF about contact people today. To confirm I understand, is the audience "wikis in languages where diacritics are used often", and the improvement "searching with diacritics should match words in a more logical manner"?

I think that's a question for our language nerd: @TJones HELP!! :)

In T132637#2881246, @Wikitiki89 wrote:

In T132637#2880285, @dcausse wrote:

I'd suggest running the reindex next quarter. We could try to contact more wiki stakeholders in the meantime to see if more wikis would be interested.

That's a long time. Why should we have to wait another three months?

Mainly because it's still hard for us to track index changes, and doing big batches is easier.
I understand that it can be frustrating, do you have a short list of wikis where it's desperately needed? I can probably do an exception if the list is not huge.

In T132637#2881246, @Wikitiki89 wrote:

That's a long time. Why should we have to wait another three months?

"Next quarter" could also mean in January, in which case it's just a month or so. We could probably have started a reindex sooner, but it's generally unwise to perform big changes before the holidays. We are interested in making the reindex process faster and less painless to increase the frequency with which we can do them, but that's a long process.

I guess what I'm trying to say is that mid-December is a bad time of year to expect things to move super fast, but we'll get to it as soon as we can. :-)

In T132637#2881282, @dcausse wrote:

Mainly because it's still hard for us to track index changes, and doing big batches is easier.
I understand that it can be frustrating, do you have a short list of wikis where it's desperately needed? I can probably do an exception if the list is not huge.

I know for sure that it's desperately needed at the English Wiktionary; many readers and editors, including myself, have been complaining about it for years, and have even been proposing convoluted workarounds, such as using modules to add hidden text with other spelling variants. It is probably also desired at other large Wiktionaries, such as the French Wiktionary, but I can't speak for them.

In T132637#2881306, @Deskana wrote:

In T132637#2881246, @Wikitiki89 wrote:

That's a long time. Why should we have to wait another three months?

"Next quarter" could also mean in January, in which case it's just a month or so. We could probably have started a reindex sooner, but it's generally unwise to perform big changes before the holidays. We are interested in making the reindex process faster and less painless to increase the frequency with which we can do them, but that's a long process.

I guess what I'm trying to say is that mid-December is a bad time of year to expect things to move super fast, but we'll get to it as soon as we can. :-)

Thanks! I do appreciate the effort. For the longest time I was under the impression that this issue was being completely ignored until I happened to check yesterday and was surprised to see that a major subtask was marked completed just a week earlier and that the fix would be coming soon. I guess that got me a little overexcited.

TL;DR: The question for the communities is, I think, which diacritics should not be folded in your language, and is it fair to assume all the others should be folded on wikis in your language?

In T132637#2881246, @Wikitiki89 wrote:

In T132637#2881207, @Deskana wrote:

Works for me. Thanks! I'll chat to @CKoerner_WMF about contact people today. To confirm I understand, is the audience "wikis in languages where diacritics are used often", and the improvement "searching with diacritics should match words in a more logical manner"?

I would state the audience as "wikis that often contain text in languages with diacritics".
And I would state the improvement with the more frequent scenario: "searching without diacritics should be able to match words that have diacritics".

I think @Wikitiki89 is on the right track, though I'd clarify that it's text in other languages with unfamiliar diacritics, and that searching without unfamiliar diacritics should match words that have them.

French speakers usually have no trouble typing French diacritics, but they may have no idea how to type Ancient Greek polytonic diacritics—which speakers of Modern Greek may also have trouble with, just as speakers of Modern English usually don't know how to type ð, þ, æ, or ē, despite them all being used in the first few lines of Beowulf! Hwæt! (You call me a language nerd, now I gotta act like one.)

I think the most general description is that people want to fold most diacritics, except the ones that are relevant to the "host" language of the wiki they are on. On English Wikipedia/Wiktionary/etc, you'd probably want to fold all the Cyrillic and (Modern and Ancient) Greek diacritics, and probably all the Latin ones, too, even though English makes light use of ´ and ¨ (e.g., resumé and Zoë) and rarely a few others, many English speakers don't see them as important.

French probably wants to fold mācrōns and hold on to ácúté and gràvè accents, while Hawaiian wants to do the opposite.

Generally, precomposed characters in the "host" language are keepers, though there are exceptions: Russian doesn't seem to care about the distinction between ё and е—though Belarusian and Rusyn do! I have no idea what Belarusian speakers searching for Russian words on Belarusian Wiktionary want to do about that, given that the dictionary citation form of a Russian word may use ё (such as чёрная дыра, "black hole"), but usage even in formal and academic sources may not (i.e., a user may have only seen it in print in Russian as черная дыра, as a quick search on Russian Google News shows is at least plausible).

Dealing with precomposed vs composed versions of the same character is more complex, and probably needs to be handled as it comes up. If I type an e+combining acute accent <é> (U+0065 U+0341) in an article on my test wiki in Vagrant, it gets converted to a single precomposed character <é> (U+00E9). I'm not sure if it's my OS, my browser, or mediawiki that does the conversion, so I'm not sure how big of a problem it is. We'll see what Phab does when I submit; the preview is keeping them distinct so far—yep, Phab kept them distinct: éé. They even look slightly different on my screen.

There are also non-diacritic foldings that happen, such as converting alternate forms of letters to their base forms, such as Greek word-final ς to standard σ, converting stylistic ligatures like ﬁ to fi, and converting fullwidth Ｌａｔｉｎ characters to "normal" halfwidth forms.

I believe that most wikis would want to implement maximal ICU folding, with exceptions for diacritics (and possibly other folded distinctions) that are relevant in the "host" language of that wiki—and this is what @dcausse has already implemented; we just don't have the exception list for many languages.

We need to find out those exceptions for each language—preferably by asking people who are familiar with the language, though it can be done by research into the host language's orthography. Additional corner cases will come up and we'll have to figure those out as they happen. We should focus first on languages where people are noting that they have problems, since that's probably where there is the biggest overlap between number of users and the size of the problem.

Note: Unfortunately, I can't find a good list of everything that could/would/should get folded. There are withdrawn drafts of Unicode Consortium technical reports. The Lucene docs have a list of categories, at least:

Accent removal
Case folding
Canonical duplicates folding
Dashes folding
Diacritic removal (including stroke, hook, descender)
Greek letterforms folding
Han Radical folding
Hebrew Alternates folding
Jamo folding
Letterforms folding
Math symbol folding
Multigraph Expansions: All
Native digit folding
No-break folding
Overline folding
Positional forms folding
Small forms folding
Space folding
Spacing Accents folding
Subscript folding
Superscript folding
Suzhou Numeral folding
Symbol folding
Underline folding
Vertical forms folding
Width folding

• Deskana mentioned this in T3836: Enable ICU folding on Hebrew wikis.Dec 19 2016, 7:54 PM

(Dan's traveling, but I wanted to respond with some notes before I forgot)

@Deskana and I spoke about this on Friday. The approach we came up (Please correct me if I'm wrong Dan!) with was to generate a list of the languages we know folding would create a poor experience (by research into the host language's orthography). We could then enable folding on all but the list where we think it would not benefit and let communities know (mailing lists, tech news, search documentation, weekly discovery updates) that if they'd like it enabled/disabled to contact the team.

The concerns for not reaching out with more vigor are that the smaller wikipedias won't have much of a response if we ask directly. It's is also complicated to understand if this change would be desired because it's a laymen's request ("I never include diacritics") as opposed to linguistically correct (I'm probably wording that wrong, but I hope my meaning is coming across).

Does that sound agreeable?

@CKoerner_WMF thanks for bringing this problem to our attention. If I understand correctly the approach you describe is to be bold and activate ICU everywhere we think it may be beneficial. The problem is that we don't really have this list... another problem is, as pointed out by @TJones, there are some very important diacritics that we should not fold, we know some of them (ru, sw, no) but not all of them.
The strategy we planned to apply was: be conservative and do not apply ICU folding unless the community asks for it. Now I understand that this may not be fair, small communities may not be aware that this feature now exists and it's unlikely that they will jump into phab to request it.
What you seem to suggest is to switch to a bold strategy: activate ICU folding everywhere and wait for communities to complain if some important diacritics are destroyed.
My feeling is that this feature is generally well accepted, it should not cause any big problems (except for those important diacritics). It may cause some annoyance for typo hunters but we have tools to help them (i.e. use of insource).

I don't really like this kind of situation because it generally ends with a status quo. I'm for the following approach:
Dan & Chris suggested to be bold and unless someone has strong objections we should go for it.

Personally I have some minor concerns but no strong objections.

Something I forgot to mention is the discussion on the community wishlist survey for wiktionaries: https://meta.wikimedia.org/wiki/2016_Community_Wishlist_Survey/Categories/Wiktionary#Character_normalization_search

In T132637#2881770, @TJones wrote:

TL;DR: The question for the communities is, I think, which diacritics should not be folded in your language, and is it fair to assume all the others should be folded on wikis in your language?

In T132637#2881246, @Wikitiki89 wrote:

In T132637#2881207, @Deskana wrote:

Works for me. Thanks! I'll chat to @CKoerner_WMF about contact people today. To confirm I understand, is the audience "wikis in languages where diacritics are used often", and the improvement "searching with diacritics should match words in a more logical manner"?

I would state the audience as "wikis that often contain text in languages with diacritics".
And I would state the improvement with the more frequent scenario: "searching without diacritics should be able to match words that have diacritics".

I think @Wikitiki89 is on the right track, though I'd clarify that it's text in other languages with unfamiliar diacritics, and that searching without unfamiliar diacritics should match words that have them.

...

I'm gonna have to disagree with most of your post. Just because most French speakers are capable of typing with diacritics, doesn't mean that they can't benefit from the flexibility of a diacritic-folding search. A French speaker, for example, might search for the name "Etienne" (dropping the diacritic on the capital letter) and expect to see results that contain the properly diacriticized "Étienne". Or a French speaker might search for the word "connaître" (using the more common spelling) and expect to see results containing the alternative spelling "connaitre" (which was introduced by a recent spelling reform, but not widely adopted). Additionally, a French speaker traveling abroad might not have access to diacritics and still want to search French Wikipedia. Luckily, for these French speakers, diacritic folding already works this way for Latin-script text on French wikis and most other Latin-script wikis. I could go on about French, and I could go on about the peculiarities of other languages, but I think you'll get my point from these examples, that diacritic folding is useful even for a given wiki's main language, and even for very familiar diacritics.

And besides, if anyone wants to search for an exact match including diacritics, they would still be able to do so by putting quotes around the search term.

@CKoerner_WMF, @Deskana, & @dcausse, I'm still against the bold version of turning on ICU folding everywhere. It seems that it would be annoying to many users in Russian (и/й), Finnish (o,a/ö,ä) and Swedish (å/ä/a), without folding exceptions in place—so similar problems are likely to exist in other languages.

The problem is that we don't know what we don't know. I remembered that ə is used in some alphabet, so I looked it up, and it's used in Azerbaijani. A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).

On the Azerbaijani Wikipedia, folding ə to a would be really annoying. The word və means "and". Searching for va would match və. That's like searching for end in English and getting every instance of and. And that's just one example word for one folded character in one language.

@Wikitiki89: Quotes don't solve all such problems. In addition to blocking stemming (which is much more useful in some languages than in English) title matches from the upper right search box are still folded, so searching for "ə" matches "A" on English Wikipedia (which for reasons beyond my pop culture knowledge redirects to Pretty Little Liars.)

In French, I know that even with diacritic folding on, not everything works the way one might like. My go-to example is element vs élément. Without diacritics it gets treated like any other French word and the -ment ending is stripped and mostly English element in titles matches, along with the Élé in Élé Asu. With diacritics, it only matches other instances with diacritics (matching plurals as usual). More examples of changes caused by ascii-folding in French—most good, but not all—are in my analysis. In this case folding is interacting with stemming in unexpected ways, and though stemming isn't available for all languages, I still worry about unexpected interactions and behavior in general.

Unfortunately, I didn't look at not folding French-specific accents, which would solve some of the problems, but not others, so I can't say much on that right now.

In general, I prefer caution when making sweeping changes to how we process text/words/characters and how that affects various languages. @CKoerner_WMF, If we don't hear from the smaller wikis about current problems, I worry we'll make things worse and never know.

I'm happy to take on a long-term project (after TextCat winds down in early Q3) to go language by language and research likely problems based on orthography to include in exceptions for each language. This could be coupled with necessary research for {T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers}.

Even though @Wikitiki89 seems to prefer the bolder approach, I hope we could agree that folding all diacritics not used in a given language would be a useful improvement.

In T132637#2894355, @TJones wrote:

@CKoerner_WMF, @Deskana, & @dcausse, I'm still against the bold version of turning on ICU folding everywhere. It seems that it would be annoying to many users in Russian (и/й), Finnish (o,a/ö,ä) and Swedish (å/ä/a), without folding exceptions in place—so similar problems are likely to exist in other languages.

I can speak for Russian, that folding и/й would not really be that annoying. I can't speak for Finnish or Swedish, but I would ask a native speaker before assuming that it would be annoying to them. In Swedish it may even make sense to fold "å" to "aa".

The problem is that we don't know what we don't know. I remembered that ə is used in some alphabet, so I looked it up, and it's used in Azerbaijani. A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).

On the Azerbaijani Wikipedia, folding ə to a would be really annoying. The word və means "and". Searching for va would match və. That's like searching for end in English and getting every instance of and. And that's just one example word for one folded character in one language.

That seems like a really bad decision. I don't think ə should fold to either of a or e.

@Wikitiki89: Quotes don't solve all such problems. In addition to blocking stemming (which is much more useful in some languages than in English) title matches from the upper right search box are still folded, so searching for "ə" matches "A" on English Wikipedia (which for reasons beyond my pop culture knowledge redirects to Pretty Little Liars.)

In French, I know that even with diacritic folding on, not everything works the way one might like. My go-to example is element vs élément. Without diacritics it gets treated like any other French word and the -ment ending is stripped and mostly English element in titles matches, along with the Élé in Élé Asu. With diacritics, it only matches other instances with diacritics (matching plurals as usual). More examples of changes caused by ascii-folding in French—most good, but not all—are in my analysis. In this case folding is interacting with stemming in unexpected ways, and though stemming isn't available for all languages, I still worry about unexpected interactions and behavior in general.

Unfortunately, I didn't look at not folding French-specific accents, which would solve some of the problems, but not others, so I can't say much on that right now.

I think the problem you point out says more about the problems with stemming then the problems with folding. There are many other words whose stem ending coincides with a suffix.

In general, I prefer caution when making sweeping changes to how we process text/words/characters and how that affects various languages. @CKoerner_WMF, If we don't hear from the smaller wikis about current problems, I worry we'll make things worse and never know.

I'm happy to take on a long-term project (after TextCat winds down in early Q3) to go language by language and research likely problems based on orthography to include in exceptions for each language. This could be coupled with necessary research for {T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers}.

Even though @Wikitiki89 seems to prefer the bolder approach, I hope we could agree that folding all diacritics not used in a given language would be a useful improvement.

You may have misunderstood me. I'm not trying to advocate a bold approach. I am trying to point out reasons to have a broader range of wikis consider ICU folding, but not to force it upon them.

I would imagine that universal folding would be helpful—especially given the sense that we print inexact matches anyway—but I am only an English speaker.

In T132637#2894519, @Wikitiki89 wrote:

I can speak for Russian, that folding и/й would not really be that annoying.

I think that in this case, and in most cases, opinions differ, and you are likely to find the full range of opinions out there. At least one native speaker of Russian on the Discovery team (hi, @Smalyshev!) said that folding и/й was more bad than good, and pointed out that yandex.ru, a leading Russian-language search engine, treats confusing the two as a typo. The conversation is buried in Gerrit, unfortunately, but it's in the comments here, on line 136.

Doing a quick search on Google—in English, and with a sample that is not statistically significant, but also not cherry-picked—here's a StackExchange comment pointing out that one is a vowel, one a consonant, and why would you ever confuse the two? And a similar comment on Quora saying they are distinct, though they happen to look alike. Both those comments are probably frustrating to English-speaking Russian learners, who seem to have trouble distinguishing the two. The comparison to i and y in English seems apt.

A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a!

That seems like a really bad decision. I don't think ə should fold to either of a or e.

I agree. It makes an /æ/ sound (the vowel in English ash) in Azerbaijani, so a isn't unreasonable, but far from perfect. Phonetics and orthography don't always align, especially from the perspective of someone who doesn't speak the language.

I also think that folding ɰ to m is ridiculous (briefly, it's like the consonant sound y makes in you, but moved back to where you make a g), but Elasticsearch asciifolding does that. It seems to be based on the name, "turned m with long leg", which is based on physical typesetting—the related symbol, "turned m" ɯ can be typeset by turning an m upside down, even though it's really an unrounded u.

Some of the foldings seem to be based on unfamiliarity with the symbols, or done from the perspective of a speaker of English or another Indo-European language. There's no magic one-folding-fits-all for all languages—it's just not possible.

I can't speak for Finnish or Swedish, but I would ask a native speaker before assuming that it would be annoying to them. In Swedish it may even make sense to fold "å" to "aa".

Here's one Finnish speaker who is pretty upset about people ignoring Finnish umlauts—and fortunately for me, complaining in English.

And here are some English speakers discussing the importance of diacritics, with examples of words differing only by diacritics in Swedish, French, Spanish, Finnish and German, with mention of Danish and Norwegian, but no examples.

[ French élément vs element. ]

I think the problem you point out says more about the problems with stemming then the problems with folding. There are many other words whose stem ending coincides with a suffix.

I haven't gone looking for examples, but I presume that the more common ones are known, like élément, and treated properly. The point is that lack of diacritics moves something from "known" to "unknown" where it receives the default morphological processing, just like a word following the regular pattern, or a nonce word like supercalifragilisticexpialidociousment would.

As I said before:

The problem is that we don't know what we don't know.

There are too many unknown elements in the languages and the software.

Even though @Wikitiki89 seems to prefer the bolder approach, I hope we could agree that folding all diacritics not used in a given language would be a useful improvement.

You may have misunderstood me. I'm not trying to advocate a bold approach. I am trying to point out reasons to have a broader range of wikis consider ICU folding, but not to force it upon them.

You are correct, I did misunderstand. I think you were actually fairly clear, but I conflated bold rollout with bold stemming. Would it be fair to say that you and I both favor a less bold approach to rolling out ICU stemming to all wikis, while you favor bolder folding and I favor less bold folding when it is deployed?

Also, this discussion is also aimed at anyone else who favors a bolder rollout, with I think ample reasons and examples of why it's good to be a bit cautious.

I think I've made my points as well as I can—though I'm glad to discuss further with anyone who wants to. (Though I may be slow to respond at this point, with the holidays coming up—happy holidays, all!)

In T132637#2896573, @TJones wrote:

In T132637#2894519, @Wikitiki89 wrote:

I can speak for Russian, that folding и/й would not really be that annoying.

I think that in this case, and in most cases, opinions differ, and you are likely to find the full range of opinions out there. At least one native speaker of Russian on the Discovery team (hi, @Smalyshev!) said that folding и/й was more bad than good, and pointed out that yandex.ru, a leading Russian-language search engine, treats confusing the two as a typo. The conversation is buried in Gerrit, unfortunately, but it's in the comments here, on line 136.

Doing a quick search on Google—in English, and with a sample that is not statistically significant, but also not cherry-picked—here's a StackExchange comment pointing out that one is a vowel, one a consonant, and why would you ever confuse the two? And a similar comment on Quora saying they are distinct, though they happen to look alike. Both those comments are probably frustrating to English-speaking Russian learners, who seem to have trouble distinguishing the two. The comparison to i and y in English seems apt.

...

I can't speak for Finnish or Swedish, but I would ask a native speaker before assuming that it would be annoying to them. In Swedish it may even make sense to fold "å" to "aa".

Here's one Finnish speaker who is pretty upset about people ignoring Finnish umlauts—and fortunately for me, complaining in English.

And here are some English speakers discussing the importance of diacritics, with examples of words differing only by diacritics in Swedish, French, Spanish, Finnish and German, with mention of Danish and Norwegian, but no examples.

You seem to be conflating two different issues. One is about the actual orthographic rules, and the other is about searching. Just because "маика" is not an acceptable alternative spelling of "майка", doesn't mean that we can't fold them in searches. In fact I think it's pretty rare that two unrelated words would differ only in и vs й, but even if that weren't the case, that wouldn't be a deal-breaker.

The same probably goes for Finnish (although, again, we should get the opinion of an actual Finnish speaker). The complaint you linked to is about foreigners' careless spelling, not about search results.

I am not a big fan of combining и and й because I think there's not a lot of benefit in it - й is in any Russian keyboard layout, and not a lot of people (knowing Russian to at least mediocre degree) would confuse between и and й (note I don't mean typos here - typo would substituting it to a random letter - or a non-random one based on keyboard proximity, order transposition, etc.). Situation with е/ё is very different, since many texts just omit ё and always use е, which I personally frown upon, but it's a common practice. Text replacing й with и would be considered completely messed up. So the situation is different.

Now, could we still collate и/й? Maybe, it wouldn't do a lot of harm because distinction between those is rarely crucial and it would not harm a lot of things. But I don't also think it would help a lot.

Should we do it? I tend to the side of "no" and to consider things like "маика" common typos. Unless there's some stats showing I am wildly wrong in estimating how common such things are :)

No idea about Finnish, I would suggest asking somebody who speaks/reads it, these things can't really be logically deduced without knowing the language :)

In T132637#2897251, @Wikitiki89 wrote:

You seem to be conflating two different issues. One is about the actual orthographic rules, and the other is about searching. Just because "маика" is not an acceptable alternative spelling of "майка", doesn't mean that we can't fold them in searches. In fact I think it's pretty rare that two unrelated words would differ only in и vs й, but even if that weren't the case, that wouldn't be a deal-breaker.

The same probably goes for Finnish (although, again, we should get the opinion of an actual Finnish speaker). The complaint you linked to is about foreigners' careless spelling, not about search results.

I'm not conflating the two issues, but I think they are deeply linked. It's clear that we have different views on the subject, and I fear we could end up talking past each other. It's okay if people have differing opinions; I'm sure it keeps the PM's lives interesting. :)

I don't think the similarity in the glyphs for two letters should be an argument for folding them in a language that distinguishes them and, as best as one can tell, considers them different letters. To me, your argument is like saying a and o could be folded together because, at least in some fonts, a is just an o with an extra downstroke—or that R is just a P with an extra leg. The fact that the orthography of, say, Finnish, chose to use umlauts on some vowels to make more distinctions than the basic Latin alphabet allows for is incidental (in my mind)—much like the fact that U and W are both historically derived from V; I certainly wouldn't want to fold those (in English—though maybe in Hawaiian)

While и and й may rarely differentiate a word, they are still different letters.

Diacritical versions that are not clearly different letters, like French or Spanish diacritics on vowels, are harder to guess at. Stas's keyboard criterion may also be a good heuristic.

The examples in of foreigners poor spelling (all I could easily and quickly find since I don't speak those languages) are examples of distinctions that would be lost if those diacritical characters were folded into their plain counterparts. I was hoping to illustrate that, for example, searching for "tune" (Låt) and getting back results for "lazy" (Lat) might not be desirable, much as searching for yota and getting results for iota, or care for core or rack for pack, or vary for wary.

Folding too aggressively waters down results, especially when the word you want is rare and the word it folds into is common. That's why disambiguation pages are so useful for words where, for example, two different meanings differ only by capitalization, like Jack and jack—if not for the disambiguation page searches for jack would be overwhelmed by results for people and places named Jack.

On the contrary, I believe the similarity of two glyphs (that differ by strokes, diacritics, jots, or tittles) in a foreign language is an argument for folding them on a wiki where they are foreign. To a monolingual American English speaker, in my experience, there isn't any meaningful difference between e and é or u and ü, or и and й.

I'm not sure what to do about of English speakers comparing German ß to B or Greek Ρ and Η to P and H, as I have known some to do, but that may be much too far afield.

In T132637#2897341, @Smalyshev wrote:

... these things can't really be logically deduced without knowing the language :)

True, but I do think we can make a first pass at it, and get reasonably far by, say, reading the Wiki page about a language. The ё / е situation, for example, is very well laid out in English Wikipedia. That reasonably well-informed opinion can also mean taking a much better informed first draft of a proposal to a language community for validation. (Though I'm willing to bet there will be disagreements!)

Thanks for weighing in, Stas!

Finally, as always, I would have written a shorter message, but I did not have the time. (Sorry.)

dcausse mentioned this in T26414: Special character "å" in the search menu.Jan 17 2017, 4:17 PM

dcausse mentioned this in T155515: Reindex el, en, fr and he wikis to enable ICU folding.Jan 17 2017, 6:50 PM

Moving to this quarter, for us to look at it in more depth after @TJones completes some of the other analyzers.

TJones updated the task description. (Show Details)Jun 27 2017, 4:45 PM

I suggest closing this ticket. We definitely need a language-by-language approach as different languages care about different distinctions in letters. I would prefer individual tickets (possibly under the umbrella of an Epic ticket) to address specific language-by-language needs.

I believe all of the specific issues in this ticket have been addressed for English language projects (and particularly Wiktionary where a lot of them come up).

Full text searches on English Wiktionary:

ἔθανον and ἔθᾰνον find ἔθᾰνον
ἐθάνετον and ἐθᾰ́νετον find ἐθᾰ́νετον
κλίνω and κλῑ́νω find κλίνω
- Note: I replaced the original κλī́νω (with Latin i-with-macron) with κλῑ́νω (with Greek iota-with-macron) in the example in the task description. The original κλī́νω gets normalized to κλiνω (with an i) and doesn't find κλίνω.
Bro̍nte gives both Brontë and Bronte in full text search

Search bar searches on English Wiktionary:

αθανατος brings up ἀθάνατος and αθάνατος (among others)
- using various combinations of ἀ and ἁ in αθανατος works, too.
Brónte matches other br-nte words in the search box because ó is pre-composed and counts as one letter; o-like matches are preferred, but one-letter-off typos that fit the br-nte pattern are good matches, as they would be for brqnte or br8nte
Bro̍nte is different because it is a 7-character string, with the combining diacritic counting as another character.
- The difference between the two might be a bit hard to explain and harder to intuit, but it seems reasonable.
Bront́e behaves like Bro̍nte and gets Bronte and Brontë as results.

I'll push all of the linked-to, unclosed tasks onto my current task review stack and get to them as soon as I can:

TJones mentioned this in T123179: Provide an option for ignoring combining characters when searching.Jun 27 2017, 9:10 PM

Sounds good, @TJones, I'll go ahead and close it.

@ObsequiousNewt - let us know if the explanation makes sense and if you have any further questions or concerns.

I'm satisified with the solution thus far, since I should at least be able to do everything that I want to do. I'm kind of surprised at the decision to use separate folding for different languages; I know that some letters are considered not equal to their diacritic-less equivalents, but I would have imagined that the problem here (if it is even a problem; I'm not sure that folding these wouldn't be desirable regardless; after all, we do already fold near-matches, don't we?) is worth the overhead of determining and implementing separate folding heuristics for each language.

But, as I said, my problem at least is solved, and having been on Mediawiki for as long as I have I know the enemy of any progress is discussion of best practices. So I'm satisfied with marking this as closed.

@ObsequiousNewt, thanks for responding! Glad things are solved for you.

I do think it's important to work language-by-language. English speakers don't care about the distinction between ä and a, while Swedish speakers really do! Russian doesn't distinguish Cyrillic ё and е, but Belarusian and Rusyn do. There isn't a one-size-fits-all solution. It's a pain, but that's language—always messy, always interesting.

You can see more of my thoughts on the language-by-language approach on my mediawiki page on the subject.

TJones mentioned this in T75605: No normalization for ancient greek accents in searches.Jul 6 2017, 8:52 PM

debt mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Jul 11 2017, 5:47 PM

Lack of diacritic folding in e.g. Ancient GreekClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Lack of diacritic folding in e.g. Ancient Greek
Closed, ResolvedPublic
Actions

Related Objects
Search...