Page MenuHomePhabricator

Map Romanian s&t with comma to cedilla internally
Closed, ResolvedPublic2 Estimated Story Points

Description

Romanian uses s&t wth commas (ș&ț) but as those were unavailable on many computers for many years, s&t with cedillas (ş&ţ) were substituted. Currently Romanian Wikipedia favors the correct forms (with commas), but the forms with cedillas are still common.

Some default Romanian analysis chain components, like the stopword list and the stemmer, are only aware of the cedilla forms, because they were created at a time when the comma forms were not available or in regular use.

The changes from T325091 add a stopword filter with the comma forms and map the cedilla forms to the comma forms. However, the stemmer internals are not as accessible as adding new stopwords. I've started a discussion upstream to get them to process both forms, but that's going to take a long time (likely years) to trickle down from Snowball to Lucene to ? (maybe OpenSearch?) to us.

For now the most effective way to handle this is to counterintuitively map everything to the outdated cedilla form internally, so that stemming can be done correctly on these forms. Users will never see the incorrect forms if the original text doens't use them.

I'm not yet sure of the scope of the changes, but adding the corrected stopwords filtered around 3½% of words in my Romanian Wikipedia sample, and I was able to quickly find a word on the front page of the Romanian Wikipedia that is stemmed incorrectly because it uses the modern, correct comma form.

Event Timeline

@TJones is there something the Romanian community can help with? What would it take to accelerate the changes upstream?

@Strainu There's nothing to do, I don't think—but thanks very much for asking! You can follow the conversation about Lucene on GitHub. They are looking to move fairly quickly, and even said they'd update to the latest Snowball stemmer as soon as it is available.

The slow part is how long it takes for changes to make it down stream to users. We won't be able to continue with Elasticsearch because of licensing changes (see T272111), but we can use them as an example. After Lucene updates to the new stemmer, they have to incorporate it into a release. I don't know their release cadence, but that could be weeks to months. Then Elastic has to upgrade to the new version of Lucene, and then release a new version, which could also take months or longer, depending on their release cadence and upgrade plan (for example, it would be reasonable to not incorporate major upgrades to components in a minor release, and instead hold off until a semi-major or major release). Once that's released, we would have to upgrade to that version of Elastic, but that can be a major undertaking that takes months to actually do, once we decide to start.

An extra wrinkle is the Elastic license change, which makes us more hesitant to upgrade, because it also requires switching search engines. OpenSearch is Amazon's fork of Elastic. There may be some compatibility issues with our plugins and other features or other issues, and we may not go to OpenSearch if something better comes along (though it seems like a plausible intermediate step, at least). All of that uncertainty further delays any upgrade/engine switching plans.

It's just the nature of software development and deployment when you have a long chain of dependencies.

Change 893441 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Map Romanian s&t with comma to cedilla internally

https://gerrit.wikimedia.org/r/893441

Full notes on Mediawiki.

Since most words on Romanian Wikipedia and Wiktionary use the modern forms, those forms weren't stemmed correctly by the stemmer. Fixing that had a big impact—lots of stemming groups, lots of tokens! Just from the mapping change and its interaction with the stemmer:

  • Romanian Wiktionary: 1,424 tokens (0.941% of tokens) were merged into 601 groups (1.400% of groups)
  • Romanian Wikipedia: 33,978 tokens (1.837% of tokens) were merged into to 1,778 groups (1.313% of groups)

So 0.9%–1.8% of tokens were not being stemmed correctly, but now will be!

Change 893441 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Map Romanian s&t with comma to cedilla internally

https://gerrit.wikimedia.org/r/893441

@TJones one more question regarding this change: will it affect the results on Romanian wikis? I know the search is diacritics-insesitive, but I wonder if cedilla redirects will be favored in relation to the redirect targets, which are using comma diacritics.

Hi @Strainu, I'm not quite sure how to directly answer your question, so let me give you some background info to make sure we are on the same page.

  • A stemmer tries to convert a word to a "stem," which is usually similar to the dictionary form of a word, though not always. For example, an English stemmer might convert juggle, juggling, and juggler to juggl, and that would be fine. The correct base form is best, but as long as related forms mostly end up with the same stem and unrelated forms mostly don't, it's working well.
  • Stop words are usually "function words" that don't add much meaning to a search, and are heavily discounted. In English, this includes a, an, the, and, but, to, at, of, in, is, are, this, these, and others. They make it so that President of the United States and United States President get the same results (though possibly ranked somewhat differently).
  • When you search in the box at the top of the screen (in the upper right in the old Vector skin in Romanian and English, and more or less in the middle in the new Vector 2022 skin), that's the Go Box, and the suggestions given while you type in the Go Box and in the main search box on the search results page are autocomplete suggestions.
  • When you get "Search results"/"Rezultatele căutării" on the Special:Search/Special:Căutare page, that's full-text search.
  • Full-text search uses two indexes. The text index lowercases words, uses the stemmer on them (if available), filters stop words (if available), and sometimes does additional normalization (like converting uncommon characters to more common versions, or removing "foreign" diacritics, which is called "folding"). The plain index does less transforming of the words, and just lowercases, and sometimes does some folding of foreign diacritics.

search is diacritics-insesitive

For the most part, search is sensitive to diacritics. şi (cedilla) gets ~5K results on Romanian Wikipedia, while și (comma) gets 289K. Typing des, deş, and deș in the Go Box all get different suggestions. ("Descălecatul Moldovei", "Sahara", and "Deșteaptă-te, române!" are the top suggestions, respectively.)

will it affect the results on Romanian wikis?

Yes, and in several good ways!

  • The stemmer will work on inflections with comma forms. Right now, if you search for apreciați, (comma) you get back only exact matches for apreciați because the stemmer doesn't know what to do with ț. Compare that to apreciaţi (cedilla) which has almost 20x as many results because it matches related forms, too, like aprecia, apreciat, apreciate, apreciată, apreciabilă, apreciabile, apreciabil, apreciau, apreciatul, apreciative, aprecere, etc. Once this change goes through, both apreciați and apreciaţi will return the same results.
  • The stop word list will ignore comma forms in the text index. Right now acești, aceștia, aș, așadar, ăștia, ați, aveți, câți, cîți, deși, ești, fiți, îți, mulți, niște, noștri, și, sînteți, sunteți, ți, ție, toți, totuși, and voștri (all comma forms) are not recognized as stop words, but the cedilla forms are. These words will still count for ranking, but much, much less, and will not be strictly required to match results (though exact matches in the title and in the plain index are always weighted more heavily).
  • Ranking should improve, though probably not in really obvious ways in most cases. Cedilla forms in queries will match more results, as above, but will also be properly weighted in ranking. Right now, for example, național matches 10K documents, but naţional only matches 1500, which makes it seem like a much rarer word (and thus a more valuable match). If there are any cases where the cedilla form occurs in a bunch of documents that the comma form does not, it could change the weighting of the comma form to be more accurate, too.

[will] cedilla redirects will be favored in relation to the redirect targets, which are using comma diacritics?

Probably not differently than they are now.

  • Autocomplete suggestions will show either the redirect title or the article title, depending on the UI element you are searching in. In the Go Box in the new Vector 2022 skin, the main article title is always shown (so you always see the comma form). In the Go Box in the old Vector skin, and in the main search box on the Special:Search page in both skins, the redirect title is shown (so you would see the cedilla form if you type the cedilla form). That won't change (at least not because of search—the UI teams may eventually change the Special:Search box to behave like the Go Box in Vector 2022, but I'm not sure). As an example, typing Deşteaptă (cedilla form) in the Vector 2022 Go Box gives the comma form as the suggestion. Typing it in the search box gives the cedilla form as the suggestion.
  • In the full-text results, the redirect title is shown if there isn't at least a partial match between the query and the main title. For example, if you search for Deşteaptă-te române (cedilla form) the result is "Deșteaptă-te, române!" because te române matches the main title. If you search for just Deşteaptă, the result is "Deșteaptă-te, române! (redirecționare de la Deşteaptă-te, române)" because the two forms don't match. There's a lot of magic that goes into redirect highlighting, and I'm not familiar with all of it, so I'm not 100% sure, but I expect it will stay the same because it works on the original text of the title and redirect.

Okay, that went on for a while. Anyway... after T330783 is complete, please let us know if you see anything that's unexpectedly bad related to commas and cedillas!