Page MenuHomePhabricator

Reindex Romanian, Sorani wikis to enable unpacked analyzers
Closed, ResolvedPublic2 Estimated Story Points

Description

Once T325091 and T330893 are deployed, reindex Romanian- and Sorani-language wikis to enable unpacked/upgraded analyzers

Current wiki counts:

  • Romanian (ro): 8 wikis
  • Sorani (ckb): 1 wiki

Event Timeline

TJones set the point value for this task to 2.Feb 28 2023, 6:13 PM

Someone from Lucene pointed out that the problem with Romanian s&t with comma vs cedilla (șț vs şţ) is not just a stopword problem, but also a stemmer problem. I think it's probably a pretty big issue, so I want to test and commit a patch for that, too, before we reindex everything.

T330893 is merged, though not deployed, so I'm moving this back into ready for dev, though we still have to wait for 1.40.0-wmf.27 to hit production.

Full write up on Mediawiki.

Highlights:

  • Romanian Wikipedia saw a big impact in the increase in number of results for more than 1 in 4 queries because of the merger of ş/ș and ţ/ț forms, which affects queries directly, but also affects stemmed matches for words in queries and in the article text. There was also a nice (but more standard) change to zero-results rate from diacritic folding: 24.4% to 23.7% (-0.7% absolute change; -2.9% relative change).
  • Sorani Wikipedia also had a nice but normal-range change to zero-results rate, from Arabic-script normalizations (38.8% to 38.3% (-0.5% absolute change; -1.3% relative change)), and shows some instability for <1% of queries because of its smaller size.
TJones added a subscriber: MPhamWMF.

@MPhamWMF, move this to "Needs Reporting" when you are done looking it over, please.