Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T219550 [EPIC] Harmonize language analysis across languages | |||
Resolved | Gehel | T272606 [EPIC] Unpack all Elasticsearch analyzers | |||
Resolved | TJones | T325091 Unpack Romanian, Sorani Elasticsearch Analyzers | |||
Resolved | TJones | T330893 Map Romanian s&t with comma to cedilla internally | |||
Resolved | TJones | T330783 Reindex Romanian, Sorani wikis to enable unpacked analyzers |
Event Timeline
Comment Actions
Someone from Lucene pointed out that the problem with Romanian s&t with comma vs cedilla (șț vs şţ) is not just a stopword problem, but also a stemmer problem. I think it's probably a pretty big issue, so I want to test and commit a patch for that, too, before we reindex everything.
Comment Actions
T330893 is merged, though not deployed, so I'm moving this back into ready for dev, though we still have to wait for 1.40.0-wmf.27 to hit production.
Comment Actions
Full write up on Mediawiki.
Highlights:
- Romanian Wikipedia saw a big impact in the increase in number of results for more than 1 in 4 queries because of the merger of ş/ș and ţ/ț forms, which affects queries directly, but also affects stemmed matches for words in queries and in the article text. There was also a nice (but more standard) change to zero-results rate from diacritic folding: 24.4% to 23.7% (-0.7% absolute change; -2.9% relative change).
- Sorani Wikipedia also had a nice but normal-range change to zero-results rate, from Arabic-script normalizations (38.8% to 38.3% (-0.5% absolute change; -1.3% relative change)), and shows some instability for <1% of queries because of its smaller size.