Unpack Romanian, Sorani Elasticsearch Analyzers
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Dec 13 2022, 6:58 PM

Description

See parent task for details.

(These were chosen next more or less at random.)

[Spun off Turkish because it became more complicated.]

Details

	Subject	Repo	Branch	Lines +/-
	Unpack Romanian and Sorani Analyzers	mediawiki/extensions/CirrusSearch	master	+442 -21

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T325091 Unpack Romanian, Sorani Elasticsearch Analyzers
Resolved	TJones	T330893 Map Romanian s&t with comma to cedilla internally
Resolved	TJones	T330783 Reindex Romanian, Sorani wikis to enable unpacked analyzers

Event Timeline

TJones created this task.Dec 13 2022, 6:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 13 2022, 6:58 PM

TJones triaged this task as High priority.Dec 13 2022, 6:58 PM

TJones added a parent task: T272606: [EPIC] Unpack all Elasticsearch analyzers.

TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

TJones set the point value for this task to 5.

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Dec 13 2022, 7:02 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Dec 13 2022, 7:22 PM

TJones edited projects, added Discovery-Search; removed Discovery-Search (Current work).Jan 9 2023, 4:20 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.Jan 26 2023, 9:08 PM

TJones moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Jan 30 2023, 2:59 PM

TJones claimed this task.Jan 30 2023, 4:20 PM

Found a problem with the apostrophe filter for Turkish, which is very aggressive and does bad things to French and Italian (which are common in names, sources, etc.). For example, d'Onofrio'nun, d'administration, d'administration'dan, and d'Arthur'unda all get indexed as plain d. Not optimal.

I've come up with a bunch of heuristics that improve the apostrophe processing. Implementing them as a collection of existing filters is a mess, so making a plugin seems like a good approach—it also makes the logic more easily reusable by others.

I'm going to spin off Turkish as its own ticket and finish up the other two first.

TJones renamed this task from Unpack Romanian, Turkish, Sorani Elasticsearch Analyzers to Unpack Romanian, Sorani Elasticsearch Analyzers.Feb 15 2023, 4:32 PM

TJones updated the task description. (Show Details)

TJones mentioned this in T329762: Unpack Turkish Analyzer and improve apostrophe handling.Feb 15 2023, 4:38 PM

Full notes on Mediawiki.

Sorani was pretty straightforward.
- ICU normalization cleaned up a lot of initial/medial/final form Arabic letters.
- ICU folding had more impact on Latin than Arabic-script tokens, but plenty of both.

Romanian got interesting!
- There's often confusion of s/t with cedilla (ş & ţ) and the more correct forms with comma (ș & ț), for historical encoding/support reasons. Added a filter to correct for it, and it had a nice effect.
- However it decreased the number of stop words filtered, because they all use the incorrect/older cedilla forms! Enabled additional stop words with the comma forms and ~3.4% of terms in the Wikipedia sample are the Romanian word for "and" (și) and should be filtered as stop words!
- I'll contact the stop word list creator and open a ticket and/or pull request upstream for Lucene to include both forms.

Patch forthcoming.

Change 891867 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack Romanian and Sorani Analyzers

https://gerrit.wikimedia.org/r/891867

gerritbot added a project: Patch-For-Review.Feb 24 2023, 9:51 PM

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Feb 24 2023, 9:52 PM

MPhamWMF moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Feb 27 2023, 3:34 PM

MPhamWMF moved this task from Needs Reporting to Needs review on the Discovery-Search (Current work) board.Feb 27 2023, 4:15 PM

Change 891867 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack Romanian and Sorani Analyzers

https://gerrit.wikimedia.org/r/891867

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.25; 2023-02-27).Feb 28 2023, 1:00 AM