Reindex Romanian, Sorani wikis to enable unpacked analyzers
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Feb 28 2023, 6:13 PM

Description

Once T325091 and T330893 are deployed, reindex Romanian- and Sorani-language wikis to enable unpacked/upgraded analyzers

Current wiki counts:

Romanian (ro): 8 wikis
Sorani (ckb): 1 wiki

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T325091 Unpack Romanian, Sorani Elasticsearch Analyzers
Resolved	TJones	T330893 Map Romanian s&t with comma to cedilla internally
Resolved	TJones	T330783 Reindex Romanian, Sorani wikis to enable unpacked analyzers

Event Timeline

TJones created this task.Feb 28 2023, 6:13 PM

Restricted Application added a subscriber: Strainu. · View Herald TranscriptFeb 28 2023, 6:13 PM

TJones set the point value for this task to 2.Feb 28 2023, 6:13 PM

TJones moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Someone from Lucene pointed out that the problem with Romanian s&t with comma vs cedilla (șț vs şţ) is not just a stopword problem, but also a stemmer problem. I think it's probably a pretty big issue, so I want to test and commit a patch for that, too, before we reindex everything.

TJones added a parent task: T330893: Map Romanian s&t with comma to cedilla internally.Mar 1 2023, 3:31 PM

TJones updated the task description. (Show Details)

TJones mentioned this in T330893: Map Romanian s&t with comma to cedilla internally.Mar 7 2023, 7:16 PM

T330893 is merged, though not deployed, so I'm moving this back into ready for dev, though we still have to wait for 1.40.0-wmf.27 to hit production.

TJones claimed this task.Mar 23 2023, 6:25 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Full write up on Mediawiki.

Highlights:

Romanian Wikipedia saw a big impact in the increase in number of results for more than 1 in 4 queries because of the merger of ş/ș and ţ/ț forms, which affects queries directly, but also affects stemmed matches for words in queries and in the article text. There was also a nice (but more standard) change to zero-results rate from diacritic folding: 24.4% to 23.7% (-0.7% absolute change; -2.9% relative change).
Sorani Wikipedia also had a nice but normal-range change to zero-results rate, from Arabic-script normalizations (38.8% to 38.3% (-0.5% absolute change; -1.3% relative change)), and shows some instability for <1% of queries because of its smaller size.

@MPhamWMF, move this to "Needs Reporting" when you are done looking it over, please.

MPhamWMF moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Mar 28 2023, 4:31 AM

Gehel closed this task as Resolved.Mar 31 2023, 8:03 AM

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.May 1 2023, 6:07 PM

Reindex Romanian, Sorani wikis to enable unpacked analyzersClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Reindex Romanian, Sorani wikis to enable unpacked analyzers
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...