Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Oct 25 2021, 3:01 PM

Description

Once T289612 is deployed (probably in MediaWiki_1.38/wmf.6), we can reindex the relevant wikis, to activate ICU normalization, ICU folding, and homoglyph normalization.

Acceptance Criteria

All wikis in the relevant languages are reindexed
A before-and-after analysis for each language's Wikipedia is provided

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T289612 Unpack Hindi, Irish, Norwegian Elasticsearch Analyzers
Resolved	TJones	T294257 Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions

Event Timeline

TJones created this task.Oct 25 2021, 3:01 PM

TJones set the point value for this task to 3.Oct 25 2021, 3:41 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Oct 25 2021, 3:56 PM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Dec 8 2021, 4:20 PM

TJones mentioned this in T297761: Create a Latin-to-Devanagari transliteration second-chance search for Hindi wikis.Dec 14 2021, 11:48 PM

Full write up on Mediawiki.

Summary

Specific new matches in all three (Irish, Hindi, & Norwegian) Wikipedias are good.
The impact overall on the zero-results rate is fairly small for all three.
The zero-results rate for Hindi Wikipedia, independent of recent changes, it really high (60+%), so I investigated a bit. Transliteration of Latin queries to Devanagari could have a sizable impact. (See T297761)
Irish and Norwegian had a sizable increase in total results, and a noticeable increase in top results. Hindi had much smaller increases for both.
Irish changes were dominated by Irish diacritics (which are not part of the alphabet), while the Norwegian changes were dominated by foreign diacritics.

@MPhamWMF, move to this to "needs reporting" when you are done looking at the summary/write up.

Feedback or questions from anyone are welcome, of course!

Thanks, @TJones ! This was really interesting. It looks like the Hindi zero results are quite high, which also seems to correlate with hiwiki having a relatively low Search Engagement as well. I wonder which other wikis having a better Latin to native transliterater would help with.

MPhamWMF moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Dec 16 2021, 2:58 PM

In T294257#7575370, @MPhamWMF wrote:

I wonder which other wikis having a better Latin to native transliterater would help with.

It's definitely an issue that we could try to address. There's an open ticket for Georgian (from Latin/Cyrillic): T127003. And there's a closed ticket with romanization in the title but DWIM ("wrong keyboard") in the description: T245677—I closed it because the diff between wrong keyboard and transliteration was never made clear and it was low priority.

I wouldn't mind getting a proper framework for these kinds of "second-chance" searches and implementing useful modules language-by-language. We have language detection models for wrong-keyboard and wrong-encoding Russian.. though maybe detection isn't the right way to approach it. Suggestions or DWIM-style completion may be a good approach too. I over-engineered an early attempt at wrong-keyboard integration and it flailed and failed.

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Dec 16 2021, 7:55 PM

Gehel closed this task as Resolved.Jan 10 2022, 1:42 PM

Reindex Hindi, Irish, Norwegian wikis to enable unpacked versionsClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...