Page MenuHomePhabricator

Reindex Hindi, Irish, Norwegian wikis to enable unpacked versions
Closed, ResolvedPublic3 Estimated Story Points


Once T289612 is deployed (probably in MediaWiki_1.38/wmf.6), we can reindex the relevant wikis, to activate ICU normalization, ICU folding, and homoglyph normalization.

Acceptance Criteria

  • All wikis in the relevant languages are reindexed
  • A before-and-after analysis for each language's Wikipedia is provided

Event Timeline

TJones set the point value for this task to 3.Oct 25 2021, 3:41 PM

Full write up on Mediawiki.


  • Specific new matches in all three (Irish, Hindi, & Norwegian) Wikipedias are good.
  • The impact overall on the zero-results rate is fairly small for all three.
  • The zero-results rate for Hindi Wikipedia, independent of recent changes, it really high (60+%), so I investigated a bit. Transliteration of Latin queries to Devanagari could have a sizable impact. (See T297761)
  • Irish and Norwegian had a sizable increase in total results, and a noticeable increase in top results. Hindi had much smaller increases for both.
  • Irish changes were dominated by Irish diacritics (which are not part of the alphabet), while the Norwegian changes were dominated by foreign diacritics.

@MPhamWMF, move to this to "needs reporting" when you are done looking at the summary/write up.

Feedback or questions from anyone are welcome, of course!

Thanks, @TJones ! This was really interesting. It looks like the Hindi zero results are quite high, which also seems to correlate with hiwiki having a relatively low Search Engagement as well. I wonder which other wikis having a better Latin to native transliterater would help with.

I wonder which other wikis having a better Latin to native transliterater would help with.

It's definitely an issue that we could try to address. There's an open ticket for Georgian (from Latin/Cyrillic): T127003. And there's a closed ticket with romanization in the title but DWIM ("wrong keyboard") in the description: T245677—I closed it because the diff between wrong keyboard and transliteration was never made clear and it was low priority.

I wouldn't mind getting a proper framework for these kinds of "second-chance" searches and implementing useful modules language-by-language. We have language detection models for wrong-keyboard and wrong-encoding Russian.. though maybe detection isn't the right way to approach it. Suggestions or DWIM-style completion may be a good approach too. I over-engineered an early attempt at wrong-keyboard integration and it flailed and failed.