Page MenuHomePhabricator

Reindex Basque, Catalan, Danish wikis to enable unpacked versions
Closed, ResolvedPublic3 Estimated Story Points

Description

Once T283366 is deployed (probably in MediaWiki_1.37/wmf.11), we can reindex the relevant wikis, to activate ICU normalization, ICU folding, and homoglyph normalization.

Current counts are: Basque—5 wikis; Catalan—6 wikis, Danish—6 wikis

Acceptance Criteria

  • All wikis in the relevant languages are reindexed
  • A before-and-after analysis for each language's Wikipedia is provided

Event Timeline

TJones set the point value for this task to 3.
TJones removed TJones as the assignee of this task.Jun 9 2021, 7:22 PM
TJones updated the task description. (Show Details)

This was blocked by the removal of my directories and files on mwmaint1002 after the data center switchover. My files have been restored, but David needs to reindex over 800 wikis for the ores_articletopicsweighted_tags rename. David's reindex will cover all of the wikis relevant to this ticket except dkwikimedia, which I just reindexed by itself, since it only has ~600 documents.

I've moved this to waiting, since it is now just waiting on David's reindex. I'll do some validation when it's done just to doublecheck, but I don't expect any problems for these wikis (they are much smaller than wikidata, commons, and enwiki, which is where problems are most likely to happen).

David's mass reindexing may make it harder to do the before-and-after analysis for this ticket. I'll capture what I can and report it here.

TJones renamed this task from Reindex Basque, Catalan, Danish Wikis to Reindex Basque, Catalan, Danish wikis to enable unpacked versions.Aug 2 2021, 8:04 PM
TJones triaged this task as High priority.

Full notes on Mediawiki.

Overall, I'm trying to streamline the impact analysis process, so I'm only calling out the expected reindexing impacts (decreased zero-results rate, increased number of results for some queries, and changes in top queries from folding diacritics), and any unexpected impacts.

Summary:

  • Catalan has a very large improvement in zero-results rate (8.1% relative improvement, or 1 in 12), largely driven by the fact that people type -cio for -ció (which is cognate with Spanish -ción and English -tion).
  • In general, the impact on Danish was very mild; the general variability in Danish query results is lower than for other wikis.
  • Basque improvements are in large part due to queries in Spanish that are missing the expected Spanish accents.