Page MenuHomePhabricator

Evaluate the benefits of adding wikidata aliases to cirrus indices
Closed, ResolvedPublic

Description

Prior to adding wikidata aliases to cirrus indices we should first evaluate what would be the benefit.
We could write a simple script for that purpose:

In the end if the ZRR is high then it's possible that adding aliases could help to reduce Cirrus ZRR. If it's low then it's not worth the effort as it means wikidata aliases are already included in cirrus docs.

We should run 2 different tests:

  • add aliases from the same language
  • add all aliases

Event Timeline

dcausse raised the priority of this task from to Needs Triage.
dcausse updated the task description. (Show Details)
dcausse subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana triaged this task as Medium priority.Oct 8 2015, 4:51 PM
Deskana subscribed.

I ran a test with the data available in our hypothesis testing cluster (en, it, de, ru).
I extracted 10% of the pages with a wikibase entity id and ran a query against wikidata to load the aliases.
These aliases were used as a query against our index to count zero results.
At a glance there's very few entities with aliases (except for ru and de wiki). And there's very few of these aliases that would help to find more results if they were added to the cirrus indices.

wikitotalpages testedaliases foundzeroData
en4 898 988489 89814 533583missing aliases
de1 642 622164 26236 116107missing aliases
it327 58032 7582 61815missing aliases
ru1 140 431114 04339 134133missing aliases

I'm afraid that even in the very optimistic case that we have one hit per day for these missing aliases it would not help to reduce the ZRR significantly.

Note that this test only included aliases for the same language (i.e. en aliases for enwiki).

We could try to add labels and aliases for all language but I think this would cover the same use case addressed by T110078.

  • T110078 is interesting because it will cover the full content of the target wiki (but depends on the language detector precision).
  • Adding labels and aliases for all languages into the same index is also interesting because we could run a single query over all languages and we won't rely on the performance of the language detector.

This multi-language index already exists, it's the cirrus index for wikidata : https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=default&search=Kami+wa+Nihon+wo+Nikunderu&fulltext=Search
In this case wikidata was able to find the document, the wikidata entity has a link to itwiki so we should try to integrate this result in a seamless manner with maybe a small indication that this result came from wikidata.

This would be a "simple way" to address this missing aliases problem and also user queries with names in a different language or with small transliteration variations.

@Deskana: There is a question of what to do with this, and we think you are the best person to answer.

@Deskana: There is a question of what to do with this, and we think you are the best person to answer.

Assigning to @Deskana so it's clear on the board who this is waiting for.

We may also have to do this if we want to do T117494 for performance reasons - so we get one result from ElasticSearch instead of going to SQL.

Thanks for the analysis, @dcausse! Given the above analysis, it does not seem to me that this avenue is worth going down at this stage, so I'm removing this from the sprint and placing it back into the backlog.

Really, this task was to perform an analysis. That analysis was done and posted by @dcausse above. So, this should actually be resolved.