Evaluate the benefits of adding wikidata aliases to cirrus indices
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcausse
	Oct 7 2015, 9:54 AM

Description

Prior to adding wikidata aliases to cirrus indices we should first evaluate what would be the benefit.
We could write a simple script for that purpose:

loop over sample of cirrus docs with a wikibase entity (could be done with a dump and IdHashMod).
extract aliases from wikidata (https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42&props=aliases)
run a query with each alias and against the cirrus index
count the number of zero results

In the end if the ZRR is high then it's possible that adding aliases could help to reduce Cirrus ZRR. If it's low then it's not worth the effort as it means wikidata aliases are already included in cirrus docs.

We should run 2 different tests:

add aliases from the same language
add all aliases

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		• Deskana	T113379 EPIC: Investigate adding aliases from Wikidata into the search index so that they can enhance the results and reduce the zero results rate
		Resolved		• Deskana	T114867 Evaluate the benefits of adding wikidata aliases to cirrus indices

Event Timeline

dcausse created this task.Oct 7 2015, 9:54 AM

dcausse raised the priority of this task from to Needs Triage.

dcausse updated the task description. (Show Details)

dcausse added projects: CirrusSearch, Discovery-Search (Current work).

dcausse subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptOct 7 2015, 9:54 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse claimed this task.Oct 7 2015, 9:55 AM

dcausse moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

dcausse added a parent task: T113379: EPIC: Investigate adding aliases from Wikidata into the search index so that they can enhance the results and reduce the zero results rate.

dcausse set Security to None.

dcausse moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

• Deskana triaged this task as Medium priority.Oct 8 2015, 4:51 PM

• Deskana subscribed.

• ksmith added a project: OKR-Work.Oct 8 2015, 11:56 PM

I ran a test with the data available in our hypothesis testing cluster (en, it, de, ru).
I extracted 10% of the pages with a wikibase entity id and ran a query against wikidata to load the aliases.
These aliases were used as a query against our index to count zero results.
At a glance there's very few entities with aliases (except for ru and de wiki). And there's very few of these aliases that would help to find more results if they were added to the cirrus indices.

wiki	total	pages tested	aliases found	zero	Data
en	4 898 988	489 898	14 533	583	missing aliases
de	1 642 622	164 262	36 116	107	missing aliases
it	327 580	32 758	2 618	15	missing aliases
ru	1 140 431	114 043	39 134	133	missing aliases

I'm afraid that even in the very optimistic case that we have one hit per day for these missing aliases it would not help to reduce the ZRR significantly.

Note that this test only included aliases for the same language (i.e. en aliases for enwiki).

We could try to add labels and aliases for all language but I think this would cover the same use case addressed by T110078.

T110078 is interesting because it will cover the full content of the target wiki (but depends on the language detector precision).
Adding labels and aliases for all languages into the same index is also interesting because we could run a single query over all languages and we won't rely on the performance of the language detector.

This multi-language index already exists, it's the cirrus index for wikidata : https://it.wikipedia.org/w/index.php?title=Speciale%3ARicerca&profile=default&search=Kami+wa+Nihon+wo+Nikunderu&fulltext=Search
In this case wikidata was able to find the document, the wikidata entity has a link to itwiki so we should try to integrate this result in a seamless manner with maybe a small indication that this result came from wikidata.

This would be a "simple way" to address this missing aliases problem and also user queries with names in a different language or with small transliteration variations.

dcausse moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Oct 9 2015, 8:10 AM

dcausse mentioned this in T113379: EPIC: Investigate adding aliases from Wikidata into the search index so that they can enhance the results and reduce the zero results rate.Oct 9 2015, 8:30 AM

@Deskana: There is a question of what to do with this, and we think you are the best person to answer.

Smalyshev subscribed.Oct 13 2015, 5:33 PM

EBernhardson moved this task from Search to On Sprint Board on the Discovery-ARCHIVED board.Oct 14 2015, 4:28 AM

@Deskana: There is a question of what to do with this, and we think you are the best person to answer.

Assigning to @Deskana so it's clear on the board who this is waiting for.

We may also have to do this if we want to do T117494 for performance reasons - so we get one result from ElasticSearch instead of going to SQL.

Thanks for the analysis, @dcausse! Given the above analysis, it does not seem to me that this avenue is worth going down at this stage, so I'm removing this from the sprint and placing it back into the backlog.

• Deskana removed a project: Discovery-Search (Current work).Nov 5 2015, 5:38 PM

• ksmith moved this task from On Sprint Board to Search on the Discovery-ARCHIVED board.Nov 5 2015, 5:41 PM

Really, this task was to perform an analysis. That analysis was done and posted by @dcausse above. So, this should actually be resolved.

• Deskana moved this task from Inbox to Resolved/Invalid/Declined/Legacy on the CirrusSearch board.Dec 31 2015, 5:07 AM

Evaluate the benefits of adding wikidata aliases to cirrus indicesClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate the benefits of adding wikidata aliases to cirrus indices
Closed, ResolvedPublic
Actions

Related Objects
Search...