Page MenuHomePhabricator

Add sitelink count to search index for Wikidata
Closed, ResolvedPublic

Description

We should explose sitelink count (and label count and statement count?) as a field in the search index for Wikidata (and any wikibase installs that use Cirrus)

then these can be considered when rescoring and ranking search results to give more useful search results.

this task involves implementing a hook handler for the CirrusSearchMappingConfig and CirrusSearchBuildDocumentParse.

at some future time, these fields could be exposed in a nicer way via the content objects and Cirrus be made to injest them more smartly. For now, the hooks are simple enough and most of the actual hook handler code would still be abstracted and then be easy to change later how it is connected to things.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedaude
ResolvedSmalyshev
Resolvedaude
ResolvedNone
DuplicateSmalyshev
ResolvedSmalyshev
InvalidNone
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
Resolveddcausse
ResolvedSmalyshev
Resolveddebt
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
ResolvedSmalyshev
Resolveddcausse
ResolvedSmalyshev

Event Timeline

aude claimed this task.
aude raised the priority of this task from to High.
aude updated the task description. (Show Details)
aude added a subscriber: aude.

Using the sitelink count for scoring was intended to be a workaround. Cirrus already has the number of incoming links ("in-degree") for each item, which it uses for scoring per default. Why is that not good enough for our case?

The main problem with the current scoring seems to be that Cirrus uses tf/idf scoring. The "tf" bit ("term frequency", the number of times the search term occurs in the document) should not be used for wikidata items, it's not a good indicator of relevance. The "idf" bit is intended to reduce the impact of irrelevant (too common) terms in the search string - which is useless for single word (or prefix) searches.

If we want to improve scoring, we should make sure that in-degree is used, and tf/idf is not used.

@daniel if you would like "encyclopedia of life" to be the first result for searching "life", then incoming links alone might be good for scoring

life (Q3) has 56 incoming links

encyclopedia of life (Q82486) has 1365362 incoming links

I'm not sure that *not* doing tf/idf is the solution, but we can investigate. The way we munge all the different terms in all the languages together in one field is probably not ideal for tf/idf. "life" is probably translated differently in most languages whereas "Half Life" (Q752241) is generally not translated yet has labels in lots of languages, so "life" is especially frequent. If we could consider just english when searching in english, then "Half Life" probably is not boosted as much compared to "life".

As well, things like exact title matches don't really work currently for Wikidata. Ideally, we would consider exact label matches in the search language and exact matches would get a boost.

I think considering other attributes (e.g. # of site links, # of statements, etc) of the document to boost scoring could help. This would not replace considering incoming links but just be additional consideration in scoring. It already works okayish enough in the entity selector. Once we put these in, then we can try different rescorings to see what works well. If this turns out to be a bad idea, then we can remove the custom rescoring config for wikidata and do as we do now.

@daniel if you would like "encyclopedia of life" to be the first result for searching "life", then incoming links alone might be good for scoring

life (Q3) has 56 incoming links

encyclopedia of life (Q82486) has 1365362 incoming links

Ah, right... we'd want to consider only links from main snaks, not from references (nto sure about qualifiers). That would need some work...

I'm not sure that *not* doing tf/idf is the solution, but we can investigate.

Term frequency doesn't seem to be a good indicator in our use case.

The way we munge all the different terms in all the languages together in one field is probably not ideal for tf/idf. "life" is probably translated differently in most languages whereas "Half Life" (Q752241) is generally not translated yet has labels in lots of languages, so "life" is especially frequent. If we could consider just english when searching in english, then "Half Life" probably is not boosted as much compared to "life".

Yes, this should be per language.

As well, things like exact title matches don't really work currently for Wikidata. Ideally, we would consider exact label matches in the search language and exact matches would get a boost.

Indeed.

I think considering other attributes (e.g. # of site links, # of statements, etc) of the document to boost scoring could help. This would not replace considering incoming links but just be additional consideration in scoring. It already works okayish enough in the entity selector. Once we put these in, then we can try different rescorings to see what works well. If this turns out to be a bad idea, then we can remove the custom rescoring config for wikidata and do as we do now.

Number of sitelinks or statements can help. I'd like to avoid gettign too many parameterrs into the mix, though. If we can, let's find one or two indicators that work well. If there are too many factors, things tend to be come unpredictable.

My objection to sitelinks was based on the assumption that we already have something better (incoming links), so why invest time into the sitelinks stuff. But as you point out, the raw number of incoming links includes links from references, and can thus be misleading. So we need to invest time anyway.

Why is this on review? What should we review here?

Change 256023 had a related patch set uploaded (by Aude):
Introduce hook handlers for CirrusSearch

https://gerrit.wikimedia.org/r/256023

@thiemowmde sorry, the patch was not linked to the task. now it is linked

thiemowmde moved this task from Review to Done on the Wikidata-Sprint-2015-12-01 board.

Change 256023 merged by jenkins-bot:
Introduce hook handlers for CirrusSearch

https://gerrit.wikimedia.org/r/256023