Page MenuHomePhabricator

boost weight of suggestions based on number of labels
Closed, ResolvedPublic

Description

The entity selector doesn't work well for a lot of classifying statements. One example is "sex or gender: male". male does not show up in the selector's first page because it doesn't have a sitelink and is therefore ranked low.
We can fix this issue by taking into account also the number of labels for an item to make the ranking. This way the item for male would be ranked considerably higher because it has labels in many languages.

Event Timeline

incoming links is something that i think is feasible to do once we use elastic search as a backend. in the short term, considering the number of labels might help?

Lydia_Pintscher renamed this task from boost weight of suggestions based on incoming links to boost weight of suggestions based on number of labels.Mar 30 2015, 1:36 PM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher set Security to None.

Good point. Adapted the description accordingly.

Hm, I don't think there is much difference between the number of sitelinks and the number of labels of the known queries that cause problems.

For things like male there is quite a difference. https://www.wikidata.org/wiki/Q6581097 has 0 sitelinks but ~90 labels.

Yeah, but that's English. I'm talking about when in search in Dutch for "man" (the Dutch version of male), will it appear on top or will the island be on top?

It'll probably continue to be on top because it has so many sitelinks and roughly as many labels. But at least it will no longer not be on the first page.

  • I support the max( number of labels, number of sitelinks ) approach discussed in the sprint start meeting.
  • What about considering the number of incoming links (a.k.a. WhatLinksHere) in the internal ranking algorithm? I understand that such a ranking will only be re-calculated when the entity is edited. But this should not be a big problem if the algorithm favors sitelinks and labels and uses backlinks as a minor aspect in the calculation.

Change 202399 had a related patch set uploaded (by Thiemo Mättig (WMDE)):
Introduce TermSqlIndex::supportsSearchKeys

https://gerrit.wikimedia.org/r/202399

Change 202399 merged by jenkins-bot:
Introduce TermSqlIndex::supportsSearchKeys

https://gerrit.wikimedia.org/r/202399

@thiemowmde The number of incoming links would be the best indicator, since it directly correlates with the probability of the user wanting to link to the entity. But calculating it is too expensive, even on edit; Cirrus search has a similar problem, and a solution (I don't remember, ask Nik). Once we move term lookup to Elastic, we can use it.

Change 202456 had a related patch set uploaded (by Daniel Kinzler):
Use max( |sitelinks|, |labels| ) for term weight.

https://gerrit.wikimedia.org/r/202456

@thiemowmde The number of incoming links would be the best indicator, since it directly correlates with the probability of the user wanting to link to the entity. But calculating it is too expensive, even on edit; Cirrus search has a similar problem, and a solution (I don't remember, ask Nik). Once we move term lookup to Elastic, we can use it.

What if we stored the most used values for each property and updated it once a week? That wouldn't be expensive to calculate, would it? Even if we stored the top fifty values?
If we can do this once we move to Elastic search then should we just shelve this bug till we have Elastic search since basing the recommendations on what is used elsewhere by the property is definitely the way to go.

This is a relatively simple fix and already implemented - just needs review. Elastic will take quite some work and time. I want us to have an improvement now because the current situation is really bad.

I don't think we need to run a maintenance script for this. We can just purge items like male and female.

If for any reason, we desire to update term_weight in the entire table, we still have the rebuildTermsSearchKey.php script. It still works and now uses wfWaitForSlaves so probably is safe to use :)

Change 202456 merged by jenkins-bot:
Use max( |sitelinks|, |labels| ) for term weight.

https://gerrit.wikimedia.org/r/202456

hoo removed a project: Patch-For-Review.
hoo moved this task from Review to Done on the Wikidata-Sprint-2015-04-07 board.
hoo subscribed.