Page MenuHomePhabricator

Explore the Entity Relevancy Scoring for Wikidata
Open, MediumPublic

Description

Wikidata has a lot of items with the same label. We should explore ways to rank them according to their relevance. For example, Berlin (the capital of Germany) should be ranked higher than Berlin (music album) in suggestions.

The purpose of this is, to allow users to find their desired items more easily. Currently, we are taking into account the number of sitelinks and labels. We should see if this is still sufficient.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 19 2016, 3:33 PM
Glorian_Yapinus renamed this task from [Task] Exploring the Entity Relevancy Scoring for Wikidata to [Task] Explore the Entity Relevancy Scoring for Wikidata.Aug 19 2016, 3:34 PM
Glorian_Yapinus updated the task description. (Show Details)
Glorian_Yapinus updated the task description. (Show Details)
Glorian_Yapinus updated the task description. (Show Details)
Glorian_Yapinus updated the task description. (Show Details)
Glorian_Yapinus updated the task description. (Show Details)

The number of incoming links to a item could be a indication of relevancy, but probably the most difficult one to add.

thiemowmde triaged this task as Medium priority.Sep 5 2016, 3:38 PM
thiemowmde added a subscriber: thiemowmde.

I find this tasks description oddly confusing. At the moment this ticket is non-actionable. It does not even describe if there is an actual issue to solve or not. Is the current method not sufficient? Do you have more specific examples? How do you suggest to improve the situation? What would be the goal of such an improvement? How do we measure if a change is a success or not?

Note that there are already multiple tickets about switching to CirrusSearch. The current ranking will be obsolete then.

thalhamm claimed this task.Sep 19 2016, 7:31 PM
thalhamm added a subscriber: thalhamm.

We were recently discussing a Wikipedia PageRank solution (or a combination of that ranking with other features). I could contribute these scores and get ready also to implement some integration (with some help).

Smalyshev added a comment.EditedJan 17 2017, 10:28 PM

@thalhamm We'd like to know more about the PageRank solution, especially applied to Wikidata. In order to see if we could integrate this solution, we'd like to know more about:

  • What is the input for the algorithm?
  • What is the output produced?
  • What platform it is run on (e.g. Hadoop?)
  • How much resources one run would require, e.g. for Wikidata with current set of entities.
  • Is the code freely available and if so, under which license?
This comment was removed by thalhamm.
thalhamm added a comment.EditedMar 17 2017, 10:32 PM

@Smalyshev, I think we might check first if the type of output is of any use for you. You can get most info (e.g. output/input format) at http://people.aifb.kit.edu/ath/#Wikidata_PageRank. It is not run on Hadoop and it takes fairly little resources (actually it can be optimized to run on a laptop with 16gb of ram). Currently, there are no optimizations in place and we use about 200GB of RAM (processing power doesn't matter). In case good use cases exist and it has been verified that the current output is of any use, as next steps I would consider the following:

  • transform the actual link datasets of Wikipedia to a processable format (similar to the output of DBpedia pagelinks)
  • develop a processing pipeline as a docker file and make all source code available under a free license

@Smalyshev

I have developed a full Bash+Python3 framework that enables to compute PageRank on any Wikipedia language edition (even with low-cost hardware). By default, the input is based on the latest version of the Wikidump and the output involves each page's Q-id and an according ranking score. The software is licensed under GPL v3 and it can be accessed at the following URL:

https://github.com/athalhammer/danker

I hope this is of any use.

thalhamm removed thalhamm as the assignee of this task.Nov 5 2017, 1:21 PM
Bovlb added a subscriber: Bovlb.Sep 20 2018, 8:32 PM
Lydia_Pintscher renamed this task from [Task] Explore the Entity Relevancy Scoring for Wikidata to Explore the Entity Relevancy Scoring for Wikidata.Jul 10 2019, 6:10 PM