Page MenuHomePhabricator

Entity suggester for Wikidata
Closed, ResolvedPublic


Proposed at

Wikidata could be a lot smarter than it is right now e.g. by suggesting fields to fill and probable values.

For example: when an editor edits an item about a person that is still missing the date of birth, this should be suggested as a possible property. Or when the editor is entering the sex of the person, Wikidata should be smart and suggest the ones that are used most for these properties first. Think of it as something very similar to the famous "people who bought x also bought y" systems.

Version: unspecified
Severity: major
See Also:



Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:18 AM
bzimport set Reference to bz46555.
bzimport added a subscriber: Unknown Object (MLST).

nilesh wrote:


I am a 3rd year undergraduate student of computer science, pursuing my B.Tech degree at RCC Institute of Information Technology. I am proficient in Java, PHP and C#.

Among the project ideas on the GSoC 2013 ideas page, the one particular idea that seemed really interesting to me is developing an Entity Suggester for Wikidata. I want to work on this.

I am passionate about data mining, big data and recommendation engines, therefore this idea naturally appeals to me a lot. I have experience with building music and people recommendation systems, and have worked with Myrrix and Apache Mahout. I recently designed and implemented such a recommendation system and deployed it on a live production site, where I'm interning at, to recommend Facebook users to each other depending upon their interests.

The problem is, the documentation for Wikidata and the Wikibase extension seems pretty daunting to me since I have not ever configured a mediawiki instance or actually used it. (I am on my way to try it out following the instructions at I can easily build a recommendation system and create a web-service or REST based API through which the engine can be trained with existing data, and queried and all. This seems to be a collaborative filtering problem (people who bought x also bought y). It'll be easier if I could get some help about the part where/how I need to integrate it with Wikidata. Also, some sample datasets (csv files?) or schemas (just the column names and data types?) would help a lot, for me to figure this out.

Please ask me if you have any questions. :-)


nilesh: Did you reach out to the Wikidata developer team at ?
Asking as your personal CV is offtopic for this specific bug report here. :)

Answered Nilesh on the mailing list.

(In reply to comment #2)

nilesh: Did you reach out to the Wikidata developer team at ?
Asking as your personal CV is offtopic for this specific bug report here. :)

Background: I asked Nilesh to create the bug report and send the proposal to wikitech-l. The Wikidata team was/is aware. Welcome Nilesh and good luck with your project idea.

Just a note to say that Nilesh Chakraborty has submitted a GSoC proposal related to this report:

Good luck!

nilesh wrote:

Thanks everyone, thank you Quim. Really appreciate it!

puneet.gkaur wrote:

Hello everyone , I am Puneet Kaur ,an undergraduate student at Indira Gandhi Institute of Technology ,New Delhi ,India.

I am interested in this project, and in large the concept of the wikidata features in whole seems nice.

I have spent some time on web development and designing, and I wish to make a good use of my existing knowledge through helping wikidata get some more features :)

nilesh wrote:

I'm considering two options for feeding the item/property data into the recommender:

i) Using the database-related code in the wikidata extension (I'm studying the DataModel classes and how they interact with the database) to fetch what I need and feed them into the recommendation engine.

ii) Not accessing the DB at all. Rather, I can write map-reduce scripts to extract all the training data and everything I need for each Item from the wikidatawiki data dump and feed it into the recommendation engine. I can use a cron job to download the latest data dump when available, and run the scripts on it. I don't think it would be an issue even if the engine lags by the interval the dumps are generated in, since the whole recommendation thing is all about approximations.

I personally think (ii) will be cleaner and faster. Please share your views on this. More details on the idea can be found at :

I agree that ii) is better, especially since this info isn't really in the database (yet), except in the form of json blobs.

The only downside is that the JSON in the XML dumps is the *internal* JSON, not the canonical JSON used in the API. We'll provide dumps using the canonical JSON at some point.

But even so, if we have code that uses the internal JSON, it should be be easy to adopt later.

nilesh wrote:

Thanks Daniel. I'm going with (ii).

Please check out for some code, loads of info, and my immediate TODO list.

I'm prototyping the entity suggester and pushing the code there and will keep updating the github repo.

  • Bug 52553 has been marked as a duplicate of this bug. ***
  • Bug 41054 has been marked as a duplicate of this bug. ***

GSoC "soft pencils down" date was yesterday and all coding must stop on 23 September. Has this project been completed?

If you have open tasks or bugs left, one possibility is to list them at and volunteer yourself as mentor.

We have heard from Google and free software projects participating in Code-in that students participating in this programs have done a great work finishing and polishing GSoC projects, many times mentores by the former GSoC student. The key is to be able to split the pending work in little tasks.

More information in the wiki page. If you have questions you can ask there or you can contact me directly.

[replacing wikidata keyword by adding CC - see bug 56417]

Daniel: What is this bug report "tracking"? No dependencies here.
Is assignee and priority still correct?

virginia.weidhaas wrote:

we are the above mentioned students group and we would like to get assigned for this ticket.
We introduced ourselves on the mailing list:

You can find our current status and code on GitHub through the provided links in the mail.


Virginia Weidhaas
Christian Dullweber
Moritz Finke
Felix Niemeyer

Deployed on test now \o/ Time to close this.