Page MenuHomePhabricator

Perform a Wikidata gap analysis
Closed, ResolvedPublic

Description

Wikidata has the potential to be at the heart of many new features and products, notably new search and reader experiences. However, we don't have a clear picture of the content currently contained in Wikidata; this makes it difficult to design new tools and experiences, because we don't know what we can rely on.

For example, if an article doesn't exist in a given language, but we have information about that topic in Wikidata, some automatically-formatted information could be presented to the reader. That information could also be used to seed the article and encourage the reader to create it. However, such a feature relies on the assumption that the relevant data is available in Wikidata. If we don't know the breadth and coverage of the content in Wikidata, it makes it difficult to build experiences on it.

Other possible uses for Wikidata content include:

  • multilingual search / "quick fact" type experiences
  • powering locally maintained infoboxes,
  • interactive timelines, maps, charts, etc.
  • navigation within fact hierarchies (countries, politicians, books, albums, episode lists, movies, actors, etc.)

We need to perform a more systematic analysis regarding the current content in Wikidata and the growth patterns, in order to determine which purposes it is likely to be able to serve in the near term. This includes identifying content biases and clear gaps in content.

This should ideally be combined with an impact analysis in each of the uses outlined above. We should be able to quantify most of these things, by looking at % coverage in properties, languages, full tuples, etc.

In any such analysis, it's important to remember that specific Wikidata platform capabilities (e.g. unit support) may act as catalysts for larger adoption/use.

We should also compare the results with existing datasets such as DBPedia.

We should take a first rough cut at this in March and aim to provide a public report in April.

Event Timeline

Eloquence raised the priority of this task from to Needs Triage.
Eloquence updated the task description. (Show Details)
Eloquence updated the task description. (Show Details)
Eloquence set Security to None.

Adding @leila and @kaldari as they have done extensive work on this question so we can avoid duplicating effort.

(my bad, they are already listed)

Thanks for starting this. Let us know when you discussed who will take the lead on this and we are happy to contribute and work with you on this.

I caused some consternation by accidentally assigning this to Leila (sorry!)

What's the next steps with this? Meeting?

Let's discuss in our next weekly team meeting on Wednesday.

Tentatively suggesting Oliver and perhaps Leila (if she has bandwidth to help) work on this project.

Would love to work with both of them on this.

@Eloquence I reviewed this task and have quite a few questions about it. Before committing to it, I'd like to sit with someone from your team who will lead this to make sure I understand the details. I'll let you know if I can commit given the timeline and other constraints I have after that meeting.

Thanks @Eloquence for the chat today. Given the scope of the project, the timeline of it, and other commitments I have for the rest of this quarter, it's best if this is lead by Tilman/Oliver/Guillaume as mentioned in your task description. The work requires some understanding of Wikidata data model, familiarity with api queries, and a lot of exploration. I'm happy to meet with the person leading this to share what I know as needed.

Great, thanks for the offer to help. Oliver should be able to start poking at this as soon as we've completed some more urgent data requests.

Ironholds triaged this task as Medium priority.Mar 17 2015, 6:34 AM
Ironholds subscribed.

I've started a stub on Meta; edit boldly. @Ironholds, feel free to move to the Research: namespace as discussed, once there's some actual content :)

I've started a stub on Meta; edit boldly. @Ironholds, feel free to move to the Research: namespace as discussed, once there's some actual content :)

Nice, thanks! I added a few of my thoughts to it.

Update:

We met with Lydia (thanks Lydia!) yesterday and had a really awesome meeting; I'm just making sure my notes captured the essence of what was said before formally writing them up. @Deskana would you be interested in chatting and bringing along one or more of your engineers, about the ideas you've got, the problems you've run into, the good bits and bad bits, etc etc?

Update:

Currently:

  1. Working on the documentation
  2. Trying to compensate for the fact that it seems pretty much impossible to get information out of wikidata in a humane format. The JSON isn't even minified. Commence the infinite, massively parallelised API queries.

Yep; see "not minified": streaming it line-by-line after it's on disc breaks since it seems to be formatted as one big JSON blob. I've worked out a way around this (and probably didn't need the entire db anyway), though :)

Update: 1m-item retrieval done!