Page MenuHomePhabricator

tiny subsets from Wikidata
Open, MediumPublic

Description

Wikidata can be a great provider of lists of entities based on queries. Some lists are used again and again but it can be a bit hard to get them from Wikidata. We can regularly run queries and provide CSV downloads of generally useful lists of entities.

The lists that could be useful:

  • countries and capitals
  • timezones
  • airports
  • languages
  • units
  • ...

What would need to happen:

  • open a repository on the Wikidata github org
  • define the SPARQL query for the required list in a file in the repository
  • run the query, generate a result CSV and upload it to the repository
  • automate the above to regularly update the result CSV
  • publicize that this is available

Why would this be useful?

  • It'd encourage more streamlined modeling for the Items in these lists.
  • It'd further position Wikidata as a provider of standard sets of entities.
  • It'd make it easier to spot changes in the data similar to what happens with Listeria.
  • It'd encourage discussion around sharpening the definitions of concepts like "country"

Event Timeline

I'd go more general and introduce "named queries". Some such queries might indeed have cached results. I'd remove the "tiny" restriction in case the named query is "relevant" e.g. popular highly requested, most useful ...

This sounds like a very good idea and shouldn't be to hard to implement (knock on wood!)… but if this becomes popular, are we going to run into any limits on github?

I wouldn'd cache the results on github but some other place in Wikimedia tool universe as outlined by Bryan Davis (whom i already talked about this during the hackathon last weekend)