Page MenuHomePhabricator

[RfC] geoshape datatype and data namespace on Commons
Closed, ResolvedPublic


It is soon possible to store map data in the Data namespace on Commons. We do want to link to this in Wikidata statements.

Open questions:

  • Do we create a new datatype for this or extend the existing Commons Media one?
  • Do we have one datatype just for the maps data or one that covers also the other tabular data in the same namespace?

Event Timeline

I guess the question is if there is some benefit in combining these new features in the same property with images/videos, e.g. detail map (P1621) could link a jpg or a shape file on Commons.

As handling between these differs, separate property data-types seem preferable.

Using the same property data-type would just add maintenance to keep them a separated. Maybe some of the experts on maintenance want to comment: @Pasleim @Fralambert @Mbch331 .. ( I tried to add more, but not everyone seems to have an account here. Sorry for those I omitted).

I vote for separate data types for maps, tabular data (and media files). The purpose of data types is to defined the semantics of a value, and thereby determine how a value is displayed and edited. I'd like to have the option to show shapes and tables in different ways, even if we just show the page title initially. More importantly, when selecting a geo-shape, i want suggestions only for geo-shapes, not for other data. So they need to have separate data types.

On the level of data value types, tabular data and geo-data can share a type. We can probably have a value type that is just an arbitrary wiki page reference. Or maybe just use a plain string value, like we do for URIs and CommonsMedia.

I think the data types should be separate. As a consumer of the data (e.g. json dump), I might want to find all the shapes or search/query on them in a different way (or also render them differently) I think such would be easier with the separate data types.

I hope this doesn't lead to tons of unstructured data (read: data not in a Wikidata-structured form)

Nothing is easy, but if we don't try, it's unlikely it will ever happen ..

I think the description needs to be updated - now it IS possible to store data :)

Maybe I should clarify that I don't really mind geoshapes to be stored at Commons, I'm worried about tables of other data. The risk is that we start the Wikipedia Infobox problem once more.

@Esc3300 The problem with infoboxes is a) that they exist once per language version and b) they are hard to parse to get the data. Neither of those problems exists with the new tabular data sets on commons.

Also note that we have said from the beginning that we explicitly do not want tabular data on Wikidata. Our model is not well suited for that kind of data. Tabular data needs a different approach.

Of course, we should take care to stay compatible, e.g. with regards to how individual values are represented.

Currently we already have tabular data at Wikidata, e.g. for population numbers and quarterly ELO ratings.

These are fairly straight-forward:

  • date
  • value
  • some qualifier about the type of value (optional or omitted)

This for each item.

Obviously, the same could be stored and retrieved from Commons, but still, it makes it much more complicated to compare values across items and it's not even certain that all values at Commons are stored in the same way. So we end up with a new infobox problems. The sole advantage could be that there is only one site involved .. (not hundreds of Wikipedias)

@Esc3300 Managing time-series data in Items this way is rather limited. You can't easily have multiple columns, you have to repeat the source references, and the whole thing does not scale well. I would not recommend to put more that 20 statements into a time series on a Wikidata item. Maybe 100 may. The storage format is not designed for this. We actually explicitly excluded this use case when we defined requirements for the storage layer. Just because we now have a nice hammer doesn't mean we should treat everything like a nail.

I agree that we should make it easy to combine the new tabular data with wikidata. For example, the data set could declare which item it relates to, and which property each column represents. That would (at least in theory) allow us to import the dataset into WDQS.

Interesting point. I hadn't thought of it that way. Obviously, it's sounds somewhat optimistic that such consistency is achieved without any constraints. In theory, infoboxes could have had that too.

I think we were already hammered by the monthly (not quarterly) ratings, but still, I think it should be made possible to have annual population numbers available. Some started also on some economic data, but that might be too voluminous unless it's limited to annual ones.

Lydia_Pintscher claimed this task.

Ok decision: Given that it is all in the same namespace on Commons we will have one datatype for now. If it turns out to not work well we can split it later. We're not extending the existing datatype. Queries should be helped by making the individual properties specific.

Seems like the new property hasn't been created yet, or has it?

I lost track of whether it was possible to create the property yet or not.

Geoshape properties can be added from Monday on.