Page MenuHomePhabricator

RFC: Data namespace blob storage on wikidata.org
Closed, ResolvedPublic

Description

We have had numerous talks about how we can store structured data in a wiki, targeting both technical and social aspects. This is yet another idea, hope it has some merit.

Which wiki to use for data storage is discussed at https://meta.wikimedia.org/wiki/User:Yurik/Storing_data

  • Store data in wikidata.org in Data: namespace.
  • Metadata is stored as wikidata item's properties, allowing flexibility and localization. The content handler will show/edit associated item metadata directly in-place.
  • The type of data is determined by the "extension": Data:NewYorkStateDistricts.topojson. Other types could be tabular csv/tsv, and generic json. We might want to consider import transformations, e.g. CSV->TSV, GeoJSON->TopoJSON, etc.
  • Each type is handled/visualized by its own content handler, either utilizing existing wikipage storage, or some other backend.
  • An api could provide data retrieval, possibly even querying functionality (filtering/joining) -- this one is trickier and might need a separate discussion
  • Any page consuming data (e.g. Lua or Graphs) will add page dependency, utilizing the batchupdate process we have now.
  • Only data that passes structural validation is allowed to be saved
  • Data is never guaranteed to be roundtrip-able, e.g. for JSON the backend will remove trailing commas and unneeded white spaces.
  • We might want to consider raising maximum storage limits for this namespace.

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added a project: Proposal.
Yurik added subscribers: Yurik, Milimetric, Tfinc, MaxSem.

I am against this proposal for several reasons:

  • Wikidata is structured around concepts. Datasets are usually structured around the same data for many topics.
  • People are already now having issues understanding the above and want to upload their spreadsheet to Wikidata. This would further make this really hard to explain and understand.
  • We can not have a mix of licenses which would surely be expected if we go along this path.
  • We are there to expose the data we have in a uniform way to Wikipedia, the other sister projects and third parties. This would make this impossible.
  • Wikidata is at the core a knowledge base. Not a place to put a dataset.
  • People expect to be able to query all the data in Wikidata in a unifrom way. This would not be possible.
  • We are building data quality tools that all resolve around the way data is stored in Wikidata right now.

Lydia, do all the raised issues get resolved if this was not wikidata.org ?

There is a big discussion on Commons related to this issue. @Lydia_Pintscher and @daniel, you might want to elaborate there why Wikidata is not the right place. The overwhelming majority so far are in favor of putting data on Commons, but few opposing votes do bring valid concerns.

@Yurik,

I agree that the majority seem to be supporting, but I think the opposing voices should be taken into consideration more carefully. I read some of the opposition and your replies, and I agree with you about the idea of releasing early and releasing often, but I think the bare bones release should consider this:

"before allowing this type of uploads we need a full developed environment for this data in Commons regarding policies (scope etc.), categories, maintance tools (like filters etc.), help & orientation pages (for uploaders and maintainers), etc.. Btw: what about references to support the data or can just everybody throw his data to Commons? But the main question is: who in Commons will be able to additionally monitoring this kind of stuff? Not only the uploads itselfs but also all later modifications (typical edit by IP: "sales": 2.000 --> 200.000)? And: they (companies, marketing, spammers, POV's/COI's users etc.) will abuse in medium-term also this system, providing [fake/false] data for their "interests". IMHO, Commons in the past already suffered some mass-oriented "features" like Wikipedia Zero, cross-wiki uploads via local Visual Editor, mobile uploads, or whatever --> all mostly either grabbed from Internet or out of project scope (often detected only months or years later) — btw: currently, around +/- 30-40 % of daily deletion requests at Commons are already related only to "out of project scope", mostly involving "Commbook"-uploads from spammers and user pics from gals & guys who (will) never touch an wiki article, vomiting an user page on "their" wiki, thinking Commons = Facebook. The concept of "data uploads" may be interesting but ignores (among other things) the completely under-staffed [maintainer] user base in Commons and instead of trying to keep a (+/-) quality database of "free-use images, sound, and other media files" the whole thing is already turning (also regarding e.g. thousands of images grabbed & uploaded from social media) more and more into a random web hoster. Gunnex (talk) 20:05, 8 May 2016 (UTC)"

I think this is mostly busy work and I'm happy to do it as one of the tasks on the new board you were going to set up.

@Milimetric the new board is at Commons-Datasets. I agree with the concerns you raise, and I see a significant portion of the community has raised them on wiki. I think we should continue the discussion there, because otherwise we risk splintering the discussion into multiple unrelated and unconnected branches.
Also, lets start putting down some ideas that have been floating around to the help page I just created. Any changes are welcome :)

RobLa-WMF mentioned this in Unknown Object (Event).Jun 6 2016, 6:46 AM

Quick update prior to today's rfc irc discussion: we basically have two areas to discuss: 1) technical questions (data types, use as 'aggregation layer' for data from wikidata), and 2) the hosting question (which wiki to put it on, or whether to give it its own).

RobLa-WMF subscribed.

I'm removing the TechCom-RFC tag from this one, since the TechCom issues can be addressed in T137929

Yurik claimed this task.

Yep, thx