RFC: Data namespace blob storage on wikidata.org
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Yurik
	Jan 24 2016, 3:27 AM

Description

We have had numerous talks about how we can store structured data in a wiki, targeting both technical and social aspects. This is yet another idea, hope it has some merit.

Which wiki to use for data storage is discussed at https://meta.wikimedia.org/wiki/User:Yurik/Storing_data

Store data in wikidata.org in Data: namespace.
Metadata is stored as wikidata item's properties, allowing flexibility and localization. The content handler will show/edit associated item metadata directly in-place.
The type of data is determined by the "extension": Data:NewYorkStateDistricts.topojson. Other types could be tabular csv/tsv, and generic json. We might want to consider import transformations, e.g. CSV->TSV, GeoJSON->TopoJSON, etc.
Each type is handled/visualized by its own content handler, either utilizing existing wikipage storage, or some other backend.
An api could provide data retrieval, possibly even querying functionality (filtering/joining) -- this one is trickier and might need a separate discussion
Any page consuming data (e.g. Lua or Graphs) will add page dependency, utilizing the batchupdate process we have now.
Only data that passes structural validation is allowed to be saved
Data is never guaranteed to be roundtrip-able, e.g. for JSON the backend will remove trailing commas and unneeded white spaces.
We might want to consider raising maximum storage limits for this namespace.

Related Objects

Mentioned In: T137929: RFC (WIP): Enable shared tabular data storage on a central wiki
E213: ArchCom RFC Meeting: Technical aspects of Data namespace blob storage (2016-06-15, #wikimedia-office)
T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...)
Mentioned Here: T137929: RFC (WIP): Enable shared tabular data storage on a central wiki

Event Timeline

Yurik created this task.Jan 24 2016, 3:27 AM

Yurik raised the priority of this task from to Needs Triage.

Yurik updated the task description. (Show Details)

Yurik added a project: Proposal.

Yurik added subscribers: Yurik, Milimetric, • Tfinc, MaxSem.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 24 2016, 3:27 AM

Yurik set Security to None.Jan 24 2016, 3:28 AM

Yurik added subscribers: daniel, tstarling.

Yurik updated the task description. (Show Details)Jan 24 2016, 3:47 AM

Yurik added a subscriber: Lydia_Pintscher.Jan 24 2016, 3:52 AM

• MZMcBride subscribed.Jan 24 2016, 3:59 AM

JanZerebecki subscribed.Feb 3 2016, 10:43 AM

hoo subscribed.Feb 5 2016, 11:22 AM

aude subscribed.Feb 5 2016, 11:25 AM

Addshore added a project: Wikidata.Feb 5 2016, 11:27 AM

Addshore subscribed.

I am against this proposal for several reasons:

Wikidata is structured around concepts. Datasets are usually structured around the same data for many topics.
People are already now having issues understanding the above and want to upload their spreadsheet to Wikidata. This would further make this really hard to explain and understand.
We can not have a mix of licenses which would surely be expected if we go along this path.
We are there to expose the data we have in a uniform way to Wikipedia, the other sister projects and third parties. This would make this impossible.
Wikidata is at the core a knowledge base. Not a place to put a dataset.
People expect to be able to query all the data in Wikidata in a unifrom way. This would not be possible.
We are building data quality tools that all resolve around the way data is stored in Wikidata right now.

Lydia, do all the raised issues get resolved if this was not wikidata.org ?

I believe so.

I listed PROs and CONs for each domain. Did I miss anything? https://meta.wikimedia.org/wiki/User:Yurik/Storing_data

Yair_rand subscribed.Feb 18 2016, 9:43 AM

Lydia_Pintscher moved this task from incoming to needs discussion or investigation on the Wikidata board.Feb 19 2016, 10:29 AM

• iecetcwcpggwqpgciazwvzpfjpwomjxn subscribed.Feb 25 2016, 1:59 PM

This comment was removed by • iecetcwcpggwqpgciazwvzpfjpwomjxn.

Bene subscribed.Feb 26 2016, 7:39 AM

JanZerebecki added a parent task: T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...).Mar 5 2016, 2:15 PM

Ricordisamoa subscribed.Mar 5 2016, 2:17 PM

Yurik updated the task description. (Show Details)Mar 5 2016, 2:28 PM

Thryduulf subscribed.Mar 5 2016, 7:58 PM

Patch https://gerrit.wikimedia.org/r/#/c/281331/ implements support for on-wiki tabular storage.

There is a big discussion on Commons related to this issue. @Lydia_Pintscher and @daniel, you might want to elaborate there why Wikidata is not the right place. The overwhelming majority so far are in favor of putting data on Commons, but few opposing votes do bring valid concerns.

Yurik removed a parent task: T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...).May 5 2016, 6:36 PM

Yurik mentioned this in T120452: Allow structured datasets on a central repository (CSV, TSV, JSON, GeoJSON, XML, ...).

Danny_B added a project: Commons-Datasets.May 6 2016, 2:37 AM

Danny_B moved this task from Backlog to Monitoring on the Commons-Datasets board.

Danny_B subscribed.

@Yurik,

I agree that the majority seem to be supporting, but I think the opposing voices should be taken into consideration more carefully. I read some of the opposition and your replies, and I agree with you about the idea of releasing early and releasing often, but I think the bare bones release should consider this:

"before allowing this type of uploads we need a full developed environment for this data in Commons regarding policies (scope etc.), categories, maintance tools (like filters etc.), help & orientation pages (for uploaders and maintainers), etc.. Btw: what about references to support the data or can just everybody throw his data to Commons? But the main question is: who in Commons will be able to additionally monitoring this kind of stuff? Not only the uploads itselfs but also all later modifications (typical edit by IP: "sales": 2.000 --> 200.000)? And: they (companies, marketing, spammers, POV's/COI's users etc.) will abuse in medium-term also this system, providing [fake/false] data for their "interests". IMHO, Commons in the past already suffered some mass-oriented "features" like Wikipedia Zero, cross-wiki uploads via local Visual Editor, mobile uploads, or whatever --> all mostly either grabbed from Internet or out of project scope (often detected only months or years later) — btw: currently, around +/- 30-40 % of daily deletion requests at Commons are already related only to "out of project scope", mostly involving "Commbook"-uploads from spammers and user pics from gals & guys who (will) never touch an wiki article, vomiting an user page on "their" wiki, thinking Commons = Facebook. The concept of "data uploads" may be interesting but ignores (among other things) the completely under-staffed [maintainer] user base in Commons and instead of trying to keep a (+/-) quality database of "free-use images, sound, and other media files" the whole thing is already turning (also regarding e.g. thousands of images grabbed & uploaded from social media) more and more into a random web hoster. Gunnex (talk) 20:05, 8 May 2016 (UTC)"

I think this is mostly busy work and I'm happy to do it as one of the tasks on the new board you were going to set up.

@Milimetric the new board is at Commons-Datasets. I agree with the concerns you raise, and I see a significant portion of the community has raised them on wiki. I think we should continue the discussion there, because otherwise we risk splintering the discussion into multiple unrelated and unconnected branches.
Also, lets start putting down some ideas that have been floating around to the help page I just created. Any changes are welcome :)

Daniel_Mietchen subscribed.May 12 2016, 7:36 PM

• RobLa-WMF added a project: TechCom-RFC.Jun 6 2016, 6:35 AM

• RobLa-WMF mentioned this in Unknown Object (Event).Jun 6 2016, 6:46 AM

• RobLa-WMF mentioned this in E213: ArchCom RFC Meeting: Technical aspects of Data namespace blob storage (2016-06-15, #wikimedia-office).Jun 13 2016, 6:00 AM

Quick update prior to today's rfc irc discussion: we basically have two areas to discuss: 1) technical questions (data types, use as 'aggregation layer' for data from wikidata), and 2) the hosting question (which wiki to put it on, or whether to give it its own).

• RobLa-WMF mentioned this in T137929: RFC (WIP): Enable shared tabular data storage on a central wiki.Jun 16 2016, 2:11 PM

I'm removing the TechCom-RFC tag from this one, since the TechCom issues can be addressed in T137929

• RobLa-WMF moved this task from Inbox to Watching on the TechCom board.Jun 30 2016, 2:58 AM

• iecetcwcpggwqpgciazwvzpfjpwomjxn added a comment.Dec 18 2016, 10:56 PM

This comment was removed by • iecetcwcpggwqpgciazwvzpfjpwomjxn.

Yep, thx

RFC: Data namespace blob storage on wikidata.orgClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

RFC: Data namespace blob storage on wikidata.org
Closed, ResolvedPublic
Actions