Page MenuHomePhabricator

Add a data-page-only wiki markup header to datasets
Open, HighPublic

Description

We urgently need a way for the community to add wiki markup to the top of the dataset pages. That markup will allow for messages, categories, deletion requests, etc.

Community discussion permalink (more messages might have been added later)

Wiki markup will not be accessible via api or via lua calls - that field will be removed from the data results.

Proposed data structure:

{
   "info": "  any wiki markup   "
}

Will be shown at the top, right after the "description" tag (or should it be above?)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
debt added a comment.Jan 26 2017, 5:17 PM

On the English Wikipedia there appears to be some sort of pseudo-parser for JS/CSS pages, shouldn't such an approach work for datasets as well?

If this workaround is good, I think we can move this ticket to the 'nice to have' section on T155601

I think the use case is putting various templates - like templates that may control bots, or templates that provide some functionality, or descriptions that need more capabilities than plain text in "description", like links.

Using talk page may be an option, but talk page is not immediately visible to the user (many users are not aware they exist even). I'm not sure what is the solution proposed by @FDMS since the page linked describes that speedy deletion template does not for for JS/CSS but I don't see any actual solution for this problem described there.

debt added a project: Maps.Oct 12 2017, 7:19 PM

There is a new discussion going on now, regarding the substance of this ticket:
https://commons.wikimedia.org/wiki/Commons:Deletion_requests/Data_talk:Kuala_Lumpur_Districts.map

Restricted Application added a project: Discovery. · View Herald TranscriptOct 12 2017, 7:19 PM

Trying to parse JSON as Mediawiki markup seems like a bad idea. Escaping rules will mean that you still need to parse the JSON as JSON.

e.g.

{
   "info": "  any wiki markup   "
}

and

{"info":"\u0020\u0020\u0061\u006e\u0079\u0020\u0077\u0069\u006b\u0069\u0020\u006d\u0061\u0072\u006b\u0075\u0070\u0020\u0020\u0020"}

are identical JSON files, and need to lead to identical results.

This is not simply a theoretical issue - quotes and newlines are common, and are required to be escaped.

The right way to do this is to properly separate metadata and data, so the wikimarkup is stored side by side. As Commons-Datasets is orphaned with no team responsible, this is likely too difficult.

Another solution would be to parse the JSON then pull out a specific field, then pass that field to a mediawiki markup parser. This doesn't mean the raw text representation of the json will be searchable. The field would have to be removed when something requests the file, otherwise it might interfere. I'd also suggest a name less likely to collide than "info"

A third option would be to strip out <noinclude> or similar sections when requesting the json.

@Pnorman actually we already use Wiki markup in these pages - the .map pages treat title and description fields as wiki markup. I haven't heard of any problems -- the values are being sanitized by the regular MW parser, and gets consumed by mapframe/maplink/lua code. Adding this field wouldn't be much of a challenge from the tech perspective.

Jeff_G added a comment.EditedOct 16 2017, 3:39 AM

@Pnorman actually we already use Wiki markup in these pages - the .map pages treat title and description fields as wiki markup. I haven't heard of any problems -- the values are being sanitized by the regular MW parser, and gets consumed by mapframe/maplink/lua code. Adding this field wouldn't be much of a challenge from the tech perspective.

When I tried to add the full delete template as a title field, I got "Syntax error". When I tried to add it to the description field, I got "⧼Parameter "description" must be an object that maps valid language codes to single line strings without tabs or trailing spaces, e.g. { "en":"String in English", ... }⧽". Admittedly, the full delete template has multiple lines. When I tried to add "{{delete|reason=No room for that here, please see the subpage.|subpage=Data talk:Kuala Lumpur Districts.map|year=2017|month=October|day=12}}" as the the title, I got another "Syntax error"; when I added it to the description, it was rendered as if it was nowiki'd. When I tried to add "Please see [[Commons:Deletion requests/Data talk:Kuala Lumpur Districts.map]]." as the title, I got another "Syntax error"; when I added it to the description, it was again rendered as if it was nowiki'd (unlike edit summaries, which render wikilinks). All of these attempts were with preview, I didn't try actually saving anything because the preview always failed.

Yurik added a comment.Oct 16 2017, 3:41 AM

Clarification - I think (need to check in the code), the title and description use "limited" wiki syntax, similar to what is used in the edit comments. A full wiki markup parsing would be needed to track categories, etc.

Yurik added a comment.Oct 16 2017, 3:42 AM

BTW, IIRC, this fix would actually be just a few lines of code.

debt added a comment.Oct 16 2017, 3:24 PM

@Yurik - could you expand on how this could be fixed with 'a few lines of code'?

Gehel added a subscriber: Gehel.Oct 16 2017, 3:49 PM

I don't know much about Commons Datasets, so the questions below might be naive... Feel free to ignore.

If I understand correctly, commons dataset is a way to store arbitrary JSON on commons. Not only maps / geojson. In this case, we take fields that are geojson specific (title / description) and interpret them as metadata / wiki markup. Going from "limited" markup to full markup would solve at least part of the problem here, but would only be a solution for .map / geojson? Right?

Again if I understand correctly, there is no easy solution for the generic case. And the question of discoverability / indexing is also not solved here (yes, different issue, I know).

Fae added a comment.Oct 17 2017, 6:08 PM

I had not caught on that as well as templates, it's not possible to add data files to categories (unless I'm missing a way to do it). Again an unsatisfying workaround is to use Data talk pages, with a current example being the maintenance category: https://commons.wikimedia.org/wiki/Category:Data_files_with_Open_Street_Map_coordinates.

Fae awarded a token.Oct 17 2017, 6:12 PM
Yurik added a comment.Oct 17 2017, 7:02 PM

@Gehel, not exactly. The new wiki header field would apply to all data stores, both .tab & .map, because it should be implemented in its base class (they share one parent). It would use the current main page parser - thus parsing in the context of the whole page, rather than create a new parser instance and discarding the "side-effects" - such as categories, link tracking, etc. The reason I mentioned the .map title & description fields is because they use a very similar approach, thus showing that it is doable. They just use a new parser instance IIRC, without tracking things.

@debt not sure how I can explain, I would have to actually do it. I will try to find some time, but no ETA. Also, please check with @MaxSem - he knows this area pretty well.

@Fae, correct, you cannot add any "side-effect-causing" markup to the data pages, only to the data talk pages. And I agree, the workaround is not ideal.

Fae added a comment.Oct 19 2017, 12:23 PM

A new Wikimedia Commons proposal has been created to allow for additional licenses for Data files. This would reduce the confusion about whether data imported from elsewhere needs attribution or can be redefined as CC0.

An obvious consequence if the proposal passes, is that the license must be able to be added to the Data file by any user, and displayed with the map or table.

Link: https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Proposal_to_include_non-CC0_licenses_for_the_Data_namespace

IKhitron added a subscriber: IKhitron.

I really need a way to add category to json pages.

daniel added a subscriber: daniel.May 7 2018, 2:42 PM

From my perspective, putting wiki markup into JSON structures seems rather horrible.

However, with Multi-Content-Revisions (MCR), it will become possible to have a wikitext part ("slot" in MCR jargon) and a data part of the page co-exist, each with its own separate editor, but with a shared history etc. Having a "description" slot on data pages seems like a valid request.

For categories, I see three options:

  1. have a "categories" field in the JSON
  2. have a separate slot for categories (which could also be used with other kinds of content, e.g. Lua modules)
  3. have a "description" slot, and put categories there.

These options arn't exclusive, technically nothing keeps us from allowing all three. I just feat that it may be confusing to have three places where categories may be defined. On the other hand, this isn't really worse than categories being imposed by templates.

  1. have a "categories" field in the JSON

I saw that there is a plan to add /**/ comments to JSON in wiki, so it can be "4".

Yurik added a comment.May 7 2018, 3:06 PM

@daniel while i do agree with you in principle, it might be a while to implement. Adding a single field to JSON and passing it through parser is about 5 lines of code, and should take at most an hour of a skilled dev. Also, I wouldn't separate categories from the wiki markup here, simply because most of the time you want templates with categories to auto-add stuff, rather than each page having individual category fields. Also, this method does not preclude future migration to the multi-part system - rather it will be very straightforward.

@IKhitron putting stuff into comments is horrible - not very reliable at parsing, gets easily overwritten by accident or by automatic tools, etc. JSON is just not safe with them (sadly).

@IKhitron putting stuff into comments is horrible - not very reliable at parsing, gets easily overwritten by accident or by automatic tools, etc. JSON is just not safe with them (sadly).

We do it this way in js and css.

daniel added a comment.May 7 2018, 4:52 PM

We do it this way in js and css.

Yes, it's horrible :)

Yurik added a comment.May 7 2018, 6:57 PM

We do it this way in js and css.

json is different - it gets parsed into data and back during the save. js & css are stored "as is" - just like wiki markup. For example, when saving, JSON data will loose all space formatting.

You can't have comments in JSON, and they'll error out several JSON parsers. If you want any interoperability and require comments, use a different format. There's a format similar to JSON which extends it to allow JS-style comments.

Ultimately, JSON is not a format designed for good human editability, so if comments are important, consider a different format.

have a "description" slot, and put categories there.

I think having a separate description slot, where you could put categories, templates, bot instructions, human instructions, etc. would be the best solution. Trying to make JSON into what it's not meant for is far inferior.

... if comments are important, consider a different format.

Not at all, for me.

Yurik added a comment.EditedMay 7 2018, 8:38 PM

I think having a separate description slot, where you could put categories, templates, bot instructions, human instructions, etc. would be the best solution. Trying to make JSON into what it's not meant for is far inferior.

Stas, I agree with you - this is the same as what Daniel proposed above. The fundamental problem is resourcing. It seems WMF has no resources to maintain many of these projects, in which case the MVP is the only path forward to solve the immediate problem. Given unlimited resources/time, a proper multi-slot system is more desirable. Unlike regural wiki pages, the good thing about JSON is that it will be trivial to implement the simple solution first, and let community actually move forward, than implement the proper long term solution and do a simple migration to the multi-slot version.

I really need a way to add category to json pages.

Can I take a step back for a moment, seeing as the discussion is about the specifics of whether or not to implement, and ask -- @IKhitron can you describe what, exactly, you need, and why do you need to put categories on json pages? What are you trying to do? What is missing?

It might be that understanding the actual need will help us find a solution -- whether the specific one described/asked for, or, potentially, a new and/or better one. Seeing as we're talking about a specific need to categorize JSON files, I'm wondering if you can explain it further, @IKhitron ?

@IKhitron can you describe what, exactly, you need, and why do you need to put categories on json pages? What are you trying to do? What is missing?

Hi. Sure, it's very simple. We need to eliminate them from special:templates needs category.

Ltrlg added a subscriber: Ltrlg.Jun 13 2018, 12:08 PM

have a "description" slot, and put categories there.

I think having a separate description slot, where you could put categories, templates, bot instructions, human instructions, etc. would be the best solution. Trying to make JSON into what it's not meant for is far inferior.

Has anybody talked to the people working on StructuredData for Commons? Their stuff may make a lot of this obsolete sooner or later …

daniel added a comment.Aug 8 2018, 8:33 AM

Has anybody talked to the people working on StructuredData for Commons? Their stuff may make a lot of this obsolete sooner or later …

My understanding is that what is requested here is the opposite of what SDoC does. SDoC allows structured machine readable meta-data to be stored on file description pages, in addition to wikitext. This here ticket asks for a way to store wikitext along with the structured machine readable data on data pages.

The overlap I see is "storing two different kinds of content on the same page". This can be done with the new MCR infrastructure in core (which enables SDoC, but isn't really part of it). This would mean that the wikitext goes into a separate "slot", instead of being part of the JSON. I think that would be the correct approach, and very similar to other use cases targeted by MCR, such as documentation for templates and Lua modules.

Nikki added a subscriber: Nikki.Aug 13 2018, 11:34 AM

I'm very happy to help with any documentation changes needed for this to progress. I've written some instructions for using map data in Wikidata and I'm waiting for the licenses to be fixed so I can publish it https://www.wikidata.org/wiki/User:John_Cummings/Map_data

Fae added a comment.Jan 23 2019, 10:49 AM

As a reminder, this task has been open for 2 years with a more detailed Wikimedia Commons community consensus to go ahead 15 months ago. https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals/Archive/2017/10#Proposal_to_include_non-CC0_licenses_for_the_Data_namespace

If deferring changes important to the community to Phabricator is the only way we can get things done, it is broken. Volunteers do not wait forever, we come and go. Leave any change long enough, and the momentum is gone, folks will be doing other things and every time this happens the long term members of the community get a little more jaded and pessimistic about the future.

I think the issue is that it's currently unclear who owns the code in question.Pinging the Multimedia team - is this yours? Pinging Community-Tech - can you help out?

Unrelated side note: I would personally love to see datasets closely integrated with Wikibase. But I don't think that's on anyone's roadmap at the moment.

MSantos added a subscriber: MSantos.

@daniel the team responsible for Maps maintenance is the Reading Infrastructure. I am tagging it so we can evaluate this task.

@MSantos but this does not have much to do with maps at all...

@daniel this is needed to allow any non CC0 map data to be imported, e.g OpenStreetMap

@Mrjohncummings Right, sorry, I wasn't clear: the code begin this is Extension:JsonConfig, which knows nothing about maps, and has nothing to do with the maps code. The data can of course be used in maps, and can represent coordinates, geo-shapes, etc.

But I see now that JsonConfig is also owned by ReadingInfrastructure, so ignore me :)

@daniel I'm glad someone understands how all this stuff works :)

Jhernandez added a subscriber: Jhernandez.

@MSantos Moving to needs analysis. Please have a look when you can and update the description like the template to prioritize it better. Thank you!

Mrjohncummings added a comment.EditedFeb 1 2019, 10:35 AM

@Jhernandez @MSantos is there anything I can do to help move this along? I'm stuck on my work for Wikidata till this gets addressed. Should this be assigned to someone in particular?

Jopparn added a subscriber: Jopparn.Feb 5 2019, 2:41 PM

@Mrjohncummings, I am not sure yet how I can help you with this case, I read the long discussion at T154071: Allow non-CC0 licensed data for datasets and I am now trying to understand the technical aspects of this issue. I need to point out that we are not planning active development on JsonConfig, but will be available to support any volunteer work, as mentioned in T154071#4323571.

This seems to be an important feature with no consensus on architecture design yet, let's chat so I can understand better what are your needs. My IRC is mateusbs17.


@Jhernandez, thanks I will update the description as soon I have more info.

@MSantos thanks for your reply, I don't use IRC, I'll send you an email today

Best

@Lydia_Pintscher @johl just so you're aware this is happening, this is the blocker for adding non CC0 datasets to Commons that can be used by the query service. I have documentation written ready to go for people to use Commons data files in Wikidata queries (e.g map shape files from OpenStreetMap). If you could ask in your network to find a volunteer (or staff member) to help move this along that would be super

@Mrjohncummings @MSantos have any clearer directions for development come out of your February discussion?

Looking at this task thus far, and seeing what is already implemented on Commons, a Multi-Content Revisions approach (with a slot for the accompanying wiki markup) sounds like the best solution IMO. I'm guessing this would only require a mechanism in JsonConfig to handle multiple slots?

Change 511088 had a related patch set uploaded (by MSantos; owner: MSantos):
[mediawiki/extensions/JsonConfig@master] Allow wikitext in description for Data namespace

https://gerrit.wikimedia.org/r/511088

NavinoEvans rescinded a token.
NavinoEvans added a subscriber: NavinoEvans.
Nforcer7 changed the task status from Declined to Invalid.Aug 1 2019, 5:19 PM