Page MenuHomePhabricator

RFC: Use YAML instead of JSON for structured on-wiki content
Closed, DeclinedPublic

Description

JSON has three major problems: it does not allow comments, it is hard to read, and it is very strict with things like trailing commas and quotes, making it much harder to write by humans. Since JSON is almost a perfect subset of YAML, I would like to propose that we start using YAML for things like template parameter documentation, graphs, on-wiki configuration. and other "JSON as wiki markup" places. See more examples below.

Note that internally we may still process all the data as JSON, e.g. if it needs to be available to JavaScript clients, or stored in the page properties.

<graph> Vega JS "JSON programming language"

example1 or example2 (scroll down to the <graph> section) - we already use the HHVM's non-standard JSON parsing for comments, lenient trailing commas, and multiline strings autoconversion to \n.

JSON
{
  // Create a dataset that indicates which cars are visible and should drive the domain of the scales.
  "name": "visible-cars",
  "source": "cars",
  "transform": [
    {
      "type": "filter",
      "test": "datum['Adjusted price'] >= selectedPrice"
    },
    {
      "type": "filter",
      "test": "if(selectedPriceType == 'greater', true, if (selectedPrice == 0, if(datum.bin_start==0 & datum['Adjusted price'] < 5000000, true, false), if (selectedPrice == 5000000, if(datum.bin_start==0 & datum['Adjusted price'] >= 5000000, true, false), if(selectedPrice == datum.bin_start, true, false))))"
    }
  ]
},
YAML
  # Create a dataset that indicates which cars are visible and should drive the domain of the scales.
- name: visible-cars
  source: cars
  transform:
  - type: filter
    test: datum['Adjusted price'] >= selectedPrice
  - type: filter
    test: >
      if (selectedPriceType == 'greater', true,
        if (selectedPrice == 0,
          if (datum.bin_start == 0 & datum['Adjusted price'] < 5000000, true, false),
          if (selectedPrice == 5000000,
            if (datum.bin_start == 0 & datum['Adjusted price'] >= 5000000, true, false),
            if (selectedPrice == datum.bin_start, true, false)
      )))

JSON as a template parameter

JSON is sometimes useful as a parameter for a template that contains a graph, e.g. this template:

JSON
{{ Graph:Street map with marks | lat=37.8 | lon=-122.4 | zoom=9 |
{"lat":37.8, "lon":-122.4, "img":"wikirawupload:{{filepath:Volcano red 32x32.svg|32}}", "width":25, "height": 25, "offsetY":-10, "text": "Volcano", "textFontWeight": "bold", "textFontSize": 20, "textColor": "#00f"}
}}
YAML
{{ Graph:Street map with marks | lat=37.8 | lon=-122.4 | zoom=9 |
  lat: 37.8
  lon: -122.4
  img: wikirawupload:{{filepath:Volcano red 32x32.svg|32}}
  width: 25
  height: 25
  offsetY: -10
  text: Volcano
  textFontWeight: bold
  textFontSize: 20
  textColor: "#00f"
}}

SPARQL query as a single string for Kartographer

Map example, while much better than the original URL encoded example, is still a major pain to write because one has to be very careful of quote escaping. Also requires non-standard (multi-line) JSON parsing.

JSON
<maplink latitude="52.16" longitude="-112.15" zoom="3" width="800" height="500" text="Governors of US states with their party affiliation">
{
  "type":"ExternalData",
  "service": "geoshape",
  "query": "SELECT ?id ?head
(SAMPLE(?img) as ?img) 
(SAMPLE(?fill) as ?fill) 
(concat('[[', substr(str(?link),31,100),  ' | ', ?headLabel, ']]') as ?title)
...
YAML
<maplink latitude="52.16" longitude="-112.15" zoom="3" width="800" height="500" text="Governors of US states with their party affiliation">
  type: ExternalData
  service: geoshape
  query: >
    SELECT ?id ?head
    (SAMPLE(?img) as ?img) 
    (SAMPLE(?fill) as ?fill) 
    (concat('[[', substr(str(?link),31,100),  ' | ', ?headLabel, ']]') as ?title)

Zero configuration

Zero team had tons of issues with configuring Zero with keeping the trailing commas and lack of comments at first, but with the non-standard JSON it has gotten better. YAML-style would help, but building a full-blown interface might have been considerably more effort

Event logging schema

our event schema - the "action" field (enum) is horribly documented because there is a disjoint between the enum values and their description, and it doesn't allow even JSON comments (event schema reformats json on save). Editing that doc is also painful because the \n are not that readable.

Event Timeline

Yurik created this task.Oct 2 2016, 6:41 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2016, 6:41 PM
Restricted Application added a project: VisualEditor. · View Herald TranscriptOct 2 2016, 6:45 PM
Yurik updated the task description. (Show Details)Oct 2 2016, 6:49 PM
Yurik renamed this task from RFC: Use YAML instead of JSON for content handler to RFC: Use YAML instead of JSON for structured wiki content.Oct 2 2016, 7:25 PM
Yurik renamed this task from RFC: Use YAML instead of JSON for structured wiki content to RFC: Use YAML instead of JSON for structured on-wiki content.
Yurik added projects: Maps (Kartographer), Maps.
Restricted Application added a project: Discovery. · View Herald TranscriptOct 2 2016, 7:27 PM
Yurik moved this task from All map-related tasks to Tracking on the Maps board.Oct 2 2016, 9:15 PM
Yurik updated the task description. (Show Details)Oct 2 2016, 9:22 PM
Yurik added a project: TechCom-RFC.
Restricted Application added a project: Multimedia. · View Herald TranscriptOct 2 2016, 10:29 PM

I think the premise here is flawed. I don't think we should ever have users editing raw JSON themselves - we should be building proper forms that have input, validation, etc. JSON vs yaml is just a backend implementation detail that users should not care about.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptOct 2 2016, 11:29 PM
Harej added a subscriber: Harej.Oct 3 2016, 1:25 AM
bd808 added a subscriber: bd808.Oct 3 2016, 2:25 AM

The the long time maintainer of YAML bindings for PHP, I oppose a move from JSON to YAML. The fundamental benefit of JSON is that it is a simple specification that lends itself to relatively easy implementation for both producing and consuming valid documents. It's easy to think of YAML as "JSON but with comments" from a the point of view of a basic user, but the full specification is far more complex. So complex in fact that arguably there is no single implementation of a YAML parser that completely fulfills the specification.

TheDJ added a subscriber: TheDJ.Oct 3 2016, 7:47 AM

I think the premise here is flawed. I don't think we should ever have users editing raw JSON themselves - we should be building proper forms that have input, validation, etc.

👏👍

jayvdb added a subscriber: jayvdb.Oct 3 2016, 7:47 AM

If we switched to YAML, we should define a subset that fixes only these problems.

Instead I recommend TOML, which is gaining support. E.g. https://www.python.org/dev/peps/pep-0518/

I get where Yuri is coming from. YAML's comments and more flexible handling of formatting make it a more interesting format for hand crafting. The idea of allowing a more forgiving format for text entry is appealing. Also, comments.

This doesn't seem like an idea we should pursue in 2016. Probably not 2017 either. JSON seems good enough for this phase of our usage, whereas a move to YAML would touch a lot of code (and introduce a lot more code) without providing enough benefit to justify the stability, security, and performance risks of the move. It seems worth setting this aside, with a plan to revisit it a little bit down the road (e.g. 2018)

Yuri, could you give an example of a JSON file in use that would benefit from a migration to YAML?

Harej added a comment.Oct 3 2016, 7:57 AM

I don't think we should ever have users editing raw JSON themselves - we should be building proper forms that have input, validation, etc.

Definitely agreed, though (a) the fact is we have features that can only be interacted with through editing raw JSON (UploadWizard campaigns, EventLogging) and (b) in the case of CollaborationKit, even though we are building editing interfaces there will still be the option to edit raw JSON, intended for advanced users.

Regarding this RFC I wonder what the target is. There's nothing in core that I know of that uses JSON. Would the outcome be to adapt the extensions that currently use it? Or to construct a layer that allows JSON to be edited as YAML, kind of like a Parsoid for structured data?

Paladox added a subscriber: Paladox.Oct 3 2016, 8:07 AM

YAML is nice for writing it (you have to worry less about quotes and strings and escape characters), but is annoying to read or modify (there's a dozen ways to encode an array of strings, so unless you've been working with it for a long time, every new file is a new adventure).

I don't think we really want YAML; what we want is JSON, but with comments and trailing commas allowed. Using a YAML parser is one of the ways to implement this (I do it myself sometimes), but adding a preprocessing step before JSON parsing would be another (we even have an implementation of it already).

I honestly don't think introducing YAML content would further your goal of making editing that content easier. We should try to define the format we really want (either as subset of YAML, or superset of JSON) and implement a parser for it. Has no one really specified a "JSON-with-comments-and-trailing-commas" anywhere on the internet before? It feels unlikely that we'd be the first people to have this problem…

bd808 added a comment.Oct 3 2016, 3:46 PM

I honestly don't think introducing YAML content would further your goal of making editing that content easier. We should try to define the format we really want (either as subset of YAML, or superset of JSON) and implement a parser for it. Has no one really specified a "JSON-with-comments-and-trailing-commas" anywhere on the internet before? It feels unlikely that we'd be the first people to have this problem…

The folks at FaceBook decided that the default JSON parser in HHVM should be lenient by default. {{cn}} but I remember hearing via unofficial channels that they later had a site outage because of a trailing comma in a json config file that was parsed outside of HHVM.

JSON with comment stripping seems reasonable. We can still provide pure JSON via a parameter. I don't think allowing trailing commas is that useful if you have (proper) input validation.

Florian added a subscriber: Florian.Oct 4 2016, 5:01 AM

I also think we shouldn't move to JSON with this Little Benefiz. E.g. JSON already Supports comments, Not as easy as yaml (you've to define what you threat as comments in the consumer), however, it's possible. On that other hand, I agree with legoktm, that we shouldn't discuss a new markup language beside wiki text to be editable by the user by default (for advanced users it should be still possible, but not as the default), we should provide editors and forms that provide an easy way to edit and probably view things. Based on that, the back end format doesn't really matter as long as the requirements are met. And as far as I know, JSON fulfils our requirements so far, right?

Btw. If this is really meant for a change in user visible JSON contents, we probably should ask the audience, who will need to work with it, the editing community (e.g. template maintainers for your TemplateData example or the technical village pump for a general audience), probably with some "actual" and "after" example :) But we should be sure before, that we all get a benefit of it, before :)

Bawolff added a subscriber: Bawolff.Oct 4 2016, 6:00 PM

I'm not a fan of YAML, as it has certain features which are unsafe. Obviously these can be disabled, but its still kind of nasty.

Allowing comments might be kind of nice. Not having comments is one of the only things I dislike about JSON.

I also agree with Lego that JSON should not be exposed to users.

I honestly don't think introducing YAML content would further your goal of making editing that content easier. We should try to define the format we really want (either as subset of YAML, or superset of JSON) and implement a parser for it. Has no one really specified a "JSON-with-comments-and-trailing-commas" anywhere on the internet before? It feels unlikely that we'd be the first people to have this problem…

Quick web searches confirm that others have had this problem:

  • https://hjson.org/
    • Hjson is a syntax extension to JSON. It's NOT a proposal to replace JSON or to incorporate it into the JSON spec itself. It's intended to be used like a user interface for humans, to read and edit before passing the JSON data to the machine.
  • https://github.com/sindresorhus/strip-json-comments
    • It will replace single-line comments // and multi-line comments /**/ with whitespace. This allows JSON error positions to remain as close as possible to the original source.

...but it's not clear if anyone has submitted anything to any standards body proposing an update that allows this. The IETF json working group publishes all of their material if someone cares to research this further: https://tools.ietf.org/wg/json/

(p.s. some intrepid soul could even join the mailing list and ask on list)

Yurik added a comment.Oct 4 2016, 9:12 PM
This comment was removed by Yurik.
Yurik updated the task description. (Show Details)Oct 5 2016, 12:06 AM
Yurik added a comment.Oct 5 2016, 1:13 AM

I just moved the examples into description, as they represent the reason why in some cases YAML-like structure is better. Here I will try to sum up the comments I saw so far, and reply to them:

  • HJSON Love it! Seems like very similar to what I had in mind. And there is even a php lib for it. I would like <graph> to support it. Thanks @RobLa-WMF!
  • YAML has too many quirks -- agree. I especially hate that yes, no, on, and off are special keywords that become true/false in JSON. I think true/false/null should be the only special keywords understood by YAML. I hope HJSON doesn't create as many WTF moments as YAML.
  • Humans should not edit JSON on Wiki/use proper UI -- just like with Windows/Mac (GUI) vs Linux (shell) debate, or in attempts to create a "UI" for programming, we tried to get rid of the "raw text" and replace it with GUI, and it repeatedly failed. We will always have novices and power users, and if there is a simple interface, power users will be unable to perform all the complex editing conviniently. So we really need both. On the other hand, I agree that large amount of structured data should not reside inside the wiki articles, just like you should not put all your code into one file. So if multi-stream content handler is implemented, and/or if we have better cross-wiki content sharing, some of the data could be moved out more easily.

    There are many tags that require structured data - Graphs, Maps, Easy timeline, Template data, Image map, Input box - and most of them either used JSON or introduced their own syntax, possibly because JSON is so bad for user editing. At this point in time we do not have a good alternative to storing data in wiki markup, despite years of us trying to figure out a better way. I'm all for figuring it out, but in the mean time, lets think of the better "intermediary" step that we could consistently use. It does not have to be a magic bullet - every extension author may choose to migrate at their own pace, but it will be better if we, at least socially, recommend a common technological path.
  • JSON is good for document storag/parsing - agree, but we are not talking about either, we are talking about ease of use for the users to enter structured data, and users need much more flexibility. Imagine PHP without comments or with horribly strict commas and newlines.
  • TOML Thanks, I looked at it, but it seems it is targetted primarily to config files use case (looks very much like .ini), and it might not be well suited for "data as code" for cases like graphs.

Hjson's quoteless strings mildly conflict with wikitext links and templates. For example, this is a syntax error (testing on http://hjson.org/try.html):

{
    foo: [[Text]].
}

Although this isn't (it's only a problem at the beginning of the string):

{
    foo: Some [[text]].
}

So maybe not a big deal, but something that'd have to be documented.

Yurik added a comment.Oct 5 2016, 3:56 AM

Well, this isn't exactly a conflict with Wiki markup - it is simply that a quote-less string cannot begin with a '[', and has to be surrounded by quotes. BTW, I suspect YAML has the same issue. I am more concerned with a very common usecase of the {{#tag:graph| ... }} or passing data as a parameter, in which case data needs to be both structured and wikimarkup safe. For example, JSON restrictions are:

  • Two closing curly braces } must be separated by some whitespace: } }
  • Two opening or closing square braces must be separated by some whitespace: [ [, ] ]
  • Pipe symbol | must be written as {{!}}
  • When you need to write {{ or }} without a space (e.g. in a string), use \u007b and \u007d instead of at least one of them.

In HJSON, multiline strings use ''', which would clearly be a problem that I'm not sure how to work around.

Base added a subscriber: Base.EditedOct 5 2016, 6:42 AM

@Yurik, I must note that Graph is super uber mega ultra difficult as it is, with very poor documentation. It would become worse if you couldn't even use Jsons taken from upstream docs and examples. It is not the case for stuff like upload wizard campaigns where syntax is our own to take care of, so we can change it to whatever we want, but such things must be considered.

brion added a subscriber: brion.Oct 5 2016, 8:20 PM
Yurik moved this task from Unsorted to Tracking on the Maps (Kartographer) board.Oct 6 2016, 8:32 PM
RobLa-WMF moved this task from Backlog to Brion on the TechCom-Has-shepherd board.
RobLa-WMF moved this task from P1: Define to Under discussion on the TechCom-RFC board.
MarkTraceur moved this task from Untriaged to Tracking on the Multimedia board.Nov 28 2016, 5:45 PM
MarkTraceur added a subscriber: MarkTraceur.

I stumbled over this while looking at the Multimedia board, and apart from being baffled why our team is tagged in this (apart from the fact that this might affect two of our products currently using JSON), I think the most important comment so far is @Legoktm's. This is a solution in search of a problem, and we should not expose JSON to users unless they really want it.

Yurik removed a project: Maps.Dec 15 2016, 4:39 AM
Restricted Application added a project: Analytics. · View Herald TranscriptJan 24 2017, 11:39 PM
MaxSem closed this task as Declined.Jan 25 2017, 12:05 AM

Closing per above.