Page MenuHomePhabricator

Remove materialized .json files from event schema repositories
Closed, ResolvedPublic

Description

Our event schema repositories are currently configured to 'materialize' the working 'current.yaml' files into 2 different formats: json and yaml. These versioned and materialized files are (or should be) semantically identical.

We originally decided to materialize both of these for convenience by users. If a user preferred to use a json parser instead of a yaml parser in their language, they could.

However, in hindsight, yaml parsers are common enough that it is probably not worth the cognitive and maintenance overhead of keeping both formats. In T308450, we realized that the ordering in the two formats was inconsistent, which...is annoying.

Let's just remove the .json files and stop materializing them.

Event Timeline

I'll wait until the week of Sept 5 2022 before moving forward on this. If you have objections, please comment before then. :)

For raw compatibility purposes, JSON is inherently valid YAML, so arguably the most-convenient direction to go in would be to keep just the materialized JSON.

For raw compatibility purposes, JSON is inherently valid YAML, so arguably the most-convenient direction to go in would be to keep just the materialized JSON.

+1

For raw compatibility purposes, JSON is inherently valid YAML

That's true, but I find the JSON harder to read, link to, and edit. We def use JSON for data serialization, but for schemas, I prefer the YAML.

We aren't considering getting rid of the current.yaml files, so I think having just one schema format to think about is better.

That's true, but I find the JSON harder to read, link to, and edit. We def use JSON for data serialization, but for schemas, I prefer the YAML.
We aren't considering getting rid of the current.yaml files, so I think having just one schema format to think about is better.

I am in favor of reducing cognitive overhead from switching between reading YAML and reading JSON (e.g. if reviewing the materialized version after editing current.yaml). Both should use the same format, be it YAML or JSON. For machines it doesn't matter, and I'm with @Ottomata on preferring YAML over JSON in terms of readability and ease of editing.

Would this change impact existing pipelines or code? I'm wondering, for instance, if the json files are accessed programmatically and whether removing them would have downstream effects.

Would this change impact existing pipelines or code?

Not that I am aware of! :)

YAML is a bit of a trainwreck of a format, with significant differences between its various revisions, significant differences between different parsers, severe usability defects at least in YAML 1.1 (e.g. Norway problem), some of the more esoteric language features facilitating security vulnerabilities (at random, here's a pyyaml arbitrary code execution vulnerability that was only fixed a year ago, after being in the wild for over a year). I'm not sure forcing the server and all clients to parse YAML is a good idea.

I'm not sure forcing the server and all clients to parse YAML is a good idea.

Hm. Getting rid of the yaml files would be more difficult than getting rid of the JSON files, in that the tooling uses the yaml now. The 'extensionless symlink' (e.g. this one) points at the .yaml file.

Existent event platform and data engineering tooling uses yaml parsers, which will parse the json files fine, so we wouldn't need tooling changes to make the extensionless symlink point at the .json file. But, I'm worried the spark/hive ingestion process will require more care for this change, mostly because of T308450. I'd need to verify that though.

The schemas are often linked for humans to read. Perhaps if we had a nice UI for them (or Datahub integration), we wouldn't do this anymore, but for now it is much easier to give a product manager a link to a yaml file.

Meh, my preference for yaml is mostly about useability for humans. Is that a good enough reason to keep them? I'm not sure. The materialized yaml files use pretty simple yaml, and because they are generated via a yaml dumper, I'd hope that they would correctly serialize e.g. "NO" instead of false.

The schemas are often linked for humans to read. Perhaps if we had a nice UI for them (or Datahub integration), we wouldn't do this anymore, but for now it is much easier to give a product manager a link to a yaml file.
Meh, my preference for yaml is mostly about useability for humans. Is that a good enough reason to keep them? I'm not sure. The materialized yaml files use pretty simple yaml, and because they are generated via a yaml dumper, I'd hope that they would correctly serialize e.g. "NO" instead of false.

For what it's worth, my experience is that these files aren't particularly intelligible to anyone who isn't a developer who's viewing them in an editor with syntax highlighting. As is generally true of YAML in most situations; it's not specific to these ones. Though the way the materialization process works arguably makes these worse than most.

(Because, ignoring the general illegibility, you either give them the current schema which requires cross-referencing other schema... or you give them the fully-expanded versioned schema, which gives little indication of which of the included fields will actually be present in the final data...)

For what it's worth, my experience is that these files aren't particularly intelligible to anyone who isn't a developer who's viewing them in an editor with syntax highlighting

That's true, but also true (and IMO worse) than json files?

That's true, but also true (and IMO worse) than json files?

Oh, sure, I was mostly just meaning to say that arguments about legibility don't really apply to either side of this choice. If we weren't pretty-printing the JSON there'd be a meaningful difference... but we are, so it's all equivalent.

I'll echo my comment in https://phabricator.wikimedia.org/T308450#8214099.

My understanding is that yaml is the only format actually in use. If so, I think there is no point in supporting both; I'd lean towards the path of least resistance (even if suboptimal) and deprecate json support.

My understanding is that yaml is the only format actually in use. If so, I think there is no point in supporting both; I'd lean towards the path of least resistance (even if suboptimal) and deprecate json support.

Agree with @gmodena!

I generally lean towards @Tgr's opinion of yaml as a format, and @DLynch's opinion of yaml not being all that readable. In our case here, yaml is being de-referenced and re-generated by a yaml dumper, so it seems unlikely that we'd hit problems with the format itself. It's kind of a subset of yaml that seems safe (this is hand-waivey, we maybe should look at it closer).

My main point here is that I think we need a UI anyway. What we want to do with these schemas as we communicate across teams and with product managers is too complicated to just rely on yaml being easy to read. I think we need a proper UI that shows the schema, its references to other schema parts, and makes it as easy as humanly possible to think about instrumentation and data flows. And we don't have to write it from scratch, there are a ton out there:

https://navneethg.github.io/jsonschemaviewer/
https://jlblcc.github.io/json-schema-viewer/

(try pasting one of our schemas into the above)

@Milimetric: I don't know that that type of visualizer/UI would be useful in practice:

Screen Shot 2022-09-07 at 11.01.07 AM.png (2×948 px, 325 KB)

…without heavy modifications from us, but yes at least we wouldn't be writing from scratch. You might laugh but I've had people tell me they even miss how schemas rendered on Meta wiki. Maybe something like this:

Screen Shot 2022-09-07 at 10.58.49 AM.png (1×4 px, 959 KB)

could work?

Re UI: We'd like to integrate Event Platform streams with datahub, so they'd end up being browsable just like Hive tables.

My understanding is that yaml is the only format actually in use. If so, I think there is no point in supporting both; I'd lean towards the path of least resistance (even if suboptimal) and deprecate json support.

+1

I generally lean towards @Tgr's opinion of yaml as a format, and @DLynch's opinion of yaml not being all that readable. In our case here, yaml is being de-referenced and re-generated by a yaml dumper, so it seems unlikely that we'd hit problems with the format itself. It's kind of a subset of yaml that seems safe (this is hand-waivey, we maybe should look at it closer).

My main point here is that I think we need a UI anyway.

yaml sounds like an internal implementation detail, and supporting two formats is overhead for the library maintainers. The shortcomings of yaml (which I agree with - I'm not defending formats) do not impact end users.

If someone wants to step up as maintainer, and support future/eventual use cases, a solution could be split out the JSON logic into it's own artifact.

Okay, objections about yaml vs json are noted. Since all users are currently using the yaml files, and we don't want to manage a 'migration' to json at this time, we are going to move forward with removing the json files.

In a future task, if we decide that we really want/need json instead of yaml, we can always re-materialize the json and do the migration then.

Change 839677 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/primary@master] Remove materialized json files and disable materializing them

https://gerrit.wikimedia.org/r/839677

Change 839678 had a related patch set uploaded (by Ottomata; author: Ottomata):

[schemas/event/secondary@master] Remove materialized json files and disable materializing them

https://gerrit.wikimedia.org/r/839678

Mentioned in SAL (#wikimedia-analytics) [2022-10-11T15:44:07Z] <ottomata> remove materialized .json files from schemas/event/secondary - this should be a no-op as no clients should actually be using the json files. - T315674

Mentioned in SAL (#wikimedia-operations) [2022-10-11T15:44:10Z] <ottomata> remove materialized .json files from schemas/event/secondary - this should be a no-op as no clients should actually be using the json files. - T315674

Change 839678 merged by jenkins-bot:

[schemas/event/secondary@master] Remove materialized json files and disable materializing them

https://gerrit.wikimedia.org/r/839678

Mentioned in SAL (#wikimedia-analytics) [2022-10-12T15:26:26Z] <ottomata> remove materialized .json files from schemas/event/primary - this should be a no-op as no clients should actually be using the json files. - T315674

Mentioned in SAL (#wikimedia-operations) [2022-10-12T15:26:33Z] <ottomata> remove materialized .json files from schemas/event/primary - this should be a no-op as no clients should actually be using the json files. - T315674

Change 839677 merged by jenkins-bot:

[schemas/event/primary@master] Remove materialized json files and disable materializing them

https://gerrit.wikimedia.org/r/839677