Page MenuHomePhabricator

Render human-readable schemas on schema.wikimedia.org
Open, LowPublic3 Estimated Story Points

Description

Background

schema.wikimedia.org hosts JSON Schema definitions for Wikimedia Event Schemas. The schemas are presented as raw YAML files.

Motivation

In addition to their programmatic uses, these schemas contain documentation that needs to be read by humans, including descriptions, types, and requirements. For example, WMF product teams building instruments using the Metrics Platform need to read schema definitions to learn about the properties supported by the Metrics Platform schemas. However, the raw YAML files are not easy to read, especially for longer schemas. Attempts to make this information more accessible usually results in the information being manually duplicated in another location, creating documentation that is likely to become out of date as schemas change.

Proposal

Integrate a rendering process into the build step for schema.wikimedia.org that produces an HTML version of the schema definitions that will be served when browsing the site.

Tooling options

json-schema-for-humans

jsonschema2md

[Feel free to add other tools]

Considerations

  • If we choose a tool that generates Markdown instead of HTML, we'll need an additional step to render the Markdown into HTML.
  • How should the URL paths be organized?
  • How do we link from the HTML version to the raw version?

Alternatives

See T372680: Investigate schema visualization tools for schema.wikimedia.org for alternative approaches.

Event Timeline

Can I see an example of "The schemas are presented as raw YAML files."? I want to make sure the raw YAML file is accessible for non-techncial users.

Can I see an example of "The schemas are presented as raw YAML files."? I want to make sure the raw YAML file is accessible for non-techncial users.

Here's the raw version of the base schema for web: https://schema.wikimedia.org/repositories//secondary/jsonschema/analytics/product_metrics/web/base/1.3.0.yaml

Milimetric triaged this task as Medium priority.Oct 18 2024, 4:22 PM
Milimetric updated Other Assignee, added: Milimetric.
Milimetric set the point value for this task to 3.

This is an amazing idea thank you! We have always wanted to make schema.wikimedia.org more readable, but never had time to prioritize it. The existent UI was something I did best effort in like a day or two, so please! A replacement would be amazing.

Also, TBD, @daniel and I are considering expanding the scope of schema.wikimedia.org to host more than just event schemas. (Do we have a ticket Daniel?)

There are other schema repo improvements we would like to make too!

Also:

EventStreams uses OpenAPI specs and doc UI to show a human readable form. E.g. https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_page_change_v1

There is an eventstreams-internal instance that can be accessed via an ssh tunnel now. This has access to all Event Platform streams and schemas. If we did T348763: Make eventstreams-internal available to WMF staff without an ssh tunnel, perhaps more folks would use it?

Also:

@tchin implemented Event Stream -> Datahub integration, so you can view all declared streams and schemas at https://datahub.wikimedia.org/search?filter_platform=urn:li:dataPlatform:eventstreams

You know what else would be cool!?

A quick and dirty readable UI of action=streamconfigs, maybe implemented in the EventStreamConfig extension! Special:StreamConfigs?

E.g. This https://meta.wikimedia.org/w/index.php?title=Special:ApiSandbox#action=streamconfigs&format=json&formatversion=2

But pretty!

Milimetric updated Other Assignee, removed: Milimetric.

current approach is to look at https://modpython.org/ and see if we can just set up the transformation on serve. We have too many unanswered questions around how to materialize on-build to resolve and be useful. But doing it at render-time is easier and allows us to iterate faster. The problem will be that jsfh renders the schema file but also renders js and css. So we'd have to host these somewhere and make apache redirect relative requests to them from anywhere to the canonical spot. Bit too much apache hacking for me but it should work :)

schema.wikimedia.org is currently hosted via nginx.

Could change though.

 The problem will be that jsfh renders the schema file but also renders js and css

The eventschemas module in puppet sets up the nginx webserver. It also has static js and css files for pretty-autoindex

We have too many unanswered questions around how to materialize on-build to resolve and be useful.

FWIW, I think this would not be too hard to do with jsonschema-tools. You'd have to add support for a new html contentType serializer.

Milimetric lowered the priority of this task from Medium to Low.Wed, Dec 11, 12:07 PM

FWIW, I just toyed with https://github.com/coveooss/json-schema-for-humans and https://github.com/tomcollins/json-schema-static-docs a bit.

https://github.com/coveooss/json-schema-for-humans

Pros:

  • html docs are prettier and more dynamic.
  • Fields are # anchor tag linkable.

Cons:

  • No index generation.

https://github.com/tomcollins/json-schema-static-docs

Pros:

  • Should have index generation, but I couldn't get it to work
  • Javascript, so 'easier' (maybe?) to generate static docs with jsonschema-tools?

Cons:

  • Some weirdness with its use of ajv. E.g. our schemas use "https" urls for jsonschema metaschemas (e.g draft-07), but ajv doesn't like this by default. We overcome this in EventGate by 'aliasing' the https url to the http or built in file version. This might be possible overcome with ajvOptions, but I'm not sure.
  • Some schemas that re-use a fragment schema multiple times will have that schema fragment's $id multiple times in the materialized schema (e.g. mediawiki/page/change). This results in Error: reference "/fragment/mediawiki/state/entity/page/2.0.0" resolves to more than one schema

FWIW, I think this would not be too hard to do with jsonschema-tools. You'd have to add support for a new html contentType serializer.

On a closer look, this might work if we wanted the html doc files to be along side of the materialized versioned files in the same schema directory, and if we used a javascript library. The contentType serializer functions in jsonschema-tools expect to be given a parsed schema as a JavaScript object, not a path to a file.