Page MenuHomePhabricator

Define storage format for listed entries (JSON / YAML / Markdown etc) and their placement
Open, MediumPublic

Description

  • Have a categories/placement file which defines available placements/positions for items (translatable)
  • For items, have one file per each listed item, with key-value pairs, such as:
    • title (1 | required | translatable)
    • description (1 | required | translatable)
    • url (>=1 | required | translatable)
    • technology tags (>=1 | required | non-translatable) (for T276704: Allow filtering of content by programming language)
    • placement/position/category/where to display (>=1 | required | non-translatable, from list defined in categories/placement file)

In the future (non-MVP), potentially expand with

Related Objects

Event Timeline

Aklapper created this task.
Aklapper moved this task from Backlog to 2021-Q3 on the Wikimedia-Developer-Portal board.
Aklapper renamed this task from Define data fields (key-value pairs) for each listed entry to Define Storage format for each listed entry (JSON?, etc).Jul 27 2021, 4:35 PM
Aklapper updated the task description. (Show Details)

This feels very related to what/how we do the static content generation generally. One thing that is sort of common in most of the static site generators I have worked with in the past is using YAML documents as the "source of truth" with the documents having a "metadata" section that describes titles, keywords, authorship, etc and then a "body" section that uses markdown or some other content markup language to describe the human readable content.

YAML is a superset of JSON (all valid JSON documents are valid YAML documents, but not the other way around). JSONSchema is a pretty good way to describe the validated structure of either JSON or YAML. JSONSchema is what is used at https://meta.wikimedia.org/wiki/Toolhub/Data_model#Version_1.2.0 to describe the structure of a toolinfo record.

A schema for these "document pointers" could be some reasonable combination of "required" metadata that we expect to have for all documents and "type specific" data that varies based on some key attribute like the document type or audience.

Aklapper renamed this task from Define Storage format for each listed entry (JSON?, etc) to Define Storage format for listed entries (JSON, YAML, etc).Aug 3 2021, 5:36 PM

I'll repost part of the comment in T287176: Display localized content to readers:

We may have a directory full of files (each file being one content entry), then aggregate template files for an entry, then generate files for each language (entry.cs, entry.de, entry.es, etc).

Aklapper renamed this task from Define Storage format for listed entries (JSON, YAML, etc) to Define storage format for listed entries (JSON / YAML) and their placement.Sep 16 2021, 3:39 PM
Aklapper raised the priority of this task from Low to Medium.Sep 16 2021, 3:51 PM
Aklapper updated the task description. (Show Details)
Aklapper renamed this task from Define storage format for listed entries (JSON / YAML) and their placement to Define storage format for listed entries (JSON / YAML / Markdown etc) and their placement.Sep 16 2021, 7:31 PM

@Aklapper and I talked through this ticket in a conference call today. We got a bit into technical weeds and then back out to a higher level again, but I think we may be closing in on a reasonable initial data model. I will try to describe it here a bit for others to comment on. Note there are terms being defined to make this description. The actual words used in the longer term may not be the same, but the concepts they describe feel fairly stable.

Terms

  • document descriptor - The most granular content item in the system. Describes a document/collection of documents external to the developer portal like https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker or https://meta.wikimedia.org/wiki/Offline_Projects.
  • category - A category aggregates one or more document descriptors which are somehow thematically related. Some examples might be "written in PHP" or "bots" or "new code contributor".
  • page - A page aggregates one or more categories which are somehow thematically related. Examples might be "use our content" or "contribute to MediaWiki" or "tools for your community".
  • portal - Collection of pages. This is the whole of the content that we are creating including translated and generated page content.

A document descriptor is the starting point for content authoring and what this task initially was trying to describe. Each descriptor will be a YAML file providing: a title (heading), a description, a collection of 1-N URLs with labels, and a collection of category names. Additional optional fields may be added in the future (like image?).

dd/performant-code.yaml
---
title: Write performant code
description: Learn about caching, backend and page load performance guidelines.
links:
  - url: https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Backend_performance
    label: mw:Wikimedia Performance Team/Backend performance
  - url: https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Page_load_performance
    label: mw:Wikimedia Performance Team/Page load performance
  - url: https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Performance_tuning
    label: mw:Manual:Performance tuning
categories:
  - code quality

A category is the second level of content organization. Each category will be a YAML file providing: the category id of document descriptors to aggregate, a title (heading), and a description. Additional optional fields may be added in the future (like image?).

cat/code-quality.yaml
---
category: code quality
title: Create quality software
description: Read about contribution standards and guidelines to make better software for everyone.

A page is the third level of content organization. Each page will be in a format prescribed by the static site generation software that is chosen for T287175: Decide on most suitable underlying technical platform. It is likely that these will take the form of either YAML documents or Markdown documents with a YAML preamble as appropriate for the tooling. A page should have a title, heading(s), and prose text as well as include one or more categories which in turn include one or more document descriptors. We want the inclusion of categories to work like a template transclusion in MediaWiki: somehow in the page's source document we want to say "put the code quality category content here". How that technically works is :waves hands: an implementation detail that we do not know yet, but we will figure it out! The static site framework we choose will either have native support for transclusions, or we will make our own implementation as either a native extension of the framework or as a pre-processor step of some kind.

Finally we get all the way up to the portal level. What this exactly looks like will again depend on details that will not be fully known until T287175 is decided. At a high level though, this is where we will describe the templates for rendering our page files as static HTML.

The build process to get from the source files to the static HTML will be a multi-step pipeline. Our collection of category documents will be processed (likely by custom code we write) to generate intermediate files in the format needed for transclusion into their including pages. This will probably happen multiple times for each category YAML file so that we end up with one intermediate file for each category + display language combination (T287176: Display localized content to readers). As a category is expanded into transcludable content it will in turn expand the document descriptors tagged with the category as inline content.

Next the build pipeline will process each page document once for each display language and output a file in the proper human language containing the previously expanded categories called for by the page source. These output files will be kept to use as inputs to the next stage of the build pipeline, actually running the static site generator.

A page is the third level of content organization. Each page will be in a format prescribed by the static site generation software that is chosen for T287175: Decide on most suitable underlying technical platform. It is likely that these will take the form of either YAML documents or Markdown documents with a YAML preamble as appropriate for the tooling. A page should have a title, heading(s), and prose text as well as include one or more categories which in turn include one or more document descriptors. We want the inclusion of categories to work like a template transclusion in MediaWiki: somehow in the page's source document we want to say "put the code quality category content here". How that technically works is :waves hands: an implementation detail that we do not know yet, but we will figure it out! The static site framework we choose will either have native support for transclusions, or we will make our own implementation as either a native extension of the framework or as a pre-processor step of some kind.

The more I think about this, the more I'm leaning towards each page that we commit to git actually being our own format rather than the native format of the site generation software. The main reason for that I think being T287176: Display localized content to readers and the need to extract and later reassemble translation units for the prose embedded in the page. We are already going to need to generate files for each page + language combination to feed to the site generator so this doesn't really introduce more work for the system. Controlling the format at this level will let us invent whatever conventions are needed for both extracting translation units from the origin document and transcluding categories into the translated page output.

As a neat side effect, this would also mean that the substantive content for the documentation portal would be decoupled from the static site generation software chosen. Our build pipeline tooling would become the only thing that needs to know what funky YAML/Markdown/whatever is required as input to the generator.