Page MenuHomePhabricator

Define content storage format (levels, data structure, hierarchy, categorization, etc)
Closed, ResolvedPublic

Description

  • Have a categories/placement file which defines available placements/positions for items (translatable)
  • For items, have one file per each listed item, with key-value pairs, such as:
    • title (1 | required | translatable)
    • description (1 | required | translatable)
    • url (>=1 | required | translatable)
    • technology tags (>=1 | required | non-translatable) (for T276704: Allow filtering of content by programming language)
    • placement/position/category/where to display (>=1 | required | non-translatable, from list defined in categories/placement file)

In the future (non-MVP), potentially expand with

Related Objects

StatusSubtypeAssignedTask
ResolvedBUG REPORTbd808
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
OpenNone
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedAklapper
ResolvedNone
Resolvedbd808
Resolvedbd808
Resolvedabi_
Resolvedbd808
ResolvedSpikebd808
Resolvedapaskulin

Event Timeline

Aklapper created this task.
Aklapper moved this task from Inbox to 2021-Q3 on the Wikimedia-Developer-Portal board.
Aklapper renamed this task from Define data fields (key-value pairs) for each listed entry to Define Storage format for each listed entry (JSON?, etc).Jul 27 2021, 4:35 PM
Aklapper updated the task description. (Show Details)

This feels very related to what/how we do the static content generation generally. One thing that is sort of common in most of the static site generators I have worked with in the past is using YAML documents as the "source of truth" with the documents having a "metadata" section that describes titles, keywords, authorship, etc and then a "body" section that uses markdown or some other content markup language to describe the human readable content.

YAML is a superset of JSON (all valid JSON documents are valid YAML documents, but not the other way around). JSONSchema is a pretty good way to describe the validated structure of either JSON or YAML. JSONSchema is what is used at https://meta.wikimedia.org/wiki/Toolhub/Data_model#Version_1.2.0 to describe the structure of a toolinfo record.

A schema for these "document pointers" could be some reasonable combination of "required" metadata that we expect to have for all documents and "type specific" data that varies based on some key attribute like the document type or audience.

Aklapper renamed this task from Define Storage format for each listed entry (JSON?, etc) to Define Storage format for listed entries (JSON, YAML, etc).Aug 3 2021, 5:36 PM

I'll repost part of the comment in T287176: Display localized content to readers:

We may have a directory full of files (each file being one content entry), then aggregate template files for an entry, then generate files for each language (entry.cs, entry.de, entry.es, etc).

Aklapper renamed this task from Define Storage format for listed entries (JSON, YAML, etc) to Define storage format for listed entries (JSON / YAML) and their placement.Sep 16 2021, 3:39 PM
Aklapper raised the priority of this task from Low to Medium.Sep 16 2021, 3:51 PM
Aklapper updated the task description. (Show Details)
Aklapper renamed this task from Define storage format for listed entries (JSON / YAML) and their placement to Define storage format for listed entries (JSON / YAML / Markdown etc) and their placement.Sep 16 2021, 7:31 PM

@Aklapper and I talked through this ticket in a conference call today. We got a bit into technical weeds and then back out to a higher level again, but I think we may be closing in on a reasonable initial data model. I will try to describe it here a bit for others to comment on. Note there are terms being defined to make this description. The actual words used in the longer term may not be the same, but the concepts they describe feel fairly stable.

Terms

  • document descriptor - The most granular content item in the system. Describes a document/collection of documents external to the developer portal like https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker or https://meta.wikimedia.org/wiki/Offline_Projects.
  • category - A category aggregates one or more document descriptors which are somehow thematically related. Some examples might be "written in PHP" or "bots" or "new code contributor".
  • page - A page aggregates one or more categories which are somehow thematically related. Examples might be "use our content" or "contribute to MediaWiki" or "tools for your community".
  • portal - Collection of pages. This is the whole of the content that we are creating including translated and generated page content.

A document descriptor is the starting point for content authoring and what this task initially was trying to describe. Each descriptor will be a YAML file providing: a title (heading), a description, a collection of 1-N URLs with labels, and a collection of category names. Additional optional fields may be added in the future (like image?).

dd/performant-code.yaml
---
title: Write performant code
description: Learn about caching, backend and page load performance guidelines.
links:
  - url: https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Backend_performance
    label: mw:Wikimedia Performance Team/Backend performance
  - url: https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Page_load_performance
    label: mw:Wikimedia Performance Team/Page load performance
  - url: https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Performance_tuning
    label: mw:Manual:Performance tuning
categories:
  - code quality

A category is the second level of content organization. Each category will be a YAML file providing: the category id of document descriptors to aggregate, a title (heading), and a description. Additional optional fields may be added in the future (like image?).

cat/code-quality.yaml
---
category: code quality
title: Create quality software
description: Read about contribution standards and guidelines to make better software for everyone.

A page is the third level of content organization. Each page will be in a format prescribed by the static site generation software that is chosen for T287175: Decide on most suitable underlying technical platform. It is likely that these will take the form of either YAML documents or Markdown documents with a YAML preamble as appropriate for the tooling. A page should have a title, heading(s), and prose text as well as include one or more categories which in turn include one or more document descriptors. We want the inclusion of categories to work like a template transclusion in MediaWiki: somehow in the page's source document we want to say "put the code quality category content here". How that technically works is :waves hands: an implementation detail that we do not know yet, but we will figure it out! The static site framework we choose will either have native support for transclusions, or we will make our own implementation as either a native extension of the framework or as a pre-processor step of some kind.

Finally we get all the way up to the portal level. What this exactly looks like will again depend on details that will not be fully known until T287175 is decided. At a high level though, this is where we will describe the templates for rendering our page files as static HTML.

The build process to get from the source files to the static HTML will be a multi-step pipeline. Our collection of category documents will be processed (likely by custom code we write) to generate intermediate files in the format needed for transclusion into their including pages. This will probably happen multiple times for each category YAML file so that we end up with one intermediate file for each category + display language combination (T287176: Display localized content to readers). As a category is expanded into transcludable content it will in turn expand the document descriptors tagged with the category as inline content.

Next the build pipeline will process each page document once for each display language and output a file in the proper human language containing the previously expanded categories called for by the page source. These output files will be kept to use as inputs to the next stage of the build pipeline, actually running the static site generator.

A page is the third level of content organization. Each page will be in a format prescribed by the static site generation software that is chosen for T287175: Decide on most suitable underlying technical platform. It is likely that these will take the form of either YAML documents or Markdown documents with a YAML preamble as appropriate for the tooling. A page should have a title, heading(s), and prose text as well as include one or more categories which in turn include one or more document descriptors. We want the inclusion of categories to work like a template transclusion in MediaWiki: somehow in the page's source document we want to say "put the code quality category content here". How that technically works is :waves hands: an implementation detail that we do not know yet, but we will figure it out! The static site framework we choose will either have native support for transclusions, or we will make our own implementation as either a native extension of the framework or as a pre-processor step of some kind.

The more I think about this, the more I'm leaning towards each page that we commit to git actually being our own format rather than the native format of the site generation software. The main reason for that I think being T287176: Display localized content to readers and the need to extract and later reassemble translation units for the prose embedded in the page. We are already going to need to generate files for each page + language combination to feed to the site generator so this doesn't really introduce more work for the system. Controlling the format at this level will let us invent whatever conventions are needed for both extracting translation units from the origin document and transcluding categories into the translated page output.

As a neat side effect, this would also mean that the substantive content for the documentation portal would be decoupled from the static site generation software chosen. Our build pipeline tooling would become the only thing that needs to know what funky YAML/Markdown/whatever is required as input to the generator.

Aklapper renamed this task from Define storage format for listed entries (JSON / YAML / Markdown etc) and their placement to Define content storage format (levels, data structure, hierarchy, categorization, etc).Oct 19 2021, 4:34 PM

Thanks for writing this up, @bd808! It's really cool to see this kind of thinking separate from choosing the tools. Overall +1; it sounds well-structured, easy to maintain, and nicely decoupled.

document descriptor/category/page/portal

This makes sense to me! In general, I’d prefer to use full words instead of “dd”, “cat”, etc. Maybe just “document” to make it a single word? :bikeshed emoji:

> document descriptor
In the current version of the design, we’ve simplified how these appear. Instead of having a title, description, and one or more links, we have a title that is a link, a description, and an optional icon. (+1 to the optional icon/image field you mentioned) I don’t expect we’ll need the 1-N URLs with labels. We could switch to something like the following example, but since we’re just starting design review, this could change.

A page should have a title, heading(s), and prose text as well as include one or more categories which in turn include one or more document descriptors. We want the inclusion of categories to work like a template transclusion in MediaWiki: somehow in the page's source document we want to say "put the code quality category content here". 

Yes! This is what I was thinking as well and aligns with the behavior I’ve seen working with static site generators in the past. Usually there’s some templating language or markdown extension used for this, and it works pretty well.

The more I think about this, the more I'm leaning towards each page that we commit to git actually being our own format rather than the native format of the site generation software. 

This is a bit over my head, but decoupling always sounds good!

A few extra things that have come to mind:

Sub-categories?
In the current design, we have sections that function like sub-categories: h2 headings within the main h1 heading of the category. (For example, “Featured projects” and “Latest updates” in the draft) Currently these have a slightly different style (gray background, a bit more padding). This helps break up the page visually and avoid presenting too many options to choose from at once. Maybe we could have a class field in the category object that would apply a set of styles based on how it should appear?

Alternate link styles?
Another design outlier is a style that is more like a button than the usual title-link-heading format. This is another style difference that is trying to create some variety on the page, but in general, if we could find a way to allow some flexibility in styling elements, that would be ideal. (For example, the three gray boxes at the top of the draft)

Link groups without headers?
It’s likely that we’ll want to display the first set of links on the page without a category heading, since the main heading on the page is already acting like an initial heading. Is it possible to make the category title optional?

Editing this comment and the comment above to focus on organizing the site into small, modular pages and rescind my requests for these extra formatting options

Peer review comments:

  • nomenclature "categories" and "category" is a little easy to confuse -- the words are very similar but they mean different things when they're at the doc level (where "categories" refers to an entity or entities defined in other yaml files) vs the category level (where "category" refers to the entity defined by the current yaml file itself) . It would be clearer to use something like "in_category" for the document-level, so that it's more clear that it's referring to the "category" field defined in category-level yaml files

Question: can one edit the markdown and put content between the template syntax without breaking how it renders and pulls in content from the category tree?

Peer review comments:

  • nomenclature "categories" and "category" is a little easy to confuse -- the words are very similar but they mean different things when they're at the doc level (where "categories" refers to an entity or entities defined in other yaml files) vs the category level (where "category" refers to the entity defined by the current yaml file itself) . It would be clearer to use something like "in_category" for the document-level, so that it's more clear that it's referring to the "category" field defined in category-level yaml files

Agreed that the "category" label is goofy looking inside a categories/*.yaml file. I think a reasonable fix would be to use "name" for that label instead:

cat/code-quality.yaml
---
name: code quality
title: Create quality software
description: Read about contribution standards and guidelines to make better software for everyone.

Question: can one edit the markdown and put content between the template syntax without breaking how it renders and pulls in content from the category tree?

Yes, with the possibly obvious caveat that the changes to the markdown document would need to preserve the required syntax of the jinja template implementation. In the current POC, the markdown files are "driving" the content generation and the YAML data files are only involved in content generation as specified in the jinja statements which output markdown chunks using data from the YAML files which has been exposed to the jinja templating context by our local "macros" python module.

Here's an example of a markdown document from the POC that is pulling in content from the YAML data files:

src/api/reading.md
---
title: Reading APIs
...
{% set cat = category("api-reading") %}

# {{ cat.title }}

{{ cat.description }}

{% for doc in cat.documents %}
## {{ doc.title }}

{{ doc.description }}

{% for link in doc.links %}
* [{{ link.label }}]({{ link.url }})
{% endfor %}
{% endfor %}

This file is a mixture of a YAML header section (delimited by the --- YAML stream document start marker and the ... YAML stream document end marker) providing meta data for mkdocs about this page, Markdown content (in this example the # and ## header indicators and * unordered list item indicator), and Jinja template commands (the bits inside {{ ... }} and {% ... %} delimiters). At document generation time, mkdocs splits the meta data from the rest of the document and fires events related to converting the markdown to HTML. In my POC the mkdocs-macros is used to evaluate any jinja commands in the markdown. The output of that step is next passed to the mkdocs-mdpo plugin which extracts "translation units" from the markdown source. Eventually mkdocs core processes the markdown stream and converts it to HTML via the python-markdown library and any extensions to it that have been configured for the project. Finally the generated HTML is inserted into a page template provided by the theme (possibly selected based on meta data in the document being processed) and written to disk as a site/api/reading/index.html file.

There is a lot of boiler plate in this particular example that I expect we would refactor into a jinja macro or include in practical use so that outputting the content of a typical category becomes something more like:

---
title: Reading APIs
...
{{ render_category("api-reading") }}

If a large number of our content pages end up being a simple 1-to-1 mapping to a single category we can go further in reducing boilerplate by introducing some mechanism for generating the entire source document at runtime and passing it along in the mkdocs pipeline.

apaskulin assigned this task to bd808.

Implemented and working 🎉