We are working on new ways of presenting, editing and composing semi-structured content and data, on a variety of platforms and with a broader range of use cases. New types of derived content are being created and need to be stored and updated. We are getting serious about separating structured data from semi-structured content, and will need stronger tooling for the systematic extraction and presentation of data.
Overall, it has become very clear that we need to evolve the way we represent, process and store content. To start the discussion, we need to detail the issues, current status / efforts and possible solutions.
Here is a summary, based on the discussion in T96903:
## Challenge: Content portability
- Devices: from feature phones on satellite backhaul to many-core high-res desktops on fiber connection
- Contexts / use cases: from summaries and factoids to long form; need good contribution workflows for each context
## Structured data can help
- Enables content portability by separating presentation from data
- Semantic levels:
1) Semi-structured page content: One or more content fragments / variants per revision, ideally addressable at fine granularity, in standard format that allows effective post-processing and data extraction
2) Metadata: less semantic & more presentational information (ex: links, widget / info&navbox definitions, template parameters, data-parsoid, categories)
3) Wikidata: focus on semantic information; directly enables rich search, translations, recommendations. Currently a lot of this information is still inline; lots of extraction work needed. Many eyeballs and contribution workflow critical for quality of data.
## Need to store and retrieve the bits: Flexible storage & APIs
- flexible: easy to add new metadata and content types
- available through a consistent and high-performance API
- needs scalable and general change propagation strategy
- should integrate with history, link tables etc
- RESTBase is moving in this direction, but not very integrated yet
## Assembling the bits: Generalized, late and hygienic content composition
- Generalized: Usable for page content (media, tag extensions, transclusions) and skins / chrome
- Late: assembly happens at the edge or on the client
- Hygienic: components are well-contained
Why:
- flexible content composition lets us adapt to devices and use contexts (ex: video players, anon / authenticated views)
- major performance wins through move to the edge, use of stored / cached content for authenticated requests
Requirements:
- performance
- needs to integrate with / be part of caching and change propagation system (see storage)
- need an intuitive editing experience for embedded conponents