WikiDev 16 working area: Content format
Closed, ResolvedPublic

Description

This is a potential area for work at Wikimedia-Developer-Summit-2016. "Content format" is about the format of our data, with a primary emphasis on the future of Wikitext & markup (or possibly, the future of eliminating it).

Central problem

How do we make manipulating our data easier and more useful (both for humans and computers)?

Alternative version (2015-12-30): What format should we use for the authoritative version of our essential content to make accessing and manipulating it easier and more useful (both for humans and computers)?

Main session: Robertson 1: Monday, 2pm

General discussion of our 2016 strategy for dealing with our central problem. This includes:

  • Establishing the shared questions, challenges, vision
    • What we should head for? (e.g. move primary data out of SQL-based data stores into key-value stores? Should we use Cassandra and RESTbase as stable primary storage)
    • Timeline and clarity of path

The goal of this session will be to capture a document that can be the first wiki draft as a charter for this area.

Summary of the other session priorities so far

Must have

Nice to have

Needs investigation/discussion

Not at summit

Withdrawn by author

Other working areas (and the meta conversations about the idea of working areas) can/should be found here: T119018: Working groups/areas for macro-organization of RfCs for the summit

Related Objects

RobLa-WMF updated the task description. (Show Details)
RobLa-WMF raised the priority of this task from to Needs Triage.
RobLa-WMF claimed this task.
RobLa-WMF added a subscriber: tstarling.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 19 2015, 12:30 AM

@tstarling - could you lead in this area? More on what this means in a bit.

RobLa-WMF set Security to None.

@tstarling, @RobLa-WMF (CC @Qgil, @Jhernandez, @phuedx, @Jdlrobson), if this is a "track", we'd like to have T114542: Next Generation Content Loading and Routing, in Practice at the kickoff as a 2 hour session, please. The rough format: present findings, do breakouts, reconvene.

Hi @dr0ptp4kt, you may have accidentally volunteered to lead this area. :-)

@Qgil, @Rfarrand and I have generally figured out "divide and conquer is good!", but we haven't actually sat down and designed what that means with respect to room allocation, timing, etc. I've got some half-formed ideas about how this should work, but I should probably file them in a separate ticket.

I would love for there to be a persistent collaboration space for this area of work. I think we can use this ticket as the collaboration space for this area until a better one becomes obvious.

@dr0ptp4kt, how would you propose organizing/prioritizing the proposals in this area?

Qgil added a comment.Nov 23 2015, 9:55 AM

@Qgil, @Rfarrand and I have generally figured out "divide and conquer is good!", but we haven't actually sat down and designed what that means with respect to room allocation, timing, etc. I've got some half-formed ideas about how this should work, but I should probably file them in a separate ticket.

T116024: WikiDev16 program is a good place to start building the schedule.

@RobLa-WMF, sounds good. Happy to help! I first wanted to acknowledge I saw your comment. I'll need several days to wrap my head around the content, and then it sounds like I should follow up on T116024: WikiDev16 program .

dr0ptp4kt moved this task from Backlog to Feature on the Reading-Admin board.
dr0ptp4kt moved this task from Feature to Next Up on the Reading-Admin board.

There are balance considerations. The authors break down as follows:

  • cscott: 10 proposals
  • gwicke: 4 proposals
  • isarra: 3 proposals
  • daniel, Adam, hoo, lucie: 1 each

I don't think it is practical for cscott to lead 10 sessions.

T114542 is a no-brainer, and T111588 and T106099 are closely related. It seems like they've been talking past each other a bit, so it would be good to have both Adam and Gabriel in the same room so we can get to the bottom of it. Adam is asking for 2 hours for T114542 -- I think 2-3 hours for these three RFCs combined may be reasonable, under the umbrella of T114542.

Isarra's RFCs are a striking contrast from those pie-in-the-sky architecture RFCs; hers are concrete and immediately practical. Prioritising them depends somewhat on how confident we are that these components will stay around. In terms of interest and immediate utility, I think the priority order is T114071, T114065, T114057, but in terms of risk, the order is probably reversed, i.e. T114057 is the least likely to be called a "legacy" component.

Of cscott's proposals, I think the highest priority for scheduling is T112991, on the basis of broad interest and feasibility. Also, we should probably have an hour for T112987 including hoo's T114251. At a slightly lower priority, T114072 and T114445 are imminent and resourced, and could perhaps be discussed together. T114454, T113002 (including T484) and T112999 are at an earlier stage of development, and resourcing for them is less clear. T112987 is vague, and T114432 has already had an RFC meeting.

Daniel's T107595 is important, but has already been discussed on IRC, and resourcing is unclear. If Daniel's RFCs in other areas are not scheduled, then this should have a high priority.

Of Gabriel's remaining RFCs, T99088 seems too vague. T55784 lacks active discussion, and I get the impression there is not much to talk about at this point.

T114477 was withdrawn by the author.

Isarra added a subscriber: Isarra.Dec 1 2015, 8:12 PM
cscott added a subscriber: cscott.EditedDec 1 2015, 11:53 PM

Perhaps we can have a "lightning talk" round? Some of my proposals that get the ax might still benefit from a brief 5 or 10 minute presentation, to raise awareness of the issues and spark hallway conversations.

I'm a little concerned by the ordering which places "imminent and resourced" over "earlier stage of development". In my mind, I'd completely invert this. The tasks which are "imminent and resourced" are going to happen regardless of the dev summit. What I really want out of the dev summit is broader input on the "earlier stage of development" proposals, to get them moving along (and to hear others' ideas about sticking points, feasibility, etc). T112999: Let MediaWiki operate entirely without wikitext could really use broader input, for instance, and T113002: Let's discuss LanguageConverter has been stuck for many years. *Not* talking about T113002 isn't going to get it unstuck, we really need to come together and make some decisions. (T113002 actually has some significant security implications as well, since LanguageConverter has been (and is) the current source of a number of security issues. We need to decide whether to rearchitect or scrap it; we can't continue to just ignore it.)

Qgil added a comment.Dec 1 2015, 11:56 PM

Perhaps we can have a "lightning talk" round? Some of my proposals that get the ax might still benefit from a brief 5 or 10 minute presentation, to raise awareness of the issues and spark hallway conversations.

Just fyi, the unconference slots at https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016#Program can be organized by the own speakers in the ways they prefer.

I just reread a comment @cscott made in September associated with T96903: Identify and prioritize architectural challenges, and I'd like to repeat the parts that I think are relevant to this working area here. He proposes two big challenges that seem pretty squarely in this area's purview:

In the interests of advancing concrete discussion, let me propose four concrete "user-focused" architectural challenges:

  1. Moving to web technologies. It's been twenty years since wikitext was introduced. Wikitext, PHP, and (even) Lua are off of the modern mainstream, and so we force our users to climb barriers to entry before they can use mediawiki or contribute to development. Here are some ways we can resync with modern practice (not intended to be exhaustive, roughly arranged from least controversial to most):
    1. T112999: HTML-only wikis. Decouple wikitext from mediawiki-core. This doesn't mean that we're going to turn it off for everyone! Just that we lay the foundation necessary to have wikis which use other representations: HTML-native, markdown, a refreshed wikitext 2.0 -- who knows what the future will bring. Let's refactor core so that we are not tied to wikitext 1.0 going forward.
    2. JavaScript support for Scribunto. At the time Scribunto was first developed, heap- and time-limiting in the v8 engine was immature. That limitation is past, let's ensure that folks can use web technologies to script templates, so that learning a brand new language isn't a prerequisite to contributing to our project.
    3. [suggestions more relevant to T119032: WikiDev 16 working area: Software engineering]
    4. [suggestion more relevant to T119030: WikiDev 16 working area: Collaboration]
  2. [suggestion more relevant to T119030: WikiDev 16 working area: Collaboration]
  3. [suggestion more relevant to T119029: WikiDev 16 working area: Content access and APIs]
  4. Polyglot Wikimedia. Mediawiki supports a number of different mechanisms for accommodating content in all the world's languages, but technical development of these features has stalled: they are not supported in our latest VisualEditor/Flow work, for example, and there is no plan for this. We should turn this around, and restart active development on polyglot features. Let's embrace ContentTranslation. This experiment seems to have succeeded, time to bring its UX into core. Rather than treating our projects in different languages as isolated silos, we should make it as easy as possible for content in wiki A to borrow from or translate content from wiki B (regardless of "variant" or whether A and B share a database, etc). For example:
    1. Fine-grained content tagging (like Parsoid's stable IDs) so that CX can permanently relate translated sections.
    2. An easily-accessible split-screen view, so that (for example) the author of an article on enwiki on (say) a South American country can very easily see a google-translated version of the eswiki article on that topic, and translate/incorporate information from it.
    3. A CX workflow for editors to keep translated sections up to date. Once we have persistent data that section A is a translated version of section B, edits to section B should be visible to editors who are maintaining section A.
    4. A process to migrate Language Converter and the Translate extension to using this same mechanism. ContentTranslation should be able to work on articles in different languages residing on the same wiki, in the way the Translate extension does, for example. We need to identify and develop whatever features are currently missing in ContentTranslation to enable this.

This seems like the beginning of a manifesto for this working area. Thoughts? Should we set up a wiki page on mediawiki.org as a collaboration venue for refining this?

@GWicke suggested on IRC that we should consider some of the tickets in T119029: WikiDev 16 working area: Content access and APIs at the same time as some of the tickets in this area. He lists these as "Content structure & APIs":

...and points out these touch on important considerations to T119162: WikiDev 16 working area: User interface presentation, specifically:

I think it's good that we have these discussions here, and use this as a venue to decide if we need to set aside some in-person time for this. How related is this bundle of session proposals?

Qgil triaged this task as Normal priority.Dec 11 2015, 8:11 AM

This is my attempt to summarize @tstarling's earlier comment for eventual placement in the description of this task.

Must have

Nice to have

Needs investigation/discussion

Not at summit

Withdrawn by author

I've got a rethinking on this area very similar to the my thoughts on T119029: WikiDev 16 working area: Content access and APIs. My first version of editing the description was my attempt to reiterate what Tim wrote in the comments as his priorities. I think his priorities are great, but I also think that many of the issues currently at the top of this list probably should be stressed in a different area.

@RobLa-WMF Thanks for linking my comment above. It would be great to get some consensus on the working items I enumerated above.

On the other hand, this session seems to be currently scheduled at the same time as the one for T114457 -- which is probably my own fault for having my fingers in too many pies. But it would be helpful to get a little bit more information about this session and that one (who is the actual owner of the 2pm Robertson 2 session?) to help me figure out which I should attend.

Hi @cscott, here's the main change I made to the description of this task yesterday:

Central problem

How do we make manipulating our data easier and more useful (both for humans and computers)?

Main session: Robertson 1: Monday, 2pm

General discussion of our 2016 strategy for dealing with our central problem. This includes:

  • Establishing the shared questions, challenges, vision
    • What we should head for? (e.g. move primary data out of SQL-based data stores into key-value stores? Should we use >Cassandra and RESTbase as stable primary storage)
    • Timeline and clarity of path

The goal of this session will be to capture a document that can be the first wiki draft as a charter for this area.

Valerie (@Frameshiftconsulting) has helped me refine the structure of the conversation plan. The plan, in short, is to ask and answer the following questions:

  • Who are the stakeholders for the content format?
  • What qualities are important to the stakeholders?

A goal of this session is to walk out with reasonably structured lists for each. We won't have agreement about the validity of the answers, but we will attempt to gather the list in the clearest way possible.

I think I'll clarify the central question to be:

How do we make accessing and manipulating the authoritative version of our essential content easier and more useful (both for humans and computers)?

"Authoritative version" and "essential" are important qualifiers here. By saying "authoritative" and "essential", I think "content" and "data" could be interchanged here, though. @ssastry, what do you think?

In my conversation with @GWicke earlier, we disagreed about what constitutes a content format. I'm making this admittedly broad: the format includes the essential elements of the storage mechanism. I.e. if the data is stored in a MariaDB database, the database itself is part of the format, since things like character set (e.g. latin1 vs utf8 vs binary) have implications up the stack from the storage itself, and because accessing+replicating the content depends on where and how the authoritative data is stored and if/how that content gets replicated.

Given the clarified definition, another list we can strive to walk out with is the list for this question:

  • What is Wikimedia's essential content?

Just my quick "off the top of my head" version of the list:

  • Sites
  • Articles
  • Revisions
  • Attribution
  • Categories
  • Associations/links (e.g. language links, interwiki links)
  • Media (bitmaps, vector art, audio, video)
  • Locations/coordinates

..but this probably has obvious omissions.

@cscott: I think the next logical step may be to take what I dubbed a "manifesto" for this area in my earlier comment and turn it into a wiki page to start things off. Is that something you can do?

I added this to the central question:

Alternative version (2015-12-30): What format should we use for the authoritative version of our essential content to make accessing and manipulating it easier and more useful (both for humans and computers)?

Note: @GWicke and I discussed this extensively in E131: Informal WikiDev '16 agenda bashing session (2015-12-30). See the comment at E131#1260 for the transcript of that part of our conversation. I was going to try to summarize, but I'll leave that as an exercise for the reader ;-)

ssastry added a comment.EditedDec 31 2015, 9:28 PM

I think it is fair to say that wikitext, as it exists today, is not a format that enables easy and robust ways of extracting semantic pieces out of it. So, for central question: how do we make content / data easier to access and mainpulate, we need a representation that is easier to manipulate. But, parsed HTML with the right semantic markup could be. Today, Parsoid is the tool that enables that, but if wikitext evolves incrementally, that ability could migrate beyond Parsoid. Whether HTML or wikitext ought to be the canonical representation of *stored content* is a difficult question to answer right away, but it seems fairly clear that a structured format like HTML is a better representation for access and manipulation than wikitext.

The way I see the various RFCs pooled in this group are as follows:

  • the API RFC establish concrete use-cases and importance of a good semantic representation of content.
  • skinning, TOC, section tags, templating improvements, improved image semantic source markup, improved semantic markup in HTML (identifying navboxes, infoboxes, chem-boxes, succession tables, etc.), language-converter proposals are about specific problems / proposals that try to move both source (wikitext) and output (HTML) towards such a semantic representation of content.
  • migrating all HTML API responses to this semantic representation (ex: using parsoid HTML for readviews, parsoid HTML for mobile, future of mobile front-end).

So, I see a natural progression of discussion there from the 1st to the 3rd bullet point.

In addition, the canonical storage question (should wikitext be the canonical storage, or is HTML a better choice of canonical storage) is a natural question that comes up in my mind from there, even if we might not be able to authoritatively answer that question right away. In my mind, this feels more like a followup of establishing shared understanding on the previous 3 bullet points.

Separately, I think while we are not ready to talk about specific plans for LanguageConverter, I agree with Scott and Gabriel that that ball has been punted for too long now. Once again, we may not be prepared to address that topic in depth at the summit, but we should commit to taking the discussion ahead in the coming quarters. Parsoid HTML / semantic HTML as the canonical representation (even if not canonical storage) requires us to generate proper output for language converter markup if it is to become the canonical output representation.

Questions explored today:

  • T99088: [RFC] Evolving our content platform: Content adaptability, structured data and caching
  • Need use cases: infoboxes, navboxes, performance (e.g. performance)
  • Is wikitext a legacy format?
  • What is "content"? For instance, is it a good idea to embed JSON blobs in wikitext?
  • What is a "revision", how should "transclusion" work? Can we show old revisions "as they were", while having the current revision automatically update?
  • How do we separate and prioritize the concerns? Performance, better semantic expression, presentation, cross linking between wikis ?
  • How important is modern practice for our formats? Can we untangle MW wikitext/Lua/etc from use of MediaWiki?
  • Polyglot Wikipedia? What does language support really mean?
  • Who will understand the new format? Will normal users understand the source format?
  • Do we need to expose the source format as part of the user interface?
  • Can we have a uniform format for the data from Wikipedia/Wikimedia sites?
  • How can we make it easier to edit for humans, and easier to read for machines? On the first point - include Design Research in the process
  • What about putting Javascript in Scribunto?
    • cscott: T114454: [RFC] Visual Templates: Authoring templates with Visual Editor is relevant?
  • How do we keep templates from getting even more complicated?
  • How can we make template parameters more uniform?
    • Santhosh: Translating content between languages, for adapting templates, uniform parameter definition is helpful
  • How can we curb the "abuse" of templates which make our content more complicated?
  • How can we semantically define what a template means?
  • Can we move toward multi-content revisions?
  • What are the short term and long term use cases we're trying to solve for?
  • How do we make our formats mobile friendly? E.g. image widths (T112991)
  • How do we balance granularity of wikitext vs various forms of translations?
  • Could templates be mapped to Qitems? (e.g. parameters, names) The idea is to make templates language neutral. (or at least to separate the (language-independent) data/code/facts and (language-specific) presentation)
  • How do we bridge the gap between human and machine friendly?
    • Maybe let people mark up free text input with machine readable meanings while they edit. This could work much like IDEs do auto-completion and suggestions.
  • Can we figure out how to collaborate on translation of machine form to human usable form?
  • What about data tables?
  • How should we translate human knowledge to machine knowledge?
  • How can we support commenting on fragments of articles, and have the comments survive article revisions?
  • Can we support authoritative format editing using various modes of editing? (how do we decouple format from editing form?)
  • How can we support richer content into articles? (see Yuri's proposal: T121044)
  • How can we make the content of an article suitable for new environments (e.g. 3D design)?
  • Can we get where we're going iteratively focusing on concrete use cases?
  • Can we support annotation of paragraphs and sentences?
  • Do we need to move to a richer format?
  • How should we deal with audio/video content?
  • Can we make wikitext more semantic?
  • What about structured data for smaller wikis?

Full meeting minutes: https://etherpad.wikimedia.org/p/WikiDev16-T119022

Amire80 added a subscriber: Amire80.Jan 5 2016, 8:56 PM

Etherpad copy:

Session name: Language: Let's discuss LanguageConverter / per-language URLs / getTargetLanguage
Meeting goal: Discuss RFCs and long term plans for wikis with language/script variants
Meeting style: Field narrowing (provisional)
Choose one of:

  • Problem-solving: surveying many possible solutions
  • Strawman: exploring one specific solution
  • Field narrowing: narrowing down choices of solution
  • Consensus: coming to agreement on one solution
  • Education: teaching people about an agreed solution

Phabricator task link: https://phabricator.wikimedia.org/T113002, https://phabricator.wikimedia.org/T114662, https://phabricator.wikimedia.org/T114640

Topics for discussion:

General notes
Examples:

color/colour
ten million / crore
berinjela / beringela (Brazilian / European Portuguese)

Current use:

Chinese
Serbian
Kazakh

Chinese has also word conversion and only character conversion, although actually the NoteTA template is used more frequently (a.k.a. "glossary").

Wiki splitting is too controversial, so it's not being discussed.

DISCUSSION
[Daniel Kinzler] UI translation doesn't go through the converter. CScott says that it's OK.
[Amir] Updating articles is in future plans for ContentTranslation, without a date for now.
[Subbu] Which problem are we trying to solve? CScott: Orthogonal interfaces to solve the same problems.

Amir: Feature request: Make it possible for me to read, edit, see diffs in my preferred language variant instead of being shown the raw text as it is stored (with mixed variants). This will require a "virtual" storage - a layer that the editor will see in the preferred variant, without having to edit something that is written in a different variant.
David: how much of the arbitrary functionality of language variants are we willing to throw away?
Bianjiang: Parsoid API doesn't let you specify the variant that you prefer o see in
David: to what extent are wikis besides zhwiki doing full-on vocab conversion?
Amir: The data shows that machine translation is useful: in languages that have it, and the quality is fairly OK, there are more translations (and few deletions of translated pages)
Amir: CX is an article creation tool .. and disconnects the two articles .. so, for varaints, the two articles may diverge and may not track each other.

Action items with owners:

  • Amir: Define the desired behavior from the user perspective according to the suggestion to write in your preferred variant.

Daniel - in which language to show Wikidata pages?
What is the language of Commons? Technically "en", but should be international.
How to detech the language before the output generation?
CScott: Is there a magic word that switches according to user language? Amir, Hoo: No, just a hacky template in some projects.

Stas: Variants
What is stored? Variant?
Answer: No, whatever the user typed.
To be searchable we can pre-render all variants of each page, and these should be searched.

https://phabricator.wikimedia.org/T114640
RFC: make Parser::getTargetLanguage aware of multilingual wikis

DON’T FORGET: When the meeting is over, copy any relevant notes (especially areas of agreement or disagreement, useful proposals, and action items) into the Phabricator task.

See https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit_2016/Session_checklist for more details.

Somewhat of a tangent: my vague understanding is that WordPerfect still is used as a document format the legal community. My theory is that on reason is because the "view codes" feature, which (possibly) makes it possible to understand a document all the way to the byte level. In legal document negotiation, this may be critical.

Given how much intense debate there frequently is about our documents, is similar aspiration to the fidelity and auditability critical to the credibility of our work?

dr0ptp4kt renamed this task from WikiDev 16 working area: Content format to WikiDev 16 working area: Content format.May 5 2017, 5:08 PM
dr0ptp4kt removed a project: Reading-Admin.
Qgil closed this task as Resolved.May 8 2017, 1:18 PM

Let's consider this (now orphan) task resolved.