[RFC] Balanced templates
Open, Stalled, NormalPublic

Description

(These were originally called "hygienic templates", which got confused with hygienic template arguments. The latter are now called "heredoc" arguments, and "hygiene" is no more.)

As described in my Wikimania 2015 talk (starting at slide 27), there are a number of reasons to mark certain templates as "balanced". Foremost among them: to allow high-performance incremental update of page contents after templates are modified, and to allow safe editing of template uses using HTML-based tools such as Visual Editor or jsapi. More discussion of motivation is at T130567 (and covered in RFC meeting E159).

"Balance" means (roughly) that the output of the template is a complete DocumentFragment: every open tag is closed. Furthermore, there are some restrictions on context to ensure there are no open tags which the template will implicitly close, nor nodes which the HTML adoption agency algorithm will reorder. (More precise details below.)

Template balance is enforced: tags are closed or removed as necessary to ensure that the output satisfies the necessary constraints, regardless of the values of the template arguments or how child templates are expanded.

Properly balanced template inclusion allows efficient update of articles by doing substring substitution for template bodies, without having to expand all templates to wikitext and reparse from scratch. It also guarantees that the template (and surrounding content) will be editable in Visual Editor; mistakes in template arguments won't "leak out" and prevent editing of surrounding content.

Wikitext Syntax
After some bikeshedding, we decided that balance should be an "opt-in" property of templates, indicated by adding a {{#balance:TYPE}} marker to the content. This syntax leverages the existing "parser function" syntax, and allows for different types of balance to be named where TYPE is.

We propose three forms of balance, of which the first and perhaps the second are likely to be implemented initially. Other balancing modes would provide safety in different HTML-parsing contexts, and may be added in the future if there is need.

  1. {{#balance:block}} (informally) would close any open <p>/<a>/<h*>/<table> tags in the article preceding the template insertion site. In the template content all tags left open at the end will be closed, but there is no other restriction. This is similar to how block-level tags work in HTML 5. This is useful for navboxes and other "block" content.
    • Formally: in context preceding template, close p, a, table, h[1-6], style, script, xmp, iframe, noembed, noframes, plaintext, noscript, textarea, select, template, dd, dt, and pre. (Alternatively, close all but div and section.) After template, close all open tags.
  2. {{#balance:inline}} would only allow inline (i.e. phrasing) content and silently delete block-level tags seen in the content. But because of this, it can be used inside a block-level context without closing active <p>/<a>/<h*> in the article (as {{#balance:block}} would). This is useful for simple plain text templates, e.g. age calculation.
    • Formally: In context preceding template, close style, script, xmp, iframe, noembed, noframes, plaintext, noscript, textarea, table, ruby, and select, template. These are the tags which change tokenizer or parser modes. (ruby affects subsequent parsing of rb/rtc/rp/rt.) Wrap the template with <span>...</span>, in order to trigger AFE reconstruction. Inside the template, strip address, article, aside, blockquote, center, details, dialog, dir, div, dl, fieldset, figcaption, figure, footer, header, hgroup, main, menu, nav, ol, p, section, summary, ul, h[1-6], pre, listing, form, li, dd, dt, plaintext, button, a, nobr, hr, isindex, xmp, optgroup, and option. These are the elements which can trigger a close tag to be emitted in body parsing mode.
    • To see the need for <span> wrapping, consider <div><b><i>foo</b>{{template}}</div> where the template is <meta>bar<b>bat</b>. The output with <span> wrapping is: <div><b><i>foo</i></b><i><span><meta>bar<b>bat</b></span></i></div> whereas without span wrapping we'd get <div><b><i>foo</i></b><meta><i>bar<b>bat</b></i></div> -- note that the <span> causes the <i> to precede the template content, instead of migrating inside it.
  3. {{#balance:table}} would allow insertion inside <table> and allow <td>/<th> tags in the content. The exact semantics need to be nailed down; it is possible that the inline mode might be extended to allow safe insertion inside <td>/<th> elements, which might remove some of the need for a special table mode. Templates which wish to insert rows or sequences of cells might still need a special mode.

We expect {{#balance:block}} to be most useful for the large-ish templates whose efficient replacement would make the most impact on performance, and so we propose {{#balance:}} as shorthand for {{#balance:block}}. (The current wikitext grammar does not allow {{#balance}}, since the trailing colon is required in parser function names, but the current patch set accommodates this without too much pain.)

Violations of content restrictions (ie, a <p> tag in a {{#balance:inline}} template) would be errors, but how these errors would be conveyed is an orthogonal issue. Currently bad tags are stripped silently. Some other options for error reporting include ugly bold text visible to readers (like {{cite}}), wikilint-like reports, or inclusion in [[Category:Balance Errors]]. Note that errors might not appear immediately: they may only occur when some other included template is edited to newly produce disallowed content, or only when certain values are passed as template arguments.

Implementation
Implementation is slightly different in the PHP parser and in Parsoid. Incremental parsing/update would necessarily not be done in the PHP parser, but it does need to enforce equivalent content model constraints for consistency.

In both implementations, we begin by recording the balance mode desired by each tranclusion and then adding a synthetic <mw:balance-TYPE> tag around the transcluded content.

PHP parser implementation strategy:

  • In the Sanitizer validate the synthetic <mw:balance-TYPE> tag to prevent forgery in wikitext, but otherwise pass the tag through.
  • Just before handing the output to tidy/depurate, perform a "cheap" parse by splitting on < characters, as the Sanitizer does, and naïvely tracking open/close tags seen on a stack (again, as the Sanitizer already does). When the <mw:balance-TYPE> open/close tag is seen, traverse the open tag stack and emit close tags as needed. Even though this pass is just an approximation of true HTML5 parsing, and doesn't accurately track AFE state or implicitly generated tags (like <tbody>), this has been validated to be sufficient. For example, even though we don't track the implicit <tbody> tag on our naïve stack, it can only be present if there was an outer <table> tag, and emitting </table> is sufficient to close the implicit <tbody>.
  • So far it has not been necessary to access "precise" HTML5 parse information in order to implement balancing. If this is necessary in the future, a pure-PHP implementation of the HTML5 Tree Builder pass has been implemented.

In Parsoid:

  • In the tree builder we have access to a fully accurate open-element stack, so we can emit precisely the correct close tags.
  • If/when PHP switches over to a DOM-based tidy, it might be able to use this same implementation strategy (balancing inside tidy) but it's not a requirement.
  • Testing **

A fuzz tester has been written, based on domino, which generates random sequences of tags and text for template and context, and then evaluates whether the desired semantics hold; that is, whether the following two expressions are equal:

  • tidy(tidy(balance(context)).replace(':hole:', tidy(stripOutsideMarker(balance(template)))))
    • Context and template balanced and tidied in isolation, then template inserted via string replacement
  • tidy(tidy(balance(context.replace(':hole:', stripOutsideMarker(template)))))
    • Template inserted into context, then balanced and tidied.

In this context tidy is just an HTML5 parse and serialize. The context is expected to contain <mw:balance-TYPE>:hole:</mw:balance-TYPE> somewhere inside it. The template is also wrapped with <mw:balance-TYPE> tags. The stripOutsideMarker function removes everything outside the <mw:balance-TYPE> tag. Note that we use tidy twice in the second case, because some tidy transformations are sensitive to the number of times we've tidied -- for example, table fostering can leave nodes in positions where they will be further altered by a subsequent tidy.

This tool has validated the set of tags named in the formal definitions of the balance modes, as well as verifying that the "sloppy parse" done in the PHP implementation yields the same results as a precise parse would.

CAVEAT: This tester does run the output through "legacy tidy". It is possible that the p-wrapping, empty element removal, and other nonstandard evilness performed by legacy tidy might affect the correctness of the balancing. I will hook up legacy tidy to the fuzz tester to look into this; hopefully the transition from legacy tidy to depurate will also make this consideration moot.

Examples
Here are some examples of the balance transformation:

  1. <p><a href="hello"><mw:balance-block><a href="world">foo<p></mw:balance-block>bar
    • The balancer will transform this to: <p><a href="hello"></a></p><mw:balance-block><a href="world">foo<p></p></a></mw:balance-block>bar
    • An HTML5 parse (or tidy) will transform this to: <p><a href="hello"></a></p><a href="world">foo<p></p></a>bar
    • The block balancing ensured that we didn't have an <a> tag inside an <a> tag.
    • The block balancing ensured that the inner <p> didn't implicitly close an outer <p>.
  2. <p><code><center><mw:balance-inline><span></mw:balance-inline><h1>foo
    • The balancer will transform this to: <p><code><center><span><mw:balance-inline><span></span></mw:balance-inline></span><h1>foo
    • An HTML5 parse (or tidy) will transform this to: <p><code></code></p><center><code><span><span></span></span><h1>foo</h1></code></center>
    • Note that HTML5 implicitly closes the <p> when it encounters <center>. This is why <center> is stripped inside (inline balanced) template contents.
    • Note that the HTML5 "reconstruction of active formatting element list" algorithm adds a new synthetic <code> element before the <span>. The balance algorithm adds a <span> *outside* of the template content, to trigger AFE reconstruction and ensure that AFEs of the context don't leak inside the template.

Deployment
Unmarked templates are "unbalanced" and will render exactly the same as before, they will just be slower (require more CPU time) than balanced templates.

It is expected that we will profile the "costliest"/"most frequently used/changed" templates on wikimedia projects and attempt to add balance markers first to those templates where the greatest potential performance gain may be achieved. @tstarling noticed that adding a balance marker to [[:en:Template:Infobox]] could affect over two million pages and have a large immediate effect on performance. We would want to carefully verify first that balance would not affect the appearance of any of those pages, using visual diff or other tools.

Related: T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool, T114072: <section> tags for MediaWiki sections.

Mailing list discussion: https://lists.wikimedia.org/pipermail/wikitech-l/2015-October/083449.html

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
tstarling moved this task from Inbox to Under discussion on the TechCom-RFC board.Oct 14 2015, 8:55 PM

Random thought: it might be handy to be able to tell that a template is "hygienic" by the inclusion syntax, maybe like this:

{{#inclue:Template:Foo}}

This should fail if Template:Foo wasn't marked SAFE, so the code parsing this has a guarantee that the templates is well-behaved.

GWicke added a comment.EditedOct 21 2015, 11:27 PM

@cscott: Originally, we discussed <domparse> & co in the context of the *opt-out* solution, as a way to still allow unbalanced templated constructs. It might be useful to mention this option in the opt-out section.

Another option we discussed is to use statistics to identify unbalanced templates, and treat all other templates as balanced by default (also opt-out). Parsoid collects a lot of this information during parsing, and exposes it in data-mw. Templates typically used as the first or last template in multi-template blocks would be the candidates we are looking for. Alternatively / additionally, we could ask authors to annotate templatedata manually. For the typical table-start and table-end template, it might be possible to infer that they normally open & close DOM scopes, and then parse content framed in them accordingly, in a limited DOM scope. The advantage of such a classification is that it could work reasonably well for old revisions, which I think is a requirement.

Isarra added a subscriber: Isarra.Nov 6 2015, 5:56 PM
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 6 2015, 5:56 PM
cscott added a comment.Nov 6 2015, 7:52 PM

We discussed this in-depth at our parsing team offsite. Some notes:

  • After some bikeshedding, we decided that we preferred {{#balance:xyz}} as the preferred marker inside a template opting into hygiene. (And that we should call it "balance" rather than "hygiene" to reduce confusion and the unpleasant connotation of "unhygenic" templates.) This syntax allows for different types of balance (named in the xyz part of the tag), since...
  • Well-formedness is generally not enough to guarantee that template updates can be reflected in the DOM by subtree replacement, because content from the template may be fostered or a/p/h* tags might be broken up (since you cannot nest those). For example, category <meta> tags may be fostered outside a <table>. Consider three cases:
    • {{#balance:block}} would close any open <p>/<a>/<h*>/<table> tags in the context, like block-level content in HTML 4. This is useful for navboxes, etc.
    • {{#balance:inline}} would only allow inline (i.e. phrasing) content and generate an error if a <p>/<a>/<h*>/<table>/<tr>/<td>/<th>/<li> tag is seen in the content. But because of this, it *can* be used inside a block-level context without closing active <p>/<a>/<h*>/<table> (as {{#balance:block}} would). This is useful for simple plain text templates, e.g. age calculation.
    • {{#balance:table}} might close <p>/<a>/<h*> but would allow insertion inside <table> and allow <td>/<th> tags in the content. (There might be some other content restrictions to prevent fostering.)
  • Not all of these balancing modes might be needed initially. We expect {{#balance:block}} to be most useful for the large-ish templates whose efficient replacement would make the most impact on performance.
    • {{#balance:}} could be shorthand for {{#balance:block}}.
    • The current parse grammar does not actually allow {{#balance}}, since the trailing colon is necessary in parse function names, but that shorthand could perhaps be supported as well without too much parser pain.
  • Violations of content restrictions (ie, a <p> tag in a {{#balance:inline}} template) would be errors, but how these errors would be conveyed is an orthogonal issue. Note that errors might not appear immediately, they may only occur when some other included template is edited to newly produce disallowed content, or only with certain values for the template parameters. Some options for error reporting:
    • Like {{cite}} as ugly bold text visible to readers (and editors too).
    • Wikilint-like reports, or inclusions in a [[Category:Balance_Errors]] (ie, silently)

Implementation is slightly different in the PHP parser and in Parsoid. Incremental parsing/update would necessarily not be done in the PHP parser, but it does need to enforce equivalent content model constraints for consistency.

PHP parser implementation strategy:

  • When a template with {{#balance}} is expanded, add a marker to the start of its output.
  • In the Sanitizer, either:
    • Close relevant open tags when the marker is seen, since the Sanitizer already maintains an HTML tag stack. *However* the tag stack is not complete because this is before doBlockLevels() etc which creates most <p> tags. So instead...
    • The Sanitizer should leave the marker alone, and then we'll replace the marker with </p></table>...etc... just before handing the output to tidy/depurate, and let that pass close the tags (and discard any irrelevant </...> tags). Some care needed to ensure we discard unnecessary close tags, and not html-entity-escape them.
  • PHP might not be able to implement {{#balance:inline}} or {{#balance:table}} quite yet -- there might need to be a special depurate mode, or do it in a DOM-based sanitizer, something like that. We can concentrate on {{#balance:block}} initially.

In Parsoid:

  • We just need to emit synthetic </p></table></...> tokens, the tree builder will take care of closing a tag if necessary or else discarding the token.
  • When PHP switches over to a DOM-based sanitizer, it might be able to use this same strategy.
saper added a subscriber: saper.Nov 10 2015, 12:01 AM
cscott renamed this task from [RFC] Hygienic templates to [RFC] Balanced templates.Nov 10 2015, 9:07 PM
cscott updated the task description. (Show Details)
cscott added a subscriber: tstarling.

I updated the description to match the notes from our offsite; see the change details for information about the other colors we considered painting the bikeshed.

It may be worth addressing some of T14974 for templates using {{#balance:...}} as well. For instance, {{#balance:block}} would ensure that the start of the template was parsed in "start of line" context (adding an implicit newline before and after it), while {{#balance:inline}} would do the opposite (add an "invisible space" before and after to ensure the template wikitext was *not* parsed in "start of line" context).

There was further conversation on wikitech-l.

Two points were brought up:

  1. The restriction on open <a> tags in the context might be quite restrictive for some "inline" templates. It's necessary, however, because HTML5 doesn't allow <a> tags to be nested. So either there is no <a> tag in the context, or no <a> tag in the content. {{#balance:inline}} does the latter. We might want to introduce an alternate, say {{#balance:link}}, which does the former: close any <a> tag in the context, which would allow the content to contain <a> tags.
  2. <table> tags can be nested, so the "emit lots of close tags and let tidy figure it out" implementation strategy requires us to still count the number of open <table> tags so we can be sure to emit the correct number. I did a little experimentation with a copy of the HTML5 parsing spec in front of me, and I couldn't find any tag (other than </table>) which was guaranteed to close all existing tables. And in fact most experiments of this sort let to unpleasant foster-parenting rather than the desired outcome.

So we'll need to borrow the strategy mentioned above in the notes from the parser team offiste and count <table> nesting depth in the Sanitizer. This is before doBlockLevels, but hopefully we can reliably identify <table> tags? And are there other nested cases we need to worry about?

Tgr added a comment.Nov 13 2015, 2:52 AM

a tags disallow images as well, so I guess those should also be forbidden inside an inline template?

The list of tags that need to be closed seems somewhat random to me. For block-level templates, we would have to close any open tag that disallows block-level content, right? (HTML5 does not seem to have an equivalent of block-level but it's basically flow content that's not phrasing content.) Assuming wikitext is the only possible source of unclosed tags, and MediaWiki has not been configured to be unusually permissive, that would be:

  • all tags with a phrasing content model: a abbr b bdi bdo cite code del dfn em i ins kbd q s samp small span strong sub sup u var.
  • tags with flow content model that have restrictions on their own content: h1..6 p ul ol dl table are the ones I can think of (ul ol dl table should actually be fine as long as we are within the right descendant but that seems hard to track).
  • some tags that technically don't exist in HTML5 but we support them anyway: font big tt center strike.
ssastry added a comment.EditedNov 13 2015, 5:11 AM

These notes has the general problem that we are trying to solve here by picking specific constrained scenarios. Here are some more thoughts about content model constraints. In order to respect these constraints, we have two possibilities: either per-template constraints or use-site constraints, and which of these are more reliably enforceable and easier to editors to reason about. It seems like per-template constraints are going to be more predictable in terms of behaviour which is why we started looking at block vs inline templates (block vs. inline as understood in the HTML4 content model since they are easier to grapple with compared to HTML5 flow, phrasing, and other types).

Given this, at our offsite, we focused most of our attention on block-level templates and constraints. We didn't fully discuss the inline templating scenario which seemed a lot more stickier as this ongoing discussion illustrates. So, we haven't figured out all the details for the inline template scenario -- let us continue to share ideas and thoughts here, but it might be easier to start with block templates.

Elitre added a subscriber: Elitre.Nov 19 2015, 4:07 PM

Efficient re-rendering on edits has some of my old notes where I was trying to think through possible approaches. Providing a pointer here in case it helps with working through this problem.

The list of tags that need to be closed seems somewhat random to me. For block-level templates, we would have to close any open tag that disallows block-level content, right? (HTML5 does not seem to have an equivalent of block-level but it's basically flow content that's not phrasing content.) Assuming wikitext is the only possible source of unclosed tags, and MediaWiki has not been configured to be unusually permissive, that would be:

  • all tags with a phrasing content model: a abbr b bdi bdo cite code del dfn em i ins kbd q s samp small span strong sub sup u var.

You can embed a block tag like a <div> inside an <a>, <small>, <span>,< strong>, etc. This is easy to verify by creating such HTML in a file and inspecting the DOM in a browser.

Tgr added a comment.Dec 8 2015, 12:57 AM

You can embed a block tag like a <div> inside an <a>, <small>, <span>,< strong>, etc. This is easy to verify by creating such HTML in a file and inspecting the DOM in a browser.

That's not what the HTML5 spec says (except for a which is transparent and can contain flow content; that one is problematic for other reasons, as it cannot be nested or contain interactive elements).

ssastry added a comment.EditedDec 8 2015, 4:30 AM

You can embed a block tag like a <div> inside an <a>, <small>, <span>,< strong>, etc. This is easy to verify by creating such HTML in a file and inspecting the DOM in a browser.

That's not what the HTML5 spec says (except for a which is transparent and can contain flow content; that one is problematic for other reasons, as it cannot be nested or contain interactive elements).

Right, but, we are not as much concerned with the HTML5 spec as much as the HTML5 tree building algorithm which is what determines how a HTML string translates into a DOM structure. This is a bit confusing and I still have to keep reminding myself of this distinction. If the tree building algorithm strictly enforced the HTML5 spec, then a lot of pages out there in the wild would be broken when viewed in a browser.

So, the same consideration applies here .. when looking at a HTML string and its embedding inside a container string (as happens with transclusions), we are more concerned about the DOM that will result when the string is parsed. And, those constraints are fewer (ex: no nesting of links, paragraphs, headings; fostering of content from tables), but I haven't yet found a page that lists out all these constraints. However, I think we can discover this with some experimentation (or a careful reading of the tree builder algorithm), but I am fairly certain that these constraints are far fewer than what the HTML5 spec dictates since the parser has to be much more lenient in what it accepts.

I'm wondering about what should happen if there are multiple #balance calls in a template, or if a subtemplate contains #balance.

The #balance parser function can store the balance type into the current PPFrame object. Then after template expansion, in the $isChildObj case of Parser::braceSubstitution(), the balance type will be available as $newFrame->getBalanceType(). Then it can add markers to the start and end of the resulting text.

Parsoid just needs to know (presumably via a ParserOutput property) whether there was a balance type applying to the whole expansion.

In the MW parser main pass, the text will need to be segmented into sections based on a change of balance type, and then each segment needs to be fed into an HTML balancer (perhaps part of Html5Depurate).

Regarding subtemplates, one option is to allow balanced sections to nest. The inner balanced section would be sent to the balancer first, and the balancer's output would be used as part of the input to the outer balancer call. A challenge with this is Parsoid compatibility, since the preprocessor cannot balance inner nested sections for Parsoid. Maybe we would have to provide complete markup of balanced sections for Parsoid, rather than a global balance type.

Another option is to strip balance markers from the inner text before wrapping them around the outer text. This means that the balancer is non-recursive and parallelisable, while still allowing proxy templates.

If #balance is completely ignored in subtemplates, then proxy templates like navboxes will need to be marked individually. For example, {{Navbox}} is implemented in Lua, and the module can probably tell us what sort of balancing it has. There are thousands of navbox templates which only invoke {{Navbox}} with specific arguments. It would be convenient if we could respect the balance flag set by Lua without requiring every navbox template to contain {{#balance}}.

I'm wondering about what should happen if there are multiple #balance calls in a template, or if a subtemplate contains #balance.
...
Another option is to strip balance markers from the inner text before wrapping them around the outer text. This means that the balancer is non-recursive and parallelisable, while still allowing proxy templates.

This seems the best option to me.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

He7d3r added a subscriber: He7d3r.Feb 1 2016, 8:00 PM
Qgil removed a subscriber: Qgil.Feb 11 2016, 12:26 PM
RobLa-WMF triaged this task as Normal priority.
RobLa-WMF added a subscriber: RobLa-WMF.

Per E146

DStrine moved this task from Request IRC meeting to Under discussion on the TechCom-RFC board.
ssastry added a comment.EditedMar 19 2016, 2:13 AM

Here is an updated proposal for "balanced templates" based on revisiting some old notes, discussion here, and @cscott's attempts to prototype the current proposal.

TL:DR;

In the current proposal as outlined in previous comments, we are trying to modify the HTML at the transclusion site so that it is possible to drop the template output as is. As @cscott is discovering with his attempts to prototype this, this can be fairly complex (and can also be hard for editors to reason about). However, instead of grappling with the complexity of the HTML5 tree building algorithm and its idiosyncrasies or HTML5's content model constraints (and the mismatches between the two which are a source of confusion), I am proposing that we identify constraints on template output and constraints on the use-site that template authors can specify. The parser then enforces both these constraints by suitably modifying the HTML output of the template and the HTML at the use site. Details below.

Detailed proposal

In this RFC, we are interested in two properties of template output: its well-formedness (balanced, well-nested HTML tags) and HTML5 content-model constraints at the transclusion site (constraints on whether the output can be introduced at the transclusion site as is). These properties affect:

  • editors' ability to reason about template output
  • their editability in HTML editors (like VE)
  • whether templates lend themselves to incremental parsing solutions

Well-formedness is easy to guarantee by building a DOM fragment from the output (string) and reserializing it. This fixes all mismatched tags, bad nesting, etc. However, if we want to localize and bound the impacts of embedding this DOM fragment within the surrounding context, we have two knobs to work with.

  1. Output constraints on the template output: These are enforceable constraints on the output of a transclusions. For example, if a template declares that it produces "block" output, we can look at the DOM fragment of transclusions that use it and wrap that DOM fragment in a <div> if its output doesn't satisfy this constraint.
  2. Use-site constraints on the surrounding context: These are constraints on where a transclusions can be embedded. For example, if a template declares it shouldn't be used inside <a> tags, any surrounding <a> tags are closed before embedding the DOM fragment.

So, the primary difference from the existing proposal is that instead of making heroic efforts in the parser to satisfy HTML5 content-model constraints, we instead let template authors specify constraints on a template output and the use site for its transclusions. This of course means that when the template output is introduced at the use site, despite these constraints, there might be non-local effects. However, I think this is acceptable. There will be a subset of templates where incremental parsing, improved reasoning, and improved editability benefits will not be available. But, with this approach, we can gradually improve the set of supported output and use-site constraints, and their complexity. So, the expectation is that over time, this subset will diminish.

Given a template with its output constraints and use-site constraints, here is how a wikitext parser might use them.

  • The HTML string is parsed to a DOM fragment which ensures that its output is well-formed.
  • Any declared output constraints are enforced. Ex. a block-output constraint will wrap the entire output in a <div> (with a special class, if necessary). Or, a no-links-output constraint will cause all <a> tags to be stripped from the output. This may be because the template author intends for the template to be used within a link.
  • Any declared use-site constraints are enforced. For example, if a template declares that it cannot be used in <a> context, any surrounding <a> tags are closed before the DOM fragment is embedded.

Some notes about constraints:

  • Templates can provide declarations about, none, one, or both of these constraints.
  • Some use-site constraints could be derived from the output constraint. For example, if a template declares an output constraint of block tags, we could decide to enforce that it cannot be used inside <p> tags or <h> tags.
  • Some use-site constraints could be dervied from the output. If a template generates output that has an <a>, <p>, or <h*> tag, we could automatically add use-site constraints that closes any surrounding tags of those types.

Here are some benefits of this approach:

  • It keeps the parser end of the bargain manageable. There is very little additional complexity here. This technique is fairly simple and relies solely on a HTML5 parsing library / service for enforcing well-formedness. Enforcing template-author-declared constraints eliminates guesswork and complexity from the implementation as well.
  • An editor can look at the template documentation and figure out fairly easily and clearly where and how the template is meant to be used. There are going to be fewer surprises in terms of how rendering is affected by non-local effects of transclusions.
  • A HTML editor like VE is very well-placed in terms of enforcing use-site constraints, i.e. if a template declares that it should not used in links, VE might prevent it from being used in a link. Because of this, it can provide stronger WYSIWYG guarantees that when the edited HTML is saved to wikitext, there are going to be fewer surprises about changes to rendering compared to how it showed up in the editor in a VE session.
  • Incremental parsability is also improved. Note that the two constraints by themselves are insufficient to guarantee that when the output of transclusion changes (either because the parameters to the translusion were changed, or because the template source itself was changed), we can take the new DOM fragment and install it in place of the old DOM fragment in the original HTML. However, in some constraint scenarios, we can make very reliable guarantees about this drop-in replacement of a transclusion's output.

    For example, with block level constraints, even if the template output moves around (for example, it got fostered out of a table), we know that since we are guaranteed that the edited output will still be block-level output, we can replace a <div>..</div> with another <div>..</div>, a <table> with another <table>, etc. Additionally, if we were enforcing use-site constraints of not being used inside a p-tag, we can even replace a <div>..</div> with a <p>..</p> and so on.

    Rather than try to guarantee incremental parsability in all cases upfront, we can build up this capability in the corpus of templates gradually by coming up with a sane set of workable output and use-site constraints and have templates opt into these over time.

    If a template edit changes its output or use-site constraints, then incremental parsing might have to be disabled for that edit. The pages using that template will now incur a full parse penalty. Later edits will re-enable incremental parsing.

    Note that this incremental parsing ability is only achievable in Parsoid since Parsoid maintains a mapping between DOM-fragments and wikitext offsets. So, on edits to a template, it can parse the old HTML, find the DOM fragments corresponding to the transclusion of that template, and replace it with the updated DOM fragment and serialize the DOM back to HTML. This feature cannot and will not be provided in the core PHP parser.

Questions to resolve

So, here are some things that need to be resolved / discussed:

  1. Feedback about this approach in general. Does this seem like an improved and viable approach?
  1. What are the best ways to prototype this? It seems that we could start with just one output and use-site constraint each. For example, we could use block-output (block in the HTML4 sense since that is easier to grok) (#balance:block as used in T114445#1789708 ) The use-site constraint could be no-p, no-h*, no-a, i.e. this template cannot be used inside <p>, <h*>, and <a> tags. We should come up with a better way to specify this.

    We need to pick a set of templates on which we could declare this output and this use-site constraint. Infoboxes seems like good candidates.
  1. Come up with a simple taxonomy / terminology / mechanism for making these output and use-site constraints. We have considered link, table, list, etc. in T114445#1789708 Anyway, we need to enumerate constraint types and write up specifications for them.
  1. Figure out where these constraints will be specified. Options are:
    • template source via magic words, parser function syntax, something else.
    • templatedata: this seems a good place for this, but template source and its constraints would now be in different places.
  1. All along, we have been very strongly leaning towards an opt-in model for templates. As far as I can tell, opt-in is the only approach that makes sense with this updated proposal.

I don't think this is actually a simplification. As noted in my prototype, the hard part here is actually determining what the "use site" of the template is. That essentially requires a full HTML5 tree builder pass. Once you've precisely identified the use site, all of the fixup strategies are essentially the same. Exposing a full use-site constraint mechanism to the user is likely to make use of templates unwieldy. As noted in my proposal above, I think block/inline/table is probably about the most this should be exposed to the user, and I expect that the first prototypes will only include the block mode.

ssastry added a comment.EditedMar 19 2016, 2:23 AM

I don't think this is actually a simplification. As noted in my prototype, the hard part here is actually determining what the "use site" of the template is. That essentially requires a full HTML5 tree builder pass. Once you've precisely identified the use site, all of the fixup strategies are essentially the same. Exposing a full use-site constraint mechanism to the user is likely to make use of templates unwieldy. As noted in my proposal above, I think block/inline/table is probably about the most this should be exposed to the user, and I expect that the first prototypes will only include the block mode.

The simplification is because you don't need to infer anything automatically. For example, the template author might specify that for infoboxes, you just need to ensure it is not inside a p-tag and that the output has to be forced to be a block tag. That eliminates the complexity of determining how to embed the template output. You just continue to use it as it has been done all along so far *after* enforcing template-author specified constraints.

Change 279670 had a related patch set uploaded (by Cscott):
WIP: Add {{#balance}} to opt-in to balanced templates

https://gerrit.wikimedia.org/r/279670

RobLa-WMF mentioned this in Unknown Object (Event).Apr 13 2016, 6:54 PM
RobLa-WMF mentioned this in Unknown Object (Event).Apr 13 2016, 7:34 PM
cscott updated the task description. (Show Details)Apr 13 2016, 7:45 PM
cscott updated the task description. (Show Details)Apr 13 2016, 8:10 PM
cscott updated the task description. (Show Details)Apr 13 2016, 8:29 PM

Updated the RFC to match the current proposed semantics and implementation.

Excuse me if I missed something in the proposal, but I'd like to raise the question of template parameters. Currently, template parameters are wikitext, and can thus contain (unbalanced) HTML tags. How should parameters be treated in balanced templates? Should each parameter be pre-parsed on it's own? Or sanitized? Or do we allow plain text parameters only? Or limited wiki syntax? Structured data?...

Allowing un-balanced wikitext parameters to be used in a balanced template can break it, or at least lead to undesired results.

Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptApr 19 2016, 3:11 PM
Bonvol added a subscriber: Bonvol.Jun 24 2016, 4:02 PM

Change 303431 had a related patch set uploaded (by Cscott):
WIP: Extend 'format' spec to include format strings.

https://gerrit.wikimedia.org/r/303431

cscott updated the task description. (Show Details)Oct 12 2016, 10:13 PM
cscott updated the task description. (Show Details)Oct 12 2016, 11:00 PM
jeblad added a comment.Jun 2 2017, 2:43 PM

Is there any progress?

ssastry changed the task status from Open to Stalled.Jun 7 2017, 7:21 PM

Sorry, we are pretty overcommitted and this is currently stalled till we finish up some ongoing projects.

Someone asked for a logo.

Balanced templates. Gettit?

Or the minimalist version:

{{===}}
jeblad removed a subscriber: jeblad.Aug 25 2017, 10:11 PM