Page MenuHomePhabricator

Add more information to MediaWiki parse tree
Open, Needs TriagePublic

Description

The action=parse&prop=parsetree API returns an XML representation of how the MediaWiki preprocessor splits up the wikitext into template names, parameters, extension tags etc. Unfortunately this representation doesn't include the template markup itself, making it hard to reconstruct the wikitext from the parse tree, which would be very useful for template manipulations (e.g. remove or change a certain parameter of a certain template). Parsoid can't do that because it can't introspect nested templates ({{outer| {{inner| param }} }} which is common in e.g. infoboxes), regexes are brittle, and writing a client-side parser is more effort than what we should force clients to do. (In Python, mwparserfromhell can do it, but there is no similar JS parser.)

Event Timeline

For example {{SomeTemplate|some-param}} is represented in the parse tree as

<template><title>SomeTemplate</title><part><name index=\"1\"/><value>some-param</value></part></template>

Instead, it could be something like

<template><markup>{{</markup><title>SomeTemplate</title><part><markup>|</markup><name index=\"1\"/><value>some-param</value></part><markup>}}</markup></template>

so that loading the parse tree into a DOM model and getting the textContent of the root element would return the original wikitext.

This is a breaking change so we'd need a version parameter and support for both versions, which seems straightforward.

The parsetree preserves whitespaces and the markup is represented by the tags, it is already possible to remap the wikitext from the parsetree, not that complex, but also not easy as with textContent.

Some hints:

  • <template> => {{ ... }} possible with lineStart= attribute
  • <tplarg> => {{{ ... }}}
    • both contains <title> and optional <part> with <name>, <equals> and <value> each, for each part a pipe is needed.
  • <ext> => < ... > with <name>, <attr>, <inner> and <close>, when close is missing it is represented as short tag with />

The wikitext in <inner> is not expanded, same reason as T4700.

For a deep dive into templates you would need the inclusion parsetree, that needs T51353.