Page MenuHomePhabricator

Provide means to extract the parameters and values of a template invocation in wikitext
Open, Needs TriagePublic

Description

Background

It is a common need to parse arbitrary wikitext for a template invocation. This has been reinvented many times over – for example mwparserfromhell (Python), wtf_wikipedia and wikiapi (JavaScript), etc. Yet, in MediaWiki itself there is no straightforward way to parse templates. One must use the Core preprocessor directly and iterate over the DOM tree.

If the bots can do it easily, a MW extension or Core feature should as well.

Proposed solution

Introduce a method to find a template invocation in arbitrary wikitext and return the parameters and values as an associative array. The values should be kept as the original wikitext with strip markers removed.

In turn, this allows a new and more user-friendly means to interact with extensions through wikitext, as opposed to traditional parser tags and functions.

Event Timeline

Change #1153392 had a related patch set uploaded (by MusikAnimal; author: MusikAnimal):

[mediawiki/core@master] Parser: add getTemplateInvocation() for easy template parsing

https://gerrit.wikimedia.org/r/1153392

The immediate application is 1148985 and its parent 1148400. Certain parameters from the Community Wishlist wish template will be cached in the database for indexing.

The proposed summary fails to mention a couple of additional implementations, including in Parsoid. This strikes me as a https://xkcd.com/927/ issue.

I'm curious why a parser function is /not/ the right answer here? That's traditionally the thing we use for "looks just like a template". And for "adding property from a template to the database" we have a number of existing mechanisms for that as well. I think we'd like to talk to Community Wishlist further to understand better why a new mechanism is necessary, and whether this is a pattern we actually want to pave a cowpath for (which means that many more folks are going to use this approach in the future as well).

In particular, it seems that "extra parses of wikitext" is a pattern we probably do *not* want to support. We already see this taking hold in a number of scribunto modules which fetch the wikitext for a page and then try to extract information from it. You can imagine scenarios where a complex page is going to do this multiple times, which isn't great. It is far better to use a parser function or other mechanism to extract the information and store it in the parser output /once/, rather than encourage "reparsing" of chunks of wikitext.

I've added this to the agenda for our next tech forum meeting (june 10) but we should probably try to set up a meeting w/ community wishlist as well.

Parsoid uses the core preprocessor, no? The other implementations I mentioned do not, and are also what I would constitute as "extra parses of wikitext". My point is merely that they are able to do it easily enough, so it shouldn't be this hard to do directly in MediaWiki when we have access to the source of truth – the Core preprocessor. So in my view, it's not so much about https://xkcd.com/927/ as it is the actual standard (MediaWiki) lacking the same capabilities offered with external libraries.

My apologies for not being more clear on our specific use case. On a high level, the motivation behind MediaWiki-extensions-CommunityRequests can be seen as akin to MediaWiki-extensions-PageAssessments. We have a bunch of uses of a particular template, and we want to store the data passed to the template, so we have that template use a parser tag or function. Easy peasy. For CommunityRequests however, one of the pieces of data is a wikitext blob (the "description" of wishes). I think we could still store it using the same parser tag, but we lose the CirrusSearch integration, and it would also necessitate us using external storage due to its potential size. So we thought we might as well continue using wiki pages as the storage medium for the wikitext blobs.

Next, we have a form (see live example) that needs to take the values of the template invocation and pass them to the Vue application. This is much easier in the upcoming extension as we have the data for most of the form fields already in our database tables. The exception being the description, for the aforementioned reasons. We need to extract the raw wikitext value from the template invocation. This is needed when accessing [[Special:WishlistIntake/:wishid]], which I think means there is no parser pass of the wish page for us to leverage (?).

I went down a rabbit hole and was amazed to see that what we needed to do was quite doable using the preprocessor. Fetching template invocations in MediaWiki was also a subject of discussion at the recent Hackathon. One thing leads to another, and here I am adding a new entry point to the Parser – something quite bold and very much outside my expertise! Needless to say I did not get my hopes up that r1153392 would get anywhere. If we are concerned with misuse, I am content abandoning this effort. I do however think the approach is appropriate for our extension (unless you have better ideas – please indulge!), and certainly an improvement over the our current TemplateParser.js. In both cases (MediaWiki extension or an on-wiki gadget), we need to fetch the raw wikitext and extract specific fields from a template invocation. I figured since we have access to preprocessor, using that is more sane and foolproof than yet-another-wikitext-parser.

Discussed in Content-Transform-Team Tech Forum today. We should probably set up a meeting with y'all to discuss current status and understand better your use case.

Our two straw proposals at this time would be:

  • Provide the API from Parser: add getTemplateInvocation() for easy template parsing but based on the cached Parsoid ParserOutput.
  • As an alternative, provide a "template listener" hook that would get triggered during the parse of the page when the template structure is available, which would allow portions of the template tree to be canonicalized into page properties or other indexable data.

Thanks. This sounds like what I was hoping for. It is a bit beyond my means to take this on, but I hope it's okay to leave this task open in case someone else wants to tackle it? if it feels to you like a never-gonna-happen, feel free to close.

For the record, we don't need this for MediaWiki-extensions-CommunityRequests anymore. I just think it's functionality that Core should offer.

Thanks for your time and assistance :)

Change #1153392 abandoned by MusikAnimal:

[mediawiki/core@master] Parser: add getTemplateInvocation() for easy template parsing

https://gerrit.wikimedia.org/r/1153392