Page MenuHomePhabricator

Provide for asynchronously-available MediaWiki parser content fragments / components
Open, Needs TriagePublic

Description

Decision Statement Overview

What is the problem or opportunity?

MediaWiki's core feature is a content framework through which end users can invoke features inline within the "wikitext" markup syntax, which trigger the Parser to cause things to be included into the HTML output, or change the behaviour of the page in some other manner. This is currently entirely synchronous. We would like the ability to add asynchronous content fragments to the parser output. Any changes we would make should be backwards-compatible, without requiring any changes in teams with existing uses of the parser system.
Background:

Over the years, we at Wikimedia have used the current system to great effect, both within MediaWiki itself such as through categories, content transclusion ('templates'), and render-time- evaluated 'parser' functions, and with MediaWiki extensions and through them server-executed code, such as embedded video playback, graph rendering (presently decommissioned), and Lua scripting.

However, all content inclusions are currently synchronous, either injected into the content at parse / render time, or take the form of an HTML link to media. This second option is evaluated by the consumer's browser at read time, and so allows extensions to point to an item that potentially doesn't exist, or at least doesn't exist yet. The most high-profile use of this quasi-asynchronous inclusion feature is in the TimedMediaHandler extension, which queues up video transcoding on upload. Users can add wikitext to embed the video before it becomes available to view, and the embedding is re-evaluated as to whether the target exists whenever the page is rendered. Such a function is not possible for injected content fragments, however.

The two main performance-related limiting features within the parser are a limit on the size of the input wikitext (2MiB) and a limit on the number of template invocations on a page (10,000). Later, a scripting augmentation (the Scribunto extension) was implemented to replace many of the template invocations, and thus make the 10,000 template calls go further.

For Wikifunctions, we would like to change this. Wikifunctions will allow for inline invocation of function calls on Wikimedia wikis' pages, using code defined on Wikifunctions and executed on the back-end service to be spliced into the page in the invocation point. We would like the function execution to be asynchronously triggered on page parse, function, or input update, with content fragments injected when they become available, and with stale fragments or, potentially, placeholders shown in their place where necessary.

For a worked example, a page might use a Wikifunctions invocation to fetch the population of a city over time from Wikidata, and calculate the pattern in drift, and turn that into some prose. Wikitext of {{#function|Z12345|en|Q12345|5 years}} might result in the content fragment "The city has an official population of 5.4 million, up by 1.4% over the five years from 2014–2019.". When the result changes, because an input was changed (e.g. from 5 years to 10), because the Wikidata item is updated (e.g. to add a new value), or because the code implementing the function was changed (e.g. fixed to support time spans of centuries), the rendered version of the article would continue to include the old version of the fragment for a few seconds until the new one was available to replace it.

The team does not have a particular solution in mind for how to build this, and is looking to work with partner teams to explore possible performant, sustainable, roadmap-aligned ways by which to achieve this aim.

What does the future look like if this is achieved?

For Wikifunctions, in-page calls would be executed asynchronously without blocking the initial page render. The return values would be cached and injected into the page after it is generated (or later in the synchronous generation if the responses are fast enough, e.g. less than 100ms). Users could add a number of function calls without hitting arbitrary limits. Pages using too many function calls would degrade gracefully to readers with a placeholder, rather than failing to load.

More widely, beyond the work of the Abstract Wikipedia Team itself, there are several potential forms of asynchronous content creation, from which other features and extensions could benefit:

Asynchronous content creation could theoretically also be extended to template calls, but we understand that such a feature re-use would first need the "balanced templates" pre-TDMP RfC to be implemented to avoid secondary impacts, which itself is awaiting the final replacement of the legacy Parser with Parsoid RfC.

Another example would be Wikidata query result pages, which often take up to a minute. These could be integrated 'live' from the query endpoint, rather than wikitext pages being updated via a user-maintained bot running on a cron job.

Most wide-rangingly and longer-term, this would be a big step towards giving MediaWiki the capability to asynchronously compose content in general. This would be a major enabler for combining the output from multiple, event-driven systems.

What happens if we do nothing?

Calls to Wikifunctions would be synchronous, and thus slow down the generation of pages, making the sites using Wikifunctions slower. Some function calls would hit the cache, but others wouldn't, providing inconsistent performance for users. On some pages on e.g. the English Wikipedia, logged-in users would sometimes have to wait multiple seconds for the page to load.

One alternative route to address this need would be a large capital outlay on additional production servers, but this would be very expensive, increase the environmental impact unnecessarily, and would only support relatively limited uses.

For those wikis where we were deployed, we would necessarily impose very strict time limits on the nature of the function calls we would allow users to add to a page, reducing the value to editors and thus readers by limiting complexity to more trivial cases.

This would be an unacceptable performance outcome, and would effectively prevent the production enablement of Wikifunctions calls, and of the future Abstract Wikipedia service.

The opportunities for communities to share functions between them to replace local Lua modules is lost, so smaller communities either have to continue to invent their own, remember to copy other wikis' features, or never know of the options available to them.

Any additional background or context to provide?

Abstract Wikipedia planning home page

Event Timeline

Jenlenfantwright updated the task description. (Show Details)
Jenlenfantwright updated the task description. (Show Details)
Jenlenfantwright updated the task description. (Show Details)

A somewhat similar problem is how to include metadata about the content in the page HTML when that metadata needs to be calculated in not-quite-real-time. An example is T213505: RfC: OpenGraph descriptions in wiki pages but there are all kinds of other potential use cases, e.g. around machine learning (such as altering page presentation based on the ORES rating of the revision). This doesn't involve the parser, which make it much simpler, but one shared aspect of the two problems is the need for refreshing the edge cache (and MediaWiki's HTML cache, if enabled) when the final HTML has been produced. It would be nice if the mechanism chosen for that would be sufficiently generic.

I know of two existing features that could benefit from this:

  • <math> tags (Math extension), when using MathML rendering. The actual rendering happens in a separate service, contacted via command line or via HTTP requests. Currently it waits on them synchronously, and to improve performance slightly, does weird hacky stuff to batch them (which causes bugs when incompletely parsed result is used, e.g. T242327).
  • Image thumbnails, when using $wgInstantCommons. Instant Commons uses synchronous HTTP requests to fetch image dimensions from Commons, and relies on caching to only be excruciatingly slow the first time they're fetched (and the caching is currently broken: T235551).

T249419: RFC: Render data visualizations on the server (Extension:Graph / Graphoid) is also somewhat related. It's similar to the Math use case, except that the graph definition is too large to be included in the thumbnail / iframe URL, so it needs to be retrieved via a side channel, which causes lots of complications.