Page MenuHomePhabricator

Consider whether Parsoid will support forced linear parsing.
Open, LowPublic

Description

Some extensions, in particular Variables, Arrays and Loops, may inherently depend on linear parsing to produce valid output. This is ususally not visible from hook usage, but becomes apparent from setting attributes to the parser or using ParserOutput->setExtensionData .

At some point, a decision needs to be made if there should be continued support for linear parsing in some low-performance mode, or whether this will break entirely sooner or later.

Event Timeline

I was chatting about this in the MediaWiki Discord server with another user by the name of "skizzerz", and they had an interesting idea: what about implementing the legacy parser as a content model? That would allow wikis to use one or the other on a per-namespace and even per-page basis, as needed, potentially even mixing and matching if transclusions can be handled correctly between the two. The legacy parser could be defined as always being linear, while Parsoid would not. That doesn't solve the underlying issue of allowing Variables (et al) to work with Parsoid, but it does, at least, provide a method for wikis to move forward with either or both of the legacy parser and Parsoid, potentially moving over gradually as time and resources permit. I've never looked at content handlers at anything but a surface level myself, so I don't know if this is actually a viable solution, but it seems like it might be a good way forward if it can be made to work without too much effort.

We've generally thought about splitting the content model for wikitext in terms of future "wikitext 2.0" work. But note that doing it for the legacy parser would probably also mean that Visual Editor (and probably other 'new' tools developed) won't be activated for "legacy wikitext" pages.

Note that the new ContentMetadataCollector abstract interface (which ParserOutput will implement, see T287216) is a write-only API. So as code moves to that interface it will become more obvious where ordering is assumed, because they will have to down-cast the ContentMetadataCollector to ParserOutput in order to read data out.

A linear parsing model might also be incompatible with the new Wikifunctions feature.

what about implementing the legacy parser as a content model? That would allow wikis to use one or the other on a per-namespace and even per-page basis, as needed, potentially even mixing and matching if transclusions can be handled correctly between the two. The legacy parser could be defined as always being linear, while Parsoid would not.

This seems to be a very good idea. Some extensions, like Variables or External Data rely on the order of parsing.

In addition, a parser function, e.g. {{#legacy_parser:wikitext}} can be added to allow linear parsing with the old parser of but a part of a document.

Alternatively, a parser function, e.g. {{#ordered:first fragment|second fragment|...}} can be defined to guarantee the order in which fragments of wikitext are parsed.

But this will work only if Parsoid's architecture allows to call the parser function before its arguments are parsed. If it does, it will also allow lazy evaluation of arguments of {{#if:}}, {{#ifeq:}} and {{#switch:}}.

Hi, I use very intensive on my wiki thewoodcraft.org extensions as Variable, Loop, or ParserFunctions. I don't use Scribunto. By me, is using Lua to work with variables and conditions is perversion. Native PHP code is more effectively for it. I prefer extensions what use native PHP for it.

By my mind it isn't important, if has parser do work Iinear or parallel, if I can add hook before finish work of parser on the raw code.

I don't know if my comments will change anything and if they belong here but I want to say that the disappearance of the Variable, Arrays, Loop, ParserFunctions and HashTables extensions will have disastrous repercussions on the wikis I maintain. I don't think I'm an isolated case given the success of these extensions. These extensions seem to be indispensable when using Semantic MediaWiki.

I like the suggestions by @RobinHood70 and skizzerz about preserving linear parsing as part of a 'legacy' content model (which can then be adopted on the level of namespaces and wiki pages) and by @alex-mashin about the possibilities of forcing a preliminary step of linear parsing through a parser function. Especially since like @Megajoule and many others, I'm anxious about the future of my wikis and would be a little unhappy to see fifteen years of work go to waste.

I would love to hear MediaWiki developers on their perspective(s). Are the suggestions above feasible at all? Should we organise ourselves to get wider attention for the issue? Or is it not yet possible at this stage to get clarity - which would at least be an answer in itself.

I don't know if my comments will change anything and if they belong here but I want to say that the disappearance of the Variable, Arrays, Loop, ParserFunctions and HashTables extensions will have disastrous repercussions on the wikis I maintain. I don't think I'm an isolated case given the success of these extensions. These extensions seem to be indispensable when using Semantic MediaWiki.

I'm just someone who works with a number of small wikis that use these extensions, but I'm a bit confused about where all of this actually is going. I do have quite a bit of work built around these extensions and around Semantic MediaWiki. I'm happy to 'get onboard with the future', but I can't seem to find very much that actually explains what that means.

What should I be doing to build to be compatible with future versions?
What versions should I be expecting my current templates (using Variables) to stop working on?

What should I be doing to build to be compatible with future versions?

WMF's vision seems to be that everyone will use Scribunto and Parsoid, and that this will somehow be the perfect solution for everyone, or at least that everyone will convert to their preferred system when left with no other choice. The implied promise to maintain state throughout a document no longer holds true, as Parsoid processes things in parallel. This handily breaks Variable, Arrays, Loop, HashTables, and probably several in-house extensions like our own, since they all rely on the ability to maintain state as a document renders. In other words, you can define a thing, make reference to that thing, and possibly alter the thing as the document progresses. Parsoid simply doesn't make that promise. I'm not entirely sure why ParserFunctions would have issues, but perhaps someone more involved with that project can shed some light on it.

The only thing we can do that I'm aware of is to split off the older parser (and probably also the preprocessor) into a separate development branch. The idea has been tossed around that it could be developed as a separate content type so that entire namespaces, including main space, could be set to a specific default content type, and each page could pick and choose if it needed something other than the default. Such an extension could either work with the older parser or at least work in a linear fashion so that wikis that need to maintain document state can do so. From personal experience, I expect that there would be a strong desire to do at least some redevelopment on the older parser/preprocessor, as the current preprocessor is...let's say "badly in need of a rewrite". While I'm less familiar with the parser itself, I suspect that would also want a fair bit of rewriting over time, perhaps in conjunction with a second option that freezes the older parser at its current state of development for full backwards compatibility.

What versions should I be expecting my current templates (using Variables) to stop working on?

As I understand it, 1.35 is where problems might begin to arise. How much of an issue that'll be, I'm not sure. I suspect that in at least some cases, they'll appear to continue working, but in reality, that'll just be a matter of luck that the page is getting parsed in linear order or near enough not to matter. I believe the creator of Variables said his works with 1.35, but you'd have to check the pages of all extensions you use to see what breaks and what doesn't, and if there are any workarounds.

Variables still works without issue in MW 1.40; and the same should hold for the other extensions. However, due to the use of #var_final it will emit deprecation warnings.

The core functionality can not really stop working without a huge breaking change in the parser that won't be easy to miss.

We are not going to break anything without sufficient notice. We have not declined this task, so this is still on our radar. The only reason there is no action on this yet is because we are focused on getting the Wikimedia cluster wikis migrated over to use Parsoid for all use cases and there is a lot to handle there that is taking all our focus and attention.

But yes, we discourage relying strongly on linear parse order expectations, and that functionality will eventually break (short of implementing a low-performance processing mode as in this task). When Parsoid becomes the default wikitext engine, we might pull out the legacy parser into its own library / extension or some such thing for wikis that don't want to or cannot migrate over to Parsoid. But, note that all future work on wikitext and templating will only be done with Parsoid.

Note that there is state in a document, the document itself is the global state. What we are not going to maintain is the promise that the processing will happen sequentially always OR even that all parts of the document will be processed (we may reuse fragments from a cache -- this is something that was part of Parsoid early on which we have since pulled out till we fully transition over). This is not something that is happening today or even in the short term. Our focus right now is heavily in making sure we transition over to Parsoid on the Wikimedia cluster.

There might be some ideas emerging out of SMWCon 2022 related to the Variables extension that might work for some of you. One of us will post inks here when we have more solid info that is not experimental.

Thanks for the update, ssastry! Just for clarification, when I talk about maintaining state, I mean that if, in a template or even just a regular document, you can define a variable in some fashion (as in Variables, Loops, etc.) and it will take effect in the document only from that point forward. Then, if changed again later in the document, the change will only take effect after that. In concept, it turns wikitext itself into a primitive programming language, which is what Variables, Arrays, and Loops are essentially doing. Obviously, this will break in a non-linear environment, but if a linear option is maintained, I think that would likely allow those extensions to continue to work as they are, at least for the time being.

Side question: are there any plans to change the pre-processor code? Last I heard, you were planning on keeping the pre-processor the same under Parsoid as in the legacy parser. If that's the case, then both of them could benefit from a pre-processor rewrite. I ask because I've already done that rewrite in C# as part of my bot's wikitext parser (roughly halving the pre-processing time from the base PHP to C# port), so I'd be very interested in backporting the same algorithm to PHP to see if the speed improvements would translate. It wouldn't be directly compatible with the existing framework, but it's conceptually close enough that changes to dependent code would probably be minor.

Splitting the content model into sequential and non-sequential content models is an interesting concept but I am not sure that is really that useful or necessary.

A memoizing non-sequential parser like Parsoid wants all parts of the parse to be referentially transparent without hidden side effects.

Any part of the parse (e.g., parser functions introduced in an extension) that wants to store and modify data behind the back of such an non-sequential parser will cause the page render to potentially fail to create the same output as a purely sequential parser.

The solution is to make each parsing function referentially transparent by either only using the input and output text for storage (since that is what parsers already know about) or having some other means of providing the parser with information about what is being used for storage.

Why not legitimize variable storage by adding such an API to the parser itself? This is basically the same issue seen with out-of-order instruction execution in modern microprocessors where each instruction can read and write to memory and registers. If we create a variable storage API in the parser, it can then do more advanced things like register renaming to analyze and eliminate data dependency "hazards". The parser can then know when and how these variables change and thus when to schedule or reschedule parsing any and all parts of a complete page render.

This would give the currently affected parser function extensions a path to porting their current side-effects by storing their variable values in the parser via this API and allowing the parser to know all of their inputs and outputs so it can effectively analyze their data dependencies across page renders.

Stux rescinded a token.
Stux awarded a token.

Any updates on this item? Not being able to use these parser functions on smaller wikis may end up forcing some of them to stay on older versions of Mediawiki (having to replace all content pages would likely not be feasible). This is quite problematic from a security standpoint.

I'm also interested in updates on this question. I manage a corporate wiki with a few thousand pages and we don't use Lua or Scribunto. I don't understand what this change will mean for me-- does my team have to rewrite everything that uses ParserFunctions? What support is being offered to help wikis that must make this change? If we have to do a ton of work in order to use a current version, we're going to be in the position Stux describes.

I'm also interested in updates on this question. I manage a corporate wiki with a few thousand pages and we don't use Lua or Scribunto. I don't understand what this change will mean for me-- does my team have to rewrite everything that uses ParserFunctions? What support is being offered to help wikis that must make this change? If we have to do a ton of work in order to use a current version, we're going to be in the position Stux describes.

This should only concern the parser functions that start with #var or #array in your wiki. While at this point, Variables are broken on MediaWiki 1.40+, we are working on fixing that issue. We'll see whether at some point, this leads to actual breakage. If past experience is any indicator, it will be years before this becomes relevant for the Variables core functionality.

@cscott I would just like to bump this such that it is not forgotten, as the combination of T343227 and T300979 would make us loose our migration path for extensions building on forced linear parsing. If you decide against this, I would request two thingsÖ

  • A rough estimate when these capabilities will be finally removed, ideally years in advance such that small wikis have a lot of time to react
  • A stance regarding T207993 , which more or less amounts to read/write template parameters, and would allow for T207649, capturing at least some of the functionality going forward.

I would like to adapt as far as possible to align with Parsoid and keep the users of Variables happy as far as possible, but I am unwilling to spend time to provide workarounds for deprecations which will themselves be deprecated or stop working not much later.

I'll add my voice to MGChecker's comments, as we use Variables (and Loops to a lesser extent) in multiple wikis totalling around 300k content pages (and growing) that use 100s of templates that use Variables. Further, all of our wikis' contents are managed not by me or my company but by the wikis' respective communities and a small-ish set of editors who would be forced to completely rework all their use of the affected extensions somehow. (It's my understanding that the generally expected alternative would be Scribunto/Lua, meaning the editors would need to learn Lua if they don't already know it.)

We also have multiple wikis with similar problems, but using a custom extension similar to Variables. Ours modifies the PPTemplateFrame variables directly (allowing basic scoping), and other functions store information in ParserOutput as parser functions are parsed.

Our tags allow two-way communication between parent-child scopes (inheriting variable values from the parent and above as well as returning them). The inheritance part is actually multi-level, so values can come from the grandparent and higher as well. To my knowledge, Scribunto/Lua doesn't support either of these features, so is not viable as a workaround in those cases. While there are certainly many cases where these features aren't used and Lua would be fine, in the cases where it won't work, that means abandoning functionality we've had built into our wiki since MW 1.10 days.

Like Justin_C_Lloyd, we're looking at around 400k pages, roughly 70% of which use *something* from our custom extension. With only a small handful of advanced template programmers and two PHP programmers, this isn't something we can do overnight. MGChecker's not wrong in saying that we'll need forewarning of a year or more, and ideally, a very specific plan (read: programming interfaces) detailing exactly what will and won't be supported. I think this is what's gotten us into this situation in the first place: old versions of the MediaWiki software had an implicit contract, by virtue of offering public methods and/or properties, to allow certain things that were later assumed not to be needed. But by then, extension developers were were already using them and wikis were making extensive use of those extensions.

Lastly, if linear parsing won't be offered, it would be useful if someone could provide a brief technical outline of what the issues are that prevent having both linear and parallel as options.

Hello everyone, thanks for the detailed responses, it's very useful to understand your use cases to make our plan for the Parser Unification effective.

I know this topic is still open-ended and I'm looking for a solution to keep communication clear and transparent as the project advances, as of now the best places is the following wiki page https://www.mediawiki.org/wiki/Parsoid/Parser_Unification/Updates.

FWIW, The core legacy parser functionality of extensions is not expected to stop working in MW 1.40 or beyond without advance warning and with a reasonable period of deprecation. MW interface stability is something we are taking serious in our planning and I hope you can feel reassured about that.

As we progress with the Parsoid evolution, we will document how to best use Parsoid in your interface and I suggest you to also keep an eye at the following wiki page https://www.mediawiki.org/wiki/Parsoid/So_you_want_your_extension_to_work_with_Parsoid.

Again, thanks for reaching out with your concerns.

I know this topic is still open-ended and I'm looking for a solution to keep communication clear and transparent as the project advances, as of now the best places is the following wiki page https://www.mediawiki.org/wiki/Parsoid/Parser_Unification/Updates.

On this page, you write

There is also no strip-tag notion in Parsoid currently. Extensions seem to primarily make use of it to tunnel content through the parser without further processing. In Parsoid, all extension output (the DOM produced by one of the above mehods) is always tunneled through the parser and expanded into the DOM before handing it off to additional processing that operates on the final DOM (including the DOM post processors that extensions might register for). So, extensions should not have to deal with this detail. As such, you will find all such methods absent in Parsoid's extension API.

I would like to point out that there is also the use case to insert some information into the page that is only known after the remainder of the page has been successfully parsed. #var_final would be an example of this.

Hi @MSantos,

Thank you very much for your update and statement regarding the future of the core legacy parser.

With that in mind I had a few questions I was hoping you or your team could clarify from the following statement:

"The core legacy parser functionality of extensions is not expected to stop working in MW 1.40 or beyond without advance warning and with a reasonable period of deprecation."

  • Given this statement, is the plan for parsoid to eventually support variables or at least provide the bare minimum interface so that a compatibility layer could be built for the relevant extensions? Or is the plan to still abandon variables altogether?
  • What is considered a "reasonable period of deprecation"? Is there a specific time-frame in mind? For some wikis I'm sure deprecation still means the end to any Mediawiki updates without a compatibility layer being available.

Any clarification around these items would be greatly appreciated.

Thanks,
-stux

The Wikimedia Foundation does not support the Variables extension, and it is not used in WMF production. That said, there's been a patch for Variables attached to T203531 for six years now; I would encourage work to continue on that.

The MediaWiki deprecation policy is at https://www.mediawiki.org/wiki/Stable_interface_policy and https://www.mediawiki.org/wiki/Version_lifecycle. You can expect that 1.43, the next "long-term support" version of MW, to be branched in December 2024, will continue to have support for using the legacy parser, and that version of MW will be supported until 2027.

I can't make any guarantees about the long-term-support version after that (1.47, Dec 2026); I would hope that use of the legacy parser will be /at least/ deprecated by then.

In my personal opinion: content migration (a) takes time, and (b) is a fact of life, unless all development on MediaWiki were to freeze. Given that we've been talking about these changes for a decade now, I'd recommend starting whatever content migration efforts you feel are appropriate /now/ at an appropriate not-urgent pace. There are tools like pywikibot which ought to be able to automatically migrate content to newer syntaxes, as well as tools like tracking categories (in MW core) and the Linter extension to aid manual efforts. We participate annually in SMWcon and I personally am eager to further support any efforts to coordinate content migration efforts among the community.

My understanding of the problem is that parallelism is the main issue. That's something no patch can address, since it's a fundamental design difference between Parsoid and the legacy parser. Documents/templates are being used as state machines and therefore expect linear parsing. You can't set a variable to X, then read that variable before it's been set. That obviously won't work.

I'm not 100% sure how the extensions work, but I believe similar problems could occur with extensions like Cargo or Semantic, albeit as more of an edge case. Imagine retrieving a value "X" and displaying it, then later on the page saving "Y" over top of that value, then loading the value again and displaying it. You would expect "X" then "Y" to appear as the document is processed, but with parallel document processing, that would no longer be guaranteed. Again, no approach other than linear processing will work here, at least not that I can think of. Even a post-parser pass over the generated HTML wouldn't be guaranteed to work, as output could have been expected to change in response to what values were retrieved.

Given those issues, I suspect that either the legacy parser or a linear version of Parsoid will be the only approaches that will work...either that or abandoning not just these extensions but their entire approach to document generation. In my experience, that introduces a lot more manual and error-prone input into the process. Imagine, for example, a 10-iteration loop or having a changing variable controlling output in multiple parts of a template based on conditional logic. That all goes away with parallel processing and now you'll just have to type everything out by hand and hope you don't make too many mistakes. In some cases, switching to Lua would work, but that requires every single user of Variables, Loops, and Arrays to learn a programming language that's not tremendously intuitive. Increasingly, templates become accessible only to a limited few. On large wikis, that's not a tremendous issue, but on small wikis where there are already only a small handful of template programmers, you may end up limiting the templaters to an even smaller handful or none at all.

@RobinHood70 Yes, separating computation from the document is generally considered best practice; cf https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller. There are multiple ways that can be done, and as mentioned there are a number of different strategies that could be employed to yield either compatibility with existing markup or else aid in migration to new markup. I don't think it advances the conversation to say "something no patch can address" or "no approach other than linear processing will work"; that's an ultimatum not a discussion.

I apologize, it wasn't intended as an ultimatum, just a statement of my understanding of how things stand. If, indeed, Variables or other extensions can be made to work in some way that's compatible with parallelism, which I think is what you're suggesting when you say to continue working on the patch, I'd be very interested to hear what you have in mind. If my understanding of the direction the patch was taking is correct, moving the variable storage from Parser to ParserOutput would indeed be a more modern approach, but it still wouldn't solve the parallelism problem. How do you ensure that a variable is set before being read with Parsoid?

Just to chime in, I don't believe Cargo or Semantic MediaWiki will have any issues with the non-linear aspect of Parsoid. (SMW does have an issue with the "::" syntax, but that's a separate thing.)

Thanks! Do you happen to know how either of them handle possible out-of-order processing? If not, or if it would take us too far off-topic, no worries, it's just idle curiosity.

There isn't really the concept of an order of operations for either extension, as far as I can think.

There isn't really the concept of an order of operations

I think in Semantic MediaWiki the only feature that *might* be affected by the order of processing is the "Sequence map" feature (i.e.: the order the authors are recorded is the order they must be displayed in a query result), although this feature is not enabled by default.

[0] https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/4226
[1] https://www.semantic-mediawiki.org/wiki/Help:Sequence_map

Right, yes - Cargo has something similar too, where you can query values in the order in which they appear on the page, but it's already unreliable, even with linear parsing.

Okay, so using Cargo as an example, and forgive me if there's any syntax issues here because I only looked at the docs quickly, if you have the following on a page, presumably split up by other text:

{{#cargo_query:
_table=MyTable
|where=key=1
}} (returns value "Hello", let's say)

{{#cargo_store:
_table=MyTable
|key=1
|value=World
}}

{{#cargo_query:
_table=MyTable
|where=key=1
}} (returns value: ?)

...how do we know that the second query will return "World"? It could be parsed out of sequence from what comes above it if the parallel processing kicks in, so in theory, could it not return "Hello"? Or vice versa?

Right, yes - Cargo has something similar too, where you can query values in the order in which they appear on the page, but it's already unreliable, even with linear parsing.

It's pretty reliable currently, in the most-common circumstance of store-query, e.g.

cargo_store a value

<!-- more content -->

cargo_query that same value

You can add that to a page and save it once and all will be well. This is a relatively common pattern too, for example on a brand-new Item page (as in a video game item), you might want to store its data into the Items table & Recipes table, then query its entire crafting tree from the Recipes table; which would require that its own Recipes data is already inserted.

Out-of-order parsing would cause this to become unreliable because maybe the query happens before the store, and you have to do an extra purge cache after saving. This is particularly insidious because whether it requires an extra purge will be nondeterministic and editors might make mistakes.

Store-query-store is also a pattern that exists, e.g. insert a someone's data into the Athletes table then query RosterChanges to get their history of moves from one team to another, compute their list of start->end dates on each team, and store into Tenures. This would have the same issue, only exacerbated because now there are multiple layers of things that can happen in the wrong order; if you rng into 2nd store - query - 1st store twice in a row, it will take *three* saves to function properly.


Regardless, the biggest concern for me with Cargo imo is the randomization of fields designed to order the list via the Variables extension. Unless you advocate for the entire wikipage to be 1 single call to Lua (which goes very much against the ethos of wikis), there will be no way to ensure fields increment as you go "down" the page.

LOL, no, let's not put entire documents into Lua. That would be...bad. Thanks for the response!

Two orthogonal issues here:

  1. Current Parsoid calls back to the legacy Parser to handle various types of content. That can cause there to be multiple $parser objects used on a given page. Most *present* issues with "linear parsing" are due to the fact that, even though the parse is technically in order, data being attached to the Parser object is lost because there are multiple parsers in play.
  2. Incremental parsing can still be done in the presence of store-modify-write sorts of changes to shared parser state as long as the shared parser state can be fully serialized for use in a cache key (ie, only reuse this fragment if the incoming parser state is identical to what it was last time) and deserialized (ie, when reusing this fragment, reset the shared parser state to what it what it was *after* the fragment was originally parsed). (Or, simply bypass any caching or incremental parsing if read-modify-write is done.)

For compatibility with Cargo and Semantic MediaWiki, the other issue is that the fragments may have done database writes or updates. *In theory* if we're reusing the fragments those writes are already correct and the underlying database doesn't need to be updated, but *in practice* (as I understand it) Cargo begins by clearing the row corresponding to the existing page before the parse begins, so all of those templates either need to be re-evaluated or their database updates need to be otherwise redone or else the data will be lost. I think one way of reframing this is: doing certain types of side-effecting operations from your extension tag of parser function will disable fragment reuse (at least) and might bypass caching as well.

  1. Current Parsoid calls back to the legacy Parser to handle various types of content. That can cause there to be multiple $parser objects used on a given page. Most *present* issues with "linear parsing" are due to the fact that, even though the parse is technically in order, data being attached to the Parser object is lost because there are multiple parsers in play.

I saw that the DataAccess class would use the same legacy parser object for all calls to extension tag when the PageConfig remains unchanged, basically it means within the same page? Do you mean this is going to be changed soon, or did I miss some concept about the PageConfig etc.?