Page MenuHomePhabricator

Replace use of deprecated hook InternalParseBeforeSanitize
Open, Needs TriagePublic

Description

Unit tests failure:

Use of InternalParseBeforeSanitize hook (used in VariablesHooks::onInternalParseBeforeSanitize) was deprecated in MediaWiki 1.35. [Called from MediaWiki\HookRunner\HookContainer::run in /workspace/src/includes/HookRunner/HookContainer.php at line 126]

Tests also failing for dependency extension: RegexFun and Arrays

Event Timeline

This can not be fixed by the extension without dropping features. See T236809 for further information.

Parsoid will probably be the doom for this extension.

Parsoid will probably be the doom for this extension.

It need not be necessarily. Most extensions will need some work to be compatible with Parsoid. In some cases, it might be become much harder to provide equivalent functionality since Parsoid does things differently. But, there may be scenarios where wikis might choose functionality over performance and in that mode, it may be possible to support features that wouldn't be possible otherwise.

https://www.mediawiki.org/wiki/Parsoid/Extension_API is where we are working out a Parsoid extension API that extensions will need to work with. It is still really early work in progress but we expect to have a first working version for extensions to play with in the coming month or two.

I took a quick look at the Variables extension and as far as I can tell, even though you might think (on the surface) that the "No_support_for_global_ordering" would mean that your extension won't work with Parsoid, it actually will. In a future where Parsoid starts doing more aggressive performance optimizations, using extensions like this might potentially disable those performance optimizations (unless we figure out appropriate solutions). But, let us cross that bridge when we get there.

As far as I can tell, Variables extension can implement #var_final in Parsoid by registering a DOM post-processing pass that inspects the DOM and updates the DOM appropriately. This maybe potentially be less performant in Parsoid as a result, but if we can provide options in Parsoid for extensions to add custom annotations to the DOM, you can use custom annotations to fetch just the DOM nodes that match them (which will likely be more performant than full DOM walks). So, at worst, this extension will be less performant in Parsoid but will still continue to function.

So, that just leaves us with the 'InternalParseBeforeSanitize' hook. Parsoid is highly unlikely to support this (because the parsing pipeline is different). So, the simplest solution would indeed be for Variables to ask for its content to be sanitized before insertion into Parsoid's final DOM. We can provide sanitization options in Parsoid's extension API so you don't have to explicit sanitize anything yourself. But for now, while Parsoid isn't the default yet, we could consider removing the deprecation on this hook but add clear guidelines that this hook will not be present when we switch over to Parsoid and extensions will have to implement their functionality differently. That said, @cscott and I will chat about this in our team meeting to see what the best approach is.

So TLDR is: I haven't seen anything in the extension that is fundamentally incompatible with Parsoid. But, any extension that relies on global ordered state can gets its work done by registering a DOM post processing pass that inspects the "final" DOM and update it appropriately (in your case, processing the #var_final uses).

Considering the discussion in T236809, I am against doing anything here. There is no suitable alternative, as InternalParseBeforeLinks will probably recieve the same treatment, and the deprecation should be removed soon.

Removing the deprecation will not 'fix' anything. I think perhaps T253768: No easy way to suppress hard-deprecation warnings for hooks would solve your problem, if you believe that #var_final is not worth reimplementing for Parsoid and you're content to just let it die once MediaWiki switches parsers. But deprecation is 100% the correct way to indicate to authors of new code that this hook should not be used because it will not be supported in the future.

There are other hooks that can be used to do roughly the same thing as #var_final, including ParserAfterTidy.

The thing is that i am in over my head here. I was able to pick up and modernize this extensions regarding best practices, extension registration, and do some basic bug fixes and so on. However, the nuances of the MediaWiki parser and Parsoid in particular escape me. (And the parser documentation is horrible.) And these are points with significant security relevance too – at the moment I am wondering if the migration from ParserBeforeTidy to ParserAfterTidy has not introduced various security holes, since the output of these extensions is apparently no longer tidied. I do not want to be reponsible for introducing serious security vulnerabilities into thousands of wikis.

Currently, Variables introduces a central store in the parser (probably, using ParserOutput would work as well) which can be freely read from and written to during a page parse. Hence, a change at a given point in the wikitext can affect arbitrary later points by changing the var values at these points. I would like to know how you think this can be made work with Parsoid considering the "No support for global ordering" heading on the Extension API page. If this feature can not recovered using Parsoid, all efforts to work around the current deprecations will be in vain anyway.

This behavior can actually be used quite effectively to cache constant, expensive template queries done may times within a single page. Using other extensions like Loops additionally, it can become even more potent.

"No support for global ordering" heading on the Extension API page

We should probably clarify that. That means that we don't guarantee that extension invocations will execute in the same order that they show up on the page. But, there is a global DOM pass that gives extensions a global ordered DOM. So, for example extensions can walk the DOM and get a global order.

However, if extensions introduce state that depends on the page parsed in linear order, yes, we cannot support that. So, this really is a difference between sequential parsing vs. out-of-order parsing (and in some cases, parsing could be skipped on segments as part of perf improvement strategies).

Which hooks are deprecated or undeprecated will not make a difference to that strategy. If this is a real blocker, then, we'll have to introduce some kind of page property OR some kind of extension config option that will force full sequential reparse for any page that has that flag or extension config set.

Legoktm added a subscriber: Legoktm.

I'm re-opening this based on discussion in T236809: Refactor Parser.php to allow alternate parser (Parsoid), where it seems that the declining of this task was based on a misunderstanding.

If this is a real blocker, then, we'll have to introduce some kind of page property OR some kind of extension config option that will force full sequential reparse for any page that has that flag or extension config set.

To me that sounds like the Parsing Team is open to modifying Parsoid to support your use case.

I would really appreciate clarification on this point, since right above it is stated "However, if extensions introduce state that depends on the page parsed in linear order, yes, we cannot support that.". At the time I read this, this lead me to the conclusion that this is more of a theoretical possibility which has no real chance to be implemented.

But rereading this again, this might actually be a misunderstanding.

Sorry about using imprecise language in my comments across phabricator tasks.

High-performance and other functionality that depends on independent / decoupled parsing of fragments of a page cannot, by definition, support linear in-order parsing (To clarify: that does not mean hat the final page output won't be in the right order. It will be.). But, if we provide an opt-out mechanism in Parsoid for that to force a linear-ordered full reparse of a page, then I suppose that does mean that Parsoid supports such pages (but disable certain features / functionality on the page as well). Once again, we haven't worked through the feasibility and complexity of providing such support (that is a separate technical investigation), but hopefully that answers your question. And, all things considered, once again so we aren't offering false hopes, we'll lean towards breaking incompatible functionality in the pursuit of higher-performance and future-proofing the technology. But, we will not go down the route of breaking things simply because we can. We have attempted to minimize impacts to the extent it is practical to do so.

I believe there are two issues that need solving:

  1. How to implement Variables as a Parsoid extension, and if new Parsoid functionality is needed for that
  2. What to do in the meantime for the legacy Parser, as the InternalParseBeforeSanitize hook is hard deprecated but still in use

I believe there are two issues that need solving:

  1. How to implement Variables as a Parsoid extension, and if new Parsoid functionality is needed for that

Yes. However, the fundamental question is not really related to the hook discussed here, since the core issue is modifying ParserOutput extension data in place during parsing. I have created T282499: Consider whether Parsoid will support forced linear parsing. to durther discuss this point.

  1. What to do in the meantime for the legacy Parser, as the InternalParseBeforeSanitize hook is hard deprecated but still in use

I still think undeprecating is not the worst idea, since "still in use by Semantic MediaWiki" was a sufficient point to not deprecate InternalParseBeforeLinks exhibiting precisely the same issues. Alternatively, silencing the warning would probably be fine as well.

var_final will be a special, independent problem from the first (and the problem this task is actually about).

While from past conversation with the Parsing-Team in this task, I am confident a solution in Parsoid exists for this case, there has not been an actual migration path until now. Which makes it hard to act on the deprecation warning in a sensible way…

Change 721950 had a related patch set uploaded (by Universal Omega; author: Universal Omega):

[mediawiki/extensions/Variables@master] [Variables] Replace deprecated InternalParseBeforeSanitize hook

https://gerrit.wikimedia.org/r/721950

It was recommended in one of the live channels that users of Variables should share their use cases with maintainers here on this phabricator thread. That said, that's what I'm hear to do.

I operate and maintain a semantic mediawiki server at NASA for a multitude of enterprise knowledge management and operational process management purposes. Over the years we have developed a large ensemble of wiki-based tools based on collections of Forms, Templates, Categories, Properties that all work together for specific (and sometimes complicated) purposes .. these wiki-tools are every-day becoming more and more the back-bone of our internal business processes and the "Variables" extension is just one of a number of extension we are critically dependent on. That said, we are critically dependent on it. I suppose if we had to, we could move our template logic to Lua, but I would prefer not to because what we have works really well using the #vardefine, #vardefineechom and #var. I would never have guessed that this would be one of the extensions that the value of would be questioned. From out perspective, it's a must have. Please don't let this extension fall by the way-side. Thank you for reading.

Echoing @Revansx , here's my primary use case:

Nearly all of my code on Leaguepedia (https://lol.fandom.com) is in Lua already; however, I still make extensive use of the Variables extension. The primary use is as a singleton object to coordinate creating indices for primary/foreign keys for stored values in Cargo. For example, if I need a "current row" key to join 3 tables together, that will be of the form {{FULLPAGENAME}}_{{#var:counter}}, and that counter must be globally incrementing across separate module calls.

There is no way to get around this, even putting all of my individual row stores into a single Lua module, because I need individual h2 tags splitting up sections so that editors are able to edit only a single section at a time. Without these anchors, the process would be prohibitively overwhelming. Here's an example page: https://lol.fandom.com/wiki/Data:LCS/2021_Season/Spring_Season

I wrote a blog post about some of the challenges of dealing with these keys here: https://river.me/blog/primary-key-caching-adventures/ - it's also not always as simple as just incrementing a counter by 1, or even as close to that.

Other, less critical use cases:

  • Adding line numbers to Cargo queries - https://help.fandom.com/wiki/Extension:Cargo/customizing_tables#Special_case:_adding_line_numbers - this is something I do a lot when a query doesn't require an entire module, and losing access to this functionality would add a large burden to create individual modules for what should be extremely simple queries (though I suppose this could be functionality written into the Cargo extension)
  • Caching results of Cargo queries from the top of the page to the bottom - I could simply redo the query, sure, but that seems rather counterproductive when the goal is to improve performance (as an example, this module reuses the list of redirects to a page - https://lol.fandom.com/wiki/Module:PlayerPageEnd - (the list is deliberately done through Cargo and not another tool like DPL because I want to be using the right set of equivalences for special characters, capitalizations, etc) - in a lot of cases I could maybe reorganize page content, but I'd rather not in most
  • When debugging Lua modules, I often use variables to log to {{#var:log}} and then print the contents of the log on the page afterwards

I am curious: Where has this topic come up recently?

@MGChecker

I'm following up on @Revansx, who may have been replying here due to me bringing this up on a live channel the same day he posted based on reviewing the documentation for all of the extensions we use as I prepare for a major version upgrade to 1.36. We use Variables extensively on our five Guild Wars 1 and 2 wikis.

I do not do any of the editing, but I build and manage the wiki platform for our communities. To summarize what I was told by a few of our main editors, we absolutely depend on this extension "on basically every page we have that's not a one-liner" and that "we use this extension in a huge number of templates[, though] Var_final is much less frequently used – we mainly use it on the achievement overview pages." One editor commented that "I am not sure there will be a work around for us if it has to go."

In general, Ext:Variables is incredibly useful for optimizing page performance when using SMW/Cargo, since it allows query results to be cached and reused. This isn't something that comes up constantly, but it happens often enough to have a sitewide impact for us on Yugipedia.

T282499 is related and we'll probably get around to that within the next 6 months. But, we'll discuss this on the team and see what action, if any, we need to take in the interim.

Thank you all for moving the discussion forward. As a bit of 'power user' of Variables myself, in conjunction with other extensions (Parser Functions, Arrays, Semantic MediaWiki, etc.), I can certainly add to the list of use cases.

But ... I'm a little confused now about the scope of this task/discussion. InternalParseBeforeSanitize is relevant only to the #var_final parser function from Variables and a number of other extensions (I don't think SMW uses the hook any longer), but we've gone on to have a fruitful discussion about the future of the Variables extension in general in the light of development on Parsoid and page parsing in non-linear order. (And there's also T282499 and T236809. )

Is this the right place to continue?