Page MenuHomePhabricator

Replace use of deprecated hook InternalParseBeforeSanitize
Open, Needs TriagePublic

Description

Unit tests failure:

Use of InternalParseBeforeSanitize hook (used in VariablesHooks::onInternalParseBeforeSanitize) was deprecated in MediaWiki 1.35. [Called from MediaWiki\HookRunner\HookContainer::run in /workspace/src/includes/HookRunner/HookContainer.php at line 126]

Tests also failing for dependency extension: RegexFun and Arrays

Event Timeline

This can not be fixed by the extension without dropping features. See T236809 for further information.

Parsoid will probably be the doom for this extension.

Parsoid will probably be the doom for this extension.

It need not be necessarily. Most extensions will need some work to be compatible with Parsoid. In some cases, it might be become much harder to provide equivalent functionality since Parsoid does things differently. But, there may be scenarios where wikis might choose functionality over performance and in that mode, it may be possible to support features that wouldn't be possible otherwise.

https://www.mediawiki.org/wiki/Parsoid/Extension_API is where we are working out a Parsoid extension API that extensions will need to work with. It is still really early work in progress but we expect to have a first working version for extensions to play with in the coming month or two.

I took a quick look at the Variables extension and as far as I can tell, even though you might think (on the surface) that the "No_support_for_global_ordering" would mean that your extension won't work with Parsoid, it actually will. In a future where Parsoid starts doing more aggressive performance optimizations, using extensions like this might potentially disable those performance optimizations (unless we figure out appropriate solutions). But, let us cross that bridge when we get there.

As far as I can tell, Variables extension can implement #var_final in Parsoid by registering a DOM post-processing pass that inspects the DOM and updates the DOM appropriately. This maybe potentially be less performant in Parsoid as a result, but if we can provide options in Parsoid for extensions to add custom annotations to the DOM, you can use custom annotations to fetch just the DOM nodes that match them (which will likely be more performant than full DOM walks). So, at worst, this extension will be less performant in Parsoid but will still continue to function.

So, that just leaves us with the 'InternalParseBeforeSanitize' hook. Parsoid is highly unlikely to support this (because the parsing pipeline is different). So, the simplest solution would indeed be for Variables to ask for its content to be sanitized before insertion into Parsoid's final DOM. We can provide sanitization options in Parsoid's extension API so you don't have to explicit sanitize anything yourself. But for now, while Parsoid isn't the default yet, we could consider removing the deprecation on this hook but add clear guidelines that this hook will not be present when we switch over to Parsoid and extensions will have to implement their functionality differently. That said, @cscott and I will chat about this in our team meeting to see what the best approach is.

So TLDR is: I haven't seen anything in the extension that is fundamentally incompatible with Parsoid. But, any extension that relies on global ordered state can gets its work done by registering a DOM post processing pass that inspects the "final" DOM and update it appropriately (in your case, processing the #var_final uses).

Considering the discussion in T236809, I am against doing anything here. There is no suitable alternative, as InternalParseBeforeLinks will probably recieve the same treatment, and the deprecation should be removed soon.

Removing the deprecation will not 'fix' anything. I think perhaps T253768: No easy way to suppress hard-deprecation warnings for hooks would solve your problem, if you believe that #var_final is not worth reimplementing for Parsoid and you're content to just let it die once MediaWiki switches parsers. But deprecation is 100% the correct way to indicate to authors of new code that this hook should not be used because it will not be supported in the future.

There are other hooks that can be used to do roughly the same thing as #var_final, including ParserAfterTidy.

The thing is that i am in over my head here. I was able to pick up and modernize this extensions regarding best practices, extension registration, and do some basic bug fixes and so on. However, the nuances of the MediaWiki parser and Parsoid in particular escape me. (And the parser documentation is horrible.) And these are points with significant security relevance too – at the moment I am wondering if the migration from ParserBeforeTidy to ParserAfterTidy has not introduced various security holes, since the output of these extensions is apparently no longer tidied. I do not want to be reponsible for introducing serious security vulnerabilities into thousands of wikis.

Currently, Variables introduces a central store in the parser (probably, using ParserOutput would work as well) which can be freely read from and written to during a page parse. Hence, a change at a given point in the wikitext can affect arbitrary later points by changing the var values at these points. I would like to know how you think this can be made work with Parsoid considering the "No support for global ordering" heading on the Extension API page. If this feature can not recovered using Parsoid, all efforts to work around the current deprecations will be in vain anyway.

This behavior can actually be used quite effectively to cache constant, expensive template queries done may times within a single page. Using other extensions like Loops additionally, it can become even more potent.

"No support for global ordering" heading on the Extension API page

We should probably clarify that. That means that we don't guarantee that extension invocations will execute in the same order that they show up on the page. But, there is a global DOM pass that gives extensions a global ordered DOM. So, for example extensions can walk the DOM and get a global order.

However, if extensions introduce state that depends on the page parsed in linear order, yes, we cannot support that. So, this really is a difference between sequential parsing vs. out-of-order parsing (and in some cases, parsing could be skipped on segments as part of perf improvement strategies).

Which hooks are deprecated or undeprecated will not make a difference to that strategy. If this is a real blocker, then, we'll have to introduce some kind of page property OR some kind of extension config option that will force full sequential reparse for any page that has that flag or extension config set.

Legoktm added a subscriber: Legoktm.

I'm re-opening this based on discussion in T236809: Refactor Parser.php to allow alternate parser (Parsoid), where it seems that the declining of this task was based on a misunderstanding.

If this is a real blocker, then, we'll have to introduce some kind of page property OR some kind of extension config option that will force full sequential reparse for any page that has that flag or extension config set.

To me that sounds like the Parsing Team is open to modifying Parsoid to support your use case.

I would really appreciate clarification on this point, since right above it is stated "However, if extensions introduce state that depends on the page parsed in linear order, yes, we cannot support that.". At the time I read this, this lead me to the conclusion that this is more of a theoretical possibility which has no real chance to be implemented.

But rereading this again, this might actually be a misunderstanding.

Sorry about using imprecise language in my comments across phabricator tasks.

High-performance and other functionality that depends on independent / decoupled parsing of fragments of a page cannot, by definition, support linear in-order parsing (To clarify: that does not mean hat the final page output won't be in the right order. It will be.). But, if we provide an opt-out mechanism in Parsoid for that to force a linear-ordered full reparse of a page, then I suppose that does mean that Parsoid supports such pages (but disable certain features / functionality on the page as well). Once again, we haven't worked through the feasibility and complexity of providing such support (that is a separate technical investigation), but hopefully that answers your question. And, all things considered, once again so we aren't offering false hopes, we'll lean towards breaking incompatible functionality in the pursuit of higher-performance and future-proofing the technology. But, we will not go down the route of breaking things simply because we can. We have attempted to minimize impacts to the extent it is practical to do so.

I believe there are two issues that need solving:

  1. How to implement Variables as a Parsoid extension, and if new Parsoid functionality is needed for that
  2. What to do in the meantime for the legacy Parser, as the InternalParseBeforeSanitize hook is hard deprecated but still in use

I believe there are two issues that need solving:

  1. How to implement Variables as a Parsoid extension, and if new Parsoid functionality is needed for that

Yes. However, the fundamental question is not really related to the hook discussed here, since the core issue is modifying ParserOutput extension data in place during parsing. I have created T282499: Consider whether Parsoid will support forced linear parsing. to durther discuss this point.

  1. What to do in the meantime for the legacy Parser, as the InternalParseBeforeSanitize hook is hard deprecated but still in use

I still think undeprecating is not the worst idea, since "still in use by Semantic MediaWiki" was a sufficient point to not deprecate InternalParseBeforeLinks exhibiting precisely the same issues. Alternatively, silencing the warning would probably be fine as well.

var_final will be a special, independent problem from the first (and the problem this task is actually about).

While from past conversation with the Parsing-Team in this task, I am confident a solution in Parsoid exists for this case, there has not been an actual migration path until now. Which makes it hard to act on the deprecation warning in a sensible way…