Clarify the extension API and fix extensions OR the API to ensure correct behavior and eliminate workarounds
Open, MediumPublic
Actions

Assigned To

None

Authored By

	ssastry
	Aug 14 2019, 11:56 AM

Description

<dl data-parsoid='{}'><dd data-parsoid='{}'><p typeof="mw:Extension/math" about="#mwtXX" data-parsoid='{}' data-mw='{"name":"math","attrs":{},"body":{"extsrc":"1^2 + 2^2 + \\cdots + n^2 = {n(n + 1)(2n + 1) \\over 6}"}}'>^?'"`UNIQ--postMath-00000001-QINU`"'^?</p></dd></dl>

Found when I parsed enwiki:Aryabhata on scandium and diffed PHP & JS output

Related Objects

Mentioned Here: T127738: [Regression pre-wmf.15] Math node is appearing as UNIQ marker in Read mode

Event Timeline

ssastry created this task.Aug 14 2019, 11:56 AM

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptAug 14 2019, 11:56 AM

ssastry triaged this task as Medium priority.Aug 14 2019, 11:57 AM

ssastry changed the subtype of this task from "Task" to "Bug Report".

ssastry moved this task from Backlog to Bugs, Notices, Crashers on the Parsoid-PHP board.

I can reproduce this locally with the <math> extension output .. Even a simple page with <math>1</math> is sufficient to reproduce this.

Also, seems specific to the <math> extension locally since I tried the syntaxhighlight extension and it renders just fine.

ssastry renamed this task from Bug: <math> output broken (or is it all non-native extensions) to Bug: <math> output broken.Sep 3 2019, 8:49 PM

This is the Math extension being weird. It inserts its own special variety of strip marker, saves the details to a private static variable, and then replaces the markers in a ParserAfterTidy hook. That hook is not called from Parser::recursiveTagParseFully() despite that function being documented as returning "HTML that is safe for output".

@Physikerwelt The problem is, the parser uses unstripBoth() to remove markers for various good reasons, and if Math doesn't respond to those requests, the markers will leak through to the output. Pretty much every call to unstripBoth() in Parser.php is a bug causing exposure of these Math strip markers. For example == <math>1 \over 2</math> == generates the HTML

<h2><span id="'&quot;`UNIQ--postMath-00000001-QINU`&quot;'"></span><span class="mw-headline" id=".7F.27.22.60UNIQ--postMath-00000001-QINU.60.22.27.7F">

One of the problems was discovered in T127738, but the fix was not applied to MathML output.

We can do requests in batches, but the batches need to be resolved when the Parser says it needs them, not on ParserAfterTidy. It will be like link holder batches. That will mean smaller batch sizes, but at least it will work reliably for all modes.

My idea is to add a batch option to Parser::setHook(), and a third type of strip marker to StripState. When registering a batched tag, the extension can also specify a collection name, allowing tags of different names to be collected into the same batch. Parser::extensionSubstitution() will select the marker type based on the batch option of the tag. To StripState we will add addBatched() and unstripBatched(), and unstripBoth() will become a deprecated alias to unstripAll(). unstripBatched() will gather batch-type markers from the input string and call each extension with the resulting array. The return value from the extension will be an array mapping each marker to its expanded HTML, and unstripBatched() will be responsible for making a second pass over the input string, inserting these HTML fragments.

Four extensions call StripState::unstripGeneral():

SemanticMediaWiki has StripMarkerDecoder, for some reason implementing its own unstrip function, including recursion of unlimited depth.
BlueSpiceSocial subclasses Parser and has a copy of Parser::parse() with random bits of it commented out.
Parser Fun exposes the names of the marker types to the user via parser function parameters.
Variables constructs a new StripState and calls addGeneral() and unstripGeneral() on it.

I think in all cases these extensions would be best served by retaining the current meaning of addGeneral() and unstripGeneral() as acting only on non-batched general strip markers. There's not really any way to do fully backwards-compatible semantics.

Math will then register its tag as a batched tag, and so will be able to do batched requests to RESTBase in the callback. Ordinarily there will only be one batch per parse, so average performance will be similar, but for things like math tags inside headings, there will be one batch per heading.

In T230473#5463848, @tstarling wrote:

Four extensions call StripState::unstripGeneral():
[...]

Should these extensions' Phab projects be added as tags to this task?

@tstarling this is exactly the right way of thinking. However, I do not understand why the titles are a sperate batch? By all means, we should avoid a significant increase in the page loading time. For Math, I had another idea in mind to eliminate the strip markers entirely. If there was an earlier call to a preTaghook that must not produce any output one could precalculate values that later can be used in the tagHook. The preTagHook could be called even before the output page gets created.

ssastry renamed this task from Bug: <math> output broken to Bug: <math> output broken (when parsed with Parsoid/PHP).Sep 4 2019, 1:32 PM

In T230473#5464307, @Physikerwelt wrote:

@tstarling this is exactly the right way of thinking. However, I do not understand why the titles are a sperate batch?

By titles do you mean headings? It's just easier to deal with the corner cases in the simplest way.

By all means, we should avoid a significant increase in the page loading time. For Math, I had another idea in mind to eliminate the strip markers entirely. If there was an earlier call to a preTaghook that must not produce any output one could precalculate values that later can be used in the tagHook. The preTagHook could be called even before the output page gets created.

Do you have a specific idea of how that would work? Strip markers exist to protect the output of extensions from being treated as wikitext.

Change 539219 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Work around T230473: Call 'ParserAfterTidy' hook in parseWikitext

https://gerrit.wikimedia.org/r/539219

gerritbot added a project: Patch-For-Review.Sep 26 2019, 3:40 AM

Change 539219 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Work around T230473: Call 'ParserAfterTidy' hook in parseWikitext

https://gerrit.wikimedia.org/r/539219

Maintenance_bot removed a project: Patch-For-Review.Sep 26 2019, 5:10 AM

Subbu's workaround merged above seems valid, but let's keep this bug open to remind us to fix the underlying core parser API issue.

ssastry renamed this task from Bug: <math> output broken (when parsed with Parsoid/PHP) to Clarify the exension API and fix extensions OR the API to ensure correct behavior and eliminate workarounds.Oct 2 2019, 8:29 PM

ssastry edited projects, added MediaWiki-Parser, Math; removed Parsoid-PHP.

ssastry changed the subtype of this task from "Bug Report" to "Task".

In T230473#5467189, @tstarling wrote:

In T230473#5464307, @Physikerwelt wrote:

@tstarling this is exactly the right way of thinking. However, I do not understand why the titles are a sperate batch?

By titles do you mean headings? It's just easier to deal with the corner cases in the simplest way.

Yes, headings. If there is a page that uses math in headings more than once, this might get be really slow.

By all means, we should avoid a significant increase in the page loading time. For Math, I had another idea in mind to eliminate the strip markers entirely. If there was an earlier call to a preTaghook that must not produce any output one could precalculate values that later can be used in the tagHook. The preTagHook could be called even before the output page gets created.

Do you have a specific idea of how that would work? Strip markers exist to protect the output of extensions from being treated as wikitext.

For math, we do not need strip markers as such. We face the problem that math rendering is slow. To speed up this process we render all math tags in parallel. To do so we fill a static variable in the mathTagHook

self::$tags[$marker] = [ $renderer, $parser ];

(source)
which we replace in the end

foreach ( self::$tags as $key => $tag ) {
	$value = call_user_func_array( [ self::class, 'mathPostTagHook' ], $tag );
	// Workaround for https://phabricator.wikimedia.org/T103269
	$text = preg_replace( '/(<mw:editsection[^>]*>.*?)' . preg_quote( $key ) .
		'(.*?)<\/mw:editsection>/',
		'\1 $' . htmlspecialchars( $tag[0]->getTex() ) . '\2</mw:editsection>', $text );
	$text = str_replace( $key, $value, $text );
}

(source).

I am so unhappy about this solution that I can barely sleep.

This has two orthogonal problems.

I would like to get the content of tags earlier in the parsing phase. Therefore, a new, non-modifying tag pre-hook as described above would be handy. In this hook, one could fill the $tags variable. After the pre-tag hook is completed for all tags, we would send the batch request, which would update the $tags variable. I am not sure if that would require a second new hook, or if there is already a hook to do that. Thereafter, the normal tag hook is called.
I would prefer an alternative to the static $tags variable and store the tag information as a property of the associated parser object, e.g., $parser->storeCache($key,$content). Here the key can be the composite key consisting of tag content and tag attributes.

The core problem of why we still need 1 is that tag hooks are not processed asynchronously.

Aklapper renamed this task from Clarify the exension API and fix extensions OR the API to ensure correct behavior and eliminate workarounds to Clarify the extension API and fix extensions OR the API to ensure correct behavior and eliminate workarounds.Oct 4 2019, 1:06 PM

Clarify the extension API and fix extensions OR the API to ensure correct behavior and eliminate workaroundsOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Clarify the extension API and fix extensions OR the API to ensure correct behavior and eliminate workarounds
Open, MediumPublic
Actions