Wikipeg cache is unsafe when rule variables are set to null/undefined
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	cscott
	Apr 18 2019, 3:56 PM

Description

This is a tricky one. Here's my most minimal test case, which is a cut down version of "Fuzz testing: Parser24" in parserTests.txt:

{{<u {{{{[[Sx-->}}

The output of the PHP and JS tokenizer differs -- the "correct" parse is:

<p>{{<u>}}</u></p>

That is, all the crazy stuff after <u and before the > is parsed as a very strange attribute name, which is later dropped. The {{ at the very start and the }} at the very end are treated as literal text, via the broken_template rule.

Turning off all caching (by hacking phpCacheRuleHook and jsCacheRuleHook to return '') results in the correct value on both PHP and JS. With caching enabled, JS still creates the correct result -- but that appears to be because it bypasses the cache unless visit count is 20 or greater, and our test case is short. PHP outputs an incorrect result (the <u> is parsed as a broken tag, so it comes out <u), because its simpler cache implementation always checks the cache When you change maxVisitCount to -1 in tokenizer.js to make the JS port always check the cache, JS also emits this incorrect result..

I *believe* what's going on is that the cache is missing some aspect of the parser rule variables, so it's reusing a previous result which was cached with a different version of the variable state. @tstarling appears to have a pretty sophisticated mechanism for including rule variable state into the cache key, but somehow it's missing something.

(As an alternative to this mechanism, we could create a new cache when a parser rule value is updated, and then when the value is restored when the rule is exited we can restore the old cache. Reference variables cause a little havoc, but it's probably fine just to flush all the caches when they are updated since that doesn't happen very often (only triggered by broken templates). But let me figure out where exactly the bug is lurking, first...)

Details

Subject	Repo	Branch	Lines +/-
Update wikipeg to 2.0.2	mediawiki/services/parsoid	master	+3 K -6 K
Update default cache key used by wikipeg	mediawiki/services/parsoid	master	+125 -112
Fix caching of reference rule variables	wikipeg	master	+8 -8

Customize query in gerrit

Revisions and Commits

rGPAR parsoid
	rGPARdb36503a9884 Update wikipeg to 2.0.2

Related Objects

Mentioned In: T219943: Create a composer library for wikipeg
T219938: Port HTML5Treebuilder
T219072: Extend JS/PHP hybrid testing to other Parsoid components
T205338: Extract and use a DOMHandler interface (API) for HTML -> wt handlers defined on DOM nodes
T113194: MediaWiki fails to parse  (en dash)
T106578: Sanitizer entity restrictions should match HTML5
Mentioned Here: rGPARc9dab9d47fde: Update wikipeg to 2.0.3 (PHP side)
T106578: Sanitizer entity restrictions should match HTML5
T113194: MediaWiki fails to parse  (en dash)
T205338: Extract and use a DOMHandler interface (API) for HTML -> wt handlers defined on DOM nodes
T219072: Extend JS/PHP hybrid testing to other Parsoid components
T219938: Port HTML5Treebuilder
T219943: Create a composer library for wikipeg

Event Timeline

cscott created this task.Apr 18 2019, 3:56 PM

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptApr 18 2019, 3:56 PM

cscott added a subscriber: Parsing-Team--ARCHIVED.Apr 18 2019, 3:56 PM

cscott updated the task description. (Show Details)Apr 18 2019, 4:29 PM

OK, pretty sure I found the bug. In the cache code, we save and restore reference values like so:

			start: [
				`$key = ${key};`,
				'$bucket = $this->currPos;',
				`$cached = $this->cache[$bucket][$key] ?? null;`,
				'if ($cached) {',
				'  $this->currPos = $cached[\'nextPos\'];',
				opts.loadRefs,
				'  return $cached[\'result\'];',
				'}',
				opts.saveRefs,
			].join('\n'),
			store: [
				`$cached = ['nextPos' => $this->currPos, 'result' => ${opts.result}];`,
				opts.storeRefs,
				`$this->cache[$bucket][$key] = $cached;`

Where opts.storeRefs typically looks something like this:

if ($saved_preproc !== $param_preproc) $cached['refs']["preproc"] = $param_preproc;
if ($saved_th !== $param_th) $cached['refs']["th"] = $param_th;

But this misses one crucial case: if the first time the rule is invoked the reference variable is (say) false, and then while parsing *it writes the variable to false again* (because there is a broken template, say), then the next time we try to reuse the cache the reference variable this time is something else ({{, say), we neglect to notice that a non-cached evaluation of this rule would have set the reference variable to false. And so the second time around we leave it as {{ instead of (properly) setting it to false.

~~Hacking the if (saved === current) ... test in opts.storeRefs to if (true) causes correct operation.~~

~~There is probably some slightly cleaner way to accomplish this fix, probably by explicitly tracking when reference variables are written.~~

EDIT: this was wrong, see next comment.

No, I'm an idiot. The above is close, but the original value of the variable is in fact stored in the cache key.

The problem (in PHP at least) is that we use 'isset' to check whether to restore the references, and that's false if the property is present but has the value NULL. So we never restore values which are set to NULL.

My original code set preproc to false on a broken template, but it got changed to null in one of the refactorings.

I could have sworn I got JS to exhibit the same bug, but JS distinguishes null from undefined and the cache logic in JS seems like it doesn't have this same bug. But I had reproduced this on JS... how did I manage that I wonder?

cscott renamed this task from Wikipeg cache is unsafe when rule variables are in use to Wikipeg cache is unsafe when rule variables are set to null/undefined.Apr 18 2019, 7:00 PM

Change 504943 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[wikipeg@master] Fix caching of reference rule variables

https://gerrit.wikimedia.org/r/504943

gerritbot added a project: Patch-For-Review.Apr 18 2019, 7:00 PM

Oh, there is one buglet on JS, I wonder if that's how I managed to get this to reproduce:

$ node
> [null, undefined].join(':')
':'
> [null, undefined,32].join(':')
'::32'

That is, null and undefined are treated as interchangeable in the cache key. That's an issue in the PHP port as well:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.3-1 — cli) by Justin Hileman
>>> implode(':', [null, 32, '', false])
=> ":32::"

It would be safer to use json_encode() for PHP, probably, but JSON.stringify doesn't distinguish null and undefined on JS.

Change 504954 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Update default cache key used by wikipeg

https://gerrit.wikimedia.org/r/504954

ssastry triaged this task as Medium priority.Apr 22 2019, 12:03 PM

Change 504943 merged by jenkins-bot:
[wikipeg@master] Fix caching of reference rule variables

https://gerrit.wikimedia.org/r/504943

Change 504954 abandoned by C. Scott Ananian:
Update default cache key used by wikipeg

Reason:
Squashed with I0f1e368a9c93e59157369f7767085dffa4e6ddde

https://gerrit.wikimedia.org/r/504954

Change 505267 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Update wikipeg to 2.0.2

https://gerrit.wikimedia.org/r/505267

Change 505267 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Update wikipeg to 2.0.2

https://gerrit.wikimedia.org/r/505267

jenkins-bot closed this task as Resolved by committing rGPARdb36503a9884: Update wikipeg to 2.0.2.Apr 25 2019, 4:19 PM

jenkins-bot added a commit: rGPARdb36503a9884: Update wikipeg to 2.0.2.

Mentioned in SAL (#wikimedia-operations) [2019-04-29T20:40:46Z] <arlolra> Updated Parsoid to c9dab9d (T106578, T113194, T205338, T219072, T219938, T221384, T219943)

Stashbot mentioned this in T219943: Create a composer library for wikipeg.Apr 29 2019, 8:40 PM

Aklapper removed a subscriber: Parsing-Team--ARCHIVED.May 24 2023, 1:09 PM

Maintenance_bot removed a project: Patch-For-Review.May 24 2023, 1:13 PM

Wikipeg cache is unsafe when rule variables are set to null/undefinedClosed, ResolvedPublicActions