Page MenuHomePhabricator

Wikipeg cache is unsafe when rule variables are set to null/undefined
Closed, ResolvedPublic

Description

This is a tricky one. Here's my most minimal test case, which is a cut down version of "Fuzz testing: Parser24" in parserTests.txt:

{{<u {{{{[[Sx-->}}

The output of the PHP and JS tokenizer differs -- the "correct" parse is:

<p>{{<u>}}</u></p>

That is, all the crazy stuff after <u and before the > is parsed as a very strange attribute name, which is later dropped. The {{ at the very start and the }} at the very end are treated as literal text, via the broken_template rule.

Turning off all caching (by hacking phpCacheRuleHook and jsCacheRuleHook to return '') results in the correct value on both PHP and JS. With caching enabled, JS still creates the correct result -- but that appears to be because it bypasses the cache unless visit count is 20 or greater, and our test case is short. PHP outputs an incorrect result (the <u> is parsed as a broken tag, so it comes out &lt;u), because its simpler cache implementation always checks the cache When you change maxVisitCount to -1 in tokenizer.js to make the JS port always check the cache, JS also emits this incorrect result..

I *believe* what's going on is that the cache is missing some aspect of the parser rule variables, so it's reusing a previous result which was cached with a different version of the variable state. @tstarling appears to have a pretty sophisticated mechanism for including rule variable state into the cache key, but somehow it's missing something.

(As an alternative to this mechanism, we could create a new cache when a parser rule value is updated, and then when the value is restored when the rule is exited we can restore the old cache. Reference variables cause a little havoc, but it's probably fine just to flush all the caches when they are updated since that doesn't happen very often (only triggered by broken templates). But let me figure out where exactly the bug is lurking, first...)

Revisions and Commits

Event Timeline

OK, pretty sure I found the bug. In the cache code, we save and restore reference values like so:

			start: [
				`$key = ${key};`,
				'$bucket = $this->currPos;',
				`$cached = $this->cache[$bucket][$key] ?? null;`,
				'if ($cached) {',
				'  $this->currPos = $cached[\'nextPos\'];',
				opts.loadRefs,
				'  return $cached[\'result\'];',
				'}',
				opts.saveRefs,
			].join('\n'),
			store: [
				`$cached = ['nextPos' => $this->currPos, 'result' => ${opts.result}];`,
				opts.storeRefs,
				`$this->cache[$bucket][$key] = $cached;`

Where opts.storeRefs typically looks something like this:

if ($saved_preproc !== $param_preproc) $cached['refs']["preproc"] = $param_preproc;
if ($saved_th !== $param_th) $cached['refs']["th"] = $param_th;

But this misses one crucial case: if the first time the rule is invoked the reference variable is (say) false, and then while parsing *it writes the variable to false again* (because there is a broken template, say), then the next time we try to reuse the cache the reference variable this time is something else ({{, say), we neglect to notice that a non-cached evaluation of this rule would have set the reference variable to false. And so the second time around we leave it as {{ instead of (properly) setting it to false.

Hacking the if (saved === current) ... test in opts.storeRefs to if (true) causes correct operation.

There is probably some slightly cleaner way to accomplish this fix, probably by explicitly tracking when reference variables are written.

EDIT: this was wrong, see next comment.

No, I'm an idiot. The above is close, but the original value of the variable is in fact stored in the cache key.

The problem (in PHP at least) is that we use 'isset' to check whether to restore the references, and that's false if the property is present but has the value NULL. So we never restore values which are set to NULL.

My original code set preproc to false on a broken template, but it got changed to null in one of the refactorings.

I could have sworn I got JS to exhibit the same bug, but JS distinguishes null from undefined and the cache logic in JS seems like it doesn't have this same bug. But I had reproduced this on JS... how did I manage that I wonder?

cscott renamed this task from Wikipeg cache is unsafe when rule variables are in use to Wikipeg cache is unsafe when rule variables are set to null/undefined.Apr 18 2019, 7:00 PM

Change 504943 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[wikipeg@master] Fix caching of reference rule variables

https://gerrit.wikimedia.org/r/504943

Oh, there is one buglet on JS, I wonder if that's how I managed to get this to reproduce:

$ node
> [null, undefined].join(':')
':'
> [null, undefined,32].join(':')
'::32'

That is, null and undefined are treated as interchangeable in the cache key. That's an issue in the PHP port as well:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.3-1 — cli) by Justin Hileman
>>> implode(':', [null, 32, '', false])
=> ":32::"

It would be safer to use json_encode() for PHP, probably, but JSON.stringify doesn't distinguish null and undefined on JS.

Change 504954 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Update default cache key used by wikipeg

https://gerrit.wikimedia.org/r/504954

ssastry triaged this task as Medium priority.Apr 22 2019, 12:03 PM

Change 504943 merged by jenkins-bot:
[wikipeg@master] Fix caching of reference rule variables

https://gerrit.wikimedia.org/r/504943

Change 504954 abandoned by C. Scott Ananian:
Update default cache key used by wikipeg

Reason:
Squashed with I0f1e368a9c93e59157369f7767085dffa4e6ddde

https://gerrit.wikimedia.org/r/504954

Change 505267 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Update wikipeg to 2.0.2

https://gerrit.wikimedia.org/r/505267

Change 505267 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Update wikipeg to 2.0.2

https://gerrit.wikimedia.org/r/505267