Page MenuHomePhabricator

The CachingParser doesn't report syntax errors in skipped branches
Closed, ResolvedPublic

Description

The following code:

false & ( foo )

is invalid, and with the old parser enabled, the syntax check will complain about an unrecognized variable named "foo". That's because, when checking syntax, the old parser knows that it must not use short-circuit, and hence the expression between parentheses is always evaluated.

However, the CachingParser doesn't have such a check, and always uses short-circuit. Hence, the snippet above will pass the syntax check. This is obviously wrong. And the same goes for all other types of syntax errors, e.g.

false & ( 1 / 0 )

Event Timeline

As a side note: there's also a special case for syntax errors: trying to add fields to an array inside of a skipped block. This cannot be handled the same way as others: assignments in skipped branches are processed separately (in discardWithHoisting), and that method doesn't check whether the array exists. Thus,

false & (nonexistent[] := 123)

will pass both the syntax check and the actual parsing in CachingParser (but will fail both with the old parser)

Change 535651 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] CachingParser: ensure to catch errors inside short-circuited blocks

https://gerrit.wikimedia.org/r/535651

The fix for "normal" errors was easy. The one for unrecognisedvar (T232498#5479795) is not (I think), hence that part is still "broken".

As a side note, T231536 will reduce the amount of errors that we can catch from syntax check, and the tests added in the patch above will make it very easy to see what errors we'd stop reporting.

Yeah, I think with variables existing conditionally and with users being able to create their own variables, it is acceptable not to have perfect coverage here.

The best we could possibly do is to walk the AST, determine all possible variables that could be created, and collect all possible variables that AF might create for any action, and then determine if there are any references to variables not in that list used by the filter code.

However, that will not catch errors of "use before defined", and will not catch use of conditionally created variables outside the condition, and will not catch use of AF variables that only exist for certain actions. I think that's okay. We can detect that at run-time instead. I vaguely recall there being a mechanism to automatically disable broken filters. Is there documentation about this I could familiarise myself with? I suppose it would help filter admins if we could show them the errors that have happened at run-time. Maybe as part of the filter log where we dump the variables etc, we could store an error message of some kind even if it didn't match. Anyway, that's for later.

I'll leave it to you whether we want the basic unknown variable detection as part of syntax checks or not. Perhaps we can have a set of dummy actions we apply the filter to in order to shake out the most obvious errors. Eg. 1 dummy edit, 1 dummy account creation, 1 dummy page move. And verify that it is able to filter all three without errors (doesn't matter whether it matches). And if it produces an error in any of them, treat similar to a syntax error (refuse to safe or auto-disable).

Yeah, I think with variables existing conditionally and with users being able to create their own variables, it is acceptable not to have perfect coverage here.

Yes, well, this is also an edge case: see the comment I left in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/AbuseFilter/+/535235/8/includes/parser/AbuseFilterCachingParser.php@355.

The best we could possibly do is to walk the AST, determine all possible variables that could be created, and collect all possible variables that AF might create for any action, and then determine if there are any references to variables not in that list used by the filter code.

Yes. Which could be a little expensive, although it can be cached. The patch above does something similar (annotates every node with the custom variables assigned in any of its descendants), but that can be done on the fly.

However, that will not catch errors of "use before defined", and will not catch use of conditionally created variables outside the condition, and will not catch use of AF variables that only exist for certain actions.

Indeed, which is what we'd like to catch.

I think that's okay. We can detect that at run-time instead.

And also at syntaxCheck-time, hopefully, given that when checking syntax the short-circuit is disabled.

I vaguely recall there being a mechanism to automatically disable broken filters. Is there documentation about this I could familiarise myself with?

Unfortunately not. A similar mechanism exists to disable filters that are matching too many actions in a short period, but it's not for syntax errors. It's documented on mw.o.

I suppose it would help filter admins if we could show them the errors that have happened at run-time. Maybe as part of the filter log where we dump the variables etc, we could store an error message of some kind even if it didn't match. Anyway, that's for later.

There are T193064 and related patches on gerrit for this. The main problem is that we'd need another DB table, similar to abuse_filter_log to store this data. Moreover, if a backend change introduces new syntax errors, existing filters may start to fail at every execution, thus flooding the log. I still have to find a smarter design for it.

I'll leave it to you whether we want the basic unknown variable detection as part of syntax checks or not.

IMHO, for now we're fine with the patches on gerrit (r535235 and r535651). Further improvements should come together with T231536, but the plan is to do that after WMF production is switched to use the CachingParser.

Perhaps we can have a set of dummy actions we apply the filter to in order to shake out the most obvious errors. Eg. 1 dummy edit, 1 dummy account creation, 1 dummy page move. And verify that it is able to filter all three without errors (doesn't matter whether it matches). And if it produces an error in any of them, treat similar to a syntax error (refuse to safe or auto-disable).

Interesting idea. That'd be similar to T208842, except with pre-defined tests, and we don't care about the result. I'm gonna leave a note there.

Change 535651 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] CachingParser: ensure to catch errors inside short-circuited blocks

https://gerrit.wikimedia.org/r/535651