Page MenuHomePhabricator

Rewrite the code for hoisting variables in skipped branches for CachingParser
Closed, ResolvedPublic

Description

In the old parser, this is done when skipping over braces: to do that, we were already checking the type of every token found inside the parentheses, so adding another check was not so bad performance-wise. But the new parser, when in short-circuit, just discards the node altogether, and thus it's potentially expensive to check every child recursively. Moreover, this is done at eval-time (and not while building the AST), which means that it cannot be cached, and thus it has to be repeated every time a filter is executed. A better solution would be to handle hoisting inside AFPTreeParser, e.g. by adding some annotations to the nodes. This way, when we have to discard a node, we can just check its "annotations" and see if any of its descendants is declaring a variable, that we can then hoist without doing anything further.

Event Timeline

Well, so this is a little harder than I thought. First of all, it's definitely easy to add code which would hoist all custom variables at the beginning of the filter. However, then we'd become way too permissive, making the unrecognisedvar exception completely useless. I don't think that's good. An alternative would be to keep variable names only inside parent nodes, but that would add some overhead to the tree parser.
So I'm starting to think that we should instead stop being permissive, and throw if undefined variables are used. Even if that's because the declaration was skipped by shortcircuit. The reason I'm not doing it now, though, is that the syntax check would currently pass for those cases. Moreover, users don't have (yet) any onwiki way to see if filters are failing. Maybe, at this point, we should wait for some profiling data with the new parser, and see if the hoisting code is really problematic. An easier alternative would be to keep a list of user-defined variables while parsing; then, while evaluating with shortcircuit enabled, dig through the nodes only if there are user-defined variables still undefined.

Change 532084 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Increase performance of CachingParser's discardWithHoisting

https://gerrit.wikimedia.org/r/532084

The patch above should be intended as a temporary fix. At this point, I believe that we should remove hoisting altogether (it was added recently, after all), BUT the syntax check should fail if a filter is trying to use a variable out of scope.

The patch above is actually dead: there's no way for it to handle situations where we add new elements to an existing array inside a skipped block. To sum up:

  • We cannot just remove hoisting and throw because users don't have any way to notice;
  • We cannot really keep the current code because I think that the hoisting part takes roughly 40-50% of the time it takes to check a filter;
  • We cannot hoist everything at the beginning;

A possible solution would be to:

  • Find a way to annotate the AST, so that each node has a list of variables defined in any of its descendant [1]. This should probably be done after creating the AST, but can be cached. Note that we'd still have to evaluate memory usage. This change would introduce a basic concept of scope in AbuseFilter. Another random idea: create a helper Scope class; there should be a single instance of this class for every existing scope, hence it could help saving memory (if all nodes in the same scope reference the same instance)
  • Then, we'd have two alternatives:
    1. Remove the hoisting part and, at syntax check time, warn if a variable is used out of scope. This would actually introduce scopes in AF.
    2. Keep the hoisting part and, in discardWithHoisting, read the scope of the node passed in, extract the variable names, and immediately set them without digging further.

Both options should be very quick (memory would be the only possible trouble). We could also start with 2, and move to 1 in the future - probably after the migration to CachingParser in prod.


[1] I spent a lot of time trying to draw a complex AST, but since I'm terrible when it comes to drawing, I just gave up. Here is a very simplified scheme:

                         Node(type:statement,vars:[my_var])
Node(type:something,vars:[])                        Node(type:sum,vars:[my_var])
                               Node(type:atom,value:1)                          Node(type:assignment,vars:[my_var])
                                                           Node(type:atom,value:'my_var')       Node(type:atom,value:'my_value')

Note that not all node types will have a vars prop (e.g. atom), but we may also set it to null/empty array. Alternatively, we could replace it with a scope property, holding a Scope class with an empty list of variables.

Change 535235 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Annotate the AST with var names before caching the AST

https://gerrit.wikimedia.org/r/535235

Change 532084 abandoned by Daimona Eaytoy:
Increase performance of CachingParser's discardWithHoisting

Reason:
In favour of I803cc58637d50eb90e57decf243f5ca78075d63d

https://gerrit.wikimedia.org/r/532084

Alright, the implementation looks good and should almost halve the runtime. Now, I believe it's actually worth it to do this before deploying the CachingParser.

Change 535235 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Annotate the AST with var names before caching the AST

https://gerrit.wikimedia.org/r/535235