Page MenuHomePhabricator

Short circuit fails with assignments
Closed, ResolvedPublic

Description

Similarly to T214642, there are several filters failing with message "Undefined variable xxx". Locally, I could get the following as the minimum failing syntax:

false & (var := 'a'); var != 'b'

Again, syntax check says everything is fine, but actual parsing fails due to short circuit. I played a bit with the parser and finally found the cause:
The first condition is of course false, and thus we enter short circuit. Then we get to "(". This would normally be handled by doLevelBraces, but when we get there mShortCircuit is true, and thus the parser calls skipOverBraces to ignore everything inside parentheses. Then the statement ends, and we get to evaluate var != 'b'. However, the variable declaration is inside brackets and was skipped, and thus the variable isn't defined.
For this case I don't really have a solution. Mostly, because the syntax shouldn't be considered valid by the syntax check: the correct way to write it would be

false & ( (var := 'a'); var != 'b' )

However, making that syntax invalid could possibly break many existing filters. Another solution could be to accept the syntax as valid and store variables inside skipOverBraces (maybe without actually computing them, and just storing the names), but that would add yet another bad syntax to AbuseFilter (and we already have T156096 and T212730) and could make skipOverBraces not so good in terms of performance save.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Urbanecm Great :-) When would you be available? I don't know what time could be fine for me.

Anything after (including) Monday Morning SWAT.

Hm, right, today it's Friday. I'm afraid I won't be able to be there monday.

If you tell me what to do to test the change (if testable), I can do it for you.

If you tell me what to do to test the change (if testable), I can do it for you.

Mostly, this:

If the change has been tested locally there's not much to test; I believe it's enough to check logstash and make sure that we don't start getting new errors. It could take some time to see if the failures will actually disappear, so that's not something we can do during a SWAT window.

Ideally we should get a reduction of "AbuseFilter parser error" warnings (specifically, no more warnings about unrecognized variables like T227356), and I had created a specific search (here) to detect new AF-related errors.

Urbanecm added a project: User-Urbanecm.EditedJul 27 2019, 8:47 PM

Makes sense, so I'll deploy and watch for new fatals, and leave deep logstash analytics for you to do later. Adding my personal project to not forget then! (ftr, I don't think I need you during the window then)

Urbanecm moved this task from Backlog to To deploy on the User-Urbanecm board.Jul 27 2019, 8:47 PM

Change 525598 merged by Urbanecm:
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.15] Initialize user-defined variables during shortcircuit

https://gerrit.wikimedia.org/r/525598

Mentioned in SAL (#wikimedia-operations) [2019-07-29T18:04:14Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter: SWAT: [[:gerrit:525598|Initialize user-defined variables during shortcircuit]] (T214674) (duration: 00m 49s)

I still see these errors on logstash, and also T228677 I believe. I have to investigate further.

I still see these errors on logstash, and also T228677 I believe. I have to investigate further.

i'm sorry, i made a mistake when deploying this. The patch was never deployed, which is why nothing changed. will be fixed soon :)

Mentioned in SAL (#wikimedia-operations) [2019-07-29T22:16:50Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter/: T214674 - 940955ea3844721a0 (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-07-29T22:16:50Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter/: T214674 - 940955ea3844721a0 (duration: 00m 48s)

I still see these errors on logstash, and also T228677 I believe. I have to investigate further.

i'm sorry, i made a mistake when deploying this. The patch was never deployed, which is why nothing changed. will be fixed soon :)

done!

Krinkle added a subscriber: Krinkle.

The patch for this task was backported to wmf.15 today, and caused more than 150,000 errors per minute to be logged. It has been immediately reverted, but given it was merged in master is thus likely to cause similar issues with wmf.16, which needs to be thought about and fixed accordingly.

Countmessage
12,432AbuseFilter parser error for filter global-102: Unexpected "T_BRACE" at character 442.
12,432AbuseFilter parser error for filter global-167: Expected a ) at character 242, not found (found $3 $4 instead).
12,430AbuseFilter parser error for filter global-131: Unexpected "T_BRACE" at character 207.
12,425AbuseFilter parser error for filter global-189: Unexpected "T_BRACE" at character 155.
12,423AbuseFilter parser error for filter global-141: Unexpected "T_BRACE" at character 353.
12,423AbuseFilter parser error for filter global-69: Unexpected "T_BRACE" at character 192.
12,422AbuseFilter parser error for filter global-100: Unexpected "T_BRACE" at character 800.
12,422AbuseFilter parser error for filter global-101: Unexpected "T_BRACE" at character 395.
12,421AbuseFilter parser error for filter global-144: Unexpected "T_BRACE" at character 864.
12,420AbuseFilter parser error for filter global-105: Expected a ) at character 627, not found (found $3 $4 instead).

Mentioned in SAL (#wikimedia-operations) [2019-07-29T22:32:37Z] <krinkle@deploy1001> Synchronized php-1.34.0-wmf.15/extensions/AbuseFilter/: T214674 - bfcaf0c26d6 (duration: 00m 48s)

Urbanecm triaged this task as Unbreak Now! priority.EditedJul 29 2019, 10:33 PM

this became a train blocker, and train blockers are UBN by definition =>setting the priority to UBN.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 29 2019, 10:33 PM

Ouch, this is really bad. Not only we started getting all of those errors, but somehow filters started to match randomly. On itwiki, filter 496 (which has block enabled), caused a carnage of admins and other normal accounts. And probably, the same happened on other wikis. For the moment, I'm self-merging a revert of the change that caused this havoc (on master); tomorrow I'll investigate.
I apologize for the incident.

Change 526276 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Revert "Initialize user-defined variables during shortcircuit"

https://gerrit.wikimedia.org/r/526276

Daimona lowered the priority of this task from Unbreak Now! to High.Jul 29 2019, 11:22 PM

No more UBN with the revert on master, but still we should get this fixed soon.

Change 526276 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Revert "Initialize user-defined variables during shortcircuit"

https://gerrit.wikimedia.org/r/526276

A couple of random remarks:

  1. In order to fix T228677, it should be enough to keep lines 128-9 from the original change
  2. The syntax errors were probably caused by the other part (the change to skipOverBraces), which didn't really work (that $this->move inside the if branch is probably the bad part).
  3. I still don't know why filter 496 started to match all of those edits, but it's vital to understand it

ATM it's really late and I'm here by coincidence, so I'll investigate tomorrow. We'll also need some parser tests to ensure this won't happen again.

First of all, I cannot test anything locally, so I'm afraid it won't be easy to submit a patch to fix this issue. However, I understood 3. We have, I believe, at least 4 bugs playing their roles.

The new AFPData::DNONE is only supposed to appear in two situations: 1-when a built-in variable is unknown, and that only happens when using Special:AbuseFilter/tools and the like. This is the only case right now for master and WMF wikis. 2-In short circuit, where we find a variable declaration. And this can also happen while the filter is running. This case is handled by the patch that was reverted, and it can lead to troubles. For instance, take the following:

thing := "foo";
get_matches( thing, added_lines )[0] != false

or even just

add := 'foo';
add != false

when inside a short-circuited pair of braces.

Both will return true. In the second example, add is DNONE, and DNONE != false is true (and this requires the part of the reverted patch inside skipOverBraces). In the first one, thing is DNONE, so calling get_matches on it will return another DNONE, and its 0 element (which is only accessible with the other part of the patch, and currently fails in production - T228677) is yet another DNONE, and again DNONE != false is true.
(Random note: this test should also return false)
If a filter (like itwiki's 496) contains a bit of code like that, that part will match, and the whole filter could match too, if the other conditions are satisfied.
Lastly, the skipOverBraces part is probably broken like I said yesterday, so it will get to the end of the function at a non-brace token, thus throwing the exception.

Recapping, the involved bugs are:

  1. We cannot access offsets of DNONE (fixed by the non-skipOverBraces part of the patch)
  2. Some user-defined variables are not initialized in shortcircuit (the bug this task is about, which curiously doesn't play any role here, and it's probably fixed by the reverted patch)
  3. The addition to skipOverBraces was broken
  4. Comparing DNONE != false and similar return true, which can cause filters to match when they shoudn't

Points 1. and 2. are already handled by the reverted patch as-is. I'll probably be able to fix 3. easily as soon as I can test it locally. 4. is the only "hard" one. I'm unsure about what we should do in that case. First thing could be to make any comparison involving DNONE return false (although I thought we were already doing this). And then, understand what to do when we find DNONE in other situations, e.g. a logic operator. And AFPData::castTypes should handle DNONE as well.

Probably, I won't be able to work on this for the next 2 weeks, but further input about 4. or anything else is highly appreciated.

Thanks @Daimona. I don't understand yet why short-circuiting during filter execution involves any kind of DNONE.

The below questions are asked by me without understanding the internals, so maybe the answer is "This is not how AbuseFilter works", in which case, please tell me :)

  • Why does thing := "foo"; create thing as DNONE? I would expect it to create a string?
  • When an early condition in an AbuseFilter rule returns, and we are short-circuiting the rest of the rule - Why does it need to "have" an answer for the skipped code? When you execute foo = 12; ( foo === 5 || foo === 12 || foo[1] == 7 ) in most programming languages, it will return true from the second condition, the last part will not execute. There is no "fake" answer needed. Why do we need one?

Thanks @Daimona. I don't understand yet why short-circuiting during filter execution involves any kind of DNONE.

This will only happen with the reverted patch, which will put DNONE inside used-defined variables, as explained below.

  • Why does thing := "foo"; create thing as DNONE? I would expect it to create a string?

This would only happen if thing := "foo"; is inside of a pair of braces that we're skipping due to short-circuit. The idea being: short-circuit is meant to speed-up things, so instead of computing the value of the variable (which could be something more complicated than a literal) we just initialize the var as DNONE. We could indeed make the parser analyze the rhs and properly initialize the variable (and consequently avoid this bug altogether), my only concern is that we'd go against the short-circuit purpose.

  • When an early condition in an AbuseFilter rule returns, and we are short-circuiting the rest of the rule - Why does it need to "have" an answer for the skipped code? When you execute foo = 12; ( foo === 5 || foo === 12 || foo[1] == 7 ) in most programming languages, it will return true from the second condition, the last part will not execute. There is no "fake" answer needed. Why do we need one?

We don't really "need an answer". First of all, let me rewrite the example as foo = 12; ( foo == 12 || ( something here ) ). We can skip the pair of braces wrapping something here, but to do that, we have to parse the code inside to match any additional pair of braces inside the outer ones. And this also leads to the problem this task was about. Using the original example, false & (var := 'a'); var != 'b', if we just skip the braces altogether, we'll have an undeclared variables later. That's why we initialize var as DNONE, in the skipOverBraces part of the patch.

Krinkle added a comment.EditedJul 30 2019, 2:07 PM

Thanks, this is starting to make sense to me now :)

  1. The reason we do look inside the parenthetical, is because we need to know how much to ignore, syntax-wise. This is a fundamental issue due to the parser not having separation between tokenising and execution (unlike the new Cached parser). Do I understand that correctly? If we did tokenising separately, we would be able to skip the relevant branches without having any form of representation inside our execution model for things we haven't reached.
  2. I agree that analysing the RHS in-depth would be undesirable. Not only would it be inefficient, it would also cause other bugs I think. It seems unfair to expect a program to be able to do valid operations in branches that are meant to be skipped. E.g. we would not even want to initialise thing for real, because its value and ability to access parts of it might be dependent on other things that have happened. Maybe this can work in AbuseFilter today without issue because the language is simple enough, but as a generic thing it seems hard to be sure. Would be better to avoid indeed.
  3. Regarding var := inside a skipped branch - this sounds like "variable hoisting". This is where the declaration of a variable is implicitly moved to the start (e.g line 0) of the program so that all needed variables "exist", with only their value being assigned at run-time in the relevant code branches. The initial value would be something like "null" or a local equivalent. That seems like a good choice for our case of hoisted variables. NULL has much better defined behaviour here than DNONE would have.

If we use NULL for the hoisted/skipped variables, does that mean we avoided DNONE entirely during execution, or are the other cases?

Are there other problems or errors you know of that we might encounter even if we use NULL? For example, in the old/non-cached parser, we process the skipped (var := 'a'); branch as var := null, what does this mean for any other code after it? If there are other skipped branches that would normally inspect the variable, we need to make sure those are also gracefully handled in a way that doesn't cause undesirable errors and also doesn't change the outcome of the filter. I don't think assuming true or false can be a solution because both of those can have different meanings in different contexts.

  1. The reason we do look inside the parenthetical, is [...] This is a fundamental issue due to the parser not having separation between tokenising and execution [...]

Yes, mostly correct, I believe. Tokenization already happens before parsing. The issue here is that we have an ordered list of tokens, and not an AST.

  1. I agree that analysing the RHS in-depth would be undesirable. [...]

Agreed.

  1. Regarding var := inside a skipped branch - this sounds like "variable hoisting". [...] The initial value would be something like "null" or a local equivalent. That seems like a good choice for our case of hoisted variables. NULL has much better defined behaviour here than DNONE would have.

Yeah, exactly. I believe DNONE would be more appropriate because a variable is empty when hoisted, but we can use DNULL for now, and leave DNONE as a @todo for the future.

If we use NULL for the hoisted/skipped variables, does that mean we avoided DNONE entirely during execution, or are the other cases?

I believe this is the only case. There are some places in the code where we create a DNONE object and then pass it by reference around the parser, but we should always end up with a non-DNONE object in the end (unless there's a syntax error, maybe, in which case we don't care anyway).

Are there other problems or errors you know of that we might encounter even if we use NULL? [...]

Anything in the rest of the filter using the skipped variable will only see it's NULL, without knowing that it was skipped. So, if there's myvar === null, that will match. That's the main reason why I tried to initialize the variable as DNONE. But even then, I agree that assuming false or true cannot ultimately solve the issue. Although TBH, DNONE and DNULL are the same here. The only difference is that DNONE === null will be false, but lots of other comparisons (e.g. myvar != false, like for the itwiki filter) will return true all the same.

Oh, a minor correction to what I said above: there are other cases where we can have DNONE during execution. That's when the newly created object passed around is not populated. But those cases should be banned anyway, and are the ones described in T156096 - basically, empty conditions, operands, or function args like for T153251.

I kept thinking about this issue, so a few other thoughts:

  1. The new CachingParser wouldn't resolve the hoisting thing. We'll only be able to skip stuff more easily, but we'd still need to initialize the variables inside.
  2. My explanation above doesn't fully explain itwiki's problem: the variable declaration AND the var != false bit are both inside a short-circuited pair of braces, so it shouldn't be evaluated at all. The problem remains for variables declared inside skipped branches, but used from outside.
  3. A solution for DNONE could be the following: make all operators, functions etc. receiving a DNONE return another DNONE; if, after evaluating a filter, we end up with a DNONE result, then count it as false (in doLevelAtom I believe). This way, "intermediate" conditions won't return true or false, but just a DNONE, and the final result will be the filter not matching due to "incomplete data". However, I don't know if this can be implemented without refactoring too much.

Yeah, exactly. I believe DNONE would be more appropriate because a variable is empty when hoisted, but we can use DNULL for now, and leave DNONE as a todo for the future.

Looking at JS, PHP and Python, I'm not aware of a concept like "DNONE" in them. They all have "real" values as the implicit default for a variable. Usually something like null.

I think using a magic bottom value could make sense, but then we should make it illegal to refer to a variable holding such a value. E.g. the hoisted variable would be "untouchable" until it is assigned. Attempts to access it in any way during execution when no value was assigned yet, would then result in an error.

But, if I understand correctly, AF currently doesn't really skip execution of skipped statements and branches. Which means this probably isn't an option (yet).

Yeah, exactly. I believe DNONE would be more appropriate because a variable is empty when hoisted, but we can use DNULL for now, and leave DNONE as a todo for the future.

Looking at JS, PHP and Python, I'm not aware of a concept like "DNONE" in them. They all have "real" values as the implicit default for a variable. Usually something like null.

That's true. Note that DNONE is like a flavour of null.

I think using a magic bottom value could make sense, but then we should make it illegal to refer to a variable holding such a value. E.g. the hoisted variable would be "untouchable" until it is assigned. Attempts to access it in any way during execution when no value was assigned yet, would then result in an error.

I don't think so. Mainly because AF doesn't have explicit hoisting, and adding it would be a breaking syntax change requiring lots of work to fix existing filters. Since a variable can only be empty (=DNONE) if it's skipped in a short-circuited branch, and this is something that users cannot usually control, I don't think we should emit an error.
For the same reason, I don't think we can use a straight null as default value.

But, if I understand correctly, AF currently doesn't really skip execution of skipped statements and branches. Which means this probably isn't an option (yet).

It does. With the reverted patch, it will only process assignments when skipping, but everything else is skipped. The true fact is that "skipping" here means checking every token and determine whether it's a rbrace or not (without doing anything further), instead of going straight to the terminating brace. Note that everything here is about skipping code inside parentheses. If you have, say, false || myFunc(args) contains myOtherFunc(otherArgs), something similar happens, but it's handled in different places.

Daimona added a comment.EditedAug 1 2019, 9:10 PM

A reminder for myself:
In skipOverBraces, the elseif branch should be smth like

// T214674, define variables
$varname = $this->mCur->value;
$next = $this->getNextToken();
if ( $next->type === AFPToken::TOP && $next->value === ':=' ) {
	$this->setUserVariable( $varname, new AFPData( AFPData::DNONE ) );
} elseif ( $next->type === AFPToken::TSQUAREBRACKET && $next->value === '[' ) {
	if ( !$this->mVariables->varIsSet( $varname ) ) {
		throw new AFPUserVisibleException( 'unrecognisedvar',
			$next->pos,
			[ $varname ]
		);
	}
	$this->setUserVariable( $varname, new AFPData( AFPData::DNONE ) );
}

Which uses getNextToken instead of going to the next token: I believe that, if $next is a rbrace (e.g. ... contains myVar)), $braces is not updated and thus we get a syntax error like it happened when the change went live.
I cannot test it, but I believe this will fix point 3. of T214674#5376121.


Thus, we're only left with point 4., i.e. how to handle comparisons etc. involving DNONE. Point 3. of T214674#5384895 is the best idea that came to mind, @Krinkle what do you think about that?
I still don't know how easily we can implement it, but that shouldn't be an obstacle.


Plus I'd like to add (at least) the following unit tests:

  • Something like the itwiki filter
  • A couple of examples from filters that started throwing syntax errors (these should be cases where a variable inside a skipped branch is followed by a rbrace)
  • Make sure that the test added in the reverted patch results in a NOT MATCH
  • Add a ton of other comparisons to the DNONE test, and also array access

EDIT: Actually, I belive we should handle DNONE at the logic operator level, instead of at the end. So, we first propagate DNONE until we get to a logic operator, and at that point we turn DNONE into a boolean false (or null). So some other tests to add:

  • A filter containing only a DNONE without logic operators returns false (several test cases)
  • Various combinations of DNONE and logic operators, e.g. DNONE || true should match, DNONE && true shouldn't.

Change 527520 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] [WIP] Better handling of DNONE

https://gerrit.wikimedia.org/r/527520

I'm not really up to speed on what's going on, but:

A couple of random remarks:

  1. In order to fix T228677, it should be enough to keep lines 128-9 from the original change

This would be awesome. We haven't been able to update https://en.wikipedia.org/wiki/Special:AbuseFilter/944 and we really need to :/

I'm not really up to speed on what's going on

Well, we have 4 different bugs involved, mostly about non-initialized data, but I don't know how to put it simply.

A couple of random remarks:

  1. In order to fix T228677, it should be enough to keep lines 128-9 from the original change

This would be awesome.

Unfortunately it's not so simple, see T228677#5386151. The good news is that the patch above is now ready for review and will fix all of these bugs. As a side note, handling of DNONE was already good. We only missed a couple of spots where it should be propagated (namely, keywords and array assignments). The actual bug was with comparisons: we returned false for DNONE too late, at AFPData::equals, instead of at AFPData::compareOp, which is where the type of the return value is decided.

We haven't been able to update https://en.wikipedia.org/wiki/Special:AbuseFilter/944 and we really need to :/

I understand, on itwiki the "faulty" filter was disabled and now I cannot re-enable it, which means that we have a super-annoying LTA running free. TBH, there could be some trick that would allow you to save the filter by slightly changing the syntax, but I didn't try to find it, and focused on solving the bug instead. It'd be great if we could get it merged before next week's train.

Change 527520 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Better handling of DNONE

https://gerrit.wikimedia.org/r/527520

Daimona moved this task from Under review to Done on the User-Daimona board.

Leaving open until production deployment, just to be sure.

I'll SWAT deploy it on Monday.

@Urbanecm Great, thanks. I probably won't be there to oversee the process, but it's the same as last time. If you notice an increase in notices with the "AbuseFilter parser error" message, or the dedicated AF search starts showing new errors, please revert.

Change 528108 had a related patch set uploaded (by Urbanecm; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@wmf/1.34-wmf.16] Better handling of DNONE

https://gerrit.wikimedia.org/r/528108

Change 528108 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@wmf/1.34-wmf.16] Better handling of DNONE

https://gerrit.wikimedia.org/r/528108

Change 528111 had a related patch set uploaded (by Urbanecm; owner: Urbanecm):
[mediawiki/extensions/AbuseFilter@wmf/1.34-wmf.16] Revert "Better handling of DNONE"

https://gerrit.wikimedia.org/r/528111

Change 528111 merged by Urbanecm:
[mediawiki/extensions/AbuseFilter@wmf/1.34-wmf.16] Revert "Better handling of DNONE"

https://gerrit.wikimedia.org/r/528111

Change 528211 had a related patch set uploaded (by Urbanecm; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.16] Better handling of DNONE

https://gerrit.wikimedia.org/r/528211

Change 528211 merged by Urbanecm:
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.16] Better handling of DNONE

https://gerrit.wikimedia.org/r/528211

Mentioned in SAL (#wikimedia-operations) [2019-08-05T18:32:22Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: 936a462: Better handling of DNONE (T214674, T228677) (duration: 00m 47s)

Change 528212 had a related patch set uploaded (by Urbanecm; owner: Urbanecm):
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.16] Revert "Better handling of DNONE"

https://gerrit.wikimedia.org/r/528212

Change 528212 merged by Urbanecm:
[mediawiki/extensions/AbuseFilter@wmf/1.34.0-wmf.16] Revert "Better handling of DNONE"

https://gerrit.wikimedia.org/r/528212

Mentioned in SAL (#wikimedia-operations) [2019-08-05T18:39:44Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: d358f17: Revert "Better handling of DNONE" (T214674, T228677) (duration: 00m 47s)

Urbanecm raised the priority of this task from High to Unbreak Now!.Aug 5 2019, 6:46 PM

@Daimona: Tried today Morning SWAT, and we didn't get "a reduction of "AbuseFilter parser error" warnings". I've reverted the backport out of an abundance of caution. See the screenshot for logstash stats for channel:AbuseFilter.

I've also spotted few new parse errors, such as 2019-08-05T18:37:02 AbuseFilter en.wikivoyage.org AbuseFilter parser error for filter 19: Invalid IP range "117.120.18.0 - 117.120.18.255" provided at character 89. or 2019-08-05T18:36:33 AbuseFilter ar.wikipedia.org Edit filter 47 on arwiki is taking longer than expected (that's not a parser error, but also new (to me) and seems to be roughly related to the deployment actions, see below:

Not fully sure both are patch-caused, but since I didn't find your decrease conditions met, I've reverted the patch to be sure. Could you look into it, @Daimona? Also bumping back to UBN, since the patch is still in master, and the train is approaching.

Oh, geez... The invalid range and "longer than expected" should be unrelated - at most, filters could be taking longer if the patch introduced a bottleneck, but I don't know.
The real problem is e.g. for arwiki filter 44, that error is definitely caused by my patch.
Thanks for reverting, although it's not as bad as last week, it's still not good.
I'll try to see if I can fix this tomorrow, before the branch is cut. A revert will be non-easy due to subsequent patches. And first of all we need to ensure that the problem still happens with the master.

brennen added a subscriber: brennen.Aug 5 2019, 7:42 PM

OK, I'm on it.
First of all: everything below is based on the assumption that the patch remained in production from 18:32 to 18:39 UTC, basing on stashbot above. If the times aren't right, then my conclusions may be wrong.

That said, some notes:

  • The arwiki error I mentioned was already happening without my patch (logstash) - it's probably still related to DNONE, but that can be fixed separately
  • Some "unrecognized variable" errors will still remain after my patch, when the variable is declared using the set function instead of the := operator. That's fine for now, I'll handle that before closing this bug.
  • When the patch was live on WMF wikis, the parser error rate stayed constant, with no increase or decrease, just as if the patch wasn't deployed. So I'll have to check whether it actually fixes the issue.
  • I checked the spike for slow filters; while there's indeed a not-so-critical spike, this is not uncommon on that dashboard, see for instance the last 24 hours. There's an even bigger spike at 22:00 UTC, and I'm not aware of any AF-related deployment at that time. Moreover, the actual spike was around 18:33 UTC and lasted no more than a minute; if the patch really stayed live until 18:39, this may well be a coincidence. Some of the filters reported as slow during that period also appear as slow in other times. So, it could really be a coincidence. We'd need to try deploying again and see what happens to confirm.

Now I'm going to investigate a bit more locally, and see how things go with master. There are good chances that this is not a train blocker, but please give me some more time to check. Also note that a temporary workaround could be to change the values of the empty constants in AFPData to be identical to DNULL.

Hmmmm I tested locally. Without my patch, I do get some "unrecognized variable" errors, but they seem to disappear when I apply r527520. @Urbanecm Could you please confirm that the times above are correct, and that the patch was indeed deployed? And that it's not the same as T214674#5374643? Thanks.

Hmmmm I tested locally. Without my patch, I do get some "unrecognized variable" errors, but they seem to disappear when I apply r527520. @Urbanecm Could you please confirm that the times above are correct, and that the patch was indeed deployed? And that it's not the same as T214674#5374643? Thanks.

I confirm StashBot-provided times are correct. Note the times are when scap finished, and I started it ~minute ago (exact duration is in stashbot's comment). I also confirm T214674#5374643 did not happen this time, and that the patch really went live.

If you think it's a coincidence, I can believe that :). Ping me during any SWAT window, and I'll be happy to try once more, this time with you supervising Logstash.

Mentioned in SAL (#wikimedia-operations) [2019-08-06T11:21:38Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.16/extensions/AbuseFilter/: SWAT: 8cc96db: Better handling of DNONE (T214674, T228677) (duration: 00m 48s)

Re-deployment of the above patch seems to be fine now. As the patch is deployed, this is no longer a train blocker.

Krinkle lowered the priority of this task from Unbreak Now! to High.Aug 6 2019, 11:34 AM
Daimona added a comment.EditedAug 6 2019, 11:36 AM

And also not UBN. A few other remarks:

  • While deploying the patch, we got another spike for slow filters. This only seem to happen for about a minute, and could have something to do with php caching or the like. AFAIK, this could also happen for every deployment and we just never noticed it.
  • The only thing left to do for this task is the case for variables set via the set function, for which I'm going to send a patch
  • Only a few "Abusefilter parser error" notices are left on logstash, mostly for known problems. The only "unknown" one is the arwiki above, which could be related to DNONE and I'm going to test locally Meh, it's not related, it has number < variable < number which is correctly reported as invalid and should be fixed onwiki

Change 528441 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Partly handle set and set_var in shortcircuit

https://gerrit.wikimedia.org/r/528441

Change 528441 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Partly handle set and set_var in shortcircuit

https://gerrit.wikimedia.org/r/528441

Daimona closed this task as Resolved.Aug 6 2019, 5:38 PM
Daimona removed a project: Patch-For-Review.

Calling resolved. The only possible follow-up would be to handle assignments with variable variables. But IMHO, AF is a pretty simple language, and variable variables should be banned altogether. I'm also skeptic about the utility of the set_var function, but that's something we can leave in place - as long as we implement the restriction above. I'm going to open another task for that.