Page MenuHomePhabricator

AbuseFilter warning: "doCas failed due to race condition" (from Redis)
Open, Needs TriagePublicPRODUCTION ERROR

Description

Error message
MediumSpecificBagOStuff::doCas failed due to race condition for wikidatawiki:abusefilter-profile:group:default.
MediumSpecificBagOStuff::doCas failed due to race condition for commonswiki:abusefilter-profile:group:default.
MediumSpecificBagOStuff::doCas failed due to race condition for enwiki:abusefilter-profile:group:default.
…
Impact

Unknown.

Notes

These are the only message logged by channel:redis in production on a regular basis, this suggests it is likely unusual

Event Timeline

These keys are used for filter profiling, and they are update after every edit. The fact that this errors seems to appear mostly on the big 3 suggests that it might be failing just because of the amount of edits in there. Also, profiling is not mission-critical, so failing to update the key is tolerated. Is there any way to silence this error?

These are logged as INFO messages in channel:redis, untagging Wikimedia-production-error as such.

These keys are used for filter profiling, and they are update after every edit. […] profiling is not mission-critical, so failing to update the key is tolerated. Is there any way to silence this error?

The code currently uses MainStash and currently calls RedisBagOStuff->merge() which makes up to 10 attempts to use cas().

Right now, this is seen to fail about 500 times a minute. That's fairly high considering production sees about 1200 edits per minute, so it would seem to affect a rather significant portion. This suggests to me that the logic inside the merge() callback is perhaps relatively slow, as I would expect that if it is fast that it not fail this often. Maybe there's room for improvement there.

Having said that, the INFO log is currently done in BagOStuff for internal attempts, not the merge operation as a whole. Perhaps we can add a WARN level log there for when all 10 attempts have failed, which gives us a more meaningful signal in terms of impact for how this impacts AbuseFilter profiling. For example, if it were true that 50% of attemps to update this data fail, that may be important to take into account when interpreting this data that it may be missing 50% of the underlying signal (involuntary sampling basically, which may be worth putting voluntary sampling in place for).

Can you confirm the following?

  • Does it seem likely that the computation callback is slow here? E.g. more than 1 millisecond, or is the main work to compute the value done outside the merge and is the merge just doing or two number operations and simple key-value merges? Is there low-hanging fruit to improve callback runtime?
  • Is this happening in the critical path, or in a post-send deferred update? Given that MainStash will soon move from Redis to MySQL (T212129) this operation is expected to become slower in the future, which would be amplified by CAS roundtrips and more so by doing multiple attempts.

Right now, this is seen to fail about 500 times a minute. That's fairly high considering production sees about 1200 edits per minute, so it would seem to affect a rather significant portion.

Ugh, I didn't do the math, but this is indeed a significant portion.

Having said that, the INFO log is currently done in BagOStuff for internal attempts, not the merge operation as a whole. Perhaps we can add a WARN level log there for when all 10 attempts have failed, which gives us a more meaningful signal in terms of impact for how this impacts AbuseFilter profiling.

Yes, this would indeed be useful!

Can you confirm the following?

  • Does it seem likely that the computation callback is slow here? E.g. more than 1 millisecond, or is the main work to compute the value done outside the merge and is the merge just doing or two number operations and simple key-value merges? Is there low-hanging fruit to improve callback runtime?

I should note that there are two callbacks involved here, one for <wiki>:abusefilter-profile:group:default keys as in the task description (here), and one for <wiki>:abusefilter-profile:v3:<filterID> keys that I've seen on logstash (here). Both are very similar, and they should also be very fast. Both of them only consist of conditionals and comparison/assignment/increment operations, the only exception being a call to $this->options->get( 'AbuseFilterConditionLimit' ) in the first callback; but even considering this, I don't think any of the callbacks can be considered slow.

  • Is this happening in the critical path, or in a post-send deferred update? Given that MainStash will soon move from Redis to MySQL (T212129) this operation is expected to become slower in the future, which would be amplified by CAS roundtrips and more so by doing multiple attempts.

All of the merge() are happening in a post-send DeferredUpdate, which is scheduled here.