Page MenuHomePhabricator

Slow queries on abuse_filter_log using afl_action or afl_actions
Open, Needs TriagePublic

Description

Looking at slow queries on tendril, I noticed some queries like the following show up several times:

SELECT * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE (afl_actions = 'block' OR (afl_actions LIKE 'block, %' ESCAPE '`' ) OR (afl_actions LIKE '%, block' ESCAPE '`' ) OR (afl_actions LIKE '%, block, %' ESCAPE '`' )) AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp DESC LIMIT 51
SELECT * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE afl_action = 'stashupload' AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp DESC LIMIT 51

(basically just grep AbuseLogPager to find them).

On large wikis (=enwiki, and sometimes commons), these queries can take up to 80 seconds. I wonder if there's something we can do to optimize them without adding any index. Or add them if it needs be: I guess one on (afl_action, afl_timestamp) and one on (afl_actions,afl_timestamp). Performance should be evaluated then.

Event Timeline

From a quick look at it:
That SELECT * from abuse_filter_log is basically going thru the the whole table.
Also, using LIKE %FOO% would make any index useless.
Look at the EXPLAIN:

root@db1089.eqiad.wmnet[enwiki]> show explain for 3164393912;
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
| id   | select_type | table            | type   | possible_keys | key           | key_len | ref                                | rows     | Extra                       |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
|    1 | SIMPLE      | abuse_filter_log | index  | NULL          | afl_timestamp | 16      | NULL                               | 22668955 | Using where; Using filesort |
|    1 | SIMPLE      | abuse_filter     | eq_ref | PRIMARY       | PRIMARY       | 8       | enwiki.abuse_filter_log.afl_filter |        1 | Using where                 |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
2 rows in set, 1 warning (0.00 sec)

And this is the second query:

root@db1089.eqiad.wmnet[enwiki]> show explain for 3164422276;
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
| id   | select_type | table            | type   | possible_keys | key           | key_len | ref                                | rows     | Extra                       |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
|    1 | SIMPLE      | abuse_filter_log | index  | NULL          | afl_timestamp | 16      | NULL                               | 22669087 | Using where; Using filesort |
|    1 | SIMPLE      | abuse_filter     | eq_ref | PRIMARY       | PRIMARY       | 8       | enwiki.abuse_filter_log.afl_filter |        1 | Using where                 |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
2 rows in set, 1 warning (0.00 sec)
root@db1089.eqiad.wmnet[enwiki]> select count(*) from abuse_filter_log;
+----------+
| count(*) |
+----------+
| 23380956 |
+----------+
1 row in set (12.28 sec)

You might need to go a bit less wide on the query and try to narrow a bit more to get exactly what you need (or close to it), so the indexes can be more beneficial.

@Marostegui Thanks! So that's because indexes are ignored with LIKE clauses starting with a wildcard, is it correct? Unfortunately the EXPLAIN on my local wiki doesn't help much, quarry seems to be a bit bugged and sql is down on beta cluster, so I don't really have many means to test new queries... Will try to find out how to avoid full table scan and filesort, although the structure of the afl_actions field could make it a bit difficult.

Yes, they get ignored.
Can't you run the new queries on labs?

Can't you run the new queries on labs?

Nope, since a few days. If I SSH to deployment-deploy01.eqiad.wmflabs and run sql deploymentwiki I get an exception due to (apparently) mysqli not being loaded.

No, I meant the wiki replicas: https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas (essentially quarry but from the command line).
However, we do filter stuff from abuse_filter_log, so the table isn't completed there (we filter the afl_ip field).

No, I meant the wiki replicas: https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas (essentially quarry but from the command line).
However, we do filter stuff from abuse_filter_log, so the table isn't completed there (we filter the afl_ip field).

Huh, I should give it a try. I'm also wondering whether FIND_IN_SET would work and use indexes, although at least it's not cross-DBMS compat. Here comes the drawback of storing comma-separated values!

Untagging us as there is no actionable here for us, I will remain subscribed in case you've got questions or further requests, happy to help!

Daimona added a subscriber: Stang.

As I just wrote at T307249#7893101... I'm not sure if it's possible to fix this without making the afl_actions field atomic, which would be nice, but also a big change. @Marostegui, @Ladsgroup do you have any recommendations here?

Yeah it's not fun. I can think of three options:

  • If number of possible actions is pretty limited, you can construct it.
  • If it's not, you can add a normalization table of possible afl_actions values which basically make the query internally translate it to a set of ids by going over the dictionary table first and then you can easily add index on the ids. It is ugly but it works and not too complicated.
  • The better solution would be to simply repeat every row with multiple values to be one action only and it's obvious what's needed.

Yeah it's not fun.

Indeed :-/

  • If number of possible actions is pretty limited, you can construct it.

There are 8 possible actions; some of these are mutually exclusive (e.g. disallow and block, warn and the others), some are rarely enabled (rangeblock), some are not reported in the AbuseLog (throttle). I tested this locally by creating a filter with all actions enabled and tripping it, and it seems like only four actions are reported (Block autopromote, Range-block, Tag, Block). Then there's the question of whether the ordering is always the same. In the current code it should be, but I'm not sure about older entries. All in all, let's assume that there are 4 actions that can appear together, and 3 that can only appear on their own, and also assume that the order is not guaranteed to always be the same. For any of the 4 actions that can be together, we'd end up with

(4 choose 4)*4! + (4 choose 3)*3! + (4 choose 2)*2! + (4 choose 1)*1! = 64

conditions. The assumptions most certainly makefor a lower bound though, and I'm not sure if it's a good idea to implement it.

  • If it's not, you can add a normalization table of possible afl_actions values which basically make the query internally translate it to a set of ids by going over the dictionary table first and then you can easily add index on the ids. It is ugly but it works and not too complicated.

This is the solution I had in mind. I think this is perhaps the easiest solution to implement, and adding a new table in production should be relatively easy IIRC. It is ugly though.

  • The better solution would be to simply repeat every row with multiple values to be one action only and it's obvious what's needed.

This seems nice; however, I also think it would be quite hard to implement. One random idea: if we do that, we could also add a column to uniquely identify the action that caused one or more filters to match. Currently it's not trivial to determine if two AbuseLog entries for different filters were caused by the same user action (you have to look at the timestamps, target page, variables, and then guess). This would be much more important if we were to insert not just one row per filter, but one row per filter per action.

Yeah it's not fun.

Indeed :-/

  • If number of possible actions is pretty limited, you can construct it.

There are 8 possible actions; some of these are mutually exclusive (e.g. disallow and block, warn and the others), some are rarely enabled (rangeblock), some are not reported in the AbuseLog (throttle). I tested this locally by creating a filter with all actions enabled and tripping it, and it seems like only four actions are reported (Block autopromote, Range-block, Tag, Block). Then there's the question of whether the ordering is always the same. In the current code it should be, but I'm not sure about older entries. All in all, let's assume that there are 4 actions that can appear together, and 3 that can only appear on their own, and also assume that the order is not guaranteed to always be the same. For any of the 4 actions that can be together, we'd end up with

(4 choose 4)*4! + (4 choose 3)*3! + (4 choose 2)*2! + (4 choose 1)*1! = 64

conditions. The assumptions most certainly makefor a lower bound though, and I'm not sure if it's a good idea to implement it.

  • If it's not, you can add a normalization table of possible afl_actions values which basically make the query internally translate it to a set of ids by going over the dictionary table first and then you can easily add index on the ids. It is ugly but it works and not too complicated.

This is the solution I had in mind. I think this is perhaps the easiest solution to implement, and adding a new table in production should be relatively easy IIRC. It is ugly though.

You can use NameTableStore to make it even easier.

  • The better solution would be to simply repeat every row with multiple values to be one action only and it's obvious what's needed.

This seems nice; however, I also think it would be quite hard to implement. One random idea: if we do that, we could also add a column to uniquely identify the action that caused one or more filters to match. Currently it's not trivial to determine if two AbuseLog entries for different filters were caused by the same user action (you have to look at the timestamps, target page, variables, and then guess). This would be much more important if we were to insert not just one row per filter, but one row per filter per action.

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

That's a good question, and I don't know the answer. I think something similar would really be useful though.

As I just wrote at T307249#7893101... I'm not sure if it's possible to fix this without making the afl_actions field atomic, which would be nice, but also a big change. @Marostegui, @Ladsgroup do you have any recommendations here?

I don't have access to the private task so I'm not sure if this point is already made there, but I think looking at T253462 may help here. Similar to the analysis in T217481#7912198. Essentially, instead of writing out the English name of the action into afl_action, we could assign a constant value to each action (1 for warn, 2 for block, 4 for rangeblock, 8 for disallow, etc.) and store a single number in afl_action. Then AbuseFilter can provide a public method that can take the number and return which actions were toggled on (so for 10, it would return ['block', 'disallow']). This has the advantage that adding new actions would only mean adding a new flag on the next digit of the underlying binary number (16, 32, etc.)

This will, however, be a breaking change to all analytics (via Wiki Replica servers or otherwise) that currently is taking advantage of the human readable afl_action labels.

Using bitwise calculation is clever. Similar to what we have in compact configuration of schema changes. The action would be 'block' and then you'd use BLOCK | DISALLOW which would give you a number you can easily later check. The problem is how to filter that in MySQL? You probably have to build all possible cases and do IN (54, 565, 346, etc.)

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

That's a good question, and I don't know the answer. I think something similar would really be useful though.

Also, abuse filter works on "actions" that are not even registered yet (e.g. a disallowed edit won't show up in revision table so won't have rev_id). You probably can make a new table and define your own id on changes that triggered abuse filter.

As I just wrote at T307249#7893101... I'm not sure if it's possible to fix this without making the afl_actions field atomic, which would be nice, but also a big change. @Marostegui, @Ladsgroup do you have any recommendations here?

I don't have access to the private task

I've just added you as subscriber, it's really just a duplicate without much information.

Essentially, instead of writing out the English name of the action into afl_action, we could assign a constant value to each action (1 for warn, 2 for block, 4 for rangeblock, 8 for disallow, etc.) and store a single number in afl_action. Then AbuseFilter can provide a public method that can take the number and return which actions were toggled on (so for 10, it would return ['block', 'disallow']). This has the advantage that adding new actions would only mean adding a new flag on the next digit of the underlying binary number (16, 32, etc.)

Ah, I like the bitwise approach, yes. I think I may have considered this in the past but I can't tell whether I just forgot about it or found some reason against it; probably the former.

Using bitwise calculation is clever. Similar to what we have in compact configuration of schema changes. The action would be 'block' and then you'd use BLOCK | DISALLOW which would give you a number you can easily later check. The problem is how to filter that in MySQL? You probably have to build all possible cases and do IN (54, 565, 346, etc.)

Can't we use bitwise operations with our DBAL? So for instance, if block is 2 as above, we'd do afl_action & 2 = 2 to filter for blocks.

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

That's a good question, and I don't know the answer. I think something similar would really be useful though.

Also, abuse filter works on "actions" that are not even registered yet (e.g. a disallowed edit won't show up in revision table so won't have rev_id). You probably can make a new table and define your own id on changes that triggered abuse filter.

Yeah, the ID would have to be specific to AF. It'd be great if we could do that without an additional table, but I'm not very sure.

Essentially, instead of writing out the English name of the action into afl_action, we could assign a constant value to each action (1 for warn, 2 for block, 4 for rangeblock, 8 for disallow, etc.) and store a single number in afl_action. Then AbuseFilter can provide a public method that can take the number and return which actions were toggled on (so for 10, it would return ['block', 'disallow']). This has the advantage that adding new actions would only mean adding a new flag on the next digit of the underlying binary number (16, 32, etc.)

Ah, I like the bitwise approach, yes. I think I may have considered this in the past but I can't tell whether I just forgot about it or found some reason against it; probably the former.

Using bitwise calculation is clever. Similar to what we have in compact configuration of schema changes. The action would be 'block' and then you'd use BLOCK | DISALLOW which would give you a number you can easily later check. The problem is how to filter that in MySQL? You probably have to build all possible cases and do IN (54, 565, 346, etc.)

Can't we use bitwise operations with our DBAL? So for instance, if block is 2 as above, we'd do afl_action & 2 = 2 to filter for blocks.

Oh yeah, see IDatabase::bitAnd. We should give this a try it seems.