Slow queries on abuse_filter_log using afl_action or afl_actions
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Daimona
	Mar 2 2019, 3:39 PM

Description

Looking at slow queries on tendril, I noticed some queries like the following show up several times:

SELECT * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE (afl_actions = 'block' OR (afl_actions LIKE 'block, %' ESCAPE '`' ) OR (afl_actions LIKE '%, block' ESCAPE '`' ) OR (afl_actions LIKE '%, block, %' ESCAPE '`' )) AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp DESC LIMIT 51

SELECT * FROM `abuse_filter_log` LEFT JOIN `abuse_filter` ON ((af_id=afl_filter)) WHERE afl_action = 'stashupload' AND ((afl_deleted = '0') OR (afl_deleted IS NULL)) ORDER BY afl_timestamp DESC LIMIT 51

(basically just grep AbuseLogPager to find them).

On large wikis (=enwiki, and sometimes commons), these queries can take up to 80 seconds. I wonder if there's something we can do to optimize them without adding any index. Or add them if it needs be: I guess one on (afl_action, afl_timestamp) and one on (afl_actions,afl_timestamp). Performance should be evaluated then.

Related Objects

Mentioned In: T251221: Validate Special:AbuseLog options chosen before running query
Mentioned Here: T253462: Allow aspects of edits or logs to be revdeled while other aspects are suppressed

Event Timeline

Daimona created this task.Mar 2 2019, 3:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2019, 3:39 PM

From a quick look at it:
That SELECT * from abuse_filter_log is basically going thru the the whole table.
Also, using LIKE %FOO% would make any index useless.
Look at the EXPLAIN:

root@db1089.eqiad.wmnet[enwiki]> show explain for 3164393912;
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
| id   | select_type | table            | type   | possible_keys | key           | key_len | ref                                | rows     | Extra                       |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
|    1 | SIMPLE      | abuse_filter_log | index  | NULL          | afl_timestamp | 16      | NULL                               | 22668955 | Using where; Using filesort |
|    1 | SIMPLE      | abuse_filter     | eq_ref | PRIMARY       | PRIMARY       | 8       | enwiki.abuse_filter_log.afl_filter |        1 | Using where                 |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
2 rows in set, 1 warning (0.00 sec)

And this is the second query:

root@db1089.eqiad.wmnet[enwiki]> show explain for 3164422276;
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
| id   | select_type | table            | type   | possible_keys | key           | key_len | ref                                | rows     | Extra                       |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
|    1 | SIMPLE      | abuse_filter_log | index  | NULL          | afl_timestamp | 16      | NULL                               | 22669087 | Using where; Using filesort |
|    1 | SIMPLE      | abuse_filter     | eq_ref | PRIMARY       | PRIMARY       | 8       | enwiki.abuse_filter_log.afl_filter |        1 | Using where                 |
+------+-------------+------------------+--------+---------------+---------------+---------+------------------------------------+----------+-----------------------------+
2 rows in set, 1 warning (0.00 sec)

root@db1089.eqiad.wmnet[enwiki]> select count(*) from abuse_filter_log;
+----------+
| count(*) |
+----------+
| 23380956 |
+----------+
1 row in set (12.28 sec)

You might need to go a bit less wide on the query and try to narrow a bit more to get exactly what you need (or close to it), so the indexes can be more beneficial.

@Marostegui Thanks! So that's because indexes are ignored with LIKE clauses starting with a wildcard, is it correct? Unfortunately the EXPLAIN on my local wiki doesn't help much, quarry seems to be a bit bugged and sql is down on beta cluster, so I don't really have many means to test new queries... Will try to find out how to avoid full table scan and filesort, although the structure of the afl_actions field could make it a bit difficult.

Yes, they get ignored.
Can't you run the new queries on labs?

In T217481#4995625, @Marostegui wrote:

Can't you run the new queries on labs?

Nope, since a few days. If I SSH to deployment-deploy01.eqiad.wmflabs and run sql deploymentwiki I get an exception due to (apparently) mysqli not being loaded.

No, I meant the wiki replicas: https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas (essentially quarry but from the command line).
However, we do filter stuff from abuse_filter_log, so the table isn't completed there (we filter the afl_ip field).

In T217481#4995638, @Marostegui wrote:

No, I meant the wiki replicas: https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas (essentially quarry but from the command line).
However, we do filter stuff from abuse_filter_log, so the table isn't completed there (we filter the afl_ip field).

Huh, I should give it a try. I'm also wondering whether FIND_IN_SET would work and use indexes, although at least it's not cross-DBMS compat. Here comes the drawback of storing comma-separated values!

• Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.Mar 4 2019, 6:08 AM

Daimona added a project: User-Daimona.Mar 4 2019, 3:00 PM

Daimona moved this task from Backlog to Future on the User-Daimona board.

Untagging us as there is no actionable here for us, I will remain subscribed in case you've got questions or further requests, happy to help!

Daimona mentioned this in T251221: Validate Special:AbuseLog options chosen before running query.Apr 30 2020, 3:37 PM

DannyS712 subscribed.Apr 30 2020, 4:05 PM

Daimona merged a task: Restricted Task.Apr 30 2022, 9:47 AM

Daimona added a subscriber: Stang.

As I just wrote at T307249#7893101... I'm not sure if it's possible to fix this without making the afl_actions field atomic, which would be nice, but also a big change. @Marostegui, @Ladsgroup do you have any recommendations here?

Yeah it's not fun. I can think of three options:

If number of possible actions is pretty limited, you can construct it.
If it's not, you can add a normalization table of possible afl_actions values which basically make the query internally translate it to a set of ids by going over the dictionary table first and then you can easily add index on the ids. It is ugly but it works and not too complicated.
The better solution would be to simply repeat every row with multiple values to be one action only and it's obvious what's needed.

In T217481#7893489, @Ladsgroup wrote:

Yeah it's not fun.

Indeed :-/

If number of possible actions is pretty limited, you can construct it.

There are 8 possible actions; some of these are mutually exclusive (e.g. disallow and block, warn and the others), some are rarely enabled (rangeblock), some are not reported in the AbuseLog (throttle). I tested this locally by creating a filter with all actions enabled and tripping it, and it seems like only four actions are reported (Block autopromote, Range-block, Tag, Block). Then there's the question of whether the ordering is always the same. In the current code it should be, but I'm not sure about older entries. All in all, let's assume that there are 4 actions that can appear together, and 3 that can only appear on their own, and also assume that the order is not guaranteed to always be the same. For any of the 4 actions that can be together, we'd end up with

(4 choose 4)*4! + (4 choose 3)*3! + (4 choose 2)*2! + (4 choose 1)*1! = 64

conditions. The assumptions most certainly makefor a lower bound though, and I'm not sure if it's a good idea to implement it.

If it's not, you can add a normalization table of possible afl_actions values which basically make the query internally translate it to a set of ids by going over the dictionary table first and then you can easily add index on the ids. It is ugly but it works and not too complicated.

This is the solution I had in mind. I think this is perhaps the easiest solution to implement, and adding a new table in production should be relatively easy IIRC. It is ugly though.

The better solution would be to simply repeat every row with multiple values to be one action only and it's obvious what's needed.

This seems nice; however, I also think it would be quite hard to implement. One random idea: if we do that, we could also add a column to uniquely identify the action that caused one or more filters to match. Currently it's not trivial to determine if two AbuseLog entries for different filters were caused by the same user action (you have to look at the timestamps, target page, variables, and then guess). This would be much more important if we were to insert not just one row per filter, but one row per filter per action.

In T217481#7912198, @Daimona wrote:
In T217481#7893489, @Ladsgroup wrote:

Yeah it's not fun.

Indeed :-/

If number of possible actions is pretty limited, you can construct it.

There are 8 possible actions; some of these are mutually exclusive (e.g. disallow and block, warn and the others), some are rarely enabled (rangeblock), some are not reported in the AbuseLog (throttle). I tested this locally by creating a filter with all actions enabled and tripping it, and it seems like only four actions are reported (Block autopromote, Range-block, Tag, Block). Then there's the question of whether the ordering is always the same. In the current code it should be, but I'm not sure about older entries. All in all, let's assume that there are 4 actions that can appear together, and 3 that can only appear on their own, and also assume that the order is not guaranteed to always be the same. For any of the 4 actions that can be together, we'd end up with
(4 choose 4)*4! + (4 choose 3)*3! + (4 choose 2)*2! + (4 choose 1)*1! = 64
conditions. The assumptions most certainly makefor a lower bound though, and I'm not sure if it's a good idea to implement it.

If it's not, you can add a normalization table of possible afl_actions values which basically make the query internally translate it to a set of ids by going over the dictionary table first and then you can easily add index on the ids. It is ugly but it works and not too complicated.

This is the solution I had in mind. I think this is perhaps the easiest solution to implement, and adding a new table in production should be relatively easy IIRC. It is ugly though.

You can use NameTableStore to make it even easier.

The better solution would be to simply repeat every row with multiple values to be one action only and it's obvious what's needed.

This seems nice; however, I also think it would be quite hard to implement. One random idea: if we do that, we could also add a column to uniquely identify the action that caused one or more filters to match. Currently it's not trivial to determine if two AbuseLog entries for different filters were caused by the same user action (you have to look at the timestamps, target page, variables, and then guess). This would be much more important if we were to insert not just one row per filter, but one row per filter per action.

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

In T217481#7914200, @Ladsgroup wrote:

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

That's a good question, and I don't know the answer. I think something similar would really be useful though.

In T217481#7893110, @Daimona wrote:

As I just wrote at T307249#7893101... I'm not sure if it's possible to fix this without making the afl_actions field atomic, which would be nice, but also a big change. @Marostegui, @Ladsgroup do you have any recommendations here?

I don't have access to the private task so I'm not sure if this point is already made there, but I think looking at T253462 may help here. Similar to the analysis in T217481#7912198. Essentially, instead of writing out the English name of the action into afl_action, we could assign a constant value to each action (1 for warn, 2 for block, 4 for rangeblock, 8 for disallow, etc.) and store a single number in afl_action. Then AbuseFilter can provide a public method that can take the number and return which actions were toggled on (so for 10, it would return ['block', 'disallow']). This has the advantage that adding new actions would only mean adding a new flag on the next digit of the underlying binary number (16, 32, etc.)

This will, however, be a breaking change to all analytics (via Wiki Replica servers or otherwise) that currently is taking advantage of the human readable afl_action labels.

Using bitwise calculation is clever. Similar to what we have in compact configuration of schema changes. The action would be 'block' and then you'd use BLOCK | DISALLOW which would give you a number you can easily later check. The problem is how to filter that in MySQL? You probably have to build all possible cases and do IN (54, 565, 346, etc.)

In T217481#7914942, @Daimona wrote:

In T217481#7914200, @Ladsgroup wrote:

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

That's a good question, and I don't know the answer. I think something similar would really be useful though.

Also, abuse filter works on "actions" that are not even registered yet (e.g. a disallowed edit won't show up in revision table so won't have rev_id). You probably can make a new table and define your own id on changes that triggered abuse filter.

In T217481#7915633, @Huji wrote:

In T217481#7893110, @Daimona wrote:

As I just wrote at T307249#7893101... I'm not sure if it's possible to fix this without making the afl_actions field atomic, which would be nice, but also a big change. @Marostegui, @Ladsgroup do you have any recommendations here?

I don't have access to the private task

I've just added you as subscriber, it's really just a duplicate without much information.

Essentially, instead of writing out the English name of the action into afl_action, we could assign a constant value to each action (1 for warn, 2 for block, 4 for rangeblock, 8 for disallow, etc.) and store a single number in afl_action. Then AbuseFilter can provide a public method that can take the number and return which actions were toggled on (so for 10, it would return ['block', 'disallow']). This has the advantage that adding new actions would only mean adding a new flag on the next digit of the underlying binary number (16, 32, etc.)

Ah, I like the bitwise approach, yes. I think I may have considered this in the past but I can't tell whether I just forgot about it or found some reason against it; probably the former.

In T217481#7916175, @Ladsgroup wrote:

Using bitwise calculation is clever. Similar to what we have in compact configuration of schema changes. The action would be 'block' and then you'd use BLOCK | DISALLOW which would give you a number you can easily later check. The problem is how to filter that in MySQL? You probably have to build all possible cases and do IN (54, 565, 346, etc.)

Can't we use bitwise operations with our DBAL? So for instance, if block is 2 as above, we'd do afl_action & 2 = 2 to filter for blocks.

In T217481#7914942, @Daimona wrote:

In T217481#7914200, @Ladsgroup wrote:

Sure but what would be the unique id for that? We have the similar situation with change_tag (how you would identify a "change") and why its design is suboptimal (and not easily fixable).

That's a good question, and I don't know the answer. I think something similar would really be useful though.

Also, abuse filter works on "actions" that are not even registered yet (e.g. a disallowed edit won't show up in revision table so won't have rev_id). You probably can make a new table and define your own id on changes that triggered abuse filter.

Yeah, the ID would have to be specific to AF. It'd be great if we could do that without an additional table, but I'm not very sure.

In T217481#7916920, @Daimona wrote:

In T217481#7915633, @Huji wrote:

Essentially, instead of writing out the English name of the action into afl_action, we could assign a constant value to each action (1 for warn, 2 for block, 4 for rangeblock, 8 for disallow, etc.) and store a single number in afl_action. Then AbuseFilter can provide a public method that can take the number and return which actions were toggled on (so for 10, it would return ['block', 'disallow']). This has the advantage that adding new actions would only mean adding a new flag on the next digit of the underlying binary number (16, 32, etc.)

Ah, I like the bitwise approach, yes. I think I may have considered this in the past but I can't tell whether I just forgot about it or found some reason against it; probably the former.

In T217481#7916175, @Ladsgroup wrote:

Using bitwise calculation is clever. Similar to what we have in compact configuration of schema changes. The action would be 'block' and then you'd use BLOCK | DISALLOW which would give you a number you can easily later check. The problem is how to filter that in MySQL? You probably have to build all possible cases and do IN (54, 565, 346, etc.)

Can't we use bitwise operations with our DBAL? So for instance, if block is 2 as above, we'd do afl_action & 2 = 2 to filter for blocks.

Oh yeah, see IDatabase::bitAnd. We should give this a try it seems.

Daimona removed a project: User-Daimona.Mar 3 2024, 5:02 PM

Slow queries on abuse_filter_log using afl_action or afl_actionsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Slow queries on abuse_filter_log using afl_action or afl_actions
Open, Needs TriagePublic
Actions