Addition of last hit date to Special:AbuseFilter table
Open, LowPublic
Actions

Assigned To

None

Authored By

	Billinghurst
	Mar 23 2015, 9:19 AM

Description

Spammers and abusers come in cycles and certain filters have periods of quiescence. The only means to monitor filters is either to visually monitor (eyeball) the log, or to try and have a good memory of the count column and whether it has changed since you last checked. The addition of a "Last hit" per filter line would allow those of us who check filters daily/regularly to be able to see at a glance whether we should be digging further into the log of a filter.

Details

Subject	Repo	Branch	Lines +/-
[WIP] Add last hit column to Special:AbuseFilter	mediawiki/extensions/AbuseFilter	master	+39 -0
mariadb: Depool db2055 for performance testing	operations/mediawiki-config	master	+4 -5
Add "Last hit" timestamp column to Special:AbuseFilter list view	mediawiki/extensions/AbuseFilter	master	+24 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T93564 Addition of last hit date to Special:AbuseFilter table
Resolved	Daimona	T220791 afl_filter should be split in afl_filter_id and afl_global
Resolved	• Marostegui	T234052 Add abuse_filter_log.afl_filter_id and afl_global columns
Resolved	• Bstorm	T234615 Wikireplicas changes for abuse_filter_log including two new columns
Resolved	Daimona	T269712 Migrate afl_filter to afl_filter_id and afl_global
Resolved	Urbanecm	T269713 Run the MigrateAflFilter script for AbuseFilter
Resolved	• Marostegui	T291719 Remove abuse_filter_log.afl_filter column and adjust schema consequently from Wikimedia production
Resolved	rook	T291806 Remove afl_filter column from the views

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Reverted due to the query timing out on various large projects. T106798#1478602 shows the query plan :/

jcrespo subscribed.Jul 24 2015, 10:21 AM

MGChecker subscribed.Jul 24 2015, 11:03 AM

@jcrespo Can you perhaps help with a better implementation for the query when you have time? Noting that this is low priority so this is not necessary but would be nice. :)

Luke081515 awarded a token.Jul 26 2015, 10:16 PM

matej_suchanek merged a task: T33743: Add column "Last hit" on table of Special:AbuseFilter.Aug 16 2015, 3:33 PM

matej_suchanek added subscribers: liangent, • wikibugs-l-list, • bzimport.

matej_suchanek awarded a token.Oct 11 2015, 3:26 PM

Legoktm added a project: Stewards-and-global-tools.Oct 27 2015, 4:57 PM

MarcoAurelio moved this task from Untriaged to Assigned/Working on on the Stewards-and-global-tools board.Feb 7 2016, 2:51 PM

Restricted Application added a subscriber: JEumerus. · View Herald TranscriptFeb 7 2016, 2:51 PM

Billinghurst unsubscribed.Feb 8 2016, 12:51 PM

MarcoAurelio removed a subscriber: • wikibugs-l-list.Apr 18 2016, 9:17 AM

MarcoAurelio removed a parent task: T43492: [DO NOT USE] Steward, global sysop and SWMT tasks bugs (tracking) [superseded by #Stewards-and-global-tools].Jan 11 2017, 11:39 PM

MarcoAurelio moved this task from Assigned/Working on to Untriaged on the Stewards-and-global-tools board.Apr 20 2017, 10:04 AM

Matanya moved this task from Untriaged to Medium priority on the Stewards-and-global-tools board.Jun 7 2017, 11:25 PM

On my wiki, running the previously used query gives the following:

EXPLAIN SELECT af_id,af_enabled,af_deleted,af_global,af_public_comments,af_hidden,af_hit_count,af_timestamp,af_user_text,af_user,af_actions,af_group,MAX(afl_timestamp) AS afl_timestamp FROM `abuse_filter` LEFT JOIN `abuse_filter_log` ON ((af_id=afl_filter)) WHERE af_deleted = 0 GROUP BY af_id ORDER BY af_id LIMIT 51

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	abuse_filter	ALL	NULL	NULL	NULL	NULL	146	Using where; Using temporary; Using filesort
1	SIMPLE	abuse_filter_log	index	filter_timestamp	filter_timestamp	80	NULL	4144	Using where; Using index; Using join buffer (flat, BNL join)

And this sucks.

jcrespo added a project: DBA.Oct 21 2018, 1:16 PM

jcrespo moved this task from Triage to Backlog (help welcome) on the DBA board.

MusikAnimal subscribed.Jan 23 2019, 7:46 PM

Is it possible to look at something other than a binary solution (off/on). For those of us who do the unheralded work in the background — spambots, vandals, LTA — having something to make our life easier would be most helpful; most especially at meta for the global filters where the spambot hits drown out others. Is there an intermediate measure other than at a point of call generation. Is it possible that this could be updated hourly, of six hourly?

The result could be cached if needs be, probably, but the point is that this query is terrible. Maybe we could optimize it with a subquery, but that would be sort of hacky. Otherwise, I guess we'd need to add indexes, i.e. perform a schema change, which isn't quick to do.

@Daimona <shrug> I am just nuff nuff spam handler and typical end user who needs some abilities to better handle the abusefilter/log which has become a morass. I talk about desires in tools that allow me to work efficiently; and to better and more effectively protect our wikis. I generally put up with the inefficiencies in the system, however, when I get to the point of "f*** it" and am throwing my hands up (again) that this is wearing me down, then I will speak up.

I know that I work in an unsexy, somewhat invisible space, and we need to balance where we allocate resources. If is not considered important or of no, or little, value, then people just need to say so, I am sure that I can find something else to do.

I am an end user myself who fiddles with abuse filters every single day. Thus, I'm already aware of several parts of this extension that should be improved. Many of them are being worked on, but it's not a quick process, and sometimes it's neither easy.

In this specific case, I personally fail to see the last hit date as an indispensable feature. Though, I'd still add id, if it weren't for the major performance issue outlined above. That's not easy to work around, and would instead require a big change which, IMHO, shouldn't be prioritized, given that we altedy have several important changes on hold.

If for you it's absolutely essential to see the last hit for each filter, and as you said above it would be fine to update it every hour or so, you can just manually run the query on Quarry.

So I looked at this again. All improvements that I could think of wouldn't really increase the performance of the query. The only concrete boost would be to turn JOIN + GROUP BY into a subquery, i.e.

SELECT af_id, af_enabled, af_deleted, af_global, af_public_comments, af_hidden, af_hit_count, af_timestamp, af_user_text, af_user, af_actions, af_group, (SELECT MAX(afl_timestamp) FROM `abuse_filter_log` WHERE ( ( af_id = afl_filter ) ) LIMIT 1) AS afl_timestamp FROM `abuse_filter` WHERE af_deleted = 0 ORDER BY af_id LIMIT 51

This query performs greatly, but isn't pretty to look at due to the LIMIT 1, and we'd need to specify an additional ORDER BY afl_id DESC to really ensure we get the last hit. This wouldn't impact performance, as it's very unlikely that we'll ever get duplicates on MAX(afl_timestamp) with af_id = afl_filter.

The only point is: how would it be hacky to use a subquery in getQueryInfo? I couldn't really find similar examples, and generally speaking I'd be against it. However, in this specific case it could be acceptable. I'm submitting a patch shortly, then we can go on discussing with some code at hand.

I'm almost there. A couple of notes:

As opposed to what I said above, the ORDER BY in the subquery is not necessary: the last hit column only shows the timestamp, and if there are duplicates we can of course just pick one. I initially thought that we were going to add a link to the log entry.
The last hit field will not be sortable. While it would be easy and quick to do a simple ORDER BY afl_timestamp, but the problem comes when an offset is specified. That causes a WHERE clause like afl_timestamp > 123. But afl_timestamp is the result of the subquery and cannot be used in a WHERE clause. To make it work we should add abuse_filter_log to the list of tables, but that would make performance issues come back. I'm still leaving a ToDo in the code, but it won't be easy to accomplish with the current schema.

Unrelated, throughout the AbuseFilter code we perform several queries on the abuse_filter table with af_deleted = 0 in the WHERE clause. Notably, this happens on the default view of Special:AbuseFilter and every time filters to execute for the current action are picked from the DB. Although the abuse_filter table is always pretty small (no more than 1000 rows on enwiki) and the query is already fast, maybe we could speed it up even more by adding an index on af_deleted.

New query performance from beta cluster:

MariaDB [deploymentwiki]> EXPLAIN SELECT af_id,af_enabled,af_deleted,af_pattern,af_global,af_public_comments,af_hidden,af_hit_count,af_timestamp,af_user_text,af_user,af_actions,af_group,af_throttled,(SELECT MAX(afl_timestamp) FROM `abuse_filter_log` WHERE (af_id = afl_filter) LIMIT 1 ) AS `afl_timestamp` FROM `abuse_filter` WHERE af_deleted = '0' AND af_enabled = '1' AND af_global = '0' ORDER BY af_id LIMIT 51\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: abuse_filter
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 30
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: abuse_filter_log
         type: index
possible_keys: filter_timestamp
          key: filter_timestamp
      key_len: 80
          ref: NULL
         rows: 343
        Extra: Using where; Using index
2 rows in set (0.01 sec)

Change 487010 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] Add last hit column to Special:AbuseFilter

https://gerrit.wikimedia.org/r/487010

gerritbot added a project: Patch-For-Review.Jan 29 2019, 10:03 AM

@jcrespo Could you please check the query from my last comment and see how it goes?

jcrespo moved this task from Backlog (help welcome) to Pending comment on the DBA board.Jan 29 2019, 10:07 AM

Not looking good on production (dependent subquery with a full table scan):

root@db1106.eqiad.wmnet[enwiki]> EXPLAIN SELECT af_id,af_enabled,af_deleted,af_pattern,af_global,af_public_comments,af_hidden,af_hit_count,af_timestamp,af_user_text,af_user,af_actions,af_group,af_throttled,(SELECT MAX(afl_timestamp) FROM `abuse_filter_log` WHERE (af_id = afl_filter) LIMIT 1 ) AS `afl_timestamp` FROM `abuse_filter` WHERE af_deleted = '0' AND af_enabled = '1' AND af_global = '0' ORDER BY af_id LIMIT 51\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: abuse_filter
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 51
        Extra: Using where
*************************** 2. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: abuse_filter_log
         type: index
possible_keys: afl_filter
          key: afl_filter
      key_len: 82
          ref: NULL
         rows: 22761859
        Extra: Using where; Using index
2 rows in set (0.00 sec)

If stats were accurate, that would be a 1000 million row scan, which would be a big no (would not finish within 60 seconds and get killed).

I finally have some time to look at this, but I am confused about the initial goal, and the summary is helpful but unclear on the details- @Daimona will you have some time on IRC (freenode, DBA) to chat and better understand it. Sorry, maybe you told me already, but I may have forgotten.

jcrespo moved this task from Pending comment to In progress on the DBA board.Feb 5 2019, 5:19 PM

So, first of all, abuse_filter_log is a varbinary(64) and not a bigint, and that will cause all sort of issues, like potentially refusing to use the right index on a JOIN. That should be fixed. On paper it is defined as varchar(64) binary NOT NULL which is the same as varbinary. This is because a string is being used instead of a number (32 or 64 bit int).

I will check if the above issue can be fixed without doing that first, but the above should be a high priority.

What you are trying to do is a join combined with a group wise maximum.

The correlated subquery version works, but it is too slow:

root@db1106.eqiad.wmnet[enwiki]> SELECT * FROM abuse_filter_log afl1 WHERE afl_id = (SELECT MAX(afl_id) FROM abuse_filter_log afl2 WHERE afl1.afl_filter = afl2.afl_filter ) ORDER BY afl_filter;
+----------+------------+
| afl_id   | afl_filter | 
+----------+------------+
| 23140805 | 1          | 
|  2869380 | 10         | 
|   588458 | 101        |
...
| 23180767 | 11         |
...
822 rows in set (50.12 sec)

Also notice it is weirdly ordered in lexicographical order.

In T93564#4931100, @jcrespo wrote:

So, first of all, abuse_filter_log is a varbinary(64) and not a bigint, and that will cause all sort of issues, like potentially refusing to use the right index on a JOIN. That should be fixed. On paper it is defined as varchar(64) binary NOT NULL which is the same as varbinary. This is because a string is being used instead of a number (32 or 64 bit int).

I guess you meant abuse_filter_log.afl_filter, and yes, that is a known issue tracked in T42757. This patch is a quick fix to unbreak Postgres compatibility, while this one is the long-term solution. The field is a varbinary only because its possible values may be "$id" or "global-$id", with $id being an integer, and the plan is to add another column, abuse_filter_log.afl_global so that we can only store the ID in afl_filter. The patch is WIP but I guess I should start working on it again.
Aside from the obvious benefit of having 2 separate columns, I think that it would also boost performance (a little) by avoiding implicit CASTs for every row.

So IIUC we should first address T42757 before moving on, right?

So IIUC we should first address T42757 before moving on, right?

Please let me complete the analysis before giving a final recommendation, I was drawn into a meeting before I could finish it.

it would also boost performance (a little) by avoiding implicit CASTs for every row.

That is normally not a huge issue, it is the problems of joining columns of different type, which can and many times confuse the query planner because the comparison function used is not optimal (the casting itself is not such an issue, but the refusal of using an index because the optimized doesn't understand that integer and string comparison functions would give the same results). Your reason is true, but a secondary concern.

@jcrespo Ah, sure. In the meanwhile, I'll revamp the WIP patch, which will be needed either way at some point.

The analysis will extend a bit more as I need to alter the 10G table on a spare host to check if a type change and a new index would work.

Indeed the alter will be essential for what I see:

root@db1106.eqiad.wmnet[enwiki]> EXPLAIN select max(afl_id) FROM abuse_filter_log WHERE afl_filter = '3'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: abuse_filter_log
         type: ref
possible_keys: afl_filter
          key: afl_filter
      key_len: 66
          ref: const
         rows: 1185562
        Extra: Using where; Using index
1 row in set (0.01 sec)

root@db1106.eqiad.wmnet[enwiki]> select max(afl_id) FROM abuse_filter_log WHERE afl_filter = 3\G                 
*************************** 1. row ***************************
max(afl_id): <id>
1 row in set (5.48 sec)

root@db1106.eqiad.wmnet[enwiki]> EXPLAIN select max(afl_id) FROM abuse_filter_log WHERE afl_filter = 3\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: abuse_filter_log
         type: index
possible_keys: afl_filter
          key: afl_filter
      key_len: 82
          ref: NULL
         rows: 22769370
        Extra: Using where; Using index
1 row in set (0.00 sec)


root@db1106.eqiad.wmnet[enwiki]> select max(afl_id) FROM abuse_filter_log WHERE afl_filter = '3'\G
*************************** 1. row ***************************
max(afl_id): <id>
1 row in set (0.28 sec)

That 20x difference could be the one between being able to query or not (as this action will be done 300 times). However, it is still very slow, so we may have to create an (afl_filter, afl_id) compound index.

Change 488410 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db2055 for performance testing

https://gerrit.wikimedia.org/r/488410

Change 488410 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db2055 for performance testing

https://gerrit.wikimedia.org/r/488410

Mentioned in SAL (#wikimedia-operations) [2019-02-06T15:08:56Z] <jynus@deploy1001> Synchronized wmf-config/db-codfw.php: Depool db2055 for performance testing T93564 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2019-02-06T16:19:18Z] <jynus> running alter table on db2055 T93564

jcrespo added a comment.Feb 6 2019, 4:28 PM

This comment was removed by jcrespo.

I dont consider the ids private, but just in case

MariaDB [enwiki]> select max(afl_id) FROM abuse_filter_log WHERE afl_filter = '3';
+-------------+   
| max(afl_id) |   
+-------------+   
|    <id> |   
+-------------+   
1 row in set (2.67 sec)

MariaDB [enwiki]> select max(afl_id) FROM abuse_filter_log WHERE afl_filter = 3;
[many seconds later]
^CCtrl-C -- query killed. Continuing normally.

MariaDB [enwiki]> set sql_log_bin=0; alter table abuse_filter_log add index (`afl_filter`, `afl_id`);
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (2 min 15.96 sec)
Records: 0  Duplicates: 0  Warnings: 1

MariaDB [enwiki]> select max(afl_id) FROM abuse_filter_log WHERE afl_filter = '3';
+-------------+
| max(afl_id) |
+-------------+
|    <id> |
+-------------+
1 row in set (0.00 sec)

With the index, it even works without the quotes.

MariaDB [enwiki]> select max(afl_id) FROM abuse_filter_log WHERE afl_filter = 3;
+-------------+
| max(afl_id) |
+-------------+
|    <id> |
+-------------+
1 row in set (0.00 sec)

That looks better, we now only have to construct the right query that will return all results, even if it has to run the previous one 300 times.

I think, with the alter, this should be the strategy:

EXPLAIN SELECT * FROM abuse_filter_log JOIN abuse_filter ON af_id = afl_filter WHERE afl_id = (SELECT max(afl_id) FROM abuse_filter_log WHERE afl_filter = 1)
union all
SELECT * FROM abuse_filter_log JOIN abuse_filter ON af_id = afl_filter WHERE afl_id = (SELECT max(afl_id) FROM abuse_filter_log WHERE afl_filter = 2)
union all
SELECT * FROM abuse_filter_log JOIN abuse_filter ON af_id = afl_filter WHERE afl_id = (SELECT max(afl_id) FROM abuse_filter_log WHERE afl_filter = 3);
----------------------------+
|    1 | PRIMARY     | abuse_filter_log | const | PRIMARY,afl_filter,afl_filter_2 | PRIMARY | 8       | const |    1 |                              |
|    1 | PRIMARY     | abuse_filter     | const | PRIMARY                         | PRIMARY | 8       | const |    1 |                              |
|    2 | SUBQUERY    | NULL             | NULL  | NULL                            | NULL    | NULL    | NULL  | NULL | Select tables optimized away |
|    3 | UNION       | abuse_filter_log | const | PRIMARY,afl_filter,afl_filter_2 | PRIMARY | 8       | const |    1 |                              |
|    3 | UNION       | abuse_filter     | const | PRIMARY                         | PRIMARY | 8       | const |    1 |                              |
|    4 | SUBQUERY    | NULL             | NULL  | NULL                            | NULL    | NULL    | NULL  | NULL | Select tables optimized away |
|    5 | UNION       | abuse_filter_log | const | PRIMARY,afl_filter,afl_filter_2 | PRIMARY | 8       | const |    1 |                              |
|    5 | UNION       | abuse_filter     | const | PRIMARY                         | PRIMARY | 8       | const |    1 |                              |
|    6 | SUBQUERY    | NULL             | NULL  | NULL                            | NULL    | NULL    | NULL  | NULL | Select tables optimized away |
+------+-------------+------------------+-------+---------------------------------+---------+---------+-------+------+------------------------------+

Let me find the non-ugly version of this.

So this is my final recommendation:

Prepare and deploy the alter table:

add index (`afl_filter`, `afl_id`)

Change the query to a variant of (it depends on the fields needed, left join can be used instead of join, etc.):

SELECT af.*, afl.* -- change the * for the right fields, explicitly (note selecting all fields can be a lot of bytes)
FROM abuse_filter af
JOIN abuse_filter_log afl ON af.af_id = afl.afl_filter
JOIN (SELECT afl_filter, MAX(afl_id) AS afl_id FROM abuse_filter_log GROUP BY afl_filter) as aflg ON afl.afl_id = aflg.afl_id
ORDER BY af.af_id; -- order by the field you need

Execution time:

823 rows in set (0.04 sec) -- time on a non-ssd host

Explain:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 88365
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: afl
         type: eq_ref
possible_keys: PRIMARY,afl_filter,afl_filter_2
          key: PRIMARY
      key_len: 8
          ref: aflg.afl_id
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: af
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: enwiki.afl.afl_filter
         rows: 1
        Extra: Using where
*************************** 4. row ***************************
           id: 2
  select_type: DERIVED
        table: abuse_filter_log
         type: range
possible_keys: NULL
          key: afl_filter_2
      key_len: 66
          ref: NULL
         rows: 88365
        Extra: Using index for group-by

Rows read:

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 1     |
| Handler_read_key           | 3292  |
| Handler_read_last          | 1     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 823   |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 1648  |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 1646  |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.06 sec)

(most of the temporary table is for the last order by)

The string to int is not a blocker, but I would recommend doing it afterwards too, as it can have negative effects as shown above.

With this I consider the optimization help "fully done", and you can contact us back when you are ready for the schema change (it should take only 2-3 minutes per host).

@jcrespo Thanks for the deep and useful analysis! I'll submit the patch for the schema change later or tomorrow. Shall I open a new ticket for that?

The schema change (code and its deployment) is up to you and how you organize yourself, for the "deployment of a schema change on WMF servers" we do require indeed a separate ticket with (Schema-change-in-production ) for organization purposes. Details on the workflow at https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

Oh, I just realized that I did the "last edit" based on the id, not the timestamp. I hope that works, if not the idea would be very similar (and maybe you wouldn't need the new index). I just though that the id would be a more accurate representation of the "last hit".

E.g.: this may work without the new index, as I use timestamp instead of id:

EXPLAIN SELECT af.*, afl.*  FROM abuse_filter af JOIN abuse_filter_log afl ON af.af_id = afl.afl_filter JOIN (SELECT afl_filter, MAX(afl_timestamp) AS afl_timestamp FROM abuse_filter_log GROUP BY afl_filter) as aflg ON afl.afl_timestamp = aflg.afl_timestamp AND afl.afl_filter = aflg.afl_filter ORDER BY af.af_id\G
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 1     |
| Handler_read_key           | 3292  |
| Handler_read_last          | 1     |
| Handler_read_next          | 1279  |
| Handler_read_prev          | 0     |
| Handler_read_retry         | 0     |
| Handler_read_rnd           | 834   |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 1659  |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 1657  |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
26 rows in set (0.05 sec)

Yes, I meant a new task with Schema-change-in-production, thanks. As for the ID/timestamp, the ID is of course better to use. I think the only reason the query used afl_timestamp is the filter_timestamp index.

Daimona added a project: User-Daimona.Feb 6 2019, 6:30 PM

Daimona moved this task from Backlog to Next on the User-Daimona board.

^see my comment update (the table already has some index overhead).

I tried to implement the query above in IndexPager, but I cannot get it to work... I pushed a new version of the change as POC, but IndexPager doesn't do what we expect. Instead, it executes the following query:

SELECT af_id,af_enabled,af_deleted,af_pattern,af_global,af_public_comments,af_hidden,af_hit_count,af_timestamp,af_user_text,af_user,af_actions,af_group,af_throttled FROM `abuse_filter` WHERE af_deleted = '0' AND af_global = '0' ORDER BY af_id LIMIT 51

where join_conds are completely ignored. I'm unsure whether this is some IndexPager limitation, or I just did it wrong (or both).

yeah, whether mediawiki sql abstraction supports it or not is a different beast, you should ask a mediawiki export (that's not me :-)) if JOINining a subquery is supported or has a separate syntax, or can be overriden in some way.

Daimona moved this task from Next to Future on the User-Daimona board.Mar 4 2019, 3:01 PM

Unlicking the cookie, I'm not currently on this.

Resolving. Please reopen when a Schema-change-in-production ticket is available.

@LSobanski I'm sorry, this wasn't meant as a Schema-change task, but rather as a feature request. The DBA part is done, so it's possible to untag if necessary, although the task itself is not resolved.

OTOH, I'm going to re-check the feasibility using IndexPager, although I'm probably going to add a dependency on T220791.

Daimona added a subtask: T220791: afl_filter should be split in afl_filter_id and afl_global.Oct 19 2020, 4:09 PM

Based on some quick experiments, using the afl_filter_id + afl_global combination instead of afl_filter doesn't really solve the problem here. The only difference is that we'd be specifying afl_global = 0 in every ON and WHERE clause, hence reducing the amount of scanned rows; on the remaining rows, the query would still be awful.

So we still need a new index as per T93564#4932358, except it wouldn't be

add index (`afl_filter`, `afl_id`)

but

add index (`afl_global`,`afl_filter_id`, `afl_id`)

(I think).

So I'm considering this stalled until the new schema will be in place, and at that point I'll investigate whether IndexPager could allow this query.

@Daimona do we have a schema change task for that? I have not been able to find it under Schema-change-in-production tag (I believe you are familiar with the schema change process, but leaving this here just in case https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change)
Thanks!

In T93564#6561104, @Marostegui wrote:

@Daimona do we have a schema change task for that? I have not been able to find it under Schema-change-in-production tag (I believe you are familiar with the schema change process, but leaving this here just in case https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change)
Thanks!

Not yet, no. I'm going to submit a patch, create the task etc. once the old schema is removed (i.e. afl_filter is dropped), so we don't have to add support for two different schemas. Note, this is still far away, as we need to merge the patch that starts writing the new schema, run the migration script, give it some time (i.e. one MW release or two), and then remove the BC code. This is likely going to take at least 1 year, based on the previous times when I had to do something similar.

In T93564#6561295, @Daimona wrote:

In T93564#6561104, @Marostegui wrote:

@Daimona do we have a schema change task for that? I have not been able to find it under Schema-change-in-production tag (I believe you are familiar with the schema change process, but leaving this here just in case https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change)
Thanks!

Not yet, no. I'm going to submit a patch, create the task etc. once the old schema is removed (i.e. afl_filter is dropped), so we don't have to add support for two different schemas. Note, this is still far away, as we need to merge the patch that starts writing the new schema, run the migration script, give it some time (i.e. one MW release or two), and then remove the BC code. This is likely going to take at least 1 year, based on the previous times when I had to do something similar.