Page MenuHomePhabricator

AbuseFilter protected variables: Make it possible for protected variable values expire when the IP address expires
Closed, ResolvedPublic

Description

Summary

We should expire MediaWiki-extensions-IPReputation AbuseFilter variable data when the IP address expires. To do this we need to add support for protected variable values to be purged when the IP address is purged.

Background

  • It is not a WMF Legal requirement to purge MediaWiki-extensions-IPReputation variable data, but it was considered that there is a risk in keeping this data when the IP address has purged.
    • This is especially the case when the action is an account creation, as it would tie data about an IP address to a registered user forever if an AbuseFilter matched on their account creation
    • This also could be an issue for temporary accounts, where their IP address expires after 90 days
  • We need to consider when the MediaWiki-extensions-IPReputation protected variables need to be expired, and whether we should expire in all cases for consistency
  • To do this we need a mechanism to store some protected variable values in the database
    • This mechanism should ideally not be too specific to MediaWiki-extensions-IPReputation AbuseFilter variables, to avoid the need to undo existing work if other variables are added
    • However, we need to consider that we would likely need to expire the data at the same time as the IP address so may still need to bound this expiration to a standard retention period
  • This solution may also be useful to consider for the existing user_unnamed_ip variable, as we could then remove some of it's specific handling code in favour of using this method

Technical notes

  • We will need a schema change to support this to the abuse_filter_log table
    • The schema change would likely involve adding a column that acts in a similar way to the log_params column (a serialised PHP array of parameters, in this case variable names to values)
    • When reading or writing the variable dump, the protected variables that need to have their values expire would be written to this new column instead of the blob used for all other variables
  • We can re-use the afl_var_dump column by making it a JSON array if the log contains variables that need purging
    • When the values are purged, the column could be reverted back to no longer use JSON as a space saving measure. Alternatively, we could make all afl_var_dump column values use a JSON array to be consistent.
  • We will need code to remove protected variables from afl_var_dump when they should be expired
    • To support this, we need a index which partially indexes afl_var_dump (just one character should be fine) and afl_timestamp. Then the query would look for the character { at the start of afl_var_dump. If it exists then the field should be in JSON and have protected variables to purge.
      • The other methods of doing this like adding a new tinyint column or making afl_ip nullable were less liked by DBAs.

Acceptance criteria

  • AbuseFilter supports protected variables being configured to have their associated values be purged when the IP address associated with the log is purged
  • MediaWiki-extensions-IPReputation variables are configured in this way to purge their values, which should at least be used when an account creation causes a log entry

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Alternatively we can introduce a new table like abuse_filter_protected_variable, since adding a new column to abuse_filter_log will take much more disk space.

Alternatively we can introduce a new table like abuse_filter_protected_variable, since adding a new column to abuse_filter_log will take much more disk space.

DBA usually recommend against adding a new table unless necessary as each table causes more files to be opened. For example, T359309#9614963. However I can check with them what they prefer.

I'm reviewing the schema of abuse filter and I have several notes:

  • A lot of work in this regard is not done yet. I'm still seeing afl_user_text and afl_user, can we finish these migrations please?
  • Currently the tables are not that big (9.5G abuse_filter_log.ibd in enwiki. It is big but not end of the world big) but the fact that it keeps everything forever and is growing without bound is concerning. Do you think it'd make sense to remove old abuse filter logs where the action is warn or tag? what's the point of keeping them forever. Definitely keep any row that's disallowed or block
    • Do we really use afl_patrolled_by?
  • I don't know the code well enough but maybe we can store the parameter in afl_var_dump itself instead of introducing a new column?
  • can we use hard-coded smallint values for afl_action and afl_actions?
  • "a serialised PHP array of parameters" <- please don't store php serialization. We did this mistake twenty years ago and at that time json didn't exist. We are already migrating as much as possible to json instead.

This table needs a lot of love :(

I'm reviewing the schema of abuse filter and I have several notes:

  • A lot of work in this regard is not done yet. I'm still seeing afl_user_text and afl_user, can we finish these migrations please?

Hmm. I see that T188180 was recently resolved by @matej_suchanek. Any thoughts on this?

  • Currently the tables are not that big (9.5G abuse_filter_log.ibd in enwiki. It is big but not end of the world big) but the fact that it keeps everything forever and is growing without bound is concerning. Do you think it'd make sense to remove old abuse filter logs where the action is warn or tag? what's the point of keeping them forever. Definitely keep any row that's disallowed or block
  • Do we really use afl_patrolled_by?

Doesn't look like it to me. Probably could be removed.

  • I don't know the code well enough but maybe we can store the parameter in afl_var_dump itself instead of introducing a new column?

afl_var_dump is a blob store address. We could re-use this column and store the blob store address in a JSON array when protected variables are stored. Then when purging we would modify afl_var_dump instead of blanking it, such that the field is back to just being the blob store address.

  • can we use hard-coded smallint values for afl_action and afl_actions?

Not easily, because other extensions can define actions. We could convert that if wanted but there could be issues with collisions for the integers that are selected. Plus, uninstalling an extension which defines an action would break those logs.

  • "a serialised PHP array of parameters" <- please don't store php serialization. We did this mistake twenty years ago and at that time json didn't exist. We are already migrating as much as possible to json instead.

Sure. Perhaps a JSON array in afl_var_dump would work.

This table needs a lot of love :(

Yeah, unfortunately.

Given the idea of re-using afl_var_dump was suggested, I will run with that and come back to a new column if that doesn't work for any reason. I'll file a task about afl_patrolled_by.

The abuse_filter_log table is not public per T375751: Public wiki replicas contain abuse filter logs for filters that are private or protected, so we can re-use this column and someone will need to consider this if it's ever made public again. That would likely be through making all protected filter logs be hidden.

This comment was removed by Dreamy_Jazz.

I'm reviewing the schema of abuse filter and I have several notes:

  • A lot of work in this regard is not done yet. I'm still seeing afl_user_text and afl_user, can we finish these migrations please?

Hmm. I see that T188180 was recently resolved by @matej_suchanek. Any thoughts on this?

I didn't initiate that migration simply because it's more complicated. You need to deal with non-IP afl_user_text where afl_user = 0, and references from metawiki for global filter hits. I was hoping to collect all the ideas on this table in a new task to kick it off, but I never got to it and you guys were faster.

This table needs a lot of love :(

Yeah, unfortunately.

^

I've filed T391027 for the afl_patrolled_by column.

Dreamy_Jazz removed a project: Schema-change.

Removing Schema-change from this ticket based on the updates.

Something to note is that once you write to blob store, you can't change the content of it anymore. Our external storage databases are append-only and will become read only once they get filled. Think of it like blockchain but actually useful. We can store a new entry and set that though. Would that be good enoughTM?

Would that be good enoughTM?

Maybe, but I'm thinking not. We want to permanently purge the value of the variables after 90 days to be consistent with how afl_ip is purged. If we write the values to the append-only storage, then we can't properly purge the data. Just unlinking the entry won't be enough, as I presume you could search through the blob store entries and find the one that matches variables in the new blob store (i.e. match the value of the new_wikitext variable).

A scenario is a user creates their account using an VPN, where the specific VPN provider is noted in the variable value. That data would never be fully purged, which I think would be an issue for wikis where some users are being targeted and some kind of leak of the blob store occurs.

Purging isn't a strict requirement, so if it's not possible to do the JSON values in afl_var_dump then it could be re-considered.

Would that be good enoughTM?

Maybe, but I'm thinking not. We want to permanently purge the value of the variables after 90 days to be consistent with how afl_ip is purged. If we write the values to the append-only storage, then we can't properly purge the data. Just unlinking the entry won't be enough, as I presume you could search through the blob store entries and find the one that matches variables in the new blob store (i.e. match the value of the new_wikitext variable).

We have roughly 50-60TB of gzipped text in ES. Searching in it is not really feasible. Unless you have the key to the row.

A scenario is a user creates their account using an VPN, where the specific VPN provider is noted in the variable value. That data would never be fully purged, which I think would be an issue for wikis where some users are being targeted and some kind of leak of the blob store occurs.

Purging isn't a strict requirement, so if it's not possible to do the JSON values in afl_var_dump then it could be re-considered.

One thing that you can do is to put the ip in afl_var_dump too. so it'll be like {'var_name_ip': 127.0.0.1, 'purge_timestamp': 1234, 'the rest of values': 'tt:1234'} if afl_var_dump can take it, that should work?

It's a blob:

`afl_var_dump` blob NOT NULL

Go crazy.

Thanks. In which case, we will store JSON in afl_var_dump when we need to expire the variable value. Given that, I'd say that DBA input is no longer needed given that we won't be making a Schema-change.

Awesome! I stay subscribed in case you need any help!

As I said above, we can alternatively introduce a new table for protected variable so this can be more easily purged (without enumbrating the abuse_filter_log table).

As I said above, we can alternatively introduce a new table for protected variable so this can be more easily purged (without enumbrating the abuse_filter_log table).

I saw and as Dreamy Jazz pointed out, it shouldn't be done.

Having looked at this task further the requirement to purge these variables is not mandated by WMF Legal. Given that we still want to purge the data we can use the easiest solution where the data isn't technically fully purged from blob store but is no longer associated with the AbuseFilter log entry:

  1. Saving all variables to the blob store and then placing that address in afl_var_dump
  2. When afl_ip is purged then create a new blob store address that no longer has the IPReputation variables

This will be the easiest technically to achieve, given that using a JSON string would require modifying the UpdateVarDumps.php maintenance scripts to support this JSON format. Plus it adds more tech debt.

If you're sure that it only gets done on logged-out edits/edit attempts. Then it's fine but if it accidentally starts creating two ES blobs for every afl entry, then I'd really prefer having the json blobs in afl_var_dump. ES is horizontally scalable but it's not that cheap.

If you're sure that it only gets done on logged-out edits/edit attempts. Then it's fine but if it accidentally starts creating two ES blobs for every afl entry, then I'd really prefer having the json blobs in afl_var_dump. ES is horizontally scalable but it's not that cheap.

We intend to at the moment only set it for logged-out editors and for account creations. However, we have a long-term plan to support these variables for registered users if T234155: Create CheckUser-level abuse filters is implemented.

Given that I'll proceed with the JSON approach to avoid having to undo work in the future.

Change #1140198 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/AbuseFilter@master] [Very WIP] Store protected variable values in DB

https://gerrit.wikimedia.org/r/1140198

Change #1148412 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/AbuseFilter@master] Drop the UpdateVarDumps.php maintenance script

https://gerrit.wikimedia.org/r/1148412

Change #1148412 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Drop the UpdateVarDumps.php maintenance script

https://gerrit.wikimedia.org/r/1148412

Change #1152154 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/AbuseFilter@master] Make afl_ip in abuse_filter_log nullable

https://gerrit.wikimedia.org/r/1152154

Dreamy_Jazz moved this task from Unsorted to Change on the Schema-change board.
Dreamy_Jazz updated the task description. (Show Details)

Change #1152162 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/AbuseFilter@master] [WIP] Store protected variables in DB instead of BlobStore

https://gerrit.wikimedia.org/r/1152162

Change #1152164 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/AbuseFilter@master] Update PurgeOldLogIPData to set afl_ip to null

https://gerrit.wikimedia.org/r/1152164

Change #1152164 abandoned by Dreamy Jazz:

[mediawiki/extensions/AbuseFilter@master] Update PurgeOldLogIPData to set afl_ip to null

https://gerrit.wikimedia.org/r/1152164

Change #1152154 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Add afl_var_dump_timestamp index to allow searching for JSON arrays

https://gerrit.wikimedia.org/r/1152154

Change #1152162 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Store protected variables in DB instead of BlobStore

https://gerrit.wikimedia.org/r/1152162

Change #1161978 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/AbuseFilter@master] [WIP] Rename PurgeOldLogIPData.php to PurgeOldLogData.php

https://gerrit.wikimedia.org/r/1161978

Change #1161978 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Rename PurgeOldLogIPData.php to PurgeOldLogData.php

https://gerrit.wikimedia.org/r/1161978

Change #1140198 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Purge protected variables stored in the DB

https://gerrit.wikimedia.org/r/1140198

dom_walden subscribed.

I ran the maintenance script with $wgAbuseFilterLogProtectedVariablesMaxAge = 1 to remove all the IP Reputation variables on my local wiki. I then checked that:

  • rows in abuse_filter_log with IP Reputation variables have their afl_var_dump column changed to just show the ID of the associated row in the text table
  • the associated row in the text table is unchanged
  • only users with the appropriate rights can access these abuse logs, and when they do it is logged

When you now look at a purged abuse log (e.g. in Special:AbuseLog/<log id>) all the IP Reputation variables show as true regardless of their value before purging.

Test environment: local docker Abuse Filter – (101543f) 07:51, 3 July 2025. IPReputation – (c04418c) 02:27, 21 June 2025.