Page MenuHomePhabricator

Investigate what to do about the AbuseFilter log revealing someone's IP address via historical logs
Open, Needs TriagePublic

Description

Background

T363906 introduces the concept of variables that have PII, specifically a user_unnamed_ip variable, for use when temporary accounts are enabled, since user_name will no longer be the IP address. (This will not be available for fully registered users, just temporary users.)

The IP address in the filter and the filter details will only be readable by users who have access to reveal IP addresses. As will the logs for that filter being triggered. In accordance with our policy of deleting IP addresses after a fixed time, the value will be stored in afl_ip (separately from the rest of the data, in afl_var_dump), so that it can be purged after the fixed time.

However, as it stands, logs will be visible forever, so whoever can read a filter containing the IP address can see who triggered the filter from that address or range.

Is this a problem?

This was mentioned up in T363906#9782548.

There's a comparable case in CheckUser, where it can be accurately guessed from the CheckUser logs which users are associated with which IP addresses, even after the IPs have been removed. This case is arguably worse, if the barrier to triggering a filter is lower than the barrier to triggering a CheckUser investigation.

What can be done?

Some suggestions:

  • Do nothing
  • Purge any logs for filters containing IP addresses after a time
  • Remove the filter ID from any logs for filters containing IP addresses after a fixed time

See also: T234155#9720590

Event Timeline

Tchanders added a subscriber: Dreamy_Jazz.

Thanks @Dreamy_Jazz for discussing this with me.

After discussing with @STran, our thoughts are:

The "do nothing" approach could be changed at some later point in time, given more time/resources to work on the problem.

cc @Tchanders @Dreamy_Jazz

kostajh claimed this task.

Notes:

  • We need to remove the connection between a user name and their IP address after 90 days.
  • The connection can be found by looking at the user name and filter ID in the log line, and inspecting the contents of the filter with that ID, at the timestamp of the log. This could reveal the user's IP.
  • The logs are stored in the abuse_filter_log table.
  • The triggering variables are stored in the afl_var_dump field. This is stored in ExternalStorage which cannot be changed. However, since this patch, the IP address is not stored here. Instead we store true if the IP address triggered the filter.
  • We could remove the connection between the user name and the IP after 90 days by modifying the entry in the abuse_filter_log table to remove either the filter ID or the performer name and ID.
  • We could remove the connection between the user name and the IP after 90 days by modifying the entry in the abuse_filter_log table to remove either the filter ID or the performer name and ID.

I think it would be nice if we could swap the performer name and ID out with a generic A temporary account string, e.g. A temporary account trigger filter {number}.

I wonder if it's possible to try and maintain both bits of the information even if they're no longer associable. eg. if a log reads: ~2024-7 triggered filter 3, performing the action "edit" on Main Page2. Actions taken: Disallow; Filter description: 3, it would be nice to be able to:

  • search for ~2024-7 and see something like ~2024-7 triggered a filter
  • search for 3 (the filter id) and see something like a user triggered filter 3, performing the action "edit" on Main Page2. Actions taken: Disallow; Filter description: 3

I think both provide valuable information - the first is the account's abuse history and the latter is the filter's history. Unfortunately, this is all stored on one row:

+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
| afl_id | afl_global | afl_filter_id | afl_user | afl_user_text | afl_ip    | afl_action | afl_actions | afl_var_dump | afl_timestamp  | afl_namespace | afl_title  | afl_wiki | afl_deleted | afl_patrolled_by | afl_rev_id |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
|      1 |          0 |             3 |        8 | ~2024-7       | 127.0.0.1 | edit       | disallow    | tt:8         | 20241205163742 |             0 | Main_Page2 | NULL     |           0 |                0 |       NULL |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+

so to do so, we'd probably have double up to make this happen eg.

+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
| afl_id | afl_global | afl_filter_id | afl_user | afl_user_text | afl_ip    | afl_action | afl_actions | afl_var_dump | afl_timestamp  | afl_namespace | afl_title  | afl_wiki | afl_deleted | afl_patrolled_by | afl_rev_id |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
|      1 |          0 |             3 |        8 | user          | 127.0.0.1 | edit       | disallow    | tt:8         | 20241205163742 |             0 | Main_Page2 | NULL     |           0 |                0 |       NULL |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
|      1 |          0 |             3 |       -1 | ~2024-7       | 127.0.0.1 | edit       | disallow    | tt:8         | 20241205163742 |             0 | Main_Page2 | NULL     |           0 |                0 |       NULL |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+

and would probably cause a bunch of downstream problems but it still might be worth considering. The alternative is as stated, to pick whichever one we think is more important and remove the other value after 90 days.

I wonder if it's possible to try and maintain both bits of the information even if they're no longer associable. eg. if a log reads: ~2024-7 triggered filter 3, performing the action "edit" on Main Page2. Actions taken: Disallow; Filter description: 3, it would be nice to be able to:

  • search for ~2024-7 and see something like ~2024-7 triggered a filter
  • search for 3 (the filter id) and see something like a user triggered filter 3, performing the action "edit" on Main Page2. Actions taken: Disallow; Filter description: 3

I think both provide valuable information - the first is the account's abuse history and the latter is the filter's history. Unfortunately, this is all stored on one row:

+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
| afl_id | afl_global | afl_filter_id | afl_user | afl_user_text | afl_ip    | afl_action | afl_actions | afl_var_dump | afl_timestamp  | afl_namespace | afl_title  | afl_wiki | afl_deleted | afl_patrolled_by | afl_rev_id |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
|      1 |          0 |             3 |        8 | ~2024-7       | 127.0.0.1 | edit       | disallow    | tt:8         | 20241205163742 |             0 | Main_Page2 | NULL     |           0 |                0 |       NULL |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+

so to do so, we'd probably have double up to make this happen eg.

+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
| afl_id | afl_global | afl_filter_id | afl_user | afl_user_text | afl_ip    | afl_action | afl_actions | afl_var_dump | afl_timestamp  | afl_namespace | afl_title  | afl_wiki | afl_deleted | afl_patrolled_by | afl_rev_id |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
|      1 |          0 |             3 |        8 | user          | 127.0.0.1 | edit       | disallow    | tt:8         | 20241205163742 |             0 | Main_Page2 | NULL     |           0 |                0 |       NULL |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+
|      1 |          0 |             3 |       -1 | ~2024-7       | 127.0.0.1 | edit       | disallow    | tt:8         | 20241205163742 |             0 | Main_Page2 | NULL     |           0 |                0 |       NULL |
+--------+------------+---------------+----------+---------------+-----------+------------+-------------+--------------+----------------+---------------+------------+----------+-------------+------------------+------------+

and would probably cause a bunch of downstream problems but it still might be worth considering. The alternative is as stated, to pick whichever one we think is more important and remove the other value after 90 days.

The issue is they may be easily connected since afl_timestamp and afl_var_dump is same.

The issue is they may be easily connected since afl_timestamp and afl_var_dump is same.

I think this is also resolvable. If we're in here making these sorts of edits we can delete afl_var_dump for the account info row as it'll be captured in the filter info row and I suppose similarly remove some fidelity from afl_timestamp (to the day or something).

However, is it valuable to keep that generalized information? Or are the specifics very important to historical abuse logs and if we can't have that it's not worth keeping anything?


We're also going to need a way to find and purge these. The obvious solution is to purge based on the protected flag but we're about to add more variables to that state (IP reputation variables) and those variables aren't considered sensitive the way user_unnamed_ip is. I originally argued we should keep all of these variables under the protected workflow in order to avoid adding complexity but if we're going to have to purge on the inferred attack surface, it might be better to create a sensitive flag to do so?

I suppose similarly remove some fidelity from afl_timestamp (to the day or something).

This is still not enough - Take a popular non-protected enwiki article, Tom Hardy, as example, it hits abuse filter 19 times in 2024: https://en.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchTitle=Tom+Hardy so if a temp account edit it the relationship is still easy to find. In addition users can just browse abuse log page-by-page - it is ordered by afl_id - to find the connection.

It is worth mentioning revision (change) tags also stay forever. So purging logs wouldn't help if the filter also applies tags to matching edits.

It would help Legal if we could present some specific approaches for their consideration.

@STran Would it be possible to make a summary of how we could do this, taking into account the comments added so far, to present to Legal?