Page MenuHomePhabricator

Raw IPs of logged-out users disclosed in wiki-replicas
Open, Stalled, Needs TriagePublic

Description

Summary
In line with T169097, the Security-Team recently completed an audit of the configuration file maintain-views.yaml, in order to explore whether wiki-replicas pose some privacy risks for the contributors supporting Wikimedia projects. As part of the conclusions, it is recommended that raw IPs of logged-out users be redacted from wiki-replicas

Broader context
Displaying raw IP information to the public is a practice that poses obvious privacy risks. IP information can provide very accurate geolocation about contributors and leaving it open to public makes it easier for malign actors to exploit that information. The two queries below provide easily a list of IP addresses used across a Wikimedia project.

SELECT * FROM actor
WHERE actor_user IS NULL
LIMIT 100;
SELECT * FROM ipblocks
WHERE ipb_user = 0
LIMIT 100;

One way to address this privacy issue could be to obfuscate or add noise to IPs in the tables
ipblocks, ipblocks_ipindex, ipblocks_compat, and actor. For instance, IP information is already hidden from abuse_filter_log table through obfuscation.

Details

Other Assignee
odimitrijevic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don't understand why do we restrict information from replicas while the same information are (as of writing) available in the interface. I used replica databases to evaluate wide anon-only range blocks applied to certain ISP(s), and for those tasks, having raw IP of (logged out) users is certainly benefitial.

abuse_filter_log is not a comparable table to actor, as afl_ip in abuse_filter_log contains IP addresses of all users, logged in or not. It is still possible to get IP addresses of logged out editors (which is stored in afl_user_text) from abuse_filter_log, as it should be (as long as this information is available in the production interface at all).

See example:

MariaDB [cswiki_p]> select * from abuse_filter_log where afl_id in (1205372, 1205361)\G
*************************** 1. row ***************************
          afl_id: 1205361
      afl_filter:
      afl_global: 0
   afl_filter_id: 32
        afl_user: 541955
   afl_user_text: Tornyy12 # logged in user, no IP in afl_ip
          afl_ip: NULL
      afl_action: edit
     afl_actions: tag
    afl_var_dump: tt:20632834
   afl_timestamp: 20210614193300
   afl_namespace: 0
       afl_title: Dan_Večeřa
        afl_wiki: NULL
     afl_deleted: 0
afl_patrolled_by: 0
      afl_rev_id: NULL
*************************** 2. row ***************************
          afl_id: 1205372
      afl_filter:
      afl_global: 0
   afl_filter_id: 3
        afl_user: 0
   afl_user_text: 213.175.51.124
          afl_ip: NULL # this is null, but afl_user_text still has the info
      afl_action: edit
     afl_actions: tag
    afl_var_dump: tt:20632849
   afl_timestamp: 20210614193525
   afl_namespace: 0
       afl_title: Masožravá_rostlina
        afl_wiki: NULL
     afl_deleted: 0
afl_patrolled_by: 0
      afl_rev_id: 20068878
2 rows in set (0.00 sec)

MariaDB [cswiki_p]>

I'd appreciate this being discussed with tool owners before implementing.

I don't understand why do we restrict information from replicas while the same information are (as of writing) available in the interface. I used replica databases to evaluate wide anon-only range blocks applied to certain ISP(s), and for those tasks, having raw IP of (logged out) users is certainly benefitial.

This basically. The wiki replicas should just allow access to whatever the wikis allow publicly (and some exceptions). I assume this should be blocked on some other task that redacts this from wikis?

I used replica databases to evaluate wide anon-only range blocks applied to certain ISP(s), and for those tasks, having raw IP of (logged out) users is certainly beneficial.

Hey @Urbanecm and thanks for surfacing that use case. While bearing in mind the privacy risk highlighted earlier, I can definitely see how obfuscating that data would disrupt anti-vandalism work, though not offering a proper alternative for now. Therefore, I am fine with having this ticket stalled until IP masking (T283177) is effective, so as not to create unnecessary disruption.

sguebo_WMF added a parent task: Restricted Task.