Page MenuHomePhabricator

User genders publicly disclosed in wiki-replicas global_preferences and user_properties tables
Closed, DeclinedPublic

Description

Summary
The Security-Team recently completed an audit of the configuration file maintain-views.yaml, in order to explore whether wiki-replicas pose some privacy risks for the contributors supporting Wikimedia projects. As part of the conclusions, it is recommended that gender data be redacted from replicas, irrespective of whether users disclosed it in their preference, willingly or not.

Broader context
Wiki-replicas currently allow users to pull information on a user's gender, provided that it was disclosed in their local or global profile preferences. Below are illustrations of a queries allowing users to collect details about gender from the centralauth_p database.

SELECT gu_name,  gp_value as gender, gu_home_db as homewiki, gu_registration
FROM global_preferences
LEFT JOIN globaluser ON gp_user = gu_id
WHERE gp_property = 'gender'
LIMIT 100;

Here is a similar query run on eswiki_p, which highlights Spanish Wikipedia editors who have disclosed their genders.

SELECT * 
FROM user_properties
WHERE up_property = 'gender'
LIMIT 100;

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Note this information is already public by MediaWiki itself, this is not something specific to the replicas. See https://cs.wikipedia.org/w/index.php?title=Wikipedista:Martin_Urbanec/P%C3%ADskovi%C5%A1t%C4%9B/3&oldid=20068901.

Also note it is very hard to not reveal this information at the wikis themselves, as this is used in things like Special:Log/block in languages that are gender-sensitive.

Note this information is already public by MediaWiki itself, this is not something specific to the replicas. See https://cs.wikipedia.org/w/index.php?title=Wikipedista:Martin_Urbanec/P%C3%ADskovi%C5%A1t%C4%9B/3&oldid=20068901.

Also note it is very hard to not reveal this information at the wikis themselves, as this is used in things like Special:Log/block in languages that are gender-sensitive.

And we literally have an API that exposes it: https://en.wikipedia.org/w/api.php?action=query&list=users&ususers=Legoktm&usprop=groups|editcount|gender

@sguebo_WMF Is this a modification of the approval given in T150679: Some Labs DB user_properties view fields are sensitive to expose that preference?

Thank you for bringing this one up, @bd808. To be honest, I wasn't even aware of that 5 year-old ticket. Otherwise, I think I would have probably framed this ticket differently. My understanding of that former conversation is that unlike the language, skin, timecorrection, and varient properties, gender was not redacted as it was not considered problematic. However, our privacy policy views gender as a personally identifying information. Having that PII publicly available, either through the API or MW templates, threatens folks privacy and is therefore problematic, irrespective of whether we think that displaying that PII has negligible consequence.

I concur that the gender property is embed in existing on-wiki practices, as Martin pointed out above, and has valuable use cases beyond just the wiki-replicas. But, as I commented in this related ticket, I don't think because that same piece of information is available elsewhere we should refrain from considering ways to mitigate a privacy issue when there is one. That was the original intent behind this audit: surfacing the privacy issues that the replicas may bring or amplify, and opening the floor for considerations around mitigations.

Here's the form when you set your gender, where it clearly says "This information will be public":

Screenshot 2021-06-25 at 12-14-02 Preferences - MediaWiki.png (634×2 px, 91 KB)

I don't really see how the privacy policy really applies here given that clear disclosure.

For what it's worth, gender is mostly used for l10n purposes so the software can refer to people in their preferred pronouns/gender, which is why it needs to be public. This is especially important in languages where words are gendered, e.g. in Spanish my userpage is located at https://es.wikipedia.org/wiki/Usuario:Legoktm and not Usuaria. I think it would be a step in the wrong direction if we were going to make it harder to do that.

For what it's worth, gender is mostly used for l10n purposes so the software can refer to people in their preferred pronouns/gender, which is why it needs to be public. This is especially important in languages where words are gendered, e.g. in Spanish my userpage is located at https://es.wikipedia.org/wiki/Usuario:Legoktm and not Usuaria. I think it would be a step in the wrong direction if we were going to make it harder to do that.

Coincidentally, I was discussing gendered languages and and MediaWiki localization with another colleague yesterday. We ended up reading the enwiki article Grammatical gender and specifically the section on 'Distribution of gender in the world's languages'. Apparently something near 25% of human languages include some form of grammatical gender.

This is functionally a matter of preferred pronouns with a limited selection provided. MediaWiki defaults to the gender neutral "they/them". As @Legoktm has pointed out, the preferences page where it can be changed clearly states that setting an alternate is optional and public. The privacy policy referenced by @sguebo_WMF in T284943#7178176 plainly states:

To gain a better understanding of the demographics of our users, to localize our services and to learn how we can improve our services, we may ask you for more demographic information, such as gender or age, about yourself. We will tell you if such information is intended to be public or private, so that you can make an informed decision about whether you want to provide us with that information. Providing such information is always completely optional. If you don't want to, you don't have to—it's as simple as that.

As I have stated on other tickets in this series, I understand the point of view that this data is in some sense toxic when consumed in aggregate and could be used for nefarious purposes. I don't think that it is inappropriate at all for the Wikimedia movement to discuss the pros and cons and attempt to find mitigations for negative results. I do however strongly and categorically reject attempts to use argument from authority framing to announce a decided action without a discussion of the implications of that action or the avenues for disclosure of the same information that are not covered by the action.

Honestly, I think there are only two actions we can take: delete gender property entriely, and stay as-is. The only usecase of gender property is to customize how MediaWiki talks about you -- if I block an user, the Czech for "block" will be the male form, while if a female administrator does it, MW will present the row at Special:Log/block using the female form.

This means the gender information is necessarily public, as the logs are public as well. IIRC, we don't use the gender property for anything that's visible only to the user, which means hiding this from replicas won't really accomplish anything.

@sguebo_WMF: Any comments? Personally I'd propose to decline this task given the last comments.

If @Urbanecm is correct that

IIRC, we don't use the gender property for anything that's visible only to the user

and we warn users that their answer to the gender question will be made public. And the default behavior is to default to "no answer" (i.e. MW does not assume a particular gender). Then it seems like there is very little incremental risk in exposing the gender response in the replicas. N.B. making already public data easier to access may still be considered a privacy violation, but it seems like, in this case, there is probably not much additional harm.

@sguebo_WMF please let us know if we've missed something in the analysis here, but seems like we might be able to decline this task.

I agree that the privacy harm is considerably lowered by the fact that users are made aware that their gender information will be made public if they choose to disclose it. Since it is my understanding that we’re comfortable moving on with that low level of privacy risk, I don't see any issue with declining the ticket.

sguebo_WMF moved this task from In Progress to Completed on the Privacy Engineering board.
sguebo_WMF added a parent task: Restricted Task.Aug 6 2021, 5:50 PM