Page MenuHomePhabricator

Some Labs DB user_properties view fields are sensitive
Closed, ResolvedPublic

Description

Run the following query from labs

select user_name, up_value from user_properties inner join user on up_user = user_id where up_property = 'timecorrection' and up_value != '' limit 30;

Gives a list of users and their timezone. This seems like a privacy breach as timezone is strongly correlated with location. You don't even have to do the correlation yourself. Consider the following query.

select user_name, up_value from user_properties inner join user on up_user = user_id where up_property = 'timecorrection' and up_value like '%America/Los_Angeles%' limit 30;

Now you know a bunch of users who live in California.

I think this preference should be considered private and redacted from the tool labs db.


From:

up_property in ( 'disablemail', 'fancysig', 'gender', 'language', 'nickname', 'skin', 'timecorrection', 'variant' )

to:

up_property in ( 'disablemail', 'fancysig', 'gender', 'nickname' )

Proposed for removal: language, skin, timecorrection and variant

Event Timeline

Bawolff created this task.Nov 14 2016, 7:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 14 2016, 7:13 PM

[Tagging as DBA. I'm not really sure how the work for labs replication is divided, so please remove that tag if its the inappropriate one]

Restricted Application added a project: Cloud-Services. · View Herald TranscriptNov 14 2016, 7:14 PM
Anomie added a subscriber: Anomie.Nov 14 2016, 7:27 PM

See also T60196: Tool Labs: Provide anonymized view of the user_properties table where "timecorrection" was approved for exporting.

It appears that the list of exposed properties is currently 'disablemail', 'fancysig', 'gender', 'language', 'nickname', 'skin', 'timecorrection', and 'variant'

MariaDB [enwiki_p]> show create table user_properties\G
*************************** 1. row ***************************
                View: user_properties
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`viewmaster`@`%` SQL SECURITY DEFINER VIEW `user_properties` AS select `enwiki`.`user_properties`.`up_user` AS `up_user`,`enwiki`.`user_properties`.`up_property` AS `up_property`,`enwiki`.`user_properties`.`up_value` AS `up_value` from `enwiki`.`user_properties` where (`enwiki`.`user_properties`.`up_property` in ('disablemail','fancysig','gender','language','nickname','skin','timecorrection','variant'))
character_set_client: utf8
collation_connection: utf8_general_ci
1 row in set (0.00 sec)

@Krinkle seems to suggest on the other bug that these properties are available through other means, however I'm not aware of where in MW that language, variant, skin, and timecorrection are exposed.

@Krinkle seems to suggest on the other bug that these properties are available through other means, however I'm not aware of where in MW that language, variant, skin, and timecorrection are exposed.

@Bawolff I stand by that disablemail, fancysig, gender, nickname are not an issue and available through other means. However I agree I do not see any way that timecorrection, language, variant or skin are exposed through other means.

In addition, I fail to see any use case for needing them in Tool Labs. Especially when talking about larger queries for multiple users.

The only purpose I can think of is personalisation when a tool wants to use this for one individual user. E.g. when assisting a user in making semi-automated edits. In that case, the tool in question should use OAUth to authenticate the user and query the relevant properties from the MediaWiki API.

I would support immediate removal of these properties from the public user_properties view. We should announce this to the labs-announce mailing list as well.

Krenair renamed this task from tool labs exposes user timezone preferences which is potentially sensitive (?) to labs DB views expose user timezone preference which is potentially sensitive (?).Nov 15 2016, 8:13 PM

I agree there are no problem with disablemail gender nickname and fancysig.

Skin probably isnt super sensitive although mw does keep it private

Marostegui added a subscriber: Marostegui.

I am moving this to "Not DB team" (but leaving the DBA tag) as this is normally handled by the Labs Team as it involves changing the view (from what I understand).
Of course, we are here if needed.

chasemp added a subscriber: chasemp.EditedNov 17 2016, 1:41 PM

I don't feel like I'm the arbiter of this in any meaningful way. It seems like ideally fields would be vetted by legal, security, or research (who are generally savvy about this kind of thing) or all of them. If the proposal is to remove this column then someone should solicit labs-l looking for current users (seeking the why), loop in the aforementioned team(s), amend maintain-views.yaml, and then we would run maintain-views w/ --replace-all on user_properties.

I cannot agree more with Chase's point of view- for years, I have been saying that labs has been treated as a not-so-important piece of infrastructure (except for the users of it), but the truth is that having production data means that it has to be the same level of priority than any other production infrastructure. I disused a similar action plan here:

T103011#2296587

and I would encourage to continue disusing there (for external reasons -staff leaving- that got freezed) for workflow improvements. But that discussion cannot be postponed more unless issues will continue popping in.

Bawolff added a comment.EditedNov 17 2016, 3:37 PM

! In T150679#2802425, @chasemp wrote:
I don't feel like I'm the arbiter of this in any meaningful way. It seems like ideally fields would be vetted by legal, security, or research (who are generally savvy about this kind of thing) or all of them. If the proposal is to remove this column then someone should solicit labs-l looking for current users (seeking the why), loop in the aforementioned team(s), amend maintain-views.yaml, and then we would run maintain-views w/ --replace-all on user_properties.

If we do consider this a data breach we should not consult labs-l until after data has been removed.

I cannot agree more with Chase's point of view- for years, I have been saying that labs has been treated as a not-so-important piece of infrastructure (except for the users of it), but the truth is that having production data means that it has to be the same level of priority than any other production infrastructure. I disused a similar action plan here:
T103011#2296587
and I would encourage to continue disusing there (for external reasons -staff leaving- that got freezed) for workflow improvements. But that discussion cannot be postponed more unless issues will continue popping in.

For future reference, please tag any bug that needs security team input with either Security-Team-Reviews (if it involves auditing something) Security (if its an actual security bug) Security-Team (for anything else). Just subscribing people can make things get lost very easily.

I agree auditing labs for data leaks from production is just as important as auditing production. There are plenty of things in production that should get audited but arent due to lack of resources. However this is about a concrete issue which should take priority over the general issue.

@dpatrick could you possibly advise on a best course of action in this case?

@dpatrick could you possibly advise on a best course of action in this case?

In this instance, we should remove the column ASAP without consultation as Brian mentions. We're also notifying Legal to make sure this is tracked, small and contained as it is.

@chasemp, @jcrespo I concur with your assertions that there should be a process for determining what is included and what is not, but that will need to be addressed separately from the more immediate concern of this exposed field.

dpatrick triaged this task as High priority.Nov 22 2016, 10:00 PM

we should remove the column ASAP

This is not a column.

Bawolff added a comment.EditedNov 23 2016, 3:15 PM

@dpatrick clearly means we should remove these 3 4 values from the list of allowed values in the public view definition of user_properties.

Would someone be able to do this? I dont know enough about labs infrastructure to do this discretely (or for that matter if i even have the technical ability to do this discretely)

Krenair added a subscriber: Krenair.EditedNov 23 2016, 3:25 PM

timecorrection is one of the three, please can you clarify what the other two are?
As for the process, I imagine it would involve disabling puppet on the labsdb hosts, modifying the user_properties section of /etc/maintain-views.yaml, and running maintain-views across all databases on each host. Then to publish once it's verified fixed, edit modules/role/templates/labsdb/maintain-views.yaml in the puppet repository and re-enable puppet. All of this has to be done by ops (well, I guess someone else could upload the public puppet patch afterwards, but other than thaat)

chasemp added a comment.EditedNov 23 2016, 3:37 PM

Can someone put up a patch for maintain-views.yaml and we can agree on what things need to look like? I'm not sure what the desired outcome there is at the moment.

edit: or even attach to this task if that's preferred

To clarify, I meant the 4 user preferences that are considered secret by MW but exposed at labs: timecorrection, language, variant, skin.

FWIW skin and language are used by tools like https://en.wikipedia.org/wiki/Wikipedia:Database_reports/User_preferences which is why I suspect they were made public in the first place.

hmm. I think that's best handled by bringing that database report into core as a query special page.

FWIW skin and language are used by tools like https://en.wikipedia.org/wiki/Wikipedia:Database_reports/User_preferences which is why I suspect they were made public in the first place.

The user_properties_anon view would suffice for that. We can add skin and language to there instead (it seems they're currently missing there).

hmm. I think that's best handled by bringing that database report into core as a query special page.

I could generate a report on sanitarium (production) the same way I generate it for watchlist_count. If we take away a functionality and we say "we will add a new one in the future, but it is not yet there", people are going to get angry.

hmm. I think that's best handled by bringing that database report into core as a query special page.

I don't think it's a good fit for MediaWiki core. It's a slow query with a lot of properties to potentially track. Which ones to aggregate and how to present them is best left to users to do in a way they like. We could generate a periodic dump from production, but it wouldn't be very useful imho. Users would still a technical person to take those reports and make a user interface for them.

Given the ease of querying user_properties_anon, we might as well expose it there and leave it at that. Any given tool could cache queries for a short amount of time (or perhaps we could add an index to the table). I'd say it's better done done as a Tool Labs tool instead of having a bot edit a wiki page, but oh well.

I agree we should remove this data from the user_properties table, but we should definitely add it to user_properties_anon. I don't know why it isn't there already. It serves the same use case and is in fact where those on-wiki "Wikipedia:Database_reports" get their data for other user properties such as Gadgets - except for skin/language, which they have been forced to query from user_properties instead since the switch from Toolserver to Tool Labs - something we now have the chance to rectify.

or perhaps we could add an index to the table

If user_properties_anon is a view, views are not indexable. A summary table like watchlist_count is a better solution, and indexable.

Any given tool could cache queries

Horrible queries are executed on labs because views. And yes, people could also have cached user_properties to their own tables, like being stored forever on quarry resultsets.

Krinkle added a comment.EditedNov 25 2016, 7:57 PM

or perhaps we could add an index to the table

If user_properties_anon is a view, views are not indexable. A summary table like watchlist_count is a better solution, and indexable.

I don't mind which way user_properties_anon is implemented. If the queries are slow enough to make this a worthwhile investment to convert to a summary table, sounds good to me :)

Horrible queries are executed on labs because views. And yes, people could also have cached user_properties to their own tables, like being stored forever on quarry resultsets.

Quarry sounds like a good use for this kind of query indeed. It would allow easy sharing and making public of the aggregates for different bots to then publish on-wiki if they still prefer it to be on a wiki page (without needing to re-run the query). If they go with the tool approach and present it on a tools web page instead, they can query directly and cache anywhere.

Either way, does this mean we can add these two up_property values to the user_properties_anon whitelist before or after they are removed from user_properties?

hmm. I think that's best handled by bringing that database report into core as a query special page.

I don't think it's a good fit for MediaWiki core. It's a slow query with a lot of properties to potentially track. Which ones to aggregate and how to present them is best left to users to do in a way they like. We could generate a periodic dump from production, but it wouldn't be very useful imho. Users would still a technical person to take those reports and make a user interface for them.
Given the ease of querying user_properties_anon, we might as well expose it there and leave it at that. Any given tool could cache queries for a short amount of time (or perhaps we could add an index to the table). I'd say it's better done done as a Tool Labs tool instead of having a bot edit a wiki page, but oh well.
I agree we should remove this data from the user_properties table, but we should definitely add it to user_properties_anon. I don't know why it isn't there already. It serves the same use case and is in fact where those on-wiki "Wikipedia:Database_reports" get their data for other user properties such as Gadgets - except for skin/language, which they have been forced to query from user_properties instead since the switch from Toolserver to Tool Labs - something we now have the chance to rectify.

I'm fine with whatever solution is easiest. I just want non public preferences to not be directly associated with the user in question as soon as possible.

Either way, does this mean we can add these two up_property values to the user_properties_anon whitelist before or after they are removed from user_properties?

Yes, I think adding skin, variant and language to user_properties_anon whitelist makes sense. I would like to not add timecorrection unless someone can come up with a persuasive use case, since that property is so much more sensitive (due to location correlation).

chasemp renamed this task from labs DB views expose user timezone preference which is potentially sensitive (?) to Labs DB view user_properties exposes timezone (timecorrection).Nov 28 2016, 4:20 PM
chasemp renamed this task from Labs DB view user_properties exposes timezone (timecorrection) to Some Labs DB user_properties view fields are sensitive.Nov 28 2016, 5:07 PM
chasemp updated the task description. (Show Details)

I am going to honor this today if no one stops me:

Proposed for removal: language, skin, timecorrection and varient

Post I will send notice to labs-announce and I will refer to security@wikimedia.org as the point of contact for concerns.

I am going to honor this today if no one stops me:

Proposed for removal: language, skin, timecorrection and varient

Post I will send notice to labs-announce and I will refer to security@wikimedia.org as the point of contact for concerns.

Done.

https://phabricator.wikimedia.org/P4525

coren added a subscriber: coren.EditedNov 29 2016, 4:14 PM

I'm a bit surprised nobody searched Phab on the topic first given that T66115 is specifically about whether to include those properties.

Search is so bad on phab that at this point i assume it wont tell me anything.

Can this task be made public now?

I am going to honor this today if no one stops me:

Proposed for removal: language, skin, timecorrection and varient

Post I will send notice to labs-announce and I will refer to security@wikimedia.org as the point of contact for concerns.

Done.
https://phabricator.wikimedia.org/P4525

Thanks @chasemp.

For future reference, announcement is https://lists.wikimedia.org/pipermail/labs-announce/2016-November/000188.html.

Krinkle updated the task description. (Show Details)Nov 30 2016, 8:54 PM
jcrespo added a comment.EditedDec 8 2016, 2:53 PM

As a followup, Data retention https://meta.wikimedia.org/wiki/Data_retention_guidelines policy should be updated to include (in addition to the email) those fields for users that willingly shared their gender, etc. information "Until user deletes/changes the account setting.", or clarify that those are not deleted automatically, but the user can delete them when they want.

I think this can be closed now, we can followup on T152043 and other tickets.

chasemp closed this task as Resolved.Mar 16 2017, 8:04 PM
chasemp claimed this task.

I think this can be closed now, we can followup on T152043 and other tickets.

agreed

Bawolff changed the visibility from "Custom Policy" to "Public (No Login Required)".
sbassett moved this task from Backlog to Done on the Privacy board.Oct 16 2019, 5:41 PM