Page MenuHomePhabricator

Tool Labs: Provide anonymized view of the user_properties table
Closed, ResolvedPublic

Description

Toolserver has a limited and anonimized view of the user_properties table available. This is used for reports like https://en.wikipedia.org/wiki/Wikipedia:Database_reports/User_preferences

mysql> describe user_properties;

FieldTypeNullKeyDefaultExtra
up_userint(11)NO0
up_propertyvarbinary(255)YESNULL
up_valueblobYESNULL

3 rows in set (0.00 sec)

mysql> describe user_properties_anonym;

FieldTypeNullKeyDefaultExtra
up_propertyvarbinary(255)YESNULL
up_valueblobYESNULL
ts_user_touched_croppedvarbinary(8)NO

3 rows in set (0.00 sec)

Only preferences that are considered to be public are replicated, the list from enwiki_p is:
mysql> select distinct(up_property) from user_properties;

up_property
disablemail
fancysig
gender
language
nickname
skin
timecorrection
variant

Version: unspecified
Severity: major
See Also:

Details

Reference
bz58196

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

(In reply to Luis Villa (WMF Legal) from comment #14)

Ah! Thanks for explaining more clearly, Krinkle. The first comment now makes
a lot more sense to me :)

One question on the anonymization - this data is still per-wiki, right?
Would last-active dates at the day-level allow deanonymization for those
users?

Those are two different requests: one table (view) containing user identifiable info (user id) with fewer properties includes, and another anonymized one with more properties.

Retitling then :)

For the filtered table: I am OK with this in principle, but would like to better understand (1) how we choose which fields to filter and (2) how we prevent new fields from leaking.

For the anonymized table: Again, OK in principle; it would be good (but not a must-have) to know more about the impact on users of small wikis.

So if I understand correctly, the whitelist of properties you approved in comment 12 are in fact allowed for identifiable queries (e.g. with user_id/user name attached to it), as opposed to the anonymized view. That's great news. I wasn't expecting a non-anonymized view of the properties table at all. Having one with this limited set of properties can be useful indeed.

And when you say "OK in principle" for the anonymized view, that means all properties?

Good point on the user activity measure being potentially identifying on small wikis. I suppose a month crop of the timestamp (YYYY-MM) would still be good enough for practical applications and be a lot less likely to reveal identifying information.

Though no matter how much we crop the timestamp (even if we'd crop it to the month or year), one could in theory still retrieve the preferences of one user by performing queries in a loop until the next day, month or year starts and correlate that to the one user currently editing on that wiki.. The only way I can think of that would mitigate that is forcing the current matching crop of the date being delayed (e.g. all users active today on 2014-03-17 would be listed under 2014-03-16 until this day is over, so that all active users have had a chance to be active that day).

(In reply to Krinkle from comment #17)

So if I understand correctly, the whitelist of properties you approved in
comment 12 are in fact allowed for identifiable queries (e.g. with
user_id/user name attached to it), as opposed to the anonymized view.

Yes, that is correct, for those categories.

As far as implementation, we'll want to review new fields as they are added, which is why I prefer a whitelist approach (review what is released; hard to accidentally release data accidentally) to a blacklist approach (review what is not released; easy to release data accidentally by not adding it to the blacklist).

And when you say "OK in principle" for the anonymized view, that means all
properties?

Yes, based on my understanding that they are already available on toolserver in user_properties_anonym and/or via the API. If they aren't already public (via API or toolserver) we should probably review one-by-one, and/or do a more careful scrutiny of the anonymization procedures. Maybe that deserves a separate bug?

(In reply to Luis Villa (WMF Legal) from comment #18)

(In reply to Krinkle from comment #17)

So if I understand correctly, the whitelist of properties you approved in
comment 12 are in fact allowed for identifiable queries (e.g. with
user_id/user name attached to it), as opposed to the anonymized view.

Yes, that is correct, for those categories.

As far as implementation, we'll want to review new fields as they are added,
which is why I prefer a whitelist approach (review what is released; hard to
accidentally release data accidentally) to a blacklist approach (review what
is not released; easy to release data accidentally by not adding it to the
blacklist).

And when you say "OK in principle" for the anonymized view, that means all
properties?

Yes, based on my understanding that they are already available on toolserver
in user_properties_anonym and/or via the API. If they aren't already public
(via API or toolserver) we should probably review one-by-one, and/or do a
more careful scrutiny of the anonymization procedures. Maybe that deserves a
separate bug?

anonym on Toolserver is also whitelist based. I don't have access to the original list ts_global.user_properties_anonym_whilelist, but it's possible to sniff it with select distinct up_property from user_properties_anonym; (a slow query).

Do we know how that whitelist was created?

Comparison for nlwiki.

nlwiki at wmf> DISTINCT(up_property) FROM user_properties
+--------------------------------------------+

ajaxsearch
autopatrol
betafeatures-auto-enroll
betafeatures-geonotahack
betafeatures-vector-typography-update
ccmeonemails
cirrussearch-default
cols
contextchars
contextlines
date
diffonly
disablemail
disablesuggest
echo-email-format
echo-email-frequency
echo-email-last-batch
echo-show-alert
echo-subscriptions-email-article-linked
echo-subscriptions-email-edit-thank
echo-subscriptions-email-education-program
echo-subscriptions-email-mention
echo-subscriptions-email-reverted
echo-subscriptions-web-article-linked
echo-subscriptions-web-edit-thank
echo-subscriptions-web-education-program
echo-subscriptions-web-mention
echo-subscriptions-web-reverted
editfont
editondblclick
editsection
editsectionondblclick
editsectiononrightclick
editwidth
enotifminoredits
enotifrevealaddr
enotifusertalkpages
enotifwatchlistpages
ep_bulkdelcourses
ep_bulkdelorgs
ep_showdyk
ep_showtoplink
extendwatchlist
externaldiff
externaleditor
fancysig
forceeditsummary
gadget-Achtergrondkleur
gadget-Berichtbalk
gadget-Diffkleuren
gadget-Direct-link-to-Commons
gadget-Externe_links
gadget-Extra_bewerkingsknoppen
gadget-Geavanceerde_bijdragen
gadget-GoogleTrans
gadget-HideFundraiser
gadget-HotCat
gadget-IProject
gadget-LocalLiveClock
gadget-Navigatie_popups
gadget-OldDiff
gadget-OpenStreetMapFrame
gadget-OpenStreetMapFrameUit
gadget-ProtectionTemplates
gadget-ReferenceTooltips
gadget-Sneltoetsen
gadget-Twinkle
gadget-UTCLiveClock
gadget-Uitleg_verwijderlijst
gadget-VectorSearchNav
gadget-Verbergding
gadget-WhatLeavesHere
gadget-Zeusmodus
gadget-fixhoofdbetekenis
gadget-hoofdbetekenis-titelwijziging
gadget-labelfilter
gadget-mijnKladblok
gadget-purgelink
gadget-searchFocus
gadget-updatemarker
gadget-verwijderreden
gender
hideminor
hidepatrolled
highlightbroken
hover
imagesize
justify
language
math
mfWatchlistFilter
mfWatchlistView
minordefault
moodbar-disable
multimedia-viewer
newpageshidepatrolled
nickname
nocache
norollbackdiff
numberheadings
prefershttps
previewonfirst
previewontop
quickbar
rcdays
rclimit
rcshowwikidata
rememberpassword
rows
searchNs-1
searchNs0
searchNs1
searchNs10
searchNs100
searchNs101
searchNs11
searchNs12
searchNs13
searchNs14
searchNs15
searchNs2
searchNs3
searchNs4
searchNs4294967295
searchNs446
searchNs447
searchNs5
searchNs6
searchNs7
searchNs8
searchNs828
searchNs829
searchNs9
searcheverything
searchlimit
showhiddencats
showjumplinks
shownumberswatching
showtoc
showtoolbar
skin
stubthreshold
thumbsize
timecorrection
uls-compact-links
uls-enable
uls-preferences
underline
usebetatoolbar
usebetatoolbar-cgd
useeditwarning
uselivepreview
usenewrc
userjs-already-set-common-preferences
variant
vector-collapsiblenav
vector-noexperiments
vector-simplesearch
visualeditor-betatempdisable
visualeditor-enable
visualeditor-enable-mwmath
watchcreations
watchdefault
watchdeletion
watchlistdays
watchlisthideanons
watchlisthidebots
watchlisthideliu
watchlisthideminor
watchlisthideown
watchlisthidepatrolled
watchlisttoken
watchmoves
wllimit
wlshowwikibase

+--------------------------------------------+
176 rows in set (4.95 sec)

nlwiki_p at toolserver> DISTINCT(up_property) FROM user_properties_anonym
+-----------------------------------------+

up_property

+-----------------------------------------+

ajaxsearch
autopatrol
ccmeonemails
cols
contextchars
contextlines
date
diffonly
disablemail
disablesuggest
echo-email-format
echo-email-frequency
echo-show-alert
echo-subscriptions-email-article-linked
echo-subscriptions-email-edit-thank
echo-subscriptions-email-mention
echo-subscriptions-email-reverted
echo-subscriptions-web-article-linked
echo-subscriptions-web-edit-thank
echo-subscriptions-web-mention
echo-subscriptions-web-reverted
editfont
editondblclick
editsection
editsectiononrightclick
editwidth
enotifminoredits
enotifrevealaddr
enotifusertalkpages
enotifwatchlistpages
extendwatchlist
externaldiff
externaleditor
fancysig
forceeditsummary
gadget-Achtergrondkleur
gadget-Berichtbalk
gadget-Diffkleuren
gadget-Direct-link-to-Commons
gadget-Externe_links
gadget-Extra_bewerkingsknoppen
gadget-fixhoofdbetekenis
gadget-Geavanceerde_bijdragen
gadget-GoogleTrans
gadget-HideFundraiser
gadget-hoofdbetekenis-titelwijziging
gadget-HotCat
gadget-IProject
gadget-labelfilter
gadget-LocalLiveClock
gadget-mijnKladblok
gadget-Navigatie_popups
gadget-OldDiff
gadget-OpenStreetMapFrame
gadget-OpenStreetMapFrameUit
gadget-ProtectionTemplates
gadget-purgelink
gadget-ReferenceTooltips
gadget-searchFocus
gadget-Sneltoetsen
gadget-Twinkle
gadget-Uitleg_verwijderlijst
gadget-updatemarker
gadget-UTCLiveClock
gadget-VectorSearchNav
gadget-Verbergding
gadget-verwijderreden
gadget-WhatLeavesHere
gadget-Zeusmodus
gender
hideminor
hidepatrolled
highlightbroken
hover
imagesize
justify
language
math
minordefault
newpageshidepatrolled
nocache
norollbackdiff
numberheadings
previewonfirst
previewontop
quickbar
rcdays
rclimit
rememberpassword
searcheverything
searchlimit
searchNs-1
showhiddencats
showjumplinks
shownumberswatching
showtoc
showtoolbar
skin
stubthreshold
thumbsize
timecorrection
underline
usebetatoolbar
usebetatoolbar-cgd
useeditwarning
uselivepreview
usenewrc
variant
vector-noexperiments
visualeditor-enable
watchcreations
watchdefault
watchdeletion
watchlistdays
watchlisthideanons
watchlisthidebots
watchlisthideliu
watchlisthideminor
watchlisthideown
watchlisthidepatrolled
watchmoves
wllimit

+-----------------------------------------+
122 rows in set (51.05 sec)

INTERSECT:

ajaxsearch
autopatrol
ccmeonemails
cols
contextchars
contextlines
date
diffonly
disablemail
disablesuggest
echo-email-format
echo-email-frequency
echo-show-alert
echo-subscriptions-email-article-linked
echo-subscriptions-email-edit-thank
echo-subscriptions-email-mention
echo-subscriptions-email-reverted
echo-subscriptions-web-article-linked
echo-subscriptions-web-edit-thank
echo-subscriptions-web-mention
echo-subscriptions-web-reverted
editfont
editondblclick
editsection
editsectiononrightclick
editwidth
enotifminoredits
enotifrevealaddr
enotifusertalkpages
enotifwatchlistpages
extendwatchlist
externaldiff
externaleditor
fancysig
forceeditsummary
gadget-Achtergrondkleur
gadget-Berichtbalk
gadget-Diffkleuren
gadget-Direct-link-to-Commons
gadget-Externe_links
gadget-Extra_bewerkingsknoppen
gadget-Geavanceerde_bijdragen
gadget-GoogleTrans
gadget-HideFundraiser
gadget-HotCat
gadget-IProject
gadget-LocalLiveClock
gadget-Navigatie_popups
gadget-OldDiff
gadget-OpenStreetMapFrame
gadget-OpenStreetMapFrameUit
gadget-ProtectionTemplates
gadget-ReferenceTooltips
gadget-Sneltoetsen
gadget-Twinkle
gadget-UTCLiveClock
gadget-Uitleg_verwijderlijst
gadget-VectorSearchNav
gadget-Verbergding
gadget-WhatLeavesHere
gadget-Zeusmodus
gadget-fixhoofdbetekenis
gadget-hoofdbetekenis-titelwijziging
gadget-labelfilter
gadget-mijnKladblok
gadget-purgelink
gadget-searchFocus
gadget-updatemarker
gadget-verwijderreden
gender
hideminor
hidepatrolled
highlightbroken
hover
imagesize
justify
language
math
minordefault
newpageshidepatrolled
nocache
norollbackdiff
numberheadings
previewonfirst
previewontop
quickbar
rcdays
rclimit
rememberpassword
searchNs-1
searcheverything
searchlimit
showhiddencats
showjumplinks
shownumberswatching
showtoc
showtoolbar
skin
stubthreshold
thumbsize
timecorrection
underline
usebetatoolbar
usebetatoolbar-cgd
useeditwarning
uselivepreview
usenewrc
variant
vector-noexperiments
visualeditor-enable
watchcreations
watchdefault
watchdeletion
watchlistdays
watchlisthideanons
watchlisthidebots
watchlisthideliu
watchlisthideminor
watchlisthideown
watchlisthidepatrolled
watchmoves
wllimit

DIFF 1:

betafeatures-auto-enroll
betafeatures-geonotahack
betafeatures-vector-typography-update
cirrussearch-default
echo-email-last-batch
echo-subscriptions-email-education-program
echo-subscriptions-web-education-program
editsectionondblclick
ep_bulkdelcourses
ep_bulkdelorgs
ep_showdyk
ep_showtoplink
mfWatchlistFilter
mfWatchlistView
moodbar-disable
multimedia-viewer
nickname
prefershttps
rcshowwikidata
rows
searchNs0
searchNs1
searchNs10
searchNs100
searchNs101
searchNs11
searchNs12
searchNs13
searchNs14
searchNs15
searchNs2
searchNs3
searchNs4
searchNs4294967295
searchNs446
searchNs447
searchNs5
searchNs6
searchNs7
searchNs8
searchNs828
searchNs829
searchNs9
uls-compact-links
uls-enable
uls-preferences
userjs-already-set-common-preferences
vector-collapsiblenav
vector-simplesearch
visualeditor-betatempdisable
visualeditor-enable-mwmath
watchlisttoken
wlshowwikibase

DIFF 2:

(none)

To confirm: INTERSECT here is the whitelist; DIFF 1 is properties that never made it onto the whitelist?

(In reply to Luis Villa (WMF Legal) from comment #22)

To confirm: INTERSECT here is the whitelist; DIFF 1 is properties that never
made it onto the whitelist?

That is correct.

INTERSECT are properties present in both Toolserver's anonymized view and in wikimedia production databases.

DIFF 1 are properties present in production and not visible in Toolserver anonymized view.

DIFF 2 are properties only present in Toolserver's anonymized view (none).

The Toolserver's non-anonimized filtered view has a much stricter whitelist:

nlwiki_p at toolserver> SELECT DISTINCT(up_property) FROM user_properties;
+----------------+

up_property

+----------------+

disablemail
fancysig
gender
language
nickname
skin
timecorrection
variant

+----------------+
8 rows in set

Hello! What's the status on this?

  • Bug 56491 has been marked as a duplicate of this bug. ***

(In reply to silke.meyer from comment #24)

Hello! What's the status on this?

No movement in 4 months: resetting status.

There's no reason to change the status of the bug, it remains pending an all-clear from Legal and a short implementation away.

FYI, a "ghost infrastructure" makes this data available to select users, per http://lists.wikimedia.org/pipermail/analytics/2014-April/001768.html : folks should/can ask [[wikitech:EventLogging]] access if they need it.

I'm cloning this bug so that we can have two discussions - one about the anonymized data, one about the filtered data. The discussion about a filtered view (with the handful of properties described in the original bug description and in comment #23) is now in bug 64115. I will continue the other discussion in my next comment.

So, let's keep this bug focused on the question of the "anonymized" table; i.e., the table that currently has most (but not all) rows, with userID removed but last active timestamp available.

The current proposal is to provide this table on labs. It has also been suggested that:

  1. we remove the whitelist altogether and provide everything; and/or
  2. we sample down the timestamp to YYYY-MM so that it is more difficult to map "last active" to editors on small wikis.

I still have a few concerns:

  1. I still don't know how we created the original whitelist, or how we plan to create the new one. (Note that if we trust our anonymization, having a whitelist probably isn't necessary.)
  1. I'm still concerned about the impact on anonymity of small wikis. Rounding to months helps, but could still be problematic in at least some cases (as Krinkle pointed out).

I do not have a great solution to these questions. Trying to brainstorm/think out loud here: maybe we could provide aggregate totals for all editors, for active editors, and for new editors (based on the standard analytics definition of active editors/new editors), instead of access to individually anonymized rows?

(In reply to Luis Villa (WMF Legal) from comment #31)

So, let's keep this bug focused on the question of the "anonymized" table;
i.e., the table that currently has most (but not all) rows, with userID
removed but last active timestamp available.

The current proposal is to provide this table on labs. It has also been
suggested that:

  1. we remove the whitelist altogether and provide everything; and/or
  2. we sample down the timestamp to YYYY-MM so that it is more difficult to

map "last active" to editors on small wikis.

I still have a few concerns:

  1. I still don't know how we created the original whitelist, or how we plan

to create the new one. (Note that if we trust our anonymization, having a
whitelist probably isn't necessary.)

Note that some user_properties rows are not actually preferences. eg. watchlisttoken.

  1. I'm still concerned about the impact on anonymity of small wikis.

Rounding to months helps, but could still be problematic in at least some
cases (as Krinkle pointed out).

So ... wikis are already grouped as small, medium and large, then YYYYMMDD in large, YYYYMM in medium and YYYY in small wikis?

https://noc.wikimedia.org/conf/highlight.php?file=large.dblist
https://noc.wikimedia.org/conf/highlight.php?file=medium.dblist
https://noc.wikimedia.org/conf/highlight.php?file=small.dblist

I do not have a great solution to these questions. Trying to
brainstorm/think out loud here: maybe we could provide aggregate totals for
all editors, for active editors, and for new editors (based on the standard
analytics definition of active editors/new editors), instead of access to
individually anonymized rows?

Is the order of rows randomized? Being able to correlate other preferences with, say, fancysig based on row number would be bad.

As for timestamp anonimization, showing the (truncated) date of the last edit of the user which did not happen in the current day(/month/year) should be pretty straightforward.

Sorry for the slow response, Liangent - email slipped through the cracks for some reason. Q: what's the criteria for small/medium/large? And how often is that updated?

And Gergo's q is a good one too.

(In reply to Tisza Gergő from comment #33)

As for timestamp anonimization, showing the (truncated) date of the last
edit of the user which did not happen in the current day(/month/year) should
be pretty straightforward.

Nevermind, I see this comes from user_touched, not the edit log, and we do not keep a past record of that.

Maybe instead of a live view, this table could be generated by a query which runs at the end of the day; that way no information is leaked by the timing of the change. Although that still does not solve the issue for small wikis since we probably want to run it more often than once a month.

(In reply to Luis Villa (WMF Legal) from comment #34)

Sorry for the slow response, Liangent - email slipped through the cracks for
some reason. Q: what's the criteria for small/medium/large? And how often is
that updated?

I have no idea about the criteria, but I guess generally wikis only grow larger, and if the list is not updated in time, we're just anonymizing more, not less.

Small/medium/big wikis have 10/100/1000 thousands total pages or more. You can see the lists at https://noc.wikimedia.org/conf/ The lists were originally made as rough indication of how expensive certain heavy database operations are (you can easily find the bugs in question).

Changing the timestamp in this table based on wiki size would be a mistake in my opinion. It would make the data too arbitrary and hard to maintain. It also probably won't scale well into the MySQL mapping and relevant configuration.

The reason we don't want to provide the full timestamp is because it exposes rather detailed information that would be trivial to map to an individual on most wikis (even large wikis) based on when an edit has happened.

Shortening this to a month minimised this ability greatly, reducing it to only being able to tell that those with the current month as their last-active timestamp (YYYY-MM) are one of the users listed on Special:ActiveUsers.

In theory this could still be exploited by being a patient person and waiting for the moment one month has ended. Then the first user to become active in the new month would be the only person with a newer month timestamp.

For this particular exploit it is irrelevant how much we crop the timestamp (year, month, date, it's all the same, it only reduces how often one can make such attempt and with that the usefulness the data in the first place).

If we consider that possibility important enough to warrant a solution, then that solution will work equally for year, month and day. I don't think there is any reason to trim it further than month, nor is there any reason to bring small/medium/large into the balance.

(In reply to Krinkle from comment #38)

[...]

In theory this could still be exploited by being a patient person and
waiting for the moment one month has ended. Then the first user to become
active in the new month would be the only person with a newer month
timestamp.

[...]

Could this be mitigated by excluding data for the current month?

(In reply to Tim Landscheidt from comment #39)

(In reply to Krinkle from comment #38)

[...]

In theory this could still be exploited by being a patient person and
waiting for the moment one month has ended. Then the first user to become
active in the new month would be the only person with a newer month
timestamp.

[...]

Could this be mitigated by excluding data for the current month?

Essentially, yes. See my comment 17:

(citing comment #17 from Krinkle)

[..] no matter how much we crop the timestamp (even if we'd crop it to the
month or year), one could in theory still retrieve the preferences of one
user by performing queries in a loop until the next day, month or year
starts and correlate that to the one user currently editing on that wiki..

The only way I can think of that would mitigate that is forcing the current
matching crop of the date being delayed (e.g. all users active today on
2014-03-17 would be listed under 2014-03-16 until this day is over, so that
all active users have had a chance to be active that day).

However "excluding data" isn't trivial. It'll make statistics and sum queries act very counter intuitive. If you'd exclude all rows where timestamp == current YYYY-MM, that would essentially hide all values of active users, and make them re-appear next month and back and forth like that.

To avoid strange statistical variance like that, I'd recommend making timestamp of the currently running month appear as last month's. That way no data is excluded. This seems like a simple enough variance that tools should be able to work with without much hassle.

To avoid strange statistical variance like that, I'd recommend making timestamp of the currently running month appear as last month's.

I don't see how that would help (other than making the window of opportunity smaller). You could still poll the data on the first day of the month to get an accurate edit timestamp. Since the attack is based on reading *when* the timestamp changes and correlating that with edit times, *what* the timestamp changes to is irrelevant. If you want users of the table to have a precision of T, you could protect users who edit more often than T by setting the timestamps to some deterministically chosen point within T so that changes within that interval are suppressed, but information would still leak about users who edit less often. For T = 1 day, that's hardly acceptable, even on large wikis.

We could randomize the last few bits of the date, that would make it hard to tell when it changes, but it would still be doable via statistical means... really the only reliable way to mitigate this is to use a temporary table instead of a view, and only update it once a day/week/whatever so that the timing of the field changing becomes completely disconnected from the timing of the edit.

(In reply to Tisza Gergő from comment #35)

Maybe instead of a live view, this table could be generated by a query which
runs at the end of the day; that way no information is leaked by the timing
of the change. Although that still does not solve the issue for small wikis
since we probably want to run it more often than once a month.

If we only provide one month granularity in the answer, then it can be run only once a month, right? Sounds like the use cases here don't need precise/fine-grained answers to this question?

What is the current status of the bug?
It becomes very importantly to fix it in the very soon future, as the toolserver is going to be down in 3 weeks

http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-June/000722.html

Changing to major per previous comment.

Current status of the bug is "questions have been posed about privacy concerns". It would be helpful to have feedback from people who use the field; in particular, my question in comment 42 and response to Krinkle's last statement in comment 40 would be good.

Hi Luis, thanks for the update.

(In reply to Krinkle from comment #40)

To avoid strange statistical variance like that, I'd recommend making
timestamp of the currently running month appear as last month's. That way no
data is excluded. This seems like a simple enough variance that tools should
be able to work with without much hassle.

I think it is good solution. The timestamp is used mainly for filtering active vs non-active users.

(In reply to Luis Villa (WMF Legal) from comment #42)

(In reply to Tisza Gergő from comment #35)

Maybe instead of a live view, this table could be generated by a query which
runs at the end of the day; that way no information is leaked by the timing
of the change. Although that still does not solve the issue for small wikis
since we probably want to run it more often than once a month.

If we only provide one month granularity in the answer, then it can be run
only once a month, right? Sounds like the use cases here don't need
precise/fine-grained answers to this question?

*My use case: the user_properties_anonym table is used for generating[[meta:Gadgets]] - statistics on the most popular gadgets accross Wikimedia projects.

  • (other example for use case) User preferences statistics are provided by MZMcBride in [[en:Wikipedia:Database reports/User preferences]]

So I belive one month resolution is fine for statistics use cases, and running it once a month could be enougth.
As Krinkle wrote above, month resolution (YYYY-MM) make it hard to associate it to specific user.

However for very very small wikis I think it isn't possible to provide such statistics: refering to http://stats.wikimedia.org/EN/TablesWikipediansEditsGt5.htm (Wikipedias by active users) some languages such as Tsonga or Oromo have only one active user. I think there should be a threshold on the number of active users in a wiki.

Active editors are not the only users changing preferences on a wiki, there are also many global users changing preferences everywhere (or in many places); only user_touched knows that.

Hi Luis, do the comments 46 and 47 provide the answers you need? If not, who can help? We are ... sort of in a hurry by now.

Summary of what was discussed during office hour (https://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/20140611.txt, from [17:17:25] on):

  • This will probably take longer for WMF Legal, so it's unlikely that it will be done before Toolserver shutdown.
  • As far as we know, "live" tools depended on #64115 (filtered view) which has been resolved. But no live tools seem to depend on this one. Please correct me, if I am wrong!

Still, it would be interesting to get an update from Luis. :)

(In reply to Silke Meyer (WMDE) from comment #49)

  • As far as we know, "live" tools depended on #64115 (filtered view) which

has been resolved. But no live tools seem to depend on this one. Please
correct me, if I am wrong!

http://toolserver.org/~liangent/gadget_usage/

If the users are OK with the month-filtered view, and exclusion of small wikis, that seems like a workable compromise to me.

(In reply to Luis Villa (WMF Legal) from comment #51)

If the users are OK with the month-filtered view, and exclusion of small
wikis, that seems like a workable compromise to me.

Well to speed up the process and keep compatibility at the same time, it would be better to add the view now, with YYYYMMDD rewritten as YYYYMM01 on larger wikis, and as 19700101 on small wikis, and we can keep discussing afterwards...

He7d3r set Security to None.

If the users are OK with the month-filtered view, and exclusion of small wikis, that seems like a workable compromise to me.

Okay, this sounds like sign-off to me.

Who does this typically? @coren or @Springle? Is there documentation for what's needed to add/remove a database table or column from Labs' views? I vaguely remember there now being two filters in place, maybe.

If anyone has links to the code repos or documentation (or even another task where we did something similar), that would likely help move this forward. At this point, it sounds like there's general consensus to move forward, this just hasn't been written out and gone through Gerrit/code review.

MZMcBride added a subscriber: yuvipanda.

There are, in essence, two steps. The underlying table needs to be replicated, and a view need to be created to it.

Access to the underlying table is already there (it is used to provide the view into limited number of keys); all that is needed is the creation of a new view that carefully aggregates the data and respects the size limits, etc.

The script that creates the views is there:

https://git.wikimedia.org/blob/operations%2Fsoftware/HEAD/maintain-replicas%2Fmaintain-replicas.pl

I'm usually the one who creates views, but my schedule is a bit on the buried side. I'd be happy to review a patch adding the view - look starting line 90, it should be fairly evident how to add one.

A brief summary of past discussion and verification that we understood each other.

A database view with a whitelisted set of keys is available with connections to individual accounts (fancysig, language, nickname, timecorrection, etc.). This information is already available by other means and is provided as convenience to perform queries on. This database view has since been created!

An anonymised database view will be created as follows:

  • up_user: omit. This view will not associate preferences with users.
  • up_property: included. As was Toolserver, the set of properties exposed may be quite large and controlled by a whitelist (to avoid exposing new sensitive properties by accident). The whitelist includes wildcard patterns for things like gadget-*.
  • up_value: included.
  • _user_touched_cropped: generated based on user.user_touched value where user.user_id = up_user. Trimmed to YYYY-MM. Values for the current month are displayed as last month's.

The reason the current month is displayed as last month's is to reduce chances of identifying users via their activity at the start of a month (which is not limited by the size of a wiki). This because using timestamps of YYYY-MD essentially reset every month. This in contrary to Special:ActiveUsers, which rolls continuously using relative timespan of "last 30 days". Without this modification, one could link a recently active user on a wiki at the start of a month. This modification would extend that timeframe from mere seconds to two full months (a user would have to be the only one active for two months). Note that this is for a timespan for two months, not one. Is that acceptable or should be still disable this view on small wikis? (Toolserver had it enabled for all wikis.)

The one thing I am missing to implement this at this time is either
(a) the whitelist of properties to collate or
(b) the method by which that whitelist is generated.

The one thing I am missing to implement this at this time is either
(a) the whitelist of properties to collate or
(b) the method by which that whitelist is generated.

Can we start with "gadget-*" and then decide on the others in a separate future task?

Can we start with "gadget-*" and then decide on the others in a separate future task?

Sounds like a reasonable first pass. I'll have to do some testing to see if wildcard matching is sufficiently efficient for the view to be usable; but otherwise I'll collect the existing gadget- names and use those as our first whitelist.

Change 197055 had a related patch set uploaded (by coren):
Update maintain-replicas

https://gerrit.wikimedia.org/r/197055

The view has been created (on all but size 1 wikis) but the timestamps currently return NULL because none of the user_touched values are on the replicas at this time (they are redundantly nulled during replication and in the views).

Change 197055 merged by coren:
Update maintain-replicas

https://gerrit.wikimedia.org/r/197055

It was user_properties_anonym on the Toolserver and it's user_properties_anon now. Probably doesn't really matter, just noting.

Thank you for https://gerrit.wikimedia.org/r/197055!

Looking at https://en.wikipedia.org/wiki/Wikipedia:Database_reports/User_preferences/Configuration, it looks like "gender", "language", and "skin" would be the easiest targets next. If anyone would prefer to close this task as resolved and file new tasks for additions to the whitelisted user properties list, I'm fine with that.

Thank you for https://gerrit.wikimedia.org/r/197055!

Looking at https://en.wikipedia.org/wiki/Wikipedia:Database_reports/User_preferences/Configuration, it looks like "gender", "language", and "skin" would be the easiest targets next. If anyone would prefer to close this task as resolved and file new tasks for additions to the whitelisted user properties list, I'm fine with that.

AFAICS, and as noted in this task's description, the user_properties table already contains "preferences that are considered to be public", such as the ones you mentioned.

Just tried:

SELECT up_value FROM user_properties INNER JOIN user ON user_id = up_user AND user_name = 'Ricordisamoa' AND up_property = 'gender';

AFAICS, and as noted in this task's description, the user_properties table already contains "preferences that are considered to be public", such as the ones you mentioned.

Just tried:

SELECT up_value FROM user_properties INNER JOIN user ON user_id = up_user AND user_name = 'Ricordisamoa' AND up_property = 'gender';

Huh, fair play. I guess most of this database report got fixed with the resolution of T66115: Tool Labs: Provide filtered view of user_properties table containing short list of properties, linked to userID.

The view exists, and is working. Whitelisting other properties is fairly easy, but requests for this should be made in separate tasks.

Re-opening since the user_properties_anon table is still missing the upa_touched timestamp field (per @coren earlier's comment, this is blocked on T92841).

valhallasw lowered the priority of this task from High to Medium.Jul 2 2015, 8:29 PM
valhallasw added a subscriber: valhallasw.

The underlying data has been backfilled.