Page MenuHomePhabricator

Implement Schema:Print purging strategy
Closed, ResolvedPublic1 Estimated Story Points

Description

According to https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging#How_to_change_the_purging_strategy_of_a_schema:

Instrumented your code to emit events to that new schema/revision and deployed it.

So this has to be done once T169730 is in production.

A/C

Refer to the talk page at https://meta.wikimedia.org/wiki/Schema_talk:Print to learn about the schema's purge strategy.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ovasileva triaged this task as Medium priority.Sep 11 2017, 11:30 AM
ovasileva added a project: Web-Team-Backlog.
ovasileva moved this task from Incoming to Upcoming on the Web-Team-Backlog board.

Team decided to not point this in grooming

MBinder_WMF set the point value for this task to 0.Sep 12 2017, 5:00 PM
ovasileva changed the point value for this task from 0 to 1.Sep 20 2017, 5:29 PM

Change 379829 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[operations/puppet@production] Implement Schema:Print purging strategy

https://gerrit.wikimedia.org/r/379829

@ovasileva - do we need to store also the sessionToken (to detect how many prints we have per session)? It's kinda user-related data (each user gets a unique sessionToken) per each session. If we don't want to query historical data for prints-per-session then there is no need to store that property.

This has been sitting here for a week. To get code review we'll need to be a little more proactive. Have we reached out to operations/analytics about this patch?

@mforns it looks like the fix for T169730 is in production now (it's in 1.30.0-wmf.19 which is everywhere).

This has been sitting here for a week. To get code review we'll need to be a little more proactive. Have we reached out to operations/analytics about this patch?

Yes, the patch has a couple of people from analytics as reviewers.

@bmansurov @Jdlrobson
Yes, thanks! Will have a loot at this tomorrow.

BTW, thanks a lot for taking the time and effort to create that puppet change!

(Moving discussion back here from Gerrit)

@mforns has voiced concerns that preserving the skin field might cause privacy problems in case of rare values (cf. T169730#3654848 ), and explained that the technical limitations of the purging procedure don't allow us to resolve this by simply aggregating all desktop skins into a common value.
On the other hand, we will need to be able to restrict analysis to desktop only, also in the longer term (for the purged data), and that's only possible using the skin field currently. An easy way out would be to simply limit the schema to desktop only, which would actually be consistent with the original task description: T169730: Define and implement instrumentation for printing on desktop web
@bmansurov, is this something we could still do easily (despite T169730#3654848)?

@bmansurov, is this something we could still do easily (despite T169730#3654848)?

@Tbayer, yes we can do this easily. Are we talking about logging only for Vector, Monobook, Modern, and Cologne blue skins? Or do we need a subset of these?

Cross-posting from https://phabricator.wikimedia.org/T169730#3664250, but we would want to keep the Minerva part of the instrumentation as well so that we can track T177215: Build download button for mobile PDF download, unless we want to pull the mobile portion into a separate schema

need to be able to restrict analysis to desktop only,

The minerva skin is available on desktop so this statement doesn't make sense. I've asked for clarification with regards to privacy problems.

restrict analysis to desktop only... that's only possible using the skin field currently

The skin field should not be used in this way. People can use Vector on the mobile site and they can use the minerva skin on their desktop. We should really be looking at userAgent here or webHost (but that is still flawed albeit not as flawed as skin is). Skin is a terrible indication for whether a user is on mobile.

The following queries show skin cannot be relied on as a field meaning "desktop":

select * from Print_17199246 where event_skin ='minerva' AND webHost NOT LIKE '%m.%'
select * from Print_17199246 where event_skin ='vector' AND webHost LIKE '%m.%'
select * from Print_17199246 where event_skin ='minerva' AND userAgent LIKE "%os_family\": \"Windows%"
select * from Print_17199246 where event_skin ='vector' AND userAgent LIKE "%os_family\": \"iOS%"

@Jdlrobson Answering to you comment on gerrit:

Could you elaborate on this by giving an example? Why would this (the skin field) be potentially identifying?

I think the skin field could be potentially identifying, because some skins might be uncommon enough that they are used by very few people.
For instance: Imagine we have an event collected from catalan wikipedia with skin="blah". And there's only 3 people in the world with that skin (me and 2 more). I'm from Barcelona and the other 2 live in New Zealand, so this may indicate that the event was generated by me. This, of course, depends on the frequency (distribution) of the skins.

@Tbayer linked an analysis of the distributions of skins here: https://phabricator.wikimedia.org/T169730#3654848. This looks like apioutput, modern and monobook are a lot smaller than vector and minerva. So that's why I thought those smaller values could be identifying. Also I guess there could be other small values (or new small values could be added in the future).

My suggestion would be to bucketize all small skin values into an other category, like: vector, minerva and other. I don't know if that would be good enough for you, or that would defeat the purpose of that field. This could be done in the instrumentation code, right? What do you think?

And there's only 3 people in the world with that skin (me and 2 more). I'm from Barcelona and the other 2 live in New Zealand, so this may indicate that the event was generated by me.

But skin is private information? How would I know that you are one of those 3 people?

@Jdlrobson
You're right, one would have to know which skin do I use. For example: If someone suspects that I printed a certain article (and they have access to the Schema:Print data), then they can potentially confirm their suspicion by i.e. peeking at my laptop and getting my Wikipedia skin.
I know the possibility of this actually happening is rather microscopic, but the theory is there.

@Jdlrobson (c.c.@Tbayer)
Reevaluating this... I think you got a very good point.

skin is not like editCount which is a publicly accessible fact, skin is not an easy thing to find out. Both this and also the fact that this schema is 10%-sampled make the data a less sensitive. I'm not sure now if we should purge the skin field or not. But still, the threat is there. Hmmm! These anonymization tasks are very subjective sometimes. Also, I'm no expert in data privacy, I give my opinion based on the experience with previous EL data audits and other anonyimization tasks we've done in the Analytics team.

Maybe we should discuss with others?

Hey all!
Speaking with the team, we agreed that Schema:Print's skin field is a tricky case, and that we needed to dig a little bit more. So looking deeper into it:


Skins are accessible from MediaWiki's user_properties table. When analyzing the risks of privacy leak, we assume that someone from outside the Foundation gains access to EL data. So, they might as well, and we should assume that, they gain access to the user_properties table. This would be enough to get all users with a given skin and compare that to the skins in the Schema:Print events.

Regarding frequencies of the skins: The frequencies extracted from the Schema:Print are not totally accurate, in the sense that they may vary a lot, because currently they are highly coupled with the wikis they have been generated from. I think a more accurate source for frequencies are the user_properties tables. I executed the following query in a small-medium wiki (cawiki) to see if there are frequencies that could be indentifying:

select up_value, count(up_value) from user_properties where up_property = 'skin' group by up_value;
+-------------+-----------------+
| up_value    | count(up_value) |
+-------------+-----------------+
|             |           29029 |
| 0           |             118 |
| 2           |               3 |
| amethyst    |               1 |
| chick       |              21 |
| cologneblue |              67 |
| modern      |             155 |
| monobook    |            5258 |
| myskin      |              10 |
| nostalgia   |              15 |
| simple      |               8 |
| standard    |              55 |
| vector      |             588 |
+-------------+-----------------+

I'd say, except for vector (default) and monobook, all other skins could be identifying here. So my conclusion is that the field skin is indeed a potential identifier combined with the field wiki.

@Tbayer Would it be possible, as @Jdlrobson suggested, to use webHost instead of skin to determine whether the event comes from desktop or mobile? This way we could purge the skin field after 90 days.

ovasileva raised the priority of this task from Medium to High.Oct 10 2017, 5:19 PM

@Tbayer - I'm actually curious to see what the split will be between desktop and mobile users for printing in Minerva - is there a chance we can look into this?

I see two solutions:

a) Can we just drop the skin property and isMobile instead? we will lose the historical data but we will get things done.
b) bit easier approach is to limit skins only to [vector|minerva|other].

I would go with b), implement that, and close this task.

I spoke with @ovasileva and we will go with the second approach, limit skins only to vector|minerva|other. I'll prepare a follow-up patch

Change 386911 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/extensions/WikimediaEvents@master] Limit logged skins for print event only to vector and minerva

https://gerrit.wikimedia.org/r/386911

Change 386911 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Limit logged skins for print event only to vector and minerva

https://gerrit.wikimedia.org/r/386911

@mforns - we changed the values stored in skin property. now it will be only vector, minerva or other. Can we proceed with the review of this task?

Thanks @pmiazga for pushing this forward.

In T175395#3665372 I suggested that we could bucketize skin and have it contain only (vector | minerva | other).
This was based on the skin frequencies of the current events collected for the Print schema T169730#3654848,
where minerva was rather frequent next to vector.

However, after a deeper look later on in T175395#3671720, I mentioned that the skin frequencies extracted from the Print schema
were not accurate, and gave an example of frequencies extracted directly from the user_properties table of a small wiki (cawiki),
where minerva has no occurrences.

I'd say, except for vector (default) and monobook, all other skins could be identifying here. So my conclusion is that the field skin is indeed a potential identifier combined with the field wiki.

A look at ptwiki confirms that minerva's frequency is very low, and as such, potentially identifying.

select up_value, count(up_value) from user_properties where up_property = 'skin' group by up_value;
+-------------+-----------------+
| up_value    | count(up_value) |
+-------------+-----------------+
|             |          378072 |
| 0           |             400 |
| 1           |               9 |
| 2           |              24 |
| amethyst    |              27 |
| chick       |            3629 |
| cologneblue |            9665 |
| minerva     |              24 |
| minervaneue |              21 |
| modern      |            8811 |
| monobook    |          161803 |
| myskin      |            1127 |
| nostalgia   |            1505 |
| simple      |            1033 |
| standard    |            2599 |
| vector      |           14487 |
+-------------+-----------------+

I still think that except from vector and monobook all other values are potentially identifying.
I'm sorry if I was unclear in my latest comment about that.
And would advice to not store the field skin with (vector | minerva | other) for more than 90 days.

In my latest comment T175395#3671720 I asked if it was possible to use the field webHost to split between desktop and mobile. Would this be possible?

@Tbayer Would it be possible, as @Jdlrobson suggested, to use webHost instead of skin to determine whether the event comes from desktop or mobile? This way we could purge the skin field after 90 days.

So as I've pointed out the webHost is flawed as the skin can be changed on desktop.

I'm still not sure i fully understand the concern here with privacy.

The Minerva print events are low on ptwiki but that's not surprising given printing is harder on mobile devices (although that will change when we add a button).

I'm still not sure i understand the concern here. If I know few people print on the minerva skin and i see Steve print a page and Steve is lucky enough to be in a small fraction of users sampled than yes now I can now link him to a session and see all his browser history. That sequences of events is highly unlikely however and seems no different from observing somebody displaying a page preview on the minerva skin on a specific page.

If I know Steve uses the Minerva skin on desktop and is in a minority of 3 users in that skin. Yes I could identify print events but given every single user on mobile uses that skin but doesn't print and the sample is lls it seems a stretch for me to claim I now know Steve's session now.

I'd love to chat over this not on phabricator as I feel like I'm missing something or we are unnecessarily concerned about this.

@mforns - We're missing one very important bit. You're showing us results of user_properties table query which shows us the number of users who changed the skin by themselves (went to the preference page and changed the skin to Minerva for example). That query does not include all mobile readers.

As an example: by default English Wikipedia uses vector skin, but if you go to en.m.wikipedia.org we will load the minerva skin because you visit the page in the mobile mode. In other words, the mobile mode doesn't respect the user_properties.skin, it always uses the minerva skin (you can override that by a GET parameter but that's a different story, not important in that case).

Because of that, the minerva skin is not identifying as it's applied to millions of requests (but it's not reflected in user settings). It would be identifying only if we include "is_mobile" flag. In only that one scenario (is_mobile = false and skin=minerva) it's potentially identifying (users who changed the skin on preferences page).

@pmiazga
Oh, understand! I didn't know that. Thanks for the explanation.
Then yes, you're totally right, skin = (vector | minerva | other) will be fine.

@Jdlrobson
Even if this task seems solved now, I'd also like to chat over this case in person/hangouts.

@ All,
I +1'd the white-list patch, and asked an ops to merge.
Sorry for being fussy/annoying with this privacy thing. Thanks for your patience!

https://gerrit.wikimedia.org/r/386911 is on the train and will not be deployed until Thursday. I guess this is blocked until then?

Change 379829 merged by Elukey:
[operations/puppet@production] Implement Schema:Print purging strategy

https://gerrit.wikimedia.org/r/379829

Will we be purging data up until today's deploy differently from the data afterwards?

@phuedx
No, all data will be purged the same.

A thing we could do, if you want to keep data consistent with the last deployment, is altering the pre-deployment data to transform everything that is not "vector" or "minerva" into "other".

@phuedx and @Tbayer
But consider this only for the sake of consistency.
Regarding privacy, I think we're OK with this small sample of the skin field being not purged/not bucketized.

@mforns It's not necessary for analysis purposes, but can't hurt much either.
BTW I will follow up on some other loose ends here soon and then close this task.

Hi all!

We've noticed that the Print schema has been sending events to EL for a couple days.
But mysql is receiving a lot of events, around 200 per second.
Is that expected?
This volume is over what mysql can stand, after a couple weeks the table will be so huge that it will be very difficult to query/maintain.
We are going to blacklist it for mysql insertion, sorry.
The data will still be normally inserted, available and queryable in Hive's EventLogging refined tables.

This can be attributed to T181297 which added events for impressions. @Tbayer warned about the consequences of this but the sampling rate was not changed :/

Is anything left to do here..... ?

@Jdlrobson

Is anything left to do here..... ?

I don't think so!
The schema is whitelisted, including the bucketized skin field.
Whitelisted fields will be kept indefinitely.