Page MenuHomePhabricator

Homepage: add schemas to EventLogging whitelist
Closed, ResolvedPublic

Description

We've got approval from Legal for our data retention plan (ref: T219250#5150014), and the Data retention guidelines have now been updated with an exception for the Homepage project. It's time to add the Homepage schemas to the EventLogging whitelist. This task tracks that, which will keep/purge schema fields in accordance with the measurement specification.

Event Timeline

Change 516520 had a related patch set uploaded (by Nettrom; owner: Nettrom):
[analytics/refinery@master] Add HomepageModule and HomepageVisit, hash HelpPanel token

https://gerrit.wikimedia.org/r/516520

Upon reviewing this patch, @mforns raised concerns about the information we store about the mentors in this schema. It likely makes it possible to connect mentor and mentee, so the suggestion is to bucket edit counts. Thinking about this, we might want to bucket "time since last activity" as well to make both pieces of information less identifying.

At the same time, the pool of mentors is limited and available on-wiki, if I remember correctly. Meaning that if one is really interested in learning this, the list of possible options is likely small.

Bucketing these two pieces of information won't affect our research questions, as far as I can tell, it'll just mean we use categories as predictors instead of continuous variables (besides, if I really want to use the exact count for this, I can calculate that for the analysis).

I'm not against this change, but also wanted to make sure thoughts, ideas, and concerns from the rest of the team are heard before I suggest we implement it. Pinging @MMiller_WMF and our engineers: @SBisson @kostajh @Catrope

As you said, mentors are few and publicly available on wiki. Also, as soon as a mentee ask a question, their mentor-mentee relationship is made public on the mentor's talk page. What is currently private is mentor assignment before any question is posted.

Thinking about the size of the edit count buckets, they would have to be fairly large for this not being identifiable, especially when cross-referenced with the "time since last activity" bucket. In other words, is it possible that to anonymize this data, buckets would have to be so large that most or all mentors are in the same buckets, and that would make this data useless for analysis?

Thanks @nettrom_WMF for bringing this out.
If the mentor-mentee relation is already public on wiki and users know that (as @SBisson said), I think it's OK to keep that information in the events!
No need to bucket edit_counts nor time since last activity.

Very interesting. I think about it like this:

  • It is public who is a mentor in general.
  • Once a newcomer takes action on their relationship with a mentor by asking them a question, it becomes public that that mentor is assigned to that user.
  • So, as @SBisson says, the only secret is which mentor is assigned to which newcomer before the newcomer takes action on the relationship. And I can't think of a reason why that information would be secret or revealing. @mforns, do you see any reason there?

@MMiller_WMF no I agree with you, it seems OK to me to keep that information.

Change 516520 merged by Mforns:
[analytics/refinery@master] Add HomepageModule and HomepageVisit, hash HelpPanel token

https://gerrit.wikimedia.org/r/516520

Merged it, thanks for the clarifications!

Have confirmed that the data is now flowing into the Data Lake and appears to be correct. Closing this as resolved.