Page MenuHomePhabricator

Whitelist Portal and WikipediaApp event data for (sanitized) long-term storage
Closed, ResolvedPublic

Description

Hello, all!

Last year, around this time, we requested that event.centralnoticebannerhistory data be sanitized and whitelisted for long-term storage so that the Fundraising Analytics team could utilize this data year over year for long-term trend analysis of our annual fundraising campaigns. See https://phabricator.wikimedia.org/T245285 for additional information on this and linked requests.

Fundraising efforts continue to expand into additional channels, which are becoming increasingly important to track over time. Similarly to event.centralnoticebannerhistory, we would also like to sanitize and whitelist:

event.mobilewikiappfeed
event.WikipediaPortal
event.MobileWikiAppiOSFeed

What other information or actions do you need from me for this request?

@Pcoombe or others please add if there are other relevant tables used for campaign reporting, as well!

Event Timeline

Could we add event.MobileWikiAppiOSFeed as well? I think that's all, thanks!

Thanks, @Pcoombe I added that to the task description to make sure it's captured.

fdans moved this task from Incoming to Ops Week on the Analytics board.

Hi @EYener!

Please, add the schemas (and fields) that you want to be kept indefinitely to the include-list in the refinery repository under static-data/eventlogging/whitelist.yaml. You can create a Gerrit patch with those changes, and add any of us Analytics as a reviewer (you can add me for this one). Maybe this documentation can help you guys decide which fields to keep and discard. We Analytics will also review, and let you know if we see any issues.

Thanks!

Thank you @mforns and apologies that it took me a while to return to this. I've submitted three individual reviews for the 3 schemas we would like to whitelist, and I'm looking forward to review.

Hi all! I reviewed the include-list patches and left some comments there.
Please, don't feel overwhelmed by the review! Let's discuss and arrive to a solution :]
Thanks for doing this changes.

Hi @Jdrewniak and @mpopov
I ping you here to discuss about WikipediaPortal schema.
I've seen you listed as schema owners on the schema's talk page, and I supposed you worked on the schema creation and instrumentation.

@EYener wants to add this and other 2 related schemas to the sanitization include-list.
See gerrit changes:
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/666223
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/666227
https://gerrit.wikimedia.org/r/c/analytics/refinery/+/666229
There are some questions I posted in the CRs about privacy, that I'd like to clarify, and I wonder if you guys know something about it.

BTW @EYener, I stumbled upon T262433, and thought it was strange, given that you want to keep this data indefinitely, just a heads up.

@mforns thanks for linking to T262433! I've added myself as a subscriber. Heads up for @mpopov, as I believe you authored this (?) that we do use this schema in Fundraising, and would like to use it more. Keeping eyes on the Portal is a good health metric for how users interact with Wikipedia, and the same can be said for the WikiApp and WikiAppiOS schemas. We run fundraising campaigns in each of these places, and keeping a close eye on how user behavior changes over time, across several years of fundraising campaigns, is important to ensuring that we are not under-serving a particular user base who sees our campaigns. We also want to ensure that there are no technical issues with our campaigns in a particular region / access method / etc.

I'm happy to discuss use cases in more detail as we talk more about these schemas! Very generally: long-term tracking of useragent and (broad, country and continent) geo data is very helpful for us in Fundraising to ensure technical campaign health and user preference and accessibility trends.

Just closed that task to remove the instrumentation and updated the migration status of the WikipediaPortal schema in the audit spreadsheet.

The instrumentation will need to be updated to use the Event Platform. Here are the relevant files:

@EYener: Since you're the only users of that data at this point, I'm thinking maybe someone in fr-tech can become maintainer of the analytics code in wikimedia/portals repo (linked above) to perform the eventual migration to MEP. @Jdrewniak what do you think?

@mforns: If/when fr-tech claims WikipediaPortal schema/instrumentation it should probably be added to T259163, right?

@mpopov that's right, the client-side instrumentation will have to be updated for the Event Platform.

@EYener The portals use a custom implementation of event-logging (made as small as possible) and register the events without the help of the mw.track event-bus. Luckily, the code is pretty brief and although I'm not sure what has to change to migrate to the Event Platform, I'm available to help with code-review and guide anyone along as needed.

Hi all! Independent of the important security questions raised in the gerrit review, I want to ask about the possibility of accessing Portal data from Nov. 30, 2020 onward.

I realize that this data is past 90 days old at this point and is no longer available in Hive. Is there any mechanism for reviving this data?

The largest Fundraising effort of the year, for which we want to be able to whitelist Portal data in some aspect, began on Nov. 30, 2020 and ran until Jan. 1 (inclusive) (@Pcoombe - please correct me if Portal did not run this full time!).

We are particularly interested in the early days, which are now no longer available in Hive, as the Portal banners did not follow our traditional YoY expected behavior this year. Is there any way that this data can be recovered and temporarily stored while we discuss a sustainable long-term solution?

CC @jrobell for awareness.

Hi @EYener

I realize that this data is past 90 days old at this point and is no longer available in Hive. Is there any mechanism for reviving this data?

Unfortunately, unsanitized data older than 90 days is irrevocably deleted to abide to our data retention guidelines.
In this case, the oldest date for which we have data for WikipediaPortal schema is Dec 5th 2020.
Note that, every day that passes, the oldest day of data will be deleted.

Is there any way that this data can be recovered and temporarily stored while we discuss a sustainable long-term solution?

We can not recover the deleted data, it just doesn't exist anywhere, sorry.
But we can stop deletion after 90 days if the Legal team grants an exception to the data retention guidelines for these particular data sets.
I recommend you contact Legal for that.

In the meantime, I think it would be cool for us to have a meeting (@EYener, @Jdrewniak, @mpopov and me) sooner rather than later.
Maybe we can find a quick way to decide what data can be kept indefinitely.
And if so, I could help you with the include-list changes and retroactively apply sanitization to the data that is about to be deleted ASAP, so that we don't lose it.
If you like this idea, please set up a meeting!

Thank you so much for the quick response, @mforns! I figured that I had missed the preservation window, but did want to ask.

Unfortunately, as I did not get this process going in time, I think we will just have to start fresh with future campaigns! I appreciate the offer to meet and discuss this issue so that we can reach a solution quickly. I will schedule something next week after I sync up with @jrobell to see if anyone else should be included in that meeting.

Thank you again!

Thanks @EYener. That's correct that the 2020 portal banners ran from Nov 30 to Jan 1 inclusive.

I had honestly given up on looking at the portal eventlogging data from 2020, it looked overall very consistent with what we saw in 2019 and I couldn't find anything that would explain the decrease in donations. So I don't mind too much if the remainder of the 2020 data gets deleted, but it would be great to have sanitized data retained for future campaigns.

Hi @mforns getting back to you on this. I'll schedule a meeting for next week with you, myself, @Jdrewniak, @mpopov , and @Pcoombe please feel free to attend as well as you are a user of this data as well and know our needs (and future needs!) so well! I will look for a time.

Hi @mforns - thanks again for the help and advice earlier this week. I have updated the gerrit reviews with the information we discussed earlier this week. I used the same branches - I hope that is okay? Please let me know if there is a preferred or better way to push changes - always want to learn more! Also, I am not 100% confident that I've spelled everything correctly!

As it turns out, in the process of alphabetically listing these schemas in the whitelist .yaml file, I discovered that MobileWikiAppFeed is already being whitelisted: event_sanitized.mobilewikiappfeed
I added a few additional fields to the request.

Re: the Portal schema, I have for now commented out the request to whitelist the referrer field. @jlinehan, we are hoping you can help with this request! To summarize: We would like to keep the referer field in the event.wikipediaportal schema for whitelisting, provided there is a safe way to do so. @mpopov suggested that this field be modified to collect only hostnames. Would this be possible? Should I log this request separately?

Thank you!

Hi, just pinging back on this ticket. @mforns are the new gerrit review requests looking better?

Hi @EYener, sorry for the delay, I've been a couple days off. Looking now

@EYener, I reviewed the changes. They look a lot better, thanks.
The only changes needed are very minor. Please, have a look and let me know.
I'd ask to combine all commits for a single schema into one atomic commit,
I explain what I mean in the comments. Let me know if I can help :]
Cheers

Update for those on this task:

All three tables have been whitelisted and are live in the event_sanitized db:
event_sanitized.mobilewikiappfeed
event_sanitized.mobilewikiappiosfeed
event_sanitized.wikipediaportal

Thank you @mforns for your help and support, particularly while I was OOO!

I would like to see if we can continue to work on re-instrumentation for WikipediaPortal: the referer field could be of a lot of use if we could whitelist this in a more secure format. Tagging @jlinehan - do you think we can modify this field to retain only hostnames?

Hi all! I am going to close out this ticket, since the main point - whitelisting the 3 mentioned schemas - has been resolved.

The outstanding question of modifying the event.WikipediaPortal schema referer field to pull only hostname seems to be a second and separate issue, which I believe we will get more traction on if we take the discussion to a separate phab. I have created this ticket for that issue: https://phabricator.wikimedia.org/T279952

Thanks again for your help!