Page MenuHomePhabricator

Popups schema is keeping page/session id longer than 90 days
Closed, ResolvedPublic

Description

Per our data retention guidelines we should not retain long term articles read by users + sessionId. Please see: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data?

I am not sure if this schema is used but if so its whitelist settings should be adjusted.

Popups:

event:
    action: keep
    api: keep
    checkin: keep
    duration: keep
    editCountBucket: keep
    hovercardsSuppressedByGadget: keep
    isAnon: keep
    linkInteractionToken: keep
    namespaceIdHover: keep
    namespaceIdSource: keep
    pageIdSource: keep
    pageTitleHover: keep -> should be removed 
    pageTitleSource: keep -> shoudl be removed 
    pageToken: keep
    perceivedWait: keep
    popupDelay: keep
    popupEnabled: keep
    previewCountBucket: keep
    previewType: keep
    sessionID: keep -> should be removed
    sessionToken: keep
    totalInteractionTime: keep
    version: keep
webHost: keep
wiki: keep

Event Timeline

Thanks for flagging this. I will check with @ovasileva about which of the two are more important from a product perspective, but I guess we will want to drop the page titles and keep the session-related data. (CC @phuedx )

Ping @phuedx can we have someone submit a fix for the issue? See similar work being done by "new contributors" team: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/471038/

Thanks for flagging this. I will check with @ovasileva about which of the two are more important from a product perspective, but I guess we will want to drop the page titles and keep the session-related data. (CC @phuedx )

@Tbayer: Did anything come of this conversation?

Thanks for flagging this. I will check with @ovasileva about which of the two are more important from a product perspective, but I guess we will want to drop the page titles and keep the session-related data. (CC @phuedx )

@Tbayer: Did anything come of this conversation?

I put it on the agenda for our regular check-in meeting, which hasn't yet happened since T207670#4687248 but will on Monday.

ovasileva triaged this task as Medium priority.Nov 5 2018, 7:33 PM

Thanks for flagging this. I will check with @ovasileva about which of the two are more important from a product perspective, but I guess we will want to drop the page titles and keep the session-related data. (CC @phuedx )

discussed this with @Tbayer and we decided to keep session id and stop collecting page id

discussed this with @Tbayer and we decided to keep session id and stop collecting page id

Did you mean "keeping page ID" here? If you meant "collecting page ID", then we'll need to queue up a change to the Page Previews codebase to remove that field.

Change 471980 had a related patch set uploaded (by HaeB; owner: HaeB):
[analytics/refinery@master] Remove page IDs and titles from Popups whitelist

https://gerrit.wikimedia.org/r/471980

discussed this with @Tbayer and we decided to keep session id and stop collecting page id

Did you mean "keeping page ID" here? If you meant "collecting page ID", then we'll need to queue up a change to the Page Previews codebase to remove that field.

Indeed, as mentioned in the task's name and description, this is about retaining data beyond 90 days, see https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging. I.e. we decided to stop retaining page IDs (beyond 90 days). I have submitted a patch to remove pageIdSource, pageTitleHover, and pageTitleSource from the whitelist.

This comment was removed by Tbayer.

not sure what to do with this one in the needs analysis column @ovasileva + @phuedx - is this something the team needs to estimate and work on or is is this work @Tbayer is carrying out?

Change 471980 merged by Nuria:
[analytics/refinery@master] Remove page IDs and titles from Popups whitelist

https://gerrit.wikimedia.org/r/471980

@Nuria can we close out this task? Is there any other work that needs to be done on the Product Analytics side?