Page MenuHomePhabricator

Sanitize pageview_hourly - subtasked {mole}
Open, LowestPublic0 Story Points

Description

Anonymize data in pageview_hourly in accordance with data retention guidelines .

This includes completely filtering data that allows for identity reconstruction. Note that this might not include all user agents but rather sparse ones, same for locations.

Event Timeline

Nuria created this task.Oct 5 2015, 5:10 PM
Nuria raised the priority of this task from to Needs Triage.
Nuria updated the task description. (Show Details)
Nuria added a subscriber: Nuria.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2015, 5:10 PM
Nuria triaged this task as High priority.Oct 5 2015, 5:10 PM
Nuria set Security to None.
Nuria added a subscriber: Analytics-Backlog.

Hi @Nuria. Please associate at least one project with this task, otherwise nobody can find this task when searching in the corresponding project(s). Thanks! :)

kevinator renamed this task from Anonymize data in pageview_hourly to comply with privacy policy to Sanitize pageview_hourly.Oct 7 2015, 8:58 PM
kevinator updated the task description. (Show Details)
kevinator updated the task description. (Show Details)

Question: Does this involve removing values of country data and/or access_method as well, or just the more fine-grained user_agent_map and city? (It's not clear to from the linked Phabricator task and mediawiki.org page.) Some people were interested earlier in analyses that would need country data from June.

Nuria added a comment.Oct 21 2015, 7:40 PM

For now we are concerned just with user agent map.

Task breakdown:

  1. Write an oozie coordinator to backfill sanitization from exisiting pageview_hourly to pageview_hourly_new (path TBD)
  2. Backfill by hand: create pageview_hourly_new, launch and monitor oozie job from 1.
  3. Modify existing pageview_hourly oozie to include the HQL from 1 in the workflow --> Insert data from webrequest into pageview_hourly_tmp, then sanitize it into pageview_hourly.
  4. When backfill is done, deploy
    • Stop pageview_hourly job, and wait for jobs using it to be waiting on new data.
    • Drop pageview_hourly and pageview_hourly_new tables in hiven
    • Rename data folders: pageview_hourly becomes pageview_hourly_tmp, and pageview_hourly_new becomes pageview_hourly.
    • Recreate pageview_hourly and pageview_hourly_tmp tables using new path from previous step.
    • Recreate partitions for existing data for pageview_hourly and pageview_hourly_tmp
    • Deploy the oozie change from 3 and run it starting at backfilling stop time.

Hopefully that should do it :)

JAllemandou moved this task from Next Up to Paused on the Analytics-Kanban board.Nov 23 2015, 5:43 PM
Milimetric renamed this task from Sanitize pageview_hourly - subtasked [0 pts] {hawk} to Sanitize pageview_hourly - subtasked {hawk}.Feb 22 2016, 9:06 PM
Milimetric set the point value for this task to 0.
Nuria renamed this task from Sanitize pageview_hourly - subtasked {hawk} to Sanitize pageview_hourly - subtasked {mole}.Apr 18 2016, 4:35 PM
Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria lowered the priority of this task from High to Lowest.Jul 22 2019, 8:50 PM

As we have learned sanitization will not work, these needs an entirely different approach.