Page MenuHomePhabricator

Homepage: purge sanitized event data through 2019-11-04
Closed, ResolvedPublic

Description

In order to stay compliant with the Growth Team's data retention approvals for experiments, we would like to have some of the sanitized data related to the Homepage experiment deleted ASAP (rather than wait for the 270 day automatic deletion to roll around).

Per this diff to the Data Retention Guidelines, our rolling approval took effect on 2019-11-05. We are therefore asking to have data prior to, and including, 2019-11-04 deleted. The following tables are affected:

  • event_sanitized.homepagevisit
  • event_sanitized.homepagemodule
  • event_sanitized.helppanel

Please let @nettrom_WMF or @MMiller_WMF know what questions there might be about this.

Event Timeline

Do we need to delete all data in the tables or just some specific partitions?

fdans moved this task from Incoming to Ops Week on the Analytics board.

Do we need to delete all data in the tables or just some specific partitions?

Since these are EventLogging tables, they're partitioned by year/month/day/hour. So all partitions for dates <= 2019-11-04 should be deleted. (In other words, data from 2019-11-05 onwards should still be available)

Purge complete!

For event_sanitized.homepagevisit

PYTHONPATH=/srv/deployment/analytics/refinery/python /srv/deployment/analytics/refinery/bin/refinery-drop-older-than -d event_sanitized -t event_sanitized.homepagevisit -b /wmf/data/event_sanitized/HomepageVisit --path-format='year=(?P<year>[0-9]+)(/month=(?P<month>[0-9]+)(/day=(?P<day>[0-9]+)(/hour=(?P<hour>[0-9]+))?)?)?' -o 2019-11-05

For event_sanitized.homepagemodule

PYTHONPATH=/srv/deployment/analytics/refinery/python /srv/deployment/analytics/refinery/bin/refinery-drop-older-than -d event_sanitized -t event_sanitized.homepagemodule -b /wmf/data/event_sanitized/HomepageModule --path-format='year=(?P<year>[0-9]+)(/month=(?P<month>[0-9]+)(/day=(?P<day>[0-9]+)(/hour=(?P<hour>[0-9]+))?)?)?' -o 2019-11-05

For event_sanitized.helppanel

PYTHONPATH=/srv/deployment/analytics/refinery/python /srv/deployment/analytics/refinery/bin/refinery-drop-older-than -d event_sanitized -t event_sanitized.helppanel -b /wmf/data/event_sanitized/HelpPanel --path-format='year=(?P<year>[0-9]+)(/month=(?P<month>[0-9]+)(/day=(?P<day>[0-9]+)(/hour=(?P<hour>[0-9]+))?)?)?' -o 2019-11-05

Verified in Hive that data is not available for any of the three schemas up through 2019-11-04 and this task can be closed. Thanks for your work on this @fdans !