Use two tables so as not to use the webrequest table as source for validation (too slow), better to use pageview_hourly_unsanitized to populate pageview_hourly
Use previous day or week rolling window on pageview_hourly_unsanitized for generating filters (thresholds TBD).
Test validity of sanitisation (nuria has existing code)
Description
Description
Details
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
analytics/refinery/source | master | +303 -0 | [WIP] Sanitize pageview_hourly table |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics | |||
Resolved | None | T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days | |||
Open | None | T130256 Wikistats 2.0. | |||
Resolved | None | T90759 Create Daily & Monthly pageview dump with country data and Visualize on UI | |||
Open | None | T114675 Sanitize pageview_hourly | |||
Duplicate | None | T118841 Deploy pageview sanitization and start ongoing process {hawk} | |||
Declined | None | T118839 Productionize Pageview_sanitization hive code with Oozie job and refinery inclusion {hawk} | |||
Resolved | JAllemandou | T118838 Write hive code doing pageview data anonimisation with two tables {hawk} |
Event Timeline
Comment Actions
After meeting with team: we are going to have our anonymization strategy peer-reviewed by research before we roll out implementation.
Comment Actions
Change 260408 had a related patch set uploaded (by Mforns):
[WIP] Sanitize pageview_hourly table
Comment Actions
Change 260408 abandoned by Mforns:
[WIP] Sanitize pageview_hourly table
Reason:
This change is obsolete, it was the base for https://gerrit.wikimedia.org/r/#/c/271033/
The actual development is being done in the latter.