Page MenuHomePhabricator

Make a copy of webrequest logs during the survey
Closed, ResolvedPublic

Description

Webrequest logs get auto-purged at 60 days. We need to make a copy of them for the period of June 21 to 30 (Note that survey ran from 2017-06-22 (13:09 UTC) to 2017-06-29 (23:19 UTC).). We will keep this copy until the 90-day period allowed by the privacy policy and anonymize/aggregate PII right before that point.

Note that this research involves de-biasing of the results based on webrequest logs and we need to keep this data as long as possible until the research is finished.

The following extraction is considered:

CREATE TABLE motivations.all_requests AS
SELECT
    client_ip,
    user_agent,
    geocoded_data,
    user_agent_map,
    ts,
    referer,
    title,
    uri_path,
    uri_host,
    uri_query,
    http_status,
    is_pageview,
    access_method,
    referer_class,
    normalized_host,
    pageview_info,
    year,
    month,
    day,
    hour,
    agent_type
FROM
    wmf.webrequest
WHERE
    year = 2017
    AND month = 6
    AND day in (21,22,23,24,25,26,27,28,29,30)
    AND webrequest_source = 'text'
    AND access_method != 'mobile app'
    AND agent_type = 'user';

Event Timeline

leila created this task.Jul 25 2017, 4:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 25 2017, 4:53 PM
leila added a comment.Jul 25 2017, 5:00 PM

@Cervisiarius do we need x_forwarded_for header?

@flemmerich do we need accept_language? if that captures the browser language, it's good to include that, in case we need to fine-tune device IDs as we build sessions.

We did not use "accept_language" so far. I dont think, we need it for building sessions necessarily since we more have browser information in the user_agent_map. It could be an interesting additional feature for the analysis though (specifically if we get to a cross-edition-browsing behavior analysis).

leila added a comment.Jul 26 2017, 2:35 PM

Please include it, @flemmerich.

Nuria added a comment.Jul 26 2017, 3:11 PM

If we have a legal exception to keep this data it must be copied to hadoop, cannot leave the cluster as it is much too large.

leila added a comment.Jul 26 2017, 5:27 PM

@Nuria the data will remain in Hadoop or in HDFS format. We keep the data for 90 days completely, and will remove PII right before the start of the 90-day. We don't need legal exception, correct?

Nuria added a comment.Jul 26 2017, 5:46 PM

Removing pii doesn't make it so the data cannot be cross checked with other sources and that pii inferred, if records are not agreggated (simplistic example).
We only keep long term pageview data agreggated on the article dimension on pageview_hourly, other data is removed at the 90 day mark.

leila added a comment.Jul 26 2017, 5:53 PM

right. :) I'll check with Legal and apply for an exemption if needed. This is the repeat of the exact same study as last year's, and I don't expect it needing exceptions, but I'll check.

Nuria added a comment.Jul 26 2017, 6:39 PM

Update:

After irc conversation we do not need all requests , neither do we need all dimensions so it is likely we can get a sufficiently anonymized dataset.

leila added a comment.Aug 3 2017, 8:50 PM

@flemmerich please go ahead and make a copy of the week's data, just to make sure we don't lose it accidentally. Nuria and I will continue conversations on aggregating/anonymizing/purging.

leila added a comment.Mar 11 2018, 6:14 PM

I missed to update this task earlier: flemmerich did this, we built the traces, sampled the traces, dropped PII fields from them and dropped the part that was not sampled completely.

leila closed this task as Resolved.Mar 11 2018, 6:14 PM