Page MenuHomePhabricator

Implement Unique Devices report on cluster using x-analytics header & last access date {bear} [13 pts]
Closed, ResolvedPublic

Description

Cluster report that looks at x-Analytics header and extracts the date to calculate uniques.

Ideally first we have done the work in the refined tables to parse x-analytics header and also the work to tag bots and WMF spiders on our request flow.

The logic on how to parse the date passed on to count unqiues is here:
https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution#How_will_we_be_counting:_Plain_English

Carefult with bots, those should be filtered out.

Event Timeline

Nuria created this task.Mar 17 2015, 4:20 PM
Nuria updated the task description. (Show Details)
Nuria raised the priority of this task from to High.
Nuria added subscribers: DarTar, Eloquence, Rdicerb and 10 others.
Nuria added a comment.Mar 18 2015, 2:54 PM

Let's see: Task T888814 has two parts:

Part #1 VCL changes, code & deploy (https://phabricator.wikimedia.org/T92435)
Part #2 hadoop job (this one)

Changes for these two tasks can be done _somewhat_ in parallel.

This will change a little bit once the UA map is in the refined tables. For now, you can use the UDF to determine if it's a spider. Here's a snippet of my Hive code to find pageviews by spiders.

WHERE
year = 2015 AND month = 3
AND is_pageview = TRUE
AND ua(user_agent)['device_family'] = "Spider"
kevinator renamed this task from Cluster report that looks at x-Analytics header and extracts the date to calculate uniques. to Implement Unique Clients report on cluster using x-analytics header & last access date {bear}.Mar 19 2015, 2:40 PM
kevinator set Security to None.

Hi all, I looked through this again and wanted to get some clarification on it.

When I talked with Aaron about setting the cookie, it was going to be a generic cookie for the month, which expired at the beginning of the next month. I'm more concerned about the privacy implications of setting a specific date in the cookie and having that cookie live for more than a month.

What's the reason for tracking the day of the last visit, and having the cookie live beyond the end of the month?

Nuria added a comment.EditedMar 19 2015, 6:00 PM

When I talked with Aaron about setting the cookie, it was going to be a generic cookie for the month, which expired at the beginning of the next month.
I'm more concerned about the privacy implications of setting a specific date in the cookie and having that cookie live for more than a month.

Cookie lives forever, but its value is changing on every access per @Halfak spec (idea hasn't changed).

Please see: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution

Cookie lives forever, but its value is changing on every access per @Halfak spec (idea hasn't changed).

Please see: https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution

If the idea hasn't changed, then the telephone game seems to have garbled it; see discussion on T88813 for what "we" had thought was the proposal.

Nuria added a comment.Mar 19 2015, 6:36 PM

@Anomie
Let us know if there are privacy concers but per comments above we do not think there is any, note both tickets reference same document (https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution).
Also note that value of cookie is just a date and it changes daily as we are counting daily and monthly uniques with this solution. Cookie does not store anything but that.

Anomie added a comment.EditedMar 19 2015, 6:45 PM

Both tasks may refer to the same document, but the discussion on the other ticket clearly doesn't match what the document currently says. Specifically, it's the difference between "you visited Wikipedia sometime this month" (or not) and "you last visited Wikipedia on March 18" (and that's if you're careful to set the expiration to day-boundaries rather than "now+T seconds").

Day granularity versus month granularity on the cookie is more of a privacy concern, but whether it's still an acceptable level I don't know.

I merged in T88814 because it's the same work, but that task did not have "daily" in the title and it is one of the requirements.

I merged in T88814 because it's the same work, but that task did not have "daily" in the title and it is one of the requirements.

@kevinator can you point me to who is driving the "daily" requirement? That makes me uncomfortable from a privacy perspective.

Cookie lives forever, but its value is changing on every access per @Halfak spec (idea hasn't changed).

@Nuria, cookies live for as long as we tell them to. What's the expiration date you're planning to set on the cookie?

@Nuria, cookies live for as long as we tell them to. What's the expiration date you're planning to set on the cookie?

Apparently until 3000-01-01T01:01:01Z, based on Gerrit change 196009. Is the Y2038 problem still a thing?

Nuria added a comment.Mar 20 2015, 8:07 PM

@csteipp:

I was planning to let cookie live forever, its value is updated upon access once a day.

Nuria added a comment.Mar 20 2015, 9:15 PM

@csteipp:
Having the cookie live forever means that no date calculations are needed on VCL for an expiration date. I cannot see a privacy issue with this given that cookie only holds YYYY_DD_MM, let me know otherwise.

@csteipp, as the Product Manager for analytics, I am driving this requirement. Monthly and Daily active users are standard web metrics. I'd like to know more about any privacy issues around daily uniques. Perhaps the best place is to document this is on the talk page: https://wikitech.wikimedia.org/wiki/Talk:Analytics/Unique_clients/Last_access_solution

kevinator moved this task from Tasked_Hidden to Next Up on the Analytics-Kanban board.
kevinator renamed this task from Implement Unique Clients report on cluster using x-analytics header & last access date {bear} to Implement Unique Clients report on cluster using x-analytics header & last access date {bear} [13 pts].

Change 216341 had a related patch set uploaded (by Madhuvishy):
[WIP] Daily and monthly uniques oozie jobs based on WMF-Last-Access cookie

https://gerrit.wikimedia.org/r/216341

We have enough data to start the validation task... I recommend we call this done.
We can deploy automated reporting in this task: T103376

kevinator closed this task as Resolved.Jun 23 2015, 3:42 PM

we have some preliminary data, which is inconclusive. Next task it to vet it with a researcher to help us improve it.

Change 216341 merged by Ottomata:
Daily last_access uniques oozie job

https://gerrit.wikimedia.org/r/216341

Nuria renamed this task from Implement Unique Clients report on cluster using x-analytics header & last access date {bear} [13 pts] to Implement Unique Devices report on cluster using x-analytics header & last access date {bear} [13 pts].Feb 3 2016, 4:57 PM