Generate test data for Pageview API {slug} [5 pts]
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• kevinator
	Jun 8 2015, 10:29 PM

Description

Generate test data for one day with these dimensions:

sub-cube: project, day/hour, agent type, pseudo-k anonymized with k = 100
hourly data, diminishing resolution: project, dialect, article

Output is TSV on HDFS

header could look like:
- dim1, dim2, dim3, ... , count
- A, B, null, ... , 120 (means for "all" dim3 values)

Related Objects
Search...

Status	Assigned	Task
Resolved	JAllemandou	T101786 Test Cassandra as a storage strategy {slug} [5 pts]
Resolved	• kevinator	T101787 Test PostgreSQL as a storage strategy {slug} [5 pts]
Declined	Milimetric	T101788 Test storage strategy 3 {slug} [5 pts]
Resolved	JAllemandou	T101785 Generate test data for Pageview API {slug} [5 pts]

Event Timeline

• kevinator created this task.Jun 8 2015, 10:29 PM

• kevinator raised the priority of this task from to Medium.

• kevinator updated the task description. (Show Details)

• kevinator added projects: Analytics-Kanban, Analytics-Clusters.

• kevinator subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2015, 10:29 PM

• kevinator moved this task from Next Up to Tasked_Hidden on the Analytics-Kanban board.Jun 8 2015, 10:30 PM

• kevinator added a parent task: T101786: Test Cassandra as a storage strategy {slug} [5 pts].Jun 8 2015, 10:33 PM

• kevinator added a parent task: T101787: Test PostgreSQL as a storage strategy {slug} [5 pts].

• kevinator added a parent task: T101788: Test storage strategy 3 {slug} [5 pts].

• ggellerman edited projects, added Analytics-Backlog; removed Analytics-Kanban.Jun 12 2015, 9:32 PM

• ggellerman moved this task from Incoming to Tasked on the Analytics-Backlog board.

• ggellerman assigned this task to JAllemandou.Jul 1 2015, 3:36 PM

• ggellerman edited projects, added Analytics-Kanban; removed Analytics-Backlog.

• ggellerman set Security to None.

• ggellerman moved this task from Tasked_Hidden to In Progress on the Analytics-Kanban board.

Datasets exported using hive scripts attached for day 2015-07-01.

page_title_hourly.hql1 KBDownload

project_cube.hql1 KBDownload

The two datasets are gzipped TSVs without headers (difficult to generate with hive) and ensuring k-anonymity with k=100 (points where view_count < 100 are discarded for detailed view, but still taken into account for aggregation)

Exported datasets are in hdfs:

/user/joal/api_data_sample/project_cube/000000_0.gz
- Columns: day, time, project, agent_type, view_count
- day-aggregated datapoints have time = '-'
- ~100 Kb
/user/joal/api_data_sample/page_title_hourly/000000_0.gz
- Columns: day, time, project, language_variant, page_title, view_count
- Bots removed and undefined pages ('-') too.
- ~3.5 Mb --> k-anonymity with k = 100 removes a lot of lines !!!! Maybe we should drop that to 20? To be discussed.

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Jul 3 2015, 2:58 PM

Distribution analysis for both page_title_hourly and project_cube

page_title_hist.hql1 KBDownload

project_cube_hist.hql1 KBDownload

k_anonymity_analysis.ods75 KBDownload

page_title_hourly
- - ~60% of page_titles have 1 view, representing ~15% of view_counts
  - ~25% of page_titles have 2 to 4 views, representing ~15% of view_counts
  - ~7% of page_titles have 5 to 9 views, representing ~10% of view_counts
  - ~4% of page_titles have 10 to 24 views, representing ~15% of view_counts
  - ~1% of page_titles have 25 to 49 views, representing ~10% of view_counts
  - ~0.5% of page_titles have 50 to 99 views, representing ~8% of view_counts
  - ~0.25% of page_titles have 100 and more views, representing ~25% of view_counts
- k=100 is too big, we loose 75% of view_counts. Moving it to 10 would allow us to keep ~60% view_counts, but we'd still loose 90% of page_titles ... Hard choice.
project_cube_hourly
- - ~4% of projects/agent_type have 1 view, representing ~0.0002% of view_counts
  - ~7% of projects/agent_type have 2 to 4 views, representing ~0.001% of view_counts
  - ~7% of projects/agent_type have 5 to 9 views, representing ~0.003% of view_counts
  - ~11% of projects/agent_type have 10 to 24 views, representing ~0.01% of view_counts
  - ~9% of projects/agent_type have 25 to 49 views, representing ~0.02% of view_counts
  - ~11% of projects/agent_type have 50 to 99 views, representing ~0.05% of view_counts
  - ~50% of projects/agent_type have 100 and more views, representing ~99.9% of view_counts
- Keeping k=100 for projects is very much ok from a view_counts perspective, but we still loose a lot of projects data.

I also tried page_title_daily (instead of hourly) and it's better:

- ~47% of page_titles have 1 view, representing ~3% of view_counts
- ~28% of page_titles have 2 to 4 views, representing ~5% of view_counts
- ~10% of page_titles have 5 to 9 views, representing ~5% of view_counts
- ~7% of page_titles have 10 to 24 views, representing ~8% of view_counts
- ~3% of page_titles have 25 to 49 views, representing ~7% of view_counts
- ~2% of page_titles have 50 to 99 views, representing ~8% of view_counts
- ~2% of page_titles have 100 and more views, representing ~64% of view_counts
With k=100 we still loose 98% of page_titles, but we keep 64% of view_counts. Moving it to 10 would allow us to keep ~88% of view_counts, but we'd still loose 75% of page_titles ... Hard as well.

I'm not sure we're making the right choice here to k-anonymize. I think even if we chose a huge K we would still be vulnerable to the problems that l-diversity and t-closeness solve. We should talk more about this, because the truncation you're seeing would make the dataset useless and we may as well not release it.

1) Page title hourly

The hive query looks good to me!
I agree with you that K=100 seems too much in this case.
I like the idea of having just daily granularity and maybe a lower K.
Even if we loose 75%-90% of page-titles, this would be a nice feature.
So: +2!

2) Project cube hourly

I see that the hive query calculates daily and hourly data. However, I had understood that we would generate all combinations of dimension values, like (note that the view counts are not real):

day            hour         project          ua         view count

-              -            -                -          23454365      // all view counts for all projects, all time, all agent types
-              -            -                user       23408620      // all view counts for that agent type
-              -            -                spider     45745
-              -            aa.wikipedia     -          567           // all view counts for this project
-              -            aa.wikipedia     user       500
-              -            aa.wikipedia     spider     67
-              -            ab.wikipedia     -          362

...

-              00:00:00     -                -          902090        // all view counts for that hour of the day
-              00:00:00     -                user       900000
-              00:00:00     -                spider     2090

...

2015-07-01     -            -                -          5134557       // all view counts for that day (this is already there!)

...

2015-07-01     18:00:00     nap.wikipedia    user       120           // specific count (this is already there!)

I'd say we need all those extra aggregations to permit such queries as "all view counts for a certain project", no?
If we go for this, and set a K~100 only after aggregation, our users will always be able to relax their queries a bit to get to a non-truncated-by-anonymity correct value. And if the result of their query is '-' they'll know that the value is <K.
Does this make sense?

• ggellerman moved this task from In Code Review to Done on the Analytics-Kanban board.Jul 22 2015, 3:33 PM

The team has decided not to k-anonymize the data delivered for the API because we are not exposing any geolocation and therefore cannot identify any editor.

• kevinator closed this task as Resolved.Jul 22 2015, 5:43 PM

Nemo_bis added a project: Datasets-General-or-Unknown.Feb 2 2016, 7:41 AM

Nemo_bis mentioned this in T44259: Make domas' pageviews data available in semi-publicly queryable database format.Feb 2 2016, 7:43 AM

ArielGlenn moved this task from Backlog to Done on the Datasets-General-or-Unknown board.Mar 16 2016, 5:16 PM

	F190565: project_cube_hist.hql
	Jul 9 2015, 1:33 PM

	F190564: page_title_hist.hql
	Jul 9 2015, 1:33 PM

	F188462: project_cube.hql
	Jul 3 2015, 2:58 PM

	F188461: page_title_hourly.hql
	Jul 3 2015, 2:58 PM

Generate test data for Pageview API {slug} [5 pts]Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Generate test data for Pageview API {slug} [5 pts]
Closed, ResolvedPublic
Actions

Related Objects
Search...