Page MenuHomePhabricator

Generate test data for Pageview API {slug} [5 pts]
Closed, ResolvedPublic

Description

Generate test data for one day with these dimensions:

  • sub-cube: project, day/hour, agent type, pseudo-k anonymized with k = 100
  • hourly data, diminishing resolution: project, dialect, article

Output is TSV on HDFS

  • header could look like:
    • dim1, dim2, dim3, ... , count
    • A, B, null, ... , 120 (means for "all" dim3 values)

Event Timeline

kevinator raised the priority of this task from to Normal.
kevinator updated the task description. (Show Details)
kevinator added a subscriber: kevinator.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2015, 10:29 PM
ggellerman moved this task from Incoming to Tasked on the Analytics-Backlog board.
ggellerman edited projects, added Analytics-Kanban; removed Analytics-Backlog.
ggellerman set Security to None.
ggellerman moved this task from Tasked_Hidden to In Progress on the Analytics-Kanban board.

Datasets exported using hive scripts attached for day 2015-07-01.

The two datasets are gzipped TSVs without headers (difficult to generate with hive) and ensuring k-anonymity with k=100 (points where view_count < 100 are discarded for detailed view, but still taken into account for aggregation)

Exported datasets are in hdfs:

  • /user/joal/api_data_sample/project_cube/000000_0.gz
    • Columns: day, time, project, agent_type, view_count
    • day-aggregated datapoints have time = '-'
    • ~100 Kb
  • /user/joal/api_data_sample/page_title_hourly/000000_0.gz
    • Columns: day, time, project, language_variant, page_title, view_count
    • Bots removed and undefined pages ('-') too.
    • ~3.5 Mb --> k-anonymity with k = 100 removes a lot of lines !!!! Maybe we should drop that to 20? To be discussed.

Distribution analysis for both page_title_hourly and project_cube

  1. page_title_hourly
      • ~60% of page_titles have 1 view, representing ~15% of view_counts
      • ~25% of page_titles have 2 to 4 views, representing ~15% of view_counts
      • ~7% of page_titles have 5 to 9 views, representing ~10% of view_counts
      • ~4% of page_titles have 10 to 24 views, representing ~15% of view_counts
      • ~1% of page_titles have 25 to 49 views, representing ~10% of view_counts
      • ~0.5% of page_titles have 50 to 99 views, representing ~8% of view_counts
      • ~0.25% of page_titles have 100 and more views, representing ~25% of view_counts
    • k=100 is too big, we loose 75% of view_counts. Moving it to 10 would allow us to keep ~60% view_counts, but we'd still loose 90% of page_titles ... Hard choice.
  2. project_cube_hourly
      • ~4% of projects/agent_type have 1 view, representing ~0.0002% of view_counts
      • ~7% of projects/agent_type have 2 to 4 views, representing ~0.001% of view_counts
      • ~7% of projects/agent_type have 5 to 9 views, representing ~0.003% of view_counts
      • ~11% of projects/agent_type have 10 to 24 views, representing ~0.01% of view_counts
      • ~9% of projects/agent_type have 25 to 49 views, representing ~0.02% of view_counts
      • ~11% of projects/agent_type have 50 to 99 views, representing ~0.05% of view_counts
      • ~50% of projects/agent_type have 100 and more views, representing ~99.9% of view_counts
    • Keeping k=100 for projects is very much ok from a view_counts perspective, but we still loose a lot of projects data.

I also tried page_title_daily (instead of hourly) and it's better:

    • ~47% of page_titles have 1 view, representing ~3% of view_counts
    • ~28% of page_titles have 2 to 4 views, representing ~5% of view_counts
    • ~10% of page_titles have 5 to 9 views, representing ~5% of view_counts
    • ~7% of page_titles have 10 to 24 views, representing ~8% of view_counts
    • ~3% of page_titles have 25 to 49 views, representing ~7% of view_counts
    • ~2% of page_titles have 50 to 99 views, representing ~8% of view_counts
    • ~2% of page_titles have 100 and more views, representing ~64% of view_counts
  • With k=100 we still loose 98% of page_titles, but we keep 64% of view_counts. Moving it to 10 would allow us to keep ~88% of view_counts, but we'd still loose 75% of page_titles ... Hard as well.

I'm not sure we're making the right choice here to k-anonymize. I think even if we chose a huge K we would still be vulnerable to the problems that l-diversity and t-closeness solve. We should talk more about this, because the truncation you're seeing would make the dataset useless and we may as well not release it.

mforns added a subscriber: mforns.Jul 10 2015, 7:03 PM

1) Page title hourly

The hive query looks good to me!
I agree with you that K=100 seems too much in this case.
I like the idea of having just daily granularity and maybe a lower K.
Even if we loose 75%-90% of page-titles, this would be a nice feature.
So: +2!

2) Project cube hourly

I see that the hive query calculates daily and hourly data. However, I had understood that we would generate all combinations of dimension values, like (note that the view counts are not real):

day            hour         project          ua         view count

-              -            -                -          23454365      // all view counts for all projects, all time, all agent types
-              -            -                user       23408620      // all view counts for that agent type
-              -            -                spider     45745
-              -            aa.wikipedia     -          567           // all view counts for this project
-              -            aa.wikipedia     user       500
-              -            aa.wikipedia     spider     67
-              -            ab.wikipedia     -          362

...

-              00:00:00     -                -          902090        // all view counts for that hour of the day
-              00:00:00     -                user       900000
-              00:00:00     -                spider     2090

...

2015-07-01     -            -                -          5134557       // all view counts for that day (this is already there!)

...

2015-07-01     18:00:00     nap.wikipedia    user       120           // specific count (this is already there!)

I'd say we need all those extra aggregations to permit such queries as "all view counts for a certain project", no?
If we go for this, and set a K~100 only after aggregation, our users will always be able to relax their queries a bit to get to a non-truncated-by-anonymity correct value. And if the result of their query is '-' they'll know that the value is <K.
Does this make sense?

The team has decided not to k-anonymize the data delivered for the API because we are not exposing any geolocation and therefore cannot identify any editor.

kevinator closed this task as Resolved.Jul 22 2015, 5:43 PM