Page MenuHomePhabricator

Generate test data for Pageview API {slug} [5 pts]
Closed, ResolvedPublic

Assigned To
Authored By
kevinator
Jun 8 2015, 10:29 PM
Referenced Files
F190565: project_cube_hist.hql
Jul 9 2015, 1:33 PM
F190581: k_anonymity_analysis.ods
Jul 9 2015, 1:33 PM
F190564: page_title_hist.hql
Jul 9 2015, 1:33 PM
F188462: project_cube.hql
Jul 3 2015, 2:58 PM
F188461: page_title_hourly.hql
Jul 3 2015, 2:58 PM

Description

Generate test data for one day with these dimensions:

  • sub-cube: project, day/hour, agent type, pseudo-k anonymized with k = 100
  • hourly data, diminishing resolution: project, dialect, article

Output is TSV on HDFS

  • header could look like:
    • dim1, dim2, dim3, ... , count
    • A, B, null, ... , 120 (means for "all" dim3 values)

Event Timeline

kevinator raised the priority of this task from to Medium.
kevinator updated the task description. (Show Details)
kevinator subscribed.

Datasets exported using hive scripts attached for day 2015-07-01.

The two datasets are gzipped TSVs without headers (difficult to generate with hive) and ensuring k-anonymity with k=100 (points where view_count < 100 are discarded for detailed view, but still taken into account for aggregation)

Exported datasets are in hdfs:

  • /user/joal/api_data_sample/project_cube/000000_0.gz
    • Columns: day, time, project, agent_type, view_count
    • day-aggregated datapoints have time = '-'
    • ~100 Kb
  • /user/joal/api_data_sample/page_title_hourly/000000_0.gz
    • Columns: day, time, project, language_variant, page_title, view_count
    • Bots removed and undefined pages ('-') too.
    • ~3.5 Mb --> k-anonymity with k = 100 removes a lot of lines !!!! Maybe we should drop that to 20? To be discussed.

Distribution analysis for both page_title_hourly and project_cube

  1. page_title_hourly
      • ~60% of page_titles have 1 view, representing ~15% of view_counts
      • ~25% of page_titles have 2 to 4 views, representing ~15% of view_counts
      • ~7% of page_titles have 5 to 9 views, representing ~10% of view_counts
      • ~4% of page_titles have 10 to 24 views, representing ~15% of view_counts
      • ~1% of page_titles have 25 to 49 views, representing ~10% of view_counts
      • ~0.5% of page_titles have 50 to 99 views, representing ~8% of view_counts
      • ~0.25% of page_titles have 100 and more views, representing ~25% of view_counts
    • k=100 is too big, we loose 75% of view_counts. Moving it to 10 would allow us to keep ~60% view_counts, but we'd still loose 90% of page_titles ... Hard choice.
  2. project_cube_hourly
      • ~4% of projects/agent_type have 1 view, representing ~0.0002% of view_counts
      • ~7% of projects/agent_type have 2 to 4 views, representing ~0.001% of view_counts
      • ~7% of projects/agent_type have 5 to 9 views, representing ~0.003% of view_counts
      • ~11% of projects/agent_type have 10 to 24 views, representing ~0.01% of view_counts
      • ~9% of projects/agent_type have 25 to 49 views, representing ~0.02% of view_counts
      • ~11% of projects/agent_type have 50 to 99 views, representing ~0.05% of view_counts
      • ~50% of projects/agent_type have 100 and more views, representing ~99.9% of view_counts
    • Keeping k=100 for projects is very much ok from a view_counts perspective, but we still loose a lot of projects data.

I also tried page_title_daily (instead of hourly) and it's better:

    • ~47% of page_titles have 1 view, representing ~3% of view_counts
    • ~28% of page_titles have 2 to 4 views, representing ~5% of view_counts
    • ~10% of page_titles have 5 to 9 views, representing ~5% of view_counts
    • ~7% of page_titles have 10 to 24 views, representing ~8% of view_counts
    • ~3% of page_titles have 25 to 49 views, representing ~7% of view_counts
    • ~2% of page_titles have 50 to 99 views, representing ~8% of view_counts
    • ~2% of page_titles have 100 and more views, representing ~64% of view_counts
  • With k=100 we still loose 98% of page_titles, but we keep 64% of view_counts. Moving it to 10 would allow us to keep ~88% of view_counts, but we'd still loose 75% of page_titles ... Hard as well.

I'm not sure we're making the right choice here to k-anonymize. I think even if we chose a huge K we would still be vulnerable to the problems that l-diversity and t-closeness solve. We should talk more about this, because the truncation you're seeing would make the dataset useless and we may as well not release it.

1) Page title hourly

The hive query looks good to me!
I agree with you that K=100 seems too much in this case.
I like the idea of having just daily granularity and maybe a lower K.
Even if we loose 75%-90% of page-titles, this would be a nice feature.
So: +2!

2) Project cube hourly

I see that the hive query calculates daily and hourly data. However, I had understood that we would generate all combinations of dimension values, like (note that the view counts are not real):

day            hour         project          ua         view count

-              -            -                -          23454365      // all view counts for all projects, all time, all agent types
-              -            -                user       23408620      // all view counts for that agent type
-              -            -                spider     45745
-              -            aa.wikipedia     -          567           // all view counts for this project
-              -            aa.wikipedia     user       500
-              -            aa.wikipedia     spider     67
-              -            ab.wikipedia     -          362

...

-              00:00:00     -                -          902090        // all view counts for that hour of the day
-              00:00:00     -                user       900000
-              00:00:00     -                spider     2090

...

2015-07-01     -            -                -          5134557       // all view counts for that day (this is already there!)

...

2015-07-01     18:00:00     nap.wikipedia    user       120           // specific count (this is already there!)

I'd say we need all those extra aggregations to permit such queries as "all view counts for a certain project", no?
If we go for this, and set a K~100 only after aggregation, our users will always be able to relax their queries a bit to get to a non-truncated-by-anonymity correct value. And if the result of their query is '-' they'll know that the value is <K.
Does this make sense?

The team has decided not to k-anonymize the data delivered for the API because we are not exposing any geolocation and therefore cannot identify any editor.