Authored by Shilad on Sep 28 2017, 6:47 AM.


SessionPagesBuilder creates a table representing viewed page grouped by
browser session. The output is a table containing columns for wiki, date,
timestamp, and a space separated list of all the page ids viewed in the
session in order.

The job now runs on the cluster in a reasonable amount of time (10 min for
a day's worth of views).

SessionPruner filters the session table and removes any views of pages
below some threshold. As a side effect it creates a frequency table.

The testing harness creates fake test data and compares computed spark
results against computed in-memory results.


  • Oozify job (may require switching to Spark 2)

Complete pass at session creation pipeline

  Added session pruner
  Switched to use tables instead of files
  Cleaned up tests

