Spark job to create page ids viewed in each session
SessionPagesBuilder creates a table representing viewed page grouped by
browser session. The output is a table containing columns for wiki, date,
timestamp, and a space separated list of all the page ids viewed in the
session in order.
The job now runs on the cluster in a reasonable amount of time (10 min for
a day's worth of views).
SessionPruner filters the session table and removes any views of pages
below some threshold. As a side effect it creates a frequency table.
The testing harness creates fake test data and compares computed spark
results against computed in-memory results.
- Oozify job (may require switching to Spark 2)
Complete pass at session creation pipeline Added session pruner Switched to use tables instead of files Cleaned up tests