Spark job to create page ids viewed in each session

Authored by Shilad on Thu, Sep 28, 6:47 AM.

Description

Spark job to create page ids viewed in each session

SessionPagesBuilder creates a table representing viewed page grouped by
browser session. The output is a table containing columns for wiki, date,
timestamp, and a space separated list of all the page ids viewed in the
session in order.

The job now runs on the cluster in a reasonable amount of time (10 min for
a day's worth of views).

SessionPruner filters the session table and removes any views of pages
below some threshold. As a side effect it creates a frequency table.

The testing harness creates fake test data and compares computed spark
results against computed in-memory results.

TODO:

  • Oozify job (may require switching to Spark 2)

Bug: T174796
Change-Id: I55395459d80d73f3d065967ce95d6506698d128e

Complete pass at session creation pipeline

  Added session pruner
  Switched to use tables instead of files
  Cleaned up tests

Change-Id: I19160e16d8140d03d81a4226e3974f42ec1e3602