Model user behavior and detect when reality heavily deviated from expectation
Open, NormalPublic20 Story Points

Description

Develop a time series model for search usage that runs every day and forecasts search usage for the next day (or next few days). We want to know if what we saw on a given day (e.g. yesterday) is within our expectation or vastly different. We can use these models to automatically suggest/alert when we should investigate specific days.

mpopov created this task.Sep 10 2015, 8:35 PM
mpopov updated the task description. (Show Details)
mpopov raised the priority of this task from to High.
mpopov added a project: Discovery.
mpopov added a subscriber: mpopov.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 10 2015, 8:35 PM
mpopov moved this task from Needs triage to Analysis on the Discovery board.Sep 10 2015, 8:35 PM
mpopov claimed this task.
mpopov set Security to None.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov edited a custom field.
Deskana lowered the priority of this task from High to Normal.Nov 10 2015, 9:11 PM
Deskana edited projects, added Discovery; removed Discovery-Analysis (Current work).
Deskana added a subscriber: Deskana.
mpopov edited a custom field.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

To-do:

  • Create and fit appropriate models of usage/traffic for event counts, load times, WDQS SPARQL usage, etc. (using the Shiny app developed earlier)
  • Use those models :
    • To detect outliers (check with SPARQL data from 2015-11-04 -- 2015-11-06 and 2015-11-08)
    • In a Bayesian, daily-updating system (needs to be researched)
Deskana moved this task from Analysis to On Sprint Board on the Discovery board.Dec 17 2015, 9:14 PM

@mpopov noted that this was a 10% time project, but I think it also makes sense to prioritise it more highly. So, now it's official work. :-)

Start prototyping an ARIMA model-based forecasting system: https://github.com/bearloga/branch/blob/master/arima_forecasting.pdf

Going to mess around with a few more models (note to self: e.g. GARCH) as well as Bayesian approaches to ARIMA and GARCH.

Then going to put together an experimental dashboard for these predictions.

Just to note here that the reason this is flipping backwards and forwards between "Backlog" and "In progress" is because this task keeps getting bumped for higher priority work. This isn't not a problem; on the contrary, it means that we're got our prioritisation really clear. :-)

mpopov set the point value for this task to 20.Apr 12 2016, 8:11 PM
debt added a subscriber: debt.

moving to backlog again - @mpopov will work on it as he can.

debt moved this task from Needs triage to Later on the Discovery-Analysis board.May 31 2016, 8:57 PM
debt moved this task from Later to Up Next on the Discovery-Analysis board.Sep 20 2016, 8:19 PM

Change 314339 had a related patch set uploaded (by Bearloga):
Deploy brand new Forecasts dashboard

https://gerrit.wikimedia.org/r/314339

Change 314339 merged by Bearloga:
Deploy brand new Forecasts dashboard

https://gerrit.wikimedia.org/r/314339

mpopov added a comment.Oct 5 2016, 8:13 PM

Brand-new dashboard up: http://discovery-experimental.wmflabs.org/forecast/

Having issues installing some R packages on stat1002 so can't automate daily forecasts yet buuuut hopefully will be able to soon.

mpopov added a subscriber: Gehel.Oct 7 2016, 5:35 PM

Stuck at not being able to install bsts on stat1002. Full description of problem & outputs: https://gist.github.com/bearloga/98fb72b57c71477c6b13b395e4c0d9ea

Hoping somebody with more C++ and make experience will be able to help me :\ Also don't know if it's an ops issue (pinging @Gehel) because the package installs OK on stat1003, which is theoretically supposed to have the same core software as stat1002 (other than stat1002-specific stuff like Hive and Beeline).

Change 325870 had a related patch set uploaded (by Bearloga):
[WIP] Migrate to Reportupdater framework

https://gerrit.wikimedia.org/r/325870

Change 325870 merged by Bearloga:
[wikimedia/discovery/golden] Migrate to Reportupdater framework

https://gerrit.wikimedia.org/r/325870

Change 343323 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/golden] Enable forecasting modules

https://gerrit.wikimedia.org/r/343323

Change 343323 merged by Chelsyx:
[wikimedia/discovery/golden] Enable forecasting modules

https://gerrit.wikimedia.org/r/343323

Change 344677 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/golden@master] Adds Prophet forecasting

https://gerrit.wikimedia.org/r/344677

Change 344967 had a related patch set uploaded (by Bearloga):
[wikimedia/discovery/golden@master] Fix how dates work for forecasting

https://gerrit.wikimedia.org/r/344967

Change 344967 merged by Bearloga:
[wikimedia/discovery/golden@master] Fix how dates work for forecasting

https://gerrit.wikimedia.org/r/344967

mpopov added a subscriber: Smalyshev.

Status Update:

Up Next:

  • Email notifications when percent error between predicted and observed exceeds a particular deviance threshold
  • More metrics (once the dust settles on Discovery-related stuff, I'll need to discuss with @debt & @Deskana which metrics would be nice to have forecasts & deviance notifications for)

Moving to Backlog column for now until I can work on this again (when more urgent tasks are done with).

Side Note:

@Smalyshev: was there any announcement on 27 March 2017? (https://discovery-experimental.wmflabs.org/forecast/#wdqs_homepage)

Change 344677 merged by Bearloga:
[wikimedia/discovery/golden@master] Adds Prophet forecasting

https://gerrit.wikimedia.org/r/344677

Gehel removed a subscriber: Gehel.Jun 20 2017, 1:30 PM