Develop a time series model for search usage that runs every day and forecasts search usage for the next day (or next few days). We want to know if what we saw on a given day (e.g. yesterday) is within our expectation or vastly different. We can use these models to automatically suggest/alert when we should investigate specific days.
|Open||mpopov||T150370 [EPIC][Search][Dashboard] Add "well-behaved searchers" filter|
|Resolved||mpopov||T150915 [Dashboard] Migrate golden to Reportupdater infrastructure|
|Resolved||Ottomata||T147682 Can't install R package Boom (& bsts) on stat1002 (but can on stat1003)|
|Open||mpopov||T112170 Model user behavior and detect when reality heavily deviated from expectation|
|Resolved||mpopov||T122937 Experimental forecast dashboard|
|Resolved||mpopov||T120285 Set up an experimental Discovery dashboard which people can push any graphs or features they want to|
- Create and fit appropriate models of usage/traffic for event counts, load times, WDQS SPARQL usage, etc. (using the Shiny app developed earlier)
- Use those models :
- To detect outliers (check with SPARQL data from 2015-11-04 -- 2015-11-06 and 2015-11-08)
- In a Bayesian, daily-updating system (needs to be researched)
Start prototyping an ARIMA model-based forecasting system: https://github.com/bearloga/branch/blob/master/arima_forecasting.pdf
Going to mess around with a few more models (note to self: e.g. GARCH) as well as Bayesian approaches to ARIMA and GARCH.
Then going to put together an experimental dashboard for these predictions.
Just to note here that the reason this is flipping backwards and forwards between "Backlog" and "In progress" is because this task keeps getting bumped for higher priority work. This isn't not a problem; on the contrary, it means that we're got our prioritisation really clear. :-)
Brand-new dashboard up: http://discovery-experimental.wmflabs.org/forecast/
Having issues installing some R packages on stat1002 so can't automate daily forecasts yet buuuut hopefully will be able to soon.
Stuck at not being able to install bsts on stat1002. Full description of problem & outputs: https://gist.github.com/bearloga/98fb72b57c71477c6b13b395e4c0d9ea
Hoping somebody with more C++ and make experience will be able to help me :\ Also don't know if it's an ops issue (pinging @Gehel) because the package installs OK on stat1003, which is theoretically supposed to have the same core software as stat1002 (other than stat1002-specific stuff like Hive and Beeline).
- Our data pipeline is generating daily forecasts for Cirrus API usage, zero results rate, WDQS homepage traffic & SPARQL endpoint usage: https://datasets.wikimedia.org/aggregate-datasets/discovery-forecasts/
- There is a dashboard for visualizing these forecasts on the experimental space: https://discovery-experimental.wmflabs.org/forecast/
- There's a patch to add forecasting via Facebook's Prophet procedure: https://gerrit.wikimedia.org/r/#/c/344677/
- Email notifications when percent error between predicted and observed exceeds a particular deviance threshold
- More metrics (once the dust settles on Discovery-related stuff, I'll need to discuss with @debt & @Deskana which metrics would be nice to have forecasts & deviance notifications for)
Moving to Backlog column for now until I can work on this again (when more urgent tasks are done with).
@Smalyshev: was there any announcement on 27 March 2017? (https://discovery-experimental.wmflabs.org/forecast/#wdqs_homepage)