Backfill data from mediacounts into mediarequests tables in cassandra so as to have historical mediarequest data
This data will not include the referrer dimension, thus we will not know what was the project that did the original request
Backfill data from mediacounts into mediarequests tables in cassandra so as to have historical mediarequest data
This data will not include the referrer dimension, thus we will not know what was the project that did the original request
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T210313 Statistics for views of individual Wikimedia images | |||
| Resolved | • fdans | T234591 Make job to backfill data from mediacounts into mediarequests tables in cassandra so as to have historical mediarequest data | |||
| Resolved | • fdans | T237119 Create script that returns oozie time intervals every time a coordinator is started from a cron job |
Per the many issues we have seen recently with cassandra not being able to keep up with loading let's stop doing manual loading and let's just do a job that loads about 20 days at a time of the day when cassandra has more resources (so it does not coincide with pageview loading). Assigning to @fdans
Per file failed jobs during backfilling:
Restarting these before rerunning backfilling.
Once these are complete, the backfilled range will be Jan 1st 2015 to Sep 9th 2015
I think we should start backfilling from 2019 backwards so as to have a "continuous" dataset. For a live api that (in theory) can be queries probably that makes more sense than having data ranges in 2015 and 2019 and nothing in between.