Change Details

The WMF's analytics team provides files containing real-time data of WMDE banner impressions every 15 minutes. The data is publicly hosted on an indexed directory on analytics.wikimedia.org. **Acceptance Criteria** * The application has an entry point that cares for picking up the data. * A file is only picked up/processed once. * The data we receive is complemented with the banner keyword. * If a file is missing, the command writes an error output and exits with a non-zero status. Further processing is stopped. **Notes** Example for a publicly hosted dataset provided by the Analytics Team of the WMF: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/browser/ **Implementation Hints** * The entry point can be added as a new command to the existing `console.php` (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php). * Outcome of this task is to create classes for downloading the data, parsing the data and remembering which data has already been processed. * After finishing the ticket, the entry point script should call the downloader and parser classes. * Guzzle should be used for testability and might even help parsing the files. * Data download should happen in an interface for testability. * The current code (in `FetchImpressionsCommand`) queries the banner page on the Meta-Wiki to retrieve the banner keyword based on the banner name, which we need to keep doing. * Time zones!!!1! * Data sources will be used in parallel for a short time. For now, the fetched data should be stored in a different database table. **Notes** * There is one file for each time span. * The file is named banner_impressions_YYYYMMDD_hhmm.csv, using the beginning of the time span. * The file contains comma-separated values: * Banner name * Impression count (extrapolated) * The files are published in analytics.wikitech.org * Files older than 30 days are deleted regularly.