Change Details

The WMF's analytics team provides files containing real-time data of WMDE banner impressions every 15 minutes. The data is publicly hosted on an indexed directory on analytics.wikimedia.org. **Acceptance Criteria** * The application has an entry point that cares for picking up the data. * A file is only picked up/processed once. * The data we receive is complemented with the banner keyword. * The numbers are projected to resemble a 100% sample rate. **Notes** Example for a publicly hosted dataset provided by the Analytics Team of the WMF: https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/browser/ **Questions****Implementation Hints** How are files created?* The entry point can be added as a new command to the existing `console.php` (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php). * File gets appended to every 15 minutesOutcome of this task is to create classes for downloading the data, parsing the data and remembering which data has already been processed. * After finishing the ticket, the entry point script should call the downloader and parser classes. * File gets overwritten every 15 minutes* Guzzle should be used for testability and might even help parsing the files. * One file per date range* Data download should happen in an interface for testability. How often are files cleaned up?* The current code (in `FetchImpressionsCommand`) queries the banner page on the Meta-Wiki to retrieve the banner keyword based on the banner name, which we need to keep doing. How exactly does the format look like?* Time zones!!!1! What is the sample rate of the banner impressions and are the values already projected in the file?* Data sources will be used in parallel for a short time. For now, the fetched data should be stored in a different database table. **Implementation Hints****Notes** * The entry point can be added as a new command to the existing `console.php` (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php)re is one file for each time span. * Outcome of this task is to create classes for downloading the data* The file is named banner_impressions_YYYYMMDD_hhmm.csv, parusing the data and rememberbeginning which data has already been processedof the time span. * After finishing the ticket, the entry point script should call the downloader and parser classes and leave a TODO for processing/storing the data.The file contains comma-separated values: * Guzzle should be used for testability and might even help parsing the files * Banner name * Impression count (extrapolated) * The files are published in analytics.wikitech.org * Files older than 30 days are deleted regularly.