The WMF's analytics team provides files containing real-time data of WMDE banner impressions every 15 minutes. The data is publicly hosted on an indexed directory on analytics.wikimedia.org.
**Acceptance Criteria**
* The application has an entry point that cares for picking up the data.
* A file is only picked up/processed once.
* The data we receive is complemented with the banner keyword.
* The numbers are projected to resemble a 100% sample rate.
**Notes**
Example for a publicly hosted dataset provided by the Analytics Team of the WMF:
https://analytics.wikimedia.org/published/datasets/periodic/reports/metrics/browser/
**Questions****Implementation Hints**
How are files created?* The entry point can be added as a new command to the existing `console.php` (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php).
* File gets appended to every 15 minutesOutcome of this task is to create classes for downloading the data, parsing the data and remembering which data has already been processed.
* After finishing the ticket, the entry point script should call the downloader and parser classes.
* File gets overwritten every 15 minutes* Guzzle should be used for testability and might even help parsing the files.
* One file per date range* Data download should happen in an interface for testability.
How often are files cleaned up?* The current code (in `FetchImpressionsCommand`) queries the banner page on the Meta-Wiki to retrieve the banner keyword based on the banner name, which we need to keep doing.
How exactly does the format look like?* Time zones!!!1!
What is the sample rate of the banner impressions and are the values already projected in the file?* Data sources will be used in parallel for a short time. For now, the fetched data should be stored in a different database table.
**Implementation Hints****Notes**
* The entry point can be added as a new command to the existing `console.php` (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php)re is one file for each time span.
* Outcome of this task is to create classes for downloading the data* The file is named banner_impressions_YYYYMMDD_hhmm.csv, parusing the data and rememberbeginning which data has already been processedof the time span.
* After finishing the ticket, the entry point script should call the downloader and parser classes and leave a TODO for processing/storing the data.The file contains comma-separated values:
* Guzzle should be used for testability and might even help parsing the files * Banner name
* Impression count (extrapolated)
* The files are published in analytics.wikitech.org
* Files older than 30 days are deleted regularly.