The WMF's analytics team provides files containing real-time data of WMDE banner impressions every 15 minutes. The data is publicly hosted on an indexed directory on analytics.wikimedia.org.
Acceptance Criteria
- The application has an entry point that cares for picking up the data.
- A file is only picked up/processed once.
- The data we receive is complemented with the banner keyword.
- If a file is missing, the command writes an error output and exits with a non-zero status. Further processing is stopped.
Notes
The data sets can be accessed from:
https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/WMDE_Banners
Implementation Hints
- The entry point can be added as a new command to the existing console.php (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php).
- Outcome of this task is to create classes for downloading the data, parsing the data and remembering which data has already been processed.
- After finishing the ticket, the entry point script should call the downloader and parser classes.
- Guzzle should be used for testability and might even help parsing the files.
- Data download should happen in an interface for testability.
- The current code (in FetchImpressionsCommand) queries the banner page on the Meta-Wiki to retrieve the banner keyword based on the banner name, which we need to keep doing.
- Time zones!!!1!
- Data sources will be used in parallel for a short time. For now, the fetched data should be stored in a different database table.
Notes
- There is one file for each time span.
- The file is named banner_impressions_YYYYMMDD_hhmm.csv, using the beginning of the time span.
- The file contains comma-separated values:
- Banner name
- Impression count (extrapolated)
- The files are published in analytics.wikitech.org
- Files older than 30 days are deleted regularly.