Page MenuHomePhabricator

Pick up banner impression data from analytics web server
Closed, ResolvedPublic

Description

The WMF's analytics team provides files containing real-time data of WMDE banner impressions every 15 minutes. The data is publicly hosted on an indexed directory on analytics.wikimedia.org.

Acceptance Criteria

  • The application has an entry point that cares for picking up the data.
  • A file is only picked up/processed once.
  • The data we receive is complemented with the banner keyword.
  • If a file is missing, the command writes an error output and exits with a non-zero status. Further processing is stopped.

Notes
The data sets can be accessed from:
https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/WMDE_Banners

Implementation Hints

  • The entry point can be added as a new command to the existing console.php (see https://github.com/wmde/fundraising-backend/blob/master/tools/console.php).
  • Outcome of this task is to create classes for downloading the data, parsing the data and remembering which data has already been processed.
  • After finishing the ticket, the entry point script should call the downloader and parser classes.
  • Guzzle should be used for testability and might even help parsing the files.
  • Data download should happen in an interface for testability.
  • The current code (in FetchImpressionsCommand) queries the banner page on the Meta-Wiki to retrieve the banner keyword based on the banner name, which we need to keep doing.
  • Time zones!!!1!
  • Data sources will be used in parallel for a short time. For now, the fetched data should be stored in a different database table.

Notes

  • There is one file for each time span.
  • The file is named banner_impressions_YYYYMMDD_hhmm.csv, using the beginning of the time span.
  • The file contains comma-separated values:
    • Banner name
    • Impression count (extrapolated)
  • The files are published in analytics.wikitech.org
  • Files older than 30 days are deleted regularly.

Event Timeline

Naming schema need to follow a naming convention ("banner impressions" prefix, date and time)
data format should not change (currently, it's number of impressions and banner name), be sure to use UTC
rotation of one month is fine

kai.nissen set the point value for this task to 13.

needs to be split into:

  • banner keyword fetching
  • server data fetching
  • data processing
  • storage interaction
    • writing new data
    • getting the last stored date
kai.nissen removed the point value for this task.