|operations/mediawiki-config : master||CentralNotice EventLogging banner impression data test|
- Mentioned In
- T90917: Get banner count via Special:BannerLoader
T214709: Remove hacky EventLogging-duplicating code and use new lightweight EL facilities
T223323: review status of event logging
T203142: [FundraisingLandingPage] should not double escape
T200721: pgehres: Remove unused columns
T198641: Rename pgehres database
- Mentioned Here
- T236834: FRUEC: Detailed comparison of events in old and new log files for banner impression pipeline
T236835: FRUEC: Debug minor discrepancy in landing page data between old and new pipelines
T237553: FRUEC: Discuss with stakeholders, Analytics and fr-tech implications and options for new Landing Page pipeline
T237997: FRUEC: For legacy compatibility, empty language property should default to 'en' for LandingPage impressions
T238592: Ask other teams for input on extra entries in new pipeline landing page logs
T239570: Investigate options for dropped CN EventLogging events for new pipeline
T242022: Verify no losses in kafka->kafkatee->logfile data pipeline segment
T242065: FRUEC: Ensure compatibility with legacy behaviour for missing and empty values, for all LandingPage event properties
T195594: New scripts to ingress data from Kafkatee into MySQL
T196563: Write a specification for mapping banner/landing page impression event properties -> database schema
T196564: DB schemas (production changes and test DB) and SQL commands to run for new banner and LP impression data from EL
T198752: Queries and maybe scripts to verify equivalence of data in new-Kafka-pipeline-testing and pgehres production databases
T192839: [Spike] Plan ingress system to use with new Kafka topic for landing page and impression data
T189613: Make a rough timeline/roadmap to replace usage of Kafkatee by FR with an event-based system to count pageviews to donatewiki
T185932: CentralNotice: use EventLogging instead of custom beacon
T185933: Donatewiki: use EventLogging to log pageloads
T186047: centralnotice_analytics: adapt ImpressionsQuery for EventLogging-based impressions recording
T186048: Adapt ingress of CN data into Druid to EventLogging-based impression recording
- T185932: CentralNotice: use EventLogging instead of custom beacon
- T185933: Donatewiki: use EventLogging to log pageloads
- T186047: centralnotice_analytics: adapt ImpressionsQuery for EventLogging-based impressions recording
- T186048: Adapt ingress of CN data into Druid to EventLogging-based impression recording
As of this writing, the open tasks attached to this epic do approximate what needs to be done. Probably as we move forward we'll think of more specific ones related to parallel deployment of the new system and switching off of the old one.
What the epic doesn't show is which tasks already have some progress, and what the priorities are. Here are some tasks that are already have some progress. All of these tasks are currently in review.
- T196563: Write a specification for mapping banner/landing page impression event properties -> database schema
- T196564: DB schemas (production changes and test DB) and SQL commands to run for new banner and LP impression data from EL
- T198752: Queries and maybe scripts to verify equivalence of data in new-Kafka-pipeline-testing and pgehres production databases
- T195594: New scripts to ingress data from Kafkatee into MySQL
I'd suggest first we finish there review on these tasks, verify the data on the part of the pipeline that is already active (EventLogging data), then maybe plan more details of the deployment process.
Here's an overview of the status of this project:
- Data specification for the new pipeline is ready, except for possible adjustments for empty LandingPage values (see below).
- Log files are created from the new Kafka streams.
- Parallel testing database is on production.
- FRUEC (new ingress script) is functional and can be deployed and run as a job on production.
- No automated tests or CI yet.
- No attempts yet to run real queries for Advancement on data from the new pipeline.
- Major issue requiring attention: discrepancies between data in the old and new pipelines.
Details on discrepancies between new and old data pipelines:
- No significant discrepancies detected due to FRUEC script.
- Likely no major issues with Kafkatee filtering of events received from Analytics' Kafka streams. Task about further verification: T242022.
- Major issues exist regarding the initial sections of the pipelines, that is, how data is initially collected from the client.
- For the CN (banner) pipeline, both old and new events are generated client-side.
- For the Landing Page pipeline, the old events are generated from requests for the base page HTML, and new events are generated client-side.
- Strangely, there are also consistently extra events in the new LandingPage pipeline. These are events in the new pipeline that don't have a corresponding event in the old pipeline. Try to figure out what's going on: T238592.
- Since the new pipeline events arrive on the same EventLogging URL, it is likely that 12-15% are being blocked by adblock filters, just like CN pipeline events.
- Analysis of data: T236835.
- What to do: T237553.