Here's a summary of the situation regarding discrepancies between the new and old pipelines for Landing Pages (includes also measures obtained in T235284).
Mon, Nov 11
FRUEC currently accepts LandingPage events with no language property in the JSON input. However, if the property exists and its value is an empty string, the event is marked invalid and not counted. However, in such a case, the legacy system (DjangoBannerStats) defaults the language to 'en'.
Fri, Nov 8
Wed, Nov 6
Mon, Nov 4
Matching on landing page, country, language, and other event fields, with old log timestamps always earlier than the new log timestamps, by at most 30 seconds, we get:
Trying different options for better matching... Just a small improvement by allowing new log timestamps that are closest to the new old log ones, that is, removing the requirement that the new log event be after the old log one. With this method, we get 136 unmatched events in new log, and 510 in the old one.
I've dug into this some more, improving filtering compared to what was done for T235284. I've got some more clarity about what the differences are, but it's still pretty ugly.
Fri, Nov 1
Thu, Oct 31
Scheduled to deploy in a few minutes...
Thanks so much for flagging this, and many apologies for the noise!! There's now a patch in review for T236627: CentralNotice: Adapt impression event schema for campaign fallback.
Here's a patch!!! Apologies for the trouble...!
Wed, Oct 30
Tue, Oct 29
Note: the second bullet point from the task description has been spun out as T236845.
This task was about the large-scale discrepancy, which I think we can consider to be solved. There are still smaller unexplained differences between data we're getting from the old and new pipelines. I've created new tasks to investigate that: T236835 and T236834.
Mon, Oct 28
I've dug deeper into the Landing Page discrepancy, comparing sequences of log entries from the same IPs in both old and new pipelines.
Sun, Oct 27
Fri, Oct 25
So actually it seems the problem is not duplicate entries in the old logs, but rather a few IP addresses hammering on the site using some sort of script, which doesn't run client-side code, so it doesn't generate any client-side events.
Made some progress on figuring out what the difference is between old and new logs. It looks like there are a lot of duplicate entries in the old log files.
Wed, Oct 23
Wed, Oct 16
Looking at LandingPage counts, it seems there is either a significant difference in the contents of the log files for the two pipelines, or how the scripts are filtering events, or both.
Tue, Oct 15
Working on the assumption that the cause or causes of the discrepancy could be different for LandingPage events than for CentralNotice ones... It seems that the discrepancy in CentralNotice event counts can be explained by the fact that neither FRUEC nor the legacy script take into account client-side sample rate.
Also a large difference in log file contents for landing pages for a more recent date:
Mon, Oct 14
There's a wide difference in the number of events in the files being consumed:
Oct 13 2019
Now deployed, seems to work!
Now deployed, seems to work!
Oct 11 2019
Oct 10 2019
Here are some suggestions on what to smoke test for any changes to the fallback loop. You could check all the following situations, each time looking for errors in the browser console and checking the contents of mw.centralNotice.data.
Oct 8 2019
I'm closing this task, as the initial work is done, and the remaining work is now tracked separately.
Oct 7 2019
This needs updating due to this schema change: https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/FRUEC/+/541155