Problem
For the Android user contribution screen schema, events are currently both sent to the legacy eventlogging service and enqueued by the Metrics Platform client library for submission to the new system. A number of events / app_install_ids that are present in the legacy eventlogging system are not found in the new system. A smaller number of events / app_install_ids are found in the new system that are not found int the legacy system.
@SNowick_WMF wrote in T228179#6947595:
Found differences between event counts on MEP and Legacy tables, on further investigation the most obvious difference is based on presence of app_install_id where users on Legacy were not showing up at all on MEP. This anomaly was mostly isolated to event ipblock. The ipblock is the event with the highest counts on both tables. (Superset data table
Engineering has speculated that this may be caused by users quitting the app after receiving the ipblock notice and therefore their eventa data is not sent from client. There is also a possibility that the bundling of events we are using for sending data to MEP is an issue.
The plan is to change how clients send events on user app close, which @Sharvaniharan will implement for next available release, after which we will see if that explains data loss. Another possibility to un-bundle events sent to MEP if we find the first test doesn't resolve data loss.
Counts below are app_install_ids that do not appear in the corresponding table, by event:
data caption_view caption_view2 desc_view desc_view2 filt_all filt_caption filt_desc filt_tag ip_block misc_view misc_view2 open_hist tag_view tag_view2 legacy 14 8 29 28 40 43 33 55 3212 13 8 114 9 4 modern 1 1 1 2 2 53 5
@SNowick_WMF wrote in T228179#7011084:
Post changes made in version 2.7.50350-r-2021-04-07 we are still seeing the issue where presence of app_install_ids on Legacy are not showing up at all on MEP, and a smaller amount of app_install_ids that appear on MEP and not on Legacy.
An 11 day data sample showed the following:
Event MEP Uniques Legacy Uniques ip_block 4 146 other events 4 10
Hypotheses
- Events are sent to the old system immediately but enqueued for a period of time with the new client library. A small number of enqueued events that were submitted to the old system may be lost before submission to the new system when the user quits the app.
- It would not be unexpected for transient network or server errors to produce a small number of discrepancies on either side. (These should be approximately equally distributed between legacy vs. modern, assuming that the 5xx error rate for eventgate vs. the MediaWiki appservers is rougly equal).
- The current Android implementation does not have a separate "input buffer" for holding events prior to validation, so it is possible that a small number of otherwise-valid events are being lost because stream configs haven't yet been fetched when the client attempts to send them.
- In a situation with intermittent network connectivity, the device may be online when we perform the connectivity check but go offline before we actually attempt to send events. Rather than attempting to track network-enabled state, we should probably send unconditionally and handle network errors appropriately.
- In the old system, isEventLoggingEnabled is checked at the point of event creation, but in the MEP client it is not checked until the point of submission. There is a small possibility that this state could change between the two checks. The MEP client should be updated to check at the point of event creation.
- In theory, we could lose events if the app is using outdated stream configs that were stored from the previous run. However, that is probably not affecting current totals because the stream config for the relevant stream has not recently changed. Regardless, to match the other implementations, this client should stop storing stream configs between runs, and fetch a fresh copy on each launch.
- Users may be using ad blocking software that blocks requests to either or both endpoints. This depends on the ad blocking configuration used and is entirely out of our control.