Page MenuHomePhabricator

[SPIKE] Investigate legacy vs. modern event submission inconsistencies for Android user contribution screen
Closed, ResolvedPublicSpike

Description

Problem

For the Android user contribution screen schema, events are currently both sent to the legacy eventlogging service and enqueued by the Metrics Platform client library for submission to the new system. A number of events / app_install_ids that are present in the legacy eventlogging system are not found in the new system. A smaller number of events / app_install_ids are found in the new system that are not found int the legacy system.

@SNowick_WMF wrote in T228179#6947595:

Found differences between event counts on MEP and Legacy tables, on further investigation the most obvious difference is based on presence of app_install_id where users on Legacy were not showing up at all on MEP. This anomaly was mostly isolated to event ipblock. The ipblock is the event with the highest counts on both tables. (Superset data table

Engineering has speculated that this may be caused by users quitting the app after receiving the ipblock notice and therefore their eventa data is not sent from client. There is also a possibility that the bundling of events we are using for sending data to MEP is an issue.

The plan is to change how clients send events on user app close, which @Sharvaniharan will implement for next available release, after which we will see if that explains data loss. Another possibility to un-bundle events sent to MEP if we find the first test doesn't resolve data loss.

Counts below are app_install_ids that do not appear in the corresponding table, by event:

datacaption_viewcaption_view2desc_viewdesc_view2filt_allfilt_captionfilt_descfilt_tagip_blockmisc_viewmisc_view2open_histtag_viewtag_view2
legacy148292840433355321213811494
modern11122535

@SNowick_WMF wrote in T228179#7011084:

Post changes made in version 2.7.50350-r-2021-04-07 we are still seeing the issue where presence of app_install_ids on Legacy are not showing up at all on MEP, and a smaller amount of app_install_ids that appear on MEP and not on Legacy.

An 11 day data sample showed the following:

EventMEP UniquesLegacy Uniques
ip_block4146
other events410

Hypotheses

  1. Events are sent to the old system immediately but enqueued for a period of time with the new client library. A small number of enqueued events that were submitted to the old system may be lost before submission to the new system when the user quits the app.
  2. It would not be unexpected for transient network or server errors to produce a small number of discrepancies on either side. (These should be approximately equally distributed between legacy vs. modern, assuming that the 5xx error rate for eventgate vs. the MediaWiki appservers is rougly equal).
  3. The current Android implementation does not have a separate "input buffer" for holding events prior to validation, so it is possible that a small number of otherwise-valid events are being lost because stream configs haven't yet been fetched when the client attempts to send them.
  4. In a situation with intermittent network connectivity, the device may be online when we perform the connectivity check but go offline before we actually attempt to send events. Rather than attempting to track network-enabled state, we should probably send unconditionally and handle network errors appropriately.
  5. In the old system, isEventLoggingEnabled is checked at the point of event creation, but in the MEP client it is not checked until the point of submission. There is a small possibility that this state could change between the two checks. The MEP client should be updated to check at the point of event creation.
  6. In theory, we could lose events if the app is using outdated stream configs that were stored from the previous run. However, that is probably not affecting current totals because the stream config for the relevant stream has not recently changed. Regardless, to match the other implementations, this client should stop storing stream configs between runs, and fetch a fresh copy on each launch.
  7. Users may be using ad blocking software that blocks requests to either or both endpoints. This depends on the ad blocking configuration used and is entirely out of our control.

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptApr 23 2021, 7:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mholloway renamed this task from [SPIKE] Investigate metrics platform test inconsistencies in Android to [SPIKE] Investigate legacy vs. modern event submission inconsistencies for Android user contribution screen.Apr 26 2021, 6:26 PM

This appears to be the MEP client behavior change that was shipped in 2.7.50350-r-2021-04-07: https://github.com/wikimedia/apps-android-wikipedia/pull/2249

An easy way to test for remaining events lost according to hypothesis (1) would be for the MEP client library to send events immediately like the legacy instrumentation does, either for an identifiable subset of users, and/or temporarily for testing in a beta release.

Mholloway triaged this task as Medium priority.
Mholloway moved this task from Inbox to Doing on the Better Use Of Data board.

An easy way to test for remaining events lost according to hypothesis (1) would be for the MEP client library to send events immediately like the legacy instrumentation does, either for an identifiable subset of users, and/or temporarily for testing in a beta release.

I wrote up a change to send events immediately in Beta in order to establish whether we are still leaking events somewhere in the validation and queueing process. Unfortunately, I wasn't yet quite up to speed on the current release cycle, and I didn't get a PR submitted/reviewed/merged before yesterday's Beta release. Regardless, I suspect that we are losing events for one or more of Hypotheses 3-5 in the task description, and I have another (larger) patch in progress to address those.

@Mholloway We discussed this in our stand-up today... This sounds like a good next step so we can have a proper comparison between the old and new systems. Please open the pr and we will review this on priority. We are planning to re-release to beta as we need to make a minor update, so we can squeeze this in with those changes. Also, I took a cursory look at the change you have made. I see that you have put it behind the beta flag. However we feel it will be beneficial to make this comparison even in production, so this can just be a general change to send events as they come on all flavors. Please lmk if you have questions.

Early results from last week's Beta release (2.7.50355-beta-2021-04-29)[1]:

EventLegacyModern
open_hist7373
filt_all3232
filt_desc3030
filt_caption2424
ip_block1919
filt_tag1818
desc_view88
tag_view66
caption_view44
caption_view233
tag_view233
desc_view211

Looking great! Assuming this holds, that confirms that it's the MEP batching system that's leaking events. (ip_block events, while still occurring more frequently than I'd have expected, are now at a less surprising level following T281189).

However, there is a discrepancy in the unique app install IDs found betwen the tables, which is surprising in light of the above. There are 40 unique app install IDs in the legacy table,[2] of which 2 do not appear in the modern table,[3] and 41 unique app install IDs in the modern table,[4] of which 3 do not appear in the legacy table.[5] 38 unique app install IDs are found in both tables.[6] Is there a mistake in one of these queries that could explain this?

[1]

WITH user_event_counts_legacy AS (
  SELECT
    event.app_install_id AS app_install_id, event.action AS action, COUNT(1) AS n_legacy_events
  FROM
    event.mobilewikiappusercontribution
  WHERE
    year = 2021 AND month > 3 AND useragent.wmf_app_version = '2.7.50355-beta-2021-04-29'
  GROUP BY
    event.app_install_id, event.action
), user_event_counts_modern AS (
  SELECT
    app_install_id, action, COUNT(1) AS n_modern_events
  FROM
    event.android_user_contribution_screen
  WHERE
    year = 2021 AND month > 3 AND user_agent_map['wmf_app_version'] = '2.7.50355-beta-2021-04-29'
  GROUP BY
    app_install_id, action
), user_event_counts_joined AS (
  SELECT
    app_install_id, action, n_legacy_events, n_modern_events
  FROM
    user_event_counts_legacy AS legacy
  JOIN
    user_event_counts_modern AS modern USING (app_install_id, action)
)
SELECT
  action, SUM(n_legacy_events) AS n_total_legacy_events, SUM(n_modern_events) AS n_total_modern_events
FROM
  user_event_counts_joined
GROUP BY
  action
ORDER BY
  n_total_legacy_events DESC;

[2]

SELECT DISTINCT event.app_install_id FROM mobilewikiappusercontribution WHERE year = 2021 AND month > 3 AND useragent.wmf_app_version = '2.7.50355-beta-2021-04-29';

[3]

SELECT DISTINCT event.app_install_id FROM mobilewikiappusercontribution WHERE year = 2021 AND month > 3 AND useragent.wmf_app_version = '2.7.50355-beta-2021-04-29'
EXCEPT
SELECT DISTINCT app_install_id FROM android_user_contribution_screen WHERE year = 2021 AND month > 3 AND user_agent_map['wmf_app_version'] = '2.7.50355-beta-2021-04-29';

[4]

SELECT DISTINCT app_install_id FROM android_user_contribution_screen WHERE year = 2021 AND month > 3 AND user_agent_map['wmf_app_version'] = '2.7.50355-beta-2021-04-29';

[5]

SELECT DISTINCT app_install_id FROM android_user_contribution_screen WHERE year = 2021 AND month > 3 AND user_agent_map['wmf_app_version'] = '2.7.50355-beta-2021-04-29'
EXCEPT
SELECT DISTINCT event.app_install_id FROM mobilewikiappusercontribution WHERE year = 2021 AND month > 3 AND useragent.wmf_app_version = '2.7.50355-beta-2021-04-29';

[6]

SELECT DISTINCT event.app_install_id FROM mobilewikiappusercontribution WHERE year = 2021 AND month > 3 AND useragent.wmf_app_version = '2.7.50355-beta-2021-04-29'
INTERSECT
SELECT DISTINCT app_install_id FROM android_user_contribution_screen WHERE year = 2021 AND month > 3 AND user_agent_map['wmf_app_version'] = '2.7.50355-beta-2021-04-29';

Closing per tech sync discussion. It is expected that the numbers of events submitted to the two systems could vary on the margins for a number of reasons as described in the task description (see Hypotheses).

Thanks for this. @Mholloway I did notice some anomalies using query [1] for event counts previously, I use that as a first pass query but it may be that it needs to be refined. Looking at the parity between app_install_ids in both tables is a much more reliable way to assess if data is missing so I generally don't rely on results from query [1].

Recording this here for posterity:
As of version 2.7.50359-r-2021-05-13, changes were made to the Android app that we wanted to see if MEP vs Legacy anomalies were corrected. As requested by @Sharvaniharan I ran a check on data we have accumulated since that release. We are still seeing anomalies between MEP and Legacy here although it does look like the ip_block action is no longer the main action where data is inconsistent.

It's my understanding that we are no longer focusing on solving for these event count/user issues and will be looking at trends/directional analysis to ascertain impact of data differences. The event counts for both of these tables are very low so pursuing that may not be worth the time it takes. Considering this ticket is closed I'm thinking we should use comparisons of new vs old non-mobile app data, following the QA @Mayakp.wiki has been working on.