Task for discussing the integration of the new MEP client for the Android Wikipedia app. Subtasks can be created as needed.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T228175 Event Platform Client Libraries | |||
Resolved | Sharvaniharan | T228179 Event Platform Client — Android | |||
Resolved | Sharvaniharan | T286000 Android Legacy to MEP Instrumentation - MobileWikiAppDailyStats | |||
Resolved | SNowick_WMF | T286001 Android Legacy to MEP Data QA - MobileWikiAppDailyStats |
Event Timeline
Assuming this task is about Product-Analytics / Product-Infrastructure-Team-Backlog-Deprecated hence adding project tags so others can find this task under these projects.
@Dbrant Let's use schema MobileWikiAppUserContribution as our first schema to move. I can work with @mpopov and @Mholloway to add description info in Gerrit.
Change 637754 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[schemas/event/secondary@master] Created /analytics/mobile_apps/user_contribution/1.0.0
Note: Doc for this schema is here if anyone needs flow chart, etc. Mikhail added the description for each action (thanks!)
Change 637754 merged by Mholloway:
[schemas/event/secondary@master] Created /analytics/mobile_apps/android_user_contribution_screen
Thanks @Mholloway, Mikhail and I discussed using the descriptions later to make a data directory using all the info in the schema so putting them in at the beginning is important. If you run into questions in the future let me know and I can help with descriptions as well.
Change 639284 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] Add event stream config for android.user_contributions_screen
Change 639284 merged by jenkins-bot:
[operations/mediawiki-config@master] Add event stream config for android.user_contributions_screen
Mentioned in SAL (#wikimedia-operations) [2020-12-01T17:34:54Z] <mholloway-shell@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add event stream config for android.user_contributions_screen T228179 (duration: 01m 07s)
Superset table comparing Legacy and MEP event counts. Will update when app has been live for enough time to accumulate data.
Found differences between event counts on MEP and Legacy tables, on further investigation the most obvious difference is based on presence of app_install_id where users on Legacy were not showing up at all on MEP. This anomaly was mostly isolated to event ipblock. The ipblock is the event with the highest counts on both tables. (Superset data table
Engineering has speculated that this may be caused by users quitting the app after receiving the ipblock notice and therefore their eventa data is not sent from client. There is also a possibility that the bundling of events we are using for sending data to MEP is an issue.
The plan is to change how clients send events on user app close, which @Sharvaniharan will implement for next available release, after which we will see if that explains data loss. Another possibility to un-bundle events sent to MEP if we find the first test doesn't resolve data loss.
Counts below are app_install_ids that do not appear in the corresponding table, by event:
data | caption_view | caption_view2 | desc_view | desc_view2 | filt_all | filt_caption | filt_desc | filt_tag | ip_block | misc_view | misc_view2 | open_hist | tag_view | tag_view2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
legacy | 14 | 8 | 29 | 28 | 40 | 43 | 33 | 55 | 3212 | 13 | 8 | 114 | 9 | 4 |
modern | 1 | 1 | 1 | 2 | 2 | 53 | 5 |
Post changes made in version 2.7.50350-r-2021-04-07 we are still seeing the issue where presence of app_install_ids on Legacy are not showing up at all on MEP, and a smaller amount of app_install_ids that appear on MEP and not on Legacy.
An 11 day data sample showed the following:
Event | MEP Uniques | Legacy Uniques |
---|---|---|
ip_block | 4 | 146 |
other events | 4 | 10 |
I assume this is the change in 2.7.50350-r-2021-04-07 that you were referring to in T228179#7011084?
Do you happen to have a SQL Lab link for this query? I'd love to see the full event data.
It's pretty weird that the discrepancies are so skewed toward ip_block events. It seems like that event type should be pretty rare overall in the full data set. (Is it?)
Hi @Mholloway I use one Presto query to compare event counts on both tables on Superset (1) and also query both tables separately (2,3) for all events and app_install_ids which I then dedupe by app_install_id using R on my desktop (I can make a notebook and/or send you code I use to dedupe, lmk).
Query 1:
WITH user_event_counts_legacy AS ( SELECT event.app_install_id AS app_install_id, event.action AS action, COUNT(1) AS n_legacy_events FROM event.mobilewikiappusercontribution WHERE year = 2021 AND month >= 4 AND day >= 2 AND useragent.wmf_app_version >= '2.7.50350' GROUP BY event.app_install_id, event.action ), user_event_counts_modern AS ( SELECT app_install_id, action, COUNT(1) AS n_modern_events FROM event.android_user_contribution_screen WHERE year = 2021 AND month >= 4 AND day >= 2 AND user_agent_map['wmf_app_version'] >= '2.7.50350' GROUP BY app_install_id, action ), user_event_counts_joined AS ( SELECT app_install_id, action, n_legacy_events, n_modern_events FROM user_event_counts_legacy AS legacy JOIN user_event_counts_modern AS modern USING (app_install_id, action) ) SELECT action, SUM(n_legacy_events) AS n_total_legacy_events, SUM(n_modern_events) AS n_total_modern_events FROM user_event_counts_joined GROUP BY action
Query 2 - Legacy Data
SELECT DATE(SUBSTRING(dt,1,10)) as date, event.app_install_id AS user, event.action AS action, useragent.wmf_app_version as useragent FROM event.mobilewikiappusercontribution WHERE year = 2021 AND month >= 4 AND day >= 1 AND useragent.wmf_app_version >= '2.7.50350' GROUP BY DATE(SUBSTRING(dt,1,10)), event.app_install_id, event.action, useragent
Query 3 - MEP data
SELECT SUBSTR(meta.dt, 1, 10) as date, app_install_id as user, action as action, user_agent_map['wmf_app_version'] as useragent FROM event.android_user_contribution_screen WHERE year = 2021 AND month >= 4 AND day >= 1 AND user_agent_map['wmf_app_version'] >= '2.7.50350'
@Mholloway I ran the event count comparison query (Query 1) while I had Superset open, as of 2021-04-21 here are the event counts by table, in order of frequency:
Event | Legacy | Modern |
---|---|---|
ip_block | 7379 | 7156 |
open_hist | 2779 | 2741 |
filt_desc | 1099 | 1093 |
filt_all | 941 | 939 |
filt_caption | 854 | 823 |
desc_view | 736 | 734 |
filt_tag | 662 | 666 |
desc_view2 | 414 | 414 |
caption_view | 263 | 263 |
misc_view | 204 | 202 |
caption_view2 | 138 | 138 |
misc_view2 | 118 | 116 |
tag_view | 84 | 84 |
tag_view2 | 54 | 54 |
paused | 10 | 10 |
disabled | 4 | 4 |
Thanks for that table, @SNowick_WMF. It's surprising that there are so many ip_block events relative to everything else, but after looking more closely at the code, I see that this event type can be logged without the user contributions screen even being opened, so the numbers seem plausible with that in mind.
There are two questions here in my mind:
- Why are so many ip_block events being issued?
- What accounts for the low-but-persistent level of other discrepancies?
On the first question, see T281182 and T281189. On the second, see T281001. Given that this task is closed, it's probably best if we move all further discussion to these new tickets.