Page MenuHomePhabricator

Event Platform Client — Android
Closed, ResolvedPublic

Description

Task for discussing the integration of the new MEP client for the Android Wikipedia app. Subtasks can be created as needed.

Event Timeline

Assuming this task is about Product-Analytics / #Reading-Infrastructure-Team-Backlog hence adding project tags so others can find this task under these projects.

jlinehan moved this task from Inbox to Epics on the Better Use Of Data board.
jlinehan moved this task from Epics to Inbox on the Better Use Of Data board.
jlinehan moved this task from Inbox to Epics on the Better Use Of Data board.
jlinehan renamed this task from Event Platform Client Library: Android to EPC Implementation: Android.Aug 27 2019, 1:22 PM
jlinehan renamed this task from EPC Implementation: Android to EPC Impl: Android.Aug 27 2019, 1:39 PM
jlinehan renamed this task from EPC Impl: Android to EPC Implementation: Android.Aug 27 2019, 1:43 PM
jlinehan renamed this task from EPC Implementation: Android to EPC Android Implementation.
jlinehan renamed this task from EPC Android Implementation to EPC Android.Aug 27 2019, 1:48 PM
jlinehan raised the priority of this task from Medium to High.Sep 10 2019, 3:39 PM
jlinehan renamed this task from EPC Android to MEP Client Android.Feb 18 2020, 4:07 PM
jlinehan updated the task description. (Show Details)
jlinehan added a subscriber: mpopov.
Mholloway renamed this task from MEP Client Android to Event Platform Client — Android.Oct 5 2020, 8:48 PM
Mholloway claimed this task.
Mholloway moved this task from Task Backlog to Doing on the Product-Data-Infrastructure board.

@Dbrant Let's use schema MobileWikiAppUserContribution as our first schema to move. I can work with @mpopov and @Mholloway to add description info in Gerrit.

Change 637754 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[schemas/event/secondary@master] Created /analytics/mobile_apps/user_contribution/1.0.0

https://gerrit.wikimedia.org/r/637754

Note: Doc for this schema is here if anyone needs flow chart, etc. Mikhail added the description for each action (thanks!)

Mikhail added the description for each action (thanks!)

All credit for that goes to @Mholloway :) thanks, Michael!

Change 637754 merged by Mholloway:
[schemas/event/secondary@master] Created /analytics/mobile_apps/android_user_contribution_screen

https://gerrit.wikimedia.org/r/637754

Mikhail added the description for each action (thanks!)

All credit for that goes to @Mholloway :) thanks, Michael!

Thanks @Mholloway, Mikhail and I discussed using the descriptions later to make a data directory using all the info in the schema so putting them in at the beginning is important. If you run into questions in the future let me know and I can help with descriptions as well.

Change 639284 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] Add event stream config for android.user_contributions_screen

https://gerrit.wikimedia.org/r/639284

Change 639284 merged by jenkins-bot:
[operations/mediawiki-config@master] Add event stream config for android.user_contributions_screen

https://gerrit.wikimedia.org/r/639284

Mentioned in SAL (#wikimedia-operations) [2020-12-01T17:34:54Z] <mholloway-shell@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Add event stream config for android.user_contributions_screen T228179 (duration: 01m 07s)

Superset table comparing Legacy and MEP event counts. Will update when app has been live for enough time to accumulate data.

Found differences between event counts on MEP and Legacy tables, on further investigation the most obvious difference is based on presence of app_install_id where users on Legacy were not showing up at all on MEP. This anomaly was mostly isolated to event ipblock. The ipblock is the event with the highest counts on both tables. (Superset data table

Engineering has speculated that this may be caused by users quitting the app after receiving the ipblock notice and therefore their eventa data is not sent from client. There is also a possibility that the bundling of events we are using for sending data to MEP is an issue.

The plan is to change how clients send events on user app close, which @Sharvaniharan will implement for next available release, after which we will see if that explains data loss. Another possibility to un-bundle events sent to MEP if we find the first test doesn't resolve data loss.

Counts below are app_install_ids that do not appear in the corresponding table, by event:

datacaption_viewcaption_view2desc_viewdesc_view2filt_allfilt_captionfilt_descfilt_tagip_blockmisc_viewmisc_view2open_histtag_viewtag_view2
legacy148292840433355321213811494
modern11122535

Post changes made in version 2.7.50350-r-2021-04-07 we are still seeing the issue where presence of app_install_ids on Legacy are not showing up at all on MEP, and a smaller amount of app_install_ids that appear on MEP and not on Legacy.

An 11 day data sample showed the following:

EventMEP UniquesLegacy Uniques
ip_block4146
other events410

The plan is to change how clients send events on user app close, which @Sharvaniharan will implement for next available release, after which we will see if that explains data loss.

I assume this is the change in 2.7.50350-r-2021-04-07 that you were referring to in T228179#7011084?

An 11 day data sample showed the following:

EventMEP UniquesLegacy Uniques
ip_block4146
other events410

Do you happen to have a SQL Lab link for this query? I'd love to see the full event data.

It's pretty weird that the discrepancies are so skewed toward ip_block events. It seems like that event type should be pretty rare overall in the full data set. (Is it?)

Hi @Mholloway I use one Presto query to compare event counts on both tables on Superset (1) and also query both tables separately (2,3) for all events and app_install_ids which I then dedupe by app_install_id using R on my desktop (I can make a notebook and/or send you code I use to dedupe, lmk).

Query 1:

WITH user_event_counts_legacy AS (
SELECT
event.app_install_id AS app_install_id, event.action AS action, COUNT(1) AS n_legacy_events
FROM event.mobilewikiappusercontribution
WHERE year = 2021 AND month >= 4 AND day >= 2 
AND useragent.wmf_app_version >= '2.7.50350'
GROUP BY event.app_install_id, event.action
), user_event_counts_modern AS (
SELECT
app_install_id, action, COUNT(1) AS n_modern_events
FROM event.android_user_contribution_screen
WHERE year = 2021 AND month >= 4 AND day >= 2 
AND user_agent_map['wmf_app_version'] >= '2.7.50350'
GROUP BY app_install_id, action
), user_event_counts_joined AS (
SELECT
app_install_id, action, n_legacy_events, n_modern_events
FROM user_event_counts_legacy AS legacy
JOIN user_event_counts_modern AS modern
USING (app_install_id, action)
)
SELECT
action,
SUM(n_legacy_events) AS n_total_legacy_events,
SUM(n_modern_events) AS n_total_modern_events
FROM user_event_counts_joined
GROUP BY action

Query 2 - Legacy Data

SELECT
DATE(SUBSTRING(dt,1,10)) as date,
event.app_install_id AS user, 
event.action AS action,
useragent.wmf_app_version as useragent
FROM event.mobilewikiappusercontribution
WHERE year = 2021 AND month >= 4 AND day >= 1 
AND useragent.wmf_app_version >= '2.7.50350'
GROUP BY DATE(SUBSTRING(dt,1,10)), event.app_install_id, event.action, useragent

Query 3 - MEP data

SELECT
SUBSTR(meta.dt, 1, 10) as date,
app_install_id as user, 
action as action,
user_agent_map['wmf_app_version'] as useragent
FROM event.android_user_contribution_screen
WHERE year = 2021 AND month >= 4 AND day >= 1 
AND user_agent_map['wmf_app_version'] >= '2.7.50350'

@Mholloway I ran the event count comparison query (Query 1) while I had Superset open, as of 2021-04-21 here are the event counts by table, in order of frequency:

EventLegacyModern
ip_block73797156
open_hist27792741
filt_desc10991093
filt_all941939
filt_caption854823
desc_view736734
filt_tag662666
desc_view2414414
caption_view263263
misc_view204202
caption_view2138138
misc_view2118116
tag_view8484
tag_view25454
paused1010
disabled44

Thanks for that table, @SNowick_WMF. It's surprising that there are so many ip_block events relative to everything else, but after looking more closely at the code, I see that this event type can be logged without the user contributions screen even being opened, so the numbers seem plausible with that in mind.

There are two questions here in my mind:

  1. Why are so many ip_block events being issued?
  2. What accounts for the low-but-persistent level of other discrepancies?

On the first question, see T281182 and T281189. On the second, see T281001. Given that this task is closed, it's probably best if we move all further discussion to these new tickets.