Page MenuHomePhabricator

Initial basic report for mobile language selector entry point instrumentation
Closed, ResolvedPublic


As the initial entry points for Section Translation are implemented (T286645) and instrumented (T301222), we want to check the initial results we get from it.
Since it is early for a more elaborate visualization (to be done when more entry points are created and instrumented), this ticket proposes to create a simple query and report to show: The language selector entry point events grouped by month and wiki, next to the sections published in each case.

In this way, we can get a sense of the users that the entry point brings and the result in terms of published sections over time.

New Entry Points for Review:

Entrypoint nameEvent source
newbytranslation (To confirm)invite_new_article_creation

Result: Published report

Related Objects

Event Timeline

Pginer-WMF created this task.

This analysis will start following confirmation of data in T295756.

Quick update:
The content_translation_event_schema was updated to include the new event_source values instrumented in T301222 . Once those updates are deployed, we should start to see events logged in content_translation_event and I can begin reviewing aggregate data to determine the basic usage of each entry point. Note: I'd recommend waiting a week or two from deployment to obtain a sufficient number of events for review.

Currently, we have only logged the following entry points events in August 2022:


Data via:

COUNT(*) AS n_events
YEAR = 2022 

It was surprising to see the access method being "desktop" for all these entry points.
Should we double-check that the events from mobile are captured correctly?

Hi @ngkountas I took a look into why the new section entry points are not yet being logged in the content_translation_event schema and identified a few possible issues to be resolved:

  • The instrument needs to be updated to send events to schema version 1.2.0 (The version number was updated in T287403). We're currently seeing validation errors because the events are being sent to schema 1.0.0 which does not include the new event_source names.
  • Per @Pginer-WMF's comment above, it looks like we are logging all of the new section entry points from mobile incorrectly as access_method: desktop. These should be logged as access_method: mobile web.

Here's a screenshot of a validation error we're receiving for one of the event_source: frequent_languages events:

content_translation_event_validation_error.png (942×2 px, 303 KB)

Let me know if there are any other details or info that would be helpful

Change 829188 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX Instrumentation: Fix event logging schema version

Change 829190 had a related patch set uploaded (by Nik Gkountas; author: Nik Gkountas):

[mediawiki/extensions/ContentTranslation@master] SX Instrumentation: Fix event logging access_method

@MNeisler thank you very much for your input! Both issues are fixed in the above (currently in-review) patches. @Pginer-WMF currently only SX logs events for the content_translation_event schema. However, as Megan also noted, there was a bug inside SX codebase, which led all those events to be logged with the access_method set to desktop, while in fact it should be mobile web.

Change 829190 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX Instrumentation: Fix event logging access_method

Change 829188 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] SX Instrumentation: Fix event logging schema version

Change 829549 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20220905

Change 829549 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 0.2.0+20220905

Thanks @ngkountas for submitting those fixes! I rechecked the aggregate data in content_translation_event and confirmed that we are now logging events for the new SX entry points and they are all now correctly logged with access_method = 'mobileweb'.

Below are the events logged for today (9 September 2022)

dashboard_opencontent_language_selectormobile web115
dashboard_opendirectmobile web66
dashboard_opendirect_preselectmobile web4
dashboard_openfrequent_languagesmobile web377
dashboard_openinvite_new_article_creationmobile web10

The frequent_languages entry point currently seems to have the most events logged today.

I'm going to wait about a week to collect some more data for review and then will plan to explore the aggregate data in more detail and provide an initial report on trends identified for each entry point.

I created a ticket to instrument the follow-up invite: T317995: Instrument follow-up invite shown after publishing
This can provide an idea of how many of the translations are part of a sequence for the user.

@Pginer-WMF I completed a more detailed QA and review of the aggregate data collected on the newly instrumented SX entry points from 9 September 2022 (when we were logging all events as expected) to 20 September 2022. Please see the initial report and summary of findings below.

Issues identified in QA

(cc @ngkountas)

  • We are currently missing the following fields with these events (all logged as NULL), which I expected to be included with any dashboard_open events based on the current schema. Can these be added to the instrument?
  • We are currently not logging any section translation entry events by unregistered users (user_is_anonymous = TRUE). Is that expected?

Section Translation Entry Point Usage Trends

  • The frequent_languages entry point has been the most used entry point (68% of all instrumented section entry point clicks from 09 September through 20 September 2022).

sx_entry_events_bytype.png (2×4 px, 169 KB)

  • There have been no significant increases or decreases in daily usage of any of the entry points over the reviewed two-week time period.
  • The majority of entry point events (68.4%) were completed by new editors logged as having 0 cumulative edits. Note: This may be related due to the incorrect tagging of anon editors (see QA note above). This trend is also true when broken down by event source type except for the direct_preselect entry point where more entry events were completed by experienced users with over 1000+ edits.

sx_entry_event_byuserexp.png (2×4 px, 213 KB)

  • We logged events at 70 distinct Wikipedias. Note: Wiki DB in this analysis refers to the database code for the wiki which the user is currently interacting with (and which is shown in the current URL). This is less meaningful than usual because Content Translation presents a single, cross-wiki interface to the user. The target and source language fields would be more useful once added. See QA note above.
  • Persian Wikipedia (fawiki) had the most logged section translations entry point events (17.5%) followed by Bengali (bnwiki) and Turkish (trwiki) Wikipedias.

sx_entry_events_bywiki.png (2×4 px, 258 KB)

  • Frequent_languages is the most used entry point across each Wikipedia, except for Tswana (tnwiki) where no frequent_languages events were logged.
  • Bengali Wikipedia (bnwiki) had the highest usage of the direct_preselect entry point (11.1% compared to under 5% at other reviewed Wikipedias).

This analysis is currently limited to the usage of the entry points. Further instrumentation of the content_translation_event schema including all editor and publish events would help us further understand the user's interactions with section and content translation from start to finish.

Thanks a lot for this analysis @MNeisler. This is super informative and has already helped to surface ideas on what to investigate next. Regarding the report just a couple of minor considerations for adjustment:

  • Events by Session shows how many sections are translated in a given session and numbers are organized in different buckets. Currently the initial bucket is for 1-5 translations and I was wondering if it makes sense to break it into smaller buckets since these seem to cover different scenarios. A user making just one translation represents a different behaviour than making a longer sequence. I'm not sure which is the right division but maybe something like 1, 2-3, 4-5, or 1, 2-5 could be more meaningful.
  • In addition to capture the events per session it would be great to show them per user, to have a sense of which is the percentage of recurring users accessing the tool vs. those accessing just once.

Regarding the question of anonymous users, Nik can confirm, but I think entry points are only shown for logged-in editors. Thus, it is expected not to get anonymous users. This makes it more unexpected to get a high volume of logged-in editors without a previous edit (and more interesting to check if they end up publishing a translation).

I created some follow-up tickets as next steps to expand our learnings in this area (and correct/check different aspects):

@Pginer-WMF - Thanks for the review and for creating the follow-up tickets!
I've made the adjustments you identified in T295757#8272043. The updates can be found in the report and are also summarized below. Please let me know if you have any questions.

Update Number of Events by Session chart to show smaller buckets
Based on the distribution of data, I decided to remove the event grouping and instead show the frequency for each number of events (1,2,3,4,5, etc) in a session. There is one 'over 20 events' bucket that includes a limited number of outlier instances.

sx_entry_events_bysession (4).png (2×4 px, 212 KB)

About 40% of sessions include 1 entry event click and 18% of sessions include 2 clicks. The percent of sessions continues to decrease as the number of events in a session increases.

Added Number of Distinct Users by Session chart
Note: I recommend looking at sessions per user instead of events. I think this provides more insight into how many times the user is accessing the tool since it is possible for a user to make many clicks (events) within a single session. Happy to provide data on events per user as well if that would be helpful.

sx_entry_sessions_byusers (1).png (2×4 px, 145 KB)

The majority of users (86.4%) have had just 1 recorded content translation session during the reviewed time period (9 Sept 2020 - 20 Sept 2020). This is a short time frame and it is likely we may see more recurring users after we collect data for a longer period of time.

The update is perfect. Thanks a lot!