Page MenuHomePhabricator

Measure the impact of MinT for Wiki Readers MVP
Open, MediumPublic

Assigned To
None
Authored By
Pginer-WMF
Sep 3 2024, 11:21 AM
Referenced Files
F57777637: image.png
Dec 4 2024, 8:07 AM
F57777578: image.png
Dec 4 2024, 7:32 AM
F57777547: image.png
Dec 4 2024, 7:32 AM
F57777545: image.png
Dec 4 2024, 7:32 AM
F57777541: image.png
Dec 4 2024, 7:32 AM
F57777500: image.png
Dec 4 2024, 7:32 AM
F57777495: image.png
Dec 4 2024, 7:32 AM
F57777477: image.png
Dec 4 2024, 7:32 AM

Description

As part of the work on MinT for Wiki Readers MVP (T359072), we want to learn about its impact.
This ticket proposes to analyze the available data to identify signs that show both its usefulness to users and any potential problematic consequences based on the data we capture from the 23 pilot Wikipedias.

An initial analysis after a high peak of activity was produced when the entry point at the article footer (T363338) was made available (disabling it later to avoid servers going over capacity). We want to make sure that, the article footer entry point has been restored before completing the analysis in the current ticket.

As part of this ticket, we may want to explore some relevant questions:

  • User interest. How many users access machine-translated articles, and which percentage of the mobile web traffic for their wiki do they represent (our initial target was to reach 3%).
  • Negative effect in other reading activities. Does reading translations result in reading fewer articles on the wiki?
  • Negative effect in other editing activities. Does the availability of translations result in less editing activity to create articles in the local wiki?
  • Funnel overview. How is their usage pattern. Do they spend time reading several sections of the article, do they navigate to other articles or translations? How much time do users spend in a machine translated article, and how this compares to the time spent in other articles?
  • Content discovery. Which is the difference of coverage between the translation consumed and the local article available (e.g., MT may be used more on articles that exist but are less than 5 paragraphs, compared to missing articles or longer ones)? Do people select different source languages based on their content available? Which kind of topics do users read as machine translations, which coverage provides the wiki for those, and how do those compare to non-translated articles?
  • Appeal to translators. Is this a feature only useful for readers, or is it also encouraging the creation of translations? How is the availability of an MT option affecting the funnel of the mobile translaiton funnel (e.g., is the availability of the MT option reducing the number of people that access Content Translation and leaves it without translating because they were actually looking for MT to read)?

More context about the annual plan target:

This area of work is in support of the Wikimedia Foundation Product & Technology Objective and Key Results (OKRs) of the 2023-24 period. Although the MVP was launched at the end of such period, the idea was to leave time for users to use the feature before assessing the impact. In particular, the target was defined for the key result WE2.2 in the following hypothesis:

Scaling Open Translation service will increase page interactions from underserved communities
Exposing our machine translation service to support the automatic translation of Wikipedia article contents. Make the service available for 10 languages not supported by commercial vendors such as Google Translate. This will increase page interactions from underserved communities by making millions of Wikipedia articles more accessible with automatic translation.
Success will be measured after 6 months by a 3% increase in page interactions on the pilot wikis driven by the views to automatic translations of content.


This ticket is focused on measurements. Related tickets enable direct user input:

Event Timeline

Pginer-WMF triaged this task as Medium priority.
Pginer-WMF moved this task from Backlog to Product integration on the MinT board.
Pginer-WMF added a subscriber: PWaigi-WMF.

Please help me to understand the measurement plans here. Using a real example will help us in getting clarity.

  • The Irak article in ff.wikipedia.org is a single line article with 127 bytes
  • It is the most viewed article in ff.wikipedia.org with 6,483 pageviews in 1/1/2023 - 11/30/2024 (23 months)
  • It is recieving average 9 pageviews(User, not automatic) per day on average.
  • It has 0 edits and 0 editors for the time period of 1/1/2023 - 11/30/2024 (23 months)

image.png (777×1 px, 431 KB)

Suppose we add entrypoints for MinT for Readers in the Irak page. The entrypoint is a footer link to access machine translated version.
Can you explain what are we measuring here?

  • User interest. How many users access machine-translated articles, and which percentage of the mobile web traffic for their wiki do they represent (our initial target was to reach 3%).
  • Will there be an increase in number of page views to this page, indicating people are accessing the onelne page so that they can read English(example) version with more content by clicking on the footer link? For unknown reasons, the 'Irak' articles viewership is declining from last year(16 page views to 9 page views per day).
  • If so, please provide some example numbers indicating the user interest

Negative effect in other reading activities. Does reading translations result in reading fewer articles on the wiki?

Since 'Irak' article is most read article in ff.wikipedia.org, decrease in average page views would indicate Negative effect? I understand that single article cannot be used as a measure as proxy for entire wiki. So let us look at the pageviews for the entire wiki:

image.png (639×1 px, 405 KB)

We see that there are 2209 average per day page views in the wiki for the last 23 months. It shows the readership is increasing in small scale but with fluctuations. Since there can be many reasons for ups and dowsn in this graph, we need to find a way to attribute ups/downs that are related to availability of synthetic content. That sounds hard to measure. What would be a practical and sensible way to measure it?

Negative effect in other editing activities. Does the availability of translations result in less editing activity to create articles in the local wiki.

Currently there are 0 edits or 0 editors for Irak article in ff.wikipedia.org for past 23 months. Anything other than 0 would be very nice.

There are around bar|2-year|~total|monthly | 2k monthly edits in ff.wikipedia.org these days. And it is greater than last year.

image.png (662×1 px, 68 KB)

There is an increased activity this year for creating new articles:

image.png (662×1 px, 70 KB)

Interestingly all those new articles are created using translations with CX. with high-machine-translation tags. The high number of edit activities are contributed by a user who created 2065 new articles in the first 70days of joining that wiki(30 articles per day) and another user with 1400+ edits in similar time span.

Since these unusual edit activities sums up most of the edit activities in ff.wikipedia.org, what is our expectation on increase or decrease in editor activities?

Funnel overview. How is their usage pattern. Do they spend time reading several sections of the article, do they navigate to other articles or translations?

  • If users land in the Special:AutomaticTranslation page, but leaves without clicking on 'explicit CTA' to read the translation, would that count as 'non-interest'?
  • If users land in the Special:AutomaticTranslation page, access lead article translation and does not expand any other sections, what would be interpretation?
  • If the user expands all sections in the Special:AutomaticTranslation page, I guess that clearly indicates the time spent on the page, indicating the interest.
  • How do we measure navigating to other articles from Special:AutomaticTranslation? Do we have instrumentation for that?
  • Since there is no edit button in the MinT for wiki readers page, we won't know if reader try to edit or correct the page. The edit icon in the top of the page might look like a call for action, but it is not a button

image.png (108×451 px, 10 KB)

  • However if user are trying to change the language on the page, there are options inviting for edit. That seems something to fix? I don't think people will change the language often and then notice the call for contributions.

image.png (72×1 px, 2 KB)

image.png (398×496 px, 21 KB)

Content discovery. Do people select different source languages based on their content available?

That is intresting thing to measure. I hope we have instrumention on this. However, the language selector is very minimal(with bugs) and has no indication on which one to chose to read more machine translated content.

There is also a question whether people chose different article now that they discover the tool.
The "Random Topic" button in the page is a dummy button though. It has no action.

image.png (474×512 px, 25 KB)

Appeal to translators. Is this a feature only useful for readers, or is it also encouraging the creation of translations?

As noted above, there is unusual translation activity in the ff.wikipedia.org without the presence of MinT for wiki readers entrypoint. I wonder how to reconcile that interest(although unusual) with the introduction of new entrypoint to MinT for readers.
For readers of machine translation to become editors - please note the above comments regarding edit options are under language selector.

Regions where these languages are spoken also need to be considered to get full picture. Nigeria for example has primary education in English and has a literary rate of 68%. How much of this literate internet accessing population who has primary language as English likes to use ff or ig or ki languages for reading encyclopedic content is very much the question here. Very low traffic to these wikis underlines this issue.

The Iraq article in ki.wikipedia.org has 2 page views per day on average this year. While Iraq was top visited article in 2023, this year, article about "Wikipedia" seems to be the top one with 28 page views per day.
ki.wikipedia.org bar|2-year|~total|monthly | has less than 10 new articles created per month with about 10 editors per month

image.png (749×1 px, 477 KB)

Very low activity wikis like this will be challenging to measure any ups or downs in activity. These wikis are basically inactive unfortunately and there are valid reasons considering the demography of communities.

Please help me to understand the measurement plans here.

The idea of the experiment is to cover a broad set of wikis and to get the data based on the activity on all articles. For example, the situation for a particular article in Fula Wikipedia may be very different from the situation in another article on Korean Wikipedia. In the initial enablement, data shows that Korean was getting over 1K times more user sessions requesting MT than Fula (4K vs. just 3). Persian got 5X+ times the requests from Korean (30K vs 4K). So demand is expected to diverge a lot by wiki, but looking at the overall numbers, as opposed to specific articles, would provide a more clear understanding of those differences. By including a broad range of wikis in the experiment, we can get a broader view of how MT is useful or not in different contexts.

This ticket is intended to capture which aspects we want to measure to identify the potential positive and negative impact of making MT available for readers in different wikis. The description is an initial brainstorm of aspects that we may want to measure to capture potential positive and negative aspects, if there are new ones to consider feel free to add. I'll defer to the analytics expert, @KCVelaga_WMF, to turn those into measurable signals and for answering the specific questions.

@santhosh @Pginer-WMF Sorry for the delay in responding here. There is also a more detailed measurement plan, which I can revise based on the discussion here, and we can all agree on that before the experiment.

While it is hard to conclude how we would interpret something with only one article, it does help us with grounding, and we should think of how we can interpret such a situation.

Can you explain what are we measuring here?
Will there be an increase in the number of page views to this page?

Not really, we won't be measuring that at page level. There might be a secondary effect in the long term (like users coming to the page because they want to read the additional machine translated content), which might be hard to measure within the scope of the experiment. The pageview of the page itself will not likely increase, because if the user clicks to view the machine translated article, they will be redirected away from that page. It is no longer associated with the page. At that point, we want to measure the user's intention to read more (and if the feature is successful in meeting that need), and at what point can we consider that page view (maybe something like a virtual page view, if not a direct one).

There will be some noise, when a new button is enabled, users might have a tendency to simply click to see what it is, rather than actually reading the article - it will be important to differentiate this. We can think of ways to filter out accidental clicks, such as very fast clicks/transitions between steps. But I am sure there are better ways of doing that including dwell time. Another approach is, instead of filtering out noise, we define what we want to capture as a signal. This can be something like, user opens Special:AutomaticTranslation, proceeds to view the translation, spends at least 5 seconds, and maybe expand to view a section or two -- only then we consider as users have really read the translated content, and consider it as pageview (probably virtual or some other). These are not direct pageviews, and not even virtual (which is being used for Wikipedia preview fetches and similar things). We can consider a better naming.

From here, there are several types of views we can think of (listed in the measurement plan doc):

  • Primary referred pageviews: views generated by clicking the option to view human-generated content (we can use a provenance parameter) (primary-pv)
  • Secondary referred views:
    • Virtual page previews from the primary referred view
    • Link clicks (navigate to another articles) from the primary referred view
    • Users clicks to view another machine translation (within MinT for Readers; which if I am not wrong happens if there is no base article present when users clicks on a interwiki link).

To summarize, we have (taking the above example):

  1. Views to the machine translated content (of Irak)
    1. differentiate actual signal (intention to read) from accidental clicks
  2. Views to others articles (referred from the machine translated content)
    1. for example, Irak to Aasiya
  3. Virtual pageview to an inter-wiki link present in Aasiya (say Duungal)
  4. Clicks to view Duungal (direct); a pageview to Duungal
  5. At 1A, if there is no human translated article present for Aasiya, I am assuming a machine translated version would be shown. So this
    1. This will also bring the user back to the 1A step in the user funnel.

So based on that we can now define what is success for the experiment.

The original hypothesis is

3% increase in page interactions (including direct views and views to automatic translations of content) for pilot Wikipedias experimenting machine translation for the first time.

from the points above; we can consider page interactions due to MinT for Readers: 1 + 2 + 5
the primary metric for evaluation will be: ((page interactions due to MinT for Readers + total mobile user pageviews of the site) / total mobile user pageviews of the site

Since 'Irak' article is the most read article in ff.wikipedia.org, decrease in average page views would indicate Negative effect? I understand that a single article cannot be used as a measure as proxy for entire wiki. So let us look at the pageviews for the entire wiki:

True, but we won't be looking at single articles. I am assuming that on pilot wikis, the footer entry point will be enabled on all pages. Decline in pageviews of one article shouldn't really impact the overall calculation. In case if the "total mobile user pageviews of the site" are decreasing, as the calculation will be a percentage of it. If mobile user pageviews of the site are decreasing after the feature is enabled, then that is of concern, which should be further investigated if it is caused by the feature. In the measurement plan, we have defined some guardrail metrics, part of which are mentioned as potential negative effects in the ticket description. We'll evaluate for the primary metric, but ensure that the guardrails are not violated as well.

Since there can be many reasons for ups and downs in this graph, we need to find a way to attribute ups/downs that are related to availability of synthetic content. That sounds hard to measure. What would be a practical and sensible way to measure it?

Agreed. Points 3 & 4 from the types of views mentioned above, can be used to attribute to ups. We can use previous years' data (given if it is accurate) to remove seasonality effects. For downs, it is still hard to attribute, especially because it takes long time for the effect. I think a practical way is to define a good set of guardrail metrics, and investigate further if they are violated.

Since these unusual edit activities sums up most of the edit activities in ff.wikipedia.org, what is our expectation on increase or decrease in editor activities?

From my understanding, the effect on editing activities is a guardrail, rather than to indicate success. We want to ensure that the synthetic content doesn't negatively impact the usual editing activity levels. In the example mentioned, if one user who is highly active stops editing, it might appear as a negative effect. However, we can reduce this noise. For example, by observing distributions of edits by users, we can see if it is a few users causing the effect or a larger trend. I will expand on how we can contribute to increases below.

Funnel overview questions

If users land in the Special:AutomaticTranslation page, but leaves without clicking on 'explicit CTA' to read the translation, would that count as 'non-interest'?
If users land in the Special:AutomaticTranslation page, access lead article translation and does not expand any other sections, what would be interpretation?
If the user expands all sections in the Special:AutomaticTranslation page, I guess that clearly indicates the time spent on the page, indicating the interest.

The first one is mostly a no, the second is a maybe, and third is yes. As I mentioned above, we have to define what the team considers as signal (true to intention to read) vs. noise (accidental clicks, curiosity of a new feature).

How do we measure navigating to other articles from Special:AutomaticTranslation? Do we have instrumentation for that?

For direct pageviews, we can use the provenance parameter I mentioned above. The instrumentation is still pending for this.

Since there is no edit button in the MinT for wiki readers page, we won't know if reader try to edit or correct the page. The edit icon in the top of the page might look like a call for action, but it is not a button
However if user are trying to change the language on the page, there are options inviting for edit. That seems something to fix? I don't think people will change the language often and then notice the call for contributions.

Yes, ideally this has to be active and instrumented. In MinT for Readers instrumentation spec, I initially proposed to have events to be connected with ServerSideAccountCreation schema and CX event schema. If a user clicks to edit, if they are registered, they log in and go to CX/SX (I assume) to review, edit and publish the translation. We can use a new event_source for CX events to track this. If it is an unregistered user, the account creation step comes in.

That is intresting thing to measure. I hope we have instrumention on this. However, the language selector is very minimal(with bugs) and has no indication on which one to chose to read more machine translated content.

We do have instrumentation to track language change in the MinT for Readers feature (for both source and target languages). But as already mentioned, I can't think of a good way to identify why users are changing the source language. A way could be, if a user changes the source, the instrumentation prompts to receive user feedback at some point.

Additionally, there is also an option for users to give feedback, I am not sure if the button is active or not. We have planned instrumentation for that, for users to give feedback and capture it using action_context of the schema, where we can capture qualitative feedback from users. It would be great to have this working before the experiment, as it add more nuance to the quantitative measurements.


I hope that answers the questions asked. Let me know if I missed anything. Based on the feedback, I can revise the measurement plan specific to the experiment, and we can use that as a reference.

Another aspect that we may want to consider is the response time for machine translation. Based on input from @KCVelaga I created a ticket (T385784) to consider avoiding long response times or at least measuring them to be able to separate the results based on this factor to identify its influence.

Another aspect that we may want to consider is the response time for machine translation. Based on input from @KCVelaga I created a ticket (T385784) to consider avoiding long response times or at least measuring them to be able to separate the results based on this factor to identify its influence.

As part of the explorations related to T385784, a ticket has been created for the instrumentation of the response time: T386362: Measure response time of machine translation

KCVelaga_WMF reopened this task as Open.
KCVelaga_WMF removed KCVelaga_WMF as the assignee of this task.
KCVelaga_WMF added subscribers: OSefu-WMF, cchen.