Measure the impact of externally-originated contributions
Open, NormalPublic

Description

When a page is delivered through an external translation service we want to provide paths for users to still be able to contribute (T212300), and we want to have a clear understanding about how these affect users and the content they produce.

In order to support this, the following aspects will be measured:

Access to contribution

  • Reading to contribution funnel. We want to measure how many people move through the workflow we are providing from reading to contribution: access the translated page → access the contribution options page → access the local/original article to contribute → complete a contribution. Capturing this as both the number of users, and the percentages of those that move/drop-off on each stage will provide a good idea of how users move through the process.
  • Comparison with local workflows. To better understand the above it would be useful to compare these numbers with the standard contribution workflow on regular articles. In particular, which percentage of readers access the edit action, and which percentage of those make a contribution. This will allow to understand whether users coming from an external automatic translation are more or less likely to try to contribute and succeed to complete such contribution.

We may want to have this analysis both, per specific wiki as well as an aggregated perspective for all wikis.

Content produced

  • Content created. How many edits and pages were created as a result of people coming from an externally translated page. This provides an idea of the volume of content that is generated when coming from an external automatic translation. A revision tag (T209132) is available to identify the contents created in this way.
    • Comparison with local wiki. Comparing the above numbers with the overall number of pages/edits created in the local wiki, will help to understand which percentage of the total contributions are originated from an external automatic translation
  • Content survival Checking whether contributions that originated from an external automatic translation have been reverted or not provides an idea of the quality of those contributions.
    • Comparison with local wikiComparing the above numbers with the usual revert/deletion rates for the local wiki will allow to understand whether users coming from an external automatic translation are more or less likely to meet the community quality standards with their contributions.

We may want to have this analysis both, per specific wiki as well as an aggregated perspective for all wikis.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2018, 11:08 AM
Pginer-WMF triaged this task as Normal priority.Dec 20 2018, 11:08 AM
atgo added subscribers: pau, DFoy.Dec 20 2018, 7:53 PM

@pau this looks great. I can't for the moment think of anything that we should add. Once we get feedback from @DFoy about the specific community concerns we may want to add something, but for now looks good.

Should we tag product analytics in?

@kzimmerman adding you here for visibility. We're still getting the task finalized, but this is how things are shaping up for Toledo

cc/ @dr0ptp4kt

Thanks @atgo ! Tagging with product analytics so we can start incorporating this into our planning in the new year.

atgo added a comment.Jan 3 2019, 7:32 PM

@pau @dr0ptp4kt want to make sure that readership of translated results is also tracked and reported. I think that's represented in the "contribution funnel" piece, but want to call it out and am making some minor adjustments accordingly.

atgo updated the task description. (Show Details)Jan 3 2019, 7:33 PM

@pau @dr0ptp4kt want to make sure that readership of translated results is also tracked and reported. I think that's represented in the "contribution funnel" piece, but want to call it out and am making some minor adjustments accordingly.

Great. Making it more explicit makes sense, and the updated description looks good. Thanks!

Tbayer added a subscriber: Tbayer.Jan 8 2019, 3:36 AM

@pau @dr0ptp4kt want to make sure that readership of translated results is also tracked and reported. I think that's represented in the "contribution funnel" piece, but want to call it out and am making some minor adjustments accordingly.

On that matter, we should consider making use of the already existing "virtual pageview" framework to track readership of these translated results. This would have various benefits, e.g. making stats for this new way of reading Wikipedia content directly comparable with our data for normal pageviews, in various dimensions.

@Tbayer what do you have in mind? Heads up, T208795 captures the first concrete case where the full transcoding indeed goes all the way through the Wikimedia servers and stuff is already counted as a pageview but there's an X-Analytics key-value made available for query purposes.

Tbayer added a comment.EditedJan 9 2019, 1:24 AM

@Tbayer what do you have in mind? Heads up, T208795 captures the first concrete case where the full transcoding indeed goes all the way through the Wikimedia servers and stuff is already counted as a pageview but there's an X-Analytics key-value made available for query purposes.

I see, thanks! Having that X-Analytics tag in the webrequest data is great, but that still leaves open the question how these particular requests should be processed and tallied. It seems that they are currently recorded as regular pageviews, without any possibility (after the data has been aggregated in the pageview_hourly table and the source webrequest data has expired) to distinguish these Google-translated views from normal pageviews where the page is read in the original language. We should discuss whether that's really what we want from a product analytics perspective. Again, an alternative proposal would be to register (and aggregate) them as a virtual pageview instead, using the existing Hive table - i.e. a new way of reading our content (just like page previews was). Or we could add a field to the pageview_hourly table distinguishing translated from regular views. Happy to follow up elsewhere on the details and tradeoffs.

Claiming for now until I have identified an analyst to work on the details. I will attend meetings/discussions and track in the meantime

kzimmerman moved this task from Triage to Backlog on the Product-Analytics board.Jan 10 2019, 9:07 PM

Change 483681 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ExternalGuidance@master] Add analytics trackers

https://gerrit.wikimedia.org/r/483681

4 events are tracked as counters with the keys given below:

In https://gerrit.wikimedia.org/r/483681 following events are emitted with counters

Event keyContext
MediaWiki.ExternalGuidance.init.serviceName.fromLang.toLangEmitted when the external context is detected.
MediaWiki.ExternalGuidance.specialpage.serviceName.fromLang.toLangEmitted when the special page is visited from the contribute link.
MediaWiki.ExternalGuidance.createpage.serviceName.fromLang.toLangEmitted when the create page button is clicked on the specialpage.
MediaWiki.ExternalGuidance.mtinfo.serviceName.fromLang.toLangEmitted when the service information overlay is accessed.

Change 483681 merged by jenkins-bot:
[mediawiki/extensions/ExternalGuidance@master] Add analytics trackers

https://gerrit.wikimedia.org/r/483681

atgo added a comment.Jan 18 2019, 9:43 PM

Hey y'all. @chelsyx is going to help us on the analysis side. She's just getting up to speed and may have some changes, but this looks good at a first pass.

Hi @santhosh , I have a couple of questions about the counters in T212414#4872124:

1, I want to double check whether my understanding about the keys are correct:

  • MediaWiki.ExternalGuidance.init.serviceName.fromLang.toLang is emitted when a page is requested by the translation service, correct?
  • MediaWiki.ExternalGuidance.mtinfo.serviceName.fromLang.toLang is emitted when user clicks on the "Automatic translation" button (T212329), correct?
  • When MediaWiki.ExternalGuidance.createpage.serviceName.fromLang.toLang is emitted, how do we distinguish whether user clicks to contribute in local language or in original language?

2, Where are we going to save these events? In the public mediawiki databases?

  • MediaWiki.ExternalGuidance.init.serviceName.fromLang.toLang is emitted when a page is requested by the translation service, correct?

No. It is emitted when our code detect that the page is presented to a user by an external service(Also known as External context detection). At this point we do our banner injection. If this event is emitted it means a user saw a page from fromLang wikipedia translated to toLang in an external context like Google Translate

  • MediaWiki.ExternalGuidance.mtinfo.serviceName.fromLang.toLang is emitted when user clicks on the "Automatic translation" button (T212329), correct?

Yes.

  • When MediaWiki.ExternalGuidance.createpage.serviceName.fromLang.toLang is emitted, how do we distinguish whether user clicks to contribute in local language or in original language?

I have not added any event for 'contributing to original language'. That is a good catch. Will add one for that.

2, Where are we going to save these events? In the public mediawiki databases?

All these events are special events since they are keyed with 'counter' prefix. They go to https://wikitech.wikimedia.org/wiki/Graphite and can be monitored and analysed using dashboards and graphs at https://grafana.wikimedia.org. I have not set up a dashboard for these counters yet, but once we get events will create one.

Change 488351 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ExternalGuidance@master] Add a tracker event for editing the original source article

https://gerrit.wikimedia.org/r/488351

chelsyx added a comment.EditedWed, Feb 6, 8:16 PM
  • When MediaWiki.ExternalGuidance.createpage.serviceName.fromLang.toLang is emitted, how do we distinguish whether user clicks to contribute in local language or in original language?

I have not added any event for 'contributing to original language'. That is a good catch. Will add one for that.

Thanks @santhosh !

2, Where are we going to save these events? In the public mediawiki databases?

All these events are special events since they are keyed with 'counter' prefix. They go to https://wikitech.wikimedia.org/wiki/Graphite and can be monitored and analysed using dashboards and graphs at https://grafana.wikimedia.org. I have not set up a dashboard for these counters yet, but once we get events will create one.

To my understanding, Graphite doesn't support aggregation (e.g. by month) on the dashboard. And if I want to export data from the dashboard to compare with local workflows, the output is in JSON. This is not very convenient for analysis and reporting purposes. Can we send the event, or transform the JSON and then send them to a relational database?

To my understanding, Graphite doesn't support aggregation (e.g. by month) on the dashboard.

Aggregration is possible for any time period. Infact you have a wide set of analytical operators available to plot in graph such as sum, rate, mean, median etc.
Some examples where aggregation and fancy charting is used https://grafana.wikimedia.org/d/000000290/wikidata-query-service-ui?refresh=1m&orgId=1, https://grafana.wikimedia.org/d/000000593/service-cxserver?refresh=5m&orgId=1 and https://grafana.wikimedia.org/d/000000598/content-translation?orgId=1 You can also export data for any time duration. Eventlogging data is by default available for last 90 days, but Graphana retains data for more than a year from my experience.

Here is an example that shows aggregation per month(you can change it to any time duration)
https://grafana.wikimedia.org/d/000000593/service-cxserver?refresh=5m&orgId=1&panelId=7&fullscreen&from=1549477800000&to=1549513810698

@santhosh Thanks for the links!

The executives are asking for some type of dashboard so that they can access the following metrics in the same place:

  • Readership: The number page views from google translation service, and then compare to normal page views (T208795)
  • Contributions: Metrics about 1) access to contribution and 2) content produced (described in this ticket)

This means we need to pipe data from multiple sources to the same place: pageviews from hive table, eventlogging (EditAttemptStep table), mediawiki tables (revision table, change_tag table, etc), and the events you generate for this task. Additionally, to compute revert rate of edits, we need to use the mwreverts python package (or a complex query) to pre-process the mediawiki data. In the future, we might need to aggregate the numbers by platform, users' geolocation, etc, which requires the user agent information.

To my understanding, building the "multiple sources -> statsd -> Graphite" pipe is not a trivial effort, if it is possible--we can consult with analytics engineering. And I think eventlogging is more flexible and can accommodate these needs. In fact, we can whitelist the eventlogging table if needed, so that the data won't be purged after 90 days. Or we can just aggregate the data (so PIIs are removed) and keep the data forever.

Hello A-team! We are asked to build a dashboard and need to pipe data from multiple sources to the same place: pageviews, eventlogging, mediawiki tables--see T212414#4937357 and the ticket description for more details. Can you offer some suggestions regarding the data pipeline and the dashboarding tools?

Nuria added a subscriber: Nuria.Mon, Feb 11, 5:03 PM

Seems like there are several issues here, from the requests we are not clear that you actually have that data right now to implement a pipeline, correct? Seems that this ticket is still needing instrumentation work? Not super clear on that but provided that you have all data you need seems like you need a spark job that munches data, aggregates in a way that is usable for this purpose and once that is done you could expose it via superset via loding that data into druid or in mysql. Not sure whether graphite/statsd has a place here, seems like some prior work persisted events to graphite but for multi-dimensional data graphite is really not the best approach. We can talk in more detail in a meeting if needed.

Nuria added a comment.EditedMon, Feb 11, 5:10 PM

Again, an alternative proposal would be to register (and aggregate) them as a virtual pageview instead, using the existing Hive table - i.e. a new way of reading our content (just like page previews was). Or we could add a field to the pageview_hourly table distinguishing >translated from regular views.

if a custom event is needed one can be emitted similar (or augment) to https://meta.wikimedia.org/wiki/Schema:VirtualPageView
As we discussed with popups (extensively) we will not be modifying pageview_hourly rather events can be sent with data of interest that we need to keep track of.

Change 489966 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ExternalGuidance@master] Eventlogging integration

https://gerrit.wikimedia.org/r/489966

@santhosh Thanks so much for the event logging! https://meta.wikimedia.org/w/index.php?title=Schema:ExternalGuidance&oldid=18870706

I think it would be helpful if we can add the following information to the schema. What do you think?

  • I think it would be helpful if we can distinguish whether user is create a new page, or edit existing page on the local wiki. To do that, we can 1) add a field named "type" with enum [new, existing] for all events (if possible), or 2) add an new action named "edit_existing"
  • Can we add a session_token (mw.user.sessionId() would be great) so that we can join this table with https://meta.wikimedia.org/wiki/Schema:EditAttemptStep
  • Analytics engineering suggest us to use snake_case instead of camel case for the field name, because sql/hive is case insensitive

@santhosh Thanks so much for the event logging! https://meta.wikimedia.org/w/index.php?title=Schema:ExternalGuidance&oldid=18870706

I think it would be helpful if we can add the following information to the schema. What do you think?

  • I think it would be helpful if we can distinguish whether user is create a new page, or edit existing page on the local wiki. To do that, we can 1) add a field named "type" with enum [new, existing] for all events (if possible), or 2) add an new action named "edit_existing"

"createpage" and "edit-original" value for action field should already cover them with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ExternalGuidance/+/488351

  • Can we add a session_token (mw.user.sessionId() would be great) so that we can join this table with https://meta.wikimedia.org/wiki/Schema:EditAttemptStep
  • Analytics engineering suggest us to use snake_case instead of camel case for the field name, because sql/hive is case insensitive

Done. See https://meta.wikimedia.org/w/index.php?title=Schema:ExternalGuidance&oldid=18870832. Will point the code to revision of schema

@santhosh Thanks so much for the event logging! https://meta.wikimedia.org/w/index.php?title=Schema:ExternalGuidance&oldid=18870706

I think it would be helpful if we can add the following information to the schema. What do you think?

  • I think it would be helpful if we can distinguish whether user is create a new page, or edit existing page on the local wiki. To do that, we can 1) add a field named "type" with enum [new, existing] for all events (if possible), or 2) add an new action named "edit_existing"

"createpage" and "edit-original" value for action field should already cover them with https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ExternalGuidance/+/488351

"createpage" and "edit-original" distinguish contribute in local language vs in original language. What I was trying to add is for local wiki, distinguish create a new page when there is no page with the same title exist, vs expand the existing article when there exist a page with the same title, i.e. page 1 vs 2, or page 3 vs 4 in this design doc https://drive.google.com/file/d/1ua7fNGZM2n66Cr7VxOG1mNG4_B2DvfRe/view

Change 489966 merged by Santhosh:
[mediawiki/extensions/ExternalGuidance@master] Eventlogging integration

https://gerrit.wikimedia.org/r/489966

Change 488351 merged by jenkins-bot:
[mediawiki/extensions/ExternalGuidance@master] Add a tracker event for editing the original source article

https://gerrit.wikimedia.org/r/488351

Change 490281 had a related patch set uploaded (by Santhosh; owner: Santhosh):
[mediawiki/extensions/ExternalGuidance@master] Eventlogging: Add new action name for editing an existing page

https://gerrit.wikimedia.org/r/490281

. What I was trying to add is for local wiki, distinguish create a new page when there is no page with the same title exist, vs expand the existing article when there exist a page with the same title,

Got it. So I added editpage as action value if the page exist. 'createpage' is when page does not exist. Patch https://gerrit.wikimedia.org/r/488351 Schema change:
https://meta.wikimedia.org/w/index.php?title=Schema%3AExternalGuidance&type=revision&diff=18873656&oldid=18870832

. What I was trying to add is for local wiki, distinguish create a new page when there is no page with the same title exist, vs expand the existing article when there exist a page with the same title,

Got it. So I added editpage as action value if the page exist. 'createpage' is when page does not exist. Patch https://gerrit.wikimedia.org/r/488351 Schema change:
https://meta.wikimedia.org/w/index.php?title=Schema%3AExternalGuidance&type=revision&diff=18873656&oldid=18870832

Awesome! Thank you!

Nuria added a comment.Wed, Feb 13, 7:26 PM

Summarizing discussion we had on meeting last Monday: there are two types of insights that @Pginer-WMF is asking on in this ticket, some of them are exploratory ("comparing content created on a wiki versus content created on a wiki via coming in through an external translation service), these would benefit from ad-hoc 1-off exploration of data per wiki, per language, per language pairs.. etc Others are oriented to really "measure" the population we are dealing with, that is, our userbase when it comes to this feature ("We want to measure how many people move through the workflow we are providing from reading to contribution: access the translated page → access the contribution options page → access the local/original article to contribute → complete a contribution.")

Our recommendation is to first measuring the population we are dealing with, once we have an estimate of users of feature and funnel usage we can compare outcomes of this workflow versus other workflows of contribution.

Change 490281 merged by jenkins-bot:
[mediawiki/extensions/ExternalGuidance@master] Eventlogging: Add new action name for editing an existing page

https://gerrit.wikimedia.org/r/490281

atgo reassigned this task from kzimmerman to chelsyx.Fri, Feb 15, 12:51 AM

Moving to @chelsyx per @kzimmerman request.

@dr0ptp4kt and I were chatting earlier and realized that we have a gap in the analysis that we'd like to include: impact to search traffic. What is the impact of this project on search?

chelsyx moved this task from Backlog to Doing on the Product-Analytics board.Fri, Feb 15, 12:59 AM

@chelsyx I still don't see a table for ExternalGuidance in db1108 log schema. As per documentation that table should get automatically created. Any idea why that is not happening?