Page MenuHomePhabricator

Measure 2018-19 annual plan metrics for the Language team
Closed, ResolvedPublic

Description

Time constraint: This should be finished by 30 September, so that the metrics are available as Content Translation v2 is be made available to users and as the team prepares for its June–September quarterly check-in.

Event Timeline

Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptAug 28 2018, 6:26 PM
nshahquinn-wmf added a comment.EditedSep 24 2018, 11:29 PM

@Pginer-WMF, @Arrbee, T194641 and T194647 require us to have "pre-publication" data about when users started, continued, and published their translations. As far as I can tell, the only source for this data is ContentTranslation events. However, that data is currently purged after 90 days.

The CX MediaWiki tables contain some of this data (start date, last update, current status), but it's not enough to fully calculate these metrics. For example, they doesn't log publishing failures. Also, it might show that a user started a translation in July and last updated it in September, but it doesn't show whether they also used CX in August.

Am I missing something or do we need to protect the event tables from purging? I can take care of requesting that from Analytics.

In T199342#4613098, @Neil_P._Quinn_WMF wrote:

@Pginer-WMF, @Arrbee, T194641 and T194647 require us to have "pre-publication" data about when users started, continued, and published their translations. As far as I can tell, the only source for this data is ContentTranslation events.

I'm pretty sure that you can also get most of this data from the cx_translations table in the wikishared database.

However, that data is currently purged after 90 days.

90 days is the default, which can be changed. We can also run a daily script that copies the necessary data to another database. This is more or less the direction that I'm taking with https://phabricator.wikimedia.org/T189475#4581765 .

The CX MediaWiki tables contain some of this data (start date, last update, current status), but it's not enough to fully calculate these metrics. For example, they doesn't log publishing failures.

Indeed. To check for failures, we'll need to make another query to https://meta.wikimedia.org/wiki/Schema:ContentTranslationError .

Also, it might show that a user started a translation in July and last updated it in September, but it doesn't show whether they also used CX in August.

Yeah... I don't think that we log opening the translation just to work on it, but @santhosh should confirm.

Am I missing something or do we need to protect the event tables from purging? I can take care of requesting that from Analytics.

I should add that I planned to refresh the say in which Schema:ContentTranslationError works. See T204273. So if you're also using this data, then we definitely need to coordinate this.

Nuria added a subscriber: Nuria.Feb 27 2019, 9:26 PM

I can take care of requesting that from Analytics.

Please submit a code patch to the eventlogging whitelist, please remember that we do not keep beyond 90 days data for users and pageviews visited. We do keep permanently data for users and pageviews edited as all editions are public.

kzimmerman added subscribers: chelsyx, kzimmerman.

@Pginer-WMF @Arrbee I'm assigning this to Chelsy to wrap this up for the year. Can you verify the Annual Plan metrics needed and get @chelsyx up to speed?

Is there a list of tables that's used by the team? And how can I find their schema? So far, I only found these links:

@Pginer-WMF @Arrbee I'm assigning this to Chelsy to wrap this up for the year. Can you verify the Annual Plan metrics needed and get @chelsyx up to speed?

These are the metrics proposed in order of priority:

  1. Surviving translations (T194650)
  2. Retention of translators (T195949)
  3. Success rate for translators (T194647)
  4. Number of translators (T194641)

If it is possible to get 2 completed, the first two are preferred. We are also quite flexible regarding the definition, if there is some complexity in these we are ok with a simplified version that provides a rough idea.

Is there a list of tables that's used by the team? And how can I find their schema?

@Amire80, can you provide more details about this?

Chelsy, Amir has created several queries and reports in this area that may be useful to extract some relevant info meanwhile. But, he can provide much more details about the tables available.

Nuria added a comment.Jun 27 2019, 3:04 PM

@chelseyx: ping us if you need help. When dealing with eventlogging data it will be easy to import it to druid such is available in superset. There are some reports for translations done over eventlogging mysql data, we should probably move those to pull data from hadoop.

@Pginer-WMF please let us know if you have access to superset http://superset.wikimedia.org

Nuria added a comment.Jun 27 2019, 3:07 PM

Also, please have in mind that the db replicas are also accessible from superset .

@Pginer-WMF please let us know if you have access to superset http://superset.wikimedia.org

Yes. I have access to superset.

@chelsyx thanks for taking this on! I have some work in progress on notebook1003 in /home/neilpquinn-wmf/proj/2018-19-Language-annual-plan-metrics/Language-metrics.ipynb.

Here's an HTML version:

I can put the notebook on GitHub too if that would be helpful.

A brain dump (feel free to ask more questions!):

  • Data on users' translations (whether in-progress or published) are stored in the wikishared.cx_translation table on x1. The translation_started_by and translation_last_update_by columns use the global user ID, so you'll have to join it to centralauth.global_user on s7. I started doing this before the Analytics MariaDB replicas were split across multiple hosts, but now you'll have to do something to bridge the gap between x1 and s7.
  • Published translations can be identified in the MediaWiki's history by the revision tag contenttranslation; the subset of those translations made using ContentTranslation version 2 additionally are tagged with contenttranslation-v2.
  • ContentTranslation logs to several different event streams whose names start with ContentTranslation. All of them have a field called token which actually contains the user name of the event performer. You might need to use the ContentTranslation stream because it logs a continue event when the user does further work on a translation without publishing (which would be relevant to measuring the number of newcomers using CX, since there we've defined "using" as including "working on without publishing"). The wikishared.cx_translation table tells when the translation was started and when it was last worked on, but doesn't give any details about intermediate activity. I meant to whitelist the ContentTranslation stream, but I never go around to it so currently we only have 90 days of past data for it. Potentially we could redefine "number of newcomers using CX" to mitigate this difficulty.

Is there a list of tables that's used by the team? And how can I find their schema?

@Amire80, can you provide more details about this?
Chelsy, Amir has created several queries and reports in this area that may be useful to extract some relevant info meanwhile. But, he can provide much more details about the tables available.

Thanks @Pginer-WMF ! This is helpful.

Thanks so much @Neil_P._Quinn_WMF !

I can put the notebook on GitHub too if that would be helpful.

Yes, that would be very helpful!

Data on users' translations (whether in-progress or published) are stored in the wikishared.cx_translation table on x1. The translation_started_by and translation_last_update_by columns use the global user ID, so you'll have to join it to centralauth.global_user on s7. I started doing this before the Analytics MariaDB replicas were split across multiple hosts, but now you'll have to do something to bridge the gap between x1 and s7.

I noticed you used event_sanitized.serversideaccountcreation for "Surviving translations by newcomers" in the notebook. Is the userId in event_sanitized.serversideaccountcreation the global user ID? Is the dt in this table the earliest user registration date across all wikis?

I posted other questions about specific tasks in their corresponding tickets.

I can put the notebook on GitHub too if that would be helpful.

Yes, that would be very helpful!

Okay, it's at https://github.com/wikimedia-research/2018-19-Language-annual-plan-metrics. Feel free to mercilessly overwrite (convert it to R, etc.) 😁

I noticed you used event_sanitized.serversideaccountcreation for "Surviving translations by newcomers" in the notebook. Is the userId in event_sanitized.serversideaccountcreation the global user ID?

I've never actually used that field, but I'm 95% sure that it's just the local user ID. The global user ID is almost never used (which is why I was surprised when I found that the CX tables do use it!).

Is the dt in this table the earliest user registration date across all wikis?

Yes! The table logs only true creations, not autocreations, so I used it as an easy way of getting a user's global registration date.

kzimmerman removed chelsyx as the assignee of this task.Jul 9 2019, 9:32 PM
kzimmerman moved this task from Next Up to Doing on the Product-Analytics board.

Unassigning @chelsyx because she's moving on from the Foundation.

Assigning to @Neil_P._Quinn_WMF to discuss with @Pginer-WMF

nshahquinn-wmf closed this task as Resolved.EditedSep 30 2019, 3:45 PM

The 2018-19 fiscal year has ended, so there's no longer value to delivering these metrics as a package. The tasks for the individual metrics will stay open on the backlog.