Page MenuHomePhabricator

Analyze the usage of different machine translation engines in Content Translation
Closed, ResolvedPublic

Description

There must be a dashboard, probably using Dashiki, that will show how many articles were translated with each of the translation engines.

The basic query is something along these lines:

select
  cxc_origin as engine,
  count(distinct(cxc_translation_id)) as translations
from
  cx_corpora
where
  cxc_timestamp between 20180901000000 and 20181001000000
group by
  cxc_origin;

And we'll probably want a line chart with a line for each engine.

Details

Related Gerrit Patches:
operations/puppet : productionAdd reportupdater job for language team metric
analytics/limn-language-data : masterAdd scheduling for Content Translation MT engine data

Event Timeline

Amire80 created this task.Oct 23 2018, 4:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2018, 4:09 PM
Arrbee assigned this task to Amire80.Oct 24 2018, 7:06 AM

Change 469390 had a related patch set uploaded (by Amire80; owner: Amire80):
[analytics/limn-language-data@master] Add scheduling for Content Translation MT engine data

https://gerrit.wikimedia.org/r/469390

Change 469390 merged by Milimetric:
[analytics/limn-language-data@master] Add scheduling for Content Translation MT engine data

https://gerrit.wikimedia.org/r/469390

Change 473056 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Add reportupdater job for language team metric

https://gerrit.wikimedia.org/r/473056

Change 473056 merged by Elukey:
[operations/puppet@production] Add reportupdater job for language team metric

https://gerrit.wikimedia.org/r/473056

Please review if it makes sense:
https://language-reportcard.wmflabs.org/cx2/#mt-engines

I'm taking this straight from the database. "Undefined" and "Scratch" may need some explanation. New engines will be added automatically when they will start getting used.

Etonkovidova closed this task as Resolved.Nov 22 2018, 12:33 AM
Etonkovidova added a subscriber: Etonkovidova.

@Amire80 - below are my notes (most likely not important):

(1) Running the query takes quite a long time:

[wikishared]> select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora where cxc_timestamp between 20180901000000 and 20181001000000 group by cxc_origin;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| Apertium   |         2175 |
| LingoCloud |          196 |
| Matxin     |           12 |
| Yandex     |         7253 |
| Youdao     |           68 |
| no-mt      |            2 |
| source     |        12573 |
| source-mt  |            5 |
| user       |        12063 |
+------------+--------------+
9 rows in set (2 min 40.82 sec)

And possible keys is NULL:

 EXPLAIN EXTENDED select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora where cxc_timestamp between 20180901000000 and 20181001000000 group by cxc_origin\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cx_corpora
         type: index
possible_keys: NULL
          key: cx_corpora_unique
      key_len: 85
          ref: NULL
         rows: 12537240
     filtered: 100.00
        Extra: Using where; Using index; Using filesort
1 row in set, 1 warning (0.00 sec)

(2) There are no undefined value for cxc_origin (and the number of scratch records is very small) .

[wikishared]> select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora  group by cxc_origin order by translations  desc;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| source     |       409877 |
| user       |       394384 |
| Yandex     |       206382 |
| Apertium   |        80694 |
| Youdao     |         1298 |
| LingoCloud |          502 |
| Matxin     |          265 |
| scratch    |          103 |
| source-mt  |           63 |
| no-mt      |           26 |
| TestClient |            5 |
| disable-mt |            1 |
+------------+--------------+
12 rows in set (2 min 22.71 sec)
Pginer-WMF added a subscriber: Pginer-WMF.

This graph seems really useful. Thanks! I created a follow-up ticket with some minor adjustments (mainly labelling) that will help clarify potential confusions: T210135: Adjust MT graph to clarify the presented concepts