Maniphest T207765

Analyze the usage of different machine translation engines in Content Translation
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Amire80
	Oct 23 2018, 4:09 PM

Description

There must be a dashboard, probably using Dashiki, that will show how many articles were translated with each of the translation engines.

The basic query is something along these lines:

select
  cxc_origin as engine,
  count(distinct(cxc_translation_id)) as translations
from
  cx_corpora
where
  cxc_timestamp between 20180901000000 and 20181001000000
group by
  cxc_origin;

And we'll probably want a line chart with a line for each engine.

Details

	Subject	Repo	Branch	Lines +/-
	Add reportupdater job for language team metric	operations/puppet	production	+4 -0
	Add scheduling for Content Translation MT engine data	analytics/limn-language-data	master	+28 -0

Customize query in gerrit

Related Objects

Mentioned In: T210135: Adjust MT graph to clarify the presented concepts
Mentioned Here: T210135: Adjust MT graph to clarify the presented concepts

Event Timeline

Amire80 created this task.Oct 23 2018, 4:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2018, 4:09 PM

Amire80 edited projects, added ContentTranslation; removed CX-analytics.Oct 23 2018, 4:09 PM

Pginer-WMF triaged this task as Medium priority.Oct 23 2018, 4:23 PM

Pginer-WMF added a project: Language-Team (Language-2018-October-December).

Pginer-WMF moved this task from Backlog to Maintenance backlog on the Language-Team (Language-2018-October-December) board.

Arrbee assigned this task to Amire80.Oct 24 2018, 7:06 AM

Amire80 moved this task from Maintenance backlog to In Progress on the Language-Team (Language-2018-October-December) board.Oct 24 2018, 7:06 AM

Change 469390 had a related patch set uploaded (by Amire80; owner: Amire80):
[analytics/limn-language-data@master] Add scheduling for Content Translation MT engine data

https://gerrit.wikimedia.org/r/469390

gerritbot added a project: Patch-For-Review.Oct 24 2018, 9:05 AM

Amire80 moved this task from In Progress to In Review on the Language-Team (Language-2018-October-December) board.Oct 24 2018, 9:05 AM

Amire80 moved this task from In Review to Blocked on the Language-Team (Language-2018-October-December) board.Oct 24 2018, 12:04 PM

Arrbee moved this task from Blocked to In Progress on the Language-Team (Language-2018-October-December) board.Oct 26 2018, 7:13 AM

Arrbee edited projects, added CX-analytics; removed ContentTranslation.Oct 29 2018, 1:01 PM

Change 469390 merged by Milimetric:
[analytics/limn-language-data@master] Add scheduling for Content Translation MT engine data

https://gerrit.wikimedia.org/r/469390

Change 473056 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Add reportupdater job for language team metric

https://gerrit.wikimedia.org/r/473056

Change 473056 merged by Elukey:
[operations/puppet@production] Add reportupdater job for language team metric

https://gerrit.wikimedia.org/r/473056

Please review if it makes sense:
https://language-reportcard.wmflabs.org/cx2/#mt-engines

I'm taking this straight from the database. "Undefined" and "Scratch" may need some explanation. New engines will be added automatically when they will start getting used.

Amire80 moved this task from In Progress to QA on the Language-Team (Language-2018-October-December) board.Nov 20 2018, 4:52 PM

Amire80 removed a project: Patch-For-Review.

@Amire80 - below are my notes (most likely not important):

(1) Running the query takes quite a long time:

[wikishared]> select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora where cxc_timestamp between 20180901000000 and 20181001000000 group by cxc_origin;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| Apertium   |         2175 |
| LingoCloud |          196 |
| Matxin     |           12 |
| Yandex     |         7253 |
| Youdao     |           68 |
| no-mt      |            2 |
| source     |        12573 |
| source-mt  |            5 |
| user       |        12063 |
+------------+--------------+
9 rows in set (2 min 40.82 sec)

And possible keys is NULL:

 EXPLAIN EXTENDED select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora where cxc_timestamp between 20180901000000 and 20181001000000 group by cxc_origin\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cx_corpora
         type: index
possible_keys: NULL
          key: cx_corpora_unique
      key_len: 85
          ref: NULL
         rows: 12537240
     filtered: 100.00
        Extra: Using where; Using index; Using filesort
1 row in set, 1 warning (0.00 sec)

(2) There are no undefined value for cxc_origin (and the number of scratch records is very small) .

[wikishared]> select cxc_origin as engine, count(distinct(cxc_translation_id)) as translations from  cx_corpora  group by cxc_origin order by translations  desc;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| source     |       409877 |
| user       |       394384 |
| Yandex     |       206382 |
| Apertium   |        80694 |
| Youdao     |         1298 |
| LingoCloud |          502 |
| Matxin     |          265 |
| scratch    |          103 |
| source-mt  |           63 |
| no-mt      |           26 |
| TestClient |            5 |
| disable-mt |            1 |
+------------+--------------+
12 rows in set (2 min 22.71 sec)

This graph seems really useful. Thanks! I created a follow-up ticket with some minor adjustments (mainly labelling) that will help clarify potential confusions: T210135: Adjust MT graph to clarify the presented concepts

• Petar.petkovic moved this task from QA to Done on the Language-Team (Language-2018-October-December) board.Nov 26 2018, 4:01 PM

• Petar.petkovic removed a subscriber: gerritbot.

Analyze the usage of different machine translation engines in Content TranslationClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Analyze the usage of different machine translation engines in Content Translation
Closed, ResolvedPublic
Actions