Page MenuHomePhabricator

Adjust MT graph to clarify the presented concepts
Closed, ResolvedPublic

Description

The graph for Machine Translation (MT) services use was created recently (T207765). It provides very useful information, but some of the labelling can generate confusion. The main points we want to address:

  • MT graph covers both CX1 and CX2 versions (while graphs in other tabs are CX2-specific)
  • In addition to translation services, we want the default options to start a translation to be clearly represented. Currently "scratch" (i.e., starting with an empty paragraph) appears, but the option for copying the source text does not. In addition an useless "undefined" line, always at 0, is plotted.

The proposed changes to solve the issues above are:

  • Rename the top title for the whole dashboard as "Weekly Content Translation activity"
  • Make the CX version explicit in the tabs for each graph. Change the tab labels to be: "Translations (CX2)", "Translators (CX2)", "MT Engines (all versions)"
  • Rename "scratch" as "No MT: Start with empty paragraph"
  • Show a "No MT: Copy original content" line capturing those cases where the source text is used instead of MT.
  • Remove the "undefined" line in the graph.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 22 2018, 10:48 AM
Amire80 updated the task description. (Show Details)Nov 25 2018, 4:06 PM

I did the first two trivial points.

For the rest, we should do if together with T210350 because it will require redeployment.

@Milimetric, do you know why is "undefined" shown there? I cannot see anything like that in the tsv file, or if I try to run the SQL query myself.

The current query filters things out like this:
cxc_origin not in ('disable-mt', 'no-mt', 'user', 'source', 'source-mt')

If I don't filter anything out, the results since the beginning of time are as follows:

wikiadmin@10.64.32.11(wikishared)> select
    ->   cxc_origin as engine,
    ->   count(distinct(cxc_translation_id)) as translations
    -> from
    ->   cx_corpora
    -> group by
    ->   cxc_origin;
+------------+--------------+
| engine     | translations |
+------------+--------------+
| Apertium   |        80943 |
| LingoCloud |          528 |
| Matxin     |          270 |
| TestClient |            9 |
| Yandex     |       207248 |
| Youdao     |         1298 |
| disable-mt |            1 |
| no-mt      |           26 |
| scratch    |          103 |
| source     |       411324 |
| source-mt  |           64 |
| user       |       395770 |
+------------+--------------+
12 rows in set (13.87 sec)

@santhosh, how should I modify the filtering to properly show those cases where the source text is used instead of MT, as @Pginer-WMF asks?

In this table, 'disable-mt' , 'source-mt' , 'TestClient', 'no-mt', 'scratch' are ignorables. They are there because of development iterations. Obvious from the numbers too.

  • 'source' represents the content from source article, we save every source section for the purpose of the parallel corporal.
  • 'Apertium', 'Youdao', 'Yandex', 'Matxin', 'LingoCloud' represents translations started by initial translation by these mt engines
  • 'user' represents the translations done on top of an initial translation by any mt engine or by copying source text or starting from scratch

It is not straight forward to know whether a section translation is initially started by copying the source text - since it will be marked as 'user' in the corpora table. For a given translation_id and a section_id, ordered by ascending timestamp value, if there is no origin value as an MT service, you can say it is started from scratch or from source content. Thats all you can know from corpora table - since it is designed to hold the parallel corpora and not the translation workflow information.

How to interpret this data:

  • 411324 sections were translated so far.
  • 395770 sections were edited by users on top of Inital translation,
  • The inital translations were from Apertium for 80943 sections, LingoCloud for 528 sections, Matxin for 270 section, Yandex for 207248 sections, Youdao for 1298 sections.(Total 290292)
  • The remaining 411324−290292 = 121032 sections were started from scratch or on top of source content. In other words, 121032 sections did not use any MT engine.

Thanks for the details, @santhosh.
Based on that it seems that:

  • We may want to create just one line for "No MT: Copy source or start from scratch" instead of two separate ones. We need to explore if the computation described above (i.e., calculating the difference between all the engines and the total of sections) is easy to support as part of the query.
  • We don't want to show outdated items ('disable-mt' , 'source-mt' , 'TestClient', 'no-mt', 'scratch'), or those that are to general such as "source" and "user".

The undefined line is due to an empty newline at the end of the file, I'll fix it as a bug in dashiki and deploy your dashboard again, no other action will be required. See T210570

Nikerabbit updated the task description. (Show Details)Jan 10 2019, 12:52 PM

This is supposed to be fixed in https://gerrit.wikimedia.org/r/#/c/analytics/limn-language-data/+/498981/ , which is waiting for a rerun at the moment.

Report reran and graph is using the new logic

Amire80 updated the task description. (Show Details)Mar 27 2019, 2:38 PM

Great, thank you!

Pginer-WMF closed this task as Resolved.Mar 27 2019, 4:27 PM