Page MenuHomePhabricator

cxserver: Design and implement a compact configuration to represent fast growing language pairs
Closed, ResolvedPublic1 Estimated Story Points

Description

The current JSON configuration is not compact and will grow very fast as we add more languages. We need a better config file that represent the permutations the languages can create.

Event Timeline

santhosh raised the priority of this task from to Needs Triage.
santhosh updated the task description. (Show Details)
Arrbee triaged this task as High priority.Feb 2 2015, 6:58 AM
Arrbee moved this task from Needs Triage to Long term on the ContentTranslation board.
Arrbee raised the priority of this task from High to Needs Triage.Feb 2 2015, 9:35 AM
Arrbee subscribed.
Amire80 triaged this task as Medium priority.Feb 4 2015, 1:43 AM
Amire80 set Security to None.
Arrbee raised the priority of this task from Medium to High.Feb 4 2015, 4:01 PM

A proposal for new configuration

registry: {
		source: [ 'af', 'an', 'ar', 'bg', 'bs', 'ca', 'cr', 'cy', 'en', 'eo', 'es', 'fr', 'gl', 'hi', 'hr', 'id', 'kk', 'mk', 'ms', 'mt', 'nl', 'oc', 'pt', 'ru', 'tt', 'ur' ],
		target: [ 'af', 'an', 'ar', 'bg', 'bs', 'ca', 'cr', 'cy', 'eo', 'es', 'fr', 'gl', 'hi', 'hr', 'id', 'kk', 'mk', 'ms', 'mt', 'nl', 'oc', 'pt', 'ru', 'tt', 'ur' ],
		mt: {
			Apertium: {
				af: 'nl',
				ar: 'mt',
				an: 'es',
				bg: 'mk',
				br: 'fr',
				ca: [ 'es', 'en', 'eo', 'fr', 'oc', 'pt' ],
				cy: 'en',
				en: [ 'bs', 'ca', 'cr', 'eo', 'es', 'gl', 'hr', 'sr' ],
				eo: 'en',
				es: [ 'ca', 'pt', 'it', 'oc', 'en', 'fr', 'an', 'eo', 'gl' ],
				eu: 'en',
				fr: [ 'ca', 'eo', 'es' ],
				gl: [ 'en', 'es', 'pt' ],
				hi: 'ur',
				id: 'ms',
				is: 'en',
				kk: 'tt',
				mk: [ 'bg', 'bs', 'hr', 'mk', 'sr' ],
				ms: 'id',
				mt: 'ar',
				nb: [ 'da', 'nn' ],
				nl: 'af',
				nn: [ 'da', 'nb' ],
				oc: [ 'es', 'ca' ],
				pt: [ 'ca', 'es', 'gl' ],
				ro: 'es',
				sh: 'sl',
				sl: [ 'bs', 'cr', 'hr', 'sr' ],
				sv: [ 'da', 'is' ],
				tt: 'kk',
				ur: 'hi'
			},
			Yandex: {
				en: 'ru'
			}
		},
		dictionary: {
			JsonDict: {
				en: 'es',
				es: 'ca'
			}
		}
}
  • If we don't want to publish to a certain wiki, but want it as source, we can do. See en missing from target
  • A big change is there is no hand picked pair. All combination of source and target is available
  • But mt, dictionary are configured per pair
  • It is not possible to blacklist a specific pair. I don't see any reason for such requirement. But tools(mt, dictionary..) can be specifically configured

There is also a design implication - Currently in source selector we list the languages without MT as "disabled-ish". I am not sure it communicate the intention - that the languages not disabled has MT. Why MT is used as single criteria for enabling it?

pasted_file (356×444 px, 15 KB)

Our observation so far suggest that people do use CX without MT too. So IMO, we should treat all languages as same. If we grey out the language, chances are less that people will select it. And provide an explicit indication that if you select this language, you will get mt and dictionary, if you select this, there is no mt, but dictionary available etc.

In addition to this, we need to take care of requirements like

  • en-ru has two MT engines. How to configure which is default? How to configure no-mt as default?

My rough plan is :

mt: {
      defaults: {
             'en-ru': 'Yandex' or  'no-MT'
      },
     Apertium: {}
        ...
}

similar way for dictionary too.

Persistent user preferences for these things are out of scope here

Apertium provides support for very specific language pairs so we need to be able to capture those.
For MTs where they support all combinations of a set of languages, we could describe that support in a more compact way. For example:

Yandex: {
	*: ['en', 'ru', 'es', 'ca']
}

instead of:

Yandex: {
	en: ['en', 'ru', 'es', 'ca'],
	ru: ['en', 'ru', 'es', 'ca'],
	es: ['en', 'ru', 'es', 'ca'],
	ca: ['en', 'ru', 'es', 'ca']
}

For MT we may have a default preference order (e.g., how engines are defined in the file) so that we only need to define preferences when overriding that order.

Apertium provides support for very specific language pairs so we need to be able to capture those.
For MTs where they support all combinations of a set of languages, we could describe that support in a more compact way. For example:

Yandex: {
	*: ['en', 'ru', 'es', 'ca']
}

We see that we cannot enable all language pairs even if Yandex support all combinations - Many translations are done by a trip via English. So in practice, just like Apertium, we will have to pick pairs

For MT we may have a default preference order (e.g., how engines are defined in the file) so that we only need to define preferences when overriding that order.

yes, so my idea is to use 'default' key only when we need override

We see that we cannot enable all language pairs even if Yandex support all combinations - Many translations are done by a trip via English. So in practice, just like Apertium, we will have to pick pairs

Yes, we need a way to specify specific pairs, but a way to define "all combinations of the following languages: A, B, C, D" (plus some additional specific pairs) seems convenient to make the configuration simpler. Consider also that we have also the possibility to make some of those pairs to be "no-MT" as their default value (which is almost disabling them but leaving the user the last choice).

There is also a design implication - Currently in source selector we list the languages without MT as "disabled-ish". I am not sure it communicate the intention - that the languages not disabled has MT. Why MT is used as single criteria for enabling it?

What you describe is not the current behaviour or the expected one (see screenshot below). The greyed out languages are supposed to represent the target languages not available given the current language selected as source. We probably should show them (otherwise user may miss that the tool as a whole supports them) but make them non-selectable.

In any case I think we should move towards supporting all combinations of a given set of languages (even if some lack MT or it is disabled as default). That would avoid the need to disable specific pairs.

Screen_Shot_2015-02-05_at_23.01.23.png (957×782 px, 95 KB)

gerritbot subscribed.

Change 189942 had a related patch set uploaded (by Santhosh):
Registry: Use a compact structure

https://gerrit.wikimedia.org/r/189942

Patch-For-Review

Change 190158 had a related patch set uploaded (by Santhosh):
Support new language configuration format

https://gerrit.wikimedia.org/r/190158

Patch-For-Review

santhosh lowered the priority of this task from High to Medium.Feb 12 2015, 8:07 AM
santhosh moved this task from In Progress to In Review on the LE-Sprint-82 board.

Change 189942 merged by jenkins-bot:
Registry: Use a compact structure

https://gerrit.wikimedia.org/r/189942

Change 190158 merged by jenkins-bot:
Support new language configuration format

https://gerrit.wikimedia.org/r/190158

Arrbee assigned this task to santhosh.