Page MenuHomePhabricator

Let's discuss LanguageConverter
Closed, ResolvedPublic


LanguageConverter needs love. Perhaps ContentTranslation is the way forward.

We need to agree on a roadmap to support similar language pairs, to minimize the amount of effort needed to share content between similar wikis. "Traditional" and "Simplified" characters on zhwiki are one end of the spectrum, "British" and "American" English are the other. There are a *lot* of languages in between.

We will come to a decision on how to best handle language variants, as on the Chinese (zhwiki) and Serbian (srwiki) wikis. There are two proposed options, both of them a lot of work:

  1. Deprecate language converter and split the articles in the database (perhaps even going as far as splitting the wikis where the community desires this, but the db split need not be user-visible). Use the Content Translation tool to maintain parallel content.
  2. Add Language Converter support to Parsoid and Visual Editor. Convert hard-to-edit inline conversion dictionaries to Glossaries. Add language/variant tagging and tracking to Visual Editor.

Are there better ways? We need to come to a decision and invest the resources necessary to make one or the other of these happen.

(Continuing the discussion started at T87652: The future of Language Converter (or, why do [[Red]] and [[Orange]] spell colo(u)r differently?))


  • Agree on a plan for long-term support of wikis with language/script variants.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Cwek triaged this task as Lowest priority.
Cwek set Security to None.
Cwek changed the task status from Invalid to Declined.Sep 18 2015, 9:39 AM
Cwek claimed this task.
Cwek added a subscriber: Cwek.

It's a personal willingness,not a consensus of the community.

Cwek removed Cwek as the assignee of this task.Sep 18 2015, 9:39 AM
Cwek removed a subscriber: Cwek.
This comment was removed by cscott.
Aklapper raised the priority of this task from Lowest to Needs Triage.Sep 19 2015, 6:39 AM
Aklapper added a subscriber: Cwek.

@Cwek: This is a valid proposal for a session at the Wikimedia Developer Summit (see the associated project of this task). Talk proposals at a conference do not need a consensus of a community. Hence reopening this task.

Well, I will see a foolish project. There are not only one people who want to split the zh.wp with technology or business, but they all failed.

The ancestor made it to keep the relation of the chinese cultrue as possible, but someone always want to break the role.

I've removed the sloppy term "split the wikis" from the topic description, as what is really meant is splitting db content. The split may or may not be user-visible: that depends on political and community factors orthogonal to the work described here.

I thought about renaming the topic to "a plan for LanguageConverter", but decided not to. Not many people in the organization know what LC is. Keeping the focus on our need to support the billion+ potential editors of Chinese Wikipedia seems to motivate the topic better.

cscott renamed this task from A plan for zhwiki to A plan for Chinese Wikipedia.Sep 19 2015, 2:43 PM

The "plan for Chinese Wikipedia" sounds controversial. Who's plan for Chinese Wikipedia?

It's great to think a new way or develop new features that can apply to Wikipedia, but you may not want to make local volunteers feel that someone is talking from above, you may not want to confuse any volunteers that WMF or whoever can make a plan for any Wikipedia without the community involved, and you may not want to mad any volunteer to let them feel the WMF can simply overrule the community consensus that many people spent numerous time and overcame many obstacle to united communities together.

Maybe my comment is too tough, Or Maybe The grammar and implement of the language converter needs to have a amendment for the VE. But, The Plan A will be a funny joke in zh.wp.

Ok, Maybe All plan are jokes. The language converter is a shower's converter ,not a writer‘s converter. Because The source code is a mixture with differenet writing mode, it will bring you into a lot of trouble for The frist step of Plan A which is 'split the articles in the database'.
And Look the example in our wiki (帮助:繁简处理/转换原理), The String “内存” in "Simplified" characters can be converted to the “記憶體” in "Traditional" characters as a word, which the word means computer's memory. But If It is in the sentence of "人体内存在很多微生物" in source code which writes with "Simplified" characters, it will be split words to the tokens "人体 内存 在 ...." with Maximum Matching Method and convert to the “人體記憶體在很多微生物” in "Traditional" characters which is wrong and it need to split words to the tokens “人体 在 ....”. We need a better token's spliter which can understand the meaning of words and split them to current tokens.
And If the language converter implants into the VE, whether it will show the original source code which it will have no converting problem, or it will show the converted String and it will have the problem as the example, even it will have the problem like:

  1. the source is "人体内存在很多微生物", If the writer of "Traditional" mode opens the source and saves it. What will the source be? Is it "人體記憶體在很多微生物"? Or nothing happened?
  2. If the writer of "Traditional" changes "人體記憶體在很多微生物" to "人體記體在很多微生物", whether it breaks the meaning of the String "人体内存在很多微生物"?

If it's no any ideas to solve these problem, All Plans are the jokes.

@Cwek You are exactly right. This problem is hard and we need to come up with a solution which both makes it easier to edit content on zhwiki as well as respects the political sensitivities involved.

One of the potential benefits of the Content Translation approach is that we would be treating simplified-to-traditional conversion as a "translation", rather than a mechanical transformation. As Language Converter demonstrates, we can often do a very good translation using just rules and tables, but with Content Translation we would have the full translation machinery available for the hard cases you illustrate. But we need to pay close attention to the user experience as well as the quality of the "translation" so we don't end up with something which is a joke.

Krenair renamed this task from A plan for Chinese Wikipedia to A plan for LanguageConverter.Sep 28 2015, 4:40 PM

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

It's kind of weird to add ContentTranslation to zh.wp because writing content mixed with both Simplified and Traditional characters are very common in Chinese culture, especially in Hong Kong and Taiwan where we used Traditional Chinese as our main writing script. If you add ContentTranslation to zh.wp, I couldn't imagine what kind of disaster will happen on zh.wp.... Orz

BTW, the zh.wp database is already mixed with both Simplified and Traditional Chinese. In one article you might have both scripts, it's not possible to split them into 2 separate databases.

@Cwek: This is a valid proposal for a session at the Wikimedia Developer Summit (see the associated project of this task). Talk proposals at a conference do not need a consensus of a community. Hence reopening this task.

Let's try to clarify this is just a discussion proposal, not an actual proposal for the language converter code.

Nemo_bis renamed this task from A plan for LanguageConverter to Discussing technical ideas for LanguageConverter.Oct 28 2015, 6:03 PM
cscott renamed this task from Discussing technical ideas for LanguageConverter to Let's discuss LanguageConverter.Nov 6 2015, 7:02 PM
cscott updated the task description. (Show Details)

I would prefer to not have "splitting the wikis" in the task description, because I would rather not have it in the agenda for discussion. I think it's potentially distracting from the technical discussion we need to have at the dev summit.

There are close language pairs, and there are distant language pairs. In MediaWiki terminology they are called variants and languages respectively. Ultimately it is for the community to decide what category they are in. As far as I'm concerned, the Chinese Wikipedia community have made their preference clear by inventing the concept of close language pairs in MediaWiki and spending months of volunteer effort writing the relevant supporting software. But it is a community decision and so the relevant stakeholders will not be present at the dev summit.

@cscott tells me that users commonly understand only one variant, and therefore simultaneous editing of both variants is difficult. Fine, but this is a problem the community has knowingly entered into, and I think we can support them in that choice.

I don't buy the idea of splitting a wiki in a way that is not user-visible, since the identity of a wiki is a frontend concept. When you edit a wiki, your changes are immediately visible, that's the definition of a wiki. If you edit in one variant and then your changes are immediately visible in the other variant, then you have one wiki. If your changes go into a translation work queue and then (possibly months later) someone accepts the suggested machine translation and thus makes it visible in the other variant, then you have two wikis. I think we can support zh.wp's decision to be one wiki in this sense. At least it is a point of competitive differentiation compared to the other online Chinese encyclopedias. According to this blog post, variant support is the main reason we are winning in Hong Kong.

If we accept that machine translations will be published without user review, as an essential part of what gives a variant wiki an integrated identity, then it makes sense to work on improving the machine translation, for example by improving word segmentation as @Cwek suggests.

Very well said: we should copy the above text into some essay on Meta-Wiki.

Can we strike proposed option 1 so that the discussion can focus on fleshing out option 2 or coming up with further options?

As far as I'm concerned, the Chinese Wikipedia community have made their preference clear by inventing the concept of close language pairs in MediaWiki and spending months of volunteer effort writing the relevant supporting software. But it is a community decision and so the relevant stakeholders will not be present at the dev summit.

I agree with this completely and that is my opinion as well as in T484#1654868 .. i.e. non-technical factors influence the specific technological choices made and those ought to be considered / respected.

@cscott: could you summarize the discussion at last year's summit (T87652), what's still the same, and how you hope we can make more/further progress this year?

Is this session proposal to repeat the session from last year (T87652), and expect a different result?

@RobLa-WMF -- perhaps. It really depends on who shows up and what the context of the summit is; whether folks feel empowered.

@tstarling, @Nemo_bis -- I agree with your point that "identity of the wiki is a front end concept", but I disagree in one fundamental way: to date we have not given wikis any alternatives. Yes, they chose "immediate update" (aka, LanguageConverter) -- but they didn't have a choice. I do not believe it is accurate to draw far-ranging conclusions based on the state of the wiki in 2002. But (again, agreeing with you) we shouldn't discount their choices either.

Further, we can not discount the communities voiced dissatisfaction with the LanguageConverter technology. It's a bit callous to say, "well, you chose this in 2002, now you're stuck with it."

The fundamental question is: can we give these wikis better tools to author (and translate) their content. I believe that we can. (For both options 1 and 2.)

A secondary question is: can we change the backend representation of their content and still respect frontend preferences? Again, I believe the answer is yes. (For both options 1 and 2.)

I'm actually optimistic we might be able to find a "third way" that bridges options 1 and 2 better, to account for the large amount of existing LanguageConverter text while providing a framework that, longer term, can incorporate Machine Translation technologies better. And I'm hopeful that further work on Content Translation will lead to more unified and less schizophrenic editing experiences for bilingual editors. That's the direction I'd like to see us work towards, but we can't get there by disqualifying either LanguageConverter or ContentTranslation at the outset.

I think I have to concur with @tstarling's statement "I think it's potentially distracting from the technical discussion we need to have at the dev summit." Note, we still have it on the schedule, but as my reiteration of Tim's point makes clear, I have some lingering doubts about this. On the one hand, it's a really, really important topic. On the other hand, we're just not prepared to have the conversation that @cscott wants to have.

Given that we're less than a week away, what's a realistic Dev Summit goal for this session? We need to avoid presenting new ideas at the summit or jamming ideas down people's throats; it's more about getting a deeper understanding on issues through real discussion (not mere sequencing of diverse presentations, no matter how short). Are we prepared to have a real discussion on this topic, or should we drop it?

I'm pretty sure @tstarling was not proposing that we drop this session from the summit. He was merely referring to the "splitting the wikis" wording -- which I agree, is inflammatory. It's also something I've tried hard to separate here -- I've made clear, both in the task description as well as in comments above, that we can discuss backend changes *orthogonally* to front-end changes. The communities have made it clear (for example) that should be a single wiki. There are political and cultural reasons. That doesn't mean we can't discuss ways to improve the editing situation on -- we just have to be careful about our language when doing so.

In any case, just to name proponents, @GWicke has been a strong advocate of "option 1" (deprecate LanguageConverter) and @Nemo_bis (as he comments above) is a strong advocate of "anything but option 1" -- I guess that means "option 2"? I don't have a dog in this race myself, but I would like to gather the proponents of the various sides together and attempt to find a consensus plan.

From above discussions, we see that we need a live(editing one variant reflects immediately on other variant) convertion of content, and hence without human intervention.

Content translation is actively adding more machine translation backend services and exposing it as a web API, for any use inside the WMF.

So, about improving the tooling, I have some questions. Is there better language converters outside MediaWiki? Whether MT services between the language pairs give better result than our language converter? If so, can we plug that as backend to CX translation service? Can we mix MW language converter and such external tools and place it behind this kind of web API and then MW editing workflow using this API to instantly convert the changed portions?

Just a note, Language Converter in MediaWiki is not merely a representative transformation; it's hooked into many MediaWiki components, a bit deeper than what CX reaches currently (most importantly, link resolution).

depends on who shows up and what the context of the summit is; whether folks feel empowered.

As far as I can see, people are largely the same:

@santhosh yes, that's an important part of the UX difference. One question is: can we add a mode to ContentTranslation that works in this way -- we get a machine translation of edited content immediately, and then editors can "fix it up" later. This is a different working model, to be sure, but it seems to be more appropriate when the "machine translation" is really good -- think of script conversion or "American English to British English" translation. It would be nice to be able to address both UX modes with the same underlying representations and mechanisms. (One interesting idea would be to take our existing LanguageConverter PHP code and package it as a "translation service" which CX could use.)

@liangent Yes, that's another current deficiency in the current ContentTranslation mechanism. To some degree it's also a weakness in LanguageConverter, since the tight integration causes some issues -- especially recurrent security problems. It would certainly be worthwhile to very precisely describe the hooks that LanguageConverter uses, and decide for each whether this can be made into a more general hook, deprecated, ported to ContentTranslation, etc. As a single example, a lot of our interlanguage link functionality has recently migrated to a wikidata representation. Could LanguageConverter's link resolution mechanism be ported to this same mechanism? If not, could it be refactored into a more general feature which could be enabled more broadly? As folks who've talked to me know, I'd like to see more pieces of LanguageConverter enabled on *all* wikis, since many wikis running the same code helps prevent code rot.

@Nemo_bis Well, just as a start, we've got a year's more experience with ContentTranslation since the last dev summit, and our users seem to really enjoy it. Last year CX was very much still an experiment. That will influence the discussion, certainly. The technology leadership situation is also very different this year. Without getting too far into the weeds, our current ED is actively looking to define a strategy (which hopefully includes reaching the audiences LanguageConverter currently serves), and we seem to have a very functional RFC process now. Any outcomes of this session which can be phrased as concrete RFCs (or features on the language team's roadmap?) seem to have a solid path forward.

As far as I understand, the desired outcome from the dev summit will include the formation of working groups who will continue to push topics forward over the coming year. So the session will be worthwhile to begin the discussion and identify members (both present at the summit as well as folks like @liangent who are not present but should be involved) so that the work can continue.

@liangent, CX has automatic link adaptation(also image, namespace, gallaries, extensions etc) betwen source and target, But it happens at the translation tool(MW extension) and not at cxserver.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Notes are at The only action item is

"Amir: Define the desired behavior from the user perspective according to the suggestion to write in your preferred variant"

which is tracked.