Page MenuHomePhabricator

Wikimedia Developer Summit 2018 Topic: Next Steps for Languages and Cross Project Collaboration
Closed, ResolvedPublic

Description

Participants, please read/think about/research these, ahead of time:

Session description:

  • Next Steps for Languages and Cross Project Collaboration

Goals:

  • Find out the Language infrastructure capabilities and opportunities for Wikimedia movement.
  • Identify projects and tasks to meet the strategic goals of Wikimedia in this area.
  • See also DevSummit purpose and results for guidance

Related position statements:

Structure (rough draft):

  • The session will be divided to two parts - first half for Languages infrastructure, and second half for Cross Project Collaboration. It is possible that some discussion having overlap between these two topics
  • Session will have an introduction to topic, how this topic is relate to WMF Vision and what problems we would like discuss.
  • Identify the key challenges we have for meeting the goals set by the WMF vision about Languages.
  • Propose ways to support future needs; concrete solution, technologies to explore
  • The session will NOT be used to discuss any of the topics in-depth. Concrete engineering questions are not in scope.

Related background reading:

Session notes:


Topic Leaders (@Lydia_Pintscher @santhosh)

This is one of the 8 Wikimedia Developer Summit 2018 topics.


Post-event Summary:

  • ...

Action items:

  • ...

Event Timeline

Rfarrand created this task.
Aklapper renamed this task from Wikimedia Developer Summit 218 Topic: Next Steps for Languages and Cross Project Collaboration to Wikimedia Developer Summit 2018 Topic: Next Steps for Languages and Cross Project Collaboration.Dec 20 2017, 12:10 AM

Machine translation is a hot topic. We should discuss it. The big risk here is that the apparent leading solutions are closed source. I don't think that is an acceptable risk in the long run. We need to figure out how to use machine translation in a sustainable manner. We need to ensure we have control over it so that it cannot be taken away from us and that we will be able to support small languages that are not usually supported by commercial entities. We should partner with other organisations doing machine translation engine development. I have a specific proposal for myself to research for my PhD to explore how we can better integrate Apertium in Content Translation so that we enable continuous improvements to the machine translation quality. The easier part then is to open up the cxserver translation API and integrate it into more places, such as structured discussions.

Second point: I think we should discuss is that I think we should be the forerunners of i18n and language support. This means, apart from resourcing, that we should build solutions that are usable outside our immediate needs. We have great PHP messaging library, but that is so integrated into MediaWiki nobody can use it. We have at least two different messaging libraries for JavaScript, none of the having many 3rd party users (jquery.i18n has some, but it is limited in functionality). These libraries need to be able to function stand-alone in addition to being used within MediaWiki and other our products, they need to be high quality, documented and easy to use.

Third point: There are many other issues, see for example T183313#3899947 and https://www.mediawiki.org/wiki/Internationalisation_wishlist_2017. It's time to start making a language strategy that clearly articulates what should be done, in which order, and what is the impact.

According to the schedule we have one 60 minute slot and one 30 minute slot with a keynote and breaks between them.

@Nikerabbit, I want to add two comments

  • It is true the leading solutions are closed source, but there are also many very good open-sourced translation tools in Neural Machine Translation area. We had checked these tools and related models, and I can assure they will provide very good translations if you feed them with enough (~100M) training sentences.
  • The NMT technology is evolving very fast, so it is not very stable for adopters.

@Mountain Those are very good points. I am also aware of open source NMT toolkits, although I am yet to try one myself. Both NMT and RBMT require certain expertise to create, expertise that doesn't exist in large amounts inside WMF. Hence I think partnering with like-minded organizations working on open source MT is required. The second point is of course that many our small languages will not have that kind of training data. For source-target language pairs the amount of data is even more limited, although the advances of using data from other languages might make this issue less important. For those RBMT can still provide a solution. Thirdly, training NMT models requires a lot of processing power, and if we want to start training models ourselves, that should be brought up with the tech/budgeting/annual planning to ensure availability of sufficient processing power.

I also have three questions:

  • when high-quality translation tools are pervasively available (this trend is happening right now in China), lots of local readers will read the high-quality English Wikipedia other than their own language versions with the help of translation tools. How does the local community grow themselves?
  • Is it feasible to unite users from different languages to work together? page level, or task level, or project level? This is from a collaboration point of view.
  • Is it possible to unite all language versions into one? This is from the knowledge point of view?

@Lydia_Pintscher @santhosh Thank you for organizing your session at the 2018 Dev Summit. Please make sure to document your outcomes, next steps and link to any tasks that were created as outcomes of this session. Otherwise please resolve this and corresponding Phabricator tasks when there is nothing left to do in those tasks.

Rfarrand claimed this task.