Introduction to cross-lingual word-embeddings at Wikimania 2019
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	diego
	Aug 7 2019, 6:51 PM

Description

Wikimania 2019 Hackathon Session: How to work with cross-lingiual word-embeddings

Word-embeddings allows machines to measure the semantic distance between a pair of words or sentences. This is done by converting each string (words or sentences) in vectors, allowing to perform mathematical operations with those strings. For example, it is possible to measure the distance between cat and dog, that might be smaller (so, they are both animals) than the distance between cat and car.

Recently, researchers have been working in make those embeddings cross-lingual, allowing to measure the distance between strings in different languages. Therefore, translations such as cat [en] and gato [es], should very similar (ideally identical) in the vector space.

In the research team we have been using those cross-lingual embeddings to create section alignments across different projects, or to align template parameters.

The session will be organized as follows:

First Part: Understanding and playing with cross-lingual word-embeddings [40 mins]

What is a word-embedding
How to use FastText in Python.
How to align models in different languages.

Second Part: Use cases on section alignment and recommendation [20 mins]

How to query the section alignment API.
How to query the section recommendation API.

If you are just interested in using the APIs you are welcome to come just to the second part of session.

Materials and recommendations:

If you want to do hands-on work, and try your own alignments you will need to install some packages and download some data in advance:

You will need a machine with at least 16GB of RAM.
Install Python 3.
Install FastText for Python
Download the models (bin+vec) in English and Spanish. You could also download any pair of languages contained in this list: ["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
Clone this github repository. (UPDATED)

If you want to know more about word-embeddings alignments check this repository.

Related Objects

Mentioned In: T229267: Plan for Research team's acitivities during Wikimania 2019

Event Timeline

diego created this task.Aug 7 2019, 6:51 PM

leila awarded a token.Aug 7 2019, 8:02 PM

leila mentioned this in T229267: Plan for Research team's acitivities during Wikimania 2019.

Aklapper moved this task from Backlog to Session Idea on the Wikimania-Hackathon-2019 board.Aug 8 2019, 9:48 AM

diego updated the task description. (Show Details)Aug 13 2019, 11:27 PM

Hi @diego, this open task is tagged only with Wikimania-Hackathon-2019 which is in the past. If this task was being worked on and resolved at the Wikimania 2019 Hackathon: Please change the task status to "resolved" via the Add Action... → Change Status dropdown. If nobody plans to ever work on this task anymore: Feel free to set the task status to "declined". Thank you for helping clean up a bit!

@Aklapper done!

Introduction to cross-lingual word-embeddings at Wikimania 2019Closed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Introduction to cross-lingual word-embeddings at Wikimania 2019
Closed, ResolvedPublic
Actions