Wikimania 2019 Hackathon Session: How to work with cross-lingiual word-embeddings
Word-embeddings allows machines to measure the semantic distance between a pair of words or sentences. This is done by converting each string (words or sentences) in vectors, allowing to perform mathematical operations with those strings. For example, it is possible to measure the distance between cat and dog, that might be smaller (so, they are both animals) than the distance between cat and car.
Recently, researchers have been working in make those embeddings cross-lingual, allowing to measure the distance between strings in different languages. Therefore, translations such as cat [en] and gato [es], should very similar (ideally identical) in the vector space.
The session will be organized as follows:
First Part: Understanding and playing with cross-lingual word-embeddings [40 mins]
- What is a word-embedding
- How to use FastText in Python.
- How to align models in different languages.
Second Part: Use cases on section alignment and recommendation [20 mins]
If you are just interested in using the APIs you are welcome to come just to the second part of session.
Materials and recommendations:
If you want to do hands-on work, and try your own alignments you will need to install some packages and download some data in advance:
- You will need a machine with at least 16GB of RAM.
- Install Python 3.
- Install FastText for Python
- Download the models (bin+vec) in English and Spanish. You could also download any pair of languages contained in this list: ["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
- Clone this github repository. (UPDATED)
If you want to know more about word-embeddings alignments check this repository.